1. Motivation
Driven by advances in computation, ensemble modeling, and data assimilation, probabilistic forecast information is rapidly spreading in the weather enterprise. Many scientists agree that this is a positive development but incorporating probability information into risk communication can be challenging, as probabilities are notoriously difficult to communicate effectively to lay audiences (e.g., National Research Council 2006; AMS 2008; National Research Council 2012). What does the research literature say about the best way to include probability information in risk communication? What is the evidence base for different communication practices? This project endeavors to address these questions by initiating a “living systematic review” of relevant research from past studies and new studies as they become available.
2. Methodology
A systematic review is a literature review that uses a transparent and repeatable methodology to identify relevant research from past studies, evaluate results from those studies, and synthesize findings. Historically, systematic reviews have been static; they synthesize the literature at a point in time and become out of date almost as soon as they are complete. To prevent this, living systematic reviews are beginning to replace static reviews. Living systematic reviews follow the same steps but are updated as new research becomes available. Most systematic reviews, living or static, include the following steps:
Define the study domain.
Search for and identify relevant studies.
Extract key topics, questions, methods, and findings from relevant studies.
Evaluate the quality of relevant studies.
Analyze and combine the studies to identify common topics, questions, methods, and findings. This review includes two additional steps:
Assess common findings to develop recommendations to assist forecasters in communicating uncertainty and probabilities.
Develop a living platform that incorporates new studies and relevant findings into the review as they become available.
We use these steps in the sections that follow to describe our living review of research literature on the use of probability information in risk communication.
a. Study domain
The review focuses on research studies that directly examine the impact of probability information on protective action decision-making, intentions, and behaviors. Most of the studies in the review focus on the “best” or most effective way to communicate probability information. They address questions like: are people more likely to take protective action when probability information is given verbally or numerically? We do not include studies that indirectly examine these relationships by way of implication or suggestion. For example, we do not include studies that explore the relationship between numeracy and risk perceptions, which may have important implications for how people use probability information when making decisions.
b. Identification of relevant studies
We use a combination of three methods to search for and identify relevant studies: 1) electronic search databases, 2) past literature reviews, and 3) citation chains. Between all three sources, we were able to identify 327 unique studies that were relevant to the study domain. This list will continue to grow as the review continues.
1) Electronic search databases
In phase 1 of the search, we use ProQuest, Web of Science, and EBSCO Academic Search Elite to identify potentially relevant studies that focus on communicating probability information in the weather and climate domains.1 We restrict the domains at this point in the process to ensure that we are identifying the studies that are most relevant to the audience of this review (see Table 1 for a list of search terms by domain). We rely on the next two phases (past reviews and citation chains) to identify potentially relevant studies in adjacent domains (health, insurance, etc.). The first database searches were conducted on 29 July 2019; additional searches were undertaken on 1 September 2020 to update the list of potentially relevant studies. Through these searches, we were able to identify 1559 potentially relevant studies; 725 of the studies were unique across the three databases.
Query parameters (search terms) and resulting number of studies by database.
After identifying potentially relevant studies, two researchers independently screen the title and abstract of each study to ensure that three inclusion criteria are met: 1) the study fits within the study domain (see above); 2) the study reports on new findings from a new research project (e.g., it is not a literature review, essay, or workshop report); and 3) the study uses a generalizable, transparent, and repeatable methodology to conduct the research. In many (but not all) cases, this leads to the exclusion of qualitative studies because most of them do not meet the generalizability criterion. If reviewers do not agree during the screening phase, they review the entire study to see if these criteria are met. Following the first screening phase, the researchers move on to a more in-depth screening phase where they independently review the entire contents of the study to see if it meets the criteria above. If they disagree at this phase, they discuss and come to an agreement. Of the 725 unique studies that we were able to identify at the start of the review, 93 met the inclusion criteria in the first screening. Of these, 29 met the criteria following in-depth review of the study.
2) Past literature reviews
In phase 2, we use past literature reviews to identify potentially relevant studies. To begin this review, we were able to identify 12 past literature reviews that provide valuable information about potentially relevant studies (see Table 2 for a list of these past reviews). Using the two-stage screening methodology we describe above, we identified 37 new studies that met the inclusion criteria above, bringing the set of relevant studies to 66.
The 12 previous literature reviews that we used to identify potentially relevant studies.
3) Citation chains
In phase 3, we use citation chains to identify potentially relevant studies. First, we use “backwards” citation chains that include all references IN the relevant studies that we identify in phases 1 and 2. Next, we use “forwards” citation chains that include all references TO the relevant studies that we identify in the first two phases. In all, there were 1523 unique references IN and 2279 unique references TO the 66 relevant studies that we were able to identify in the first two phases. Using the two-stage methodology we describe above, we came up with 255 new studies that met the inclusion criteria, bringing the cumulative set of relevant studies to 327.
Because this is a living systematic review, it is important to reiterate that this is only the beginning of the review. We plan to repeat these phases every few months to make sure that we are including the most up-to-date research.
c. Extracting key information
Following identification, we review all relevant studies, extract key information, and store it in a spreadsheet. In addition to basic bibliographic information, we note relevant research questions, variables, research methodologies, information about research subjects, domains of study, and primary findings.
d. Evaluating study quality
While extracting key information about each study, we assess the quality of each study. There are many ways to assess quality; we use three indicators of validity: 1) external validity (are the results generalizable to the population of interest?); 2) internal validity (are we sure that variation in x causes variation in y?); and 3) domain validity (how relevant is the study domain to weather hazards and forecasting?). Each dimension is independently given a score by two researchers on a 3-point scale (1 = low; 2 = medium; 3 = high). We use the mean value of these scores to measure validity along each dimension and the overall validity of each study.
3. Results
a. Common topics: Quantity of evidence
We were able to identify 5 primary research topics and 13 secondary topics in the 327 studies we are using to begin this review. While this list of topics may evolve as the review continues, an examination of the relative frequency of each topic provides a snapshot of the literature at this point in time. It provides valuable information about the quantity of evidence we were able to identify in each topic area.
Figure 1 provides this information by showing the percentage of studies that address the primary topics. Note that the topics are not mutually exclusive; many studies address more than one topic. As Fig. 1 indicates, the most common topics in the current set of relevant studies are public understanding and use of probability information in decision-making (label a), communicating probability information using words and phrases (label b), and communicating probability information using visualizations (label c). The least common topic is communicating probability information to a heterogeneous population (label e). The discrepancy between these topics suggests that we know a lot about how people use probability information to make decisions and the extent to which words and phrases facilitate this process, whereas we know relatively little about how different types of people use and interpret different types of probability information.
Figure 2 provides more information by showing the percentage of studies within each primary topic that address secondary topics of relevance. For example, the figure shows that studies on the communication of probability information using words and phrases (Fig. 2b) frequently address 1) numeric translations of words and phrases, but rarely address 3) severity and probability conflation. Likewise, studies on communication of probability information using numbers (Fig. 2c) are more likely to focus on 1) probabilities as percentages than 2) probabilities as frequencies. Again, these discrepancies provide valuable data on the amount of evidence we have on each of the secondary topics.
b. Common topics: Quality of evidence
Information about quantity provides one metric for assessing the strength of evidence in each of the topic areas. Information about quality of evidence provides a second metric. We assess the quality of evidence from each study in the review by evaluating the external, internal, and domain validity of the study. When we average these across topics, we can discriminate between topics with high-quality evidence and topics with low-quality evidence. These averages will change as the review continues, but they provide valuable information about the current state of research.
Figure 3 provides this information by showing the mean validity of studies that address the primary topics. Note that the sizes of the points in the plot reflect mean domain validity; large points indicate high domain validity, small points indicate low domain validity. The lines with dashes indicate overall mean validity scores; on average, studies on topics to the right of the line have more internal validity than studies to the left of the line and studies above the line have more external validity than studies on topics below the line. Given this orientation, the relatively large point in the top-right quadrant indicates that, on average, studies on public understanding and use of probability information in decision-making (label a) have high validity. By comparison, the relatively small point in the bottom-left quadrant indicates that studies on communicating probability information using words and phrases (label b) generally have low validity. Interestingly, studies on communicating probability information using numbers (label c) typically have more internal but less external validity than studies on the use of words in phrases; this is because these studies often use experiments to identify causality, but the subjects of the experiments are rarely representative of the U.S. population.
Figure 4 provides more information by plotting the average validity of studies by secondary topic within each of the primary topics. The plots illustrate, for example, that studies on the value of probability information (Fig. 4a, label 1) and understanding probability information (Fig. 4a, label 2) have high levels of validity whereas studies on verbal directionality (Fig. 4b, label 2) and reference class ambiguity (Fig. 4c, label 3) typically have low levels of validity. If quality is the metric, these findings suggest that we know a lot about the first two areas of study, whereas more research is necessary on the latter two.
c. Common findings
We were able to identify more than 100 unique findings in the 327 studies we are using to begin this review. It is not possible to describe all of these findings in a single article. We therefore attempt to synthesize and summarize as many findings as possible in the sections below.
1) Public understanding and use of probability information in decision-making
(i) Value of probability information
Do members of the public benefit from probability information, or are they better off with deterministic statements? Some forecasters express a desire to “boil down” complex probability information to a deterministic point forecast for fear of confusing members of the public (Pappenberger et al. 2013). Strong evidence in the research literature indicates that these fears are unfounded. Nearly all of the studies we review indicate that people make better decisions, have more trust in information, and/or display more understanding of forecast information when forecasters use probability information in place of deterministic statements (Ash et al. 2014; Bolton and Katok 2018; Grounds and Joslyn 2018; Grounds et al. 2017; Joslyn and LeClerc 2012, 2016; Joslyn and Demnitz 2019; Joslyn et al. 2007; LeClerc and Joslyn 2012; Marimo et al. 2015; Miran et al. 2019; Nadav-Greenberg and Joslyn 2009; Roulston and Kaplan 2009; Roulston et al. 2006; Joslyn and Grounds 2015). However, it is important to note that both experts and the public sometimes have difficulty interpreting probability information, and different communication formats can affect understanding (Bramwell et al. 2006). Along these lines, many studies emphasize the importance of making probabilistic forecasts as straightforward and easy to understand as possible in order to avoid “information overload” (Durbach and Stewart 2011). It is also important to note that these findings refer to probability information beyond a general acknowledgment of epistemological uncertainty (e.g., “this forecast is based on estimates; it is impossible to ever know for sure what will happen”), as these types of overly broad statements can undermine trust in forecasts (Howe et al. 2019).
(ii) Understanding probability information
Many studies also examine how people understand and interpret probability information. Often, these studies simply give subjects a probability of precipitation forecast (PoP) and ask them to interpret it in order to assess their level of understanding. Gigerenzer et al. (2005), Morss et al. (2008), Sink (1995), Zabini et al. (2015), and Abraham et al. (2015) all show, to various degrees, that a majority or a substantial proportion of the public is unable to give the correct interpretation of a PoP forecast, generally considered to be something like “it will rain somewhere in the forecast area on X% of days like today” (Gigerenzer et al. 2005; Morss et al. 2008; Sink 1995; Zabini et al. 2015; Abraham et al. 2015). However, Juanchich and Sirota (2019) argue that previous studies use a cumbersome “correct” answer, and that “X% of simulations predict rain in the forecast area” is a more “fluent” and more easily understood response category; when using this “correct” answer, the vast majority are able to give the correct PoP interpretation (Juanchich and Sirota 2019). Another prior study, Murphy et al. (1980), argues something similar—that “people do not have trouble understanding what ‘30% chance’ means, but . . . they do have trouble understanding exactly what the probability refers to in this kind of forecast” (Murphy et al. 1980). These findings should reassure forecasters that the public can correctly understand probabilities but should underscore the need to explain the events the forecast refers to in an intuitive and clear way.
Multiple studies also offer findings about the process by which members of the public think through probabilistic risk information. For instance, a group of studies indicate that most people intuitively infer uncertainty even when given a deterministic forecast (Savelli and Joslyn 2012; Joslyn and Savelli 2010; Morss et al. 2008, 2010). Moreover, inclusion of uncertainty can lead to increased worry but can also be mitigated through the use of textual and visual formats (Han et al. 2011). These findings suggest that people think about forecast events in probabilistic terms even when not explicitly told to do so. Several more studies find that higher probabilities (regardless of context or direction) may lead people to view a forecast as more accurate; for instance, the same forecast would likely be taken to be more accurate if it reported a 70% chance of sun rather than a 30% chance of rain (Bagchi and Ince 2016; Løhre et al. 2019; Juanchich and Sirota 2017). In a similar vein, some studies suggest that people consistently misinterpret confidence intervals and forecast periods. For instance, two studies find that both experts and nonexperts implicitly interpret forecast events as being more likely toward the end of the forecast period (e.g., if there were an X% chance that a given event would occur sometime in a given week, people will, on average, perceive that the event is more likely to happen on Friday than on Monday) (Doyle et al. 2014; McClure et al. 2015), and one study finds that a significant proportion of the public is unsure of how to understand the distribution of possible outcomes denoted by confidence intervals (Dieckmann et al. 2015). These findings underscore the need to clarify what forecast periods and confidence intervals mean in the context of a given forecast rather than assuming that will be clear to the audience.
A few studies also identify some common ways that people misinterpret or dismiss probability information. Perhaps the most significant is motivated reasoning, the tendency to interpret new information in a way that supports preexisting beliefs. Four studies in the review explicitly focus on the effect of motivated reasoning on how people interpret probability information; all indicate that probability information, especially information about politically sensitive topics like climate change, is susceptible to misinterpretation when it contradicts preexisting beliefs (Budescu et al. 2012; Dieckmann et al. 2017; Piercey 2009; Nurse and Grant 2020). Interestingly, more numerate people seem to be more susceptible to these effects (Nurse and Grant 2020) and using verbal probability expressions seems to encourage more motivated reasoning than using visual or numeric expressions (Budescu et al. 2012; Dieckmann et al. 2017; Piercey 2009). Another common way that probability information can become “distorted” is what Hohle and Teigen (2015) call the “trend effect.” In short, people often interpret recent forecasts in light of past forecasts. A “moderate” risk, for instance, will cause more worry if it has been upgraded from a “low” risk than if it has been downgraded from a “high” risk (Hohle and Teigen 2015; Løhre 2018).
2) Communicating probability information using words and phrases
(i) Numeric translations of words and phrases
Experts and nonexperts routinely use verbal probability expressions like “unlikely” or “a good chance” to indicate uncertainty; this practice is particularly common in the weather domain. The first core finding in this area of the review is very simple: there is strong evidence that risk communicators should always include a numeric “translation” for any verbal probability expressions used, and that translation should appear directly in or next to the verbal expression itself (Carey et al. 2018; Connelly and Knuth 1998; Dorval et al. 2013; Fortin et al. 2001; Hill et al. 2010; Zabini et al. 2015; Wintle et al. 2019). For example, a verbal expression like “severe thunderstorms are possible this evening” would be more effectively rephrased as “severe thunderstorms are possible (20% chance) this evening” (Budescu et al. 2014). Explicit statements of the upper and lower bound (e.g., 0%–33%) implied by an expression (e.g., “likely” or “unlikely”) improve accuracy of interpretation versus a statement alone (Harris et al. 2017). This is important not only because it helps people to correctly interpret the meaning of a forecast, but also because people generally prefer mixed formats (e.g., a numeric probability and a verbal probability expression together, or a number and a visualization) to singular ones (Carey et al. 2018; Connelly and Knuth 1998; Dorval et al. 2013; Fortin et al. 2001; Hill et al. 2010; Sink 1995; Zabini et al. 2015). Members of the public demonstrate a basic understanding of probabilistic forecasts; however, uncertainty is best communicated through combined use of numeric and verbal expressions to meet the needs of heterogeneous audiences (Kox et al. 2015). Translations are important because less numerate people tend to focus on narrative evidence when evaluating risk communications (the context, their perceptions about the likelihood of comparable events, etc.), while more numerate people tend to focus on the numeric probability of the risk (Budescu et al. 2009; Dieckmann et al. 2009; Budescu et al. 2012; Juanchich et al. 2013; Mandel 2015).
(ii) Verbal directionality
The next core finding in this area addresses the importance of “directionality” in verbal probability statements. “Directionality” can be positive or negative (Teigen and Brun 1995). Positive statements focus the probability that an event will happen (e.g., “it is possible that the hurricane will affect town x”) and negative statements focus on the probability that it will not happen (e.g., “it is likely that the hurricane will miss town x”). Generally, research in this area suggests that positive statements can cause people to overestimate the baseline probability of an event and, consequently, engage in behaviors that are in line with the target outcome even if that outcome is very unlikely (e.g., take protective action even if there is a very small chance that a hurricane will affect town x). Negative statements can have the opposite effect; they can cause people to underestimate the probability of an event and decide not to engage in protective actions (Honda and Yamagishi 2006, 2009, 2017; Teigen and Brun 1995, 1999, 2000, 2003; Budescu et al. 2003). Researchers are still exploring the communicative function of these statements, but some evidence suggests that the direction of a statement conveys implicit information about a speaker’s reference point (McKenzie and Nelson 2003; Honda and Yamagishi 2017). Positive statements may indicate that a probability is increasing or higher than a speaker’s reference point. Negative statements indicate the opposite; that it is decreasing or low in comparison with a speaker’s point of reference.
(iii) Severity and probability conflation
Another core finding on verbal probability expressions pertains to the “severity effect,” which is the tendency of people to implicitly interpret verbal probability expressions as more likely when they describe more severe or undesirable outcomes (Bonnefon and Villejoubert 2006; Fischer and Jungerman 1996; Harris and Corner 2011; Weber and Hilton 1990). For example, someone who interprets a “slight chance” of rain showers to mean a 1%–5% chance will likely interpret a “slight chance” of a hurricane to mean something closer to a 10%–15% chance. This is important for forecasters to consider when using verbal probability statements, as it may suggest different interpretations of the same words and phrases, depending on the situation.
(iv) Choosing words and phrases
The studies in this area of the review provide a few core findings on word choice. For instance, when deciding whether to use a word like “can” or “will,” be aware of the “extremity effect”: when shown a probability and asked what “can” happen, people tend to focus on the most extreme possible values, and when asked what “will” happen, they tend to focus on the more likely scenarios (Teigen and Filkuková 2013; Teigen et al. 2018, 2014). In a similar vein, Teigen et al. (2013) find that people often use and interpret words like “improbable” to refer to events that are not just unlikely (something like a 10%–20% chance, for instance), but nearly impossible (closer to a ∼1% chance, for instance). Often, “improbable” is implicitly understood to refer to events that have not happened yet but have a small chance of happening in the future, even when experts have another definition in mind. Clarifying such terms and providing explicit numeric “translations” helps to reduce these misunderstandings (Teigen et al. 2013). Last, research strongly indicates that forecasters should avoid vague verbal probability terms (such as “it is possible” or “there is a chance”), as they can be particularly problematic in communication due to variable interpretation (Fillenbaum et al. 1991; Reyna 1981; Lenhardt et al. 2020). In summary, words and phrases play an important role in the communication of probability information. As a result, forecasters and audiences would benefit from careful consideration of translations, directionality, severity, and word choice to ensure clear communication.
3) Communicating probability information using numbers
(i) Probabilities as percentages
Experts and nonexperts also use a variety of numeric formats (e.g., percentages, frequencies, odds) to communicate probability information. Numeric probabilities are most commonly expressed as a percentage (e.g., “a 30% chance of rain”) or as