1. Introduction
During warning operations, weather forecasters rely heavily on radar technology to observe and monitor potentially severe thunderstorms (Andra et al. 2002). The National Weather Service (NWS) currently utilizes a network of 158 Weather Surveillance Radar-1988 Dopplers (WSR-88Ds) that are located across the United States (Whiton et al. 1998). Given that the WSR-88D was initially designed with a projected lifetime of 20 yr (Zrnić et al. 2007), continuous upgrades are required to maintain its functionality (e.g., Saffle et al. 2009; Crum et al. 2013). However, eventually the WSR-88D network will have to be replaced. A replacement candidate under consideration is phased-array radar (PAR; Zrnić et al. 2007). To explore the suitability of PAR for weather observation, a phased-array antenna was loaned to the NOAA/National Severe Storms Laboratory (Forsyth et al. 2005) in Norman, Oklahoma by the U.S. Navy. A key characteristic of this PAR is its capability to provide volume updates in less than 1 min (Heinselman and Torres 2011).
When exploring future replacement technologies to the WSR-88D, an important consideration is forecaster needs. In a survey conducted by LaDue et al. (2010), forecasters expressed a need for higher-temporal-resolution radar data during rapidly evolving weather events. In particular, forecasters reported that the 4–6-min updates provided by the WSR-88D are insufficient for observing radar precursor signatures of thunderstorms such as downbursts (LaDue et al. 2010). Fujita and Wakimoto (1983) define a downburst as, “A strong downdraft which induces an outburst of damaging winds on or near the ground.” Radar precursor signatures, such as a descending high-reflectivity core and strong midlevel convergence, can be used to identify storms capable of producing a downburst (e.g., Roberts and Wilson 1989; Campbell and Isaminger 1990). Such precursor signatures, however, can evolve too quickly for trends to be sampled sufficiently by the WSR-88D. Such limitations may result in delayed warnings and therefore reduced lead time or, worse, missed events. These limitations are of concern because downbursts can produce damaging winds at the surface, presenting a threat to life and property. Therefore, for improvement in warning operations, a future radar system should be capable of sampling the atmosphere on a shorter time scale, which PAR can provide.
Heinselman et al. (2008) examined the weather surveillance capabilities of the PAR during severe weather events. In particular, microburst precursor signatures observed by the PAR were compared to those observed by the WSR-88D. During a 13-min observation period when a storm was sampled by both radars, the PAR and WSR-88D collected 23 and 3.5 volume scans, respectively. The considerably faster PAR sampling resulted in an improved ability to observe and track microburst precursor signatures, prior to the detection of divergent outflow at the lowest scans. Additionally, Heinselman et al. (2008) analyzed a hailstorm observed by PAR. Although a comparison to the WSR-88D was not available, the development of radar features indicative of a hail threat (e.g., bounded weak-echo region and three-body scatter spike) were clearly visible in PAR data as the storm quickly evolved. These findings by Heinselman et al. (2008) suggest that the use of PAR data could provide forecasters with the ability to detect impending severe weather earlier, which in turn may provide the public with longer warning lead times.
The Phased Array Radar Innovative Sensing Experiment (PARISE) was designed to assess the impacts of higher-temporal-resolution radar data on the warning decision process of forecasters (Heinselman et al. 2012; Heinselman and LaDue 2013). The work of PARISE is critical to ensuring that the implementation of PAR technology would be beneficial to the NWS. The 2010 and 2012 PARISE focused on low-end tornado events (Heinselman et al. 2012; Heinselman and LaDue 2013). Both experiments reported enhanced forecaster performance with the use of 1-min radar updates compared to forecasters using traditional 5-min radar updates, as demonstrated through warnings issued with longer tornado lead times. The purpose of this study was to extend the work of PARISE to include severe hail and wind events, with a focus on downbursts (see section 3b for the NWS definition of severe). Based on the findings of Heinselman et al. (2012) and Heinselman and LaDue (2013), we hypothesized that during such events, rapidly updating radar data would positively impact the warning decision process of NWS forecasters. To assess this hypothesis, data collection focused on both quantitative and qualitative aspects of the forecaster warning decision process. In particular, details of warning products were recorded so that forecaster performance could be assessed from a verification standpoint. The data collected revealed that the warning decision process comprised three key decision stages. For this reason, verification was assessed with regard to what has been termed the compound warning decision process, which recognizes that forecasters detect, identify, and reidentify severe weather (see section 3a). Additionally, confidence ratings were obtained each time a forecaster made a key decision, along with reasoning for each confidence rating. Through the use of a confidence-based assessment, these ratings were analyzed to address the question of whether increasing the temporal availability of radar data leads to better decisions. Specifically, decisions were categorized into four types: doubtful, uninformed, misinformed, and mastery. The reasoning for each confidence rating provides insight into why each decision type occurred, and whether the temporal resolution of radar data played a role.
2. Methods
a. Experimental design
From two NWS Weather Forecast Offices (WFOs), 12 forecasters were recruited to participate in the 2013 PARISE. The two WFOs were located in the NWS’s Southern and Eastern Regions, and therefore given the climatology of these regions, the 12 forecasters would have experienced working severe hail and wind events (Kelly et al. 1985). During each of the six experiment weeks, one forecaster from each WFO visited Norman, Oklahoma. The experiment adopted a two-independent-group design, where each week forecasters were assigned to either a control or an experimental group. The volume update time acted as the independent variable, where the control group received 5-min updates from temporally degraded PAR data, and the experimental group received 1-min updates from full-temporal-resolution PAR data.
To ensure balanced groups in terms of knowledge and experience, matched random assignment was incorporated into the experiment design. Matching was accomplished through an online survey that was issued to participants prior to the experiment. Participants’ experience was measured by the number of years they had worked in the NWS (Table 1, columns 1 and 3). Although experience is important with respect to the amount of exposure one has had in their work environment, experience does not imply expertise. As described by Jacoby et al. (1986), experience and expertise are “conceptually orthogonal,” with a distinguishing factor being that expertise is achieved through acquiring a “qualitatively higher level of either knowledge or skill.” Therefore, to assess aspects of forecaster expertise relevant to this study, knowledge was measured through four questions regarding familiarity (Table 1, columns 2 and 3), understanding, knowledge of precursors, and training with respect to downburst events (Table 1, columns 4–7). For knowledge, the three questions requiring qualitative responses were compared to criteria that were based on downburst conceptual models (e.g., Atkins and Wakimoto 1991). Based on their survey responses, all participants were assigned an experience and knowledge score ranging between 1 and 5 (Fig. 1). The experience score was based on the single experience question, whereas the knowledge score was generated by averaging the points obtained from the four knowledge questions. Among the participants, experience was spread fairly evenly, and knowledge was clustered around the medium range (Fig. 1). For all possible group combinations, the Mahalanobis distance was computed to assess the similarity between groups by using experience and knowledge scores as variables (McLachlan 1999). The smallest distance represented the greatest similarity between groups, which therefore determined the group assignment for each participant (Fig. 1).
Criteria for points assigned to questions from the preexperimental online survey. Columns 1 and 3 refer to how experience scores were assigned, and columns 2–7 refer to how knowledge scores were assigned. In column 2, a scale from 1 to 10 is used (where 1 indicates no familiarity and 10 indicates extensive familiarity).
Although efforts were made to match groups, the limitations associated with the applied methodology should be acknowledged. A limitation that arose following the distribution of the survey was that participants may not have always interpreted the questions correctly, leading to discussions on tangential topics. For example, participants were asked to explain their understanding of a downburst. Although most participants perceived this question as intended (Table 1, column 4), some responses focused on the type of damage observed from downbursts. In addition, the amount of time and effort that participants invested into the survey was likely variable. For these reasons, it is possible that survey responses did not provide a complete representation of participants’ knowledge. However, despite this possibility, the consistent assessment of survey responses and the use of a similarity metric provided a means to objectively match groups.
b. Case studies
The National Weather Radar Testbed located in Norman, Oklahoma, is home to an S-band PAR that is being evaluated and tested for weather applications. Given that the PAR is a single flat-panel array, data collection is limited to a 90° sector at any one time. PAR’s electronic beam steering means that it operates with a nonconformal beamwidth increasing from 1.5° to 2.1° as the beam is steered from boresight to ±45° (Zrnić et al. 2007). Additionally, the electronic beam steering allows the atmosphere to be scanned noncontiguously, enabling weather-focused observations, which further reduce the volume update time to less than 1 min (Heinselman and Torres 2011; Torres et al. 2012).
Based on the following criteria, two cases from archived PAR data were selected for the 2013 PARISE (Table 2). First, the cases needed to be long enough to allow participants to settle into their roles and demonstrate their warning decision processes as the weather evolved. Second, severe hail and/or wind reports needed to be associated with the event, preferably toward the end of the case to give participants an opportunity to interrogate the storms beforehand and make warning decisions as necessary. Third, for consistent low-level sampling of the weather event, the PAR data needed to be uninterrupted and within a range of 100 km from the radar.
Descriptions of cases 1 and 2.
Case 1 presented multicell clusters of storms that occurred at 0134–0210 UTC 20 April 2012 (Figs. 2a,b; Table 2). This marginally severe (i.e., at or slightly greater than the severe criteria) hail event was observed by the PAR using an enhanced volume coverage pattern (VCP) 12 strategy. Specifically, this VCP scanned 19 elevation angles ranging between 0.51° and 52.90°. Although only one severe hail report occurred during case time, an additional six hail reports were associated with the same storm 1 h after case end time.
Case 2 included multicellular storms with some rotation that were sampled by PAR at 2053–2139 UTC 16 July 2009 (Figs. 2c,d; Table 2). PAR collected data using a VCP that was composed of 14 elevation angles ranging between 0.51° and 38.80°. Both severe hail and wind events were reported and associated with a downburst event that occurred in central Oklahoma. During case time, there was one severe wind and two severe hail reports. Within the hour after case end time, an additional 16 reports of severe hail and wind events were associated with the same storm.
All storm reports were obtained from Storm Data, which is logged in the NWS Performance Management System (https://verification.nws.noaa.gov/). Because the spatial and temporal accuracy of Storm Data is limited (e.g., Witt et al. 1998; Trapp et al. 2006), it was important to ensure consistency between the location and timing of storm reports with the radar data. Additionally, weather reports obtained during the Severe Hazards Analysis and Verification Experiment (SHAVE; Ortega et al. 2009) were examined to validate confidence in the storms that did not produce severe weather. Both SHAVE and Storm Data were in agreement with storms classified as null events. The occurrence of both severe and nonsevere storms during the cases provided a realistic scenario whereby participants were challenged to differentiate between storms that would and would not produce severe weather.
c. Working the cases
Before working each case, participants viewed a weather briefing video that was prepared by J. Ladue of the Warning Decision Training Branch. This video provided participants with an overview of the environmental conditions associated with the case, along with satellite and radar imagery leading up to the case start time. The weather briefing gave all participants the same information from which they could form expectations. Participants were then told that they had just come on shift, that no warnings were in progress, and that it was their job to determine whether a warning was required for the storms they would encounter. All participants worked independently in separate rooms. They were reminded that the data collected from their participation would remain anonymous, and participants were encouraged to work as they would in their usual WFOs. In this study, participants are referred to as P1–P6 for the control group and P7–P12 for the experimental group.
Cases were played in simulated real time using the next-generation Advanced Weather Interactive Processing System-2 (AWIPS-2). Given that during the summer of 2013 participants were using AWIPS-1 within their WFOs, a short familiarization session with the newer software prior to working events was provided to increase the participants’ comfort level using AWIPS-2 as their forecasting tool. Therein, participants were able to view base velocity, reflectivity, and spectrum width products from the PAR. During the case, participants received verbal information of storm reports that were timed according to the details provided in Storm Data. All warning products (e.g., special weather statements, severe thunderstorm warnings, and severe weather statements) were issued using the Warning Generation (WARNGEN) software. Whenever participants issued a product, they were asked to indicate their level of confidence on a scale that ranged from not sure (0%), to partially sure (50%), to sure (100%; Fig. 3). Following the case, participants were asked a set of probing questions that targeted the reasons for each decision and the decision maker’s associated confidence level.
3. Forecaster performance
a. The compound warning decision process
Decisions are oftentimes not a one-step procedure. Rather, decision makers can find themselves in a compound decision environment that consists of multiple decision elements. For example, search and rescue operations require locating a target followed by identifying that the correct target has been recovered (Duncan 2006), and medical diagnoses can involve first detecting an abnormality, and then correctly localizing the abnormality for treatment (Obuchowski et al. 2000). Observations of participants during the 2013 PARISE revealed that weather forecasters also encounter multiple problems when working toward a solution. In particular, these problems are focused on warning decisions and are recognized as detection, identification, and reidentification, together forming the compound warning decision process (Fig. 4). Detection relates to the decision to warn; a forecaster perceives and comprehends information that leads to the belief that severe weather will occur. The decision to issue a warning prompts the forecaster to open the WARNGEN software, at which point the forecaster progresses to the identification stage. For instance, when issuing a severe thunderstorm warning, the forecaster must identify the expected weather threats (i.e., hail and/or wind) from the storm in question. Once the warning is issued, the forecaster continues to monitor the storm’s evolution and updates the warning by issuing severe weather statements. It is through these updates that the forecaster reidentifies the weather threats; the threat may be maintained, changed in magnitude, changed in type, or canceled.
Distinguishing severe hail and wind events from one another is a challenge that NWS forecasters regularly encounter during warning operations. Currently, though, the NWS only assesses forecaster performance at the detection level. The compound warning decision process, however, allows for a more comprehensive assessment of warning decisions. A correct decision at the detection level does not necessarily mean that the forecaster has accurately comprehended information regarding the storm’s potential. For example, while working case 2, P3 (control participant) issued a severe thunderstorm warning, identifying only wind as the weather threat. Although at the detection level P3 made a correct decision, P3 had missed the hail threat during identification. The participant maintained this threat expectation through the issuance of two warnings, only realizing after the first hail report at 2135 UTC 16 July 2009 that hail was also a threat. At this point, P3 issued a severe weather statement to reidentify both hail and wind as weather threats, but unfortunately had not communicated the hail threat until after the event had occurred. This example demonstrates that to fully understand the quality of a forecast, a more intricate analysis of warning decisions is required.
b. Verification
To measure forecaster performance at the three levels of the compound warning decision process, forecaster decisions were verified using the NWS severe criteria. In operations, a severe thunderstorm warning is verified by the occurrence of 50 knots (kt; 1 kt = 0.51 m s−1) or higher wind and/or at least hail of 1-in. diameter, whereas a tornado warning is verified by reports of a tornado within the spatiotemporal limits of the warning polygon (NOAA 2011). Storm reports associated with the severe weather events in cases 1 and 2 were treated as instantaneous events (Table 2). Additionally, since participants worked only a portion of a severe weather event, there were occasions where warnings were verified only by severe hail and/or wind events after the case had ended. In these instances, storm reports recorded 1 h after case end time were used for verification purposes. For detection, individual warnings were verified by assessing whether the warning encompassed an event both spatially and temporally. Each event that was not warned for was recorded as a miss. For identification, weather threats were first considered individually. For example, in case 1, only hail reports were associated with the severe storm. If both hail and wind were identified as threats in the warning, then hail was a hit, and wind was a false alarm. The results from each weather threat were then combined for overall identification statistics. Reidentification was verified in a similar manner to identification, but this time for the updated warning information detailed in severe weather statements.
Mean POD and FAR statistics for the control and experimental groups for detection, identification, and reidentification.
1) Case 1
All but one participant (control) successfully detected the severe hail event in case 1, which resulted in POD scores of 0.83 and 1 for the control and experimental groups, respectively (Fig. 5a). All control and five experimental participants also decided to issue warnings on a storm that was not associated with severe weather. Although at the detection level the experimental group’s performance scores were more variable, the overall performance of the experimental group resulted in a lower FAR score (0.45) than the control group (0.58; Fig. 5a). Five control and two experimental participants obtained FAR scores of 0.5 by warning once on the severe storm and once on a nonsevere storm. Two warnings were also issued by P5, but neither verified (FAR = 1). Three warnings were issued by P7 and P12, of which only one verified at the detection level (FAR = 0.67). Three warnings were also issued by P10, but two of his warnings verified (FAR = 0.33). The only experimental participant who did not incorrectly detect severe weather (FAR = 0) was P11.
Following detection, participants identified the weather threat associated with the storm that was being warned. All participants identified hail in each of their warnings, which only verified for warnings that were successful at the detection level. Therefore, the POD scores for identification match those for detection (Fig. 5b). Participants were also assigned FAR scores because of incorrect identifications (Fig. 5b). Incorrect identifications were a result of two reasons: 1) a weather threat was identified for a warning that did not verify at the detection level and 2) incorrect identification of a wind threat was made on the severe storm that was associated with only severe hail events. Both groups’ FAR scores increased from detection to identification, though the experimental group continued to achieve a lower FAR score than the control group (Table 3). We surmise that the increase in FAR scores from the detection to identification level is due to the added challenge of having to discern between potential weather threats.
Reidentification of weather threats during case 1 were made while updating a warning. No updates were issued by P6, and therefore statistics were not calculated for this participant. Hail and wind threats were only reidentified on warnings that were false alarms at the detection level by P4, P5, and P7. These participants therefore received POD and FAR scores of 0 and 1, respectively (Fig. 5c). The remaining participants reidentified a hail threat on the correct storm at least once, achieving POD scores of 1. The variable FAR scores at the reidentification level resulted from 1) what storms participants decided to update (i.e., the severe storm or the nonsevere storm) and 2) whether participants were able to correctly reidentify hail as the only threat. Whereas the mean FAR score from identification to reidentification remained nearly steady for the experimental group, the control group’s mean FAR score increased (Table 3). Overall, the experimental group was more successful at reidentifying the correct weather threat on the correct (i.e., severe) storm than the control group.
2) Case 2
During case 2, all participants except P5 successfully detected the three severe weather events (Fig. 6a). This participant missed one event, which resulted in the control group achieving a slightly lower mean POD score of 0.95 compared to the experimental group’s mean POD score of 1 (Table 3). In comparison to POD scores, FAR scores were more variable among participants (Fig. 6a). For participants obtaining FAR scores of 0, each warning that was issued encompassed the severe storm and therefore was verified with respect to detection. Participants obtaining FAR scores of 0.5 typically issued two warnings, of which only one was verified by severe weather, while the other targeted a storm to the north that was not associated with severe weather reports. Three warnings were issued by both P4 and P6. For P4, one of these warnings verified (FAR = 0.67), whereas for P6, two warnings verified (FAR = 0.33). Overall, the experimental group had fewer false alarms, as demonstrated by their lower mean FAR score of 0.25 compared to 0.33 for the control group (Table 3).
Unlike in case 1, case 2 presented a storm that produced both severe hail and wind. Of these events, all participants identified the wind event successfully, and four participants in each group identified the hail events (Fig. 6b). This similar performance between groups led to matching mean POD scores at the identification level of 0.88 (Table 3). The experimental group, though, performed better than the control group regarding false alarms. In case 2, false alarms at the identification level occurred mostly within warnings that did not verify at the detection level. While all control participants incorrectly identified weather threats within these warnings, three experimental participants achieved an FAR score of 0. Additionally, two control participants incorrectly identified a tornado threat. The resulting mean FAR scores for identification were 0.36 for the control group and 0.22 for the experimental group (Table 3).
When participants began to reidentify weather threats, group POD scores increased and the FAR scores decreased (Table 3). As the severe storm evolved over time, participants realized that the southern storm had more potential than the storm to the north, which was beginning to dissipate. The wind threat associated with the severe storm was correctly reidentified by all participants, while all control and four experimental participants also correctly reidentified the hail threat (Fig. 6c). Some participants in both groups also incorrectly reidentified weather threats. Whereas the experimental group’s FAR score decreased slightly from identification to reidentification, the control group’s FAR score decreased more substantially (Table 3). However, the accuracy of the control group’s decisions during reidentification improved to a level of accuracy similar to that demonstrated by the experimental group during the identification stage.
c. Lead time
The lead time was calculated as the time of the severe hail or wind event minus the time of warning issuance. For events that were unwarned, a lead time of 0 min was assigned. On occasions where multiple warnings encompassed one event, the earliest issued warning was used to calculate lead time. Lead time was calculated for all 12 participants for one event in case 1, and three events in case 2.
Participants’ lead times during case 1 ranged from 0 to 30 min (Fig. 7a). The experimental group, however, demonstrated a tendency toward longer lead times. With the exception of P7, all experimental participants achieved a lead time of at least 20 min, compared to just half of the control participants. Group mean lead times were 16.4 and 22.0 min for the control and experimental groups, respectively (Table 4). For case 2, lead time was calculated for three events that both spatially and temporally occurred close to one another. Therefore, often one warning verified the three events. Within the experimental group, four participants achieved a lead time of at least 20 min for all three events, compared to just one control participant (Fig. 7b). Group mean lead times for case 2 were 16.4 and 21.8 min for the control and experimental groups, respectively.
Mean lead times for control and experimental groups for cases 1 and 2, along with the group differences in mean lead time.
Combining the lead time results of both cases, we find that the control group’s mean lead time was 16.4 min compared to 21.9 min for the experimental group. Therefore, the experimental group’s mean lead time exceeded the mean lead time of the control group by 5.5 min. While this difference in mean lead time is similar to the temporal resolution provided to the control group, the variability among participants’ lead time results within the same group suggests that factors in addition to temporal resolution may be important for explaining participant performance. Additionally, the Wilcoxon rank sum nonparametric test (Wilks 2006) was used to assess the difference between the median lead times of the control (17.3 min) and experimental (21.5 min) groups. The test yielded a p value of 0.0252, indicating that the difference in median lead times was statistically significant above the 95% confidence interval. Although the results from this study cannot be generalized because of the small sample size, the performance of the experimental group is encouraging and the lead time results are in favor of the use of higher-temporal-resolution radar data.
4. Decision types
a. Confidence-based assessment
The increased flux of information provided by PAR raises the question of how rapidly updating radar data will impact forecaster confidence, and what the resulting effects will be on the decisions that are made. To investigate this question, the relationship between confidence and correctness was assessed using a two-dimensional testing method (Bruno 1993). Referred to as the confidence-based assessment behavioral model, a decision maker is required to indicate their confidence associated with each decision on a scale ranging from “not sure” (0%) to “partially sure” (50%) to “sure” (100%; Fig. 3). In particular, confidence-based assessment can identify three states of mind, confidence, doubt, and ignorance (e.g., lacking knowledge), and can help categorize decisions into four types: doubtful, uninformed, misinformed, and mastery (Fig. 8; Bruno et al. 2006; Adams and Ewen 2009). According to Bruno et al. (2006), doubtful decisions, although correct, lack confidence and are made with hesitance. Decisions that are both incorrect and made without confidence are uninformed. Decisions that are incorrect yet made with confidence are misinformed, and perhaps are the riskiest types of decisions. The most desirable type of decision is mastery, which arises from smart and informed choices that are both confident and correct.
b. Categorizing decisions
When participants made a key decision (i.e., decision to issue or update a warning), a corresponding confidence rating was assigned. Since there was variability in their confidence baselines, results (which ranged from 26% to 100%) were normalized by linear transformation onto a new scale ranging from 0 to 7. Ratings of at least 5 were considered confident decisions, since this value indicated that the decision was closer to sure than partially sure (i.e., ≥75%). The key decisions made during cases 1 and 2 were combined, yielding a total of N = 53 and 54 key decisions for the control and experimental groups, respectively. Decisions were classified as correct if the decision to issue or maintain a warning corresponded with the occurrence of severe weather. Similarly, decisions to not issue or to cancel a warning were correct for instances when severe weather did not occur.
Of these key decisions, a larger proportion of the decisions made by the experimental group were classified as mastery (63%) compared to those of the control group (51%). Individual participants in the experimental group made a higher number of mastery decisions and a lower number of uninformed and misinformed decisions compared to individual participants in the control group (Fig. 9a). The majority of the key decisions in both groups were categorized as misinformed and mastery. This result is unsurprising since one may expect for decisions to be made more frequently when a decision maker is confident rather than unsure. The Wilcoxon rank sum nonparametric test (Wilks 2006) was used to assess the difference in the median number of decisions made by the control and experimental groups for all four decision types. The p values yielded were 0.862, 0.673, 0.802, and 0.325 for the doubtful, uninformed, misinformed, and mastery decision types, respectively. Therefore, although statistical significance was not established, these results indicate that of the four decision types, the control and experimental groups differed most with respect to mastery decisions.
c. Explanations for decision types
Following each case, participants were questioned on the reasons for the confidence ratings that they had provided. The qualitative data collected from this questioning gives insight into why doubtful, uninformed, misinformed, and mastery decisions were made. Although reasoning provided by participants varied somewhat, common topics discussed by participants were also found.
The control and experimental groups made five and four decisions, respectively, that were correct but made without confident (i.e., doubtful; Fig. 9b). Of these hesitant decisions, the majority were made during case 2, with just one doubtful decision being recorded during case 1 for both groups. Three control participants explained that their hesitation was due to their warning criteria not being fully satisfied. For example, in case 1, P3 said that she was “flirting with the criteria” since the storm appeared “more marginal,” and P4 went ahead with issuing a tornado warning in case 2 despite being “not sure [that the] environment was conducive” for tornadogenesis. Similarly, some experimental participants found themselves making warning decisions without confidence. During case 2, P10 questioned the severe potential of a storm on which he had decided to warn. His doubt arose because despite seeing that the storm had a “good” and “healthy” core, he was “just not sure [whether] the environment” was supportive of severe storms. For P12 and P8, though, conflict arose as a result of earlier warnings not being verified. For example, P12 explained that during case 1 she wanted “any kind of determination on previous storms.” Additionally, P8 lacked confidence in case 2 after observing a “downward trend in reflectivity and velocity data” while also having “not received reports at that time.” The absence of reports on storms that were already warned on resulted in P12 and P8 being hesitant in their subsequent warning decisions.
Decisions categorized as uninformed were made on eight occasions in the control group and five occasions in the experimental group (Fig. 9b). Participants that did not make incorrect decisions without confidence (i.e., uninformed) also did not make incorrect decisions with confidence (i.e., doubtful). These participants are identified as P5 and P6 of the control group, and P7, P9, and P11 of the experimental group (Fig. 9b). Of the eight uninformed decisions recorded in the control group, three control participants explained that they did not have sufficient data to make a confident and informed decision. In particular, P1 described going “off [his] gut” when he decided to warn during case 1, P4 projected that a storm in case 2 would “continue to grow” despite “[not] having a lot of information,” and P2 decided to issue a tornado warning in case 2 because she thought that if she had waited for more information, it would have been “too late.” Control participants made incorrect decisions without confidence for other reasons also, including the warning decision being the “first one of the day” (P3; case 2), feeling that a warning could not be canceled despite it “[not] look[ing] severe anymore” (P4; case 1), and maintaining a tornado warning because it was “approaching a major interstate” despite having “reservations about [the] tornado aspects” of the storm (P4; case 2). Experimental participants’ reasoning for their lack of confidence varied, but unlike control participants, their reasons were not associated with the amount of radar data they had available. Furthermore, all uninformed decisions made by experimental participants were made during case 1. For P12, not having “reports of ground truth” led to an incorrect decision being made without confidence on two occasions. Both P8 and P10 reported that their lack of confidence in case 1 was due to the storm of interest appearing weaker than a storm that they had already warned on. It was explained by P10 that, “reflectivity-wise, it did not seem as robust as the southern storm.” Similarly, P8 noted that the storm was not “as strong as the southern storm.” A second decision was made by P8 in case 1 without confidence as a result of observing an apparent weakening in a storm of interest, which was evident by the “lowering hail core to less than 20 kft.”
Misinformed decisions, which were incorrect but made with confidence, made up the second largest decision-type category for both the control and experimental groups. Whereas all control participants made at least one incorrect decision with confidence, only four experimental participants did so (Fig. 9b). No key decisions made by P10 or P12 were categorized as misinformed. Across the two cases, the experimental group’s misinformed decisions were distributed evenly, whereas the control group’s occurred predominantly during case 1. Most incorrect yet confident decisions made by the control (N = 11 of 13) and experimental (N = 10 of 11) groups were made with the belief that severe weather was a threat. Typically within the forecast office, warning criteria are established based on experience and climatology. Many participants applied their usual warning criteria to the storms they encountered during these cases. For example, in case 1, P7 reported seeing “60 dBZ above 20 kft,” which she explained “fit [her] conceptual model for [severe] hail.” This warning criterion was common among participants, because, as P9 explained during case 2, “hail is very predictable when the core is that high.” However, given that warning criteria are established with respect to a certain location, participants’ warning criteria may not have been as suited to the environment in Oklahoma, ultimately leading to participants making incorrect decisions with confidence. Misinformed decisions were also recorded twice in the control group and once in the experimental group for participants who had decided to trim the warning polygon since the storm had moved “out of the county” (P1). Although confidence was associated with the decision to cancel a threat in some location, these three participants chose to incorrectly maintain the severe threat elsewhere in the warning polygon, resulting in false alarms at the reidentification level.
More than half of the decisions made by both groups fell into the mastery decision category. In total, the control and experimental groups made 27 and 34 confident and correct decisions, respectively (Fig. 9b). At least three key decisions made by each participant were categorized as mastery. A maximum of eight key decisions were categorized as mastery for one participant in each group (P1 and P11; Fig. 9b). Mastery decisions were common in both cases, with approximately 40% occurring during case 1 and 60% during case 2. Explanations for confidence that was associated with correct decisions revolved around two reasons. The first reason was that participants compared storm characteristics on radar. For example, in case 1, P4 noted that the severe storm had a “much larger and deeper high-reflectivity core” than other storms, and P8 described the severe storm as being the “most intense” on radar. Similar to these observations, in case 2, P6 explained that the severe storm was “more impressive” than the storm to the north that he had already warned on. Making comparative observations of storms provided participants with confidence in their warning decisions. This type of reasoning was provided on 12 occasions by the control group, but only 4 occasions by the experimental group.
The second reason for mastery decisions was based on perceived severe radar signatures of specific storms. Participants observed features and trends of individual storms that justified their warning decisions. The experimental group made confident and correct decisions using this reasoning on 30 occasions compared to 15 occasions by the control group. One possibility that the experimental group provided this reasoning twice as often as the control group is that the use of rapidly updating radar data aided experimental participants in obtaining more-detailed observations of storms. For example, in case 2, P7 saw that the severe storm was “increasing in intensity aloft,” leading to concern that there was “precipitation loading producing high winds near the ground.” Another example of specific storm interrogation was when P8 observed that the hail core was “[continuing] to grow on upper-level reflectivity,” while the “midlevel convergence signature [was] getting stronger and stronger.” Examples such as these demonstrate the experimental group’s ability to track individual characteristics of storms, which was sufficient for developing an understanding of the storm dynamics and correctly projecting the occurrence of severe weather.
5. Discussion and summary
The purpose of the 2013 PARISE was to extend the work of earlier experiments (Heinselman et al. 2012; Heinselman and LaDue 2013) to investigate whether the use of higher-temporal-resolution radar data during severe hail and wind events would be beneficial to the warning decision process of NWS forecasters. The experiment design allowed for a comparison between control and experimental participants that utilized PAR data with temporal updates of 5 and 1 min, respectively. While working two severe hail and/or wind case studies in simulated real time, all participants exhibited a decision process that was formed of multiple components. Observing participants detecting, identifying, and reidentifying severe weather led to the designation of the compound warning decision process. This process introduced a new verification approach, where the accuracy of warning decisions was considered with respect to the detection, identification, and reidentification of severe weather. This verification approach was important for fully understanding and comparing each group’s performance, since warning decisions and perceived severe weather threats changed as storms evolved with time. Given that the elements that compose the compound warning decision process are a part of real-time warning operations, we suggest that evaluating forecaster warning decisions beyond the detection level may provide a more thorough assessment of forecaster performance for the duration of a severe weather event, rather than for the initial warning decision only.
The POD and FAR statistics were calculated for all three stages of the compound warning decision process (Table 3). Overall, the experimental group made more accurate warning decisions than did the control group. Additionally, the experimental group also made more timely warnings (Table 4). More timely warnings were demonstrated through the significantly higher (p value = 0.0252) median lead time obtained by the experimental group (21.5 min) compared to the control group (17.3 min). The finding that the experimental group made more accurate and timely warning decisions was not necessarily expected, since earlier studies have shown that the skill of operational meteorologists did not increase notably with increased information (Stewart et al. 1992; Heideman et al. 1993). Research has also shown that increasing the amount of information a decision maker receives may increase confidence and satisfaction, yet decrease actual performance (O’Reilly 1980). This effect was not observed during the 2013 PARISE. Rather, the experimental group performed superiorly to the control group through the use of 1-min radar updates.
The findings from each experiment support the use of higher-temporal-resolution radar data during warning operations with increased lead time being a consistent finding through all three experiments (Heinselman et al. 2012; Heinselman and LaDue 2013). However, a limitation of the 2013 PARISE, along with the 2010 and 2012 PARISE, is the sample size. Given that a total of 12 participants were recruited for each experiment, and that each PARISE focused on a particular weather threat, the generalizability of the results to the wider forecasting community is questionable. Future experiments should include a wider variety of cases that together are more representative of what forecasters encounter in the real world.
Participants’ decisions were also assessed with respect to both confidence and correctness. Rather than simply identifying decisions as right or wrong, the goal of this confidence-based assessment was to categorize decisions into four different types, namely, doubtful, uninformed, misinformed, and mastery (Bruno et al. 2006; Adams and Ewen 2009). Both groups made decisions that fell into each category. However, while the control group made slightly more doubtful, uninformed, and misinformed decisions than the experimental group, the experimental group made more mastery decisions than the control group. Qualitative reasoning for each confidence rating was important for understanding the factors that led to each decision type. The reasons leading to uninformed and misinformed decisions highlight some of the limitations associated with working in a simulated environment. Not having available radar data prior to the case start time resulted in control participants making incorrect decisions without confidence, while a change in geographic location and therefore unsuitable application of warning criteria resulted in both groups making incorrect decisions with confidence. Avoiding limitations such as these could be accomplished by experimenting with the use of PAR data during real-time operations in the local forecast office. Mastery decisions resulted from participants either making a comparison between storms or observing and tracking individual storm characteristics. While both reasons explained the confident and correct decisions made by the control group, the mastery decisions in the experimental group were predominantly explained by the latter reason. As discussed previously, LaDue et al. (2010) reported that forecasters expressed a need for faster radar updates in order to observe rapidly evolving weather. The qualitative reasoning provided for mastery decisions suggests that the experimental group’s ability to observe storm evolution on a finer temporal scale was enhanced through the use of 1-min radar updates.
Acknowledgments
Thank you to the 12 NWS forecasters for participating in this study, to the participating WFOs’ MICs for supporting recruitment, and to Michael Scotten for participating in the pilot experiment. We also thank A/V specialist James Murnan, software expert Eddie Forren, and GIS expert Ami Arthur. Advice from committee members Robert Palmer and David Parsons, along with insightful discussion with Harold Brooks and Lans Rothfusz, aided the development of this study. We are grateful to Kurt Hondl, Michael Scotten, and the two anonymous reviewers for providing comments on this paper. Funding was provided by NOAA/Office of Oceanic and Atmospheric Research under NOAA–University of Oklahoma Cooperative Agreement NA11OAR4320072, U.S. Department of Commerce.
REFERENCES
Adams, T. M., and Ewen G. W. , 2009: The importance of confidence in improving educational outcomes. 25th Annual Conf. on Distance Teaching and Learning, Madison, WI, University of Wisconsin–Madison, 1–5.
Andra, D. L., Quoetone E. M. , and Bunting W. F. , 2002: Warning decision making: The relative roles of conceptual models, technology, strategy, and forecaster expertise on 3 May 1999. Wea. Forecasting, 17, 559–566, doi:10.1175/1520-0434(2002)017<0559:WDMTRR>2.0.CO;2.
Atkins, N. T., and Wakimoto R. M. , 1991: Wet microburst activity over the southeastern United States: Implications for forecasting. Wea. Forecasting, 6, 470–482, doi:10.1175/1520-0434(1991)006<0470:WMAOTS>2.0.CO;2.
Bruno, J. E., 1993: Using testing to provide feedback to support instruction: A reexamination of the role of assessment in educational organizations. Item Banking: Interactive Testing and Self-Assessments, D. A. Leclercq and J. E. Bruno, Eds., Springer-Verlag, 190–209.
Bruno, J. E., Smith C. J. , Engstrom P. G. , Adams T. M. , Warr K. D. , Cushman M. J. , Webster B. D. , and Bollin F. M. , 2006: Method and system for knowledge assessment using confidence-based measurement. U.S. Patent 2006/0029920 A1, filed 23 July 2005, issued 9 February 2006.
Campbell, S. D., and Isaminger M. A. , 1990: A prototype microburst prediction product for the Terminal Doppler Weather Radar. Preprints, 16th Conf. on Severe Local Storms, Kananaskis Park, AB, Canada, Amer. Meteor. Soc., 393–396.
Crum, T., Smith S. D. , Chrisman J. N. , Saffle R. E. , Hall R. W. , and Vogt R. J. , 2013: WSR-88D radar projects–Update 2013. Proc. 29th Conf. on Environmental Information Processing Technologies, Austin, TX, Amer. Meteor. Soc., 8.1. [Available online at https://ams.confex.com/ams/93Annual/webprogram/Paper221461.html.]
Duncan, M., 2006: A signal detection model of compound decision tasks. Defense Research and Development Canada Tech. Rep. TR2006–256, 56 pp.
Forsyth, D. E., and Coauthors, 2005: The National Weather Radar Testbed (phased array). Preprints, 32nd Conf. on Radar Meteorology, Albuquerque, NM, Amer. Meteor. Soc., 12R.3. [Available online at https://ams.confex.com/ams/pdfpapers/96377.pdf.]
Fujita, T. T., and Wakimoto R. , 1983: Microbursts in JAWS depicted by Doppler radars, PAM and aerial photographs. Preprints, 21st Conf. on Radar Meteorology, Edmonton, AB, Canada, Amer. Meteor. Soc., 19–23.
Heideman, K. F., Stewart T. R. , Moninger W. R. , and Reagan-Cirincione P. , 1993: The Weather Information and Skill Experiment (WISE): The effect of varying levels of information on forecast skill. Wea. Forecasting, 8, 25–36, doi:10.1175/1520-0434(1993)008<0025:TWIASE>2.0.CO;2.
Heinselman, P. L., and Torres S. M. , 2011: High-temporal-resolution capabilities of the National Weather Radar Testbed Phased-Array Radar. J. Appl. Meteor. Climatol., 50, 579–593, doi:10.1175/2010JAMC2588.1.
Heinselman, P. L., and LaDue D. S. , 2013: Supercell storm evolution observed by forecasters using PAR data. Proc. 36th Conf. on Radar Meteorology, Breckenridge, CO, Amer. Meteor. Soc., 3B.4. [Available online at https://ams.confex.com/ams/36Radar/webprogram/Paper228747.html.]
Heinselman, P. L., Priegnitz D. L. , Manross K. L. , Smith T. M. , and Adams R. W. , 2008: Rapid sampling of severe storms by the National Weather Radar Testbed Phased Array Radar. Wea. Forecasting, 23, 808–824, doi:10.1175/2008WAF2007071.1.
Heinselman, P. L., LaDue D. S. , and Lazrus H. , 2012: Exploring impacts of rapid-scan radar data on NWS warning decisions. Wea. Forecasting, 27, 1031–1044, doi:10.1175/WAF-D-11-00145.1.
Jacoby, J., Troutman T. , Kuss A. , and Mazursky D. , 1986: Experience and expertise in complex decision making. Adv. Consum. Res., 13, 469–472.
Kelly, D. L., Schaefer J. T. , and Doswell C. A. III, 1985: Climatology of nontornadic severe thunderstorm events in the United States. Mon. Wea. Rev., 113, 1997–2014, doi:10.1175/1520-0493(1985)113<1997:CONSTE>2.0.CO;2.
LaDue, D. S., Heinselman P. L. , and Newman J. F. , 2010: Strengths and limitations of current radar systems for two stakeholder groups in the southern plains. Bull. Amer. Meteor. Soc., 91, 899–910, doi:10.1175/2009BAMS2830.1.
McLachlan, G. J., 1999: Mahalanobis distance. Resonance, 4, 20–26, doi:10.1007/BF02834632.
NOAA, 2011: Verification. NWS Rep. NWSI 10-51601, 100 pp. [Available online at http://www.nws.noaa.gov/directives/sym/pd01016001curr.pdf.]
Obuchowski, N. A., Lieber M. L. , and Powell K. A. , 2000: Data analysis for detection and localization of multiple abnormalities with application to mammography. Acad. Radiol., 7, 516–525, doi:10.1016/S1076-6332(00)80324-4.
O’Reilly, C. A., 1980: Individuals and information overload in organizations: Is more necessarily better? Acad. Manage. J., 23, 684–696, doi:10.2307/255556.
Ortega, K. L., Smith T. M. , Manross K. L. , Scharfenberg K. A. , Witt A. , Kolodziej A. G. , and Gourley J. J. , 2009: The Severe Hazards Analysis and Verification Experiment. Bull. Amer. Meteor. Soc., 90, 1519–1530, doi:10.1175/2009BAMS2815.1.
Roberts, R. D., and Wilson J. W. , 1989: A proposed microburst nowcasting procedure using single-Doppler radar. J. Appl. Meteor., 28, 285–303, doi:10.1175/1520-0450(1989)028<0285:APMNPU>2.0.CO;2.
Saffle, R. E., Istok M. J. , and Cate G. , 2009: NEXRAD product improvement—Update 2009. 25th Conf. on Interactive Information and Processing Systems (IIPS) for Meteorology, Oceanography, and Hydrology, Phoenix, AZ, Amer. Meteor. Soc., 10B.1. [Available online at https://ams.confex.com/ams/pdfpapers/147971.pdf.]
Stewart, R. T., Heideman K. F. , Moninger W. R. , and Reagan-Cirincione P. , 1992: Effects of improved information on the components of skill in weather forecasting. Organ. Behav. Hum. Decis. Processes, 53, 107–134, doi:10.1016/0749-5978(92)90058-F.
Torres, S. M., and Coauthors, 2012: ADAPTS Implementation: Can we exploit phased-array radar’s electronic beam steering capabilities to reduce update time? Extended Abstract, 28th Conf. on Interactive Information and Processing Systems (IIPS) for Meteorology, Oceanography, and Hydrology, New Orleans, LA, Amer. Meteor. Soc., 6B.3. [Available online at https://ams.confex.com/ams/92Annual/webprogram/Paper196416.html.]
Trapp, R. J., Wheatley D. M. , Atkins N. T. , Przybylinski R. W. , and Wolf R. , 2006: Buyer beware: Some words of caution on the use of severe wind reports in postevent assessment and research. Wea. Forecasting, 21, 408–415, doi:10.1175/WAF925.1.
Whiton, R. C., Smith P. L. , Bigler S. G. , Wilk K. E. , and Harbuck A. C. , 1998: History of operational use of weather radar by U.S. weather services. Part II: Development of operational Doppler weather radars. Wea. Forecasting, 13, 244–252, doi:10.1175/1520-0434(1998)013<0244:HOOUOW>2.0.CO;2.
Wilks, D. S., 2006: Statistical Methods in the Atmospheric Sciences. 2nd ed. Academic Press, 467 pp.
Witt, A., Eilts M. D. , Stumpf G. J. , Mitchell E. D. , Johnson J. T. , and Thomas K. W. , 1998: Evaluating the performance of WSR-88D severe storm detection algorithms. Wea. Forecasting, 13, 513–518, doi:10.1175/1520-0434(1998)013<0513:ETPOWS>2.0.CO;2.
Zrnić, D. S., and Coauthors, 2007: Agile beam phased array radar for weather observations. Bull. Amer. Meteor. Soc., 88, 1753–1766, doi:10.1175/BAMS-88-11-1753.