Abstract

There are numerous challenges with the forecasting and detection of flash floods, one of the deadliest weather phenomena in the United States. Statistical metrics of flash flood warnings over recent years depict a generally stagnant warning performance, while regional flash flood guidance utilized in warning operations was shown to have low skill scores. The Hydrometeorological Testbed—Hydrology (HMT-Hydro) experiment was created to allow operational forecasters to assess emerging products and techniques designed to improve the prediction and warning of flash flooding. Scientific goals of the HMT-Hydro experiment included the evaluation of gridded products from the Multi-Radar Multi-Sensor (MRMS) and Flooded Locations and Simulated Hydrographs (FLASH) product suites, including the experimental Coupled Routing and Excess Storage (CREST) model, the application of user-defined probabilistic forecasts in experimental flash flood watches and warnings, and the utility of the Hazard Services software interface with flash flood recommenders in real-time experimental warning operations. The HMT-Hydro experiment ran in collaboration with the Flash Flood and Intense Rainfall (FFaIR) experiment at the Weather Prediction Center to simulate the real-time workflow between a national center and a local forecast office, as well as to facilitate discussions on the challenges of short-term flash flood forecasting. Results from the HMT-Hydro experiment highlighted the utility of MRMS and FLASH products in identifying the spatial coverage and magnitude of flash flooding, while evaluating the perception and reliability of probabilistic forecasts in flash flood watches and warnings.

NSSL scientists and NWS forecasters evaluate new tools and techniques through real-time test bed operations for the improvement of flash flood detection and warning operations.

NOAA/National Severe Storms Laboratory and National Weather Service forecasters evaluate new tools and techniques through real-time test bed operations for the improvement of flash flood detection and warning operations.

Flooding is one of the deadliest weather-related phenomena in the United States. Over the 20-yr period from 1995 to 2014, flash flooding, river flooding, and coastal flooding combined contributed to an average of 77.3 fatalities per year. This was the third highest average fatality rate per year, behind tornadoes and extreme heat (www.nws.noaa.gov/os/hazstats/resources/weather_fatalities.pdf). Flash flooding is the greatest contributor to flood-related fatalities (Ashley and Ashley 2008). A flash flood is defined by the National Oceanic and Atmospheric Administration (NOAA)/National Weather Service (NWS) as a “rapid and extreme flow of high water into a normally dry area, or a rapid water level rise in a stream or creek above a predetermined flood level, beginning within six hours of the causative event” (NWS 2015). Most flash flood events are generated from intense rainfall rates over a prolonged duration at a given point; however, flash floods can also be induced by rapid snowmelt, ice jams, or structural failures.

Forecasting a flash flood event requires both meteorological and hydrologic information. The identification of synoptic and mesoscale patterns and ingredients conducive for significant rainfall totals along with analyzing model quantitative precipitation forecasts (QPFs) help generate local precipitation forecasts and determine the potential for flash flooding. Precipitation measurements from automated gauges can provide accurate point observations during an event; however, gauge networks generally lack sufficient density to capture the spatial variability of precipitation (e.g., Goodrich et al. 1995). Quantitative precipitation estimates (QPEs) from radar can provide high spatial resolution and coverage but have an inherent set of limitations, including ground clutter and other nonmeteorological echoes (Harrison et al. 2000), beam blockage (Young et al. 1999), and bright banding within the melting layer (e.g., Smith 1986). Hydrologic ingredients, such as soil type, degree of saturation, land use, and geometric basin characteristics, influence the amount of rainfall that translates into surface runoff and potential flash flooding.

The NWS River Forecast Centers (RFCs) utilize these hydrologic ingredients to develop a flash flood guidance (FFG) product to estimate the amount of necessary rainfall over a given temporal period per basin to produce bank-full conditions on small creeks and streams (Sweeney 1992). Regional NWS RFCs employ various methods for generating FFG to assist forecasters at NWS Weather Forecast Offices (WFOs) in the prediction of flash flooding (Clark et al. 2014). This combination of precipitation information from gauges and radar during an event, along with FFG values, are the catalyst for flash flood detection and warning decision-making; however, the FFG product has varying skill across the conterminous United States (CONUS). Clark et al. (2014) determined that the regional critical success index (CSI) of the NWS FFG product ranged from 0.00 to 0.19 when evaluated against NWS Storm Data reports and from 0.00 to 0.44 when evaluated against U.S. Geological Survey (USGS) stream gauge measurements.

Despite recent advances in precipitation estimation and hydrologic modeling, challenges still exist regarding the prediction and the determination of the potential magnitude of flash flooding. The NWS issued an average of 4,040 flash flood warnings (FFWs) per year in the storm-based warning (SBW) era from 2008 to 2014. Annual statistical metrics of FFWs from 2008 to 2014 were obtained from the NWS Performance Management System (https://verification.nws.noaa.gov/). Trends in FFW statistical metrics showed a generally stagnant warning performance (Fig. 1). Approximately 1,808 FFWs (about 44.8% of all FFWs) per year go unverified (Fig. 2a). Of the approximately 3,600 flash flood events per year in the United States, an average of 460 flash flood events (12.8%) per year go completely unwarned (Fig. 2b). An additional 1,062 flash flood events (29.5%) per year were only partially warned, meaning that the actual area experiencing flash flooding was partially outside of the warned area (Fig. 2b).

Fig. 1.

Average annual FFW (a) probability of detection (POD), (b) false-alarm ratio (FAR), (c) critical success index (CSI), and (d) lead time from 2008 to 2014. The dashed gray lines represent the linear trend line for each statistical metric. The annual trend of each statistical metric is displayed in the top-right corner of each panel. FFW metric data were obtained from the NWS Performance Management System.

Fig. 1.

Average annual FFW (a) probability of detection (POD), (b) false-alarm ratio (FAR), (c) critical success index (CSI), and (d) lead time from 2008 to 2014. The dashed gray lines represent the linear trend line for each statistical metric. The annual trend of each statistical metric is displayed in the top-right corner of each panel. FFW metric data were obtained from the NWS Performance Management System.

Fig. 2.

(a) Number of verified (green) and unverified (gray) FFWs per year from 2008 to 2014. (b) Number of warned (blue), partially warned (purple), and unwarned (red) flash flood events per year for the same period. FFW verification data were obtained from the NWS Performance Management System.

Fig. 2.

(a) Number of verified (green) and unverified (gray) FFWs per year from 2008 to 2014. (b) Number of warned (blue), partially warned (purple), and unwarned (red) flash flood events per year for the same period. FFW verification data were obtained from the NWS Performance Management System.

The NOAA/National Severe Storms Laboratory (NSSL) is developing new technologies and applications for improved QPE and flash flood prediction. The Multi-Radar Multi-Sensor (MRMS) system utilizes seamless, quality-controlled radar mosaics combined with hourly gauge observations, numerical weather prediction, and background climatologies to create high-spatiotemporal-resolution QPE products (Zhang et al. 2016). Built upon the MRMS framework is the Flooded Locations and Simulated Hydrographs (FLASH) system (Gourley et al. 2017). The FLASH system utilizes the MRMS Radar-Only QPE (Q3RAD; Zhang et al. 2016) as the precipitation input for multiple distributed hydrologic model cores over the CONUS at a resolution of 0.01° × 0.01° (approximately 1 km × 1 km) to produce flash flood forecasts every 2–10 min. The Ensemble Framework for Flash Flood Forecasting (EF5) within the FLASH system includes the experimental Coupled Routing and Excess Storage (CREST; Wang et al. 2011), the Sacramento Soil Moisture Accounting (SAC-SMA; Burnash et al. 1973), and hydrophobic (HP) water balance models coupled with kinematic wave routing. The primary CREST output is the CREST maximum unit streamflow, which is discharge normalized by the upstream drainage area (m3 s−1 km−2; Gourley et al. 2017). The CREST maximum unit streamflow product is most directly applicable to flash flood detection by focusing on areas experiencing significant flows (Gourley et al. 2017). Other model and QPE-related products are available within the FLASH system, including the following:

  • QPE average recurrence intervals (ARIs) of the MRMS Q3RAD product based on gridded output derived from the NOAA Atlas 14 point precipitation frequency estimates (Perica et al. 2013)

  • QPE-to-FFG ratios calculated between the MRMS Q3RAD values and the NWS FFG values

  • Gridded soil moisture saturation values from the CREST and SAC-SMA models

  • Precipitable water values and anomalies based on radiosonde observations (raobs) and the Rapid Refresh (RAP) model

THE HMT-HYDRO EXPERIMENT.

The Hydrometeorological Testbed—Hydrology (HMT-Hydro) experiment provided an opportunity to evaluate MRMS and FLASH products prior to their transition from research to operations. In addition, the HMT-Hydro experiment facilitated the development and assessment of emerging technologies and operational “best practices” in support of flash flood prediction and warning decision-making. The NOAA Hazardous Weather Testbed (HWT) Experimental Warning Program (EWP) hosted the inaugural HMT-Hydro experiment in 2014. In 2015, the U.S. Weather Research Program (USWRP) HMT hosted the second annual HMT-Hydro experiment. Both years, the HMT-Hydro experiment took place at the National Weather Center (NWC) in Norman, Oklahoma, during the climatologically favorable flash flood period from July through early August. Participating forecasters in the HMT-Hydro experiment came from NWS WFOs and RFCs across the CONUS.

The scientific goals of the HMT-Hydro experiment focused on multiple aspects of the prediction and warning of flash flood events, where participating forecasters evaluated the following:

  • The skill of experimental short-term flash flood prediction tools from the FLASH product suite, along with QPE-related products from MRMS (Table 1)

  • The benefit of lead time versus the potential loss of spatial accuracy and magnitude of experimental FFWs through the use QPFs as forcing in EF5

  • The utility of communicating the uncertainty and magnitude of potential flash flooding through probabilistic forecasts of nuisance and major flash flooding within experimental flash flood watches (FFAs) and FFWs

  • The utility of forecasting tools, such as the Hazard Services software interface and experimental flash flood recommenders

Table 1.

List of MRMS and FLASH products evaluated during the HMT-Hydro experiment with product scale and spatiotemporal resolution. The spatial resolution of 0.01° × 0.01° is approximately 1 km × 1 km. The temporal resolution of the products made available in the HMT-Hydro experiment varied from the operational release of the products. See Gourley et al. (2017) for notes about the temporal resolution of the FLASH products and Zhang et al. (2017) for the MRMS products. Products that underwent specific evaluations in the HMT-Hydro experiment are denoted by an asterisk.

List of MRMS and FLASH products evaluated during the HMT-Hydro experiment with product scale and spatiotemporal resolution. The spatial resolution of 0.01° × 0.01° is approximately 1 km × 1 km. The temporal resolution of the products made available in the HMT-Hydro experiment varied from the operational release of the products. See Gourley et al. (2017) for notes about the temporal resolution of the FLASH products and Zhang et al. (2017) for the MRMS products. Products that underwent specific evaluations in the HMT-Hydro experiment are denoted by an asterisk.
List of MRMS and FLASH products evaluated during the HMT-Hydro experiment with product scale and spatiotemporal resolution. The spatial resolution of 0.01° × 0.01° is approximately 1 km × 1 km. The temporal resolution of the products made available in the HMT-Hydro experiment varied from the operational release of the products. See Gourley et al. (2017) for notes about the temporal resolution of the FLASH products and Zhang et al. (2017) for the MRMS products. Products that underwent specific evaluations in the HMT-Hydro experiment are denoted by an asterisk.

Evaluations related to the scientific goals were conducted both during and after experimental operations. A “Tales from the Testbed” webinar series hosted by the NWS Warning Decision Training Division (WDTD) allowed forecasters to present their observations and findings from the HMT-Hydro experiment to the operational community at the end of each forecast week. The weekly “Tales from the Testbed” webinars were recorded and are available online (at https://blog.nssl.noaa.gov/flash/hwt-hydro/). The complete schedule for all activities during a typical week in the HMT-Hydro experiment is presented in Table 2.

Table 2.

Schedule of activities during a typical week of the HMT-Hydro experiment.

Schedule of activities during a typical week of the HMT-Hydro experiment.
Schedule of activities during a typical week of the HMT-Hydro experiment.

Participating forecasters utilized MRMS and FLASH products in their warning decision-making process and issued experimental FFAs and FFWs. The participating forecasters were asked to define the probability of nuisance and major flash flooding within the experimental FFAs and FFWs. A nuisance flash flood was defined as overflowing creeks or streams, cropland/yard flooding, road flooding, road closures, stranded vehicles, or USGS stream gauges reaching minor flood stage. A major flash flood was defined as inundated structures, vehicles/structures that have been swept away, evacuations, numerous water rescues, or USGS stream gauges reaching major flood stage or record levels. Forecasters were able to assign probabilities in 25% intervals in the 2014 HMT-Hydro experiment. The interval for the probability values changed to 1% in 2015 per forecaster feedback. Experimental FFAs and FFWs were issued across the CONUS without the restriction of geopolitical boundaries (i.e., polygons can be drawn beyond the constraints of an NWS WFO county warning area). Experimental FFAs were generally valid for approximately 6 h from the time of issuance, analogous to that of convective watches issued by the Storm Prediction Center (SPC). Experimental FFWs were generally valid for approximately 3 h from the time of issuance.

Subjective performance evaluations of key MRMS and FLASH products, experimental FFAs and FFWs, and the probabilistic information within FFAs and FFWs occurred each day that followed operations. Flash flood reports used in these evaluations included NWS local storm reports (LSRs), USGS stream gauges, the Meteorological Phenomena Identification Near the Ground (mPING) project (Elmore et al. 2014), and the Severe Hazards Analysis and Verification Experiment (SHAVE; Ortega et al. 2009). Experimental FFAs and FFWs were subjectively assessed on the skill of user-defined probabilistic forecasts of nuisance and major flash flooding and on the coverage of experimental FFAs and FFWs against operational FFAs and FFWs. A single flash flood report verified an experimental watch/warning product. The details of the flash flood report verified whether the flash flood event was nuisance or major. A major flash flood report would also verify the nuisance flash flood criteria.

The HMT-Hydro experiment also focused on the user-centered design of the forecasting tools. Forecasters used the System Usability Scale (SUS; Brooke 1996) to assess the usability and learnability of the FLASH product suite in 2014. The usability of the prototype Hazard Services interface and an experimental flash flood recommender algorithm (described fully below) was evaluated in 2015 through a user-testing approach based on a Questionnaire for User Interaction Satisfaction (QUIS; Chin et al. 1988). Evaluations occurred on the first day of each experimental week in order to assess how novice users interact with the new products and software, while the assessment of utilizing flash flood recommenders within the flash flood warning process was conducted throughout the week.

The HMT-Hydro experiment operated in conjunction with the Flash Flood and Intense Rainfall (FFaIR) experiment at the Hydrometeorology Testbed at the Weather Prediction Center (HMT-WPC; Barthold et al. 2015). The FFaIR experiment utilized experimental high-resolution convective-allowing models (CAMs) and probabilistic flash flood forecasting tools to generate probabilistic heavy rainfall and flash flood forecasts. This collaboration simulated a real-time workflow between the issuance of forecast and guidance products at the WPC and the issuance of FFAs and FFWs at an NWS WFO. The conjunction of these two experiments also facilitated collaborative discussions between operational forecasting, research, and academic communities on the challenges of short-term flash flood forecasting.

Figure 3 diagrams the workflow and feedback between the HMT-Hydro experiment and the FFaIR experiment. A hydrometeorological forecast discussion was first conducted by participating forecasters and scientists with the FFaIR experiment at 1800 UTC on days when operations were conducted at the HMT-Hydro experiment. Included in these forecast discussions were probabilistic excessive rainfall outlooks over the 1500–1200 UTC time frame, probabilistic flash flood forecasts for both the 1800–0000 UTC and the 0000–0600 UTC periods, and a domain-limiting discussion (DLD), which is similar to the operational mesoscale precipitation discussion (MPD) issued by WPC, for the region of greatest flash flood potential from 1800 to 0000 UTC. The detailed forecast discussions and associated probabilistic products focused the areas of responsibility for operations at the HMT-Hydro experiment and the issuance of experiment FFAs by forecasters. Qualitative evaluations and feedback of the forecast discussions and probabilistic forecast products were delivered to the FFaIR experiment the following day.

Fig. 3.

Diagram of simulated workflow between the HMT-Hydro and FFaIR experiments.

Fig. 3.

Diagram of simulated workflow between the HMT-Hydro and FFaIR experiments.

ASSESSMENT OF MRMS AND FLASH PRODUCTS.

Four MRMS and FLASH products were subjectively rated on their ability to capture the spatial coverage and potential magnitude of one notable verified flash flood event per day: MRMS Q3RAD, QPE ARIs, QPE-to-FFG ratios, and CREST maximum unit streamflow (Fig. 4). Ratings were conducted on a scale from 0 to 100. The mean spatial coverage rankings among the four evaluated products were nearly identical; however, the QPE ARI and CREST maximum unit streamflow products had greater standard deviations. Similar results were found with the product rankings with respect to the potential magnitude of flash flooding. An analysis of variance (ANOVA) test failed to find a significant difference between the product rankings for spatial coverage, F(3, 44) = 0.039, p = 0.99, or for magnitude, F(3, 44) = 0.677, p = 0.57. While no significant statistical differences in the subjective evaluation of these products existed, the forecasters noted that the collective use of the MRMS and FLASH products versus the use of a single experimental product benefitted the warning decision-making process.

Fig. 4.

Box-and-whisker plots of subjective ranking of MRMS Q3RAD, QPE ARI, QPE-to-FFG ratio, and CREST maximum unit streamflow (forced by Q3RAD) to detect the (a) spatial coverage and (b) magnitude of flash flooding during the 2015 HMT-Hydro experiment. The top (bottom) of each box represents the 75th (25th) percentile with the line in the middle of each box representing the median subjective ranking value. The top (bottom) whisker represents the maximum (minimum) ranking. The black dot represents the mean subjective ranking. The mean (µ) and standard deviation (σ) values for each product are shown below each box-and-whisker plot.

Fig. 4.

Box-and-whisker plots of subjective ranking of MRMS Q3RAD, QPE ARI, QPE-to-FFG ratio, and CREST maximum unit streamflow (forced by Q3RAD) to detect the (a) spatial coverage and (b) magnitude of flash flooding during the 2015 HMT-Hydro experiment. The top (bottom) of each box represents the 75th (25th) percentile with the line in the middle of each box representing the median subjective ranking value. The top (bottom) whisker represents the maximum (minimum) ranking. The black dot represents the mean subjective ranking. The mean (µ) and standard deviation (σ) values for each product are shown below each box-and-whisker plot.

The CREST maximum unit streamflow product had the greatest statistical variability in both spatial and magnitude subjective evaluations. This variability was notable in areas of rural terrain versus urbanized locations. The average subjective ranking for the CREST maximum unit streamflow product was 82.5 for the spatial coverage and 80.0 for the potential magnitude of flash flooding in urban areas, whereas the average subjective ranking in more rural areas were 63.8 and 53.8 for spatial coverage and potential magnitude, respectively. Participating forecasters noted the benefit of the CREST maximum unit streamflow product over urbanized regions due to the CREST model accounting for impermeable surfaces. A flash flood event in Wichita, Kansas, on 6 July 2015 demonstrated the potential of the CREST-based product over an urbanized area (Fig. 5). The MRMS Q3RAD 3-h accumulation of 50–75 mm (2.00–2.75 in.) over the metropolitan region resulted in maximum QPE-to-FFG ratios of less than 80% and a QPE ARI of less than 5 years. While those product values would suggest a low flash flood potential, the highlighted region of CREST maximum unit streamflow values ≥ 1 m3 s−1 km−2 (100 ft3 s−1 mi−2) corresponded well with reports of extensive street flooding and water rescues of stranded motorists.

Fig. 5.

Image of (a) MRMS Q3RAD 3-h accumulation (in.; 1 in. = 2.54 cm), (b) maximum QPE-to-FFG ratio (%), (c) maximum QPE ARI (yr), and (d) CREST maximum unit streamflow (ft3 s−1 mi−2) as displayed in AWIPS-II at 2345 UTC 6 Jul 2015 during a flash flood event in Wichita. The red contour in (a)–(c) represents the approximate area of CREST maximum unit streamflow values ≥ 1 m3 s−1 km−2 (100 ft3 s−1 mi−2). The light blue contour in (d) represents the AWIPS-defined urban boundary of Wichita. The white dots represent the real-time NWS LSRs of flash flooding.

Fig. 5.

Image of (a) MRMS Q3RAD 3-h accumulation (in.; 1 in. = 2.54 cm), (b) maximum QPE-to-FFG ratio (%), (c) maximum QPE ARI (yr), and (d) CREST maximum unit streamflow (ft3 s−1 mi−2) as displayed in AWIPS-II at 2345 UTC 6 Jul 2015 during a flash flood event in Wichita. The red contour in (a)–(c) represents the approximate area of CREST maximum unit streamflow values ≥ 1 m3 s−1 km−2 (100 ft3 s−1 mi−2). The light blue contour in (d) represents the AWIPS-defined urban boundary of Wichita. The white dots represent the real-time NWS LSRs of flash flooding.

Participating forecasters were asked to define threshold values of CREST maximum unit streamflow for use in the warning decision-making process. Responses for evaluated flash flood events had minimum CREST unit streamflow values of <1–4 m3 s-1 km−2 (100–400 ft3 s−1 mi−2; Fig. 6); however, forecasters were most comfortable using a value from 1 to 2 m3 s−1 km−2 (100–200 ft3 s−1 mi−2) as an overall warning decision threshold for flash flooding. There were no identifiable patterns in forecaster-defined CREST maximum unit streamflow threshold values for flash flooding in rural versus urban areas.

Fig. 6.

Subjective evaluation of the minimum CREST maximum unit streamflow values (m3 s−1 km−2) that a forecaster would consider as a threshold for flash flooding to occur.

Fig. 6.

Subjective evaluation of the minimum CREST maximum unit streamflow values (m3 s−1 km−2) that a forecaster would consider as a threshold for flash flooding to occur.

UTILIZING SHORT-TERM QPFS.

Two 0–6-h QPF inputs were ingested within the EF5 framework: the experimental High-Resolution Rapid Refresh model (HRRRX; http://rapidrefresh.noaa.gov/hrrr/) from the NOAA/Earth System Research Laboratory (ESRL) and the Advective-Statistical Forecasts of Rainfall (ADSTAT) package based on the work by Kitzmiller et al. (2011). The experimental ADSTAT products were generated from a blend of radar extrapolative and RAP model QPFs via processes executed in real time by the NWS National Water Center (D. Kitzmiller 2015, personal communication). Output based on the HRRRX and ADSTAT QPFs were available to forecasters during real-time operations.

Twelve subjective comparisons between ADSTAT and HRRRX QPFs via CREST maximum unit streamflow output yielded inconclusive results (Fig. 7). Five flash flood events where the HRRRX was determined to have a better performance were attributed to the HRRRX predictive skill with convective initiation. Three events where the ADSTAT had greater performance were generally credited to reduced false-alarm area versus the HRRRX. The ADSTAT and HRRRX displayed similar results for four events when neither system generated positive QPFs for a subsequently observed flash flood event; thus, the QPE became the dominant contributor in the CREST model outputs. Both the ADSTAT and HRRRX were noted to contribute to the situational assessment for the potential of flash flooding, but challenges with either the spatial placement and coverage of QPF or inconsistencies between model runs limited the use of QPFs for generating FFWs with greater lead time.

Fig. 7.

Subjective evaluation of the ADSTAT QPF when compared to the HRRRX QPF for flash flood events when used as an input in the CREST hydrologic model during the 2015 HMT-Hydro experiment.

Fig. 7.

Subjective evaluation of the ADSTAT QPF when compared to the HRRRX QPF for flash flood events when used as an input in the CREST hydrologic model during the 2015 HMT-Hydro experiment.

EXPERIMENTAL FLASH FLOOD WATCHES AND WARNINGS.

Fifteen of the 32 experimental FFAs issued during the 2015 HMT-Hydro experiment were subjectively evaluated based on their spatial coverage versus operational FFAs and their assigned probabilistic values (Fig. 8). Nine experimental FFAs were rated as having “slightly better” or “much better” spatial coverage than the collocated operational FFAs. Forecaster feedback for these ratings included reduced false-alarm area, better coverage with respect to storm reports, and the lack of an operational flash flood watch where flash flooding subsequently occurred. This was likely due to the experimental FFAs being issued for a 6-h forecast period, which allows the experimental FFA to be more concise, whereas operational FFAs can be issued up to 48 h before a possible flash flood event (NWS 2011), which could make them susceptible to increased forecast uncertainty. Three of the experimental FFAs rated worse than their collocated operational FFAs. Forecaster feedback included a lack of storm reports within the experimental FFA and a greater false-alarm area for the experimental FFA, possibly from the lack of geopolitical constraints with polygon generation. Most evaluated experimental FFAs were rated as having the probabilistic forecasts as “about right”; the four experimental FFAs not rated that were described as having major flash flood probabilities that were too high or low, or there being a lack of storm reports for proper verification.

Fig. 8.

Subjective evaluation of (a) the spatial coverage of experimental FFAs compared to operational FFAs and (b) the probabilistic threat in experimental FFAs.

Fig. 8.

Subjective evaluation of (a) the spatial coverage of experimental FFAs compared to operational FFAs and (b) the probabilistic threat in experimental FFAs.

A similar subjective evaluation was also conducted on 34 of the 157 experimental FFWs issued during the 2015 HMT-Hydro experiment (Fig. 9). The greatest number of assessed experimental FFWs was identified as having “slightly worse” spatial coverage than the collocated operational FFWs. The primary feedback for this selection was greater false-alarm area by the experimental FFWs and the lack of storm reports within experimental FFW polygons. Experimental FFWs subjectively evaluated as better than the collocated operational FFWs were based on reduced false-alarm area, better coverage of the flash flood event, and a lack of an operational FFW when flash flooding occurred. While 16 of the 34 probabilistic forecasts in evaluated experimental FFWs were rated as “about right,” another 15 experimental FFWs were rated as having “too high” of probabilities, of which two were described by forecasters as having too-high major probabilities due to a lack of major flash flood reports or having too-high overall probabilities from a lack of storm reports.

Fig. 9.

Subjective evaluation of (a) the spatial coverage of experimental FFWs compared to operational FFWs and (b) the probabilistic threat in experimental FFWs.

Fig. 9.

Subjective evaluation of (a) the spatial coverage of experimental FFWs compared to operational FFWs and (b) the probabilistic threat in experimental FFWs.

Objective verifications of the probabilistic forecast reliability for all 32 experimental FFAs and 157 experimental FFWs are depicted in Fig. 10. For experimental FFAs, probabilistic forecasts for nuisance flash flooding were underforecasted both years for values around and below 50% (Fig. 10a). There was also underforecasting of the 0% major flash flood probability in 2014, but most probabilistic forecasts for major flash flooding were closer to perfect reliability. Experimental FFWs showed a general overforecasting of the probabilistic values than what was verified (Fig. 10b). The only underforecasting occurred with the nuisance flash flood probability of 25% and major flash flood probability of 0% in 2014. A notable trend in the forecast reliability analysis was the increase in overforecasting with higher probabilistic forecast values for nuisance or major flash flooding in the experimental FFWs; however, one must consider the inadequate verification of flash floods, especially in sparsely populated areas, which can skew probabilistic forecast reliability assessments.

Fig. 10.

Reliability of probabilistic forecasts for nuisance (light shades) and major (dark shades) flash flooding during the 2014 (blue) and 2015 (orange) HMT-Hydro experiment for (a) experimental FFAs and (b) experimental FFWs. Perfect reliable probability forecasts would line up along the 1:1 line (dashed). Values for 2015 probabilities were binned in intervals of 20 and plotted at the middle of each bin range for experimental FFAs (e.g., probabilistic forecast reliability for all values between 81% and 100% were plotted at the probability value of 90%). An interval of 10 was used for the 2015 experimental FFWs. The sample sizes N for experimental FFAs and FFWs for both years are denoted in the upper-left corner of each graph.

Fig. 10.

Reliability of probabilistic forecasts for nuisance (light shades) and major (dark shades) flash flooding during the 2014 (blue) and 2015 (orange) HMT-Hydro experiment for (a) experimental FFAs and (b) experimental FFWs. Perfect reliable probability forecasts would line up along the 1:1 line (dashed). Values for 2015 probabilities were binned in intervals of 20 and plotted at the middle of each bin range for experimental FFAs (e.g., probabilistic forecast reliability for all values between 81% and 100% were plotted at the probability value of 90%). An interval of 10 was used for the 2015 experimental FFWs. The sample sizes N for experimental FFAs and FFWs for both years are denoted in the upper-left corner of each graph.

USING HAZARD SERVICES IN HMT-HYDRO OPERATIONS.

New for the 2015 HMT-Hydro experiment was the addition of the Hazard Services interface into the second-generation Advanced Weather Interactive Processing System (AWIPS-II), the weather forecasting and display platform utilized by NWS forecasters. Hazard Services was designed to consolidate and integrate the various AWIPS-II functionality for generating short- and long-fused weather watches and warnings into a single-hazard-alerting interface. The HMT-Hydro experiment provided an opportunity to evaluate a beta version of Hazard Services in an operational, real-time test bed environment through the generation of experimental FFAs and FFWs.

Flash flood recommenders were introduced into the warning decision process through the Hazard Services interface. Flash flood recommenders are contoured areas generated via a selection algorithm designed to highlight regions associated with high risk as judged by underlying computational models. The flash flood recommender tool available in the 2015 HMT-Hydro experiment was designed to contour areas ≥ 0.001°2 (approximately 10 km2) that exceed a user-defined threshold value for one of the four available products: CREST maximum unit streamflow, maximum QPE-to-FFG ratio, maximum QPE ARI, and MRMS Q3RAD 3-h accumulations. Participating forecasters were provided basic guidance on what thresholds to use for each input product but had freedom in setting those threshold values. The contoured areas created by the flash flood recommender algorithm are converted into individual hazard events. Each individual hazard resulted in a proposed warning polygon. The utility of proposed warning polygons derived from flash flood recommenders was evaluated daily during warning operations.

An evaluation of 157 recommender-based proposed warning polygons during the 2015 HMT-Hydro experiment found that forecasters would accept fewer than 6% of the proposed polygons without modification (Fig. 11a). Approximately half of the recommender-based polygons were considered usable by forecasters with some modification to the polygon area. The forecasters rejected the remaining 44.6% of recommender-based polygons. Some variation in the use of the proposed polygons was found when assessing each individual input source (Figs. 11b–e). Proposed polygons based on the MRMS Q3RAD 3-h accumulations and the maximum QPE-to-FFG ratio were more likely to be used as is or with modification, while proposed polygons generated using the maximum QPE ARI were most likely to be rejected.

Fig. 11.

Usability evaluation of proposed warning polygons based on flash flood recommenders. Shown are the results for (a) all input source products and then separated for the following individual input products: (b) MRMS Q3RAD 3-h accumulation, (c) maximum QPE-to-FFG ratio, (d) maximum QPE ARI, and (e) CREST maximum unit streamflow.

Fig. 11.

Usability evaluation of proposed warning polygons based on flash flood recommenders. Shown are the results for (a) all input source products and then separated for the following individual input products: (b) MRMS Q3RAD 3-h accumulation, (c) maximum QPE-to-FFG ratio, (d) maximum QPE ARI, and (e) CREST maximum unit streamflow.

Forecaster feedback given through free-response questions noted some limitations when applying the proposed recommender-based polygons without modification. The proposed warning polygons contoured areas directly impacted by rainfall; however, the polygons might not portray any downstream impacts or storm motion. The clustering of recommender-based warning polygons resulted in small gaps where flash flood conditions potentially existed, whereas a single user-defined polygon would provide continuous coverage over the entire threat area. Nevertheless, forecasters also indicated that proposed polygons based on flash flood recommenders added to the situational assessment process.

One could assume that technical improvements to recommenders would allow forecasters to issue more recommender-based polygons exactly as created; however, this outcome may not be completely desirable from a decision-making perspective. Previous research has found an inverse relationship between level of automation use and situation awareness (Dao et al. 2009; Endsley and Kiris 1995); thus, it is hypothesized that total dependence on recommender-based polygons, a form of decision-aiding automation, may lead to out-of-the-loop forecasting, performance decrements, and reduced situation awareness. It will be critical going forward to consider the role of recommenders in the warning decision-making process and to ensure that their utility is balanced against the pitfalls of overreliance.

SUMMARY AND FUTURE ASSESSMENTS.

Challenges still exist with the prediction and warning of flash flood events, one of the deadliest weather phenomena in the United States. In addition, the reports that describe flash floods are inadequate and sparse in relation to other weather-related hazards. The HMT-Hydro experiment was designed to investigate new products and techniques that could assist in flash flood prediction and warning decision-making in a real-time environment. The HMT-Hydro experiment also assessed the use of forecaster-defined probabilistic forecasts for nuisance and major flooding in experimental FFAs and FFWs, as well as the utility of the Hazard Services interface and flash flood recommenders.

Products from the FLASH suite assisted in the identification of focused regions for increased flash flood potential; moreover, the CREST maximum unit streamflow demonstrated its ability to highlight the potential for flash flooding in urban areas. Varying challenges with 0–6-h QPFs limited the utility of generating earlier FFWs. Forecasters portrayed the uncertainty and possible magnitude of flash flooding through probabilistic information. Reliability assessments of user-defined probabilistic forecasts of nuisance and major flash flooding in experimental FFAs and FFWs showed distinct patterns of under- and overforecasting of flash floods. The addition of user-defined probabilistic information provided insight into the perceptions and challenges of probabilistic forecasting.

The HMT-Hydro experiment will continue to evaluate emerging techniques and technologies for probabilistic flash flood prediction and warning decision-making. Future HMT-Hydro experiments will explore the use of probabilistic quantitative precipitation estimation (PQPE; Kirstetter et al. 2015) to generate probabilistic hydrologic outputs in the CREST model. New experimental warning paradigms using a continuum of probabilistic hazard information (PHI) instead of deterministic, product-centric watches and warnings are being prototyped in the Forecasting a Continuum of Environmental Threats (FACETs; Rothfusz et al. 2014) project through a system that can identify and predict high-impact hazards (Karstens et al. 2015). While current efforts are evaluating the use of PHI for severe convective events, a similar approach can be applicable to flash flooding in the near future.

Feedback on automatically generated proposed warning polygons derived from the single-product-based flash flood recommenders showed utility with increasing situational awareness of increased flash flood potential; however, these proposed warning polygons were rarely accepted without modification. Limiting factors for using these proposed warnings included the lack of portraying downstream impacts or storm motion or the creation of small gaps with contoured clusters. Future work in this area includes the development of a multivariable flash flood recommender algorithm that weights different inputs from the MRMS and FLASH product suites and the eventual use of proposed warning polygons based on gridded flash flood probabilities within the FACETs paradigm.

Advances in numerical weather prediction, precipitation estimation, and distributed hydrologic modeling will be beneficial to the quality and accuracy of flash flood forecasts and operational products. Moreover, continued research is needed on the application and reliability of probabilistic forecasts in operational flash flood products. NSSL and the HMT-Hydro experiment will continue to provide a platform for research scientists, operational forecasters, and hydrologic modelers to collaborate on improving flash flood prediction and to facilitate the evaluation and transition of new products and techniques from research to operations.

ACKNOWLEDGMENTS

Funding for this research was provided by the Disaster Relief Appropriations Act of 2013 (P.L. 113-2), which provided support to the Cooperative Institute for Mesoscale Meteorological Studies at the University of Oklahoma under Grant NA14OAR4830100 and the Hydrometeorological Testbed Program under Grant NA15OAR4590158. Regional NWS headquarters also provided funding for some participants. The authors thank the anonymous reviewers for their time and feedback on this manuscript. The authors also thank Dr. Stephen Cocks (CIMMS/OU–NSSL) for his thoughts and review of the manuscript. The NWS WDTD provided the use of its conference room, office space, and equipment for the experiment. Tiffany Meyer (CIMMS/OU–NSSL), Darrel Kingfield (CIMMS/OU–NSSL), and Andre Reddington (WDTD) assisted with the use of the AWIPS-II workstations, and Kevin Manross (GSD) provided expertise on the installation and use of Hazard Services. Gabe Garfield (CIMMS/OU–NWS) provided coordination between the experiment and the HWT. The following undergrad students from the OU School of Meteorology called members of the public for flash flood verification as part of the SHAVE project: Mary Beers, Kevin Biehl, Taylor Faires, Rachel Gaal, Paul Goree, Bria Hieatt, Corey Howard, David King, Brittany Newman, Derek Rosseau, and Alexandra Wright. The HMT-Hydro experiment would not have been possible without the involvement and insight from participating NWS forecasters, as well as the collaboration with the scientists and forecasters participating in the FFaIR experiment.

REFERENCES

REFERENCES
Ashley
,
S. T.
, and
W. S.
Ashley
,
2008
:
Flood fatalities in the United States
.
J. Appl. Meteor. Climatol.
,
47
,
805
818
, doi:.
Barthold
,
F.
,
T.
Workoff
,
B.
Cosgrove
,
J.
Gourley
,
D.
Novak
, and
K.
Mahoney
,
2015
:
Improving flash flood forecasts: The HMT-WPC Flash Flood and Intense Rainfall Experiment
.
Bull. Amer. Meteor. Soc.
,
96
,
1859
1866
, doi:.
Brooke
,
J.
,
1996
: SUS: A ‘quick and dirty’ usability scale.
Usability Evaluation and Industry
, P. W. Jordan et al., Eds., Taylor & Francis,
189
194
.
Burnash
,
R. J. C.
,
R. L.
Ferral
, and
R. A.
McGuire
,
1973
: A general streamflow simulation system—Conceptual modeling for digital computers. Joint Federal and State River Forecast Center Tech. Rep., U.S. National Weather Service and California Dept. of Water Resources, 204 pp.
Chin
,
J. P.
,
V. A.
Diehl
, and
K. L.
Norman
,
1988
:
Development of an instrument for measuring user satisfaction of the human-computer interface
. CHI ’88: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, J. J. O’Hare, Ed., ACM, 213–218, doi:.
Clark
,
R. A.
,
J. J.
Gourley
,
Z. L.
Flamig
,
Y.
Hong
, and
E.
Clark
,
2014
:
CONUS-wide evaluation of National Weather Service flash flood guidance products
.
Wea. Forecasting
,
29
,
377
392
, doi:.
Dao
,
A.-Q.
V., S. L.
Brandt
,
V.
Battiste
,
K.-P. L.
Vu
,
T.
Strybel
, and
W. W.
Johnson
,
2009
: The impact of automation assisted aircraft separation on situation awareness. Human Interface and the Management of Information: Information and Interaction; Symposium on Human Interface 2009, G. Salvendy and M. J. Smith, Eds., Lecture Notes in Computer Science, Vol. 5618, Springer, 738–747.
Elmore
,
K. L.
,
Z. L.
Flamig
,
V.
Lakshmanan
,
B. T.
Kaney
,
V.
Farmer
,
H. D.
Reeves
, and
L. P.
Rothfusz
,
2014
:
MPING: Crowd-sourcing weather reports for research
.
Bull. Amer. Meteor. Soc.
,
95
,
1335
1342
, doi:.
Endsley
,
M. R.
, and
E. O.
Kiris
,
1995
:
The out-of-the-loop performance problem and level of control in automation
.
Hum. Factors
,
37
,
381
394
, doi:.
Goodrich
,
D. C.
,
J.-M.
Faures
,
D. A.
Woolhiser
,
L. J.
Lane
, and
S.
Sorooshian
,
1995
:
Measurement and analysis of small-scale convective storm rainfall variability
.
J. Hydrol.
,
173
,
283
308
, doi:.
Gourley
,
J. J.
, and Coauthors
,
2017
:
The FLASH project: Improving the tools for flash flood monitoring and prediction across the United States
.
Bull. Amer. Meteor. Soc.
,
98
,
361
372
, doi:.
Harrison
,
D. L.
,
S. J.
Driscoll
, and
M.
Kitchen
,
2000
:
Improving precipitation estimates from weather radar using quality control and correction techniques
.
Meteor. Appl.
,
7
,
135
144
, doi:.
Karstens
,
C.
, and Coauthors
,
2015
:
Evaluation of a probabilistic forecasting methodology for severe convective weather in the 2014 Hazardous Weather Testbed
.
Wea. Forecasting
,
30
,
1551
1570
, doi:.
Kirstetter
,
P.-E.
,
J. J.
Gourley
,
Y.
Hong
,
J.
Zhang
,
S.
Moazamigoodarzi
,
C.
Langston
, and
A.
Arthur
,
2015
:
Probabilistic precipitation rate estimates with ground-based radar networks
.
Water Resour. Res.
,
51
,
1422
1442
, doi:.
Kitzmiller
,
D.
,
W.
Wu
,
S.
Wu
, and
D.
Miller
,
2011
: Development of a short-range probabilistic precipitation forecast algorithm based on radar and numerical prediction model input. 35th Conf. on Radar Meteorology, Pittsburgh, PA, Amer. Meteor. Soc., 145. [Available online at https://ams.confex.com/ams/35Radar/webprogram/Paper191725.html.]
NWS
,
2011
: Weather Forecast Office hydrologic products specification. U.S. Dept. of Commerce, National Weather Service, Instruction 10-922, 84 pp. [Available online at www.nws.noaa.gov/directives/sym/pd01009022curr.pdf.]
NWS
,
2015
: National Weather Service glossary. [Available online at http://w1.weather.gov/glossary/index.php.]
Ortega
,
K. L.
,
T. M.
Smith
,
K. L.
Manross
,
A. G.
Kolodziej
,
K. A.
Scharfenberg
,
A.
Witt
, and
J. J.
Gourley
,
2009
:
The Severe Hazards Analysis and Verification Experiment
.
Bull. Amer. Meteor. Soc.
,
90
,
1519
1530
, doi:.
Perica
,
S.
, and Coauthors
,
2013
: Southeastern States (Alabama, Arkansas, Florida, Georgia, Louisiana, Mississippi). Vol. 9, Version 2.0, Precipitation-Frequency Atlas of the United States, NOAA Atlas 14, 171 pp. [Available online at www.nws.noaa.gov/oh/hdsc/PF_documents/Atlas14_Volume9.pdf.]
Rothfusz
,
L.
,
C. D.
Karstens
, and
D.
Hilderbrand
,
2014
:
Next-generation forecasting of high-impact weather
.
Eos, Trans. Amer. Geophys. Union
,
95
,
325
326
, doi:.
Smith
,
C. J.
,
1986: The reduction of errors caused by bright bands in quantitative rainfall measurements made using radar
.
J. Atmos. Oceanic Technol.
,
3
,
129
141
, doi:.
Sweeney
,
T. L.
,
1992
: Modernized areal flash flood guidance. NOAA Tech. Rep. NWS HYDRO 44, 21 pp.
Wang
,
J.
, and Coauthors
,
2011
:
The coupled routing and excess storage (CREST) distributed hydrological model
.
Hydrol. Sci. J.
,
56
,
84
98
, doi:.
Young
,
C. B.
,
B. R.
Nelson
,
A. A.
Bradley
,
J. A.
Smith
,
C. D.
Peters-Lidard
,
A.
Kruger
, and
M. L.
Baeck
,
1999
:
An evaluation of NEXRAD precipitation estimates in complex terrain
.
J. Geophys. Res.
,
104
,
19 691
19 703
, doi:.
Zhang
,
J.
, and Coauthors
,
2016
:
Multi-Radar Multi-Sensor (MRMS) quantitative precipitation estimation: Initial operating capabilities
.
Bull. Amer. Meteor. Soc.
,
97
,
621
638
, doi:.