Machine Learning–Derived Severe Weather Probabilities from a Warn-on-Forecast System

Adam J. Clark aNOAA/OAR/National Severe Storms Laboratory, Norman, Oklahoma
bSchool of Meteorology, University of Oklahoma, Norman, Oklahoma

Search for other papers by Adam J. Clark in
Current site
Google Scholar
PubMed
Close
and
Eric D. Loken aNOAA/OAR/National Severe Storms Laboratory, Norman, Oklahoma
cCooperative Institute for Severe and High-Impact Weather Research and Operations, University of Oklahoma, Norman, Oklahoma

Search for other papers by Eric D. Loken in
Current site
Google Scholar
PubMed
Close
Free access

We are aware of a technical issue preventing figures and tables from showing in some newly published articles in the full-text HTML view.
While we are resolving the problem, please use the online PDF version of these articles to view figures and tables.

Abstract

Severe weather probabilities are derived from the Warn-on-Forecast System (WoFS) run by NOAA’s National Severe Storms Laboratory (NSSL) during spring 2018 using the random forest (RF) machine learning algorithm. Recent work has shown this method generates skillful and reliable forecasts when applied to convection-allowing model ensembles for the “Day 1” time range (i.e., 12–36-h lead times), but it has been tested in only one other study for lead times relevant to WoFS (e.g., 0–6 h). Thus, in this paper, various sets of WoFS predictors, which include both environment and storm-based fields, are input into a RF algorithm and trained using the occurrence of severe weather reports within 39 km of a point to produce severe weather probabilities at 0–3-h lead times. We analyze the skill and reliability of these forecasts, sensitivity to different sets of predictors, and avenues for further improvements. The RF algorithm produced very skillful and reliable severe weather probabilities and significantly outperformed baseline probabilities calculated by finding the best performing updraft helicity (UH) threshold and smoothing parameter. Experiments where different sets of predictors were used to derive RF probabilities revealed 1) storm attribute fields contributed significantly more skill than environmental fields, 2) 2–5 km AGL UH and maximum updraft speed were the best performing storm attribute fields, 3) the most skillful ensemble summary metric was a smoothed mean, and 4) the most skillful forecasts were obtained when smoothed UH from individual ensemble members were used as predictors.

© 2022 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Adam J. Clark, adam.clark@noaa.gov

Abstract

Severe weather probabilities are derived from the Warn-on-Forecast System (WoFS) run by NOAA’s National Severe Storms Laboratory (NSSL) during spring 2018 using the random forest (RF) machine learning algorithm. Recent work has shown this method generates skillful and reliable forecasts when applied to convection-allowing model ensembles for the “Day 1” time range (i.e., 12–36-h lead times), but it has been tested in only one other study for lead times relevant to WoFS (e.g., 0–6 h). Thus, in this paper, various sets of WoFS predictors, which include both environment and storm-based fields, are input into a RF algorithm and trained using the occurrence of severe weather reports within 39 km of a point to produce severe weather probabilities at 0–3-h lead times. We analyze the skill and reliability of these forecasts, sensitivity to different sets of predictors, and avenues for further improvements. The RF algorithm produced very skillful and reliable severe weather probabilities and significantly outperformed baseline probabilities calculated by finding the best performing updraft helicity (UH) threshold and smoothing parameter. Experiments where different sets of predictors were used to derive RF probabilities revealed 1) storm attribute fields contributed significantly more skill than environmental fields, 2) 2–5 km AGL UH and maximum updraft speed were the best performing storm attribute fields, 3) the most skillful ensemble summary metric was a smoothed mean, and 4) the most skillful forecasts were obtained when smoothed UH from individual ensemble members were used as predictors.

© 2022 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Adam J. Clark, adam.clark@noaa.gov

1. Introduction

Convection-allowing models (CAMs) do not explicitly forecast severe weather events (defined here as occurrence of a tornado, hail ≥ 1.0 in., or wind speeds ≥ 50 kt; 1 kt ≈ 0.51 m s−1), but storm-based diagnostics such as hourly maximum updraft helicity (UH; Kain et al. 2010) have been shown to work very well as severe weather proxies (e.g., Sobash et al. 2011, 2016a,b, 2019; Clark et al. 2013, 2018; Schwartz et al. 2015; Sobash and Kain 2017; Adams-Selin et al. 2019; Roberts et al. 2020). However, these proxies are sensitive to many aspects of model configuration including resolution, model core, physics, and diffusion (e.g., Potvin et al. 2017; Clark et al. 2018). Thus, deriving severe weather probabilities from the proxies alone requires careful calibration, which must be repeated for model configuration changes/upgrades.

Furthermore, the likelihood that a particular proxy will be associated with a severe weather event can be strongly dependent upon environmental conditions. For example, Gallo et al. (2016) found that filtering UH based on where environmental fields were favorable for tornadoes improved UH-derived tornado probabilities, and Gallo et al. (2018a) found further improvement through combining UH and significant tornado parameter (STP; Thompson et al. 2003) forecasts with climatological information on the frequencies of observed tornadoes within specified ranges of STP. Clark et al. (2012b) showed that correlations between UH and tornado pathlengths could be significantly improved when considering UH alone only in environments favoring surface-based storms with low lifting condensation level heights. Finally, in a study examining 1- and 3-km CAM forecasts, Sobash et al. (2019) found that, even though 1-km grid-spacing UH-based tornado forecasts were more skillful than the 3-km forecasts, when the 3-km UH forecasts were combined with information on STP, they were just as skillful as the 1-km forecasts.

To identify the complex, nonlinear relationships that exist between severe weather proxies, forecast environmental conditions, and the associated likelihood of severe weather, machine learning approaches are very appealing since multiple predictors can be used for input to generate forecast probabilities. Specifically, random forests (RFs; Breiman 2001) have been used in several recent studies that have demonstrated considerable skill relative to more naïve approaches that are tuned to a single proxy. For example, Loken et al. (2020) used RFs driven by input from the Storm Scale Ensemble of Opportunity (SSEO; Jirak et al. 2016) to derive Day 1 (1200–1200 UTC) severe hazard probabilities. These RF-based forecasts were more skillful than calibrated UH-based forecasts, and at times even outperformed Storm Prediction Center (SPC) convective outlooks. Also, Hill et al. (2020) used environmental fields from the Global Ensemble Forecast System Reforecast (GEFS/R) dataset (Hamill et al. 2013) to generate RF-derived severe weather probabilities covering the Days 1–3 period. These forecasts slightly underperformed SPC outlooks at Day 1 but significantly outperformed SPC outlooks for Days 2–3. Other machine-learning approaches, such as neural-networks, have also been used to derive skillful severe weather probabilities (e.g., Sobash et al. 2020), and machine-learning is being used for many other forecasting applications like precipitation (e.g., Herman and Schumacher 2018; Loken et al. 2019; Hill and Schumacher 2021; Schumacher et al. 2021), hail (e.g., Gagne et al. 2017, 2019; Burke et al. 2020), convective mode (Jergensen et al. 2020), tornadoes (Lagerquist et al. 2020), convective wind (Lagerquist et al. 2017), and aviation applications (Muñoz-Esparza et al. 2020). While these studies note that machine learning approaches have several advantages over using single severe weather proxies, the machine learning approaches still require retraining when significant model changes/upgrades are implemented, which is a similar disadvantage to using single proxies. See Haupt et al. (2022) for an overview of AI applications in the environmental sciences.

For very short-term severe weather forecasting applications (i.e., 0–6-h lead times), use of machine-learning-based postprocessing has only been attempted in one other study—Flora et al. (2021). In that study, several machine learning methods were used within an object-based framework to derive the probability that 30-min ensemble storm tracks, defined using a 10 m s−1 minimum updraft strength threshold, would be associated with a tornado, wind, or hail report. The storm tracks and associated predictors came from the National Severe Storm Laboratory’s (NSSL) Warn-on-Forecast System (WoFS). For their algorithm, predictors consisted of storm diagnostics, environment fields, and object shape properties computed over each storm object. Then, the storm objects were classified according to whether they contained a tornado, hail, and/or wind report. This study found very encouraging results, namely that the ML algorithms had good discrimination ability for all three hazards and produced more reliable probabilities than a set of baseline predictions. However, the authors acknowledged important limitations in their object-based framework. Specifically, forecasts and the predictors input into the ML algorithms are limited to where WoFS places storms, and verification does not account for misses (i.e., severe weather that occurs outside of storm objects).

This study is similar to Flora et al. (2021) in that machine learning is used to derive severe weather probabilities using predictors from WoFS. However, a grid-based, as opposed to event-based, framework is utilized so that nonzero severe weather probabilities can occur anywhere within the WoFS domains and missed events are incorporated in the verification. Also, herein we analyze the probability that any type of severe weather will occur as opposed to probabilities for specific hazards that were analyzed in Flora et al. (2021). Any type of severe weather was used mainly to increase the number of “targets” for the RF, which was important given our smaller set of cases compared to Flora et al. (2021). Predictors used for this study include both environment and storm-based fields, which are used to produce probabilities of severe weather within 39 km of a point for 0–3-h lead times. The skill and reliability of these forecasts is analyzed along with the sensitivity to different sets of predictors. The remainder of this paper is organized as follows: section 2 contains the data and methodology, section 3 the results, and section 4 the summary and conclusions.

2. Data and methodology

NSSL’s Warn-on-Forecast System (WoFS; Stensrud et al. 2009; Miller et al. 2022) is a rapidly updating, convection-allowing ensemble that is designed to dramatically increase lead times for hazardous weather. Furthermore, WoFS can help address the relative lack of probabilistic model-based guidance within the 0–6-h lead time range, so that the NWS can provide more meaningful guidance products after convective watches are issued and before warnings are needed. In this context, machine learning can be used as a tool to calibrate WoFS guidance and relate the guidance more directly to severe weather likelihood to streamline the forecasting process. In 2018, a prototype configuration of WoFS was run experimentally during the NOAA Hazardous Weather Testbed Spring Forecasting Experiment (e.g., Clark et al. 2012a; Gallo et al. 2017, 2018b). This system was initialized every 30 min from 1900 to 0300 UTC the next day with 6-h forecasts targeting watch-to-warning lead times. WoFS used the Advanced Research Weather Research and Forecasting (WRF) Model version 3.8 (Skamarock et al. 2008) with 36-member ensemble data assimilation cycled every 15 min starting at 1800 UTC using the ensemble adjustment Kalman filter (Anderson 2001) included in NCAR’s Data Assimilation Research Testbed (DART; Anderson et al. 2009) software. Forecasts are launched from 18 members, which use 3-km grid-spacing, 50 vertical levels, multiphysics, and cover a 900 km × 900 km domain. The varied physics schemes include the Yonsei University (YSU; Noh et al. 2003), Mellor–Yamada–Janjić (MYJ; Mellor and Yamada 1982; Janjić 2001), and Mellor–Yamada–Nakanashi–Niino (MYNN; Nakanishi 2000; Nakanish 2001; Nakanishi and Niino 2004, 2006) turbulence schemes, which are paired with either Dudhia and Rapid Radiative Transfer Model (RRTM; Mlawer et al. 1997) or the Rapid Radiative Transfer Model for Global Climate Models (RRTMG; Iacono et al. 2008) parameterizations for short- and longwave radiation, respectively. All members used the NSSL two-moment microphysics parameterization (Mansell 2010) and the Rapid Update Cycle land surface model (Smirnova et al. 1997, 2000). Observations assimilated include surface observations, Doppler velocity from Weather Surveillance Radar-1988 Dopplers (WSR-88Ds) that have coverage in the WoFS domain, Multi-Radar Multi-Sensor (MRMS) reflectivity > 20 dBZ, and cloud water path from GOES-16. The Global System Laboratory’s experimental High-Resolution Rapid Refresh Ensemble (HRRRE; Kalina et al. 2021) provided the background analyses and boundary conditions.

Complete sets of WoFS forecasts were available for 24 days in 2018 (specific days listed in Fig. 1). For each day, there are 8-hourly initializations covering the time period 1900–0200 UTC the next day. Thus, the full dataset in this study is composed of 8 × 24 = 192 unique WoFS initializations of 18-member forecasts. While a larger sample size would potentially give better results, our set of cases should be more than adequate as a proof-of-concept and to examine sensitivity to different sets of predictors. Another potential limitation of our dataset is applicability outside of the spring. While the spring 2018 time period sampled a variety of convective scenarios, there is an emphasis over the High Plains, where April–June is the climatological severe weather peak. Further work will be required to sample events more typical in the summer (“weakly forced” high CAPE and low shear events) and cool seasons (“strongly forced” low CAPE and high shear events). To illustrate the variety of convective events sampled herein, Fig. 1 provides information on WoFS domain placement, maximum SPC categorical risks, and dominate hazard types for each day. The cases include several classic dryline setups resulting in supercells that produced tornadoes and hail (e.g., 1–2 and 28–29 May), squall lines that produced primarily damaging wind (e.g., 4, 15, 30–31 May), and marginal events that produced very limited severe weather (e.g., 29 April; 7 and 16 May).

Fig. 1.
Fig. 1.

(a) Frequency of WoFS domain placement for the 24 days examined herein. The bottom right contains a list of cases classified by region. (b) Calendar covering the experiment time period. The days are shaded according to the maximum categorical SPC risk (issued at 1630 UTC) within the WoFS domain. For each day, the dominant hazard types that occurred within the WoFS domain are listed. These were determined through subjective examination of daily archived SPC storm report maps.

Citation: Weather and Forecasting 37, 10; 10.1175/WAF-D-22-0056.1

The RF-derived probabilities in this study are created using random forest classifiers from the Python module Scikit-Learn (Pedregosa et al. 2011) using: 200 trees, a maximum tree depth of 15 levels, at least 20 samples per leaf node, the minimization of entropy for splits, and the square root of the total number of predictors was evaluated at each node. This is the same set of hyperparameters used in Loken et al. (2019, 2020). In those studies sensitivity tests found very little dependence on the particular set of values chosen, but no additional hyperparameter tuning or sensitivity tests were performed herein. The strategy for RF application again follows that of Loken et al. (2019, 2020); for further details on RFs see those studies and references therein.

Because our sample of cases is relatively limited, we use k-fold cross validation to estimate model skill on “unseen” data (e.g., Platt 1999). In k-fold cross validation, the dataset is split into equal portions, where k is the number of portions. In this study, one portion is held out for testing, while the remining portions are lumped together for training. This process is repeated so that each portion is held out for testing once. In this way, we are able to generate forecasts for every sample in our dataset, where each forecast was generated from training on a separate, independent dataset. For our purposes, sixfold cross validation with 32 initializations per fold are used. This means that for each iteration of cross validation, forecasts are generated for 32 WoFS initializations using training on the other 160 WoFS initializations. Note that we do not set aside a portion of our data for validation in this study because we use the same set of hyperparameters for each RF.

The RFs are trained with various combinations of 48 predictors. To make training computationally feasible, careful consideration was given to how to preprocess the predictors in order to reduce the dataset dimensionality. The predictors can be classified as either environment, hourly-maximum storm fields, or miscellaneous. For the 11 environment fields, an ensemble mean was calculated at forecast hours 1, 2, and 3. Then, a mean was taken over all three forecast hours. For the six storm fields, the maximum value over forecast hours 0–3 was found for each member. Then, at each grid point, three different quantities were computed that represent different aspects of the ensemble distribution: 1) the maximum of all members, 2) the 90th percentile of all members, and 3) the smoothed mean. The smoothed mean was computed by finding the average of all members at each grid point and then applying a Gaussian kernel density function with standard deviation of α = 18 km, which was chosen based on testing a range of values and finding that 18 km seemed to depict a reasonable amount of spatial uncertainty without spreading out the values over too large an area. Using statistical descriptions of the data allows us to retain information from the ensemble member distributions without using all 18 members as predictors. This reduces the dimensionality from 6 fields × 18 members = 108 predictors, to 6 fields × 3 statistics = 18 predictors. Meanwhile, applying a Gaussian kernel density function is an effective way to account for spatial uncertainty in an ensemble that is underdispersive, and it helps account for “under-sampling” in an ensemble with small membership (e.g., Sobash et al. 2016a). Miscellaneous predictors include the WoFS initialization time and the smoothed 2–5 km AGL UH fields from individual members. Figure 2 shows an example of each type of storm field for 2–5 km AGL UH and selected environmental fields, while Table 1 provides a full list of the predictors. The RF using the first 30 predictors in Table 1 is hereafter referred to as RFCNTL.

Fig. 2.
Fig. 2.

Example of six predictors used in the RF algorithms for the 2100 UTC WoFS initialization on 1 May 2018. (a) Hourly maximum UH (2–5 km AGL), (b) 90th percentile UH (2–5 km AGL), and (c) smoothed mean UH (2–5 km AGL). Ensemble mean fields of (d) mixed layer CAPE, (e) storm relative helicity (0–3 km AGL), and (f) significant tornado parameter. In each panel, the area outlined with black contours and filled with transparent gray shading shows regions where severe weather occurred within 39 km of a point.

Citation: Weather and Forecasting 37, 10; 10.1175/WAF-D-22-0056.1

Table 1

List of predictors used in the RF algorithms.

Table 1

For training and verification, filtered local storm reports (LSRs) were obtained from the SPC website (SPC 2021) for tornadoes, wind, and hail ≥ 1.0 in. The LSRs were gridded to the 3-km WoFS domains and dilated using a radius of 39 km, which approximately corresponds to the definition of severe weather used in SPC’s convective outlooks (i.e., severe weather occurrence within 25 miles of a point). Note, for simplicity, 39 km rather than 40 km, was used since 39 is divisible by the native model grid spacing of 3 km. The resulting grid contains 0 and 1 values, where the 1 values indicate that a severe weather report occurred within 39 km of the point. The observations for the example case in Fig. 1 are overlaid with transparent shading. It is well known that LSRs are subject to errors and reporting biases (e.g., Brooks et al. 2003; Verbout et al. 2006; Cintineo et al. 2012). However, using a large radius of influence at least helps mitigate undersampling (while some spatial precision is sacrificed).

To put the performance of the RF-based forecasts in appropriate context, formulation of a rigorous set of baseline forecasts is necessary. Since 2–5 km AGL UH is widely used by operational forecasters and generally recognized as a very skillful storm-based proxy for severe weather (e.g., Sobash et al. 2011, 2016a; Gallo et al. 2016; Clark et al. 2018), our strategy is to optimize UH-derived severe weather probabilities to give the most skillful Brier scores (Brier 1950). This strategy generally matches that of Loken et al. (2020), Sobash et al. (2020), and Flora et al. (2021), who all also used UH-based baselines for comparison to ML-derived severe weather probabilities. The Brier score was chosen because, unlike some other probabilistic metrics like area under the relative operating characteristic curve, it is strongly linked to forecast reliability and is a direct measure of probability error magnitudes. The surrogate severe methodology applied to ensembles (Sobash et al. 2016a) was used to derive the UH-based probabilities. For each WoFS initialization, the maximum UH is computed for each member over forecast hours 0–3. Then, 39-km neighborhood exceedance probabilities are calculated using 10 UH percentiles ranging from 0.97 to 0.999. Percentiles are preferred as opposed to raw thresholds because the different physics configurations in the WoFS members introduce some slight variability in the model UH climatologies (Potvin et al. 2020). The percentiles are computed from the UH values aggregated over all initializations. For each of these percentiles, a Gaussian smoother is applied with σ values ranging from 3 to 120 km, resulting in 90 sets of probabilities. Finally, the Brier score is computed for each set of probabilities to find which combination of σ and UH percentile gives the best (i.e., lowest) Brier score. Note, by using all 192 initializations to find the optimal σ and UH percentile we are giving on advantage to the baseline over the RFs since the same data used to optimize the probabilities are used to generate the forecasts. Thus, the baseline skill can be considered an upper bound. A similar approach was used in Sobash et al. (2020).

This exercise found that σ = 48 km and UH percentile = 0.998 results in the optimal baseline. Within the WoFS members, the 0.998 UH percentile corresponds to UH values of 120–148 m2 s−2. The optimal value of σ = 48 km is noticeably less than previous works applying CAM ensembles for Day 1 forecasting applications. For example, Sobash et al. (2016a) and Clark et al. (2018) both found that fractions skill score for UH-based ensemble forecasts were maximized in the range 150–180 km. The smaller optimal σ values for WoFS are consistent with the focus on shorter lead times, which equate to more accurate and precise probabilities (i.e., higher magnitudes with smaller areas). The use of smaller smoothing length scales in CAM applications is also discussed in Hill et al. (2021), and Schwartz and Sobash (2017) discuss more generally the use of neighborhood ensemble probabilities with CAMs.

3. Results

a. Example cases

First, to gain some understanding of the RF-based severe weather probabilities, spatial plots were generated comparing the RFCNTL to the baseline probabilities with LSRs overlaid for every WoFS initialization. Examples showing select WoFS initializations for 1 and 4 May 2018 are shown in Fig. 3. These two cases were chosen because they were particularly active severe weather days, they depict two different dominant convective modes, and they represent the range in performance of RFCNTL relative to baseline.

Fig. 3.
Fig. 3.

Severe weather probabilities derived from 1900 UTC WoFS initializations on 1 May 2018 using the (a) baseline probabilities, and (b) RFCNTL probabilities. (c),(d) As in (a) and (b), but for WoFS initializations on 4 May 2018. (e)–(h) (i)–(l),(m)–(p) As in (a)–(d), but for 2100, 2300, and 0100 UTC WoFS initializations, respectively. In each panel, the observed storm report locations are overlaid [legend in (a)].

Citation: Weather and Forecasting 37, 10; 10.1175/WAF-D-22-0056.1

The synoptic pattern on 1 May 2018 featured a dryline stretching through Nebraska, Kansas, and Oklahoma with a frontal boundary stretching from southeast Nebraska into Iowa. This region was downstream of a broad midtropospheric trough with a corridor of 40–55-kt southwesterly 500-hPa winds. The warm sector air mass became very unstable by early afternoon and storms, including supercells, began to form by 2000 UTC. The storms, which left a broad corridor of tornado, hail, and wind reports, lasted into the evening and nighttime hours, eventually transitioning into a training line of convection over northeast Kansas.

RFCNTL and baseline probabilities generally highlight the same regions, which was typical for cases in which the WoFS members have a strong UH signal. The RFCNTL probabilities have sharper gradients than the baseline, which is because the RFCNTL probabilities do not require smoothing. The sharper gradients were particularly advantageous for the 2100 UTC initialization when storms were developing/evolving along well-defined boundaries (Figs. 3e,f). The RFCNTL probabilities contain noticeable regions of false alarm, especially in the 2100 and 2300 UTC initializations (Figs. 3f,j), which were forecasting >30% probabilities over northeast Kansas where no LSRs were reported. Closer inspection of the WoFS forecasts revealed spurious convection that was likely related to these members underestimating the strength of the capping inversion. Even though these spurious storms were not as intense as the storms further west along the dryline [as indicated by their lower UH intensities (Figs. 2a,b)], the RF algorithm likely recognized that the environment there was very favorable for severe weather and thus generated relatively high probabilities. This illustrates an important characteristic of RFCNTL—if most of the WoFS members forecast spurious storms in a favorable environment, the RFCNTL probabilities will not recognize the false alarm. Thus, it would be important for forecasters to recognize whether higher severe weather probabilities were associated with imminent or ongoing storms, as opposed to storms that have not yet formed and convective initiation was uncertain.

The 4 May 2018 case featured a quickly moving midtropospheric trough axis that moved southwest to northeast across the Great Lakes and toward northern New England. Widespread convection—mostly in the form of a long narrow squall line—developed along an associated cold front and progressed west to east during the early afternoon and into the late evening over Pennsylvania, upstate New York, and much of New England. Although the thermodynamics were only marginally favorable for severe weather, the intense low- to midlevel wind fields resulted in a widespread damaging wind event. This was clearly a case in which RFCNTL provided value by forecasting 15% and greater severe weather probabilities across large portions of Pennsylvania and New York, where there were numerous severe wind reports and the baseline probabilities were less than 2% (Figs. 3c,d,g,h,k,l,o,p). The low probabilities in the baseline forecasts reflect that the WoFS members were not producing strong UH. As shown later, favorable environmental parameters and other storm attribute fields elevated the RFCNTL probabilities for this case.

b. Objective metrics

To quantify the skill of the baseline and RFCNTL severe weather probabilities, an attributes diagram (e.g., Hsu and Murphy 1986) was constructed using all 192 WoFS initializations and probability bins of [0%–2%), [2%–5%), [5%–10%), [10%–20%), …, and [90%–100%]. These diagrams provide a visualization of forecast reliability—i.e., how well the observed relative frequencies of LSRs correspond to the forecast probabilities. Additionally, the Brier score (BS; e.g., Murphy 1973), reliability component of the Brier score (BSrely; e.g., Atger 2003), Brier skill score (BSS; e.g., Wilks 2006), and area under the relative operating characteristic curve (AUC; Mason 1982) were computed. Unlike BS, the BSS accounts for climatological skill, and, for our purposes, BSS is calculated using the sample climatology defined as the average fraction of grid points for each initialization time that experiences a severe weather report within 39 km of a point. BSrely provides an objective measure of reliability by taking the squared difference of each point in the attributes diagram, which is weighted according to the number of forecasts within each probability bin. The AUC measures the ability to discriminate between events and nonevents and is computed by plotting the probability of detection (POD) versus the probability of false detection (POFD) for a range of probabilistic thresholds, and then calculating the area under the curve connecting the POD-POFD points using a trapezoidal approximation (e.g., Wandishin et al. 2001). To test for statistical significance, a Welch’s t test was used for BS and BSrely, and the resampling approach described by Hamill (1999), which uses paired differences between experiments, was used with resampling repeated 1000 times for AUC and BSS. For both metrics, significance was tested at α = 0.05 and each of the 192 initializations was treated as a separate sample.

The attributes diagram (Fig. 4) shows that each set of forecasts have good reliability, even up to forecast probabilities ≥ 75%. However, the RFCNTL almost exactly follows the perfect reliability line all the way up to the 25%–35% bin, while the baseline is a slight underforecast in the lowest bin, and then an overforecast for all of the higher bins. For all four objective metrics, improvements in the RFCNTL were statistically significant (Fig. 4).

Fig. 4.
Fig. 4.

Attributes diagram for all 192 WoFS initializations in the study period for the RFCNTL (black) and baseline (red) severe weather probabilities. The vertical dashed line at the bottom indicates the sample climatology, the next dashed gray line is the no-skill line, and the gray dashed line oriented at 45° is the perfect reliability line. The inset at the upper left shows the number of forecasts in each probability bin (log scale), and the tableinset at the bottom right shows various skill metrics, where bold and italics in the “baseline” column indicates that differences with respect to RFCNTL were statistically significant.

Citation: Weather and Forecasting 37, 10; 10.1175/WAF-D-22-0056.1

To examine whether there is any dependence on initialization time, AUC and BSS are shown for each initialization time (Figs. 5a,b). Improvements in the RFCNTL relative to baseline are generally consistent across the times, except at 1900 UTC the RFCNTL and baseline are quite similar. The AUCs (Fig. 5a) show a gradual increase in skill with later initialization times, while there is a maximum for BSS at 2300 UTC (Fig. 5b).

Fig. 5.
Fig. 5.

(a) AUC and (b) BSS for the RFCNTL, baseline, RFStorm-only, and RFEnv-only severe weather probabilities for each initialization time. (c),(d) As in (a) and (b), but for RFCNTL, RFUH25-only, RFUP-only, RFUH02-only, RFWZ-only, RFHAIL-only, and RFWS-only. (e),(f) As in (a) and (b), but for RFCNTL, RFSmooth-only, RF90th-only, RFMax-only, and RFBest.

Citation: Weather and Forecasting 37, 10; 10.1175/WAF-D-22-0056.1

c. Sensitivity to predictors

To study the impact that different types of predictors in the RF algorithm have on forecast skill, the RF was trained with only specified subsets of predictors in three different experiments. In the first experiment, the RF was trained with only storm fields (RFStorm-only) and with only environmental fields (RFEnv-only); in the second experiment, six RFs are trained where each uses only a single set of storm fields (RFUH25-only, RFUP-only, RFUH02-only, RFWZ-only, RFHail-only, and RFWS-only); and in the third experiment, three RFs are trained where each uses only a single ensemble statistic for the storm fields (RFSmooth-only, RF90th-only, and RFMax-only). Finally, as part of experiment 3, we construct an optimized RF using the most skillful storm field and ensemble statistic applied to individual ensemble members. Table 2 provides specific information on the predictors used in each experiment, and the results for each experiment are discussed in the following sections.

Table 2

Skill metrics for baseline and various sets of RF-derived severe weather probabilities. For each experiment, the subsets are ranked from best to worst according to BSS (except for RFBest). Bold and italicized indicates differences relative to the best performing set of probabilities in each experiment were statistically significant using α = 0.05. For RFBest, statistical significance was tested relative to RF. The predictors used for each RF are provided in brackets, where the numbers correspond to the ID numbers used in Table 1.

Table 2

1) Storm versus environment fields

The aggregate skill measures (Table 2) indicate that RFStorm-only performs significantly better than RFEnv-only. In fact, RFStorm-only performs almost as well as RFCNTL, indicating that the storm fields contribute the vast majority of forecast skill to RFCNTL. The differences between RFStorm-only and RFEnv-only widen with later initialization times, indicating that the storm fields have increasing importance for later WoFS initializations (Figs. 5a,b). We speculate that the relatively rapid drop in skill for RFEnv-only at the later initializations times is related to boundary layer stabilization and some of the environment fields, especially thermodynamic ones, becoming less useful as storm coverage decreases and storms become elevated (i.e., inflow parcels originate above a stable boundary layer).

To understand how each set of predictors impacts the RFCNTL probabilities for individual cases, Figs. 6a–e compare severe weather probabilities from RF, RFStorm-only, and RFEnv-only for the 2200 UTC 1 May 2018 WoFS initializations. RFCNTL and RFStorm-only are much more similar than RFCNTL and RFEnv-only, and while RFEnv-only probabilities capture the areas where severe weather occurred, probability magnitudes are lower and there is greater false alarm area compared to RFStorm-only (Figs. 6a–c). To illustrate the impact of each set of predictors in the context of the full RF probabilities, differences between RFCNTL and RFStorm-only and RFCNTL and RFEnv-only are shown in Figs. 6d–e. In Fig. 6d (RFCNTL − RFStorm-only), positive (negative) differences can be interpreted as areas where the environment fields increase (decrease) the probabilities in RFCNTL, and in Fig. 6e (RFCNTL − RFEnv-only), positive (negative) differences indicate areas where the storm fields increase (decrease) the probabilities in RFCNTL. For this case, it could be argued that the environment fields slightly improve the RFCNTL forecasts by increasing probabilities where there were LSRs in Iowa and east central Nebraska, and decreasing probabilities where LSRs were not reported over northeast Nebraska (Fig. 6d). For the storm fields, the impact on RFCNTL was much clearer compared to the environment fields. The storm fields clearly improved the RFCNTL forecasts by increasing the probabilities where LSRs were reported in central Kansas, Nebraska, and Iowa, while decreasing the probabilities where no LSRs were reported across much of southeast Kansas, northern Oklahoma, and western Missouri (Fig. 6e). However, it should be noted that the storm fields degraded the RF forecast by increasing probabilities where no LSRs occurred over northeast Kansas and northwest Missouri, and decreasing the probabilities where LSRs were reported near the Kansas–Oklahoma border.

Fig. 6.
Fig. 6.

Severe weather probabilities from 2200 UTC 1 May 2018 WoFS initializations derived from (a) RFCNTL, (b) RFStorm-only, and (c) RFEnv-only. (d) Difference between RFCNTL and RFStorm-only and (e) RFCNTL and RFEnv-only. (f)–(j) As in (a)–(e), but for 2200 UTC 4 May 2018 WoFS initializations.

Citation: Weather and Forecasting 37, 10; 10.1175/WAF-D-22-0056.1

Compared to the 1 May case, the environment fields for 4 May appeared to have a much greater positive impact on the forecast probabilities (Figs. 6f–j). For almost the entire corridor of wind reports stretching from Pennsylvania, through eastern New York and into Vermont, probabilities were increased by the environment fields. For this particular case, a combination of only marginally favorable thermodynamics, but extremely favorable kinematics for severe storms featuring very strong low-level wind fields, resulted in a very narrow, fast moving squall line that produced the majority of severe wind reports. Although the WoFS members depicted a narrow line of storms, there was not a strong signal from the storm fields, especially compared to events typical of the central plains featuring much more instability. We speculate that for these types of cases that feature shallower storms with narrower updrafts, higher resolution is needed to adequately depict the storm dynamics, and thus environmental indicators become more important within the RF algorithm.

2) Individual storm field types

To study the impact that different types of storm fields in the RF algorithm have on forecast skill, another set of experiments was conducted in which the RF was trained using only a single type of storm field along with the environment predictors (Table 2; Experiment 2). Aggregate skill measures (Table 2) indicate that RFUH25-only and RFUP-only performed best, with both AUC and BSS almost as skillful as RFCNTL. The other storm field RFs (i.e., RFUH02-only, RFWZ-only, RFHail-only, and RFWS-only) performed significantly worse than RFUH25-only, with RFWS-only having the lowest skill. Examining the storm field RF forecast skill as a function of initialization time (Figs. 5c,d) shows that RFUH25-only and RFUP-only have the biggest advantage for the 2000–2200 UTC WoFS initializations when both their AUCs and BSSs are almost identical to RFCNTL. At initializations after 2200 UTC, some of the other RFs (with the exception of RFWS-only) are competitive with RFUH25-only and RFUP-only, and the advantage of RFCNTL is larger. Note, most of the differences between pairs of forecasts at individual initialization times in Experiment 2 are not statistically significant, except for pairs including RFWS-only (not shown). Because the sample size for the individual initialization times is much smaller (by a factor of 8), fewer significant differences occur relative to Table 2, which combines the forecasts from all the initializations. The finding that midlevel UH and maximum updraft speed work so well as predictors is consistent with past works finding that these diagnostics work well as proxies for any type of severe weather (e.g., Sobash et al. 2011, 2016a; Clark et al. 2018), since strong updrafts and midlevel rotation are commonly present in multiple modes of severe convection. While some of the other storm fields RFs do not perform as well when verified against all types of LSRs, it is quite possible that they would perform better when used to predict specific hazards for which they were designed (e.g., using RFHail-only for hail verification or RFWS-only for wind verification).

To examine the impact of the individual storm field RFs for the 2200 UTC 1 May 2018 WoFS initialization, the severe weather probabilities for each RF are displayed in Figs. 7a–f. With the exception of RFWS-only (Fig. 7f), each of the storm field RFs highlight similar regions with higher probabilities with the main differences in the areal coverage of the lower magnitude probabilities. Specifically, RFUH02-only and RFWZ-only (Figs. 7c,d) highlight larger areas with probabilities of 2%–15% over eastern Kansas and western Missouri compared to RFUH25-only, RFUP-only, and RFHail-only. Inspection of the 0–2 km AGL UH and 0–2 km AGL vertical vorticity revealed that much of the areas with lower probabilities contain nonzero values that likely result from boundary layer processes outside of strong convection (not shown). Thus, we speculate that—because these predictors are not as strongly related to severe weather as 2–5 km AGL UH and maximum updraft speed—the smaller values outside of strong convection can still result in the RF producing low probabilities. For RFHail-only, there are longevity and updraft criteria within the HAILCAST algorithm that ensure nonzero values are only produced for storms of sufficient magnitude. Thus, the expansion of the lower probabilities in RFHail-only is not observed (Fig. 7e). The severe weather probabilities for RFWS-only (Fig. 7f) are clearly the most different relative to the other storm field RFs. Probabilities in the range 2%–30% occur over a large area covering eastern Kansas and western Missouri where probabilities from the other storm field RFs were much lower. Inspection of the 80 m AGL maximum winds revealed that much of these areas contained modest wind speeds and it was difficult to separate strong environmental winds from those resulting from convection.

Fig. 7.
Fig. 7.

Severe weather probabilities from 2200 UTC 1 May 2018 WoFS initializations derived from (a) RFUH25-only, (b) RFUP-only, (c) RFUH02-only, (d) RFWZ-only, (e) RFHAIL-only, and (f) RFWS-only. (g)–(l) As in (a)–(f), but for 2200 UTC 4 May WoFS initializations.

Citation: Weather and Forecasting 37, 10; 10.1175/WAF-D-22-0056.1

For the 4 May 2018 case (Figs. 7g–l), the RFUH25-only and RFUP-only once again highlight similar areas. RFUH02-only and RFWZ-only have higher magnitude probabilities that extend further east relative to RFUH25-only and RFUP-only. RFHail-only highlights a much smaller region than the other storm fields, while the RFWS-only highlights a much larger region.

3) Storm field statistics

To study the impact that the different storm field statistics in the RF algorithm have on forecast skill, the third set of experiments examines RFs trained using only a single type of summary measure for the storm fields along with the environment predictors (Table 2; Experiment 3). The aggregate skill measures presented in Table 2 indicate that RFSmooth-only performs best and is almost as skillful as RFCNTL. RF90th-only is slightly less skillful, while RFMax-only is the least skillful with differences in both AUC and BSS that are statistically significant compared to RFSmooth-only. The differences in AUC and BSS are quite consistent across the different initialization times (Figs. 5e,f). As illustrated in the 1 and 4 May 2018 example cases (Fig. 8), the smoothed mean results in RF-derived probabilities that are less noisy and generally have higher amplitude probabilities. Visual inspection of several other cases also revealed that RFSmooth-only oftentimes has lower probabilities compared to RF90th-only and RFMax-only in areas without observed severe weather (not shown). We speculate that these are cases when 2 or 3 members have a strong signal for severe storms that do not occur. In such cases, there would be a strong signal in the maximum and 90th percentile fields, while the smoothed mean would dampen the signal considerably.

Fig. 8.
Fig. 8.

Severe weather probabilities from 2200 UTC 1 May 2018 WoFS initializations derived from (a) RFSmooth-only, (b) RF90th-only, and (c) RFMax-only. (d)–(f) As in (a)–(c), but for 2200 UTC 4 May 2018 WoFS initializations.

Citation: Weather and Forecasting 37, 10; 10.1175/WAF-D-22-0056.1

To see if we can further optimize forecast skill, a RF is constructed where the three 2–5 km AGL UH predictors are replaced by 18 predictors composed of the smoothed 2–5 km AGL UH from each individual WoFS member. The 2–5 km AGL UH was used since it was the storm field that resulted in the most skillful RF, and the members were smoothed since the smoothed mean was the ensemble statistic that gave the most skillful RF. The smoothing in the members is applied as it was for the ensemble (i.e., Gaussian kernel density function with α = 18 km) and this RF is named RFBest. RFBest is different from the other RFs in that it is the only RF that uses information from individual ensemble members for any of the storm fields. It could be thought of as using a “paintball plot” of information rather than a single ensemble statistic. Note, RFBest still uses the ensemble statistics for the five other storm attribute fields.

RFBest outperforms all other RFs (Table 2). In hypothesis tests comparing RFBest to RFCNTL, BSS differences were significant, while AUC differences were not quite significant. Interestingly, the advantage for RFBest is heavily weighted toward earlier initializations (Figs. 5e,f), at which time the differences are quite dramatic. For both BSS and AUC, the largest advantage for RFBest is at 1900 UTC with differences decreasing each subsequent initialization. For AUC (BSS), RFBest has about the same skill as RFCNTL by about 2200 UTC (0100 UTC). We believe that these differences are linked to the timing of convective initiation (CI) and associated uncertainty in the WoFS forecasts. At early WoFS initializations (i.e., 1900–2100 UTC), storm coverage is typically limited with the majority of CI having not yet occurred. Thus, without having assimilated radar data depicting ongoing storms, the location of CI and subsequent storm evolution is much more uncertain relative to later times when there is more ongoing convection. At these early times, the ensemble summary measures would be much less effective at characterizing the actual ensemble member distribution, and thus the information from the ensemble members provides more useful and unique information to the RF. To illustrate this point, consider two situations, both where the ensemble mean 2–5 km AGL UH is 50 m2 s−2. In the first situation, all 18 members have storms with UH of 50 m2 s−2. In the second, 9 members have no storm (UH of 0 m2 s−2), while 9 members have a storm with UH of 100 m2 s−2. In the first case, there is a strong likelihood of a moderately intense storm, while in the second case there is more uncertainty (i.e., a lower probability) of a higher-intensity storm. Knowing the difference between these situations likely boosts RF skill, and the scenario in the first case likely happens more often during the earlier initialization times before storms have developed.

To better understand the differences between RFCNTL and RFBest for earlier WoFS initializations, the daily differences in AUC and BSS at 1900 UTC were computed (not shown). For AUC, there was only one case in which RFCNTL performed better than RFBest, and for BSS there were only three cases. Thus, the advantage of RFBest was quite consistent. Furthermore, we visually inspected the RFCNTL and RFBest forecasts for all 24 cases initialized at 1900 UTC. In Fig. 9, two cases are shown for which RFBest provided obvious advantages over RFCNTL (12 and 15 May; Figs. 9a–f), one case is shown for which there were similar advantages and disadvantages (19 May; Figs. 9g–i), and one case where RFBest performed worse than RFCNTL (16 May; Figs. 9j–l). For each of these four cases, a common theme was that RFBest had higher magnitude probabilities, but over smaller areas. Thus, including the individual members appears to sharpen the probability distributions. For the 12 and 15 May cases, the higher magnitude probabilities corresponded to the locations of LSRs and probabilities were reduced or eliminated over areas without LSRs (Figs. 9a–f). For the 19 May case, although RFBest probabilities were increased over areas of north central Oklahoma and southeast Kansas where there were LSRs, false alarm area was increased over western Oklahoma and probabilities were decreased where there were LSRs in Missouri (Figs. 9g–i). Finally, for the 16 May case, there were no LSRs at all, but RFBest generally increased probabilities resulting in larger errors relative to RFCNTL. For the 12 May case (Figs. 9a–c), CI had not yet occurred or was ongoing, so this is consistent with our hypothesis about individual members providing more useful information in the pre-CI period when uncertainty is higher. However, for the 15 May case, there was on ongoing squall line at 1900 UTC, but it is still possible that there was considerable uncertainty regarding the evolution of the squall line so that the individual members provided unique information that improved RFBest.

Fig. 9.
Fig. 9.

Severe weather probabilities from 1900 UTC 12 May 2018 WoFS initializations derived from (a) RFCNTL, (b) RFBest, and (c) RFBest − RF. (d)–(f),(g)–(i),(j)–(l) As in (a)–(c), but for 1900 UTC 15, 19, and 16 May 2018 WoFS initializations, respectively.

Citation: Weather and Forecasting 37, 10; 10.1175/WAF-D-22-0056.1

Herman and Schumacher (2016, 2018) also examined the impact of including additional ensemble information in GEFS/R-based ML forecasts of fog/visibility and extreme precipitation, respectively. They found that the value provided by using several specified percentiles or all individual members was small and attributed this to GEFS/R being underdispersive at the lead times examined so that ensemble member information was not providing unique information in addition to the mean. This reasoning is consistent with our own speculation on when WoFS forecasts have the most uncertainty and the corresponding impact on forecasts using additional member information.

4) Applicability to another year (2019)

An important consideration for the RF is whether the results generalize across years outside of 2018. To address this concern, we generated predictors from the 2019 WoFS dataset the same way that we did the 2018 WoFS dataset. The 2019 WoFS dataset consists of 25 cases. We used the same 8 initialization times as we used for 2018, so for 2019 there were 8 initializations × 25 cases = 200 sets of forecasts. For 2019, two environmental variables used for 2018 were not available: 2 m AGL virtual potential temperature and 0–3 km AGL CAPE. We tested the impact of removing these predictors on the 2018 dataset and it was negligible. Then, we trained an RF using the 2018 predictors (without 2-m virtual potential temperature and 0–3 km AGL CAPE), and used it to generate probabilities from the 2019 dataset. A very important caveat here is that there are several differences between the 2018 and 2019 WoFS configurations. These differences include: 1) increased domain size in 2019, 2) data assimilation software switched from DART to GSI [Gridpoint Statistical Interpolation; Developmental Testbed Center (2017)], 3) changes in the way some variables were postprocessed (e.g., UH calculations and mixed-layer environmental variables were computed differently), and 4) changes to the HRRRE configuration used for ICs/LBCs. Because of these inconsistencies, we do not expect 2019 forecasts generated from an RF trained with 2018 data (hereafter, RF2019) to perform as well as 2019 forecasts generated from an RF trained with 2019 data via k-fold cross validation (hereafter, RF2019KFOLD).

Nevertheless, we tested these RFs and compared against a baseline for 2019 generated in the same way as for 2018 (i.e., finding an optimized 2–5 km AGL UH percentile and smoothing level). The results of this test are displayed in Fig. 10. Note, instead of sixfold cross validation, we used fivefold cross validation, which results in 40 initializations per fold. This means that for each iteration of cross validation, forecasts are generated for 40 WoFS initializations using training on the other 160 WoFS initializations. The results reveal that—even though there are inconsistencies between the 2018 training and 2019 testing WoFS datasets—RF2019 still outperforms the baseline skill as measured by AUC, BS, and BSS. However, BSrely is slightly better in the baseline relative to RF2019. The degradation in reliability is very likely a direct result of the inconsistencies between the 2018 and 2019 WoFS datasets. It is also possible that differences between the 2018 and 2019 cases meant that the 2018 training data did not adequately capture the range of convective events that occurred in 2019. On the other hand, if we apply k-fold cross validation to generate probabilities using solely 2019 WoFS data, BSrely is dramatically improved in RF2019KFOLD relative to RF2019. The other skill metrics—AUC, BS, and BSS—are also improved in RF2019KFOLD. Visual inspection of the forecasts reveals that the differences between RF2019 and RF2019KFOLD are at times quite noticeable. Similar to 2018, there were many cases in which the RF-derived forecasts were clearly an improvement relative to the baseline. To illustrate some of these cases, Fig. 11 displays these three sets of forecasts for three different WoFS initializations from 2019. In each of these cases, there are obvious improvements over the baseline in both RFs, and RF2019KFOLD clearly reduced false alarm area relative to RF2019. While these results illustrate that it is possible to obtain useful forecasts even when using inconsistent datasets for training, the best results are obtained when the datasets are consistent.

Fig. 10.
Fig. 10.

Attributes diagram for all 200 WoFS initialization in 2019 for the RF2019, RF2019KFOLD, and baseline severe weather probabilities. The vertical dashed line at the bottom indicates the sample climatology, the next dashed gray line is the no-skill line, and the gray dashed line oriented at 45° is the perfect reliability line. The inset at the upper left shows the number of forecasts in each probability bin (log scale), and the table inset at the bottom right shows various skill metrics, where bold and italics in the RF2019 and RF2019KFOLD columns indicate that differences with respect to baseline were statistically significant.

Citation: Weather and Forecasting 37, 10; 10.1175/WAF-D-22-0056.1

Fig. 11.
Fig. 11.

Severe weather probabilities derived from 2200 UTC WoFS initializations on 17 May 2019 using the (a) baseline probabilities, (b) RF2019 probabilities, and (c) RF2019KFOLD probabilities. (d)–(f),(g)–(i) As in (a)–(c), but for 2000 UTC 21 May and 2200 UTC 30 May WoFS initializations, respectively. In each panel, the observed storm report locations are overlaid [legend in (a)].

Citation: Weather and Forecasting 37, 10; 10.1175/WAF-D-22-0056.1

4. Summary and conclusions

Severe weather probabilities were derived from a machine-learning algorithm known as the Random Forest (RF), with predictors from a prototype 18-member, 3-km grid-spacing Warn-on-Forecast System (WoFS) that was run during the 2018 NOAA Hazardous Weather Testbed Spring Forecasting Experiment. The forecasts covered 0–3-h lead times for hourly WoFS initializations from 1900 to 0200 UTC the following day for 24 days in May and June 2018 resulting in a total of 192 sets of 18-member forecasts (24 cases × 8 initializations = 192). Recent work has shown that RFs can generate very skillful and reliable forecasts when applied to convection-allowing model ensembles for the “Day 1” time range (i.e., 12–36-h lead times), but they have not been tested for shorter lead times relevant to WoFS (e.g., 0–3 h). Thus, this paper developed a RF using WoFS predictors and trained with local storm reports (LSRs) to forecast the 0–3-h severe weather within 39 km of a point.

A main emphasis of this research was focused on how to preprocess the WoFS predictors in order to reduce the dataset dimensionality, while retaining the most important information from the forecasts. With 18 ensemble members, dozens of forecast fields, and output provided every 5 min, it is easy to see how the number of predictors can become unmanageable. For our purposes, most of the RFs analyzed used various combinations of 30 predictors that were classified as environment, hourly maximum storm fields, or miscellaneous. For each environment field (11 total), ensemble means were computed at forecast hours 1, 2, and 3, and then the mean over those three hours was used as a predictor. For the storm fields (6 total), the maximum value over forecast hours 0–3 was found for each member and three ensemble statistics were computed to characterize the ensemble distribution and serve as predictors: 1) the maximum of all members, 2) the 90th percentile of all members, and 3) the smoothed mean (Gaussian kernel density function with σ = 18 km). Additionally, a non-RF baseline forecast was formulated to put the RF forecasts in appropriate context. The baseline forecast was computed by finding the 2–5 km AGL UH percentile and threshold that results in the most skillful forecasts according to Brier score. Main findings are summarized below.

  1. The RF using a set of 30 predictors (RFCNTL) significantly outperformed the baseline forecasts according to several objective metrics of forecast skill. Examination of example cases illustrated that the RFCNTL forecasts contain sharper gradients in probabilities that skillfully delineate where LSRs occurred. However, at times, the RFCNTL contained notable areas of false alarm.

  2. Storm fields were found to make a much greater contribution to RFCNTL skill relative to environment fields with the storm fields becoming even more important at later initialization times. The storm fields that resulted in the most skillful RFs were 2–5 km AGL UH and maximum updraft speed. This finding was consistent with past works finding that these diagnostics are skillful proxies for any type of severe weather, since strong updrafts and midlevel rotation occur in multiple modes of severe convection.

  3. The storm field ensemble statistic that resulted in the most skillful RF was the smoothed mean.

  4. To see if RF skill could be further optimized, a RF was constructed where the ensemble statistics for 2–5 km AGL UH were replaced by individual member smoothed 2–5 km AGL UH fields. This optimized RF (named RFBest) outperformed all other RFs, and the advantage of RFBest was particularly large at earlier initializations. This early advantage in RFBest is likely linked to the timing of convective initiation (CI) and associated uncertainty in the WoFS forecasts, which is typically larger at earlier initializations pre-CI. At these times, the ensemble summary measures are likely not as effective at characterizing the actual ensemble member distributions, and thus information from the members themselves provides more useful and unique information to the RF. Inspection of example cases for RFBest finds that it sharpens the probability distributions relative to other RFs to better correspond with LSRs.

While the results of this study provide a meaningful proof-of-concept and give some insight on the role of different types of predictors, a major limitation is the relatively small sample size. We believe that it is an open question as to whether the results would be generalizable to cases beyond late Spring and a more diverse set of severe weather environments. However, our result that environmental field predictors only added a little value to RF-derived severe weather probabilities suggests that—as long as WoFS has a reasonable depiction of the observed storms—the storm attribute fields will provide most of the skill to the RF, so it is possible that the RF would perform well over a more diverse set of environments. Furthermore, testing on 2019 WoFS datasets illustrates applicability across different years. Work is ongoing at NSSL to develop a much larger database of WoFS cases so that these issues can be further addressed. Operational application of RF-based WoFS severe weather probabilities will require additional considerations. At short lead times, NWS forecast and warning products can be issued and change quickly based on rapid initiation and evolution of observed storms. Since WoFS latency is at least 20 min, observed radar and satellite data and algorithms derived from them like ProbSevere (Cintineo et al. 2014, 2018, 2020) contain more up-to-date information, and are thus potentially more useful to forecasters making warning decisions. Therefore, we believe that the ideal way to apply machine learning for short-term forecasting applications is combining dynamical forecast data from WoFS with radar and satellite-based tools like ProbSevere in a system that updates very rapidly (e.g., every 5 min) using the most up-to-date output from both systems. This type of probabilistic forecasting system fits very well within the FACETS paradigm (Forecast a Continuum of Environmental Threats; Rothfusz et al. 2018), which is envisioned to shift the NWS from deterministic to probabilistic watch/warning products. Future work is planned to pursue these avenues of research.

Acknowledgments.

EDL was provided support for this work by NOAA/Office of Oceanic and Atmospheric Research under NOAA–University of Oklahoma (OU) Cooperative Agreement NA21OAR4320204, U.S. Department of Commerce. This work comprised regular duties at the federal funded NOAA/NSSL for AJC. Additional support was provided by the Developmental Testbed Center (DTC) Visitor Program. The DTC Visitor Program is funded by the National Oceanic and Atmospheric Administration, the National Center for Atmospheric Research, and the National Science Foundation. We would also like to acknowledge high-performance computing support from Cheyenne (doi:10.5065/D6RX99HX) provided by NCAR’s Computational and Information System Laboratory, sponsored by the National Science Foundation.

Data availability statement.

The WoFS forecast data used in this study are not currently available in a publicly accessible repository. However, the data and code used to generate the results herein are available from the authors upon request.

REFERENCES

  • Adams-Selin, R. D., A. J. Clark, C. J. Melick, S. R. Dembek, I. L. Jirak, and C. L. Ziegler, 2019: Evolution of WRF-HAILCAST during the 2014-16 NOAA/Hazardous Weather Testbed Spring Forecasting Experiment. Wea. Forecasting, 34, 6179, https://doi.org/10.1175/WAF-D-18-0024.1.

    • Search Google Scholar
    • Export Citation
  • Anderson, J. L., 2001: An ensemble adjustment filter for data assimilation. Mon. Wea. Rev., 129, 28842903, https://doi.org/10.1175/1520-0493(2001)129<2884:AEAKFF>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Anderson, J. L., T. Hoar, K. Raeder, H. Liu, N. Collins, R. Torn, and A. Avellano, 2009: The Data Assimilation Research Testbed: A community facility. Bull. Amer. Meteor. Soc., 90, 12831296, https://doi.org/10.1175/2009BAMS2618.1.

    • Search Google Scholar
    • Export Citation
  • Atger, F., 2003: Spatial and interannual variability of the reliability of ensemble-based probabilistic forecasts: Consequences for calibration. Mon. Wea. Rev., 131, 15091523, https://doi.org/10.1175//1520-0493(2003)131<1509:SAIVOT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Breiman, L., 2001: Random forests. Mach. Learn., 45, 532, https://doi.org/10.1023/A:1010933404324.

  • Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78, 13, https://doi.org/10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Brooks, H. E., C. A. Doswell III, and M. P. Kay, 2003: Climatological estimates of local daily tornado probability. Wea. Forecasting, 18, 626640, https://doi.org/10.1175/1520-0434(2003)018<0626:CEOLDT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Burke, A., N. Snook, D. J. Gagne, S. McCorkel, and A. McGovern, 2020: Calibration of machine learning-based probabilistic hail predictions for operational forecasting. Wea. Forecasting, 35, 149168, https://doi.org/10.1175/WAF-D-19-0105.1.

    • Search Google Scholar
    • Export Citation
  • Cintineo, J. L., T. M. Smith, V. Lakshmanan, H. E. Brooks, and K. L. Ortega, 2012: An objective high-resolution hail climatology of the contiguous United States. Wea. Forecasting, 27, 12351248, https://doi.org/10.1175/WAF-D-11-00151.1.

    • Search Google Scholar
    • Export Citation
  • Cintineo, J. L., M. J. Pavolonis, J. M. Sieglaff, and D. T. Lindsey, 2014: An empirical model for assessing the severe weather potential of developing convection. Wea. Forecasting, 29, 639653, https://doi.org/10.1175/WAF-D-13-00113.1.

    • Search Google Scholar
    • Export Citation
  • Cintineo, J. L., and Coauthors, 2018: The NOAA/CIMSS ProbSevere Model: Incorporation of total lightning and validation. Wea. Forecasting, 33, 331345, https://doi.org/10.1175/WAF-D-17-0099.1.

    • Search Google Scholar
    • Export Citation
  • Cintineo, J. L., M. J. Pavolonis, J. M. Sieglaff, L. Cronce, and J. Brunner, 2020: NOAA ProbSevere v2.0–ProbHail, ProbWind, and ProbTor. Wea. Forecasting, 35, 15231543, https://doi.org/10.1175/WAF-D-19-0242.1.

    • Search Google Scholar
    • Export Citation
  • Clark, A. J., and Coauthors, 2012a: An overview of the 2010 Hazardous Weather Testbed Experimental Forecast Program Spring Experiment. Bull. Amer. Meteor. Soc., 93, 5574, https://doi.org/10.1175/BAMS-D-11-00040.1.

    • Search Google Scholar
    • Export Citation
  • Clark, A. J., J. S. Kain, P. T. Marsh, J. Correia Jr., M. Xue, and F. Kong, 2012b: Forecasting tornado pathlengths using a three-dimensional object identification algorithm applied to convection-allowing forecasts. Wea. Forecasting, 27, 10901113, https://doi.org/10.1175/WAF-D-11-00147.1.

    • Search Google Scholar
    • Export Citation
  • Clark, A. J., J. Gao, P. T. Marsh, T. Smith, J. S. Kain, J. Correia Jr., M. Xue, and F. Kong, 2013: Tornado pathlength forecasts from 2010 to 2011 using ensemble updraft helicity. Wea. Forecasting, 28, 387407, https://doi.org/10.1175/WAF-D-12-00038.1.

    • Search Google Scholar
    • Export Citation
  • Clark, A. J., and Coauthors, 2018: The Community Leveraged Unified Ensemble (CLUE) in the 2016 NOAA/Hazardous Weather Testbed Spring Forecasting Experiment. Bull. Amer. Meteor. Soc., 99, 14331448, https://doi.org/10.1175/BAMS-D-16-0309.1.

    • Search Google Scholar
    • Export Citation
  • Developmental Testbed Center, 2017: Gridpoint statistical interpolation user’s guide version 3.6. Developmental Testbed Center Doc., 158 pp., https://dtcenter.org/com-GSI/users/docs/.

  • Flora, M. L., C. K. Potvin, P. S. Skinner, S. Handler, and A. McGovern, 2021: Using machine learning to generate storm-scale probabilistic guidance of severe weather hazards in the Warn-on-Forecast System. Mon. Wea. Rev., 149, 15351557, https://doi.org/10.1175/MWR-D-20-0194.1.

    • Search Google Scholar
    • Export Citation
  • Gagne, D. J., A. McGovern, S. E. Haupt, R. A. Sobash, J. K. Williams, and M. Xue, 2017: Storm-based probabilistic hail forecasting with machine learning applied to convection-allowing ensembles. Wea. Forecasting, 32, 18191840, https://doi.org/10.1175/WAF-D-17-0010.1.

    • Search Google Scholar
    • Export Citation
  • Gagne, D. J., S. E. Haupt, D. W. Nychka, and G. Thompson, 2019: Interpretable deep learning for spatial analysis of severe hailstorms. Mon. Wea. Rev., 147, 28272845, https://doi.org/10.1175/MWR-D-18-0316.1.

    • Search Google Scholar
    • Export Citation
  • Gallo, B. T., A. J. Clark, and S. R. Dembek, 2016: Forecasting tornadoes using convection-permitting ensembles. Wea. Forecasting, 31, 273295, https://doi.org/10.1175/WAF-D-15-0134.1.

    • Search Google Scholar
    • Export Citation
  • Gallo, B. T., and Coauthors, 2017: Breaking new ground in severe weather prediction: The 2015 NOAA/Hazardous Weather Testbed Spring Forecasting Experiment. Wea. Forecasting, 32, 15411568, https://doi.org/10.1175/WAF-D-16-0178.1.

    • Search Google Scholar
    • Export Citation
  • Gallo, B. T., A. J. Clark, B. T. Smith, R. L. Thompson, I. Jirak, and S. R. Dembek, 2018a: Blended probabilistic tornado forecasts: Combining climatological frequencies with NSSL-WRF ensemble forecasts. Wea. Forecasting, 33, 443460, https://doi.org/10.1175/WAF-D-17-0132.1.

    • Search Google Scholar
    • Export Citation
  • Gallo, B. T., and Coauthors, 2018b: Spring Forecasting Experiment 2018: Program overview and operations plan. NOAA/NSSL, 46 pp., https://hwt.nssl.noaa.gov/sfe/2018/docs/HWT_SFE2018_operations_plan.pdf.

  • Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts. Wea. Forecasting, 14, 155167, https://doi.org/10.1175/1520-0434(1999)014<0155:HTFENP>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., G. T. Bates, J. S. Whitaker, D. R. Murray, M. Fiorino, T. J. Galarneau Jr., Y. Zhu, and W. Lapenta, 2013: NOAA’s second-generation global medium-range ensemble reforecast dataset. Bull. Amer. Meteor. Soc., 94, 15531565, https://doi.org/10.1175/BAMS-D-12-00014.1.

    • Search Google Scholar
    • Export Citation
  • Haupt, S. E., and Coauthors, 2022: The history and practice of AI in the environmental sciences. Bull. Amer. Meteor. Soc., 103, E1351E1370, https://doi.org/10.1175/BAMS-D-20-0234.1.

    • Search Google Scholar
    • Export Citation
  • Herman, G. R., and R. S. Schumacher, 2016: Using reforecasts to improve forecasting of fog and visibility for aviation. Wea. Forecasting, 31, 467482, https://doi.org/10.1175/WAF-D-15-0108.1.

    • Search Google Scholar
    • Export Citation
  • Herman, G. R., and R. S. Schumacher, 2018: Money doesn’t grow on trees, but forecasts do: Forecasting extreme precipitation with random forests. Mon. Wea. Rev., 146, 15711600, https://doi.org/10.1175/MWR-D-17-0250.1.

    • Search Google Scholar
    • Export Citation
  • Hill, A. J., and R. S. Schumacher, 2021: Forecasting excessive rainfall with random forests and a deterministic convection-allowing model. Wea. Forecasting, 36, 16931711, https://doi.org/10.1175/WAF-D-21-0026.1.

    • Search Google Scholar
    • Export Citation
  • Hill, A. J., G. R. Herman, and R. S. Schumacher, 2020: Forecasting severe weather with random forests. Mon. Wea. Rev., 148, 21352161, https://doi.org/10.1175/MWR-D-19-0344.1.

    • Search Google Scholar
    • Export Citation
  • Hill, A. J., C. C. Weiss, and D. C. Dowell, 2021: Influence of a portable near-surface observing network on experimental ensemble forecasts of deep convection hazards during VORTEX-SE. Wea. Forecasting, 36, 11411167, https://doi.org/10.1175/WAF-D-20-0237.1.

    • Search Google Scholar
    • Export Citation
  • Hsu, W.-R., and A. H. Murphy, 1986: The attributes diagram: A geometrical framework for assessing the quality of probability forecasts. Int. J. Forecasting, 2, 285293, https://doi.org/10.1016/0169-2070(86)90048-8.

    • Search Google Scholar
    • Export Citation
  • Iacono, M. J., J. S. Delamere, E. J. Mlawer, M. W. Shephard, S. A. Clough, and W. D. Collins, 2008: Radiative forcing by longlived greenhouse gases: Calculations with the AER radiative transfer models. J. Geophys. Res., 113, D13103, https://doi.org/10.1029/2008JD009944.

    • Search Google Scholar
    • Export Citation
  • Janjić, Z. I., 2001: Nonsingular implementation of the Mellor–Yamada level 2.5 scheme in the NCEP Meso Model. NCEP Office Note 437, 61 pp., http://www.emc.ncep.noaa.gov/officenotes/newernotes/on437.pdf.

  • Jergensen, G. E., A. McGovern, R. Lagerquist, and T. Smith, 2020: Classifying convective storms using machine learning. Wea. Forecasting, 35, 537559, https://doi.org/10.1175/WAF-D-19-0170.1.

    • Search Google Scholar
    • Export Citation
  • Jirak, I. L., C. J. Melick, and S. J. Weiss, 2016: Comparison of the SPC storm-scale ensemble of opportunity to other convection-allowing ensembles for severe weather forecasting. 28th Conf. on Severe Local Storms, Portland, OR, Amer. Meteor. Soc., 102, https://ams.confex.com/ams/28SLS/webprogram/Paper300910.html.

  • Kain, J. S., S. R. Dembek, S. J. Weiss, J. L. Case, J. J. Levit, and R. A. Sobash, 2010: Extracting unique information from high-resolution forecast models: Monitoring selected fields and phenomena every time step. Wea. Forecasting, 25, 15361542, https://doi.org/10.1175/2010WAF2222430.1.

    • Search Google Scholar
    • Export Citation
  • Kalina, E. A., I. Jankov, T. Alcott, J. Olson, J. Beck, J. Berner, D. Dowell, and C. Alexander, 2021: A progress report on the development of the High-Resolution Rapid Refresh Ensemble. Wea. Forecasting, 36, 791804, https://doi.org/10.1175/WAF-D-20-0098.1.

    • Search Google Scholar
    • Export Citation
  • Lagerquist, R., A. McGovern, and T. Smith, 2017: Machine learning for real-time prediction of damaging straight-line convective wind. Wea. Forecasting, 32, 21752193, https://doi.org/10.1175/WAF-D-17-0038.1.

    • Search Google Scholar
    • Export Citation
  • Lagerquist, R., A. McGovern, C. R. Homeyer, D. J. Gagne, and T. Smith, 2020: Deep learning on three-dimensional multiscale data for next-hour tornado prediction. Mon. Wea. Rev., 148, 28372861, https://doi.org/10.1175/MWR-D-19-0372.1.

    • Search Google Scholar
    • Export Citation
  • Loken, E. D., A. J. Clark, A. McGovern, M. Flora, and K. Knopfmeier, 2019: Postprocessing next-day ensemble probabilistic precipitation forecasts using random forests. Wea. Forecasting, 34, 20172044, https://doi.org/10.1175/WAF-D-19-0109.1.

    • Search Google Scholar
    • Export Citation
  • Loken, E. D., A. J. Clark, and C. D. Karstens, 2020: Generating probabilistic next-day severe weather forecasts from convection-allowing ensembles using random forests. Wea. Forecasting, 35, 16051631, https://doi.org/10.1175/WAF-D-19-0258.1.

    • Search Google Scholar
    • Export Citation
  • Mansell, E. R., 2010: On sedimentation and advection in multimoment bulk microphysics. J. Atmos. Sci., 67, 30843094, https://doi.org/10.1175/2010JAS3341.1.

    • Search Google Scholar
    • Export Citation
  • Mason, I., 1982: A model for assessment of weather forecasts. Aust. Meteor. Mag., 30, 291303.

  • Mellor, G. L., and T. Yamada, 1982: Development of a turbulence closure model for geophysical fluid problems. Rev. Geophys., 20, 851875, https://doi.org/10.1029/RG020i004p00851.

    • Search Google Scholar
    • Export Citation
  • Miller, W. J., and Coauthors, 2022: Exploring the usefulness of downscaling free forecasts from the Warn-on-Forecast System. Wea. Forecasting, 37, 181203, https://doi.org/10.1175/WAF-D-21-0079.1.

    • Search Google Scholar
    • Export Citation
  • Mlawer, E. J., S. J. Taubman, P. D. Brown, M. J. Iacono, and S. A. Clough, 1997: Radiative transfer for inhomogeneous atmospheres: RRTM, a validated correlated-k model for the longwave. J. Geophys. Res., 102, 16 66316 682, https://doi.org/10.1029/97JD00237.

    • Search Google Scholar
    • Export Citation
  • Muñoz-Esparza, D., R. D. Sharman, and W. Deierling, 2020: Aviation turbulence forecasting at upper levels with machine learning techniques based on regression trees. J. Appl. Meteor. Climatol., 59, 18831899, https://doi.org/10.1175/JAMC-D-20-0116.1.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1973: A new vector partition of the probability score. J. Appl. Meteor., 12, 595600, https://doi.org/10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Nakanishi, M., 2000: Large-eddy simulation of radiation fog. Bound.-Layer Meteor., 94, 461493, https://doi.org/10.1023/A:1002490423389.

    • Search Google Scholar
    • Export Citation
  • Nakanish, M., 2001: Improvement of the Mellor-Yamada turbulence closure model based on large-eddy simulation data. Bound.-Layer Meteor., 99, 349378, https://doi.org/10.1023/A:1018915827400.

    • Search Google Scholar
    • Export Citation
  • Nakanishi, M., and H. Niino, 2004: An improved Mellor-Yamada level-3 model with condensation physics: Its design and verification. Bound.-Layer Meteor., 112, 131, https://doi.org/10.1023/B:BOUN.0000020164.04146.98.

    • Search Google Scholar
    • Export Citation
  • Nakanishi, M., and H. Niino, 2006: An improved Mellor-Yamada level-3 model: Its numerical stability and application to a regional prediction of advection fog. Bound.-Layer Meteor., 119, 397407, https://doi.org/10.1007/s10546-005-9030-8.

    • Search Google Scholar
    • Export Citation
  • Noh, Y., W. G. Cheon, S.-Y. Hong, and S. Raasch, 2003: Improvement of the K-profile model for the planetary boundary layer based on large eddy simulation data. Bound.-Layer Meteor., 107, 401427, https://doi.org/10.1023/A:1022146015946.

    • Search Google Scholar
    • Export Citation
  • Pedregosa, F., and Coauthors, 2011: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res., 12, 28252830.

  • Platt, J. C., 1999: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, A. J. Smola et al., Eds., MIT Press, 6174.

    • Search Google Scholar
    • Export Citation
  • Potvin, C. K., and Coauthors, 2017: Systematic comparison of convection-allowing models during the 2017 NOAA HWT Spring Forecasting Experiment. Wea. Forecasting, 34, 13951416, https://doi.org/10.1175/WAF-D-19-0056.1.

    • Search Google Scholar
    • Export Citation
  • Potvin, C. K., and Coauthors, 2020: Assessing systematic impacts of PBL schemes on storm evolution in the NOAA Warn-on-Forecast System. Mon. Wea. Rev., 148, 25672590, https://doi.org/10.1175/MWR-D-19-0389.1.

    • Search Google Scholar
    • Export Citation
  • Roberts, B., B. T. Gallo, I. Jirak, A. J. Clark, D. C. Dowell, X. Wang, and Y. Wang, 2020: What does a convection-allowing ensemble of opportunity buy us in forecasting thunderstorms? Wea. Forecasting, 35, 22932316, https://doi.org/10.1175/WAF-D-20-0069.1.

    • Search Google Scholar
    • Export Citation
  • Rothfusz, L. P., R. Schneider, D. Novak, K. Klockow-McClain, A. E. Gerard, C. D. Karstens, G. J. Stumpf, and M. Smith, 2018: FACETs: A proposed next-generation paradigm for high-impact weather forecasting. Bull. Amer. Meteor. Soc., 99, 20252043, https://doi.org/10.1175/BAMS-D-16-0100.1.

    • Search Google Scholar
    • Export Citation
  • Schumacher, R. S., A. J. Hill, M. Klein, J. A. Nelson, M. J. Erickson, S. M. Trojniak, and G. R. Herman, 2021: From random forests to flood forecasts: A research to operations success story. Bull. Amer. Meteor. Soc., 102, E1742E1755, https://doi.org/10.1175/BAMS-D-20-0186.1.

    • Search Google Scholar
    • Export Citation
  • Schwartz, C. S., and R. A. Sobash, 2017: Generating probabilistic forecasts from convection-allowing ensembles using neighborhood approaches: A review and recommendations. Mon. Wea. Rev., 145, 33973418, https://doi.org/10.1175/MWR-D-16-0400.1.

    • Search Google Scholar
    • Export Citation
  • Schwartz, C. S., G. S. Romine, M. L. Weisman, R. A. Sobash, K. R. Fossell, K. W. Manning, and S. B. Trier, 2015: A real-time convection-allowing ensemble prediction system initialized by mesoscale ensemble Kalman filter analyses. Wea. Forecasting, 30, 11581181, https://doi.org/10.1175/WAF-D-15-0013.1.

    • Search Google Scholar
    • Export Citation
  • Skamarock, W. C., and Coauthors, 2008: A description of the Advanced Research WRF version 3. NCAR Tech. Note NCAR/TN-475+STR, 113 pp., https://doi.org/10.5065/D68S4MVH.

    • Search Google Scholar
    • Export Citation
  • Smirnova, T. G., J. M. Brown, and S. G. Benjamin, 1997: Performance of different soil model configurations in simulating ground surface temperature and surface fluxes. Mon. Wea. Rev., 125, 18701884, https://doi.org/10.1175/1520-0493(1997)125<1870:PODSMC>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Smirnova, T. G., J. M. Brown, S. G. Benjamin, and D. Kim, 2000: Parameterization of cold-season processes in the MAPS land-surface scheme. J. Geophys. Res., 105, 40774086, https://doi.org/10.1029/1999JD901047.

    • Search Google Scholar
    • Export Citation
  • Sobash, R. A., and J. S. Kain, 2017: Seasonal variations in severe weather forecast skill in an experimental convection-allowing model. Wea. Forecasting, 32, 18851902, https://doi.org/10.1175/WAF-D-17-0043.1.

    • Search Google Scholar
    • Export Citation
  • Sobash, R. A., J. S. Kain, D. R. Bright, A. R. Dean, M. C. Coniglio, and S. J. Weiss, 2011: Probabilistic forecast guidance for severe thunderstorms based on the identification of extreme phenomena in convection-allowing model forecasts. Wea. Forecasting, 26, 714728, https://doi.org/10.1175/WAF-D-10-05046.1.

    • Search Google Scholar
    • Export Citation
  • Sobash, R. A., C. S. Schwartz, G. S. Romine, K. R. Fossell, and M. L. Weisman, 2016a: Severe weather prediction using storm surrogates from an ensemble forecasting system. Wea. Forecasting, 31, 255271, https://doi.org/10.1175/WAF-D-15-0138.1.

    • Search Google Scholar
    • Export Citation
  • Sobash, R. A., G. S. Romine, C. S. Schwartz, D. J. Gagne, and M. L. Weisman, 2016b: Explicit forecasts of low-level rotation from convection-allowing models for next-day tornado prediction. Wea. Forecasting, 31, 15911614, https://doi.org/10.1175/WAF-D-16-0073.1.

    • Search Google Scholar
    • Export Citation
  • Sobash, R. A., C. S. Schwartz, G. S. Romine, and M. L. Weisman, 2019: Next-day prediction of tornadoes using convection-allowing models with 1-km horizontal grid spacing. Wea. Forecasting, 34, 11171135, https://doi.org/10.1175/WAF-D-19-0044.1.

    • Search Google Scholar
    • Export Citation
  • Sobash, R. A., G. S. Romine, and C. S. Schwartz, 2020: A comparison of neural-network and surrogate-severe probabilistic convective hazard guidance derived from a convection-allowing model. Wea. Forecasting, 35, 19812000, https://doi.org/10.1175/WAF-D-20-0036.1.

    • Search Google Scholar
    • Export Citation
  • SPC, 2021: Severe weather event summaries: NWS local storm reports. Accessed 1 July 2019, https://www.spc.noaa.gov/climo/online/.

  • Stensrud, D. J., and Coauthors, 2009: Convective-scale Warn-on-Forecast System: A vision for 2020. Bull. Amer. Meteor. Soc., 90, 14871500, https://doi.org/10.1175/2009BAMS2795.1.

    • Search Google Scholar
    • Export Citation
  • Thompson, R. L., R. Edwards, J. A. Hart, K. L. Elmore, and P. Markowski, 2003: Close proximity soundings within supercell environments obtained from the Rapid Update Cycle. Wea. Forecasting, 18, 12431261, https://doi.org/10.1175/1520-0434(2003)018<1243:CPSWSE>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Verbout, S. M., H. E. Brooks, L. M. Leslie, and D. M. Schultz, 2006: Evolution of the U.S. tornado database: 1954–2003. Wea. Forecasting, 21, 8693, https://doi.org/10.1175/WAF910.1.

    • Search Google Scholar
    • Export Citation
  • Wandishin, M. S., S. L. Mullen, D. J. Stensrud, and H. E. Brooks, 2001: Evaluation of a short-range multimodel ensemble system. Mon. Wea. Rev., 129, 729747, https://doi.org/10.1175/1520-0493(2001)129<0729:EOASRM>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 2006: Statistical Methods in the Atmospheric Sciences. 2nd ed. International Geophysics Series, Vol. 100, Academic Press, 648 pp.

    • Search Google Scholar
    • Export Citation
Save
  • Adams-Selin, R. D., A. J. Clark, C. J. Melick, S. R. Dembek, I. L. Jirak, and C. L. Ziegler, 2019: Evolution of WRF-HAILCAST during the 2014-16 NOAA/Hazardous Weather Testbed Spring Forecasting Experiment. Wea. Forecasting, 34, 6179, https://doi.org/10.1175/WAF-D-18-0024.1.

    • Search Google Scholar
    • Export Citation
  • Anderson, J. L., 2001: An ensemble adjustment filter for data assimilation. Mon. Wea. Rev., 129, 28842903, https://doi.org/10.1175/1520-0493(2001)129<2884:AEAKFF>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Anderson, J. L., T. Hoar, K. Raeder, H. Liu, N. Collins, R. Torn, and A. Avellano, 2009: The Data Assimilation Research Testbed: A community facility. Bull. Amer. Meteor. Soc., 90, 12831296, https://doi.org/10.1175/2009BAMS2618.1.

    • Search Google Scholar
    • Export Citation
  • Atger, F., 2003: Spatial and interannual variability of the reliability of ensemble-based probabilistic forecasts: Consequences for calibration. Mon. Wea. Rev., 131, 15091523, https://doi.org/10.1175//1520-0493(2003)131<1509:SAIVOT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Breiman, L., 2001: Random forests. Mach. Learn., 45, 532, https://doi.org/10.1023/A:1010933404324.

  • Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78, 13, https://doi.org/10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Brooks, H. E., C. A. Doswell III, and M. P. Kay, 2003: Climatological estimates of local daily tornado probability. Wea. Forecasting, 18, 626640, https://doi.org/10.1175/1520-0434(2003)018<0626:CEOLDT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Burke, A., N. Snook, D. J. Gagne, S. McCorkel, and A. McGovern, 2020: Calibration of machine learning-based probabilistic hail predictions for operational forecasting. Wea. Forecasting, 35, 149168, https://doi.org/10.1175/WAF-D-19-0105.1.

    • Search Google Scholar
    • Export Citation
  • Cintineo, J. L., T. M. Smith, V. Lakshmanan, H. E. Brooks, and K. L. Ortega, 2012: An objective high-resolution hail climatology of the contiguous United States. Wea. Forecasting, 27, 12351248, https://doi.org/10.1175/WAF-D-11-00151.1.

    • Search Google Scholar
    • Export Citation
  • Cintineo, J. L., M. J. Pavolonis, J. M. Sieglaff, and D. T. Lindsey, 2014: An empirical model for assessing the severe weather potential of developing convection. Wea. Forecasting, 29, 639653, https://doi.org/10.1175/WAF-D-13-00113.1.

    • Search Google Scholar
    • Export Citation
  • Cintineo, J. L., and Coauthors, 2018: The NOAA/CIMSS ProbSevere Model: Incorporation of total lightning and validation. Wea. Forecasting, 33, 331345, https://doi.org/10.1175/WAF-D-17-0099.1.

    • Search Google Scholar
    • Export Citation
  • Cintineo, J. L., M. J. Pavolonis, J. M. Sieglaff, L. Cronce, and J. Brunner, 2020: NOAA ProbSevere v2.0–ProbHail, ProbWind, and ProbTor. Wea. Forecasting, 35, 15231543, https://doi.org/10.1175/WAF-D-19-0242.1.

    • Search Google Scholar
    • Export Citation
  • Clark, A. J., and Coauthors, 2012a: An overview of the 2010 Hazardous Weather Testbed Experimental Forecast Program Spring Experiment. Bull. Amer. Meteor. Soc., 93, 5574, https://doi.org/10.1175/BAMS-D-11-00040.1.

    • Search Google Scholar
    • Export Citation
  • Clark, A. J., J. S. Kain, P. T. Marsh, J. Correia Jr., M. Xue, and F. Kong, 2012b: Forecasting tornado pathlengths using a three-dimensional object identification algorithm applied to convection-allowing forecasts. Wea. Forecasting, 27, 10901113, https://doi.org/10.1175/WAF-D-11-00147.1.

    • Search Google Scholar
    • Export Citation
  • Clark, A. J., J. Gao, P. T. Marsh, T. Smith, J. S. Kain, J. Correia Jr., M. Xue, and F. Kong, 2013: Tornado pathlength forecasts from 2010 to 2011 using ensemble updraft helicity. Wea. Forecasting, 28, 387407, https://doi.org/10.1175/WAF-D-12-00038.1.

    • Search Google Scholar
    • Export Citation
  • Clark, A. J., and Coauthors, 2018: The Community Leveraged Unified Ensemble (CLUE) in the 2016 NOAA/Hazardous Weather Testbed Spring Forecasting Experiment. Bull. Amer. Meteor. Soc., 99, 14331448, https://doi.org/10.1175/BAMS-D-16-0309.1.

    • Search Google Scholar
    • Export Citation
  • Developmental Testbed Center, 2017: Gridpoint statistical interpolation user’s guide version 3.6. Developmental Testbed Center Doc., 158 pp.,