Extracting explicit severe weather forecast guidance from convection-allowing ensembles (CAEs) is challenging since CAEs cannot directly simulate individual severe weather hazards. Currently, CAE-based severe weather probabilities must be inferred from one or more storm-related variables, which may require extensive calibration and/or contain limited information. Machine learning (ML) offers a way to obtain severe weather forecast probabilities from CAEs by relating CAE forecast variables to observed severe weather reports. This paper develops and verifies a random forest (RF)-based ML method for creating day 1 (1200–1200 UTC) severe weather hazard probabilities and categorical outlooks based on 0000 UTC Storm-Scale Ensemble of Opportunity (SSEO) forecast data and observed Storm Prediction Center (SPC) storm reports. RF forecast probabilities are compared against severe weather forecasts from calibrated SSEO 2–5-km updraft helicity (UH) forecasts and SPC convective outlooks issued at 0600 UTC. Continuous RF probabilities routinely have the highest Brier skill scores (BSSs), regardless of whether the forecasts are evaluated over the full domain or regional/seasonal subsets. Even when RF probabilities are truncated at the probability levels issued by the SPC, the RF forecasts often have BSSs better than or comparable to corresponding UH and SPC forecasts. Relative to the UH and SPC forecasts, the RF approach performs best for severe wind and hail prediction during the spring and summer (i.e., March–August). Overall, it is concluded that the RF method presented here provides skillful, reliable CAE-derived severe weather probabilities that may be useful to severe weather forecasters and decision-makers.
With horizontal grid spacing less than approximately 4 km, convection-allowing models (CAMs) are important tools for severe weather forecasters, since they adequately resolve the dominant circulations of individual convective storms without convective parameterization (e.g., Weisman et al. 1997; Done et al. 2004). As a result, CAMs more accurately predict storm initiation, evolution, intensity, and mode compared to convection-parameterizing models (e.g., Kain et al. 2006, 2008). Depiction of storm mode is especially useful to severe weather forecasters (e.g., Kain et al. 2006; Clark et al. 2012a) since a storm’s morphology is related to its attendant hazards (e.g., Gallus et al. 2008; Duda and Gallus 2010; Schoen and Ashley 2011; Smith et al. 2012). However, CAMs currently lack horizontal grid spacing fine enough to explicitly simulate individual tornadoes, hailstones, or microscale severe wind events. Therefore, forecasters using CAM guidance must infer simulated severe weather occurrence from modeled storm attributes that are correlated with observed severe weather (e.g., Sobash et al. 2011).
An example of a commonly used simulated severe storm “surrogate” (Sobash et al. 2011, 2016, 2019), or proxy, is hourly maximum 2–5 km above ground level updraft helicity (hereafter, UH; e.g., Kain et al. 2008, 2010; Guyer and Jirak 2014; Loken et al. 2017; Sobash et al. 2011, 2016, 2019). Large values of UH identify not only rotating updrafts associated with supercells, but also the sheared updrafts associated with severe mesoscale convective systems (MCSs; Sobash et al. 2011). As a result, UH has been found to be a skillful predictor of all-hazards severe weather (Kain et al. 2008; Sobash et al. 2011, 2016, 2019). UH has also been used—generally in conjunction with simulated environmental variables—to forecast tornadoes (Clark et al. 2013; Guyer and Jirak 2014; Gallo et al. 2016; Sobash et al. 2019) and severe wind and hail (Jirak et al. 2014). Other common simulated severe weather proxies include large values of hourly maximum upward vertical velocity (e.g., Roberts et al. 2019), low-level vertical vorticity (e.g., Skinner et al. 2016; Sobash et al. 2019) and UH integrated from 0 to 1 km above the surface (Sobash et al. 2019).
One major drawback of these proxies is that they require extensive calibration to perform optimally. For example, Sobash and Kain (2017) demonstrated that the best UH threshold to use for all-hazards severe weather prediction varies by location and time of year. Moreover, if binary proxies are smoothed spatially to obtain probabilistic forecasts (e.g., Sobash et al. 2011, 2016, 2019; Loken et al. 2017), the degree of spatial smoothing must be properly calibrated as well. Too little smoothing results in overforecasting bias, while too much can yield underforecasting and degrade sharpness and resolution (e.g., Sobash et al. 2011, 2016; Loken et al. 2017, 2019a,b). Additionally, these calibrations are CAM and hazard dependent. For example, Clark et al. (2012b, 2013) used a larger UH threshold and smaller degree of spatial smoothing to forecast tornado pathlengths compared to that used by Sobash et al. (2011) to forecast all-hazards severe weather, while Gagne et al. (2017) used different UH thresholds to predict 25- and 50-mm diameter hail.
Another important drawback of simulated severe weather proxies is that they use limited information to determine the severe weather threat. For example, Clark et al. (2012b) and Gallo et al. (2016) noted that large values of UH may exist in environments that are not conducive to severe weather. However, even when proxies are filtered based on the simulated environment (e.g., Clark et al. 2012b; Jirak et al. 2014; Gallo et al. 2016), the resulting predictions may still be suboptimal since severe weather can still occur in locations with unfavorable simulated environments if the CAM has biases or is not representing the observed environment well. Moreover, the use of environment-based filtering does not mean the resulting prediction has considered all relevant forecast variables.
Another way to extract explicit severe weather guidance from CAMs is to statistically relate multivariate CAM output with the observed occurrence of severe weather. Indeed, this is the general approach of Model Output Statistics (MOS; Glahn and Lowry 1972; Klein and Glahn 1974), which has shown promise for a variety of forecast fields, including: probability of precipitation, maximum and minimum temperatures, cloud coverage, near-surface wind, conditional probability of precipitation, and thunderstorms (e.g., Glahn and Lowry 1972; Klein and Glahn 1974; Carter 1975; Bermowitz 1975; Schmeits et al. 2005; Kang et al. 2011). However, MOS relationships tend to be linear and based on regression while relationships between CAM forecast variables and observed severe weather are likely to be flow-dependent and nonlinear (e.g., Legg and Mylne 2004; Melhauser and Zhang 2012; Torn and Romine 2015; Trier et al. 2015). Thus, machine learning (ML) techniques, which can model nonlinear relationships, may be more appropriate for diagnosing the severe weather threat conveyed by CAM or convection-allowing ensemble (CAE) guidance.
Indeed, recent studies have successfully used ML techniques to create probabilistic precipitation (e.g., Gagne et al. 2014; Herman and Schumacher 2018; Loken et al. 2019a) and severe weather (e.g., Gagne et al. 2017; Lagerquist et al. 2017; Burke et al. 2020) forecasts based partly or entirely on numerical weather prediction (NWP) predictors. For severe weather prediction, a common approach has been to use predictors associated with storm “objects,” which are identified by thresholding a certain simulated storm attribute (e.g., maximum hourly column total graupel mass in Gagne et al. 2017; maximum hourly upward vertical velocity in Burke et al. 2020). Thus, the object identification process “filters out” areas of weaker or nonexistent simulated storms. Such an approach is efficient for ML training since it eliminates the need to consider predictors from all grid points but can underperform if there is poor correspondence between simulated and observed storms (Gagne et al. 2017). Conversely, when gridpoint-based predictors are used, training takes longer, but higher performance may be achieved when the CAE is imperfect, since the gridpoint predictors offer the ML algorithm more (and more relevant) information. Moreover, when gridpoint-based predictors and predictands are used, output probabilities are directly given in two-dimensional (rather than object) space, facilitating user interpretation of ML output.
While gridpoint-based methods have been used to obtain skillful probabilistic precipitation forecasts (Herman and Schumacher 2018; Loken et al. 2019a), they are untested for severe weather prediction. Therefore, this study seeks to develop and evaluate an RF-based method for creating individual-hazard day 1 (i.e., 1200–1200 UTC) severe weather probabilities from gridpoint-based CAE forecast output. Due to its skill (Jirak et al. 2016, 2018) and long data archive, the SPC’s 7-member Storm-Scale Ensemble of Opportunity (SSEO; Jirak et al. 2012, 2016, 2018) is used as the underlying dynamical forecast system. For evaluation against operationally relevant baselines, the RF-based severe weather forecasts are compared to SSEO UH-based probabilistic forecasts and SPC day 1 convective outlooks (COs) issued at 0600 UTC.1 While multiple previous studies have applied ML to severe weather prediction, the RF method described herein is unique in that it uses gridpoint-based CAE forecast fields as predictors, produces probabilistic forecasts for multiple severe weather hazards over the full contiguous United States (CONUS), and is directly evaluated against top-performing human and NWP baselines.
The forecast and observational datasets used herein span 629 days from late April 2015 to early July 2017 (Table 1). RF- and UH-based severe weather forecasts are derived from the SSEO (Jirak et al. 2012, 2016), a 7-member CAE with members that use different initial and lateral boundary conditions, initialization times, and microphysics and turbulence parameterizations. Since SPC forecasters began using the SSEO in 2011 (Jirak et al. 2016), its convection-related forecasts have compared favorably with those from other experimental CAEs (Jirak et al. 2016). As a result, the SSEO was ultimately formalized as the High-Resolution Ensemble Forecast System Version 2 (HREFv2), which became the first operational CAE run by the National Oceanic and Atmospheric Administration’s (NOAA’s) Environmental Modeling Center in November 2017 (Jirak et al. 2018; Roberts et al. 2019; Loken et al. 2019b). All SSEO member forecasts are provided on a 4-km contiguous United States (CONUS) domain with 1199 × 799 points. Full SSEO specifications are summarized in Table 2.
SSEO forecasts are compared against SPC day 1 COs, which are issued daily by 0600 UTC and are valid from 1200 to 1200 UTC the following day. These COs include probabilistic forecasts of tornadoes, severe wind [i.e., wind speeds of at least 50 kt (58 mph)], and severe hail (i.e., a maximum hailstone diameter of 1 in. or greater), with probabilities valid for within 25 mi of a point (about a 40-km radius). The COs also denote locations with a 10% or greater probability of observing significant tornadoes [i.e., those with an enhanced Fujita (EF) rating of 2 or higher], significant severe wind [i.e., wind speeds at least 65 kt (75 mph)], and significant severe hail (i.e., a maximum hailstone diameter of 2 in. or greater) within 25 mi. Individual hazard probabilities are then used to determine a categorical outlook forecast based on the criteria in Table 3.
One limitation of the SPC day 1 COs is that only certain probability levels (i.e., 2%, 5%, 10%, 15%, 30%, 45%, and 60% for tornadoes; 5%, 15%, 30%, 45%, and 60% for severe wind and hail; and 10% for significant severe weather) are contoured. As a result, it is difficult to equitably compare SPC forecasts with the continuous RF- and UH-based forecasts from the SSEO. There are two potential remedies to this problem. The first is to truncate the SSEO-derived forecasts at the same probability levels as used by the SPC. The second is to spatially interpolate the SPC probabilities between contour levels (e.g., Herman et al. 2018). Both methods are used herein. However, in this study, continuous SPC probabilities are created using a method developed at the SPC (Karstens et al. 2019). Herein, raw SPC contours are filled/gridded using a top-hat distribution, such that all grid points enclosed by a contour are assigned that contour value. The gridding procedure is done using the General Meteorological Package (GEMPAK; desJardins et al. 1991) within a 1° expanded CONUS domain to negate chronic dampening of probabilities near the edges of the forecast domain. Next, unique probability areas are identified using watershed segmentation (e.g., Lakshmanan et al. 2009), and adjacent probability areas are bilinearly interpolated using a Euclidean distance transformation. Finally, the maximum probability level is assumed to be 25% greater than the maximum nonzero contoured probability level present in the forecast. Continuous SPC probabilities created using this method are henceforth referred to as “full” SPC probabilities, while the raw, discrete SPC probabilities are referred to as “original” SPC probabilities. Importantly, full SPC probabilities do not exist for significant severe weather forecasts, since the SPC only issues a 10% or greater probability contour for significant severe events. Additionally, the SPC does not issue day 1 outlook probabilities for all-hazards severe or significant severe weather.
Severe weather observations used for verification and RF training are taken from the SPC website (SPC 2019b) for wind and hail and the SPC Storm Events Database (SSED; SPC 2019a) for tornadoes. The SSED was required for tornadoes since it displayed information about each tornado’s EF rating, necessary for the prediction/verification of significant tornadoes. Unfiltered reports are used to account for all reported instances of severe weather.
b. UH-based forecasts
UH-based probability forecasts for each severe weather hazard are derived from the SSEO. These forecasts are created in the same manner described by Loken et al. (2017). Namely, the fraction of ensemble members exceeding a given UH threshold is noted at each grid point, and that fraction is smoothed using a two-dimensional isotropic Gaussian kernel density function. Therefore, the UH-based probability p at a given grid point can be expressed as
where f is the fraction of ensemble members exceeding some UH threshold, N is the number of points with at least one member exceeding the threshold, dn is the distance between the current grid point and the nth point, and σ is the standard deviation of the Gaussian kernel. To determine the combination of UH threshold and σ to use for each hazard, the UH threshold is varied from 10 to 200 m2 s−2 in increments of 10 m2 s−2 while σ is varied from 30 to 210 km in increments of 30 km. The combination that optimizes the Brier skill score (BSS; e.g., Wilks 2011) for a given hazard over the entire dataset is used (right column of Fig. 1), with BSS measured relative to a constant forecast of observed hazard climatology during the 629-day dataset. The calibration is done on the 80-km verification grid (see below) rather than the native 4-km grid.
c. Random forest forecasts
1) RF method overview
A RF is an ensemble of decision trees (Breiman 2001). Individual decision trees (Breiman 1984) work by recursively splitting a dataset until a stopping criterion is reached (e.g., the tree reaches a specified maximum number of levels, the number of samples at a node falls below a specified threshold, etc.). Splitting criteria are determined by the algorithm during training. Specifically, at each node, the algorithm chooses the threshold and predictor variable that splits the data in a way that maximizes a dissimilarity metric (e.g., information gain, Gini impurity). Class predictions can then be made on unseen data by running a testing example through the tree and analyzing the training samples in the appropriate leaf node (i.e., terminal node). For example, class probabilities are expressed as the fraction of training examples associated with the given class in the leaf node containing the testing example.
Although individual decision trees are human-readable and relatively easy to interpret, they are prone to overfitting, such that small changes to a testing example’s predictor variables can produce very different class predictions (e.g., Gagne et al. 2014). The RF algorithm helps remedy this overfitting tendency by growing multiple trees, which are made unique by 1) growing each tree based on a random subset of training examples, and 2) determining the best split at each node by considering a random subset of predictor variables (Breiman 2001). During testing, RF class probabilities are simply the mean probability from each tree in the RF. In this study, RFs and corresponding RF probabilities are created using random forest classifiers from the Python module Scikit-Learn (Pedregosa et al. 2011). More information on RFs can be found in Loken et al. (2019a) and works cited therein.
2) Predictor variables
The first step of creating RF-based probabilities is to determine which predictors (or input variables) the RF will consider. Here, predictor variables are based on SSEO forecast fields. However, only a small number of variables relevant to severe weather forecasting are originally stored within the SSEO data archive (i.e., the variables without asterisks in Table 4). To enhance RF skill, several predictor variables (i.e., those with asterisks in Table 4) are added to these original variables. For example, the product of most unstable convective available potential energy (MUCAPE) and 0–6-km wind shear is computed at each native 4-km grid point and stored as a predictor variable. Latitude, longitude, and smoothed UH probabilities are also added as predictors during preprocessing.
3) Data preprocessing
While the SSEO contains a limited number of archived forecast fields, there is originally an overwhelming amount of data potentially available to the RF, since each SSEO member forecasts each variable at 3-km grid spacing over the CONUS every hour. To make training the RF computationally feasible, the dimensionality of the SSEO dataset must be reduced through several steps of data preprocessing.
The first preprocessing step is to reduce the temporal dimension of the dataset. This is accomplished by taking a 24-h (1200–1200 UTC) temporal maximum (for the storm attribute variables; Table 4) or mean (for the environment-related variables) at each 4-km grid point. Next, these temporally aggregated forecast variables—as well as the observed storm reports—are remapped to an approximately 80-km grid (i.e., NCEP grid 211) to further reduce dataset dimensionality and to match the verification scales used by the SPC. For the storm attribute fields, remapping is done by selecting the maximum forecast value on the 4-km grid within each 80-km grid box. For the environment-related fields, remapping to the 80-km grid is done using a neighbor budget method (Accadia et al. 2003), which approximately conserves the remapped quantity. After remapping, the ensemble mean, maximum, minimum, and standard deviation values are computed for each forecast variable at every 80-km grid point. Additionally, smoothed UH probabilities (to be used as predictors) are derived based on the method in section 2b. However, the UH threshold and standard deviation of the Gaussian kernel combination used is that which maximizes area under the relative operating characteristic curve (AUC; e.g., Wilks 2011; left column of Fig. 1) rather than BSS, since AUC is a measure of potential skill after bias calibration (Wilks 2011) and RF outputs typically have low bias (e.g., Breiman 2001).
After preprocessing, a final set of predictors is obtained for input into the RF. Here, these predictors include the ensemble mean, maximum, minimum, and standard deviation of SSEO forecast fields as well as latitude, longitude, and UH-based probabilities (Table 4). For a given grid point prediction, the RF considers these quantities at the 25 closest 80-km grid points.
4) RF predictions
The RF gives probabilistic predictions of whether a given 80-km grid box will experience the occurrence of at least one observed severe weather report (all hazards or individual hazard) over the 24-h day 1 CO period (i.e., 1200–1200 UTC). Separate RFs are used to predict the occurrence of: all-hazards severe weather, all-hazards significant severe weather, any tornadoes, significant tornadoes, any severe wind, significant severe wind, any severe hail, and significant severe hail. Finally, the predictions from these separate RFs are used to construct an RF-based day 1 categorical outlook using the same guidelines employed by the SPC (i.e., those in Table 3).
5) Discrete/truncated RF probabilities
To facilitate a fair comparison with the SPC day 1 outlooks, discrete RF probabilities are created for individual-hazards severe and significant severe weather forecasts using the same probability levels as the SPC (Table 3). Discrete RF probabilities (henceforth referred to as truncated RF forecasts) are created by simply converting all continuous RF probabilities between discrete SPC probability levels to the lower probability. For example, continuous severe hail probabilities between 5% (inclusive) and 15% (exclusive) are converted to 5% probabilities, since they would all be contained within a 5% SPC contour. Similarly, for individual-hazard significant severe forecasts, truncated RF probabilities are 10% if the continuous RF probabilities meet or exceed 10% and 0% otherwise.
Probabilistic severe weather forecasts are evaluated over the entire CONUS (Fig. 2a) as well over the West, Midwest, and East (Fig. 2b), which are defined based on temperature and precipitation climatology and represent an aggregation of regions described in Bukovsky (2011). Forecasts are also analyzed seasonally, with winter, spring, summer, and fall defined as December–February, March–May, June–August, and September–November, respectively.
Forecasts are verified on the ~80-km NCEP grid 211 to approximately match the verification definitions used by the SPC, which evaluates the occurrence of severe weather within 40 km of a point, and to save computational expense during verification. Continuous RF, truncated RF, original SPC, full/continuous SPC, and (continuous) UH-based probabilities are evaluated and compared against each other whenever possible. Unfortunately, due to the limitations of the SPC forecasts, full SPC probabilities are not created for significant severe weather forecasts, and neither original nor full SPC probabilities exist for all-hazard severe or significant severe forecasts. Additionally, no quantitative verification is performed on the RF- and SPC-based categorical outlooks, since these are not true probabilistic forecasts, but rather summary products that merge probability and intensity information. Forecast evaluation is done using 17-fold cross validation with 37 days per fold. A total of 17 folds are used here to balance the trade-off between computational expense and training set size and to provide an equal number of days (37) in each fold. As in Loken et al. (2019a), verification statistics are computed over the full set of 629 forecasts derived from each fold’s testing set.
Metrics used for verification include: BSS, BS components (e.g., Wilks 1995), attributes diagrams (e.g., Hsu and Murphy 1986), and performance diagrams (Roebber 2009). While AUC is used to set the UH threshold and Gaussian kernel standard deviation for smoothed UH-based predictors, it is not used for forecast evaluation since it is not sensitive to bias and it tends to increase nonlinearly with increasing forecast skill such that two well-performing but differently skilled forecast systems may have similar AUC values near 1 (Marzban 2004).
The BS (e.g., Wilks 1995), which measures the magnitude of forecast probability errors, can be decomposed into reliability, resolution, and uncertainty components (Murphy 1973; Wilks 1995), and is defined as
where N is the total number of forecast/observation pairs, K is the number of forecast probability bins, pi is the forecast probability at point i, oi is the binary observation (i.e., 0 or 1) at point i, nk is the number of forecasts in bin k, is the mean observed relative frequency in bin k, and is the overall sample climatological frequency. The three terms on the right of Eq. (2) represent the reliability, resolution, and uncertainty components of the BS, respectively. Meanwhile, the BSS compares the BS to that of a reference forecast, thus enabling a fair comparison for events with different climatological relative frequencies (Wilks 1995). Specifically, the BSS is defined as
where, herein, BSref is the BS resulting from always forecasting the observed climatology of the relevant dataset. A BSS of 1 (0) indicates perfect (no) skill relative to the reference forecast. In total, 95% confidence intervals (95CIs) for each forecast’s BSS values are determined using resampling with replacement (i.e., bootstrapping; e.g., Wilks 2011). Specifically, 629 random samples (with replacement) are drawn from a given forecast’s 629 individual-day BS values. The aggregate BS and BSS over the random sample are then computed and stored. After 10 000 iterations of this process, the 95% BSS confidence interval is noted by observing the 2.5- and 97.5-percentile values of the stored BSS distribution.
While the reliability component of the BS provides a single-number summary of how well forecast probabilities correspond with observations, attributes diagrams allow users to assess reliability separately for each of k probability bins. Herein, bins are defined by the following probability level ranges: [0%–1%), [1%–2%), [2%–5%), [5%–15%), [15%–25%), …, [85%–95%), and [95%–100%]. Perfectly reliable forecasts fall along a line of slope 1 passing through the origin; over (under) forecasts fall below (above) this line. Attributes diagrams also contain horizontal and vertical lines plotted at the sample climatological relative frequency as well as a no-skill line located halfway between the horizontal climatology line and the perfect reliability line. Points above (below) the no-skill line contribute positively (negatively) to the BSS when a reference forecast of climatology is used (Wilks 1995).
Performance diagrams (Roebber 2009) binarize probabilistic forecasts at specific probability levels (herein, 0%, 1%, 2%, 5%–95% in increments of 10%, and 100%) and display probability of detection (POD), success ratio (SR), bias, and critical success index (CSI) on a single plot [e.g., see Eqs. (1)–(4) in Roebber (2009)]. Points falling closer to the top-right-hand corner of the diagram exhibit greater skill, since POD, SR, CSI, and bias are all optimized at a value of 1. Moreover, POD, SR, CSI, and bias are all independent of the number of correct negatives, making the performance diagram a good tool for evaluating forecasts with many trivial correct negatives.
a. Full-domain, full-period results
The continuous RF forecasts have the greatest overall BSS values for each of the hazards examined (Fig. 3a). Compared to the UH-based forecasts, the continuous RF forecasts give substantially better predictions for all hazards except significant tornadoes (Fig. 3a). This is an important result given that UH is a skillful predictor of severe weather (e.g., Kain et al. 2008; Sobash et al. 2011, 2016, 2019) and is widely used in test bed settings (e.g., Kain et al. 2008; Clark et al. 2012a; Guyer and Jirak 2014; Gallo et al. 2017; Roberts et al. 2019). The continuous RF forecasts always have better resolution (Fig. 3b) and frequently—though not always—have better reliability (Fig. 3c) than the UH forecasts. Of course, it is likely that the UH-based forecasts would obtain a higher BSS if a time- and space-varying UH threshold were used instead of a constant one (Sobash and Kain 2017). However, calibrating the UH threshold in space and time requires substantially more computational resources compared to a constant threshold calibration. While training a RF is also computationally intensive, the RF considers multiple variables, and its multivariate “calibration” occurs implicitly as the algorithm is run.
The continuous RF forecasts also perform substantially better than the full SPC forecasts for hail and wind but not tornado prediction (Fig. 3a), an unsurprising result given this study’s lack of tornado-specific predictors [e.g., significant tornado parameter (STP; Thompson et al. 2003), low-level storm relative helicity (e.g., Coffer et al. 2019), etc.]. Thus, it is possible that adding predictors with a stronger correlation to observed tornado and/or low-level mesocyclone occurrence could improve the RF tornado and significant tornado forecasts. However, even without tornado-specific predictors, the continuous RF forecasts have better resolution (Fig. 3b) and better (i.e., smaller) reliability values (Fig. 3c) than the continuous SPC forecasts for all hazards.
When the continuous RF forecasts are truncated at the probabilities used by the SPC, BSS values are, unsurprisingly, reduced (Fig. 3a). Much of this reduction comes from degraded reliability (Fig. 3c) rather than decreased resolution (Fig. 3b). However, the truncated RF probabilities still have substantially greater BSSs than the original SPC probabilities for severe wind (Fig. 3a). Truncated RFs also have higher BSSs relative to the original SPC forecasts for severe hail, with the 95CIs of the two forecasts just barely overlapping. For the significant severe hazards, the truncated RFs do not substantially outperform the original SPC forecasts. However, the continuous RF forecasts do have notably greater BSSs than the original SPC forecasts for significant severe wind and significant severe hail. This outperformance is due to the improved resolution (Fig. 3b) and reliability (Fig. 3c) that is possible with access to continuous rather than binary (i.e., ≥10%) forecast probabilities.
While the RF-based forecasts have the best resolution for all hazards (Fig. 3b), they do not necessarily have the best reliability (Fig. 3c); however, reliability among all forecasts for all hazards is generally very good (Figs. 4 and 5 ). Large deviations from perfect reliability are typically associated with small sample size in the relevant forecast probability bin(s) [e.g., the UH significant severe weather forecasts (Figs. 5a,c,e,g) at higher forecast probabilities and the RF and UH tornado probabilities greater than 30% (Fig. 4c)]. Interestingly, both the original and SPC probabilities underforecast tornadoes (Fig. 4c) and severe wind (Fig. 4e). For the original SPC forecasts, this underforecasting is at least partially due to their use of discrete probabilities (i.e., probabilities between two discrete levels are mapped to the lower level). However, the underforecasting may also reflect a general philosophy of the SPC to emphasize higher-end tornado and wind events, given that SPC categorical outlooks are directly dependent on forecast hazard probability (Table 3). For example, it is possible that forecasters may wish to convey a message other than “moderate” or “high risk” to emergency managers or other users when they anticipate higher probabilities of marginally severe wind [e.g., ~50 kt (58 mph)] or low-end (e.g., EF0) tornado reports. Similarly, the SPC may wish to have high POD—even at the expense of false alarm—for significant tornadoes and significant severe wind events, which could explain the SPC overforecasting for these hazards (Figs. 5c,e). The SPC does not have the same overforecasting bias for severe (Fig. 4g) and significant severe (Fig. 5g) hail, perhaps since these events have less potential for truly devastating impacts. Importantly, the UH and RF forecasts give equal weight to all observed storm reports and do not consider the potential societal impacts of observed severe weather.
As statistical methods, the UH and RF forecasts tend to struggle most for the rarest events, which have the least amount of data. For example, the UH forecasts have good reliability for most hazards but some overforecasting at higher probabilities for tornadoes (Fig. 4c) and significant severe weather hazards (Figs. 5a,c,e,g). Meanwhile, the continuous RF forecasts tend to have either near-perfect reliability (e.g., Figs. 4g and 5a,e,g) or slight underforecasting (e.g., Figs. 4a,e) at most probability levels for most hazards. Unsurprisingly, the truncated RF forecasts tend to underforecast relative to the continuous RF forecasts, since—like the original SPC forecasts—all continuous forecast probabilities less than a given discrete level are assigned to the next lowest level. Nevertheless, both the continuous and truncated RF forecasts have excellent reliability for the prediction of all hazards at probabilities with a sufficiently large sample size.
Performance diagrams (Figs. 6a–h) generally corroborate the BSS-based results (Figs. 3a–c), showing a clear outperformance of the RF-based method for most hazards. For example, the continuous RF forecasts substantially outperform the UH forecasts for all-hazard severe (Fig. 6a) and significant severe (Fig. 6b) weather at all probability levels. The continuous and truncated RF forecasts also clearly outperform both the SPC and UH forecasts for severe wind (Fig. 6e), significant severe wind (Fig. 6f), severe hail (Fig. 6g), and significant severe hail (Fig. 6h). Interestingly, for tornadoes, the RF-based forecasts perform as well as (for the lower forecast probabilities) or slightly worse than (for the higher forecast probabilities) those from the SPC, with the UH-based forecasts noticeably inferior (Fig. 6c). Again, the RF-based forecasts’ worse performance for tornado prediction potentially reflects the lack of tornado-specific predictors in this study. For significant severe hazards (Figs. 6d,f,h), skill is relatively low for all forecasts, but the RF forecasts have CSI values at least as high as those from SPC and UH forecasts.
In general, the continuous and truncated RF forecasts have similar CSI scores. There is one interesting exception, however: for the significant tornado forecasts, the truncated RF method is associated with a noticeably higher CSI (Fig. 6d). The likely cause is the poor reliability of the continuous RF forecasts at greater than 10% probabilities due to small sample size (Figs. 5c,d). Because the continuous RF probabilities dramatically overforecast significant tornadoes above 10% probability, the truncation procedure dramatically improves reliability (Fig. 5c) and CSI (Fig. 6d) at the 10% level.
b. Seasonal and regional results
Consistent with Sobash and Kain (2017), it is found herein that the “best” UH threshold to use (i.e., the one that maximizes BSS) for all-hazards severe (Figs. 7a,b) and significant severe (Figs. 7c,d) weather prediction depends on season and region. The best-performing UH threshold is particularly sensitive to region: values of 60, 40, and 30 m2 s−2 (140, 110, and 130 m2 s−2) are best for the West, Midwest, and East regions, respectively, for all-hazards severe (significant severe) weather (Figs. 7b,d). While the best UH threshold does not change much seasonally for the all-hazard severe weather forecasts (Fig. 7a), seasonal variations are more apparent for all-hazard significant severe weather forecasts (Fig. 7c). Importantly, the continuous RF always outperforms the best UH forecast over a given region or season (Figs. 7a–d).
When all forecasts are verified seasonally, a similar pattern emerges: with just one exception (i.e., fall tornado forecasts; Fig. 8a), the continuous RF forecasts have the highest BSSs for all hazards during all seasons (Figs. 8a–f). Both the continuous and truncated RF forecasts have substantially greater BSSs for summer severe wind prediction compared to either the UH or SPC forecasts (Fig. 8c). The continuous RF forecasts also dramatically outperform the best-performing SPC forecast for the prediction of spring and summer severe hail (Fig. 8e) and spring significant severe hail (Fig. 8f). Additionally, the continuous RF forecasts substantially outperform the UH-based forecasts—but not the continuous SPC forecasts—for spring severe wind (Fig. 8c) and winter tornadoes (Fig. 8a). However, it should be noted that using a spatiotemporally varying UH threshold would likely improve the BSSs of the UH forecasts (e.g., Sobash and Kain 2017), especially for the winter tornado forecasts. While the continuous RF forecasts generally exhibit noticeably larger BSSs than the other forecasts for significant severe hazards (e.g., Figs. 8b,d,f), the seasonal 95CIs are typically quite large for these hazards. Truncated RF forecast BSSs are generally higher—but not dramatically higher—than those from the original SPC forecasts for subsignificant severe weather prediction (i.e., Figs. 8a,c,e), although the truncated RF forecasts do have substantially better summer severe wind forecasts. For the significant severe hazards, the truncated RF probabilities have BSSs similar to the original SPC probabilities during each season.
When BSS is tabulated regionally, it is apparent that the RF method struggles at predicting tornadoes (Fig. 9a), significant tornadoes (Fig. 9b), and significant severe wind (Fig. 8d) in the West region. However, for all other hazards and regions (Figs. 9a–f), the continuous RF forecasts have the greatest BSSs. Regionally, the RF approach gives the greatest relative benefit for East severe wind prediction (Fig. 9c); both the continuous and truncated RF forecasts have substantially greater BSSs than the UH- or SPC-based forecasts. The continuous RF also noticeably outperforms the UH and SPC forecasts for the prediction of West and East severe wind (Fig. 9c) and Midwest severe hail (Fig. 9e) and significant severe hail (Fig. 9f). As with the seasonal verification results, truncated RF significant severe probabilities (Figs. 9b,d,f) tend to have similar BSSs to original SPC probabilities for each region.
4. Case studies
a. 26–27 May 2015
At 1200 UTC 26 May, a midlevel trough and associated mesoscale convective complex (MCC) was centered over central Iowa. A line of thunderstorms extended along a surface front from the MCC southeast into eastern Mississippi. As the period progressed, the midlevel trough moved northeastward over the Great Lakes region and helped deepen an associated surface cyclone, ultimately leading to several tornado and severe wind reports in Illinois and Wisconsin before 1900 UTC. The cyclone’s cold front also advanced eastward and helped force the development of severe-wind-producing thunderstorms over eastern Alabama, western Georgia, and the Ohio Valley. Farther west, storms initiated along a dryline extending from west-central Oklahoma southward into central Texas by 2300 UTC. These storms produced numerous reports of severe wind and hail, with multiple significant severe hail reports and one significant severe wind report.
The RF and SPC outlooks (Figs. 10a,b) had some notable differences on this day, including the RF outlook’s use of the enhanced risk over two locations as well as the RF outlook’s greater areal coverage of slight risk areas. In the Upper Midwest, the RF shifted the 2% and greater tornado probabilities westward compared to the SPC (Figs. 11a,b). As a result, the RF had better POD for tornadoes in eastern Iowa, southern Wisconsin, and northern Illinois. Along the Oklahoma–Texas border, the RF issued 10% tornado probabilities with 6% significant tornado probabilities. Ultimately, no significant tornadoes were observed in this region, although multiple tornado reports occurred near the RF’s 10% tornado area. Unlike the SPC forecast, the RF forecast issued 30% severe wind probabilities in a region extending from the Ohio Valley to the western Florida Panhandle (Figs. 11c,d). Numerous severe wind reports were observed near these locations, giving the RF a better POD. The RF also moved the 15% severe wind area slightly southeastward into southern Oklahoma and northern Texas, which better captured some severe wind reports—including a significant severe wind report—in that region. Notably, the one significant severe wind report fell near the RF’s 2% contour for significant severe wind. Indeed, one advantage of the RF forecast is its ability to communicate nonzero (but still less than 10%) probabilities for significant severe weather. For severe hail, the RF forecasts gave a much larger 5% area than the SPC (Figs. 11e,f) but focused on a similar area for its 15% probabilities. However, unlike the SPC, the RF forecasts produced a large area of 30% severe hail probabilities and indicated a greater than 10% chance of significant severe hail in western Oklahoma and northern Texas. Ultimately, numerous severe and significant severe hail reports occurred in this region. Two significant severe hail reports in central Texas also fell outside of the RF’s 10% “hatched area” for significant severe hail but within the RF’s 2% significant severe hail contour. However, the RF forecast did have greater false alarm than the SPC in eastern Louisiana (where the RF issued 15% probabilities) and over a large area extending from central Wisconsin to the Gulf Coast (where the RF generally issued 5% probabilities). Nevertheless, the RF generally made improvements over the SPC forecast. A human forecaster with access to the RF probabilities on this day might have had more confidence in a Texas–Oklahoma significant severe hail event and a widespread severe wind event in the Ohio Valley and Southeast.
UH-based probabilities might have only communicated part of this story. For example, compared to RF all-hazard probabilities (Fig. 12a), UH-based probabilities (Fig. 12b) were much lower over the Ohio Valley and Southeast United States. However, UH all-hazards severe and significant severe probabilities (Figs. 11b,d) were generally similar to those from the RF (Figs. 12a,c) over the southern plains.
b. 18–19 May 2017
The SPC identified 18 May 2017 as a high-risk day in the southern plains (e.g., Fig. 13b), with their 0600 UTC outlook highlighting the potential for widespread long-track tornadoes in parts of Oklahoma and Kansas. At the surface, a cyclone was developing in the western Oklahoma Panhandle by 1200 UTC. Strong southerly winds throughout central Texas and Oklahoma advected rich low-level moisture into the southern plains, where strong deep-layer vertical wind shear was in place. Storms began forming in the warm sector along the dryline in western Oklahoma and northern Texas by 1830 UTC and quickly became severe. Severe storms also formed along the warm front in central Kansas by 2130 UTC. Meanwhile, in the Northeast, severe hail- and wind-producing storms initiated ahead of a cold front in an unstable, sheared environment.
While the RF and SPC forecasts identified similar threat areas in their outlooks (Figs. 13a,b), they issued different maximum outlook categories, with the RF (SPC) issuing a moderate (high) risk in the southern plains and an enhanced (slight) risk in the Northeast. Interestingly, although the RF produced smaller tornado probability magnitudes in the southern plains (Fig. 14a), it gave larger areas of higher-end (i.e., >10%) tornado probabilities there. Indeed, most of the observed tornadoes occurred within these areas of higher-end RF probabilities. The RF tornado forecast also expanded its 2% tornado probabilities farther east compared to the SPC, enabling it to better capture the QLCS tornado reports in Missouri (Figs. 14a,b). While the RF and SPC agreed on the area with the largest significant tornado probability (i.e., southern Kansas and northern Oklahoma; Figs. 14a,b), most of the observed significant tornadoes fell outside of this region but within/near the RF’s 2% significant tornado probability contour. The RF and SPC forecasts had very similar tornado forecasts in the Northeast, with the RF forecasts having slightly less false alarm area.
RF and SPC severe wind probability magnitudes were quite different on this day, with the RF having higher probabilities in both the eastern United States and the southern plains (Figs. 14c,d). These higher probabilities led to better POD for the RF in New York, northern Pennsylvania, and southern Oklahoma but greater false alarm in most of West Virginia and northern Texas. The RF also expanded the 15% probability area farther eastward compared to the SPC, giving it greater POD in Arkansas and Missouri.
RF and SPC hail forecasts were similar, although the RF extended the 30% probability area and 10% significant severe hail area farther south into central Texas, where severe and significant severe hail occurred (Figs. 14e,f). Additionally, the RF indicated 2% significant severe hail probabilities in New York and Kansas where significant severe hail was observed but fell outside of the RF or SPC 10% significant severe hail probabilities. Finally, the RF demonstrated better severe hail POD in Maryland (Figs. 14e,f). Overall, the RF-based outlook (Fig. 13a) and individual-hazard probabilities (Figs. 14a,c,e) compared favorably against the corresponding SPC forecasts on this day.
RF all-hazards severe and significant severe weather probabilities (Figs. 15a,c) also compared favorably with UH-based probabilities (Figs. 15b,d), especially in the Northeast, where the RF had better POD for severe and significant severe weather. In the southern plains, RF and UH forecasts were generally similar. However, it is noteworthy that the RF significant severe forecasts (Fig. 15c) shift the maximum probabilities southwest into western Oklahoma, close to a cluster of significant severe reports, while the UH probability maximum is in central Kansas, away from any such cluster (Fig. 15d).
Compared to the SPC forecasts, the RF probabilities frequently highlighted similar areas for severe weather but gave different probability magnitudes. However, the RF forecasts herein occasionally assigned higher probabilities (e.g., greater than 15% or even 30%) to areas outside of the SPC’s marginal risk. When this happened, many times the areas with the higher RF probabilities did experience observed severe weather. This occurred most often for severe wind events in the East region. In these instances, it is possible that the differences between the SPC and RF forecasts could be partially explained by biases and nonmeteorological artifacts in the severe wind report observations, given the high ratio of estimated to measured severe wind reports in the eastern and southeastern United States (Edwards et al. 2018). While the RF algorithm views all observed storm reports equally (i.e., as unbiased, perfect observations) and does not consider storm coverage, density, intensity, or potential societal impacts when constructing its probabilities, SPC forecasters may be mindful of how their forecast probabilities equate to outlook categories (Table 3) and may emphasize higher-impact events that pose a greater threat to life and property.
The biggest advantage of the RF method described herein is its ability to create skillful CAE-derived severe weather guidance products analogous to those issued by the SPC. However, it must be emphasized that the goal in creating these RF-based products is not to replace human forecasters but to augment them. Indeed, this augmentation could potentially take a variety of (nonmutually exclusive) forms. First, RF-based forecasts could provide a skillful, reliable first guess (e.g., Karstens et al. 2018) product, which forecasters could modify based on other data sources (e.g., satellite and radar trends, surface analyses, etc.) and their expertise. Such a product could increase forecaster efficiency and facilitate proper forecast calibration (Karstens et al. 2018). Used as a first guess or “last check” product, the RF guidance may also identify a threat area that a forecaster might have overlooked for a given hazard (e.g., significant severe hail in the southern plains; Figs. 11e,f). The RF forecasts may also help simply by providing useful uncertainty information in challenging forecasting situations. Such uncertainty information may be especially valuable for more precisely quantifying the threat of significant severe weather, which is rare but extremely threatening to life and property. Finally, it is conceivable that the RF-based forecasts—when properly interrogated using ML interpretability metrics (e.g., McGovern et al. 2019)—may give forecasters and researchers insight into ensemble biases or complex relationships between CAE forecast output and observed severe weather. Human forecasters learning from ML would not be unprecedented, as artificial intelligence techniques have recently provided new knowledge to human experts in other complex domains, such as the game of Go (Silver et al. 2016, 2017) and multiplayer no-limit Texas Hold’em poker (Brown and Sandholm 2019). A future study is planned to determine how and why RF-based severe weather probabilities differ from human and UH-based forecasts.
6. Summary and conclusions
This paper used a random forest (RF) to produce CONUS-wide 1200–1200 UTC day 1 convective outlooks (COs) and individual-hazard severe weather probabilities from Storm-Scale Ensemble of Opportunity (SSEO) forecast output. Temporally aggregated gridpoint-based forecast variables were used as predictors. The gridpoint-based approach is advantageous because it allows users to interpret RF output directly in two-dimensional space and does not require the assumption of perfect correspondence between simulated and observed storms.
Continuous and discrete (i.e., truncated) RF forecasts created herein were compared against calibrated, spatially smoothed 2–5-km updraft helicity (UH) forecasts as well as original and continuous (i.e., full) SPC day 1 COs issued at 0600 UTC. The continuous RF forecasts almost always produced the highest BSSs, both when the forecasts were verified over the entire dataset and when verification was performed regionally or seasonally. The truncated RF forecasts frequently had the second-highest BSSs and were often better—but never substantially worse—than the corresponding original SPC forecasts. In general, the RF method performed best relative to the SPC and UH forecasts for severe wind and hail prediction in the Midwest and East regions during the spring and summer. All forecasts—including the RF-based ones—generally had very good reliability, while the RF forecasts tended to have the best resolution.
Given the promising results of the RF technique described herein, it is important to evaluate its skill and value to forecasters in an operational environment. To this end, efforts are under way to apply the technique described herein to the operational HREFv2 with the goal of evaluating real-time RF forecasts in future Hazardous Weather Testbed Spring Forecasting Experiments (e.g., Clark et al. 2012a; Gallo et al. 2017). While such formal evaluation is necessary to draw more robust conclusions, it is speculated that real-time RF-based guidance will aid human day 1 severe weather forecasts by providing forecasters with calibrated CAE-based severe hazard probabilities and outlooks.
Support for this work was provided by NOAA/Office of Oceanic and Atmospheric Research under NOAA–University of Oklahoma Cooperative Agreement NA11OAR4320072, U.S. Department of Commerce. Additional support was provided by the Developmental Testbed Center (DTC). The DTC Visitor Program is funded by the National Oceanic and Atmospheric Administration, the National Center for Atmospheric Research, and the National Science Foundation. We would also like to acknowledge high-performance computing support from Cheyenne (doi:10.5065/D6RX99HX) provided by NCAR’s Computational and Information Systems Laboratory, sponsored by the National Science Foundation. Additionally, we extend our thanks to two anonymous reviewers, whose feedback improved the quality of the manuscript. AJC and CDK contributed to this work as part of regular duties at the federally funded NOAA/National Severe Storms Laboratory. The statements, findings, conclusions, and recommendations presented herein are those of the authors and do not necessarily reflect the views of NOAA or the U.S. Department of Commerce.
0600 UTC SPC COs are used because SPC forecasters, like the RFs, have access to 0000 UTC SSEO guidance during that forecast period.