1. Introduction
Operational predictions of severe weather hazards (i.e., tornadoes, hail, and wind) are a key focus of the National Oceanic and Atmospheric Administration (NOAA) Storm Prediction Center (SPC), which is responsible for “timely and accurate forecasts and watches for severe thunderstorms and tornadoes over the contiguous United States” (SPC 2020). The SPC uses forecast guidance from numerical weather prediction (NWP) models, including postprocessed products, diagnostic parameters (e.g., storm relative helicity), as well as current observations, to issue outlooks 1–8 days in advance of the threat of severe weather; outlooks issued in the 1- and 2-day timeframe delineate threats for specific hazards whereas day 3–8 outlooks highlight risk areas of any severe hazard. Because of the tremendous societal and financial impacts of severe weather events, including 10 severe weather–attributed billion-dollar disasters in 2021 alone (NCEI 2022), it is imperative that SPC forecasters receive reliable and valuable forecast information to inform their operational products to provide sufficient lead time to stakeholders of the threat of severe weather.
Deterministic and ensemble NWP model predictions of severe weather and associated hazards have improved substantially over the last decade as dynamical models have leveraged increased computing power to decrease grid spacing and increase effective resolution. Real-time high-resolution models are now capable of explicitly resolving discrete convective storms and mesoscale convective systems (e.g., Done et al. 2004; Kain et al. 2008; Dowell et al. 2022); these prediction systems are commonly referred to as convection-allowing models (CAMs). These advances provided opportunities to forecast weather hazards (e.g., tornadoes) by using proxies (e.g., updraft helicity; Sobash et al. 2011, 2016b,a; Hill et al. 2021) in NWP model output, or generating calibrated guidance products (e.g., Gallo et al. 2018; Harrison et al. 2022; Jahn et al. 2022) that probabilistically depict severe weather threat areas. However, few of these methods have carried over into longer-range forecasting because of their dependence on CAMs, which are operationally limited to the near-term (<4 days) due to computational constraints and rapid increases in forecast spread from small-scale errors (Zhang et al. 2003, 2007). As a result, SPC forecasters must rely on global ensemble and deterministic prediction systems to provide guidance for the day 4–8 outlooks (hereafter referred to as the “medium range”). Because these systems parameterize convection, they tend to provide more limited value for severe weather forecasting than CAMs.
Efforts have been made to leverage the large global ensemble datasets to generate postprocessed and calibrated forecast products for medium-range forecast events as well (e.g., Carbin et al. 2016; Gensini and Tippett 2019). Postprocessed fields become increasingly valuable at these extended lead times for coarse global models with convective parameterization, offering a simplistic probabilistic estimate for the threat of severe weather (e.g., calibrated severe thunderstorm probabilities; Bright and Wandishin 2006) that may not be contained in highly variable deterministic model output fields (e.g., run to run variability) or ensemble output diagnostics (e.g., ensemble mean, variance). For example, the U.S. National Blend of Models (Hamill et al. 2017; Craven et al. 2020) uses quantile mapping to generate postprocessed precipitation forecasts in the medium range. Additionally, analog forecasting (Lorenz 1969) has been used to forecast high-impact weather environments (e.g., Hamill and Whitaker 2006), and typically involves training a regression model (e.g., logistic regression) on past environmental states that were coincident with severe weather reports and learning how similar environments relate to severe weather frequency (e.g., CIPS Analog-Based Severe Guidance).1 Using these developed statistical relationships, the predicted atmospheric patterns are interrogated to determine comparable analogs on which to base a new forecast. To the authors’ knowledge, no analog techniques have been formally published addressing severe weather forecasting in the medium range, and a need still exists for techniques and tools to generate skillful medium-range severe weather forecast guidance.
Recently, more advanced machine learning (ML) models have emerged into the meteorology domain as a complementary and alternative option to forecast severe weather hazards. Whereas dynamical models cannot explicitly forecast hazards below their effective resolution, ML models can be trained to forecast any hazard given a sufficiently accurate observational dataset (i.e., labels) and related environmental predictors (i.e., features). ML has become a widely used technique to postprocess NWP model output, for example, to generate forecasts mimicking operational NCEP center products (e.g., Herman and Schumacher 2018b; Loken et al. 2019; Hill et al. 2020; Loken et al. 2020; Sobash et al. 2020; Hill and Schumacher 2021; Schumacher et al. 2021) and forecast severe weather or hazards more generally (e.g., Gagne et al. 2014, 2017; Jergensen et al. 2020; McGovern et al. 2017; Burke et al. 2020; Flora et al. 2021; Loken et al. 2022). Others have developed ML prediction systems using observations or reanalysis datasets (e.g., Gensini et al. 2021; Shield and Houston 2022), demonstrating ML-based prediction systems capable of highlighting conducive severe weather environments, for example. Hill et al. (2020) and Loken et al. (2020) employed random forests (RFs; Breiman 2001) to generate forecasts analogous to SPC outlooks, effectively creating postprocessed and probabilistic first-guess forecasts of severe weather that could be used by forecasters when generating their outlooks. Loken et al. (2020) used CAM fields as inputs to train an RF and derive day-1 hazard forecast probabilities whereas Hill et al. (2020) used global ensemble output as inputs to generate day-1–3 forecasts. Both studies demonstrated that RFs could produce skillful and reliable forecasts of severe weather at short lead times, and Hill et al. (2020) further highlighted that incorporating the statistical product information into the human-generated SPC forecast (i.e., through a statistical weighting procedure) yielded a better forecast at day 1 than either individual component forecast; the statistical models outperformed SPC forecasters at days 2 and 3. Therefore, it is reasonable to hypothesize that a similar RF-based prediction system devoted to forecasting severe hazards beyond day 3 would benefit SPC forecasters issuing medium-range forecasts.
Building upon the work of Hill et al. (2020), this study trains and develops RFs to forecast severe thunderstorm hazards in the medium range (i.e., days 4–8). The RF infrastructure of the Colorado State University Machine Learning Probabilities (CSU-MLP) prediction system (e.g., Hill et al. 2020; Schumacher et al. 2021) is used herein to explore medium-range severe weather predictions and determine their utility in relation to operational forecasts. Feature engineering (e.g., Loken et al. 2022) is also explored (i.e., how to organize predictors from the dynamical model output used to train the RF models) to determine if medium-range RF-based forecast skill is impacted by the way predictors are gathered and trained with observations. All RF-based forecasts are evaluated alongside corresponding SPC outlooks to assess the relative value of incorporating the statistical guidance into forecast operations.
2. Methods
a. Random forests
RFs combine the simplicity of the decision tree architecture with the robustness of ensemble methods. Individual decision trees are constructed beginning with the root node (i.e., top of the tree) where a subset of training examples (i.e., instances of severe weather or no severe weather observation) is extracted from the full training set via bootstrapping and a feature is selected that best splits the examples, i.e., the feature best describes separation of severe weather events from nonevents. Successive nodes are similarly constructed along the branches of the tree until a maximum tree depth is reached or a minimum number of training samples needed to split a node is breached, ending that particular branch in a terminal or “leaf” node. The leaf node either contains all the same observation types (e.g., all examples are severe weather events) or a mixture. As new inputs are supplied to a decision tree, e.g., from real-time NWP forecast output, the tree can be traversed to a leaf node using the inputs, producing a deterministic or probabilistic prediction of severe weather from the single tree. The aggregation of all decision tree predictions from a forest of decision trees provides a probabilistic prediction for the threat of severe weather.
RF-based forecasts in this context are analogous to the SPC outlooks at days 4–8, i.e., the probability of severe weather occurrence within 40 km (or 25 mi) of a point over the daily 24-h period defined by 1200–1200 UTC, and forecast products are constructed to mimic those produced by SPC (e.g., Fig. 1). For RF training, severe weather observations are encoded onto a grid by interpolating point-based severe storm reports from the SPC Severe Weather Database (SPC 2022b) to NCEP grid 4 (0.5° grid spacing); the same grid is used to define RF features, further discussed in section 2a(1). Any severe weather report during the 24-h forecast window is interpolated to the 0.5° grid using a 40-km neighborhood, resulting in at least one grid point encoded for each severe weather report; the 0.5° grid has approximately 55-km grid spacing. Severe weather reports are defined as tornadoes of any EF rating, hail at least 1 in. (2.54 cm) in diameter, and convective wind speeds at least 58 mph (93 km h−1); damage indicators (e.g., tree limb sizes) are routinely used to estimate wind speeds and issue wind reports. Training examples are encoded as 0 (no severe), 1 (nonsignificant severe), or 2 (significant severe) across the CONUS (Table 1); the significant severe designation is a specific class of tornadoes of intensity F2/EF2 or greater, hail exceeding 2 in. (5.08 cm) in diameter, and convective wind gusts exceeding 74 mph (119 km h−1). Thus, the RFs are tasked with a prediction of three classes. However, because the current version of SPC day-4–8 outlooks do not include significant severe probability contours, CSU-MLP RF significant severe forecasts will not be formally evaluated in this work—readers are referred to the work of Hill et al. (2020) who go in-depth on significant severe forecast skill for day-1 forecasts.
Classification labels.
1) Feature assembly
Meteorological features surrounding the encoded observations used for training are obtained from the Global Ensemble Forecast System version 12 (GEFSv12) Reforecast dataset (hereafter GEFS/R; Hamill et al. 2022; Zhou et al. 2022), a 5-member daily 0000 UTC initialized ensemble system that utilizes the finite volume cubed sphere (FV3; Harris and Lin 2013) dynamical core. GEFSv12 reforecasts date back to 1 January 2000 and extend forward to the end of 2019. Feature variables with known or presumed relationships to severe weather are extracted from the GEFS/R, including surface-based convective available potential energy (CAPE), precipitable water (PWAT), and bulk vertical wind shear (e.g., SHEAR500); a full list of features is provided in Table 2. Due to inconsistent resolution in the GEFS/R dataset, e.g., near-surface features have higher horizontal resolution (0.25° grid) than upper-tropospheric features (0.5° grid), each retrieved meteorological output field is interpolated to the 0.5° grid, which also aligns the simulated reanalysis meteorological environments with the encoded observations. Some meteorological features are limited over portions of CONUS; for example, the 850-hPa pressure level may be below ground in areas of high terrain, rendering SHEAR850 undefined. In these instances, bounded-variable features (e.g., wind speeds) are set to 0 and continuous-variable features (e.g., 2-m temperature) are set to the climatological average computed over the training dataset. Additionally, latitude, longitude, and Julian day are used as static features, coincident with the encoded severe weather report location.
Meteorological features used for RF training and forecasts.
Meteorological features are assembled in a forecast-point relative framework in which raw values are gathered both at the observation training example point and over a predefined radius around the point. The predefined radius is set to 3 grid points for all models trained in this manner. Additionally, the GEFS/R has 3-hourly temporal resolution across the 24-h forecast windows (i.e., instantaneous fields at 3-h intervals), allowing for both spatial and temporal sampling of the environment surrounding a severe weather report that occurred within the window; a depiction of the assembly method is provided in Fig. 2. The raw values accumulated in space and time represent the ensemble median for that given point; previous work demonstrated the superiority of using the ensemble median as opposed to the ensemble mean or outliers of the ensemble distribution (Herman and Schumacher 2018b). In other words, a two-dimensional time series is constructed for each meteorological feature at every grid point across the forecast window.
2) Alternative assembly methods
Two alternative feature assembly methods are also applied. Previous research employing the CSU-MLP prediction system has nearly always applied the forecast-point relative framework to assemble features (e.g., Herman and Schumacher 2018b,a; Hill et al. 2020; Schumacher et al. 2021). However, Hill and Schumacher (2021) showed that spatially averaging the features (e.g., 1D time series) yielded improved RF-based excessive rainfall forecasts, and they hypothesized that characterizing the environment at a particular time with a single mean value reduced noise during model training. Similarly, Loken et al. (2022) demonstrated that improved RF forecast skill could be achieved by using the ensemble mean at each spatial point rather than all individual members at each point because it reduced training noise. While Hill and Schumacher (2021) and Loken et al. (2022) both used CAM inputs to train RFs, which may have inherently more noise than a coarser resolution global model, the same spatial averaging procedure of Hill and Schumacher (2021) is employed here (e.g., Fig. 3) as one alternative approach to explore if the medium-range RF predictions generated here suffer from noisy input features and whether skill could be improved by removing that noise.
A second alternative assembly method can be explored when spatially averaging predictors because the total number of features is substantially reduced, decreasing the computational time needed to train the RFs. For example, the traditional CSU-MLP feature assembly method (forecast-point relative method described above) contains N = mp(2r + 1)2 features, where m is the number of meteorological variables (m = 12), p is the number of forecast hours in the 24-h window (p = 9), and r is the number of grid points used surrounding each training example (r = 3); N is 5292 when the traditional CSU-MLP method is used and 108 (m = 12, p = 9, r = 0) when features are spatially averaged. The reduced number of features motivates a second exploration into how preceding environments [i.e., the day(s) prior leading up to a severe weather event] may add information to RF predictions and change subsequent skill. For instance, it is well understood that moisture return from the Gulf of Mexico into the Great Plains can often precede a severe weather event, providing deeper boundary layer moisture and ample CAPE to support deep convection and severe weather (Johns and Doswell 1992)—can the RFs learn to better predict severe weather by considering how the atmosphere becomes “primed” before cyclogenesis in the lee of the Rockies? Furthermore, ensemble spread increases as a function of lead time, so considering GEFS/R ensemble median meteorological features on prior days for a day-6 forecast, for example, may yield additional confidence in a forecast outcome than considering only features spanning the day-6 forecast window. Therefore, three experiments are conducted that use the preceding 1, 2, and 3 days of features—hereafter referred to as p1, p2, and p3, respectively. For simplicity, only the surrounding environment near the training example is sampled rather than employing a trajectory analysis to sample the upstream environment; such a method could be explored in future work. Using four total days of spatially averaged features (i.e., 3 preceding days and 1 valid over the forecast window) only requires 396 features (p = 33), an order of magnitude less than the traditional CSU-MLP feature assembly method. For brevity, and since this method evaluation is exploratory in nature, the spatial averaging methodology, using 0–3 preceding days (i.e., p0, p1, p2, and p3), is only applied for day-6 model training and corresponding forecast evaluation.
A third alternative assembly method using time-lagged GEFS/R data is also explored. Time-lagging methods have been shown to be effective at creating forecast spread, yielding improved forecasts over their deterministic components and are competitive with multimodel and initial-condition perturbation ensembles at convective scales (e.g., Jirak et al. 2018; Wade and Jirak 2022). One specific benefit from time-lagging is no additional computation is required since the time-lagged forecasts are already created and available, and can be combined relatively easily to compute an ensemble median. In this work, time-lagging is used to increase the GEFS/R ensemble size, ideally yielding a more representative ensemble median when assembling features. The 10-member time-lagged ensembles (denoted as tl10 experiments) are generated using sequential reforecast initializations with features from each initialization valid over the same forecast window. As a result, the 10-member time-lagged RF models use 1 less day for training, discussed in the next subsection; the first GEFS/R initialization has no corresponding time-lagged members from the previous day.
3) Training, validation, and testing
While the GEFS/R has 20 years of daily forecasts, only ∼9 years are using for training the medium-range forecast models. This decision was largely to facilitate a comparison between day-1–3 models developed by Hill et al. (2020) who used a previous version of the GEFS, which are not explored in this work. Daily initializations of the GEFS/R from 12 April 2003 to 11 April 2012 are used to assemble features and severe weather reports are aggregated and encoded for this same period. It should be noted that since the first initialization is 12 April 2003, the first training examples (for the day-4 models) are valid 1200–1200 UTC 15/16 April 2003. The k-fold cross-validation technique (Raschka 2015), with k set to 4, is employed over the 9-yr period to select an optimal model as well as avoid overfitting the RFs. Testing of the trained RFs is conducted 2 October 2020–1 April 2022; the operational GEFSv12 was implemented in late September 2020 and full datasets to make real-time predictions were available 2 October 2020.
RFs are trained for distinct regions of the country that represent unique diurnal and seasonal climatologies of severe weather (e.g., Brooks et al. 2003) (Fig. 4); these regions are consistent with those used by Hill et al. (2020). For each region, an RF is cross validated (i.e., four models trained) and the model that minimizes the Brier score (BS; Brier 1950) across its fold is selected. After a fold is chosen, RF hyperparameters are varied during training and the trained models are again validated on the withheld fold to tune the RFs. It should be noted that during RF tuning the BSs of the trained RFs were more sensitive to the cross validation fold than the hyperparameters varied. Hyperparameters varied included the number of trees in the forest and minimum number of samples needed to split a node. Entropy was set as the splitting criterion and a random selection of features were evaluated at each node, equal to the square root of the total number of features. RFs trained with the alternative feature assembly methods undergo the same cross-validation and hyperparameter tuning procedure.
Forecasts across the testing set are made using the GEFSv12 operational prediction system with 21 (the control and first 20) of the available 31 ensemble members. This decision to use fewer ensemble members was motivated by real-time computing constraints and the need to get real-time forecasts to SPC expeditiously. While the operational GEFS is run at approximately 25-km horizontal resolution, output fields are also available on the 0.5° resolution grid; the coarser resolution grids are used to assemble real-time features and make RF predictions. Forecasts are made with each regional, optimized RF and the severe weather probabilities are stitched together and smoothed at regional borders with a 7 × 7 gridpoint distance-dependent smoothing filter to limit discontinuities and create a CONUS-wide forecast. As with training, only the 21-member ensemble median is used to generate real-time features, which feed into all RF versions to generate predictions. It should be noted that the GEFS/R dataset could have been used for testing between 2012 and 2019, but real-time forecasts are generated with the operational GEFSv12 and the authors felt the most appropriate evaluation should use products that SPC forecasters would be using in the future. Additionally, the GEFS/R was created using the GEFSv12, which provides consistency between training and real-time inputs.
b. Verification
Traditional methods used to quantify probabilistic prediction skill—e.g., Brier skill score (BSS), area under the receiver operating characteristic curve (AUROC), and reliability diagrams—are employed to evaluate the CSU-MLP and SPC forecasts.2 Observations of severe weather are obtained from the SPC archive of National Weather Service preliminary local storm reports since the SPC severe weather database was not updated through 2021 at the time the analysis was conducted. For a more direct comparison between continuous RF-generated forecasts and discrete SPC probabilities, the RF probabilities are discretized to resemble the outlooks issued by SPC forecasters. Discretization converts all RF probabilities within a probabilistic bin to the midpoint of the bin (e.g., Schumacher et al. 2021)—i.e., all probabilities below the 15% minimum SPC probability contour are set to 7.5%, probabilities between 15% and 30% to 22.5%, and probabilities above 30% are set to 65%. The bin midpoint conversion is also applied to the SPC contours such that areas of 0% are assigned 7.5%, 15% contours are converted to 22.5%, and 30% contours are assigned a value of 65%. Additional discretized contours (e.g., a 5% contour from the RF forecasts) are introduced in the verification section to elucidate factors influencing RF forecast skill. While the continuous RF probabilities could be evaluated alongside interpolated SPC contours (e.g., Hill et al. 2020), which have been shown to be more skillful than discrete contours (Herman et al. 2018), the limited number of possible SPC contours at days 4–8 reduces the utility of interpolation; the discrete 15% and 30% SPC contours are retained for verification. All medium-range SPC outlooks evaluated herein were issued at ∼0900 UTC each day and the shapefiles are converted to a gridded domain with ArcGIS as in Herman et al. (2018) as well as upscaled to NCEP grid 4 for comparison to the RF-based forecasts.
AUROC (e.g., Marzban 2004) measures forecast resolution and in this work the prediction system’s ability to discriminate severe weather from nonsevere weather at various probabilistic thresholds. Reliability diagrams (Murphy and Winkler 1977) are also used to characterize the relative forecast probability against the observed frequency of events, highlighting forecast calibration at the various RF probabilities and the discrete 15% and 30% SPC levels. Finally, the percent of forecast area covered by observations is computed to evaluate consistent biases in contour size and/or forecast probability magnitude. If a 15% contour frequently contains more than 30% fractional coverage of observations (i.e., the next contour interval), that issued contour is considered too small. Alternatively, if the fractional coverage is less than 15%, the contour is too large. Fractional coverage has been used extensively to evaluate probabilistic ML-based forecasts against human-generated outlooks (e.g., Erickson et al. 2019; Hill et al. 2020; Erickson et al. 2021; Hill et al. 2021; Schumacher et al. 2021).
Finally, the statistical relationships identified by the RFs are inspected to glean additional insights about the features that the models rely on to make predictions and how those relationships align with our physical understanding of severe weather forecasting. Feature importances (FIs) are computed and evaluated to assess the use of feature information in each tree of the forest. Specifically, the “Gini importance” metric (e.g., Pedregosa et al. 2011; Herman and Schumacher 2018b; Whan and Schmeits 2018; Hill et al. 2020) is used to quantify the FIs. Each feature is assigned an importance value based on the number of times it is used to split a decision tree node, weighted by the number of training examples at the split (Friedman 2001). The importances are then summed over all splits and trees in the forest and can be aggregated to characterize temporal or spatial importance patterns (e.g., Hill et al. 2020). The higher the importance, the more value the RFs place on that predictor to make forecasts. While other FI techniques exist (e.g., McGovern et al. 2019), the Gini importance metric is used here as an initial glance and not a holistic interrogation of FIs of the developed RFs; a follow-on manuscript is being prepared to fully interrogate model FIs with sufficient breadth and depth that is beyond the scope of this study.
3. Verification period overview
a. Frequency of severe weather
Tornado event frequency highlights an active 1.5 years across the Southeast United States (Fig. 6a), likely attributable to two fall–winter seasons in the dataset; the climatological frequency in tornado activity across the Southeast United States has two peaks: one in the early fall and the other late winter/early spring (e.g., Horgan et al. 2007; Guyer and Dean 2010; Dixon et al. 2011; Gensini and Ashley 2011; Cintineo et al. 2012; Smith et al. 2012; Gensini et al. 2020). With only one full spring season of severe weather in 2021, limited tornadoes were reported across the Great Plains (e.g., Fig. 6b). Unsurprisingly, hail reports are primarily confined to the Great Plains and wind reports are more uniform across the United States (Fig. 6c), east of the Rocky Mountains, compared to the other two hazards. Across the mid-Atlantic up into New England, multiple high-impact weather events produced numerous wind reports (Figs. 6c,d). These reports were anomalous compared to the long-term severe weather climatology (Fig. 5b).
b. Frequency of forecasts over extended period (days 4–8)
The frequency of medium-range forecasts is first assessed spatially across the CONUS. Day-4–7 15% forecasts from the CSU-MLP system and outlooks from SPC were issued across areas that experienced frequent severe weather, including the Southeast United States and to some extent the southern Great Plains out to day 7 (Fig. 7); 30% forecast contours are qualitatively similar, but omitted for brevity. The RF-based forecasts were issued much more frequently, however, compared to SPC. At day 7, where SPC issued limited 15% contours across the CONUS (Fig. 7g), the RF issued forecasts in some point locations as many as 10 times (2% of the days; Fig. 7h). Moreover, the RFs forecast areas of severe weather across larger areas of the CONUS, covering nearly all states east of the Rockies at day 5 (Fig. 7d), whereas SPC limited their day 5 outlooks primarily south of 37°N (i.e., the northern border of Oklahoma).
Despite the relatively short verification period, and only one spring forecast season in the verification dataset, there appears to be seasonality to the issuance of SPC medium-range outlooks and CSU-MLP forecasts. SPC issued more 15% contours in the spring and fall seasons than summer and winter (Fig. 8a). This pattern is not necessarily surprising given the climatological frequency of strong synoptic weather systems and corresponding severe weather across the CONUS in the spring and fall (e.g., Brooks et al. 2003; Smith et al. 2012). In contrast, the CSU-MLP had a nearly uniform distribution of 15% contours across the late spring and summer months for days 4–6 (Fig. 8b), with perhaps a slight peak in frequency across the summer months—15% contours at days 7 and 8 were more common between January and August. SPC issued only a handful of 30% contours in the months of March and October (Fig. 8c) whereas the CSU-MLP issued a number of higher probability contours through the spring months, primarily at days 4 and 5 (Fig. 8d), highlighting the RF-based system’s ability to forecast both predictable and less predictable (i.e., warm season) severe weather regimes.
4. Forecast verification
RF and SPC forecast skill is first evaluated in aggregate across the entire verification period. Control RF (i.e., CSU-MLP) forecast skill decreases with increasing lead time, demonstrating significantly better skill than SPC outlooks (i.e., 95% confidence intervals do not overlap) at days 4 and 5 and equally near-climatological skill beyond day 5 (Fig. 9a). As lead time increases, the RFs are confronted with learning how GEFS/R environments (specifically, the median environments) associate with severe weather as forecast variability and ensemble variance also increases. The limited number of GEFS/R ensemble members likely prohibits a proper depiction of all future atmospheric states and forces the RFs to “learn” relationships between severe weather events and simulated environments that may not be conducive to severe weather. As a result, the ability of the RFs to discriminate events from nonevents similarly decreases with increasing lead time, with AUROC falling to ∼0.5 by day 8 (Fig. 9b); AUROC is highest for day-4 forecasts (0.62). However, none of the AUROC values eclipse the 0.7 mark denoting “good” resolution (Mandrekar 2010). On the other hand, when 5% probability contours are included in the RF-forecast discretization process, reliability is increased substantially to >0.07 at day 4 and is significantly larger than 0 (i.e., climatology) at day 8. Meanwhile, AUROC surpasses 0.8 at day 4 and is nearly 0.7 at day 8. While the higher probability 15 and 30% contours do not have significant skill beyond day 5 (or good discrimination at any forecast lead time), the adjusted skill metrics resulting from including 5% probability contours demonstrate that low-probability forecast contours may have considerable value for forecasting severe weather out to day 8.
The skill and discrimination ability of forecasts derived from alternatively trained RFs is also assessed to determine if medium-range prediction skill can be improved by learning to associate severe weather events with features in the days leading up to events (p0, p1, p2, and p3 experiments), reducing the impact of noisy features, and increasing the representative sample of atmospheric states in the underlying GEFS/R ensemble (tl10 experiment) used to assemble features. Aggregated BSSs are computed over the same verification period for the p0, p1, p2, p3, and tl10 experimental forecasts at the day-6 lead time (Fig. 10a) with the 5% contour included for comparison against the best control CSU-MLP forecasts (e.g., Fig. 9a). The forecast skill for all predictor-averaged models is statistically indistinguishable from the control CSU-MLP RF model (Fig. 10a). The BSSs alone imply that reasonably skillful day-6 forecasts can be derived by simply considering how the atmosphere evolves locally before a severe weather event. Moreover, the relatively equal skill among forecasts suggests that the raw GEFS/R features used in the control model (e.g., at all points in space in the radius) do not add significant value in training and perhaps exhibit less noise than their CAM-model counterparts (e.g., Hill et al. 2021; Loken et al. 2022). The time-lagged model forecasts exhibit significantly less skill than the control and predictor-averaged model forecasts, but their AUROC is significantly larger (Fig. 10b). Furthermore, the predictor-averaged model forecasts all improve upon the discrimination ability of the control system, with p0 containing the highest AUROC (Fig. 10b).
An assessment of spatial forecast skill underscores the frequency biases of the RF-based and SPC forecasts. The SPC outlooks feature patchy areas of positive skill associated with instances of severe weather that were correctly highlighted days in advance (Figs. 11a,b); in other words, when SPC forecasters do issue outlooks, they do so quite skillfully, particularly at day 4. However, small areas of slightly negative skill interspersed with small areas of slightly positive skill in the southern Great Plains suggest there were missed opportunities to forecast severe weather events 4 and 6 days in advance. Fewer SPC forecasts at day 6 (Fig. 7e) further limits the extent of positive forecast skill (Fig. 11b) compared to day 4. It is possible that forecasters had little confidence in the forecast evolution to warrant a 15% or 30% outlook contour at these lead times for various severe weather events. Furthermore, the limited verification dataset likely creates localized areas of high/low skill, and increasing the length of verification to multiple years would help to clarify SPC forecast skill in the medium range.
The spatial skill of the control RF forecasts exhibits more expansive and smoothed areas of positive and negative BSSs (Figs. 11c,d). For brevity and since the spatially averaged feature models display similar skill, only the control forecast spatial skill is considered here. BSS differences between SPC outlooks and RF forecasts (Figs. 11g,h) demonstrate the complexities of verifying these forecasts over a short period of time. Across the upper Midwest and mid-Atlantic RF forecasts were notably better than SPC as a result of SPC issuing far fewer outlooks (Figs. 7a,g). The CSU-MLP spatial forecast skill at days 4 and 6 is improved further when the 5% probability contour is included (Figs. 11e,f), which amplifies areas of positive and negative skill. In some instances, including the 5% forecast contour reverses areas of negative skill (cf. Figs. 11d,f in Georgia) or replaces neutral skill with positive skill as in the Great Plains (cf. Figs. 11d,f in North and South Dakota). However, other areas have particularly low skill due to substantial reporting biases (e.g., underreporting in the Great Lakes; Figs. 11c–f) and more frequent 5% probability contours (not shown).
Forecast skill is also assessed by computing the fractional coverage of observations within specific forecast contours. While the SPC contours are defined as discontinuous single probabilistic levels and not a discrete range (e.g., 15%–30%), it is reasonable to suggest that the fractional coverage of observations for a probabilistic contour should not exceed the next probabilistic value, otherwise a higher contour would be warranted. Figure 12 shows the fractional coverage of observations, in which the objective is to be within the green and red horizontal bars for a particular probabilistic threshold. When below the green horizontal line, forecast contours are believed to be too large on average, and when above the red line, forecast areas are too small. SPC and CSU-MLP control forecasts are well calibrated at the 15% threshold at almost all forecast days; SPC forecasts are potentially too small at day 8, but a small sample size limits a complete analysis for that lead time (Fig. 12a). On the other hand, the day-4 and day-5 30% outlooks from SPC are typically smaller than the CSU-MLP control forecasts, which appear generally well-calibrated prior to day 7 (Fig. 12b). The CSU-MLP 5% probability contours are also well-calibrated at all forecast lead times (not shown).
For a more complete depiction of calibration, reliability diagrams are constructed for all SPC and control RF forecasts (Fig. 13). Reliability curves that fall above or below the perfect-reliability line, dashed black lines in both panels of Fig. 13, are said to underforecast or overforecast severe weather events, respectively. All SPC outlooks are generally well-calibrated prior to day 8, whereas RF forecasts generally have a small underforecast bias above the 15% probability threshold but achieve reliability for lower thresholds. Day-7 and day-8 control RF forecasts lose reliability quickly after 15% (Fig. 13b), plummeting effectively to no skill above 30% and no resolution (i.e., below the orange dashed line) at the highest probabilities considered—SPC maintains skill at day 7 but underforecasts severe weather events at day 8 (Fig. 13b). For the RF probabilities, while reliability for days-4–6 forecasts above 30% is considerably more variable (Fig. 13a) owing to smaller sample sizes (e.g., inset figure in Fig. 13), forecast skill still hovers near perfect reliability. Analysis of reliability for the alternative feature assembly methods reveals that spatially averaging features maintains reliability relative to the control forecasts up to slightly higher probabilistic thresholds (e.g., 25%) for the day-6 forecasts whereas the day-6 tl10 experiments overforecast observed events (not shown), aligning with the aggregate BSS statistics (Fig. 10a). This analysis underscores the utility of the continuous RF probabilities as a forecast tool at medium-range lead times, and also stresses the difficulty of accurately capturing severe weather threat areas multiple days in advance (i.e., most RF-based forecasts slightly underforecast the observed event probabilities), even for skillful statistical models.
a. Example forecasts
Two example forecasts are provided that display some of the skill and discrimination attributes of the RF outlooks described previously. For the first case (Fig. 14), all day-4–8 example forecasts are valid for the 24-h period ending 1200 UTC 16 December 2021. This particular event featured a compact shortwave trough with strong low- and midlevel wind fields. Robust low-level warm air advection with low 60s dewpoints contributed to an atmosphere primed for severe thunderstorms and a highly anomalous severe weather event for mid-December; a thorough discussion of the meteorological parameter space for this event is provided in the day-1 SPC forecast discussion archive (SPC 2022a). Medium-range forecasts from NWP models depicted the shortwave trough days in advance, but underpredicted instability at day 4 (not shown). SPC forecasters did not highlight severe weather areas at days 4–8 (i.e., no contours ≥ 15% were issued) but decided to issue a 5% severe hazard risk at day 3, noting the impressive kinematic support for damaging wind gusts.3 Increasing forecaster confidence in a high-impact severe weather event with decreasing lead time resulted in a 15% severe wind area on day 2 and moderate categorical risk being issued at day 1, with a 45% probability of severe wind and a large area of significant severe wind delineated; a 10% probability of tornadoes also accompanied the SPC day-1 forecast.
The CSU-MLP control forecasts (Fig. 14) depicted a 5% probability severe weather threat area across the upper Mississippi valley into the southern Plains eight days in advance (Fig. 14e). By day 6, a 15% probability contour was introduced in the forecasts with a 30% contour added in subsequent, shorter lead-time forecasts (Figs. 14a–c). Forecast skill scores (BSS) gradually increased from 0.04 at day 8 to 0.19 at day 5 with a slight decrease back to 0.15 at day 4. In short, the probabilistic RF-based guidance showed skill out to day 8 for this particular case, and the progression of forecasts from day 8 to 4 showcases the utility of the forecast system to highlight areas that may experience severe weather.4 On the other hand, this case also exemplifies an underforecast bias across all forecast days with observations overlapping forecast contours at higher percentages than the probability levels, consistent with previous analysis of the forecast system (e.g., Figs. 12, 13), and suggesting that the probabilities themselves were too low in these forecasts.
A second example is provided to reinforce the similarities between the experimental, alternative-assembly forecast systems for one particular case. On this day, the 24-h period ending 1200 UTC 14 August 2021, numerous wind reports were recorded across Ohio, Pennsylvania, West Virginia, and other mid-Atlantic states (Fig. 15). All RF-based forecasts have a broad 5% contour across this region, and westward into Indiana, Illinois, and Missouri. None of the forecasts indicate greater than 15% probability of severe weather, despite dense sets of wind reports in two corridors across the east. The 15% contour in the control system seems subjectively well-positioned (BSS = 0.1076), but the p0 model forecast inaccurately extends the 15% contour westward (BSS = 0.1064). The p1 model forecast contracts the 15% contour back eastward, but also eliminates an area in New York and Pennsylvania that experienced wind reports (BSS falls to 0.0979). The p2 and p3 model forecasts further contract the 15% contour (BSSs of 0.059 and 0.0441, respectively), leaving the 5% area relatively unchanged; the time-lagged model is nearly identical to the p3-model forecast. This example illustrates the differences between the experiments that renders the control system and alternative RF models objectively similar. A more comprehensive case-study evaluation would be needed to characterize these forecast differences over the entire forecast period, which is beyond the scope of this exploration but is an active area of research. Additionally, the computational savings of the spatial-averaging models, particularly in training the RFs, may support a continued investigation alongside the CSU-MLP control forecast system into their utility as an operational tool.
b. Feature importances
To better understand what the RFs have learned about severe weather prediction from the training process, FIs are aggregated by meteorological feature and region for the day-4, day-6, and day-8 CSU-MLP control models (Fig. 16). Consistent with day-1–3 models developed by Hill et al. (2020), CAPE, CIN, MSLP, SHEAR500, and SHEAR850 are the most importance features for the day-4 models (Fig. 16a); CAPE is also less important in the east region as MSLP and SHEAR850 increases in importance. As lead time increases, CAPE becomes less important in the central region (Figs. 16b,c), being replaced by Q2M. Q2M also becomes more important in the west region, but CIN replaces CAPE as the most important feature (e.g., Fig. 16c). In the east region, CIN also replaces CAPE as the most important feature.
FIs of the p1 and p2 models are explored to further understand how the RFs leverage features in the day(s) leading up to severe weather events. FIs for both models peak during the day-6 forecast window (i.e., forecast hours 132–156), but they also ramp up from the days leading up to the event (Fig. 17), suggesting the local meteorological environment is being used by the RFs in predictions. Slight differences in FIs exist by regional model as well. For example, the West region p1 and p2 models have secondary peaks near forecast hours 120–123 (Fig. 17a), approximately 0000–0300 UTC the day before the forecast window, and p2 has a tertiary peak near forecast hour 99 (Fig. 17b). Not only are the prior-day features being used, but there is a notable cyclical nature that matches the diurnal climatology of severe weather (e.g., Hill et al. 2020). In contrast, FIs in the central and east regions do not have the same diurnal pattern, but rather have a nearly constant ramp up in FIs (e.g., orange and yellow bars in Fig. 17a) that is perhaps tied to a gradual change in the local environment that is less sensitive to time of day. However, since these FIs are a summation over all meteorological features, it is not clear what aspects of the environment prior to a severe weather event are being learned by the RFs to make day-6 predictions.
To further clarify the prior-day FIs, FIs are separated into thermodynamic and kinematic subgroups (refer to Table 2) for the p1 models (Fig. 18). In the west region models, which exhibited a strong diurnal FI pattern, the thermodynamic features (e.g., CAPE, Q2M) are the primary contributor to the prior-day FI secondary peak, with a sharp increase at forecast hour 123 (Fig. 18a)—the kinematic FIs in the west p1 model have a more subtle diurnal pattern, but still peak during the day-6 forecast window. In the central region model, the FIs have a broad and uniform peak from forecast hours 135–147 and a smaller contribution compared to the west region in the day prior period (Fig. 18b). The East region models lean on prior-day thermodynamic features slightly more than the Central region models, with a longer ramp-up of FIs from forecast hours 123–141 (Fig. 18c), but the kinematic FIs for both the central and east models are markedly smaller than the thermodynamic feature contributions. A full explanation for these FI patterns is reserved for future work, but as an initial assessment, the FIs highlight unique regional and meteorological relationships learned by the experimental models that were exploited by the RFs to make severe weather predictions. Changes to the local environment that preceded severe weather events clearly influenced the RFs during training, but to what extent those features contributed to forecast probabilities is not discernible from Gini importances alone.
5. Summary and discussion
Nine years of reforecasts from the GEFSv12 model are used along with historical records of severe weather to construct a novel RF prediction system capable of explicitly and probabilistically predicting severe weather at 4–8-day lead times, i.e., the medium range. Human forecasts issued by the SPC are evaluated alongside the RF-based predictions to assess the operational utility of the ML forecasts. A handful of experiments are also conducted to explore whether RF control forecasts could be improved through feature engineering and expanding the GEFSv12 ensemble size. The main conclusions are as follows:
-
RF forecasts have more skill and more discrimination ability than the human-based outlooks, which is partly a reflection of the continuous probabilities of the RFs and their ability to issue lower probability contours more frequently that add considerable skill and resolution (i.e., AUROC) to the forecast system.
-
The CSU-MLP forecasts tend to underforecast the occurrence of severe weather in the medium range at probabilistic contours above 15% whereas SPC forecasts are calibrated prior to day 8. This latter finding at least partially reflects the broader range of observed frequency of severe report occurrence that is used to verify SPC forecast probabilities.
-
Using spatially averaged GEFS/R features yielded similarly skillful forecasts as the control CSU-MLP method while also allowing for prior-day meteorological information to inform the forecasts; the models learned to associate the buildup of thermodynamically and kinematically favorable environments with next-day severe weather events. Additionally, the similar skill among models suggests that spatiotemporally displaced GEFS/R features are not particularly noisy but do not provide tremendous value to the RFs.
-
Time-lagging the GEFSv12 reforecasts to produce larger ensembles for RF training degraded forecast skill but increased forecast resolution.
-
Feature importances revealed relationships known to be important for severe weather forecasting, providing confidence in the RF forecasts.
-
The performance of the RF-based forecasts alongside the human-generated outlooks demonstrates their utility and potential value as a guidance tool in the human forecast process.
The comparisons between RF-based predictions and human forecasts provided in this work have some important caveats to consider. SPC forecasters have often employed specific philosophies in generating day-4–8 outlooks (Steven Weiss, personal communication) that likely limit the number of forecasts issued and hamper more skillful human-based medium-range forecasts. First, SPC forecasters are tasked with forecasting the probability of “organized severe thunderstorms,” not necessarily severe weather reports, so they will rarely if ever outline a high CAPE, low shear scenario in the medium range despite a high likelihood of thunderstorms. Second, SPC forecasters are very aware of maintaining day-to-day continuity in the outlooks when possible and prefer ramping up the forecast threat in days leading up to an event. For example, forecasters may opt to introduce a forecast area in a day-3 or day-2 outlook rather than days 4–8 because they may not want to highlight a low confidence threat area that has to be shifted, enlarged, or removed altogether in subsequent outlooks—forecasters are hesitant to add outlook areas when confidence is too low knowing they can add it into subsequent outlooks when considerable lead time may still be present. For these reasons, the SPC day-4–8 outlook product design includes provisions to issue “predictability too low” medium range outlooks when forecast uncertainty is large and confidence in a particular scenario unfolding in the day-4–8 timeframe is limited. In these cases, the accompanying written discussion will mention areas of possible severe weather concern and explain the limiting factors that preclude delineation of severe weather probabilities at that time. Relatedly, SPC forecaster perception is that NWS weather forecast offices may object when SPC removes or reduces severe weather probabilities in days before a potential event because it affects public messaging of severe weather. As a result, it is not uncommon for the initial SPC day-4–8 outlooks to be relatively small (e.g., Fig. 12) and infrequent, particularly when atmospheric predictability for severe weather wanes in the warm season (e.g., Fig. 8). As confidence increases in a severe weather threat area, the probabilities can be increased and the area expanded in day-1–3 outlooks. These and other internal constraints, along with the relative dearth of useful NWP model severe weather guidance, impact SPC forecast skill at longer lead times. On the other hand, the ML guidance generated from the CSU-MLP could substantially aid SPC by increasing confidence and consistency to provide more lead time to operational partners and end users to the threat of severe weather.
The results presented highlight some ML success against baseline forecasts and a number of unique avenues that could be explored moving forward to enhance and improve both ML-based guidance and the SPC human-based forecasts, as well as increase interpretability of the ML “black box.” While the alternative feature assembly experiments (e.g., p2, tl10) did not yield forecasts that surpassed the skill of the traditional CSU-MLP system, the simplification of features could be exploited to include other ensemble diagnostic or summary metrics (e.g., mean, high or low member values) that characterize ensemble spread into the medium range. The meteorological features could also be varied in any of the ML configurations to explore which features add the most value, or objective methods (e.g., permutation importance) could be used to reduce feature redundancy and select a more optimal subset of features. It will also be vitally important that alternative interpretability metrics [e.g., tree interpreter (Saabas 2016), Shapley additive values (SHAP; Shapley 2016), accumulated local effect (ALE; Apley and Zhu 2020)] are employed to interrogate how the RFs make predictions; this exploration is underway and will be the focus of a follow-on manuscript. Additionally, the added benefit of the ML system against the underlying GEFS model could be quantified more explicitly. Traditionally, ML-based forecasts have been measured against the very model that generates the ML predictors, with demonstrated success improving upon the raw dynamical models (e.g., Herman and Schumacher 2018b). In this instance, with notable 2-m dry and low-instability biases in the GEFSv12 system5 (Manikin et al. 2020), it would be informative to quantify the value added by the ML system to correct for these biases when making severe weather predictions.
Equipped with calibrated statistical products such as the CSU-MLP RF system and expert human knowledge, there is promise SPC forecasters may be able to increase medium-range outlook skill. Furthermore, by incorporating these types of robust, skillful statistical guidance products into the forecast process, SPC forecaster confidence in a generalized severe weather forecast outcome may increase starting days in advance. Being able to unveil how and why the ML models are issuing probabilities in a certain area will provide additional confidence to forecasters to rely on the products as a forecast tool. We expect that the usefulness of the CSU-MLP prediction system and others like it is not necessarily limited to the medium-range either, as demonstrated in days 1–3 by Hill et al. (2020), and the applicability to subseasonal or seasonal predictions of severe weather is planned for future investigation. Additionally, with continued effort from the meteorological community to explain AI/ML methodologies (e.g., Chase et al. 2022) with comprehension, and to make these tools more common in academic settings, there will be more opportunities to pursue new and improved forecast methodologies. Finally, it is crucial that constant communication is maintained between ML developers and SPC forecasters in order to generate guidance products that SPC operations finds useful and valuable. One such avenue includes continued participation and development, testing, and evaluation of these ML products in the Hazardous Weather Testbed Spring Forecast Experiment (Clark et al. 2021). It is important to leverage these testbed environments to better quantify how ML-based guidance products add value to the forecast process, which will be a priority moving forward.
Cooperative Institute for Precipitating Systems (CIPS) Analog-Based Probability of Severe Guidance available at https://www.eas.slu.edu/CIPS/SVRprob/SPG_Guidance_whitepaper.pdf.
SPC forecasters began looking at CSU-MLP forecasts in real time beginning early 2022. As a result, the independence of RF forecasts and human-based outlooks cannot be guaranteed. Due to an already limited verification period, forecast dependence is ignored when comparing the verification statistics.
Day 3 discussion available at https://www.spc.noaa.gov/products/outlook/archive/2021/day3otlk_20211213_0830.html.
SPC forecasters used the CSU-MLP forecasts during this event, highlighting their value in upgrading SPC outlooks as the event neared. See note from SPC forecaster Andrew Lyons: https://twitter.com/TwisterKidMedia/status/1471585397440487433?s=20&t=cSYwf08xjtvvuwIYr2TgHQ.
Internal SPC surveys have suggested these biases exist and are reducing forecaster confidence in deterministic Global Forecast System and GEFSv12 forecasts.
Acknowledgments.
This work is supported by NOAA Joint Technology Transfer Initiative Grant NA20OAR4590350. We thank SPC forecasters for their invaluable perspectives about these forecast products and continued collaboration to develop cutting edge medium-range guidance products. We greatly appreciate thoughtful comments and suggestions from three anonymous reviewers who helped improve the clarity and presentation of the manuscript.
Data availability statement.
All RF-based forecasts are available upon request from the Colorado State University. SPC outlooks are available via a public archive at https://www.spc.noaa.gov/. The GEFS/R dataset and GEFSv12 forecasts are publicly available from Amazon AWS at https://registry.opendata.aws/noaa-gefs/. A complete dataset of forecasts and verification data associated with this manuscript is available from Dryad at https://doi.org/10.5061/dryad.c2fqz61cv.
REFERENCES
Apley, D. W., and J. Zhu, 2020: Visualizing the effects of predictor variables in black box supervised learning models. J. Roy. Stat. Soc., 82, 1059–1086, https://doi.org/10.1111/rssb.12377.
Breiman, L., 2001: Random forests. Mach. Learn., 45, 5–32, https://doi.org/10.1023/A:1010933404324.
Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78, 1–3, https://doi.org/10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.
Bright, D. R., and M. S. Wandishin, 2006: Post processed short range ensemble forecasts of severe convective storms. 18th Conf. on Probability and Statistics in the Atmospheric Sciences, Atlanta, GA, Amer. Meteor. Soc., 5.5, https://ams.confex.com/ams/Annual2006/techprogram/paper_98458.htm.
Brooks, H. E., C. A. Doswell, and M. P. Kay, 2003: Climatological estimates of local daily tornado probability for the United States. Wea. Forecasting, 18, 626–640, https://doi.org/10.1175/1520-0434(2003)018<0626:CEOLDT>2.0.CO;2.
Burke, A., N. Snook, D. J. Gagne II, S. McCorkle, and A. McGovern, 2020: Calibration of machine learning–based probabilistic hail predictions for operational forecasting. Wea. Forecasting, 35, 149–168, https://doi.org/10.1175/WAF-D-19-0105.1.
Carbin, G. W., M. K. Tippett, S. P. Lillo, and H. E. Brooks, 2016: Visualizing long-range severe thunderstorm environment guidance from CFSv2. Bull. Amer. Meteor. Soc., 97, 1021–1031, https://doi.org/10.1175/BAMS-D-14-00136.1.
Chase, R. J., D. R. Harrison, A. Burke, G. M. Lackmann, and A. McGovern, 2022: A machine learning tutorial for operational meteorology. Part I: Traditional machine learning. Wea. Forecasting, 37, 1509–1529, https://doi.org/10.1175/WAF-D-22-0070.1.
Cintineo, J. L., T. M. Smith, V. Lakshmanan, H. E. Brooks, and K. L. Ortega, 2012: An objective high-resolution hail climatology of the contiguous United States. Wea. Forecasting, 27, 1235–1248, https://doi.org/10.1175/WAF-D-11-00151.1.
Clark, A. J., and Coauthors, 2021: A real-time, virtual spring forecasting experiment to advance severe weather prediction. Bull. Amer. Meteor. Soc., 102, E814–E816, https://doi.org/10.1175/BAMS-D-20-0268.1.
Craven, J. P., D. E. Rudack, and P. E. Shafer, 2020: National blend of models: A statistically post-processed multi-model ensemble. J. Oper. Meteor., 8, 1–14, https://doi.org/10.15191/nwajom.2020.0801.
Dixon, P. G., A. E. Mercer, J. Choi, and J. S. Allen, 2011: Tornado risk analysis: Is Dixie Alley an extension of Tornado Alley. Bull. Amer. Meteor. Soc., 92, 433–441, https://doi.org/10.1175/2010BAMS3102.1.
Done, J., C. A. Davis, and M. Weisman, 2004: The next generation of NWP: Explicit forecasts of convection using the weather research and forecasting (WRF) model. Atmos. Sci. Lett., 5, 110–117, https://doi.org/10.1002/asl.72.
Dowell, D. C., and Coauthors, 2022: The High-Resolution Rapid Refresh (HRRR): An hourly updating convection permitting forecast model. Part I: Motivation and system description. Wea. Forecasting, 37, 1371–1395, https://doi.org/10.1175/WAF-D-21-0151.1.
Erickson, M. J., J. S. Kastman, B. Albright, S. Perfater, J. A. Nelson, R. S. Schumacher, and G. R. Herman, 2019: Verification results from the 2017 HMT–WPC flash flood and intense rainfall experiment. J. Appl. Meteor. Climatol., 58, 2591–2604, https://doi.org/10.1175/JAMC-D-19-0097.1.
Erickson, M. J., B. Albright, and J. A. Nelson, 2021: Verifying and redefining the Weather Prediction Center’s Excessive Rainfall Outlook forecast product. Wea. Forecasting, 36, 325–340, https://doi.org/10.1175/WAF-D-20-0020.1.
Flora, M. L., C. K. Potvin, P. S. Skinner, S. Handler, and A. McGovern, 2021: Using machine learning to generate storm-scale probabilistic guidance of severe weather hazards in the Warn-on-Forecast System. Mon. Wea. Rev., 149, 1535–1557, https://doi.org/10.1175/MWR-D-20-0194.1.
Friedman, J. H., 2001: Greedy function approximation: A gradient boosting machine. Ann. Stat., 29, 1189–1232, https://doi.org/10.1214/aos/1013203451.
Gagne, D. J., II, A. McGovern, and M. Xue, 2014: Machine learning enhancement of storm-scale ensemble probabilistic quantitative precipitation forecasts. Wea. Forecasting, 29, 1024–1043, https://doi.org/10.1175/WAF-D-13-00108.1.
Gagne, D. J., II, A. McGovern, S. E. Haupt, R. A. Sobash, J. K. Williams, and M. Xue, 2017: Storm-based probabilistic hail forecasting with machine learning applied to convection-allowing ensembles. Wea. Forecasting, 32, 1819–1840, https://doi.org/10.1175/WAF-D-17-0010.1.
Gallo, B. T., A. J. Clark, B. T. Smith, R. L. Thompson, I. Jirak, and S. R. Dembek, 2018: Blended probabilistic tornado forecasts: Combining climatological frequencies with NSSL–WRF ensemble forecasts. Wea. Forecasting, 33, 443–460, https://doi.org/10.1175/WAF-D-17-0132.1.
Gensini, V. A., and W. S. Ashley, 2011: Climatology of potentially severe convective environments from the North American regional reanalysis. Electron. J. Severe Storms Meteor., 6 (8), https://doi.org/10.55599/ejssm.v6i8.35.
Gensini, V. A., and M. K. Tippett, 2019: Global ensemble forecast system (GEFS) predictions of days 1–15 U.S. tornado and hail frequencies. Geophys. Res. Lett., 46, 2922–2930, https://doi.org/10.1029/2018GL081724.
Gensini, V. A., A. M. Haberlie, and P. T. Marsh, 2020: Practically perfect hindcasts of severe convective storms. Bull. Amer. Meteor. Soc., 101, E1259–E1278, https://doi.org/10.1175/BAMS-D-19-0321.1.
Gensini, V. A., C. Converse, W. S. Ashley, and M. Taszarek, 2021: Machine learning classification of significant tornadoes and hail in the United States using ERA5 proximity soundings. Wea. Forecasting, 36, 2143–2160, https://doi.org/10.1175/WAF-D-21-0056.1.
Guyer, J. L., and A. R. Dean, 2010: Tornadoes within weak CAPE environments across the continental United States. 25th Conf. on Severe Local Storms, Norman, OK, Amer. Meteor. Soc., 1.5, https://ams.confex.com/ams/25SLS/techprogram/paper_175725.htm.
Hamill, T. M., and J. S. Whitaker, 2006: Probabilistic quantitative precipitation forecasts based on reforecast analogs: Theory and application. Mon. Wea. Rev., 134, 3209–3229, https://doi.org/10.1175/MWR3237.1.
Hamill, T. M., E. Engle, D. Myrick, M. Peroutka, C. Finan, and M. Scheuerer, 2017: The U.S. National Blend of Models for statistical postprocessing of probability of precipitation and deterministic precipitation amount. Mon. Wea. Rev., 145, 3441–3463, https://doi.org/10.1175/MWR-D-16-0331.1.
Hamill, T. M., and Coauthors, 2022: The reanalysis for the Global Ensemble Forecast System, version 12. Mon. Wea. Rev., 150, 59–79, https://doi.org/10.1175/MWR-D-21-0023.1.
Harris, L. M., and S.-J. Lin, 2013: A two-way nested global-regional dynamical core on the cubed-sphere grid. Mon. Wea. Rev., 141, 283–306, https://doi.org/10.1175/MWR-D-11-00201.1.
Harrison, D. R., M. S. Elliott, I. L. Jirak, and P. T. Marsh, 2022: Utilizing the high-resolution ensemble forecast system to produce calibrated probabilistic thunderstorm guidance. Wea. Forecasting, 37, 1103–1115, https://doi.org/10.1175/WAF-D-22-0001.1.
Herman, G. R., and R. S. Schumacher, 2018a: “Dendrology” in numerical weather prediction: What random forests and logistic regression tell us about forecasting extreme precipitation. Mon. Wea. Rev., 146, 1785–1812, https://doi.org/10.1175/MWR-D-17-0307.1.
Herman, G. R., and R. S. Schumacher, 2018b: Money doesn’t grow on trees, but forecasts do: Forecasting extreme precipitation with random forests. Mon. Wea. Rev., 146, 1571–1600, https://doi.org/10.1175/MWR-D-17-0250.1.
Herman, G. R., E. R. Nielsen, and R. S. Schumacher, 2018: Probabilistic verification of Storm Prediction Center convective outlooks. Wea. Forecasting, 33, 161–184, https://doi.org/10.1175/WAF-D-17-0104.1.
Hill, A. J., and R. S. Schumacher, 2021: Forecasting excessive rainfall with random forests and a deterministic convection-allowing model. Wea. Forecasting, 36, 1693–1711, https://doi.org/10.1175/WAF-D-21-0026.1.
Hill, A. J., G. R. Herman, and R. S. Schumacher, 2020: Forecasting severe weather with random forests. Mon. Wea. Rev., 148, 2135–2161, https://doi.org/10.1175/MWR-D-19-0344.1.
Hill, A. J., C. C. Weiss, and D. C. Dowell, 2021: Influence of a portable near-surface observing network on experimental ensemble forecasts of deep convection hazards during VORTEX-SE. Wea. Forecasting, 36, 1141–1167, https://doi.org/10.1175/WAF-D-20-0237.1.
Horgan, K. L., D. M. Schultz, J. E. Hales Jr., S. F. Corfidi, and R. H. Johns, 2007: A five-year climatology of elevated severe convective storms in the United States east of the Rocky Mountains. Wea. Forecasting, 22, 1031–1044, https://doi.org/10.1175/WAF1032.1.
Jahn, D. E., I. L. Jirak, A. Wade, and J. Milne, 2022: Extracting storm mode and tornado potential from statistical moments of updraft helicity distribution. 31st Conf. on Weather Analysis and Forecasting/27th Conf. on Numerical Weather Prediction, Houston, TX, Amer. Meteor. Soc., 460, https://ams.confex.com/ams/102ANNUAL/meetingapp.cgi/Paper/392702.
Jergensen, G. E., A. McGovern, R. Lagerquist, and T. Smith, 2020: Classifying convective storms using machine learning. Wea. Forecasting, 35, 537–559, https://doi.org/10.1175/WAF-D-19-0170.1.
Jirak, I., B. Roberts, B. Gallo, and A. Clark, 2018: Comparison of the HRRR time-lagged ensemble to formal CAM ensembles during the 2018 HWT Spring Forecasting Experiment. 29th Conf. on Severe Local Storms, Stowe, VT, Amer. Meteor. Soc., 123, https://ams.confex.com/ams/29SLS/webprogram/Paper348703.html.
Johns, R. H., and C. A. Doswell III, 1992: Severe local storms forecasting. Wea. Forecasting, 7, 588–612, https://doi.org/10.1175/1520-0434(1992)007<0588:SLSF>2.0.CO;2.
Kain, J. S., and Coauthors, 2008: Severe-weather forecast guidance from the first generation of large domain convection-allowing models: Challenges and opportunities. 24th Conf. on Severe Local Storms, Savannah, GA, Amer. Meteor. Soc., 12.1, https://ams.confex.com/ams/24SLS/techprogram/paper_141723.htm.
Loken, E. D., A. J. Clark, A. McGovern, M. Flora, and K. Knopfmeier, 2019: Postprocessing next-day ensemble probabilistic precipitation forecasts using random forests. Wea. Forecasting, 34, 2017–2044, https://doi.org/10.1175/WAF-D-19-0109.1.
Loken, E. D., A. J. Clark, and C. D. Karstens, 2020: Generating probabilistic next-day severe weather forecasts from convection-allowing ensembles using random forests. Wea. Forecasting, 35, 1605–1631, https://doi.org/10.1175/WAF-D-19-0258.1.
Loken, E. D., A. J. Clark, and A. McGovern, 2022: Comparing and interpreting differently designed random forests for next-day severe weather hazard prediction. Wea. Forecasting, 37, 871–899, https://doi.org/10.1175/WAF-D-21-0138.1.
Lorenz, E. N., 1969: Atmospheric predictability as revealed by naturally occurring analogues. J. Atmos. Sci., 26, 636–646, https://doi.org/10.1175/1520-0469(1969)26<636:APARBN>2.0.CO;2.
Mandrekar, J. N., 2010: Receiver operating characteristic curve in diagnostic test assessment. J. Thorac. Oncol., 5, 1315–1316, https://doi.org/10.1097/JTO.0b013e3181ec173d.
Manikin, G., A. Bentley, L. Dawson, and S. Shields, 2020: Model evaluation group (MEG) GEFSv12 evaluation overview. Accessed 3 August 2022, https://www.emc.ncep.noaa.gov/users/meg/gefsv12/pptx/MEG_4-23-20_GEFSv12_Overview.pptx.
Marzban, C., 2004: The ROC curve and the area under it as performance measures. Wea. Forecasting, 19, 1106–1114, https://doi.org/10.1175/825.1.
McGovern, A., K. L. Elmore, D. J. Gagne II, S. E. Haupt, C. D. Karstens, R. Lagerquist, T. Smith, and J. K. Williams, 2017: Using artificial intelligence to improve real-time decision-making for high-impact weather. Bull. Amer. Meteor. Soc., 98, 2073–2090, https://doi.org/10.1175/BAMS-D-16-0123.1.
McGovern, A., R. Lagerquist, D. J. Gagne, G. E. Jergensen, K. L. Elmore, C. R. Homeyer, and T. Smith, 2019: Making the black box more transparent: Understanding the physical implications of machine learning. Bull. Amer. Meteor. Soc., 100, 2175–2199, https://doi.org/10.1175/BAMS-D-18-0195.1.
Murphy, A. H., and R. L. Winkler, 1977: Reliability of subjective probability forecasts of precipitation and temperature. J. Roy. Stat. Soc., 26, 41–47, https://doi.org/10.2307/2346866.
NCEI, 2022: U.S. billion-dollar weather and climate disasters. NOAA/National Centers for Environmental Information (NCEI), accessed 19 January 2022, https://www.ncdc.noaa.gov/billions/.
Pedregosa, F., and Coauthors, 2011: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res., 12, 2825–2830.
Raschka, S., 2015: Python Machine Learning. Packt Publishing Ltd., 454 pp.
Saabas, A., 2016: Random forest interpretation with scikit-learn. Accessed 13 July 2022, https://blog.datadive.net/random-forest-interpretation-with-scikit-learn/.
Schumacher, R. S., A. J. Hill, M. Klein, J. A. Nelson, M. J. Erickson, S. M. Trojniak, and G. R. Herman, 2021: From random forests to flood forecasts: A research to operations success story. Bull. Amer. Meteor. Soc., 102, E1742–E1755, https://doi.org/10.1175/BAMS-D-20-0186.1.
Shapley, L. S., 2016: A value for n-person games. Contributions to the Theory of Games (AM-28), H. W. Kuhn and A. W. Tucker, Eds., Annals of Mathematics Studies, Vol. II, Princeton University Press, 307–318.
Shield, S. A., and A. L. Houston, 2022: Diagnosing supercell environments: A machine learning approach. Wea. Forecasting, 37, 771–785, https://doi.org/10.1175/WAF-D-21-0098.1.
Smith, B. T., R. L. Thompson, J. S. Grams, C. Broyles, and H. E. Brooks, 2012: Convective modes for significant severe thunderstorms in the contiguous United States. Part I: Storm classification and climatology. Wea. Forecasting, 27, 1114–1135, https://doi.org/10.1175/WAF-D-11-00115.1.
Sobash, R. A., and J. S. Kain, 2017: Seasonal variations in severe weather forecast skill in an experimental convection-allowing model. Wea. Forecasting, 32, 1885–1902, https://doi.org/10.1175/WAF-D-17-0043.1.
Sobash, R. A., J. S. Kain, D. R. Bright, A. R. Dean, M. C. Coniglio, and S. J. Weiss, 2011: Probabilistic forecast guidance for severe thunderstorms based on the identification of extreme phenomena in convection-allowing model forecasts. Wea. Forecasting, 26, 714–728, https://doi.org/10.1175/WAF-D-10-05046.1.
Sobash, R. A., G. S. Romine, C. S. Schwartz, D. J. Gagne II, and M. L. Weisman, 2016a: Explicit forecasts of low-level rotation from convection-allowing models for next-day tornado prediction. Wea. Forecasting, 31, 1591–1614, https://doi.org/10.1175/WAF-D-16-0073.1.
Sobash, R. A., C. S. Schwartz, G. S. Romine, K. R. Fossell, and M. L. Weisman, 2016b: Severe weather prediction using storm surrogates from an ensemble forecasting system. Wea. Forecasting, 31, 255–271, https://doi.org/10.1175/WAF-D-15-0138.1.
Sobash, R. A., G. S. Romine, and C. S. Schwartz, 2020: A comparison of neural-network and surrogate-severe probabilistic convective hazard guidance derived from a convection-allowing model. Wea. Forecasting, 35, 1981–2000, https://doi.org/10.1175/WAF-D-20-0036.1.
SPC, 2020: About the SPC. NOAA/National Weather Service, accessed 28 April 2022, https://www.spc.noaa.gov/misc/aboutus.html.
SPC, 2022a: Dec 15, 2021 0600 UTC day 1 convective outlook. NOAA/National Weather Service, accessed 6 May 2022, https://www.spc.noaa.gov/products/outlook/archive/2021/day1otlk_20211215_1200.html.
SPC, 2022b: Storm Prediction Center WCM page. NOAA/National Weather Service, accessed 12 January 2022, https://www.spc.noaa.gov/wcm/.
Wade, A. R., and I. L. Jirak, 2022: Exploring hourly updating probabilistic guidance in the 2021 Spring Forecasting Experiment with objective and subjective verification. Wea. Forecasting, 37, 699–708, https://doi.org/10.1175/WAF-D-21-0193.1.
Whan, K., and M. Schmeits, 2018: Comparing area-probability forecasts of (extreme) local precipitation using parametric and machine learning statistical post-processing methods. Mon. Wea. Rev., 146, 3651–3673, https://doi.org/10.1175/MWR-D-17-0290.1.
Zhang, F., C. Snyder, and R. Rotunno, 2003: Effects of moist convection on mesoscale predictability. J. Atmos. Sci., 60, 1173–1185, https://doi.org/10.1175/1520-0469(2003)060<1173:EOMCOM>2.0.CO;2.
Zhang, F., N. Bei, R. Rotunno, C. Snyder, and C. C. Epifanio, 2007: Mesoscale predictability of moist baroclinic waves: Convection-permitting experiments and multistage error growth dynamics. J. Atmos. Sci., 64, 3579–3594, https://doi.org/10.1175/JAS4028.1.
Zhou, X., and Coauthors, 2022: The development of the NCEP Global Ensemble Forecast System version 12. Wea. Forecasting, 37, 1069–1084, https://doi.org/10.1175/WAF-D-21-0112.1.