1. Introduction
Tennekes et al. (1986) famously stated “no forecast is complete without a forecast of forecast skill.” This philosophy has long inspired the development of objective ensemble-spread-based methods to predict the skill of ensemble-mean or high-resolution deterministic forecasts (e.g., Leith 1974; Hoffman and Kalnay 1983; Kalnay and Dalcher 1987; Murphy 1988; Buizza 1997). Forecasters today routinely use subjective or objective measures of ensemble spread to calibrate their confidence in NWP forecasts and, in turn, the operational forecasts they issue (Novak et al. 2008; Evans et al. 2014; Demuth et al. 2020). However, using ensemble spread to predict forecast skill has important limitations. Spread–error correlations tend to be high only when the ensemble spread is extremely high or low (e.g., Houtekamer 1993; Whitaker and Loughe 1998; Grimit and Mass 2002). Ensembles tend to be underdispersive (overconfident) due largely to model errors and computationally limited ensemble sizes (Houtekamer et al. 1996; Buizza 1997; Toth and Kalnay 1997; Buizza et al. 1999; Stensrud et al. 2000). Underdispersion in convection-allowing model (CAM) ensembles (e.g., Clark et al. 2010; Roberts et al. 2020) in particular often arises from slow ensemble spinup of new or rapidly evolving storms, which can substantially bias the ensemble analysis and ensuing forecast. Finally, the predictability of forecast skill from ensemble spread is intrinsically limited since an ensemble produces a probability distribution, not a deterministic prediction, of forecast error about the ensemble mean (Murphy 1988; Barker 1991; Houtekamer 1993; Whitaker and Loughe 1998; Grimit and Mass 2007). This limitation is most pronounced for large-spread forecasts, which will sometimes have small ensemble-mean forecast errors no matter how well calibrated the ensemble.
Machine learning (ML) methods can mitigate these ensemble limitations by exploiting statistical relationships between forecast skill and multiple predictors, which can include ensemble spread metrics. Previous ML approaches have used multiple linear regression to predict the skill of large-scale, medium-to-long-range deterministic forecasts (Leslie et al. 1989; Molteni and Palmer 1991; Wobus and Kalnay 1995). While “forecasting forecast skill” has a long history in coarser models, the authors are unaware of any published evaluations of skill prediction methods for convection-allowing forecasts. This gap in CAM guidance is notable given how much forecasters value forecast uncertainty information. CAM ensemble forecast uncertainty metrics tend to focus on storm-tailored variables (e.g., maximum composite reflectivity) and use neighborhood approaches to accommodate tolerable storm phase errors (e.g., Schwartz and Sobash 2017; Roberts et al. 2019, 2020). Forecasters routinely assess CAM ensemble spread (whether from a formal ensemble, time-lagged ensemble, or an ad hoc ensemble of deterministic CAMs) to estimate and communicate uncertainty information to users, while recognizing that CAM ensembles can poorly represent the true forecast uncertainty (Evans et al. 2014; Demuth et al. 2020). It therefore seems likely that forecasters would benefit from additional forecast skill guidance that mitigates the limitations of traditional spread-based metrics. The present study addresses this need by developing and testing an ML-based forecast skill prediction technique for the NSSL Warn-on-Forecast System (WoFS), an experimental CAM ensemble designed primarily to provide thunderstorm hazard guidance at 0–6-h lead times (Stensrud et al. 2009, 2013; Wheatley et al. 2015; Jones et al. 2016; Heinselman et al. 2024).
Given the focus of WoFS and other convection-allowing ensembles on thunderstorms, we develop storm-oriented predictors for our ML models and train the models to predict composite reflectivity forecast skill within localized patches rather than over the entire WoFS domain. We frame the ML task as a multiclassification problem and predict where forecast skill will be POOR (worst 20% of forecasts for a given lead time), FAIR (middle 60% of forecasts for a given lead time), or GOOD (best 20% of forecasts for a given lead time). These skill classifications are determined using a percentile-based method involving the extended fraction skill score. We experiment with several ML algorithms, ultimately adopting the ordinal logistic regression method due in part to its low computational cost, a critical consideration for operational applications. We systematically reduce the initial numbers of predictors that are input to the models to further decrease their computational cost and increase their explainability, which should enhance their usability by forecasters. While the immediate goal of this work is to investigate the efficacy of ML-based prediction of forecast skill for the WoFS, we also hope to encourage the development of analogous methods for the entire range of deterministic and ensemble NWP models.
2. Data
a. WoFS and MRMS
The WoFS is a regional, 3-km ensemble that presently uses the Advanced Research version of the Weather Research and Forecasting (WRF-ARW) Model (Skamarock et al. 2008). The WoFS domain is 900 km × 900 km and is relocated each day that it is run over a region where high-impact (typically convective) weather is anticipated. The WoFS uses 36 members to assimilate radar, satellite, and Oklahoma Mesonet data (when available) every 15 min and conventional observations every hour. Forecasts are initialized every 30 min and run out to 3 or 6 h (at the bottom or top of the hour, respectively) using 18 of the analysis members. Initial and lateral boundary conditions are provided by the High-Resolution Rapid Refresh Data Assimilation System (HRRRDAS; Dowell et al. 2022). The WoFS is run primarily during the warm season, including during the Hazardous Weather Testbed Spring Forecasting Experiment (SFE; Gallo et al. 2017; Clark et al. 2023). While currently experimental, WoFS output often informs forecast products issued by the NOAA NWS’s Storm Prediction Center and Weather Prediction Center and an increasing number of NWS Forecast Offices (Wilson et al. 2023). The WoFS is scheduled to be transitioned to operations within the Unified Forecast System between 2027 and 2030. Additional WoFS details can be found in Miller et al. (2022), Heinselman et al. (2024), and references therein.
This study uses WoFS analyses and 1-, 2-, and 3-h forecasts initialized every 30 min from 2000 to 0300 UTC for 106 days during the 2017–21 SFEs. WoFS composite reflectivity (“REFLCOMP”) forecast skill is measured using one of the two REFLCOMP products in the Multi-Radar Multi-Sensor (MRMS) System suite. The MRMS REFLCOMP used herein is the column maximum of the exponential inverse-distance-weighted average of the reflectivity from each contributing radar (Smith et al. 2016). The MRMS REFLCOMP was generated at NSSL using the MRMS version 12 codebase and then interpolated from its native 1-km grid to the 3-km WoFS grid. The MRMS system generates numerous other products related primarily to severe weather and precipitation using model data and observations from a variety of platforms.
b. Calibration of WoFS REFLCOMP
The WoFS has a well-documented positive bias relative to MRMS at moderate and high REFLCOMP (e.g., Skinner et al. 2018; Fig. 1) attributed largely to overprediction of graupel in the NSSL 2-moment microphysics scheme (Mansell 2010). In this study, we remove this systematic bias so that it cannot dominate the forecast errors. We consider this a best practice for postprocessing output from models known to have large biases since it is a straightforward way to improve analyses and forecasts. The WoFS REFLCOMP bias correction proceeds as follows. First, for each forecast lead time used in this study (0, 1, 2, and 3 h), we compute the cumulative distributions of the WoFS and MRMS REFLCOMP over our entire dataset (106 cases) using 1000 bins. Since systematic differences between the WoFS member reflectivity climatologies are minimal (Potvin et al. 2020), a single WoFS cumulative distribution function (CDF) is generated by using one randomly selected member per ensemble forecast. Then, for each analysis/forecast, we convert the WoFS REFLCOMP to the WoFS cumulative distribution bin space. Finally, we compute the corresponding MRMS REFLCOMP using the MRMS cumulative distribution. Thus, for example, the 90th percentile WoFS REFLCOMP is mapped to the 90th percentile MRMS REFLCOMP and so forth. This remapped REFLCOMP (Fig. 1) is used in place of the original WoFS REFLCOMP for all subsequent analyses unless noted otherwise. We use linear interpolation for mapping between REFLCOMP and REFLCOMP CDF space.
c. Spread–error relationship for WoFS REFLCOMP
To illustrate the limitations of using ensemble spread alone to predict forecast skill, we present WoFS REFLCOMP spread–error scatterplots computed over all 2-h forecasts in our dataset (Fig. 2). Following Fortin et al. (2014), we compute the (scalar) spread of each 18-member forecast as the root mean of the (2D) ensemble variance in REFLCOMP and the (scalar) error as the root mean of the (2D) squared difference between the ensemble mean and MRMS REFLCOMP. Similarly to Wang and Bishop (2003), we then bin the data into spread-based deciles to clarify how the expected ensemble-mean error varies with the ensemble spread. We see that the WoFS is characterized by a moderately high spread–error correlation (Fig. 2a) that approaches 1.0 upon binning the ensemble statistics (Fig. 2b) and that while the WoFS is substantially underdispersive, the rate of change in expected ensemble-mean error with ensemble spread is nearly ideal (i.e., the slope approaches one).1 Thus, the WoFS ensemble spread provides valuable information about the expected ensemble-mean forecast error. However, on a case-by-case basis (Fig. 2a), a given value of REFLCOMP spread can correspond to a wide range of REFLCOMP errors, which is unavoidable (as explained in section 1), and will nearly always underestimate the error (sometimes modestly; sometimes severely). These problems did not improve by applying maximum filters with various scales to mitigate modest phase errors nor by discarding grid points where the MRMS and ensemble-maximum REFLCOMP did not meet a threshold to focus upon deep convection (not shown). The limited ability to predict WoFS forecast errors from ensemble spread alone motivates incorporating additional predictors via ML models to improve predictions of forecast skill.
3. Methods
a. Prediction task
The unique value of the WoFS lies largely in its ability to skillfully predict convection at finer scales than other CAM systems. Much of this additional skill owes to the rapid assimilation of radar and satellite observations and is associated primarily with storms that exist at forecast initialization time, especially those that have been well assimilated into the ensemble (Guerra et al. 2022). Since WoFS forecasts are intended to inform operational prediction of individual storms and convective regions, we configure our ML models to predict forecast skill not over the entire WoFS domain nor over fixed subdomains, but within square domains positioned according to the locations and WoFS-estimated motions of observed storms at forecast initialization (i.e., analysis) time. Hereafter, we refer to these domains as “forecast patches.” As we discuss later in this subsection, these forecast patches do not always contain the storm observed at initialization and can contain other storms. Thus, it is more accurate to say that the ML models are trained to predict forecast skill within the forecast patches, rather than the skill with which the initial storms per se are forecast.
We identify storms present at analysis time from the MRMS REFLCOMP field using the storm segmentation technique of Potvin et al. (2022). For each identified storm, we define a square domain centered on the storm object centroid; this is the “initial patch” for that storm. The diameter of the initial patch is determined by the lead time at which the forecast skill is to be predicted. We then average the WoFS Bunkers right-mover storm motion (Bunkers et al. 2000) over the initial patch and assume the storm translates at that velocity through the forecast period.2 We thus estimate the future location of the storm at the given forecast valid time. We define the forecast patch to be a square centered on the projected storm location with diameter set to 150, 180, or 210 km at lead times of 1, 2, and 3 h, respectively. These diameters were chosen empirically to be large enough to account for typical errors in the estimated storm motion and potential upscale storm growth. However, storms that deviate strongly from the mean flow (e.g., left-moving supercells) will often end up outside the forecast patch. The diameters of initial patches are quasi-arbitrarily set to the diameters of their corresponding forecast patches. We use larger initial patches for longer-lead forecasts in order to increase the probability that the initial patch encompasses the source region of the convection in the forecast patch. Initial-forecast patch pairs are discarded if one or the other extends outside the WoFS domain.
Figure 3 depicts four pairs of initial and forecast patches for a single WoFS forecast. In some cases, particularly at longer lead times, the convection within the forecast patch will not have evolved from the initially observed convection. For example, other observed or model storms may have moved into, or developed within, the forecast patch; the initial storm(s) may not be captured by the forecast patch due to a poor storm motion estimate; or the initial storm(s) may already have decayed. Thus, the ML models are not trained to predict the skill with which the initial convection is forecast, but rather the forecast skill for convection within the forecast patch.
The ML model classifies the forecast skill within each forecast patch as POOR, FAIR, or GOOD (section 3b) based on predictors computed within the initial and forecast patches (section 3c). To further clarify the prediction task, and to demonstrate how we intend to visualize the predictions on the WoFS web viewer (https://cbwofs.nssl.noaa.gov/forecast), we show final model predictions and verification for a representative case in Fig. 4. This is a typical case in that the majority of the predictions are correct, no GOOD predictions verify as POOR or vice versa, but a nontrivial fraction of the predictions are off by one class (section 4). While the ML predictions are valid within square patches, we display them as circles (sized such that they are circumscribed by the forecast patch boundary) since we found the circular regions are more easily distinguished from one another when many overlapping forecast patches are present. To further prevent overcrowding of the ML prediction visualization, if the overlapping area of two forecast patches exceeds 25% of the area of a single forecast patch, one of the patches (and the associated ML prediction) is randomly selected to not be plotted. This is the case for one of the northern forecast patches in the 3-h forecast. The southern forecast patch present in the 1- and 2-h forecasts is absent from the 3-h forecast since it would intersect the WoFS patch boundary by this time.
We settled on our strategy for defining the forecast patches after carefully considering alternative approaches. For example, we considered positioning the forecast patches over WoFS storms valid at forecast time, rather than downstream of storms observed at analysis time. This alternative option would ensure all WoFS storms at a given forecast time are contained within forecast patches, but would also sometimes exclude regions downstream of existing storms that are spuriously absent from the ensemble (due, e.g., to slow ensemble spinup or premature dissipation of the storm in the forecast). For example, the westernmost forecast patch in Fig. 3b is void of storms and therefore would not exist under this alternative strategy, preventing the prediction of forecast skill for the storms present in Colorado (Fig. 3a). Conversely, our strategy promotes coverage of all storms present at analysis time but can exclude regions where the WoFS produces spurious storms. Thus, returning to our example, the spurious WoFS storms in northern Kansas and northeastern Colorado are not included in the forecast patches (Fig. 3b), which precludes predictions of low forecast skill that could signal forecasters to disregard the affected regions. We chose our current approach since it seems better to anchor the forecast patches on real storms present at analysis time than on model predictions of storms. In future work, we may develop the alternative framework (based on WoFS storm forecasts) and then explore the utility of combining both approaches to ensure coverage of all regions of interest.
We also considered predicting the forecast skill of individual storms rather than convective regions. Training such a model would require associating each MRMS storm object at forecast initialization time with the corresponding storm object at forecast valid time. This strategy would greatly shrink the training datasets at the 2- and 3-h lead times given the typically short lifetimes of thunderstorms. More fundamentally, since the trained ML model would assume the observed storm still exists at the forecast time, the predictions would be biased in cases where the storm has already dissipated. While beyond the scope of this study, it may be valuable to develop and evaluate a complementary framework that focuses on lead times < 1 h so that the ML predictions remain focused on individual storms rather than regions of storms.
b. Labels
As stated previously, we decided to measure WoFS forecast skill in terms of REFLCOMP. Real-time REFLCOMP observations are reliably available and reasonably accurate, and accurate forecasts of storm occurrence and intensity are necessary for accurate forecasts of storm impacts. In selecting what verification metric to use to quantify WoFS REFLCOMP forecast skill, we adopted two requirements. First, the metric must tolerate modest phase errors, which can cause good forecasts to be assigned low scores by point-to-point verification methods, potentially causing better forecasts (as judged by experts) to be ranked lower than poorer forecasts. Second, the metric must account for the full ensemble forecast distribution, rather than inputting, e.g., a simple or probability-matched mean of the ensemble REFLCOMP forecast, which can filter out information that is valuable to forecasters. The extended fraction skill score (eFSS; Duc et al. 2013), an extension of the FSS (Roberts and Lean 2008), meets both requirements.
The FSS compares the fractional coverage of observed and forecast events within 2D spatial neighborhoods of prescribed width. Typically, “events” are instances where a field exceeds a prescribed threshold (e.g., hourly rainfall > 1 cm). With the eFSS, neighborhoods are defined over two spatial dimensions plus a third dimension, which can be height, time, or as in our case, the ensemble member space. Both FSS and eFSS range from 0, which occurs when events are observed but not forecast or are forecast but not observed, to 1, which indicates that the forecast and observed coverage of events are identical within all neighborhoods of the prescribed width.
While the eFSS (or FSS) is valid for a single scale and threshold, there is no one scale and threshold for which the eFSS encapsulates the skill of a REFLCOMP ensemble forecast. For example, were a single small eFSS neighborhood used, a forecast that correctly predicted the general placement of a region of convection but had little skill in the placement of individual storms could receive a similar eFSS as a forecast that misplaced even the general location of convection. Were a single large eFSS neighborhood used instead, a forecast that correctly predicted even the locations of individual storms could receive a similar eFSS as a forecast that correctly predicted the general region of convection but displaced individual storms. Similar arguments can be made against using a single REFLCOMP threshold. We therefore compute the eFSS for all combinations of three prescribed scales and four prescribed thresholds and average the scores together to obtain a single verification score eFSSavg. Based on trial and error, we selected eFSS thresholds of 30, 35, 40, and 45 dBZ, and eFSS neighborhoods, expressed as half-diameters in (3-km) grid lengths, that increased with forecast lead time: 6, 9, and 12 for 1-h forecasts; 9, 14, and 18 for 2-h forecasts; and 12, 18, and 24 for 3-h forecasts.
While assessing the correspondence between eFSSavg and subjective assessment of forecast quality for numerous cases, we identified two common scenarios where unduly low eFSS occurs. The first is when both the MRMS REFLCOMP and WoFS member REFLCOMP exceed at least one of the prescribed eFSS thresholds over a tiny portion of the forecast box, often causing modest MRMS versus WoFS amplitude differences to drastically reduce the eFSS. We therefore calculate eFSS for a given threshold only if the MRMS coverage of threshold exceedance—hereafter, simply “coverage”—exceeds 30 grid points (a quasi-arbitrary threshold) and the WoFS coverage accumulated over all 18 members exceeds 30 × 18 = 540 points. Otherwise, the eFSS for that threshold is excluded from the eFSSavg calculation. If the MRMS and WoFS coverages are too low even for the lowest threshold (30 dBZ), the case is discarded. The second scenario where unduly low eFSS arises is when MRMS coverage exists primarily near the forecast patch boundary, in which case modest WoFS phase errors can cause dramatic differences in WoFS versus MRMS coverages within the forecast patch. To address this problem, we include a buffer zone that extends 6 × (1 + LT) grid cells outward from the boundary, where LT is the forecast lead time in hours. This buffer zone allows the eFSS computation patches to extend outside the forecast patch and thereby capture MRMS and WoFS storms that would otherwise be excluded. Cases in which the forecast patch or buffer zone extends outside the WoFS domain are excluded from the dataset to avoid unrepresentative eFSS values.
The target variable for the ML prediction task is not eFSSavg itself; rather, we frame the task as a classification (not regression) problem. To do this, we first compute the distribution of eFSSavg over the dataset for each lead time. Kernel density estimates (KDEs) of eFSSavg for 2-h lead times are shown in Fig. 5; the KDEs for the 1- and 3-h lead times are similar (not shown). Each eFSSavg is then converted to the index of the pentile (i.e., 20% interval) in which it falls: 0%–20%, 20%–40%, 40%–60%, 60%–80%, or 80%–100%. In early work, ML models trained on this 5-class prediction scheme discriminated very poorly between the middle three classes. We therefore trained new ML models on a 3-class scheme obtained by combining those middle three classes. Thus, the percentile bins of the 3-class scheme are 0%–20%, 20%–80%, and 80%–100%, which we quasi-arbitrarily label POOR, FAIR, and GOOD, respectively (Fig. 5). For example, an ML prediction could be P(POOR) = 70%, P(FAIR) = 20%, and P(GOOD) = 10%, which would mean the model is confident that the eFSS will be in the lowest 20% of forecasts at that lead time. The new models produce much better predictions of the 20%–80% (FAIR) class than the original models did of the individual 20%–40%, 40%–60%, and 60%–80% classes, but mildly degraded predictions of the 0%–20% (POOR) and 80%–100% (GOOD) classes, which are arguably of greatest interest to forecasters. To obtain the advantages of both approaches, we train the final ML models on the 5-class prediction scheme, but then convert the predictions to the 3-class scheme. Based on visual comparison of WoFS and MRMS composite reflectivity fields for hundreds of analyses and forecasts at lead times of 1 and 3 h, we found that the final class labels correspond qualitatively well with our subjective impressions of WoFS forecast accuracy.
c. Predictors
We began with a very large set of predictors (N = 323) to increase the odds that any important predictor is included in the (much smaller) final models. Both grid- and object-based methods are used to generate the predictors. The predictors are designed to measure a wide range of characteristics of the MRMS storms, the WoFS storms, the accuracy of the WoFS analyses of storms, and the WoFS initial and forecast environments. The predictor generation process is summarized in Fig. 6.
1) Object-based predictors
Storm objects were generated from REFLCOMP as in Potvin et al. (2022), prior to calibrating the WoFS REFLCOMP (to avoid the potential need for parameter tuning; this is the only instance where the uncalibrated WoFS REFLCOMP is used in our experiments). The storm mode of each object was determined using both the 7-class and 3-class schemes in Potvin et al. (2022). Predictors were developed using the properties of MRMS storm objects within the initial patch at analysis time and of WoFS storm objects both within the initial patch at analysis time and within the forecast patch at forecast valid time (Table 1; Fig. 4a; N = 50). Object-based predictor names are prepended with “storm,” “init,” or “fcst” if they are valid for the patch-centered MRMS storm, all MRMS storms within the initial patch (computed using the mean or mode), or all WoFS storms within the forecast patch, respectively.
Predictors generated from WoFS and MRMS storm objects (50 total).
2) Grid-based predictors
Two-dimensional fields derived from the raw WoFS model output in real time are used to generate ensemble-spatial statistics (Table 2; Fig. 6b; N = 236). These predictors are computed for both the initial patch (at analysis time; prefixed init) and the forecast patch (at forecast time; prefixed fcst). The spatial statistics vary with the field and can include the maximum (“max”), minimum (“min”), median (“med”), 90th percentile (“90pc”), and 10th percentile (“10pc”). The ensemble median (“med”) and standard deviation (“std”) of each spatial statistic are computed. The ensemble statistic suffix precedes the spatial statistic suffix; thus, init_mlcape_med_90pc is the ensemble median of the 90th percentile of mixed-layer convective available potential energy (CAPE) for the initial patch. The spatial statistics of environmental fields (labeled “Environmental” in Table 2) are computed only where WoFS REFLCOMP < 20 dBZ (a quasi-arbitrary threshold) to mitigate convective contamination of the environmental predictors.
WoFS postprocessed fields and spatial statistics used to generate grid-based predictors (236 total). Negatively oriented variables are indicated by an asterisk.
3) Additional predictors
Spread, error, spread–error ratio, and entropy (e.g., Ziehmann 2001; Grimit and Mass 2007) metrics for REFLCOMP were computed within each initial patch at analysis time, and spread and entropy metrics for REFLCOMP were computed at forecast valid time within each forecast patch (Table 3; Fig. 4c; N = 37). We calculated these metrics for the original REFLCOMP fields, for REFLCOMP fields maximum filtered with a neighborhood radius set to 3 + 3 × LT (denoted by “upscaled”), for REFLCOMP fields masked below 30 or 40 dBZ for initial or forecast patches, respectively (denoted by “thres”), and for REFLCOMP fields with both of these modifications.
Predictors relating to forecast spread and error, REFLCOMP coverage and intensity, and diurnal time (37 total).
Since the target variable is computed from REFLCOMP eFSSavg within the forecast patch, we include as predictors the REFLCOMP eFSSavg and FSSavg valid in the initial and forecast patches at analysis time. The FSS calculations operate on the WoFS probability-matched REFLCOMP, and FSSavg uses the same scales and thresholds as eFSSavg. These predictors are motivated by our expectation that convection that is well analyzed in the WoFS, or moving into a region well analyzed by the WoFS, is more likely to be well predicted (as we show in section 4, this relationship breaks down at longer lead times). We also include the coverage of MRMS REFLCOMP > 35 dBZ and WoFS probability-matched mean REFLCOMP > 35 dBZ in the initial and forecast patches at analysis time, as well as the MRMS and WoFS REFLCOMP 99th percentiles and the ratio of the two valid over the initial patch at analysis time. Finally, the forecast initialization and valid times were included as predictors since WoFS forecast accuracy is known to vary with each (Guerra et al. 2022).
d. ML model development
1) Model tuning, training, and testing
We use nested cross validation to train, validate (tune), and test our models (Fig. 7). The dataset is split into five folds (subsets) containing similar numbers of forecasts but unique dates that are randomly distributed among the folds (thus irrespective of year). Confining samples from a given date to a single fold limits autocorrelations between folds and thereby enables more representative evaluation of how the models generalize to new data. Distributing cases from all the years among the folds reduces the impacts of years with anomalous thunderstorm activity and mitigates data drift due to the evolution of the WoFS over the 2017–21 period. For each of the five outer loop iterations, one data subset is held out for testing the final model, and the remaining four become cross-validation folds for hyperparameter tuning. Each of the four cross-validation folds is alternately held out for validating provisional models trained on the remaining three folds (inner loop). Using Bayesian hyperparameter optimization (Wu et al. 2019), 100 provisional models (hyperparameter combinations) are trained and validated. The hyperparameters that optimize the all-folds-mean cross-validation ranked probability skill score (RPSS) are used to train a model on the complete training dataset (comprising the four cross-validation folds). Finally, the trained model is used to predict the test fold labels. This procedure is performed for each of the five final model test folds. Thus, our approach yields five trained models and therefore five sets of predictions for evaluation. We collate the predictions to compute some verification metrics; other metrics are computed for each of the five test folds, and their means and standard deviations are computed.
The use of nested cross validation enables crude assessment of the uncertainty in how well the model generalizes to new data. For example, large variability in model verification across folds (potential sources of which are described in section 4a) would mean the real-world performance of the model is uncertain. Nested cross validation also allows every sample in the dataset to participate in training, validating, and testing, making maximum use of the dataset, and mitigating the influence of any unrepresentative cases (which, in a training framework with a single test fold, could unduly bias the model evaluation if assigned to that fold). Prior to deployment, a final model would be trained and tuned on the entire dataset.
The numbers of forecast samples in the training/testing datasets averaged 9774/2444, 7417/1854, and 5370/1342 across folds at forecast times of 1, 2, and 3 h, respectively. The reduction in samples at longer forecast times arises primarily from the increasing fraction of forecast patches that extend outside the WoFS patch and are discarded. The more frequent truncation of forecast patches at longer forecast times is due to their increasing size and the tendency for convection to be concentrated further east within the WoFS domain later in the diurnal cycle.
2) Learning algorithms
Since our target classes are not unordered as in many multiclassification problems but exist on a scale (from low to high forecast skill), we adopt ordinal versions of the logistic regression and random forest algorithms (OLR and ORF, respectively). We implemented the ordinal classification models using the OrdinalClassifier class at https://github.com/leeprevost/OrdinalClassifier. This class implements the strategy of Frank and Hall (2001), namely, decomposing a k-class ordinal classification problem into a series of k-1 binary classification problems that encode the original class ordering. The loss function for all models was the RPSS (e.g., Weigel et al. 2007). As in Herman and Schumacher (2016), we adopt the RPSS since it accounts for the predicted probability of each class (not just the highest-probability class) and more heavily penalizes particularly bad predictions (in our case, predictions of GOOD that verify as POOR and vice versa).
3) Model reduction
In meteorological applications, large predictor sets typically can be substantially reduced with acceptable loss in model skill, owing to high collinearity (redundant information) between predictors. Using a method similar to one described in Kuhn and Johnson (2013, p. 47), we identify all pairs of continuous predictors with correlations exceeding a threshold—herein, 0.7—and remove the member of each pair that is less correlated with the target variable. We subsequently remove predictors whose correlation with the target variable does not exceed 0.1. These steps reduce the number of predictors from 323 to 44, 55, and 53 at forecast times of 1, 2, and 3 h, respectively, with no model skill loss.
We then use a nonrecursive predictor elimination procedure to further reduce the predictor sets.3 Using the same nested cross-validation strategy and folds as for the original models, we train new models on the reduced predictor sets and use a multipass permutation method (e.g., Lakshmanan et al. 2015; McGovern et al. 2019; Flora et al. 2024a) to rank the predictors by importance for each of the five training folds. Permutation methods randomly shuffle a predictor’s values among the examples, thus destroying the predictor’s relationship to the target variable and effectively removing the predictor from the model. In terms of the “forward” and “backward” nomenclature introduced by Flora et al. (2022a), we use the forward multipass permutation (FMP) method. The FMP measures predictor importance by permuting all the predictors (resulting in a model that always predicts the mean target value; in our case, the event base rate), individually unpermuting (i.e., adding back to the model) each predictor and measuring the increase in model accuracy to determine the most important predictor, finding the second-most important predictor with the most important predictor unpermuted, and so forth (Flora et al. 2022a). We choose the FMP for its superior performance as a predictor selection method for the WoFS severe weather prediction models in Flora et al. (2022b). The FMP predictor rankings vary considerably across folds, so we compute the median rankings for each predictor. New models are trained on the most important N predictors, where N is decreased from 40 to 5 in increments of 5. The smallest N predictors that do not unduly degrade the model performance are adopted as the final predictor set. This model reduction procedure is performed for each of the three lead times and each of the two learning algorithms (OLR and ORF).
e. ML model evaluation
The OLR and ORF algorithms produce probabilistic predictions, i.e., one probability for each of the POOR, FAIR, and GOOD classes. We remind the reader that each WoFS forecast sample is assigned to one of these three classes based on how its eFSSavg ranks within the entire dataset (section 3b; Fig. 5).
1) Verification methods
We use both deterministic and probabilistic verification methods since we envision forecast skill predictions being used in both ways. The natively probabilistic predictions from the ML models are converted to deterministic predictions (i.e., the class with the highest probability) for input to the deterministic verification methods. Class-respective verification methods are computed over the collated predictions from each test fold. The first four of these methods are the probability of detection (POD), false alarm ratio (FAR), critical success index (CSI), and classification accuracy (ACC). In multiclassification, the ACC is the sum of the numbers of instances where the class is correctly predicted (i.e., true positives and true negatives) divided by the number of predictions. The POD, FAR, CSI, ACC, and balanced classification accuracy (BACC; defined below) are deterministic metrics; all metrics discussed below are probabilistic.
To assess model discrimination ability, we use relative operating characteristic (ROC) curves, area under the ROC curve (AUC), performance diagrams (PDs), normalized area under the PD curve (NAUPDC; Flora et al. 2021; Miller et al. 2022), and maximum normalized CSI (MAX_NCSI; Flora et al. 2021; Miller et al. 2022) on the PD curve. Both NAUPDC and MAX_NCSI are skill scores ranging from zero (no skill) to 1 (maximum skill). Each point on the ROC and PD curves corresponds to a probability threshold4 for a given class; predictions are binarized on this threshold to define the contingency table entries (hits, misses, false alarms, and true negatives) required for computing POD, false-positive rate, and success ratio. To assess reliability, we inspect reliability diagrams5 along with the reliability component of the Brier skill score BSSrel, which is the mean squared difference between bin-averaged probabilistic predictions and event frequencies, weighted by the sample sizes for each bin (Murphy 1986). For context, for a prediction system whose mean forecast probability differs from the event frequency in each bin by 10 percentage points, the BSSrel is 0.01 if all bins have equal samples.
We use three verification methods that evaluate all three classes collectively. We compute RPSS (section 2d) over the collated predictions from each test fold. The remaining two metrics are computed for each of the five test folds, and then, the cross-fold mean and standard deviation are obtained. These metrics are the macroaverage AUC, which is the unweighted average of the AUCs per class, and the BACC, which is the unweighted average of the classification accuracy per class.
2) Baselines
To assess how much our final ML models improve upon methods that are much easier and quicker to develop and maintain, we develop two baseline methods for predicting WoFS forecast skill. Both baseline methods are single-predictor OLR models; using the ORF instead did not improve the baseline model skill on the testing dataset (not shown). The first, persistence-based, baseline model (PERS) operates on init_eFSSavg, the eFSSavg valid within the initial patch at analysis time. The init_eFSSavg varies enough from case to case to provide a meaningful baseline for fcst_eFSSavg, in part since storms are often poorly analyzed in CAM ensembles if they are newly formed (e.g., Guerra et al. 2022) or lie within a radar coverage gap. The second, spread-based, baseline model (SPRD) operates on fcst_REFLCOMP_std_max, the ensemble spread of maximum REFLCOMP within the forecast patch at forecast valid time. A large number and variety of gridpoint-to-gridpoint REFLCOMP spread metrics were tested, including some that used neighborhood methods and/or intensity thresholding to mitigate impacts of modest phase differences between ensemble members and/or focus upon convectively active regions, respectively (as mentioned in section 2c, filtering and/or thresholding REFLCOMP did not improve spread–error relationships). Of all the spread metrics tested, fcst_REFLCOMP_std_max was the simplest, yet most skillful, predictor for the spread-based baseline model.
f. ML model explainability methods
We use the backward single-pass (BSP) permutation method to assess model predictor importance for the final models. The BSP measures predictor importance by permuting predictors one at a time and computing the reductions in model skill (e.g., Lakshmanan et al. 2015; McGovern et al. 2019; Flora et al. 2022a). The BSP was among the best-performing explanation methods analyzed in Flora et al. (2022b), especially after correlated predictors were limited. As with the FMP predictor rankings used for predictor reduction, BSP rankings were generated for each of the five training folds and then the medians and standard deviations of the rankings were computed. Model prediction effects of the most important predictors were examined using per-fold accumulated local effect (ALE) curves for predictions of POOR and GOOD. The ALE curve displays the expected change in the model prediction as the value of a given predictor is varied (Apley and Zhu 2016; Molnar et al. 2020; Flora et al. 2022a). ALE curves are only shown for GOOD in section 4; the POOR ALE curves were qualitatively the sign inverse of the GOOD ALE curves for most predictors. We implement all three explainability methods (FMP, BSP, and ALE) using the Python scikit-explain package (Flora and Handler 2022).
4. Results
a. ML model performance
Using the procedure described in section 3d(3), we decreased the numbers of predictors in the original models (OLR_ALL and ORF_ALL) from 323 to 10 or 15 in the final models (OLR_FNL and ORF_FNL) with little loss in accuracy (Fig. 8; Tables 4–6). Overall, for both the original and reduced (final) predictor sets, the OLR and ORF algorithms had comparable performance on the testing dataset for all verification metrics considered. We subsequently adopted OLR_FNL as our final model given its low computational cost relative to OLR_ALL and the ORF models and its greater explainability than the all-predictors models.
Verification statistics for 1-h lead time models. The term N is the number of model predictors, BACC is the all-folds-mean balanced classification accuracy, and MA AUC is the all-folds-mean macroaverage AUC. The per-class ACC and per-class AUC are also shown. For all these metrics, higher values indicate better performance. All the examined verification metrics produce nearly identical conclusions about the relative accuracy of the ML models; to save space, therefore, we omit POD, FAR, CSI, BSS reliability component, and RPSS.
The variation of model performance across folds (Fig. 8) increases with lead time, becoming quite large for the 3-h models. This increasing cross-fold variability with lead time could arise from various factors, including increasingly complex predictor–target relationships (which would increase the training data size requirements) and decreasing training and testing sample sizes. The rather large variation in model performance at 3-h lead times underscores our motivation for using nested cross validation: to assess the uncertainty in future model performance and to assess whether the addition of training data may improve the models. While the 1-h models seem unlikely to substantially benefit from additional training cases, in the case of the 2-h and (especially) 3-h models, it is possible the model performance would improve in the future. Alternatively, if the greater variability in model performance at longer lead times arises primarily from increased representativeness errors in the smaller testing folds and not from differences in general model skill between folds (due to the smaller training folds, a factor that could be exacerbated by more complex predictor–target relationships at longer lead times), then expanding our dataset may not substantially improve the 2- and 3-h models but would still improve our estimates of their skill.
The performance of both the OLR and ORF models degrades with lead time (Fig. 9). The performance of the persistence-based model (PERS) substantially degrades with lead time, whereas the spread-based model (SPRD) performs similarly across lead times. The rapid decrease in correlation between the magnitudes of the analysis and forecast errors is broadly consistent with the intrinsically limited predictability of deep convection (e.g., Zhang et al. 2016) and likely also arises in part from the decreasing correspondence between storms in the analysis and forecast patches (due, e.g., to initial storms decaying before forecast valid time and storms developing after initialization time and moving into the forecast patch). Both the OLR and ORF models substantially outperform both baselines at all lead times. Thus, the use of ML to incorporate information from a large and diverse set of predictors, while time-consuming, yields meaningfully better forecasts of forecast skill in our application.
We now examine in more detail the performance of the OLR_FNL model. The ROC curves demonstrate substantial discrimination ability for all three classes (Fig. 10). The AUCs are lower than what is often seen in the meteorology literature since we are not predicting rare events; the base rates of our POOR, FAIR, and GOOD classes are 0.2, 0.6, and 0.2, respectively, whereas base rates for extreme weather are often O(0.01). The rarer an event, the more false alarms can occur while still maintaining a high POD and low false-positive rate since it becomes trivial to correctly predict nonevents (Fawcett 2004). Thus, the fact that our model better discriminates POOR and GOOD forecasts than FAIR forecasts in terms of AUC is not surprising. On the other hand, since the POOR and GOOD base rates are approximately equal by construction, we can conclude unambiguously that the model more skillfully discriminates POOR than GOOD forecasts with respect to AUC (i.e., when correct negatives are included).
We now turn to the PD curves as a complementary measure of model discrimination that ignores correct negatives (Fig. 11). The CSI is much higher for POOR than for GOOD at most probability thresholds, including those that yield frequency biases near unity. The superior model discrimination of POOR relative to GOOD under these metrics is consistent with the ROC curves. On the other hand, the FAIR performance curves are located primarily rightward of the POOR and GOOD curves, indicating that the model discriminates FAIR cases better than POOR and GOOD cases if correct negatives are not considered. This is the opposite result given by the ROC curves (which account for correct negatives). Which view of the model discrimination ability is preferred depends on whether one wants to account for the influence of correct negatives, a decision that becomes more critical if one or more prediction classes have a low base rate (since correct negatives become common).
To compare the skill with which each class is discriminated requires accounting for the differences between the class base rates. The NAUPDC and MAX_NCSI are skill scores designed to facilitate comparison of discrimination ability between models and/or classes with different event rates (Flora et al. 2021). The near-zero MAX_NCSI for FAIR (Fig. 11) indicates the model’s ability to discriminate FAIR forecasts is not much better than a no-skill model (i.e., a model based solely on the class base rate). Fortunately, this is not the case for POOR and GOOD forecasts, which are arguably of greater operational interest than FAIR forecasts. Both the NAUPDC and the MAX_NCSI are substantially higher for POOR than for GOOD. The gaps between the POOR and GOOD PD curves and ROC curves increase with lead time (Figs. 10 and 11) since the model discrimination ability for GOOD degrades faster than for POOR.
The OLR_FNL models are very well calibrated for all three classes (Fig. 12). The strong overconfidence of high-probability (p ≥ 0.6) GOOD forecasts at 3-h lead times is not particularly detrimental overall (BSSrel = 0.003) since such forecasts are very rare (<3% of all 3-h forecasts). Egregious predictions (POOR classified as GOOD or vice versa) are relatively rare (Fig. 13). In summary, for the classes the model was designed primarily to predict (POOR and GOOD), the model exhibits moderate discrimination ability and high reliability and only rarely produces egregious predictions. Still, misclassifications are common; for example, predictions of GOOD verify only 34% of the time at 3-h lead times in our testing dataset (Fig. 13). It is unclear a priori whether the model is accurate enough to be of value to forecasters. In the authors’ experience, however, forecasters are often able to leverage modestly skillful guidance and to provide valuable suggestions on how it can be improved. We therefore plan to engage with forecasters to assess whether it is worthwhile to continue developing these forecast skill prediction models for the WoFS (section 5).
b. ML model explanations
For each lead time, the OLR and ORF models shared the majority of their top 10 most important predictors, and for most of those predictors, the median ranking for one model occurred within the interquartile range (IQR) of that predictor ranking in the other model (Figs. 14 and 15). ALE curves for both learning algorithms (Fig. 16), but especially for the ORF models, are flatter than the corresponding event rate curves. This insensitivity of the model predictions likely indicates that the models fail to capture higher-order predictor relationships to forecast skill (i.e., feature interactions). The generally larger ranges of the ALE curves in the OLR models indicate that first-order effects (i.e., the average changes in the model prediction as features are individually varied) account for larger portions of the model predictions than in the ORF models. The greater prominence of first-order effects (which are more easily understood than higher-order effects) in the OLR models renders the latter more explainable than the ORF models.
We now examine in more detail the predictor rankings and ALE curves for both the ORF and OLR models to gain a better understanding of which predictors are generally most important to predicting forecast skill, regardless of how important they are to a particular model. Unsurprisingly, init_eFSS (the sole predictor in the persistence models) is the most important predictor for both 1-h models (Figs. 14a,b) and is absent from the 3-h model (Figs. 15a,b), consistent with the complete lack of skill of the 3-h persistence baseline (Table 6). Similarly, fcst_REFLCOMP_std_max (the sole predictor in the spread-based models) is the most important predictor of the 3-h models (Figs. 15a,b) but is progressively less important (but still highly ranked) at shorter lead times (Figs. 14a,b). The two baseline model predictors have generally expected effects on the model predictions of GOOD (Figs. 16a,c,j,l). For example, in the 1-h models, high init_eFSS substantially increases P(GOOD) (Figs. 16a,j), and in the 3-h models, moderate and high fcst_REFLCOMP_std_max substantially decrease P(GOOD) (not shown). Many of the most important predictors appear to function as proxies of storm organization, consistent with the expectation that more organized storms are more predictable (e.g., Droegemeier and Levit 1993); this hypothesis has recently been verified for MCSs in the WoFS (Britt et al. 2024). For example, fcst_vort_med_max is one of the top two predictors in the 1-h (Figs. 14a,b) and 2-h models (not shown) and is positively correlated with P(GOOD) (Figs. 16b,k), consistent with the expectation that storms containing stronger rotation are generally more organized and therefore more predictable. Similarly, the importance of fcst_ws_med_max (Figs. 14a and 15a,b) is likely due to its association with strong, organized convection. The init_pw_med_90pc or fcst_pw_med_90pc is among the top 10 predictors in five of the six models (Figs. 14a,b and 15b; 2-h feature rankings not shown) and is positively correlated with P(GOOD) (Figs. 16f,n). This could be because more humid environments tend to produce higher storm coverage, which in turn reduces the influence of storm phase errors on the eFSS. The importance of init_fed_med_90pc is likely due in large part to its correlation with storm organization, as well as to its correlation with storm maturity, since older storms are generally better assimilated and forecast (Guerra et al. 2022).
These explanations indicate that the models are generally consistent with our understanding of thunderstorm predictability, which suggests that real-time local explainability products may help forecasters calibrate their trust in the ML model predictions on a case-by-case basis. The importance of some of the model predictors, however, is more difficult to explain conceptually. In some instances, the OLR ALE curve is dramatically different from the event base rate curve (Figs. 16d,p). This could be due to the compensating effects of other predictors. In addition, some spread-based predictors positively correlate with P(GOOD), despite the general negative correlation between spread and skill. These include the spread in certain patch-maximum storm intensity statistics, e.g., fcst_wdn_std_min and fcst_rain_std_max (at 3-h lead times; not shown), or in the number of quasi-linear convective system (QLCS)-embedded ordinary or rotation-bearing cells (fcst_qlcs-ord_std and fcst_qlcs-meso_std; Figs. 16e,o).
We hypothesize that the positive correlation of certain spread-based predictors with P(GOOD) arises in part from the tendency for more predictable storms to be associated with larger ensemble spread in metrics that scale with storm organization (since the ensemble median and standard deviation are correlated). For example, forecast patches containing QLCSs will tend to be characterized by large medians and standard deviations of the numbers of QLCS-embedded ordinary and/or rotating cells. It was initially unclear, however, why the spread-based predictors, but not the corresponding median-based predictors, would be included among the most important predictors in the final models. Inspection of the predictor sets before and after the model reduction procedure (section 3d) revealed that in all cases where a spread-based predictor was correlated with P(GOOD), the median counterpart to that predictor was eliminated because of its high correlation with another predictor that was better correlated with the target variable (not shown). Thus, the spread-based predictors in question are not necessarily intrinsically better predictors than their median counterparts; rather, the former may contain more predictive information than the latter that is not already contained among the other predictors in the original predictor set.6 Additional analysis, likely involving inspection of many individual cases, is required to confirm our physical understanding of the importance of the predictors in our final models.
It was shown in section 4a that POOR predictions discriminate better than GOOD predictions, especially at later lead times. Comparison of the fcst_REFLCOMP_std_max ALE curves and centered event rates between POOR and GOOD for lead times of 1 and 3 h (not shown) reveals this spread-based predictor has a larger effect on P(POOR) than on P(GOOD), both in our dataset and in the model. Since fcst_REFLCOMP_std_max is one of the most important predictors across lead times (Figs. 14 and 15), we hypothesize that the superior predictions of POOR can be largely attributed to the stronger correspondence between ensemble spread and the probability of a POOR forecast than of a GOOD forecast. We further speculate that this stronger correspondence arises in part from the underdispersion of the WoFS, which produces many instances of low spread but high error (Fig. 2a), thus weakening the correspondence between fcst_REFLCOMP_std_max and P(GOOD) at low values of fcst_REFLCOMP_std_max. Frequently slow ensemble spinup of new storms in the WoFS analysis (e.g., Guerra et al. 2022) is likely a major contributor to this low-spread/high-error problem.
5. Summary and conclusions
We have demonstrated the capability of ML to improve objective prediction of WoFS forecast skill. Our ML models effectively exploit relationships between forecast skill and numerous observation-based and (especially) model-based predictors, rather than relying solely upon measures of ensemble forecast spread or analysis accuracy. While ML has been extensively used to predict thunderstorms (e.g., McGovern et al. 2023), we believe this is the first application of ML to predicting thunderstorm forecast skill.
Using WoFS and MRMS data from 106 Hazardous Weather Testbed (HWT) SFE cases, we trained ordinal logistic regression and ordinal random forest models to predict WoFS composite reflectivity forecast skill at lead times of 1, 2, and 3 h. Forecast skill was classified as POOR (bottom 20% of forecasts), FAIR (middle 60% of forecasts), or GOOD (top 20% of forecasts). We began with a very large predictor set (N = 323) to help ensure that the final models would include all the most important predictors and then we dramatically reduced the model dimensionality (N = 10 or 15) with little skill loss on the testing dataset. As a result, the final models run faster, are more explainable, and are therefore well suited to real-time operation. The ordinal logistic regression models are very nearly as skillful as the ordinal random forest models. The ability for relatively simple ML models to perform similarly to much more complex ML models is a typical phenomenon both in meteorology (e.g., Flora et al. 2022a) and in other disciplines (Rudin 2019).
We plan to engage with NWS forecasters to better evaluate the operational utility of our algorithm and identify any potential improvements to the prediction task framing or visualization. We will solicit feedback on our forecast patch strategy; for example, forecasters may prefer a static set of forecast patches (obtained by tiling the full WoFS domain) to the present storm-dependent approach or that patches be determined by WoFS forecast storms rather than observed storms. It will be important to assess how well forecasters’ subjective impressions of forecast skill match the objective metric adopted herein (which uses the eFSS) and whether forecasters prefer a different forecast skill metric (e.g., a different set of percentile ranges for the eFSSavg classes; eFSSavg itself rather than eFSSavg classes; or individual models for different eFSS scales and/or thresholds). Depending on initial feedback, we may explore the value of explainability graphics like those examined in the 2022 HWT SFE for WoFS-based ML models that predict severe thunderstorm hazards (Clark et al. 2023; Flora et al. 2024b).
We expect that ML could improve forecast skill prediction not just at the short lead times (1–3 h) examined herein, but at much longer lead times (days to weeks), especially given the successful early forays into this endeavor using deterministic models (Leslie et al. 1989; Molteni and Palmer 1991; Wobus and Kalnay 1995) and the existence of large-scale patterns associated with diminished or enhanced predictability (e.g., Miller and Gensini 2022). Given forecasters’ ongoing need for objective forecast skill guidance across NWP scales, we hope that ML will continue to be leveraged for forecasting forecast skill.
The unbinned spread–error correlation is very similar for all three lead times, while the spread–error slope increases from 0.62 for 1-h lead times to 0.70 for 3-h lead times.
The only other variable relevant to storm motion that is stored for WoFS archived cases is the pressure-weighted wind within the 850–300-hPa layer. This variable provided inferior estimates of storm motion as judged by visually comparing observed and projected storm positions at 2- and 3-h lead times.
In future work, we will consider instead using recursive feature elimination since this approach can work better for predictor selection (Gregorutti et al. 2017).
When generating the ROC and PD curves, we use the set of probability thresholds provided by the Python scikit-learn roc_curve method.
We use the following forecast probability bin edges for the reliability diagrams: [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 1].
Given this result, we recommend that when doing this type of predictor set reduction, a list should be created of the eliminated predictors and their retained (highly correlated) counterparts to facilitate model explainability.
Acknowledgments.
This work was prepared by the authors with funding from the NOAA/National Severe Storms Laboratory (C. K. P. and A. E. R.) and the NOAA/Office of Oceanic and Atmospheric Research under NOAA–University of Oklahoma Cooperative Agreement NA21OAR4320204 U.S. Department of Commerce (M. L. F., P. S. S., and B. C. M.). We thank Thea Sandmael for reviewing a preliminary version of this manuscript. Many of the analyses and visualizations were produced using the freely provided Anaconda Python distribution. The contents of this paper do not necessarily reflect the views or official position of any organization of the United States.
Data availability statement.
The experimental WoFS ensemble forecasts and MRMS data used in this study are not currently available in a publicly accessible repository. All of the ML features are available by request from CKP. The Python scripts used for model training, reduction, verification, and explanation are available at https://github.com/coreypotvin/wofs_fcst_skill.
REFERENCES
Apley, D. W., and J. Zhu, 2016: Visualizing the effects of predictor variables in black box supervised learning models. arXiv, 1612.08468v2, https://doi.org/10.48550/arxiv.1612.08468.
Barker, T. W., 1991: The relationship between spread and forecast error in extended-range forecasts. J. Climate, 4, 733–742, https://doi.org/10.1175/1520-0442(1991)004<0733:TRBSAF>2.0.CO;2.
Britt, K. C., P. S. Skinner, P. L. Heinselman, C. K. Potvin, M. L. Flora, B. Matilla, K. H. Knopfmeier, and A. E. Reinhart, 2024: Verification of quasi-linear convective systems predicted by the Warn-on-Forecast System (WoFS). Wea. Forecasting, 39, 155–176, https://doi.org/10.1175/WAF-D-23-0106.1.
Buizza, R., 1997: Potential forecast skill of ensemble prediction and spread and skill distributions of the ECMWF ensemble prediction system. Mon. Wea. Rev., 125, 99–119, https://doi.org/10.1175/1520-0493(1997)125<0099:PFSOEP>2.0.CO;2.
Buizza, R., M. Milleer, and T. N. Palmer, 1999: Stochastic representation of model uncertainties in the ECMWF ensemble prediction system. Quart. J. Roy. Meteor. Soc., 125, 2887–2908, https://doi.org/10.1002/qj.49712556006.
Bunkers, M. J., B. A. Klimowski, J. W. Zeitler, R. L. Thompson, and M. L. Weisman, 2000: Predicting supercell motion using a new hodograph technique. Wea. Forecasting, 15, 61–79, https://doi.org/10.1175/1520-0434(2000)015<0061:PSMUAN>2.0.CO;2.
Clark, A. J., W. A. Gallus Jr., M. Xue, and F. Kong, 2010: Growth of spread in convection-allowing and convection-parameterizing ensembles. Wea. Forecasting, 25, 594–612, https://doi.org/10.1175/2009WAF2222318.1.
Clark, A. J., and Coauthors, 2023: The third real-time, virtual spring forecasting experiment to advance severe weather prediction capabilities. Bull. Amer. Meteor. Soc., 104, E456–E458, https://doi.org/10.1175/BAMS-D-22-0213.1.
Demuth, J. L., and Coauthors, 2020: Recommendations for developing useful and usable convection-allowing model ensemble information for NWS forecasters. Wea. Forecasting, 35, 1381–1406, https://doi.org/10.1175/WAF-D-19-0108.1.
Dowell, D. C., and Coauthors, 2022: The High-Resolution Rapid Refresh (HRRR): An hourly updating convection-allowing forecast model. Part I: Motivation and system description. Wea. Forecasting, 37, 1371–1395, https://doi.org/10.1175/WAF-D-21-0151.1.
Droegemeier, K. K., and J. Levit, 1993: The sensitivity of numerically-simulated storm evolution to initial conditions. Preprints, 17th Conf. on Severe Local Storms, St. Louis, MO, Amer. Meteor. Soc., 431–435.
Duc, L., K. Saito, and H. Seko, 2013: Spatial-temporal fractions verification for high-resolution ensemble forecasts. Tellus, 65A, 18171, https://doi.org/10.3402/tellusa.v65i0.18171.
Evans, C., D. F. Van Dyke, and T. Lericos, 2014: How do forecasters utilize output from a convection-permitting ensemble forecast system? Case study of a high-impact precipitation event. Wea. Forecasting, 29, 466–486, https://doi.org/10.1175/WAF-D-13-00064.1.
Fawcett, T., 2004: ROC graphs: Notes and practical considerations for data mining researchers. Pattern Recognit. Lett., 31, 1–38.
Flora, M., and S. Handler, 2022: Scikit-Explain. Github, accessed 1 January 2024, https://github.com/monte-flora/scikit-explain.
Flora, M., C. Potvin, A. McGovern, and S. Handler, 2022a: Comparing explanation methods for traditional machine learning models Part 1: An overview of current methods and quantifying their disagreement. arXiv, 2211.08943v1, https://doi.org/10.48550/arxiv.2211.08943.
Flora, M., C. Potvin, A. McGovern, and S. Handler, 2022b: Comparing explanation methods for traditional machine learning models Part 2: Quantifying model explainability faithfulness and improvements with dimensionality reduction. arXiv, 2211.10378v1, https://doi.org/10.48550/arXiv.2211.10378.
Flora, M. L., C. K. Potvin, P. S. Skinner, S. Handler, and A. McGovern, 2021: Using machine learning to generate storm-scale probabilistic guidance of severe weather hazards in the Warn-on-Forecast System. Mon. Wea. Rev., 149, 1535–1557, https://doi.org/10.1175/MWR-D-20-0194.1.
Flora, M. L., C. K. Potvin, A. McGovern, and S. Handler, 2024a: A machine learning explainability tutorial for atmospheric sciences. Artif. Intell. Earth Syst., 3, e230018, https://doi.org/10.1175/AIES-D-23-0018.1.
Flora, M. L., B. Gallo, C. K. Potvin, A. J. Clark, and K. Wilson, 2024b: Exploring the usefulness of machine learning severe weather guidance in the warn-on-forecast system: Results from the 2022 NOAA Hazardous Weather Testbed Spring Forecasting Experiment. Wea. Forecasting, 39, 1023–1044, https://doi.org/10.1175/WAF-D-24-0038.1.
Fortin, V., M. Abaza, F. Anctil, and R. Turcotte, 2014: Why should ensemble spread match the RMSE of the ensemble mean? J. Hydrometeor., 15, 1708–1713, https://doi.org/10.1175/JHM-D-14-0008.1.
Frank, E., and M. Hall, 2001: A simple approach to ordinal classification. Machine Learning: ECML 2001, Lecture Notes in Computer Science, Vol. 2167, Springer, 145–156, https://doi.org/10.1007/3-540-44795-4_13.
Gallo, B. T., and Coauthors, 2017: Breaking new ground in severe weather prediction: The 2015 NOAA/Hazardous Weather Testbed Spring Forecasting Experiment. Wea. Forecasting, 32, 1541–1568, https://doi.org/10.1175/WAF-D-16-0178.1.
Gregorutti, B., B. Michel, and P. Saint-Pierre, 2017: Correlation and variable importance in random forests. Stat. Comput., 27, 659–678, https://doi.org/10.1007/s11222-016-9646-1.
Grimit, E. P., and C. F. Mass, 2002: Initial results of a mesoscale short-range ensemble forecasting system over the Pacific Northwest. Wea. Forecasting, 17, 192–205, https://doi.org/10.1175/1520-0434(2002)017<0192:IROAMS>2.0.CO;2.
Grimit, E. P., and C. F. Mass, 2007: Measuring the ensemble spread–error relationship with a probabilistic approach: Stochastic ensemble results. Mon. Wea. Rev., 135, 203–221, https://doi.org/10.1175/MWR3262.1.
Guerra, J. E., P. S. Skinner, A. Clark, M. Flora, B. Matilla, K. Knopfmeier, and A. E. Reinhart, 2022: Quantification of NSSL Warn-on-Forecast System accuracy by storm age using object-based verification. Wea. Forecasting, 37, 1973–1983, https://doi.org/10.1175/WAF-D-22-0043.1.
Heinselman, P. L., and Coauthors, 2024: Warn-on-Forecast System: From vision to reality. Wea. Forecasting, 39, 75–95, https://doi.org/10.1175/WAF-D-23-0147.1.
Herman, G. R., and R. S. Schumacher, 2016: Using reforecasts to improve forecasting of fog and visibility for aviation. Wea. Forecasting, 31, 467–482, https://doi.org/10.1175/WAF-D-15-0108.1.
Hoffman, R. N., and E. Kalnay, 1983: Lagged average forecasting, an alternative to Monte Carlo forecasting. Tellus, 35A, 100–118, https://doi.org/10.1111/j.1600-0870.1983.tb00189.x.
Houtekamer, P. L., 1993: Global and local skill forecasts. Mon. Wea. Rev., 121, 1834–1846, https://doi.org/10.1175/1520-0493(1993)121<1834:GALSF>2.0.CO;2.
Houtekamer, P. L., L. Lefaivre, J. Derome, H. Ritchie, and H. L. Mitchell, 1996: A system simulation approach to ensemble prediction. Mon. Wea. Rev., 124, 1225–1242, https://doi.org/10.1175/1520-0493(1996)124<1225:ASSATE>2.0.CO;2.
Jones, T. A., K. Knopfmeier, D. Wheatley, G. Creager, P. Minnis, and R. Palikonda, 2016: Storm-scale data assimilation and ensemble forecasting with the NSSL experimental Warn-on-Forecast System. Part II: Combined radar and satellite data experiments. Wea. Forecasting, 31, 297–327, https://doi.org/10.1175/WAF-D-15-0107.1.
Kalnay, E., and A. Dalcher, 1987: Forecasting forecast skill. Mon. Wea. Rev., 115, 349–356, https://doi.org/10.1175/1520-0493(1987)115<0349:FFS>2.0.CO;2.
Kuhn, M., and K. Johnson, 2013: Applied Predictive Modeling. 1st ed. Springer, 600 pp.
Lakshmanan, V., C. Karstens, J. Krause, K. Elmore, A. Ryzhkov, and S. Berkseth, 2015: Which polarimetric variables are important for weather/no-weather discrimination? J. Atmos. Oceanic Technol., 32, 1209–1223, https://doi.org/10.1175/JTECH-D-13-00205.1.
Leith, C. E., 1974: Theoretical skill of Monte Carlo forecasts. Mon. Wea. Rev., 102, 409–418, https://doi.org/10.1175/1520-0493(1974)102<0409:TSOMCF>2.0.CO;2.
Leslie, L. M., K. Fraedrich, and T. J. Glowacki, 1989: Forecasting the skill of a regional numerical weather prediction model. Mon. Wea. Rev., 117, 550–557, https://doi.org/10.1175/1520-0493(1989)117<0550:FTSOAR>2.0.CO;2.
Mansell, E. R., 2010: On sedimentation and advection in multimoment bulk microphysics. J. Atmos. Sci., 67, 3084–3094, https://doi.org/10.1175/2010JAS3341.1.
Markowski, P., M. Majcen, Y. Richardson, J. Marquis, and J. Wurman, 2011: Characteristics of the wind field in three nontornadic low-level mesocyclones observed by the Doppler on wheels radars. Electron. J. Severe Storms Meteor., 6 (3), https://doi.org/10.55599/ejssm.v6i3.30.
McGovern, A., R. Lagerquist, D. J. Gagne II, G. E. Jergensen, K. L. Elmore, C. R. Homeyer, and T. Smith, 2019: Making the black box more transparent: Understanding the physical implications of machine learning. Bull. Amer. Meteor. Soc., 100, 2175–2199, https://doi.org/10.1175/BAMS-D-18-0195.1.
McGovern, A., R. J. Chase, M. Flora, D. J. Gagne II, R. Lagerquist, C. K. Potvin, N. Snook, and E. Loken, 2023: A review of machine learning for convective weather. Artif. Intell. Earth Syst., 2, e220077, https://doi.org/10.1175/AIES-D-22-0077.1.
Miller, D. E., and V. E. Gensini, 2022: Characteristics of GEFSv12 high and low skill day 10 forecasts for tornadoes in the United States. 30th Conf. on Severe Local Storms, Santa Fe, NM, Amer. Meteor. Soc., 12.1B, https://ams.confex.com/ams/30SLS/meetingapp.cgi/Paper/407547.
Miller, W. J. S., and Coauthors, 2022: Exploring the usefulness of downscaling free forecasts from the Warn-on-Forecast System. Wea. Forecasting, 37, 181–203, https://doi.org/10.1175/WAF-D-21-0079.1.
Molnar, C., G. Casalicchio, and B. Bischl, 2020: Interpretable machine learning – A brief history, state-of-the-art and challenges. arXiv, 2010.09337v1, https://doi.org/10.48550/arXiv.2010.09337.
Molteni, F., and T. N. Palmer, 1991: A real-time scheme for the prediction of forecast skill. Mon. Wea. Rev., 119, 1088–1097, https://doi.org/10.1175/1520-0493(1991)119<1088:ARTSFT>2.0.CO;2.
Murphy, A. H., 1986: A new decomposition of the Brier score: Formulation and interpretation. Mon. Wea. Rev., 114, 2671–2673, https://doi.org/10.1175/1520-0493(1986)114<2671:ANDOTB>2.0.CO;2.
Murphy, J. M., 1988: The impact of ensemble forecasts on predictability. Quart. J. Roy. Meteor. Soc., 114, 463–493, https://doi.org/10.1002/qj.49711448010.
Novak, D. R., D. R. Bright, and M. J. Brennan, 2008: Operational forecaster uncertainty needs and future roles. Wea. Forecasting, 23, 1069–1084, https://doi.org/10.1175/2008WAF2222142.1.
Potvin, C. K., and Coauthors, 2020: Assessing systematic impacts of PBL schemes on storm evolution in the NOAA Warn-on-Forecast System. Mon. Wea. Rev., 148, 2567–2590, https://doi.org/10.1175/MWR-D-19-0389.1.
Potvin, C. K., and Coauthors, 2022: An iterative storm segmentation and classification algorithm for convection-allowing models and gridded radar analyses. J. Atmos. Oceanic Technol., 39, 999–1013, https://doi.org/10.1175/JTECH-D-21-0141.1.
Roberts, B., I. L. Jirak, A. J. Clark, S. J. Weiss, and J. S. Kain, 2019: Postprocessing and visualization techniques for convection-allowing ensembles. Bull. Amer. Meteor. Soc., 100, 1245–1258, https://doi.org/10.1175/BAMS-D-18-0041.1.
Roberts, B., B. T. Gallo, I. L. Jirak, A. J. Clark, D. C. Dowell, X. Wang, and Y. Wang, 2020: What does a convection-allowing ensemble of opportunity buy us in forecasting thunderstorms? Wea. Forecasting, 35, 2293–2316, https://doi.org/10.1175/WAF-D-20-0069.1.
Roberts, N. M., and H. W. Lean, 2008: Scale-selective verification of rainfall accumulations from high-resolution forecasts of convective events. Mon. Wea. Rev., 136, 78–97, https://doi.org/10.1175/2007MWR2123.1.
Rudin, C., 2019: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell., 1, 206–215, https://doi.org/10.1038/s42256-019-0048-x.
Schwartz, C. S., and R. A. Sobash, 2017: Generating probabilistic forecasts from convection-allowing ensembles using neighborhood approaches: A review and recommendations. Mon. Wea. Rev., 145, 3397–3418, https://doi.org/10.1175/MWR-D-16-0400.1.
Skamarock, W. C., and Coauthors, 2008: A description of the Advanced Research WRF version 3. NCAR Tech. Note NCAR/TN-475+STR, 113 pp., https://doi.org/10.5065/D68S4MVH.
Skinner, P. S., and Coauthors, 2018: Object-based verification of a prototype Warn-on-Forecast System. Wea. Forecasting, 33, 1225–1250, https://doi.org/10.1175/WAF-D-18-0020.1.
Smith, T. M., and Coauthors, 2016: Multi-Radar Multi-Sensor (MRMS) severe weather and aviation products: Initial operating capabilities. Bull. Amer. Meteor. Soc., 97, 1617–1630, https://doi.org/10.1175/BAMS-D-14-00173.1.
Stensrud, D. J., J.-W. Bao, and T. T. Warner, 2000: Using initial condition and model physics perturbations in short-range ensemble simulations of mesoscale convective systems. Mon. Wea. Rev., 128, 2077–2107, https://doi.org/10.1175/1520-0493(2000)128<2077:UICAMP>2.0.CO;2.
Stensrud, D. J., and Coauthors, 2009: Convective-scale Warn-on-Forecast System. Bull. Amer. Meteor. Soc., 90, 1487–1500, https://doi.org/10.1175/2009BAMS2795.1.
Stensrud, D. J., and Coauthors, 2013: Progress and challenges with Warn-on-Forecast. Atmos. Res., 123, 2–16, https://doi.org/10.1016/j.atmosres.2012.04.004.
Tennekes, H., A. P. M. Baede, and J. D. Opsteegh, 1986: Forecasting forecast skill. Workshop on Predictability in the Medium and Extended Range, Shinfield Park, Reading, ECMWF, 277–302, https://www.ecmwf.int/node/12652.
Toth, Z., and E. Kalnay, 1997: Ensemble forecasting at NCEP and the breeding method. Mon. Wea. Rev., 125, 3297–3319, https://doi.org/10.1175/1520-0493(1997)125<3297:EFANAT>2.0.CO;2.
Wang, X., and C. H. Bishop, 2003: A comparison of breeding and ensemble transform Kalman filter ensemble forecast schemes. J. Atmos. Sci., 60, 1140–1158, https://doi.org/10.1175/1520-0469(2003)060<1140:ACOBAE>2.0.CO;2.
Weigel, A. P., M. A. Liniger, and C. Appenzeller, 2007: The discrete brier and ranked probability skill scores. Mon. Wea. Rev., 135, 118–124, https://doi.org/10.1175/MWR3280.1.
Wheatley, D. M., K. H. Knopfmeier, T. A. Jones, and G. J. Creager, 2015: Storm-scale data assimilation and ensemble forecasting with the NSSL experimental Warn-on-Forecast System. Part I: Radar data experiments. Wea. Forecasting, 30, 1795–1817, https://doi.org/10.1175/WAF-D-15-0043.1.
Whitaker, J. S., and A. F. Loughe, 1998: The relationship between ensemble spread and ensemble mean skill. Mon. Wea. Rev., 126, 3292–3302, https://doi.org/10.1175/1520-0493(1998)126<3292:TRBESA>2.0.CO;2.
Wilson, K. A., and Coauthors, 2023: The NOAA weather prediction center’s use and evaluation of experimental Warn-on-Forecast System guidance. J. Oper. Meteor., 11, 82–94, https://doi.org/10.15191/nwajom.2023.1107.
Wobus, R. L., and E. Kalnay, 1995: Three years of operational prediction of forecast skill at NMC. Mon. Wea. Rev., 123, 2132–2148, https://doi.org/10.1175/1520-0493(1995)123<2132:TYOOPO>2.0.CO;2.
Wu, J., X.-Y. Chen, H. Zhang, L.-D. Xiong, H. Lei, and S.-H. Deng, 2019: Hyperparameter optimization for machine learning models based on Bayesian optimization. J. Electron. Sci. Technol., 17, 26–40, https://doi.org/10.11989/JEST.1674-862X.80904120.
Zhang, Y., F. Zhang, D. J. Stensrud, and Z. Meng, 2016: Intrinsic predictability of the 20 May 2013 tornadic thunderstorm event in Oklahoma at storm scales. Mon. Wea. Rev., 144, 1273–1298, https://doi.org/10.1175/MWR-D-15-0105.1.
Ziehmann, C., 2001: Skill prediction of local weather forecasts based on the ECMWF ensemble. Nonlinear Processes Geophys., 8, 419–428, https://doi.org/10.5194/npg-8-419-2001.