Real-time prediction of storm longevity is a critical challenge for National Weather Service (NWS) forecasters. These predictions can guide forecasters when they issue warnings and implicitly inform them about the potential severity of a storm. This paper presents a machine-learning (ML) system that was used for real-time prediction of storm longevity in the Probabilistic Hazard Information (PHI) tool, making it a Research-to-Operations (R2O) project. Currently, PHI provides forecasters with real-time storm variables and severity predictions from the ProbSevere system, but these predictions do not include storm longevity. We specifically designed our system to be tested in PHI during the 2016 and 2017 Hazardous Weather Testbed (HWT) experiments, which are a quasi-operational naturalistic environment. We considered three ML methods that have proven in prior work to be strong predictors for many weather prediction tasks: elastic nets, random forests, and gradient-boosted regression trees. We present experiments comparing the three ML methods with different types of input data, discuss trade-offs between forecast quality and requirements for real-time deployment, and present both subjective (human-based) and objective evaluation of real-time deployment in the HWT. Results demonstrate that the ML system has lower error than human forecasters, which suggests that it could be used to guide future storm-based warnings, enabling forecasters to focus on other aspects of the warning system.
Accurately predicting storm longevity in real time is a critical task for National Weather Service (NWS) forecasters in a variety of situations. When issuing a warning, forecasters can use longevity predictions to guide the spatial and temporal extent of the warning. Likewise, forecasters in an outbreak situation can use longevity predictions to help focus their attention on the storms likely to last longer. Longevity prediction is also an important task for air travel, as convection closes access points to the airport, leading to long delays in the system as planes are delayed or rerouted (MacKeen et al. 1999). Improved longevity prediction at airports would have economic benefit for both airlines and customers.
The NWS project called Forecasting a Continuum of Environmental Threats (FACETs; Rothfusz et al. 2014, 2018) will change the current watch and warning system in two ways. First, it will evolve from deterministic to probabilistic forecasts; second, it will ensure a smooth transition of forecast information across spatiotemporal scales. The Probabilistic Hazard Information tool (PHI; Karstens et al. 2015, 2017, 2018) is a key part of this new paradigm, focusing on storm-based prediction. The system we propose was tested in PHI during the National Oceanic and Atmospheric Administration (NOAA) Hazardous Weather Testbed (HWT; Clark et al. 2012; Gallo et al. 2017), which is a quasi-operational naturalistic environment.
Thus, the goal of this research is to produce a longevity-prediction system that improves upon the existing predictions within PHI, using data and resources available in real time. This goal changes the approach from a purely research-based (hindcasting) system to a quasi-operational system, which adds constraints. For example, in an operational or hindcasting system, additional and higher-quality data would likely be available to improve the predictions. However, we do not study this effect, since the goal is to produce a quasi-operational system.
Most prior research on forecasting storm longevity consists of modeling studies, though some have been case studies of individual storms or outbreaks. In all cases, it has proven to be a challenging task. Within the modeling domain, it has been demonstrated that small differences in soundings can produce vastly different storm longevities. For example, Elmore et al. (2002) show similar soundings that produce storms with very different longevities, as well as very different soundings that produce storms with similar longevities. Brooks (1992) and Lilly (1990) show differences in longevity with small changes in the thermal bubble used to initiate the simulations. While not studying longevity, Dahl (2014) demonstrates that storm-scale vortices are highly sensitive to small perturbations in the initial conditions of a numerical simulation.
Other modeling studies have examined the influences of different environmental processes on storm longevity. Thorpe et al. (1982) demonstrate that strong low-level shear is needed to create long-lived storms. Rotunno et al. (1988) demonstrate that long-lived squall lines are dependent on the interaction of low-level shear and the surface cold pool. Weisman and Klemp (1982, 1986) demonstrate that wind shear and buoyancy are critical to both storm mode and longevity. Houston and Wilhelmson (2011) numerically study the issue of storm longevity in a low-shear environment and demonstrate that a deep cold pool is crucial for sustaining long-lived storms in low-shear environments. Parker (2007) shows similar results in a moderate-shear environment. Shear is one of the environmental parameters available to our ML algorithm. Cintineo and Stensrud (2013) examine the predictability of supercells in simulation, under a variety of different initial conditions. They do not specifically examine lifetime except to note that supercells have predictable lifetimes of around 90 min. Many other storm features are extremely sensitive to the initial conditions and cannot be predicted beyond 2 h in advance.
Although fewer in number, there have been some observational studies of storm longevity. Bunkers et al. (2002) study long-lived supercells, specifically focusing on storms that last longer than four hours. They show that high wind shear and isolation from other convection are crucial for such long-lived storms. Wilson and Megenhardt (1997) focus on storms near Cape Canaveral, Florida, examining the relationship between wind shear and the convergence zone that often causes Florida storms.
Another approach to forecasting longevity comes from algorithms such as the Thunderstorm Identification, Tracking, Analysis, and Nowcasting system (TITAN; Dixon and Wiener 1993; Li et al. 2012; Wolfson et al. 1994). Both Li et al. (2012) and Wolfson et al. (1994) use machine learning (ML) to automate part of the tracking and identification process, but neither uses ML to predict storm longevity.
MacKeen et al. (1999) is most related to our work, as they use linear regression to predict storm longevity. Their database consists of 879 storms—some single cells and some multicell clusters—near Memphis, Tennessee. Specifically, they apply both univariate and multivariate linear regression to radar- and sounding-derived variables. They demonstrate that automating the prediction of storm longevity is difficult, because there is no clear correlation between storm longevity and any set of radar- or sounding-derived variables.
This work is unique in several aspects. First, the data comprise multiple years of observations across the full continental United States (CONUS). Previous observational studies have focused on specific storms or regions of the United States. Second, the predictions were used in a real-time system: the PHI Experiment (Karstens et al. 2015, 2017), which was part of the NOAA HWT, in spring 2016 and 2017. Third, the predictions are generated by ML. Note that we have a very preliminary version of this work in McGovern et al. (2017). This paper represents a significant extension, in both sophistication of the methods and subjective and objective analysis of the results. Also, this paper adds HWT 2017 to the testing data.
Data used for this project fall into two categories. The first is training data, used to build and objectively evaluate the ML models. The second is human data, consisting of both subjective and objective evaluations from the human forecasters in HWT 2016 and 2017.
a. Training data
Because the goal of this project is to create models that can eventually be transitioned to full-time operations and evaluate those models in a quasi-operational naturalistic environment in the HWT, the training data must be available in real-time. The main source of training data is ProbSevere Cintineo et al. (2014), a real-time decision-support system for severe convection. Its main components are automated storm tracking and ML. Storm tracking is performed with segmotion (Lakshmanan et al. 2009), an algorithm in the Warning Decision Support System–Integrated Information (WDSS-II; Lakshmanan et al. 2007) software package. The tracking variable used in ProbSevere is composite reflectivity from the Multi-Radar Multi-Sensor (MRMS; Smith et al. 2016) system, which is updated every ~2 min. Storm objects1 are defined as areas ≥ 20 km2 with composite reflectivity ≥ 35 dBZ.
ProbSevere’s ML is done with a naive-Bayes classifier (NBC), which predicts the probability of severe weather (wind gust ≥ 25.7 m s−1 or hail diameter ≥ 25.4 mm or tornado) for each storm cell. These predictions are not temporally specific (e.g., a 25% tornado prediction means the system is 25% confident that the storm will produce a tornado at some time in the future) and ProbSevere does not automatically draw warning polygons. Thus, automated guidance on storm longevity should help the human forecasters in drawing warning polygons and focusing their attention (e.g., larger polygons and higher priority for longer-lived storms). ProbSevere’s ML uses five predictors, including MUCAPE, bulk shear, and maximum estimated size of hail (MESH). Our ML uses these and all other variables listed in Table 1.
We hypothesized that data pertaining to the near-storm environment (NSE), in addition to the storm itself, would improve predictions. Thus, we interpolate soundings from the Rapid Refresh model (RAP; Benjamin et al. 2016) to the center of each storm object. The interpolation method is nearest neighbor in space and previous neighbor in time (i.e., the entire sounding is taken from one grid cell) and the most recent RAP analysis (0-h forecast), which preserves physical consistency among the sounding variables. Then we compute 97 sounding indices for each storm object, using the Sounding and Hodograph Analysis and Research Program in Python (SHARPpy; Blumberg et al. 2017). The sounding indices include convective available potential energy (CAPE), convective inhibition (CIN), the supercell composite parameter (SCP), and many others with which forecasters are familiar. For a detailed description of all 97 indices, see Table A1 of Lagerquist et al. (2017). We omit the detailed description here because, due to limited computing resources, we cannot use the sounding indices in real time.
In preliminary work we also investigated the use of radar-derived variables from the MRMS. Although there is tremendous value in real-time radar data, much of the information therein is subsumed by the ProbSevere variables. Thus, using MRMS data added little predictive performance, which we determined not to be worth the additional processing time. If this were an operational or hindcasting system, not a quasi-operational one, using all available data would likely be more important.
Every ML model needs predictors (listed in Table 1) and truth, which is the actual storm longevity. Since ProbSevere includes storm tracking, we initially considered these longevities as true labels. However, it has been observed in the HWT that segmotion (the tracking algorithm used by ProbSevere) creates a large number of storm splits, where it splits what a human meteorologist would call one storm track (Harrison 2018). This is because segmotion is a real-time algorithm, meaning that it can look only at current and past data to make tracking decisions. WDSS-II includes a postevent tracking algorithm, called best track (Lakshmanan et al. 2015). “Postevent” means that best track is run in hindcast mode, allowing it to use both past and future data for tracking decisions, which reduces the frequency of track splitting. Thus, we use best track to create true labels of storm longevity.
As expected, best track significantly increases the storm longevities in our dataset, as shown in Fig. 1 for spring 2015 and 2016. For example, ~80% of ProbSevere storms live less than 60 min, which best track decreases to ~70%. We use best track only to create labels and not for predictors, which must use real-time data only. In HWT 2016, storm tracks (and thus, all predictors listed in Table 1) were taken directly from ProbSevere. In HWT 2017, storm tracks were altered by a real-time version of best track (Harrison 2018) leading to the “postprocessed” variables in Table 1.
We acquired data for every ProbSevere storm from 9 April 2015 to 24 March 2017. To make training more computationally feasible, we use only storms from April to June 2015 and 2016. We focus on the spring season, because (i) the HWT occurs in spring and (ii) the temporal frequency and spatial coverage of storm-damage reports are climatologically maximized in spring (Kelly et al. 1985). Storms are split into training, validation, and testing (Table 2). The purpose of training is to optimize the parameters of the ML model [e.g., the linear-regression coefficients in Eq. (1)]; the purpose of validation is to find the best hyperparameters (user-selected values that remain constant throughout training, such as the number of trees in a random forest); and the purpose of testing is to evaluate the chosen model on unseen data, which provides a reasonable expectation of future performance “in the wild” (e.g., in the HWT).
b. NOAA HWT data
Figure 2 shows a screenshot of the PHI tool (Karstens et al. 2015), which highlights the importance of longevity predictions to the forecasters. Longevity is used explicitly to determine the temporal extent of a warning/advisory and spatial extent of a warning/advisory polygon. It is also used by a separate ML algorithm that predicts the probability of severe weather for each storm cell for the predicted lifetime of the storm Harrison (2018). In the absence of a prediction algorithm, PHI originally used a constant longevity prediction (60 min remaining), which is replaced with our ML system for the tests described in this paper.
ML predictions were integrated into PHI during HWT 2016 and 2017 (Karstens et al. 2018; Ling et al. 2017). Each year, the ML system was tested for three weeks by three NWS forecasters each week. The ML system used in the PHI experiments was an ensemble of gradient-boosted regression trees (GBRT), as described in section 3, and predictions were capped at 120 min due to logistical considerations. The ML models were trained on different data for 2016 versus 2017 (since 2017 data were not available in 2016), but the training procedures (section 3) were the same. Although the ML model results presented in this paper (RMSE, reliability, etc, shown below) are not the same as the models trained for HWT, the differences are small enough that the analysis of the results would be the same. Forecasters used the predictions in two settings: (i) displaced real-time events, where the forecasters receive psuedo-real-time data for a historical severe-thunderstorm outbreak, focusing on a single county warning area (CWA); and (ii) actual real-time events, occurring in the late afternoon and evening. A listing of these events and testing periods is provided in Karstens et al. (2018). In both settings the ProbSevere data, augmented with the longevity predictions described in this paper and other ML products (e.g., Lagerquist et al. 2017; McGovern et al. 2018), were provided to the forecasters as a first guess for issuing severe thunderstorm warnings or subsevere products (“advisories”) at their discretion.
For each storm, forecasters expressed their level of confidence that severe wind or hail would occur within a certain time window (e.g., “0–45 min into the future”). The default time window was the storm longevity predicted by the ML system. In this work we objectively evaluate forecasters’ use of the ML longevity predictions—specifically, whether or not they modified the predictions, which is possible because their activities were logged.
3. Methods: Training and evaluation of machine learning
For the ML methods described in this paper, we use the implementation in Python’s scikit-learn library (Pedregosa et al. 2011). We chose three algorithms, based on prior experience using ML for weather prediction (e.g., McGovern et al. 2017). First, as a baseline method, we chose linear regression with elastic-net regularization (Zou and Hastie 2005). Linear regression produces an equation in the form of Eq. (1), where the xj are predictors; β0 and βj are adjustable coefficients; and f is the resulting prediction, to be compared with the true value y. There are M predictors and N examples:
Without regularization, the loss function (minimized by training) is the mean squared error between the predicted and true values [the first term in Eq. (2)]. Elastic-net regularization is a combination of the lasso penalty (Tibshirani 1996), which is the second term in Eq. (2), and the ridge penalty (Hoerl and Kennard 1970, 1988), which is the third term. The variable λ ∈ [0, ∞) determines the amount of regularization, and α ∈ [0, 1] determines the trade-off between the ridge and lasso penalties. Both penalties encourage the model to produce smaller regression coefficients (the βj), and the lasso penalty specifically encourages the model to “zero out” coefficients, which effectively removes predictors from the model. Thus, elastic-net regularization encourages a simpler model, which often generalizes better to unseen situations and noisy data. In ML terminology, regularization mitigates “overfitting” to the training data.
The second and third algorithms are decision tree based. Decision trees are popular in many applications, because they can identify the most important predictors and produce human-readable models. Figure 3 shows an example of a regression tree from one of the trees in our trained random forest. Because of the size of the full tree, this is just a small subset chosen for illustration. It shows the yes and no branches of the tree as well as the questions identified by the tree growing algorithm. At each leaf node, a regression tree predicts a constant value, which is shown in the rectangular nodes.
Decision-tree ensembles, such as random forests and GBRT, have recently been successful in many meteorological applications (Williams et al. 2008a,b; Gagne et al. 2009; McGovern et al. 2014; Williams 2014; McGovern et al. 2015; Clark et al. 2015; Elmore and Grams 2016). Ensembles usually have smaller bias and variance (mean squared error) than a single decision tree. In a random forest (Breiman 2001), each tree is trained with a bootstrap-resampled (Efron 1979) version of the training set. Thus, on average each tree sees only 63.2% of examples in the training set, which encourages diversity among the trees and improves the performance of the final ensemble. Conversely, in GBRT (Friedman 2002), each tree is trained on the full training set. However, the kth tree is fit to the residual error from the first k − 1 trees, rather than fitting to the true label (observed longevity) as in random forests. Also, examples with the largest residuals are weighted the most heavily (Schapire 2003), which encourages the GBRT to improve its worst predictions.
We also experiment with bias-correcting predictions for each model, using isotonic regression (Niculescu-Mizil and Caruana 2005). Isotonic regression learns a stepwise function that maps from the base model’s predictions to calibrated predictions, in a way that minimizes mean squared error [the first term in Eq. (2)]. The sole input (predictor) for isotonic regression is the base model’s prediction, and the sole output is the calibrated prediction. The training set for isotonic regression must be independent of that for the base model (if the two training sets are the same, isotonic regression will be trained only on cases for which the base model performs uncharastically well, so it will learn to calibrate only uncharacteristically good predictions). In this work we use the validation data (Table 1) to train isotonic regression.
When evaluating forecast quality (the performance of an ML algorithm), we compare to two non-ML baselines. The first is the constant method (originally used in PHI), which is to predict a remaining longevity of 60 min for all storms. Although this baseline is easy to outperform, it is important to establish that we have improved upon the previously used method. The second baseline is “persistence,” where the remaining longevity of the storm is predicted to equal its current longevity. This is also known as the Lindy effect (Goldman 1964). For example, if the storm is 15 min old, persistence predicts that it will last another 15 min, so its total longevity will be 30 min.
Performance is measured in four ways. First, we measure the root-mean-squared error (RMSE), which is , with variables defined as in Eq. (2). Second, we compare the predicted and observed cumulative density functions (CDFs) of storm longevity, which helps to identify model bias, both in an overall sense and within certain longevity ranges. Third, we compare the predicted and observed probability density functions (PDFs) of storm longevity. For this we use violin plots2, which synthesize the information shown in a typical PDF and a typical boxplot. Fourth, we plot reliability curves, which show the mean observed longevity (y axis) for each bin of predicted longevity (x axis). A perfect reliability curve is the line x = y, which means that the conditional expected value is always the predicted value (given a predicted longevity of T seconds, the mean observation is always T seconds). The main purpose of the reliability curve is to identify conditional bias, or bias within certain ranges of the prediction space.
Our random forests and GBRT ensembles each contain hundreds of trees, which partly impairs human readability (although each tree alone is human readable, reading them all would take many hours and synthesizing the information thereby gleaned would be nearly impossible). The permutation method introduced by Breiman (2001) partly circumvents this problem, by quantifying the importance of each predictor. Specifically, for each predictor xj, the training values are randomly permuted, yielding the perturbed training set . Then an already-trained model (random forest or GBRT) is used to generate predictions for . The “importance” of predictor xj is defined as the mean squared error [first term in Eq. (2)] on the perturbed training set , minus MSE on the original training set. Thus, the most “important” predictors are those whose random permutation leads to the greatest decrease in performance (increase in MSE). Predictors can be ranked by importance, which allows some human insight into the workings of the model.
The training of elastic nets requires very little computing time, and they often perform nearly as well as more sophisticated methods. For these reasons we use elastic nets in our first experiment, to determine which predictors are necessary for training. Specifically, we train elastic nets with and without NSE data (section 2a), with and without temporal data (where the predictors in Table 1 are computed for both the current and previous time steps of the storm), and with and without bias correction (section 3). This yields eight models (2 × 2 × 2), for which the performance is shown in Fig. 4. All results in Figs. 4 and 5 are computed on the testing data, as detailed in Table 2. To create a distribution, such as shown in Figs. 4b and 5b, the RMSE is computed on each day in the testing set independently. Bootstrapping with 1000 replicates is used to create distributions.
Figure 4a shows the CDF’s of observed storm longevity, predicted longevity from the eight elastic nets, and predicted longevity from the two baselines. The eight elastic nets are clustered together so tightly that they are almost impossible to distinguish. The same is true in all panels of Fig. 4, where the elastic nets have nearly identical errors but are clearly distinguishable from the baselines.
Three conclusions can be drawn from Fig. 4. First, NSE and temporal data yield very little performance gain. Also, computing NSE and temporal data takes about four minutes for each ProbSevere update (which come at intervals of about two minutes and are accompanied by MRMS data). This latency time is unacceptable, given that (i) many other storm-based predictions are included in the ProbSevere data, so their latency time is four minutes less; and (ii) thunderstorms evolve quickly, so using 4-min-older predictions can be a serious disadvantage. Thus, we chose the smallest predictor set (Table 1) for deployment in the HWT. The lack of improvement with NSE data may be surprising, but (i) these values have little temporal variance relative to the radar-derived predictors in ProbSevere that can change significantly with each two minute update (ii) some ProbSevere predictors (MUCAPE and bulk shear) are already based on NWP soundings.
Second, although bias correction has proven important in prior experiments with this and other meteorological data, it did not provide any performance gain for this problem. Since bias correction adds computing time, which is cumbersome for real-time deployment, we did not use it in the HWT or the remaining experiments.
Third, comparing to the two baseline methods (60 min and persistence), Fig. 4 indicates the need for learning. Elastic nets with all eight parameter settings outperform the baseline methods, according to all performance metrics. As shown in Figs. 4a and 4d, persistence has a strong underforecasting bias, while the constant method has no resolution at all. As shown in Fig. 4c, although the baselines have a similar RMSE for some values of predicted longevity, their RMSE is generally much higher than the ML models.
Given the recent success of decision-tree ensembles in meteorology (section 3), we hypothesized that they would perform the best of the ML models. Figure 5 compares random forests and GBRT ensembles to elastic nets. Each model is trained with only the predictors in Table 1. We used cross validation (not shown) to choose the number of trees per ensemble and maximum depth per tree. However, in choosing these values, we had to consider the computing time needed to run models in a quasi-operational naturalistic environment. There may be hundreds of storm cells in the CONUS at one time, and applying a large model to all storms could take too long. We empirically decided on 250 trees per ensemble (random forest or GBRT), with a maximum depth of five branch nodes per tree. Deeper trees or larger forests could slightly improve predictive power but at a significant computational cost, limiting their use in real time. Elastic nets used the default hyperparamters.
As shown in Fig. 5, both types of decision-tree ensembles improve the predictions, especially for short-lived storms. They are capable of predicting longevities of 10–30 min, whereas elastic nets rarely predict less than 30 min. Figure 5 shows that, for most values of predicted longevity, the three ML models unanimously outperform the baselines and that, while the three ML models have near-perfect reliability, the decision-tree ensembles are more reliable at the lower end of the range (improve on the elastic net’s overforecasting bias) and the elastic net is more reliable at the higher end of the range (improves on the decision trees’ underforecasting bias). We chose the GBRT for HWT deployment, since it (i) predicts short-lived storms better than the elastic net and (ii) produces smoother error graphs than the random forest.
Figure 6 shows forecasters’ usage of the predictions in 2016 and 2017. Most evident is a drastic reduction in usage frequency (by “all forecasters”), from ~75% in 2016 to ~40% in 2017. Why did this reduction occur? Fig. 7 shows, for each year, the distribution of all ML-predicted longevities, those that were modified by humans (before modification), and those that were unmodified. The ML predictions (and observed storm durations) were generally higher in 2017 than in 2016, and in both years forecasters preferentially modified the higher predictions. This action is also evident when the distributions are split into storms for which the forecaster issued a warning or advisory (Fig. 7).
Figure 7 shows how the ML predictions were modified, as well as how the modified and unmodified distributions compare to observations. In 2017 both the ML predictions and observations were longer than in 2016, which seems to justify the longer ML predictions. In 2016 human modification brings the median prediction closer to the median observation, but in 2017 it has the opposite effect, exacerbating the underforecasting bias of the ML system. However, in both years the ML system has slightly lower error than the humans.
Despite the large interannual difference between ML predictions, there is little interannual difference between the modified predictions. This implies that human forecasters have a preferred range of longevity values and prefer not to adjust this range much to accommodate new situations. Similar behavior is shown in Fig. 8, especially in 2016, where the predictions are generally adjusted downward for storms with warnings and upward for storms with advisories, resulting in very similar postmodification distributions.
Figure 9 (from Harrison and Karstens 2017) shows the duration distribution for storm-based warnings (SBW). By comparison with Fig. 7, human-modified storm longevity has a very similar distribution to severe thunderstorm warnings. Thus, the human forecasters’ tendency to more frequently change/reduce longevity predictions in 2017 was likely caused in part by their application of SBW-era training to the HWT experiment.
Using the permutation method (section 3), we examined differences in predictor importance across the three models (elastic net, random forest, and GBRT). The models are consistent in their ranking of the most important variables. All three models choose the current storm longevity as the most important variable. This is not surprising, especially given that the persistence method (which uses only current longevity as a predictor) performs reasonably (Figs. 4 and 5). The second-most important variable (again, chosen unanimously by all three models) is MESH within the storm. This also makes sense intuitively, as storms with large hail tend to have stronger updrafts and to be longer lived. Shear was identified by many of the studies discussed in the introduction as the most important parameter and it shows up indirectly in the most important variables.
Figure 7 shows that human modification of the ML predictions slightly increased the error, implying that forecasters are perhaps better served by using the raw ML prediction for storm longevity. However, at least two outstanding questions remain. First, these distributions are based on a large number of cases, and there may be certain situations (e.g., storm modes, mesoscale regimes, or synoptic regimes) where human modification generally improves the predictions. A more detailed analysis could reveal these regimes and other aspects of human-machine interdependence that we have not considered. Second, the ML predictions for warned storms (those associated with a warning) were generally greater than subsevere storms (those associated with advisories), and forecasters modified the ML predictions for warned storms more often (Fig. 7). It makes sense that stronger, more organized storms last longer, while weaker ones remain more transient. However, this rationale considers only the temporal dimension of storm longevity. Uncertainty in the spatial dimension (i.e., the forecast location of a storm) increases with lead time, which may be one reason that forecasters compressed longevity predictions into a smaller range with sound empirical basis from the SBW era, especially for warning decisions. This insight implies that spatially joining a warning with an advisory (i.e., warning for shorter lead times and advising for longer lead times thereafter) may be a viable way to transition between the current and FACETs warning paradigms, which bifurcates the plume into warning duration and storm longevity, respectively.
We have demonstrated that machine learning can be used for real-time prediction of storm longevity and provide valuable information to forecasters. As additional storm data become available in real time, and as we develop faster and more sophisticated ways to process said data, the performance of the ML system should improve. For example, although MRMS data are available in real time, processing is slow (the grids are CONUS-wide with 0.01° spacing) and our processing methods led to minimal performance gain. We anticipate that data from high-resolution convection-allowing models, and new sensing systems such as the Geostationary Operational Environmental Satellite-R Series, will allow significant performance gains. Furthermore, since HWT 2016 and 2017, the algorithms for computing velocity-derived variables (azimuthal shear and convergence) in MRMS have improved. All other MRMS variables are reflectivity-derived, so azimuthal shear and convergence may contain valuable new information. Last, we are currently working on a real-time system for classifying storm mode (e.g., supercell, multicell cluster, linear system), which will be useful to forecasters and may provide a valuable input to the longevity model.
The authors thank David Harrison for his assistance in refining the best track algorithm used in this study. The authors also thank the anonymous reviewers for their help in refining the manuscript. Funding was provided by NOAA/Office of Oceanic and Atmospheric Research under NOAA–University of Oklahoma Cooperative Agreement NA16OAR4320115, U.S. Department of Commerce. This work was supported by the NEXRAD Product Improvement Program, by NOAA/Office of Oceanic and Atmospheric Research. The statements, findings, conclusions, and recommendations are those of the authors and do not necessarily reflect the views of NOAA, the U.S. Department of Commerce, or the University of Oklahoma. The computing for this project was performed at the OU Supercomputing Center for Education and Research (OSCER) at the University of Oklahoma (OU).
One “storm object” is one storm cell at one time step; in other words, a snapshot of a storm cell.