A Machine Learning Approach to Improve the Usability of Severe Thunderstorm Wind Reports

Elizabeth Tirone Iowa State University, Ames, Iowa;

Search for other papers by Elizabeth Tirone in
Current site
Google Scholar
PubMed
Close
,
Subrata Pal Iowa State University, Ames, Iowa;

Search for other papers by Subrata Pal in
Current site
Google Scholar
PubMed
Close
,
William A. Gallus Jr. Iowa State University, Ames, Iowa;

Search for other papers by William A. Gallus Jr. in
Current site
Google Scholar
PubMed
Close
,
Somak Dutta Iowa State University, Ames, Iowa;

Search for other papers by Somak Dutta in
Current site
Google Scholar
PubMed
Close
,
Ranjan Maitra Iowa State University, Ames, Iowa;

Search for other papers by Ranjan Maitra in
Current site
Google Scholar
PubMed
Close
,
Jennifer Newman Iowa State University, Ames, Iowa;

Search for other papers by Jennifer Newman in
Current site
Google Scholar
PubMed
Close
,
Eric Weber Iowa State University, Ames, Iowa;

Search for other papers by Eric Weber in
Current site
Google Scholar
PubMed
Close
, and
Israel Jirak NOAA/Storm Prediction Center, Norman, Oklahoma

Search for other papers by Israel Jirak in
Current site
Google Scholar
PubMed
Close
Open access

Abstract

Many concerns are known to exist with thunderstorm wind reports in the National Center for Environmental Information Storm Events Database, including the overestimation of wind speed, changes in report frequency due to population density, and differences in reporting due to damage tracers. These concerns are especially pronounced with reports that are not associated with a wind speed measurement, but are estimated, which make up almost 90% of the database. We have used machine learning to predict the probability that a severe wind report was caused by severe intensity wind, or wind ≥ 50 kt (∼25 m s−1). A total of six machine learning models were trained on 11 years of measured thunderstorm wind reports, along with meteorological parameters, population density, and elevation. Objective skill metrics such as the area under the ROC curve (AUC), Brier score, and reliability curves suggest that the best performing model is the stacked generalized linear model, which has an AUC around 0.9 and a Brier score around 0.1. The outputs from these models have many potential uses such as forecast verification and quality control for implementation in forecast tools. Our tool was evaluated favorably at the Hazardous Weather Testbed Spring Forecasting Experiments in 2020, 2021, and 2022.

© 2024 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Elizabeth Tirone, elizabeth.tirone@noaa.gov

Abstract

Many concerns are known to exist with thunderstorm wind reports in the National Center for Environmental Information Storm Events Database, including the overestimation of wind speed, changes in report frequency due to population density, and differences in reporting due to damage tracers. These concerns are especially pronounced with reports that are not associated with a wind speed measurement, but are estimated, which make up almost 90% of the database. We have used machine learning to predict the probability that a severe wind report was caused by severe intensity wind, or wind ≥ 50 kt (∼25 m s−1). A total of six machine learning models were trained on 11 years of measured thunderstorm wind reports, along with meteorological parameters, population density, and elevation. Objective skill metrics such as the area under the ROC curve (AUC), Brier score, and reliability curves suggest that the best performing model is the stacked generalized linear model, which has an AUC around 0.9 and a Brier score around 0.1. The outputs from these models have many potential uses such as forecast verification and quality control for implementation in forecast tools. Our tool was evaluated favorably at the Hazardous Weather Testbed Spring Forecasting Experiments in 2020, 2021, and 2022.

© 2024 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Elizabeth Tirone, elizabeth.tirone@noaa.gov

Straight-line winds from thunderstorms are one of the most destructive types of weather, as was evident in the August 2020 Midwestern derecho that caused over $12 billion in damage (NOAA 2020). Each year around 15,000 reports of severe wind or wind damage appear in the National Center for Environmental Information Storm Events Database (NCEI 2023). Severe thunderstorm wind reports include winds from thunderstorms that are 50 kt (1 kt ≈ 0.51 m s−1) or more, or winds less than 50 kt that cause fatalities, injuries, and/or damage (NOAA 2021a).

There are two different types of thunderstorm wind reports (SRs)—measured SRs, which are assigned a wind speed based on a measurement from an approved weather station, and estimated SRs, which are assigned an estimated value based on damage or radar information. Issues with estimated SRs include the overestimation of wind speeds (Edwards et al. 2018), changes in report frequency due to population density (Trapp et al. 2006), and differences in reporting due to differences in damage tracers (Weiss 2002). We have also noted evidence of subjectivity in the assigned values, finding that there was a peak in the number of estimated SRs occurring with an estimate of exactly 50 kt, the threshold to be considered a severe thunderstorm, that was not present with measured SRs (Fig. 1). While there may be a philosophical argument for alerting the public to thunderstorms that are producing damage at wind speeds less than the 50 kt threshold, this definition allows for consistency in warnings. A meteorologist can input wind reports for damage resulting from subsevere wind (<50 kt); however, these reports are not used in the verification of severe thunderstorm warnings and are rare in the database (<1% of all reports) (NOAA 2021b).

Fig. 1.
Fig. 1.

Percentage of SRs with wind speeds from 50 to 59 kt for estimated (blue) and measured (orange) SRs from 2007 to 2018.

Citation: Bulletin of the American Meteorological Society 105, 3; 10.1175/BAMS-D-22-0268.1

The potential errors in assigned wind speeds can have serious consequences for future research. Wind reports are used for verification of forecasts and forecasting tools, and in various research studies focused on development of new forecast methods and assessing climatological wind risk, which require reliable data (Hitchens and Brooks 2014; Bunkers et al. 2020).

To mitigate these issues, we have created a tool using machine learning (ML) which assigns to SRs a probability that they are caused by winds ≥ 50 kt. Output from our tool has been examined at the Hazardous Weather Testbed (HWT) Spring Forecasting Experiments during 2020–22 where participants evaluated and discussed output valid for the previous day’s wind reports. Our tool has numerous potential uses, and participants in the experiments considered it valuable. This paper describes the ML tool and provides examples of its applications.

Data

The data sources discussed in this section represent everything we used during our model development. Numerous different types of data were investigated as inputs to the ML algorithms, but some were eliminated in the final product if they did not improve performance as measured by several objective skill metrics as discussed later. The impact of adding or removing different data sources will be discussed in detail in the results, as well as the combination used in the final version of the ML models.

To avoid problems with inaccurate wind estimates in the training of the ML models, we trained using only measured SRs from 2007 to 2017, with 2018 being used as the test set. Many variables are available for each SR in the Storm Events Database; however, we chose to use only the date and time, latitude and longitude, and text assigned for the episode and event narratives with the wind speed measurement. To perform ML classification, the wind speeds were used as labels, with measured wind speeds ≥ 50 kt having a correct classification of “severe” while wind speeds < 50 kt had a correct classification of “subsevere.” Two approaches were taken to represent time. The first separates the year, month, day, hour, and minute as separate variables, and the second applies a Fourier transform on the date as another separate variable.

The episode and event narratives contain information about the damage that was reported and the meteorological setup for the SR, respectively. Due to there being many instances of missing event narratives, we chose to combine the event and episode narratives into one text narrative for evaluation. These text data were fed into a statistical text analysis model, the correlated topic model (CTM), which is an extension of the popular latent Dirichlet allocation (LDA) model (Blei et al. 2003; Blei and Lafferty 2007). The CTM analyzes the text to discover “topics” that occur, where a specific topic contains words that are found to be similar. One example topic from the CTM contains words such as wind, tree, power, line, limb, large, Kansas, damage, blown, and across. This topic could be summarized to include damage to tree limbs and/or power lines in Kansas. A total of 30 topics were chosen to reduce the dimensionality of using the raw text. For each SR, a probability is produced by the CTM to represent the probability of the merged narrative belonging to each topic.

While using measured SRs avoided many issues known to exist with estimated SRs, there were still issues with these SRs that can affect the quality of the ML tool. For instance, on some occasions SRs were assigned a beginning and end time and location to represent a swath of damage/severe wind. To simplify the inclusion of other supplementary data that represent the meteorological environment, we only used reports lasting less than 20 min. In addition, a small fraction of reports were assigned unreasonably low wind speeds, as low as <10 kt, and these SRs were assumed to be erroneous. We only used SRs with wind speeds ≥ 30 kt in our training set. After quality controlling for these two issues, 2.1% of SRs were eliminated, and the final training dataset included 17,797 SRs while the test set included 2,581 SRs. The locations of the training set are shown in Fig. 2. Similar to what is found by both Smith et al. (2013) and Edwards et al. (2018), more measured reports were found in the central United States.

Fig. 2.
Fig. 2.

Locations of measured SRs as denoted by blue dots for 2007–17. Vertical orange lines denote subjective regional borders of the “west,” “central,” and “east” in the United States. Measured wind speed (kt) are represented by the color shading of each SR point.

Citation: Bulletin of the American Meteorological Society 105, 3; 10.1175/BAMS-D-22-0268.1

Based on the idea that an SR may be more likely to be actually caused by severe intensity wind if there are other measured SRs with winds of severe intensity near in time and space, the distance to the nearest measured severe wind report was added as an input. A filter was implemented to restrict the spatial and temporal distance to which a measured SR was considered for this calculation, so SRs had to be within ±11 km and ±20 min.

Meteorological data were added to supply information about the near-storm environment at the storm report. Thirty-one parameters from the Storm Prediction Center’s (SPC) mesoanalysis dataset (Bothwell et al. 2002) were collected on a 5 × 5 grid centered on each report. The mesoanalysis data were based on the RUC (before May 2012) and RAP (after May 2012) analyses modified by surface observations via a two-pass Barnes objective analysis and were on a 40 km grid (Bothwell et al. 2002). The meteorological parameters were selected based on their relevance to environments that are supportive of severe wind, which are listed in Table 1. Future discussion of parameters will use abbreviations listed in Table 1.

Table 1.

A list of the SPC mesoanalysis parameters. Abbreviations, units, and descriptions are listed. Italicized abbreviations represent parameters that were not available for the full period of training. “Surface” refers to 10 m AGL.

Table 1.

Of the 31 mesoanalysis meteorological parameters, 6 did not exist in the earliest years of our dataset, as italicized in Table 1. S6MG was not introduced until 2008, 3KRH was introduced in 2009, and UEIL, VEIL, RH80, and RH70 were introduced in 2013. Due to their importance in severe wind forecasting, a method was applied to SRs missing these variables, which substitutes for the most similar mesoanalysis parameter. Substitutions were as follows (in parentheses): S6MG (U6SV), 3KRH (RHLC), UEIL (U8SV), VEIL (V8SV), RH80 (RHLC), and RH70 (RHLC).

Within the mesoanalysis data, there were instances of values such as −9,999 and 9,999, which represented an inability to calculate the field. To avoid erroneous parameter values, the missing-indicator method was applied which replaced these values with zero and added an indicator variable, which alerts the ML model that the zero was a placeholder and not a real value.

Although the training initially used all 31 mesoanalysis parameters over the 5 × 5 array of points, it was determined that having almost 800 features played a negative role in the skill of our prediction, and we thus reduced the dimensionality of our covariates. Among the 31 variables, we retained three (UWND, VWND, SBCP) at all 25 points. Since boundaries play an important role for convective development, we chose to retain data at all points based on their ability to discern surface boundaries. UWND and VWND were chosen since spatial variations are most likely among all parameters to reflect the presence of a surface boundary and SBCP was chosen to reflect thermodynamic gradients. For all other mesoanalysis parameters, only the minimum, maximum, and average values were input into the ML algorithms.

Feedback from the 2020 Hazardous Weather Testbed Spring Forecasting Experiment prompted the inclusion of some additional data to strengthen the depiction of the spatial setting for each SR. We investigated three additional datasets including elevation, land-use category, and population density. Elevation data were added due to the impact that higher elevations have on some of the meteorological parameters used as input. We believed that the land-use category might play an important role in the resultant damage from severe wind. For example, it may be more likely that wind damage will be reported in an area with abundant trees compared to an area with grassy plains. The Moderate Resolution Imaging Spectroradiometer (MODIS) land category was collected for the location of each storm report (Friedl et al. 2010). Population density has been discussed as a major factor influencing severe weather reporting (Trapp et al. 2006; Smith et al. 2013; Bunkers et al. 2020). It has been found that the number of reports was generally biased toward areas with a higher population density. To account for this fact, the average population density over a 2 km × 2 km area around the storm report was collected using the LandScan population dataset (Oak Ridge National Laboratory 2023).

Because reliability was found to be poor when only the Storm Events Database was used for training (discussed in more detail below), likely due to the fact that 98.5% of all measured values were at least 50 kt, we tested the impact of including in the training a supplemental dataset of thunderstorm winds under 50 kt. Subsevere wind reports were sampled from Automated Surface Observing System (ASOS) and Automated Weather Observing System (AWOS) sites that were reporting a thunderstorm and within 32 km and 30 min of estimated SRs and were impacted by reflectivity of at least 50 dBZ within 32 km and 30 min. These criteria were used to ensure that the weather stations were impacted by strong thunderstorms, likely of the same mode that resulted in an SR. A total of around 8,000 of these subsevere reports was found during the same 2007–17 period over which the SRs occurred that were used in training. Mesoanalysis data and the other environmental data were included with these subsevere reports in training to make them as similar as possible to the other SRs used for training. However, these subsevere data do not contain any text data, and hence, we used the best model without text data.

Testing also was done to see the impact of including radar data in the training of the algorithms. Due to the single component limitation of radar velocity data which has prevented development of a national composite product at the current time, we used only radar reflectivity. Because national composites such as GridRad and MRMS did not exist for our entire training period, or techniques to create them changed, we developed our own dataset using the same algorithms that had been used for the GridRad products (Bowman and Homeyer 2017). Composite reflectivity data were interpolated to a 2 km × 2 km horizontal grid with a time frequency of 15 min (0, 15, 30, and 45 min past the top of the hour). We used data over a 64 km × 64 km area centered on the SR at the nearest 15 min interval to the time of the SR, along with 30 and 15 min before and after that time. Peak reflectivity, peak gradient in reflectivity, and average reflectivity in the nearest 8 km × 8 km (4 × 4 array of the radar points which represented 2 km × 2 km pixels) area centered on the SR were implemented into the models. Radar data were used to sample the storm motion over ±15 min centered on the wind report. The average speed of the most intense radar echo over that time period was also used as a predictor. The radar predictors were evaluated together using the methods described below.

Machine learning methods

The machine learning models were constructed to classify SRs as being produced from severe wind or nonsevere wind. The classification probabilities, or the probability the SR was classified as severe, were then used to evaluate skill and visualize model output. Six ML models were trained and tested on our data; three of them are considered to be base models and the other three are ensemble models constructed from the three base models. The ML models that were selected were chosen based on their usage in the fields of meteorology by studies such as Lee et al. (2004), Li et al. (2017), and Jergensen et al. (2020). For a more detailed explanation of ML models and terminology in the context of meteorology, refer to Chase et al. (2022, 2023).

Repeated cross validation (CV) with 6 folds and 36 repetitions was conducted to select hyperparameters for each base model. Possible combinations of hyperparameters were considered and the best combinations were determined from the CV errors. The top hyperparameter combinations will be summarized below, though more information on tested solutions can be found at https://github.com/etirone/SevereWindMachineLearning.

The first base model, gradient boosted machine (GBM; Friedman 2002), consists of sequences of decision trees, each of which can be understood as flowcharts with data-driven decision points that aim to maximize the difference at each decision point. Despite being very interpretable, decision trees are typically weak classifiers (Hastie et al. 2009). GBM uses the nth tree to fit a proportion of the residuals from the prior (n − 1) trees and goes on recursively, creating a strong classifier. While this model is considered to be a typical ensemble model, in our context we establish an ensemble model to be an ensemble of different classifier models, and thus in our case, an ensemble of the “base” models. Given the hyperparameters evaluated with CV, the final parameters were as follows: 250 trees, maximum tree depth of 4, shrinkage of 0.1, and minimum of 10 observations per node.

The next model, support vector machine (SVM; Cortes and Vapnik 1995), attempts to separate distinct data classes using a multidimensional plane or a transformed plane using a kernel (referred to as the kernel trick), with the goal of maximizing the difference between the classes. For SVM, the hyperparameter evaluated with CV is the sigma value that controls the linearity of the model, where larger values represent more linear solutions. The selected sigma value was 0.004, which is the average of the 0.1 and 0.9 quantiles of all the pairwise squared Euclidean distances among inputs (Caputo et al. 2002). The small value favors nonlinearity.

The last base model we used was the artificial neural network (ANN; McCulloch and Pitts 1943; Goodfellow et al. 2016), more specifically, a multilayer perceptron (MLP), consisting of neural layers and activation functions. A total of three fully connected layers were selected with the Adam optimizer, with a decreasing number of neurons in each layer. The optional hyperparameters corresponding to the number of neurons in the hidden layers were selected with CV, where the first layer had 60 neurons, layer 2 had 15 neurons, and layer 3 had 5 neurons, with the two output neurons corresponding to two classes. Furthermore, a dropout rate of 0, a learning rate of 0.001, and the rectified linear unit (ReLU) activation function were selected.

The ensemble models, or meta models, were built combining the three base models with the intention of leveraging each model’s strongest aspect. For this purpose, we chose a generalized linear model (GLM; Nelder and Wedderburn 1972), an average ensemble (AVG; Alpaydin 2014), and a random forest (RF; Opitz and Maclin 1999; Breiman 2001). In the ensemble approach with k-fold cross validation, a few (say b) base models are first trained for each fold, giving predicted probabilities as outputs. These values are collated to form an n × b feature matrix, where n is the number of data points. This matrix, along with the original response vector, is called the “level-one” data. In stacking (Wolpert 1992), an ensemble model (e.g., GLM or RF) is trained on this level-one data, which can then be used to generate predictions on a test set. The average ensemble was a simpler way to aggregate the base models, which used the average of predicted probabilities by the base models as the final predicted probability of severe wind.

The ML models can be evaluated given their predicted classification (severe or not severe) and by their predicted probability (classification probability). A standard 2 × 2 contingency table was constructed using the binary classification. Probability of detection [POD; also known as recall (R)], probability of false detection (POFD), and precision (P) were calculated given contingency table elements (Wilks 2011; Hitchens and Brooks 2012, 2014). Area under the receiver operating characteristic curve (AUC; Hanley and McNeil 1982) and area under the precision recall curve (PR; Davis and Goadrich 2006) were calculated given the plots of POD/POFD and P/R, respectively. For both AUC and PR, the closer the value is to 1, the better the performance is considered to be. Brier score (BS; Brier 1950) was computed given the ML predicted probabilities, where values closer to 0 are considered to have higher model accuracy. Reliability diagrams were plotted given the observed frequency compared to the forecasted probability, where good reliability is shown to follow the one-to-one line (Wilks 2011).

We have tried multiple data engineering methods and compared them using BS. These data engineering methods are (i) inclusion or exclusion of spatiotemporal distance to the nearest measured SR, (ii) dimension reduction of mesoanalysis data versus no dimension reduction, (iii) inclusion or exclusion of missing-indicator variables for erroneous mesoanalysis data, (iv) substitution of missing mesoanalysis parameters from earlier years with similar parameters; and inclusion or exclusion of (v) text data from the event and episode narratives, (vi) environmental characteristics (elevation and population), (vii) land category, and (viii) radar. Each feature has two possibilities giving rise to 28 = 256 possible combinations. For each such combination, we optimized the hyperparameters of each ML model. Training and testing samples were consistent across feature combinations and ML models. The methods that were found to negatively affect skill (increased BS) were discarded in further implementations of the ML models.

Results

Changes in BS are shown in Fig. 3 for the top-performing base model (GBM) and ensemble model (GLM) based on a general evaluation of skill metrics (see Fig. 4). The implementations that negatively affected skill were disregarded in the final version of the models. The data engineering methods that were retained in the final version of the ML models are the addition of text data, reducing the dimensionality of the mesoanalysis data, and substituting mesoanalysis parameters for early years of the training data. While the text data were shown to improve skill, these were not retained in iterations of the models that included subsevere measurements since no text is associated with these reports.

Fig. 3.
Fig. 3.

Boxplots comparing the BS of the (a) GLM and (b) GBM when each data/method is included or excluded. Data are calculated on the test set. Whiskers represent the first (third) quartile minus (plus) 1.5 times the inner quartile range. “Closest measured” represents information added about the existence of measured SRs nearby in time and space, “dim reduction” represents the methodology of reducing dimensionality of mesoanalysis data, “missing indicator” represents the treatment of erroneous mesoanalysis data, “substitution” represents the substitution of similar mesoanalysis parameters in early years, “text” represents the inclusion of the text data in the training/testing, “environment” represents the inclusion of elevation and population, “land category” represents the inclusion of land category information, and “radar” represents the use of radar reflectivity.

Citation: Bulletin of the American Meteorological Society 105, 3; 10.1175/BAMS-D-22-0268.1

Fig. 4.
Fig. 4.

(left) Brier score (BS), (center) area under the ROC curve (AUC), and (right) area under the precision recall curve (PR) for each ML model over the test dataset. Upper and lower bounds of the 95% confidence intervals are denoted as tails on the points.

Citation: Bulletin of the American Meteorological Society 105, 3; 10.1175/BAMS-D-22-0268.1

Subsevere wind reports were added to both the training and testing sets to increase the sample size of subsevere reports; thus, the test set now consisted of 75% severe SRs and 25% subsevere. AUC and BS did not change noticeably from this inclusion; however, reliability was found to improve greatly. The reliability plots of the top-two objectively performing models (as discussed below) are shown in Fig. 5. The histograms of predicted probabilities denote predictions more often being higher probabilities, which is a result of a majority of test data being ≥50 kt.

Fig. 5.
Fig. 5.

Reliability plots of the top-performing base model (GBM) and ensemble model (GLM) with and without the subsevere reports included in the training and testing sets. Plots are created using the test dataset. (left) The GBM (a) without subsevere and (c) with subsevere. (right) The GLM (b) without subsevere and (d) with subsevere. Bootstrap confidence intervals based on 2.5th and 97.5th percentiles (500 replicates) are shown on the one-to-one lines and histograms of the predicted probabilities are shown in the lower-right corner of each panel.

Citation: Bulletin of the American Meteorological Society 105, 3; 10.1175/BAMS-D-22-0268.1

Training of the final version of the ML models included the closest measured SR, mesoanalysis data with dimension reduction, substitution of mesoanalysis parameters in early years, environmental data (population and elevation), and subsevere thunderstorm winds. The objective skill metrics for all six ML models are shown in Fig. 4. Multiple skill metrics were used (BS, AUC, PR, reliability) to test model skill given an unbalanced dataset, with most reports being severe. Evaluation of all metrics, including subjective ratings as discussed below, suggest that the best performing ML model is the GLM followed by the GBM. The ANN and SVM were shown to have consistently poorer performance, likely due to the nature of the data not easily fitting the architecture of the ML model.

Use in Hazardous Weather Testbed Spring Forecasting Experiments

Our tool has been evaluated at the HWT over the 3-yr period of 2020–22. The HWT is a platform that allows researchers to test their new tools and techniques in pseudo–real time with operational meteorologists (Kain et al. 2003; Clark et al. 2012; Gallo et al. 2017). This opportunity allowed us to gain valuable feedback about the output of the tool and ways to improve it, as well as allow us to observe how our tool performs in different meteorological and regional scenarios. Participants navigated the site with the ML output available for each of the SRs from the previous day, while filling out a series of questions. These questions aimed to gauge the impression of participants about the value and usefulness of the tool and website and which model they thought performed best. For each year, the ML model names and approaches were concealed to allow for unbiased evaluations.

Daily results were shown to participants by offering them color-shaded scatterplots with a map background and a tabular format with SR information. After the 2020 HWT, data were moved off of a Google site with stationary maps to another website with interactive plots. Participants had the ability to zoom in/out, hover over the points to get SR information including the ML assigned probability, text information, and magnitude of measured winds, and toggle SPC probabilistic severe thunderstorm wind forecasts. An example of this website is shown in Fig. 6. In 2022, the map was updated to include peak wind reports from any ASOS/AWOS station that was being impacted by reflectivity of at least 50 dBZ, and within 16 km and 15 min of an SR. A more restrictive threshold was chosen compared to subsevere reports used in training in order to reduce extra information on the plots. The inclusion of these subsevere reports allowed participants to have a better idea of the nearby environment and likely reduced bias that might come about by only seeing reports that would end up in the Storm Events Database, where they would likely be assigned an estimate of 50 kt or more.

Fig. 6.
Fig. 6.

Example screenshot of the website shown at the 2022 HWT. SRs are represented by the shaded points and the color shading represents the assigned ML probabilities, with red shades representing probabilities greater than 50% and blue shades representing probabilities less than 50%. Nearby ASOS/AWOS measurements to SRs are plotted as gray stars. These ASOS/AWOS measurements are not tested by the ML models and are therefore not assigned color shading. The 1630 UTC convective wind outlook is represented as the background shading.

Citation: Bulletin of the American Meteorological Society 105, 3; 10.1175/BAMS-D-22-0268.1

Our tool was evaluated by participants using a rating scale of 1–10, with 10 being the highest rating. A different set of ML models were shown to participants each year, which reflected the top objectively performing models at the time and current data and/or engineering method. In the 2020 HWT, all six ML models were shown to participants and evaluations queried for which model subjectively performed best. It was noted that there seemed to be a regional bias in the model, with higher probabilities output for storm reports in the central United States. To gain insight into possible regional biases, for the 2021 HWT we included land-use category, population, elevation, and radar data into the models, and also created regional versions of the algorithms, by training the models over smaller regions based on longitudinal regional dividers (−90° and −105°). Participants were shown three approaches (with radar data, without radar, and regionally trained without radar) for two ML models (GBM and AVG). The 2021 HWT participant evaluations in addition to objective skill metrics revealed that the regional models did not improve performance, likely due to the smaller datasets used in the training. Land category and radar were also shown to not improve objective performance during that experiment and only marginally improved subjective evaluation scores, so they were both dropped as covariates for later testing of the models. In 2022, new versions of the models that included subsevere reports in the training were compared to versions without. These two methods were shown to participants for two ML models: GBM and GLM. The top subjectively rated ML model from each year is shown in Fig. 7, with the AVG providing the highest rating in 2020, the GBM which included radar data being rated the highest in 2021, and the GLM with subsevere included in the training being rated the highest in 2022.

Fig. 7.
Fig. 7.

Boxplot of evaluation scores for the top-ranked machine learning model/approach for each year the tool was shown at the Hazardous Weather Testbed Spring Forecasting Experiment. Mean evaluation scores are shown as the white dot and number marking. Whiskers represent the first (third) quartile minus (plus) 1.5 times the inner quartile range.

Citation: Bulletin of the American Meteorological Society 105, 3; 10.1175/BAMS-D-22-0268.1

Each subsequent year resulted in a higher rating of the best-performing model. This likely indicated an improvement in the performance of the models and/or the website used to display the output, although a small subset of participants each year were returning from prior participation and may have simply felt more comfortable or had gained a better understanding of the goal of the tool. The evaluations suggested that the models were performing how participants would expect, with higher probabilities being assigned to SRs that were associated with environments more conducive to strong thunderstorm winds. This point was evaluated further, and it was found that higher probabilities were assigned to SRs with meteorological environments that were more conducive to the production of severe wind and vice versa. Participants provided increased confidence in the ability of the ML models to produce realistic results from year to year as is shown by the scores in Fig. 7 and by evaluating verbal and written feedback.

Preliminary investigation into applications

Output from our tool has many potential uses. Two examples follow, showing the impact of our output when used in verification of SPC wind forecasts and in the creation of practically perfect hindcasts. The examples discussed below are presented simply to show that the use of the ML probabilities in skill metrics to take into account uncertainty in the estimated wind values can have a noticeable impact on the output of these metrics.

SPC wind forecast verification.

For the first example, ML probabilities from the GLM were used in the verification of SPC 1630 UTC day one wind forecasts from 2018 to 2021. Verification methods follow closely to Hitchens and Brooks (2012, 2014). Traditional SPC forecast verification assigned all SRs a value of 1 in contingency tables to calculate verification metrics, so that measured and estimated storm reports are treated equally. However, our output can be used to give the reports different weighting. Thus, for this example, estimated SRs were assigned a value based on their assigned ML probability of being due to severe intensity wind, while measured values maintain a value of 1. In doing this, estimated reports whose severe intensity was more likely to be an error are given less weight.

The forecast data for the severe wind outlooks were on a roughly 80 km × 80 km grid over the contiguous United States. Because we were evaluating the 1630 UTC day one outlook, the period being verified runs from 1630 UTC on the first day through 1200 UTC the following day. For each day, grid points were assigned a value based on the SPC forecast contours (e.g., 5%, 15%, 30%). SRs were then assigned to the nearest grid point based on their latitude/longitude. If more than one SR was valid for the same point, the highest probability from the ML tool was selected for that location. If an SR did not fall within a region forecast to have at least a 5% probability, it was assigned an SPC forecast probability of zero. For every binned SPC forecast contour, contingency table elements were calculated based on a standard 2 × 2 contingency table.

To show the possible impacts of the weighting on skill metrics, forecast skill was represented both with AUC using contingency table elements, as well as BS using Eq. (1), where N is the total number of forecasts points, ft is SPC forecast contour value, and ot is the observed value (1 for unweighted measured observation and the ML probability for the weighted estimated observation). SPC forecast contour values (e.g., 0.05, 0.15) are interpolated to represent intermediate contours (0, 0.1, 0.225, 0.375, 0.525). The results are shown in Table 2 comparing the weighted and unweighted approaches. Note that the use of weighting decreases the AUC value in this example, showing the forecast to have less skill, due to a decrease in the probability of detection, while the BS decreases, showing the forecast to have more skill, since the many estimated observations receive a value less than 1 in Eq. (1):
BS=1N(ftot)2.
Table 2.

SPC forecast verification results for 2018–21 SRs using Brier score (BS) and area under the ROC curve (AUC). The weighting method utilizes probabilities from the GLM.

Table 2.

Practically perfect hindcast.

As a second test of an application, the probabilities from the GLM were used in the development of practically perfect hindcasts (Hitchens et al. 2013; Gensini et al. 2020). Daily plots were made using all SRs as input and using SRs with probability weighting from the GLM with subsevere in the training. An example of these plots is shown in Fig. 8.

Fig. 8.
Fig. 8.

Practically perfect hindcasts for an example day (25 May 2022) during the 2022 HWT, with all SRs and (a) the 1630 UTC SPC wind outlook, (b) unweighted practically perfect forecast, and (c) the GLM-weighted practically perfect forecast. Observed SRs are shown in (a) and (b) as blue dots and are shaded in (c) based on the GLM predicted probabilities.

Citation: Bulletin of the American Meteorological Society 105, 3; 10.1175/BAMS-D-22-0268.1

Subjectively, participants found value in the creation of these ML-adjusted hindcasts to provide information, especially on days with a large number of SRs that people believed were likely not due to wind ≥ 50 kt. This was especially noted by participants on more marginal days and cases where there were a high number of reports due to the proximity to urban or highly treed areas, where participants felt that the ML weighting provided a more reasonable representation by decreasing the intensity of the provided hindcasts.

Conclusions

We have shown that a ML tool can be used to add valuable information on uncertainty to the wind reports in the NCEI Storm Events Database. By training six ML models on various meteorological and environmental data, we output a probability that an SR was caused by wind ≥ 50 kt. These probabilities allow forecasters and researchers to have uncertainty information which could result in improved verification and development of future forecast tools. Two examples were shown of possible ways to implement the output into forecast verification and practically perfect hindcasts.

Various data sources and training approaches were evaluated to determine their impact on ML skill, resulting in models that utilize meteorological information, environmental characteristics, text data, information on the closest measured SR, and data substitution. After numerous rounds of rigorous model tuning and testing, we determined objectively that the best performing model was the GLM based on AUC, BS, and PR.

Evaluation of the ML tools at the HWT in 2020, 2021, and 2022 led to positive feedback from potential future users of the tool. Daily evaluations led to detailed examination of the tool’s performance and nuances that could not be identified with evaluations of statistical skill metrics alone. For example, before the 2020 HWT the GBM seemed to be performing the best objectively, but participant evaluations noted that the model had polarizing output with either extremely high or extremely low probabilities that they believed were likely not justified. This demonstrates the importance of evaluating ML models using more than objective skill metrics alone.

Future work with the ML tools will focus on a deeper meteorological analysis of the weather conditions accompanying the output, especially when the output seems unusual, such as high probabilities assigned to reports in a region where no substantial damage or strong measured winds were reported, and low probabilities when significant damage or significantly severe wind was observed. Preliminary results reveal that lower probabilities are assigned to high wind speed SRs that are seemingly caused by wet microbursts in the eastern United States (east of −90°), and contrarily high probabilities to low wind speed SRs in complex terrain. These follow-up analyses reveal times when the output should be used with caution and could guide future adjustments to improve the ML algorithm. For instance, the present study was limited in that it only examined the use of radar reflectivity since no national composite exists for radar velocity data, but future work could explore the addition of some velocity-derived parameters such as convergence or mesocyclone occurrence. Also, since ML has been used in some recent studies (e.g., Haberlie and Ashley 2018; Jergensen et al. 2020), to simplify the process of classifying convective mode, that information should be tested as an added predictor in our algorithms.

Acknowledgments.

Discussions and evaluations from the HWT have been advantageous in the development of our tool. Feedback has helped guide our data, methodology, and display resulting in a higher quality product. The assistance of Jonathan Thielen was invaluable in the data processing and creation of gridded radar reflectivity datasets. Data collection and processing was done with the assistance of Daryl Herzmann, Mark De Bruin, and Nathan Erickson. This material is based upon work supported by the HWT Program within the NOAA/OAR Weather Program Office, Department of Commerce, under Grant NA19OAR4590133. The statements, findings, conclusions, and recommendations are those of the authors and do not necessarily reflect the views of NOAA or the U.S. Department of Commerce. The research reported in this paper is partially supported by the HPC@ISU equipment at Iowa State University, some of which has been purchased through funding provided by NSF under MRI Grants 1726447 and MRI2018594.

Data availability statement.

The data that support the findings of this work can be found at the NOAA NCEI website (www.ncdc.noaa.gov/stormevents/), Oak Ridge National Laboratory (https://landscan.ornl.gov/), and Open Topo Data (www.opentopodata.org/). The SPC mesoanalysis data (Bothwell et al. 2002) are available upon request. Processed wind reports, meteorological environmental output, and additional technical details are available at https://github.com/etirone/SevereWindMachineLearning. Any use of the data in the GitHub repository should cite this paper.

References

  • Alpaydin, E., 2014: Introduction to Machine Learning. MIT Press, 613 pp.

  • Blei, D. M., and J. D. Lafferty, 2007: A correlated topic model of science. Ann. Appl. Stat., 1, 1735, https://doi.org/10.1214/07-AOAS114.

    • Search Google Scholar
    • Export Citation
  • Blei, D. M., A. Y. Ng, and M. I. Jordan, 2003: Latent Dirichlet allocation. J. Mach. Learn. Res., 3, 9931022, www.jmlr.org/papers/v3/blei03a.html.

    • Search Google Scholar
    • Export Citation
  • Bothwell, P. D., J. Hart, and R. L. Thompson, 2002: An integrated three-dimensional objective analysis scheme in use at the Storm Prediction Center. 21st Conf. on Severe Local Storms, San Antonio, TX, Amer. Meteor. Soc., JP3.1, https://ams.confex.com/ams/SLS_WAF_NWP/techprogram/paper_47482.htm.

  • Bowman, K. P., and C. R. Homeyer, 2017: GridRad—Three-dimensional gridded NEXRAD WSR-88D radar data. NCAR CISL Research Data Archive, accessed 8 August 2020, https://doi.org/10.5065/D6NK3CR7.

  • Breiman, L., 2001: Random forests. Mach. Learn., 45, 532, https://doi.org/10.1023/A:1010933404324.

  • Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78, 13, https://doi.org/10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Bunkers, M. J., S. R. Fleegel, T. Grafenauer, C. J. Schultz, and P. N. Schumacher, 2020: Observations of hail–wind ratios from convective storm reports across the continental United States. Wea. Forecasting, 35, 635656, https://doi.org/10.1175/WAF-D-19-0136.1.

    • Search Google Scholar
    • Export Citation
  • Caputo, B., K. Sim, F. Furesjö, and A. Smola, 2002: Appearance-based object recognition using SVMs: Which kernel should I use? Proc. Workshop on Statistical Methods for Computational Experiments in Visual Processing and Computer Vision, Red Hook, NY, NeurIPS.

  • Chase, R. J., D. R. Harrison, A. Burke, G. M. Lackmann, and A. McGovern, 2022: A machine learning tutorial for operational meteorology. Part I: Traditional machine learning. Wea. Forecasting, 37, 15091529, https://doi.org/10.1175/WAF-D-22-0070.1.

    • Search Google Scholar
    • Export Citation
  • Chase, R. J., D. R. Harrison, G. M. Lackmann, and A. McGovern, 2023: A machine learning tutorial for operational meteorology. Part II: Neural networks and deep learning. Wea. Forecasting, 38, 12711293, https://doi.org/10.1175/WAF-D-22-0187.1.

    • Search Google Scholar
    • Export Citation
  • Clark, A. J., and Coauthors, 2012: An overview of the 2010 Hazardous Weather Testbed Experimental Forecast Program spring experiment. Bull. Amer. Meteor. Soc., 93, 5574, https://doi.org/10.1175/BAMS-D-11-00040.1.

    • Search Google Scholar
    • Export Citation
  • Cortes, C., and V. Vapnik, 1995: Support-vector networks. Mach. Learn., 20, 273297, https://doi.org/10.1007/BF00994018.

  • Davis, J., and M. Goadrich, 2006: The relationship between precision-recall and ROC curves. Proc. 23rd Int. Conf. on Machine Learning, New York, NY, ACM, 233–240, https://doi.org/10.1145/1143844.1143874.

  • Edwards, R., J. T. Allen, and G. W. Carbin, 2018: Reliability and climatological impacts of convective wind estimations. J. Appl. Meteor. Climatol., 57, 18251845, https://doi.org/10.1175/JAMC-D-17-0306.1.

    • Search Google Scholar
    • Export Citation
  • Friedl, M. A., D. Sulla-Menashe, B. Tan, A. Schneider, N. Ramankutty, A. Sibley, and X. Huang, 2010: MODIS Collection 5 global land cover: Algorithm refinements and characterization of new datasets. Remote Sens. Environ., 114, 168182, https://doi.org/10.1016/j.rse.2009.08.016.

    • Search Google Scholar
    • Export Citation
  • Friedman, J. H., 2002: Stochastic gradient boosting. Comput. Stat. Data Anal., 38, 367378, https://doi.org/10.1016/S0167-9473(01)00065-2.

    • Search Google Scholar
    • Export Citation
  • Gallo, B. T., and Coauthors, 2017: Breaking new ground in severe weather prediction: The 2015 NOAA/Hazardous Weather Testbed Spring Forecasting Experiment. Wea. Forecasting, 32, 15411568, https://doi.org/10.1175/WAF-D-16-0178.1.

    • Search Google Scholar
    • Export Citation
  • Gensini, V. A., A. M. Haberlie, and P. T. Marsh, 2020: Practically perfect hindcasts of severe convective storms. Bull. Amer. Meteor. Soc., 101, E1259E1278, https://doi.org/10.1175/BAMS-D-19-0321.1.

    • Search Google Scholar
    • Export Citation
  • Goodfellow, I., Y. Bengio, and A. Courville, 2016: Deep Learning. MIT Press, 775 pp., www.deeplearningbook.org.

  • Haberlie, A. M., and W. S. Ashley, 2018: A method for identifying midlatitude mesoscale convective systems in radar mosaics. Part I: Segmentation and classification. J. Appl. Meteor. Climatol., 57, 15751598, https://doi.org/10.1175/JAMC-D-17-0293.1.

    • Search Google Scholar
    • Export Citation
  • Hanley, J. A., and B. J. McNeil, 1982: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143, 2936, https://doi.org/10.1148/radiology.143.1.7063747.

    • Search Google Scholar
    • Export Citation
  • Hastie, T., R. Tibshirani, and J. Friedman, 2009: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Springer Series in Statistics, Springer, 764 pp., https://doi.org/10.1007/978-0-387-84858-7.

    • Search Google Scholar
    • Export Citation
  • Hitchens, N. M., and H. E. Brooks, 2012: Evaluation of the Storm Prediction Center’s day 1 convective outlooks. Wea. Forecasting, 27, 15801585, https://doi.org/10.1175/WAF-D-12-00061.1.

    • Search Google Scholar
    • Export Citation
  • Hitchens, N. M., and H. E. Brooks, 2014: Evaluation of the Storm Prediction Center’s convective outlooks from day 3 through day 1. Wea. Forecasting, 29, 11341142, https://doi.org/10.1175/WAF-D-13-00132.1.

    • Search Google Scholar
    • Export Citation
  • Hitchens, N. M., H. E. Brooks, and M. P. Kay, 2013: Objective limits on forecasting skill of rare events. Wea. Forecasting, 28, 525534, https://doi.org/10.1175/WAF-D-12-00113.1.

    • Search Google Scholar
    • Export Citation
  • Jergensen, G. E., A. McGovern, R. Lagerquist, and T. Smith, 2020: Classifying convective storms using machine learning. Wea. Forecasting, 35, 537559, https://doi.org/10.1175/WAF-D-19-0170.1.

    • Search Google Scholar
    • Export Citation
  • Kain, J. S., P. R. Janish, S. J. Weiss, M. E. Baldwin, R. S. Schneider, and H. E. Brooks, 2003: Collaboration between forecasters and research scientists at the NSSL and SPC: The Spring Program. Bull. Amer. Meteor. Soc., 84, 17971806, https://doi.org/10.1175/BAMS-84-12-1797.

    • Search Google Scholar
    • Export Citation
  • Lee, Y., G. Wahba, and S. A. Ackerman, 2004: Cloud classification of satellite radiance data by multicategory support vector machines. J. Atmos. Oceanic Technol., 21, 159169, https://doi.org/10.1175/1520-0426(2004)021<0159:CCOSRD>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Li, N., M. Wei, Y. Yu, and W. Zhang, 2017: Evaluation of a support vector machine–based single-Doppler wind retrieval algorithm. J. Atmos. Oceanic Technol., 34, 17491761, https://doi.org/10.1175/JTECH-D-16-0199.1.

    • Search Google Scholar
    • Export Citation
  • McCulloch, W. S., and W. Pitts, 1943: A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys., 5, 115133, https://doi.org/10.1007/BF02478259.

    • Search Google Scholar
    • Export Citation
  • NCEI, 2023: Storm events database. Accessed 2 February 2023, www.ncdc.noaa.gov/stormevents/.

  • Nelder, J. A., and R. W. M. Wedderburn, 1972: Generalized linear models. J. Roy. Stat. Soc., 135A, 370384, https://doi.org/10.2307/2344614.

    • Search Google Scholar
    • Export Citation
  • NOAA, 2020: August 10, 2020 derecho. NWS, www.weather.gov/dmx/2020derecho.

  • NOAA, 2021a: Storm data preparation. National Weather Service Instruction 10-1605, 28 pp., www.nws.noaa.gov/directives/010/archive/pd01016005b.pdf.

  • NOAA, 2021b: Storm data preparation. National Weather Service Instruction 0-1605, 110 pp., https://www.nws.noaa.gov/directives/sym/pd01016005curr.pdf

  • Oak Ridge National Laboratory, 2023: LandScan. Accessed 21 July 2023, https://landscan.ornl.gov/.

  • Opitz, D., and R. Maclin, 1999: Popular ensemble methods: An empirical study. J. Artif. Intell. Res., 11, 169198, https://doi.org/10.1613/jair.614.

    • Search Google Scholar
    • Export Citation
  • Smith, B. T., T. E. Castellanos, A. C. Winters, C. M. Mead, A. R. Dean, and R. L. Thompson, 2013: Measured severe convective wind climatology and associated convective modes of thunderstorms in the contiguous United States, 2003–09. Wea. Forecasting, 28, 229236, https://doi.org/10.1175/WAF-D-12-00096.1.

    • Search Google Scholar
    • Export Citation
  • Trapp, R. J., D. M. Wheatley, N. T. Atkins, R. W. Przybylinski, and R. Wolf, 2006: Buyer beware: Some words of caution on the use of severe wind reports in postevent assessment and research. Wea. Forecasting, 21, 408415, https://doi.org/10.1175/WAF925.1.

    • Search Google Scholar
    • Export Citation
  • Weiss, S., 2002: An examination of severe thunderstorm wind report climatology: 1970–1999. 21st Conf. on Severe Local Storms, San Antonio, TX, Amer. Meteor. Soc., 11B.2, https://ams.confex.com/ams/SLS_WAF_NWP/techprogram/paper_47494.htm.

  • Wilks, D. S., 2011: Statistical Methods in the Atmospheric Sciences. 3rd ed. Elsevier, 676 pp.

  • Wolpert, D. H., 1992: Stacked generalization. Neural Networks, 5, 241259, https://doi.org/10.1016/S0893-6080(05)80023-1.

Save
  • Alpaydin, E., 2014: Introduction to Machine Learning. MIT Press, 613 pp.

  • Blei, D. M., and J. D. Lafferty, 2007: A correlated topic model of science. Ann. Appl. Stat., 1, 1735, https://doi.org/10.1214/07-AOAS114.

    • Search Google Scholar
    • Export Citation
  • Blei, D. M., A. Y. Ng, and M. I. Jordan, 2003: Latent Dirichlet allocation. J. Mach. Learn. Res., 3, 9931022, www.jmlr.org/papers/v3/blei03a.html.

    • Search Google Scholar
    • Export Citation
  • Bothwell, P. D., J. Hart, and R. L. Thompson, 2002: An integrated three-dimensional objective analysis scheme in use at the Storm Prediction Center. 21st Conf. on Severe Local Storms, San Antonio, TX, Amer. Meteor. Soc., JP3.1, https://ams.confex.com/ams/SLS_WAF_NWP/techprogram/paper_47482.htm.

  • Bowman, K. P., and C. R. Homeyer, 2017: GridRad—Three-dimensional gridded NEXRAD WSR-88D radar data. NCAR CISL Research Data Archive, accessed 8 August 2020, https://doi.org/10.5065/D6NK3CR7.

  • Breiman, L., 2001: Random forests. Mach. Learn., 45, 532, https://doi.org/10.1023/A:1010933404324.

  • Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78, 13, https://doi.org/10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Bunkers, M. J., S. R. Fleegel, T. Grafenauer, C. J. Schultz, and P. N. Schumacher, 2020: Observations of hail–wind ratios from convective storm reports across the continental United States. Wea. Forecasting, 35, 635656, https://doi.org/10.1175/WAF-D-19-0136.1.

    • Search Google Scholar
    • Export Citation
  • Caputo, B., K. Sim, F. Furesjö, and A. Smola, 2002: Appearance-based object recognition using SVMs: Which kernel should I use? Proc. Workshop on Statistical Methods for Computational Experiments in Visual Processing and Computer Vision, Red Hook, NY, NeurIPS.

  • Chase, R. J., D. R. Harrison, A. Burke, G. M. Lackmann, and A. McGovern, 2022: A machine learning tutorial for operational meteorology. Part I: Traditional machine learning. Wea. Forecasting, 37, 15091529, https://doi.org/10.1175/WAF-D-22-0070.1.

    • Search Google Scholar
    • Export Citation
  • Chase, R. J., D. R. Harrison, G. M. Lackmann, and A. McGovern, 2023: A machine learning tutorial for operational meteorology. Part II: Neural networks and deep learning. Wea. Forecasting, 38, 12711293, https://doi.org/10.1175/WAF-D-22-0187.1.

    • Search Google Scholar
    • Export Citation
  • Clark, A. J., and Coauthors, 2012: An overview of the 2010 Hazardous Weather Testbed Experimental Forecast Program spring experiment. Bull. Amer. Meteor. Soc., 93, 5574, https://doi.org/10.1175/BAMS-D-11-00040.1.

    • Search Google Scholar
    • Export Citation
  • Cortes, C., and V. Vapnik, 1995: Support-vector networks. Mach. Learn., 20, 273297, https://doi.org/10.1007/BF00994018.

  • Davis, J., and M. Goadrich, 2006: The relationship between precision-recall and ROC curves. Proc. 23rd Int. Conf. on Machine Learning, New York, NY, ACM, 233–240, https://doi.org/10.1145/1143844.1143874.

  • Edwards, R., J. T. Allen, and G. W. Carbin, 2018: Reliability and climatological impacts of convective wind estimations. J. Appl. Meteor. Climatol., 57, 18251845, https://doi.org/10.1175/JAMC-D-17-0306.1.

    • Search Google Scholar
    • Export Citation
  • Friedl, M. A., D. Sulla-Menashe, B. Tan, A. Schneider, N. Ramankutty, A. Sibley, and X. Huang, 2010: MODIS Collection 5 global land cover: Algorithm refinements and characterization of new datasets. Remote Sens. Environ., 114, 168182, https://doi.org/10.1016/j.rse.2009.08.016.

    • Search Google Scholar
    • Export Citation
  • Friedman, J. H., 2002: Stochastic gradient boosting. Comput. Stat. Data Anal., 38, 367378, https://doi.org/10.1016/S0167-9473(01)00065-2.

    • Search Google Scholar
    • Export Citation
  • Gallo, B. T., and Coauthors, 2017: Breaking new ground in severe weather prediction: The 2015 NOAA/Hazardous Weather Testbed Spring Forecasting Experiment. Wea. Forecasting, 32, 15411568, https://doi.org/10.1175/WAF-D-16-0178.1.

    • Search Google Scholar
    • Export Citation
  • Gensini, V. A., A. M. Haberlie, and P. T. Marsh, 2020: Practically perfect hindcasts of severe convective storms. Bull. Amer. Meteor. Soc., 101, E1259E1278, https://doi.org/10.1175/BAMS-D-19-0321.1.

    • Search Google Scholar
    • Export Citation
  • Goodfellow, I., Y. Bengio, and A. Courville, 2016: Deep Learning. MIT Press, 775 pp., www.deeplearningbook.org.

  • Haberlie, A. M., and W. S. Ashley, 2018: A method for identifying midlatitude mesoscale convective systems in radar mosaics. Part I: Segmentation and classification. J. Appl. Meteor. Climatol., 57, 15751598, https://doi.org/10.1175/JAMC-D-17-0293.1.

    • Search Google Scholar
    • Export Citation
  • Hanley, J. A., and B. J. McNeil, 1982: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143, 2936, https://doi.org/10.1148/radiology.143.1.7063747.

    • Search Google Scholar
    • Export Citation
  • Hastie, T., R. Tibshirani, and J. Friedman, 2009: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Springer Series in Statistics, Springer, 764 pp., https://doi.org/10.1007/978-0-387-84858-7.

    • Search Google Scholar
    • Export Citation
  • Hitchens, N. M., and H. E. Brooks, 2012: Evaluation of the Storm Prediction Center’s day 1 convective outlooks. Wea. Forecasting, 27, 15801585, https://doi.org/10.1175/WAF-D-12-00061.1.

    • Search Google Scholar
    • Export Citation
  • Hitchens, N. M., and H. E. Brooks, 2014: Evaluation of the Storm Prediction Center’s convective outlooks from day 3 through day 1. Wea. Forecasting, 29, 11341142, https://doi.org/10.1175/WAF-D-13-00132.1.

    • Search Google Scholar
    • Export Citation
  • Hitchens, N. M., H. E. Brooks, and M. P. Kay, 2013: Objective limits on forecasting skill of rare events. Wea. Forecasting, 28, 525534, https://doi.org/10.1175/WAF-D-12-00113.1.

    • Search Google Scholar
    • Export Citation
  • Jergensen, G. E., A. McGovern, R. Lagerquist, and T. Smith, 2020: Classifying convective storms using machine learning. Wea. Forecasting, 35, 537559, https://doi.org/10.1175/WAF-D-19-0170.1.

    • Search Google Scholar
    • Export Citation
  • Kain, J. S., P. R. Janish, S. J. Weiss, M. E. Baldwin, R. S. Schneider, and H. E. Brooks, 2003: Collaboration between forecasters and research scientists at the NSSL and SPC: The Spring Program. Bull. Amer. Meteor. Soc., 84, 17971806, https://doi.org/10.1175/BAMS-84-12-1797.

    • Search Google Scholar
    • Export Citation
  • Lee, Y., G. Wahba, and S. A. Ackerman, 2004: Cloud classification of satellite radiance data by multicategory support vector machines. J. Atmos. Oceanic Technol., 21, 159169, https://doi.org/10.1175/1520-0426(2004)021<0159:CCOSRD>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Li, N., M. Wei, Y. Yu, and W. Zhang, 2017: Evaluation of a support vector machine–based single-Doppler wind retrieval algorithm. J. Atmos. Oceanic Technol., 34, 17491761, https://doi.org/10.1175/JTECH-D-16-0199.1.

    • Search Google Scholar
    • Export Citation
  • McCulloch, W. S., and W. Pitts, 1943: A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys., 5, 115133, https://doi.org/10.1007/BF02478259.

    • Search Google Scholar
    • Export Citation
  • NCEI, 2023: Storm events database. Accessed 2 February 2023, www.ncdc.noaa.gov/stormevents/.

  • Nelder, J. A., and R. W. M. Wedderburn, 1972: Generalized linear models. J. Roy. Stat. Soc., 135A, 370384, https://doi.org/10.2307/2344614.

    • Search Google Scholar
    • Export Citation
  • NOAA, 2020: August 10, 2020 derecho. NWS, www.weather.gov/dmx/2020derecho.

  • NOAA, 2021a: Storm data preparation. National Weather Service Instruction 10-1605, 28 pp., www.nws.noaa.gov/directives/010/archive/pd01016005b.pdf.

  • NOAA, 2021b: Storm data preparation. National Weather Service Instruction 0-1605, 110 pp., https://www.nws.noaa.gov/directives/sym/pd01016005curr.pdf

  • Oak Ridge National Laboratory, 2023: LandScan. Accessed 21 July 2023, https://landscan.ornl.gov/.

  • Opitz, D., and R. Maclin, 1999: Popular ensemble methods: An empirical study. J. Artif. Intell. Res., 11, 169198, https://doi.org/10.1613/jair.614.

    • Search Google Scholar
    • Export Citation
  • Smith, B. T., T. E. Castellanos, A. C. Winters, C. M. Mead, A. R. Dean, and R. L. Thompson, 2013: Measured severe convective wind climatology and associated convective modes of thunderstorms in the contiguous United States, 2003–09. Wea. Forecasting, 28, 229236, https://doi.org/10.1175/WAF-D-12-00096.1.

    • Search Google Scholar
    • Export Citation
  • Trapp, R. J., D. M. Wheatley, N. T. Atkins, R. W. Przybylinski, and R. Wolf, 2006: Buyer beware: Some words of caution on the use of severe wind reports in postevent assessment and research. Wea. Forecasting, 21, 408415, https://doi.org/10.1175/WAF925.1.

    • Search Google Scholar
    • Export Citation
  • Weiss, S., 2002: An examination of severe thunderstorm wind report climatology: 1970–1999. 21st Conf. on Severe Local Storms, San Antonio, TX, Amer. Meteor. Soc., 11B.2, https://ams.confex.com/ams/SLS_WAF_NWP/techprogram/paper_47494.htm.

  • Wilks, D. S., 2011: Statistical Methods in the Atmospheric Sciences. 3rd ed. Elsevier, 676 pp.

  • Wolpert, D. H., 1992: Stacked generalization. Neural Networks, 5, 241259, https://doi.org/10.1016/S0893-6080(05)80023-1.

  • Fig. 1.

    Percentage of SRs with wind speeds from 50 to 59 kt for estimated (blue) and measured (orange) SRs from 2007 to 2018.

  • Fig. 2.

    Locations of measured SRs as denoted by blue dots for 2007–17. Vertical orange lines denote subjective regional borders of the “west,” “central,” and “east” in the United States. Measured wind speed (kt) are represented by the color shading of each SR point.

  • Fig. 3.

    Boxplots comparing the BS of the (a) GLM and (b) GBM when each data/method is included or excluded. Data are calculated on the test set. Whiskers represent the first (third) quartile minus (plus) 1.5 times the inner quartile range. “Closest measured” represents information added about the existence of measured SRs nearby in time and space, “dim reduction” represents the methodology of reducing dimensionality of mesoanalysis data, “missing indicator” represents the treatment of erroneous mesoanalysis data, “substitution” represents the substitution of similar mesoanalysis parameters in early years, “text” represents the inclusion of the text data in the training/testing, “environment” represents the inclusion of elevation and population, “land category” represents the inclusion of land category information, and “radar” represents the use of radar reflectivity.

  • Fig. 4.

    (left) Brier score (BS), (center) area under the ROC curve (AUC), and (right) area under the precision recall curve (PR) for each ML model over the test dataset. Upper and lower bounds of the 95% confidence intervals are denoted as tails on the points.

  • Fig. 5.

    Reliability plots of the top-performing base model (GBM) and ensemble model (GLM) with and without the subsevere reports included in the training and testing sets. Plots are created using the test dataset. (left) The GBM (a) without subsevere and (c) with subsevere. (right) The GLM (b) without subsevere and (d) with subsevere. Bootstrap confidence intervals based on 2.5th and 97.5th percentiles (500 replicates) are shown on the one-to-one lines and histograms of the predicted probabilities are shown in the lower-right corner of each panel.

  • Fig. 6.

    Example screenshot of the website shown at the 2022 HWT. SRs are represented by the shaded points and the color shading represents the assigned ML probabilities, with red shades representing probabilities greater than 50% and blue shades representing probabilities less than 50%. Nearby ASOS/AWOS measurements to SRs are plotted as gray stars. These ASOS/AWOS measurements are not tested by the ML models and are therefore not assigned color shading. The 1630 UTC convective wind outlook is represented as the background shading.

  • Fig. 7.

    Boxplot of evaluation scores for the top-ranked machine learning model/approach for each year the tool was shown at the Hazardous Weather Testbed Spring Forecasting Experiment. Mean evaluation scores are shown as the white dot and number marking. Whiskers represent the first (third) quartile minus (plus) 1.5 times the inner quartile range.

  • Fig. 8.

    Practically perfect hindcasts for an example day (25 May 2022) during the 2022 HWT, with all SRs and (a) the 1630 UTC SPC wind outlook, (b) unweighted practically perfect forecast, and (c) the GLM-weighted practically perfect forecast. Observed SRs are shown in (a) and (b) as blue dots and are shaded in (c) based on the GLM predicted probabilities.

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 512 512 276
PDF Downloads 341 341 119