In this study, we use machine learning (ML) to improve hail prediction by postprocessing numerical weather prediction (NWP) data from the new High-Resolution Ensemble Forecast system, version 2 (HREFv2). Multiple operational models and ensembles currently predict hail, however ML models are more computationally efficient and do not require the physical assumptions associated with explicit predictions. Calibrating the ML-based predictions toward familiar forecaster output allows for a combination of higher skill associated with ML models and increased forecaster trust in the output. The observational dataset used to train and verify the random forest model is the Maximum Estimated Size of Hail (MESH), a Multi-Radar Multi-Sensor (MRMS) product. To build trust in the predictions, the ML-based hail predictions are calibrated using isotonic regression. The target datasets for isotonic regression include the local storm reports and Storm Prediction Center (SPC) practically perfect data. Verification of the ML predictions indicates that the probability magnitudes output from the calibrated models closely resemble the day-1 SPC outlook and practically perfect data. The ML model calibrated toward the local storm reports exhibited better or similar skill to the uncalibrated predictions, while decreasing model bias. Increases in reliability and skill after calibration may increase forecaster trust in the automated hail predictions.
Hail is a high-impact severe weather hazard, annually causing in excess of $1 billion (U.S. dollars) of property damage and $1 billion of crop damage (Jewell and Brimelow 2009). Isolated hail events, especially those impacting large urban areas, are particularly damaging. For example, a single hailstorm during the afternoon rush hour in the Denver, Colorado, metropolitan area on 8 May 2017 resulted in $2.3 billion of insurance claims (Svaldi 2018). The economic impacts of severe hail underscore the need for accurate and timely predictions, which allow individuals and businesses to take action toward mitigating risk to their property and safety. Accurate predictions of hail remain a challenge given the rapid evolution of hail-producing convective storms, coupled with uncertainties and limitations of atmospheric observation data needed to properly resolve the small-scale convective environment.
To produce skillful hail forecasts through explicit hail prediction, numerical weather prediction (NWP) models must accurately predict the development of convective storms, as well as produce reasonably accurate representations of hail within the model’s microphysical scheme (Labriola et al. 2017). At these small scales, model forecast errors can lead to large uncertainties in the timing and location of convective storms (e.g., Kain et al. 2010b; Durran and Weyn 2016). For hail prediction on longer time scales (up to 48 h), most methods rely on approximating environmental data at the convective scale (e.g., Johns and Doswell 1992). However, the spatial and temporal coverage of atmospheric soundings are generally insufficient to provide accurate initial conditions for explicit prediction of storms on the convective scale.
Where limitations in scale cause uncertainties in local storm characteristics, convection-allowing models (CAMs) have shown skill in predicting convective morphologies (e.g., Weisman et al. 2008). In recent years, CAMs have been employed in NOAA’s Hazardous Weather Testbed (HWT) Spring Forecasting Experiment (SFE). For example, during the 2010 HWT SFE operational forecasters subjectively indicated that the CAM guidance improved convection forecasts, compared to traditional convective-parameterizing schemes (Clark et al. 2012). Also, Gallo et al. (2017) noted that CAMs played an important role in reliable short-term forecasts, especially hourly forecasts, during the 2015 HWT SFE. For day-ahead forecasts (12–36-h lead time), CAM ensemble forecasts have shown improved skill compared to individual deterministic CAM forecasts (Loken et al. 2017).
Explicit hail prediction using storm-scale ensembles has been previously studied (e.g., Adams-Selin and Ziegler 2016; Snook et al. 2016; Labriola et al. 2017, 2019). However, limitations exist when explicitly predicting hail, including the sensitivity of microphysical scheme choice, initial and boundary conditions varying across ensembles, and physical assumptions needed to predict hail with computational efficiency. Additionally, the multitude of operational models and ensembles predicting hail that are available to forecasters can lead to cognitive overload. Wilson et al. (2017) found that forecaster overload increased with an increase in the number of datasets monitored, especially when multiple warning decisions are needed.
Recently, studies have focused on using machine learning (ML) to synthesize large amounts of atmospheric data, reducing the amount of monitored data while producing skillful forecasts products without explicit prediction assumptions (e.g., Gagne 2016; Gagne et al. 2017, hereafter G17; McGovern et al. 2017; Lagerquist et al. 2017; Herman and Schumacher 2018b,a). Instead of explicit prediction, ML models map a set of inputs to a given output by optimizing the model’s structure, such that the differences between the ML predictions and the output observations, or “ground truth”, are minimized. Using these learned structures, ML models are able to make predictions on new sets of model data with relatively minimal computational expense. Low computational expense, compared to other postprocessing methods applied to gridded NWP output for hail prediction (e.g., Adams-Selin and Ziegler 2016), is a major advantage of ML forecasts. ML-based hail prediction studies over the contiguous United States (CONUS) have demonstrated that ML predictions exhibit greater forecasting skill over direct prediction of hail from NWP model output or the use of proxy variables (Gagne 2016; G17). However, subjective commentary from the 2018 HWT SFE indicated that forecasters do not trust ML guidance if the output is unfamiliar or dissimilar to human-produced forecasts.1
Previous studies on applying automation in the forecasting process emphasize human interaction (e.g., Snellman 1977; Bosart 1989; Moller et al. 1994), and the importance of reliable guidance over simple competence (Hoffman et al. 2013). Similar to the 2018 HWT SFE, forecasters during the 2014 HWT SFE did not trust guidance without knowing the reliability and skill of new products (Karstens et al. 2015). However, Karstens et al. (2018) found that when proper training and forecast verification results are provided, the addition of automation can increase forecaster productivity. One way to increase the skill and reliability of probabilistic forecasts is through calibration (e.g., Raftery et al. 2005; Hagedorn et al. 2008; Hamill et al. 2008). In addition to increases in forecast performance, calibrating ML output to resemble existing operational forecasts, specifically those produced by the Storm Prediction Center (SPC), could result in greater trust in automated guidance for operations.
In this study, adapted from Burke (2019), we present newly-developed hail forecast guidance products using ML algorithms and output from the operational HREFv2 model. We demonstrate that these day-ahead forecast products can be successfully calibrated to increase reliability and skill, as well as resemble SPC hail products, all to increase forecaster trust in automated hail guidance.
2. Data and methods
The ML-based hail prediction models investigated in this study use HREFv2 (Jirak et al. 2018) model output as input data. Starting in April 2017, the SPC began running the High-Resolution Ensemble Forecast system, version 2 (HREFv2). The HREFv2 is based off the Storm Scale Ensemble of Opportunity (SSEO), a “poor man’s ensemble” (Ebert 2001) that combines existing operational CAMs produced by NOAA (Jirak et al. 2012) to produce a computationally efficient ensemble. Between the 2012–15 HWT SFEs, the SSEO scored positively in both objective and subjective metrics and provided a baseline for evaluation of CAM model guidance during the 2016 HWT SFE (Clark et al. 2016). The success of the SSEO in probabilistic severe weather prediction has brought attention to the HREFv2 dataset for use in skillful weather prediction. Compared to the SSEO, the HREFv2 produces slightly smaller grid spacing and includes an Advanced Research version of the Weather Research and Forecasting (WRF) Model (WRF-ARW; Skamarock and Klemp 2008) for a total of eight members.
Developed by National Centers for Environmental Prediction (NCEP)/Environmental Modeling Center (EMC), the HREFv2 is run daily by NCEP Central Operations (NCO).2 The HREFv2 is an eight-member ensemble, with time-lagged members initialized at 0000, 0600, 1200, and 1800 UTC. Only the members initialized at 0000 and 1200 UTC were available for use in this study. The HREFv2 is a diverse ensemble consisting of multiple microphysical schemes, including the WRF single-moment six-class (Hong et al. 2010) and Ferrier–Aligo (Aligo et al. 2014) scheme, as well as multiple initial conditions. The planetary boundary layer (PBL) schemes are the Yonsei University (YSU) (Hong et al. 2006) and the local Mellor–Yamada (MYJ) (Janjić 1990, 1994). All members use a horizontal model grid spacing of approximately 3 km. The number of vertical levels at which data are produced differs between ensemble members; the four high-resolution window (HiResW) members, including the time-lagged members, produce 50 vertical levels, the two “National Severe Storm Lab (NSSL) like” ARW models output 40 vertical levels, and the two nested North American Mesoscale Model (NAM-NEST) members include 60 vertical levels. Forecast products are generated from 1200 UTC to 1200 UTC the next day, the same period as a SPC day-1 convective outlook. A detailed description of the eight members of the HREFv2 ensemble is provided in Table 1.
The target data to train the ML models are the Maximum Expected Size of Hail (MESH) (Witt et al. 1998) derived from NOAA/NSSL Multi-Radar Multi-Sensor (MRMS) radar data (Zhang et al. 2011; Smith et al. 2016). Because MESH outputs exhibit greater skill for values exceeding 19 mm (Wilson et al. 2009), only values greater than 19 mm are considered. Although MESH has known biases, such as overprediction of higher values (e.g., Wilson et al. 2009; Cintineo et al. 2012; Ortega 2018; Murillo and Homeyer 2019), MESH was chosen over the local storm reports (LSRs) as the observational dataset for training the ML models because of the known population and size biases with LSRs (e.g., Schaefer et al. 2004; Cintineo et al. 2012). Also, Melick et al. (2014) found that MESH hail swaths were more skillful than LSRs in observing hail objects and act as a useful independent dataset in low population areas.
The ML models are trained on data between 1 April and 31 July 2017, and tested on data from 1 May through 31 August 2018. Different years are used for training and testing to create independent datasets and reduce the chance of overfitting. The duration of the training period (April–July) is selected based on greater hail potential and number of observations over the CONUS in the spring and early summer. The testing period includes the 2018 HWT SFE, from 30 April through 1 June 2018, during which forecasts were provided to HWT SFE participants for evaluation and feedback. April 2018 is not included for testing because of incomplete HREFv2 data.
b. Data preprocessing
Both the input and observational datasets are preprocessed with object tracking to address the relative rarity of hail (and severe weather in general) at any given location within the CONUS. The object-tracking method and ML models evaluated for hail prediction are based upon those used in G17. This object-tracking algorithm identifies potential storm objects where a chosen variable field exceeds a user-specified threshold. For the HREFv2 dataset, storm objects are identified using the maximum hourly upward vertical velocity (MAXUVV) with a threshold of 8 m s−1, rather than column total graupel greater than 3 kg m−2 from G17. The selection of updraft speed rather than updraft helicity or column total graupel, and the use of a relatively low threshold value, were designed to capture all possible hail storms rather than only high-end supercells. Although supercells are typically responsible for the most severe hail events, marginal hail is also important to the public, the insurance industry, and agriculture. For observations, potential MESH storm objects are generated for values exceeding 19 mm, differing from the 12-mm threshold used in G17, for reasons outlined above.
After identification, potential storm objects are matched in time and space to create storm tracks. For the HREFv2 storm tracks, additional hourly maximum variables are extracted at each grid point throughout the tracks. While instantaneous model fields may miss variations in storm intensity at time scales less than 1 h, hourly maximum fields can record maximum intensities without needing to output model data at every time step. Kain et al. (2010a) found that hourly maximum values provide skill for severe weather forecasting, particularly in determining hail threats in nonsupercellular storms. In addition, hourly maxima have been found skillful as guidance when forecasting severe weather, with minimal calibration needed (Sobash et al. 2011). Although CAMs, which output hourly maximum variables, cannot resolve individual hazards (such as severe wind, severe hail, or tornadoes), they can resolve severe hazard proxies such as updraft helicity or updraft speed.
The HREFv2 maximum hourly variables, statistically evaluated over each storm track after extraction, include storm and environmental variables (Table 2). Storm variables are directly related to convection, such as hourly maximum reflectivity, storm relative helicity (SRH), hourly maximum updraft helicity (UH), and so on. Environmental variables consist of near-storm fields, such as temperature, dewpoint temperature, geopotential height, and so on. The environmental variables are extracted from the previous forecast hour to mitigate contamination of storms on environmental conditions, with storm variables extracted at the current forecast hour. Statistical evaluations of the storm tracks for both variable types include the mean, maximum, minimum, standard deviation, skewness, and 10th, 50th, and 90th percentiles. The HREFv2 hourly maximum variable statistics serve as the input data to the ML models. The number of HREFv2 storm tracks identified for training the ML models ranges from 10 000 to 25 000 per member, and from 23 000 to 63 000 per member in the input test set. The 2018 test set is larger as it contains more days with model runs, while the 2017 data was limited because it was the first year of running the HREFv2 operationally. Increasing the number of storm tracks in the 2017 dataset would be ideal, however, any changes would also increase the track number in 2018, maintaining the dataset imbalance issues.
The final preprocessing step matches the HREFv2 storm tracks to MESH tracks. Track pairings occur where the calculated distance between member storm tracks and observed hail tracks is less than 80 km. If any of the fields used to calculate distance (differences in the observed and modeled track starting times, locations, durations, and sizes) exceed the thresholds defined in G17, the tracks are not matched. Match classifications, a binary dataset, as well as the shape and scale parameters of the paired MESH tracks provide the observational datasets for training the ML models.
c. Machine learning methods
The algorithm for operationally calibrating the ML hail predictions includes three models (Fig. 1): a random forest (RF) (Breiman 2001) classification model, RF regression model, and an isotonic regression (IR) model (Niculescu-Mizil and Caruana 2005). Both RF model configurations include 500 trees, a square root number of features chosen per tree, and a requirement of 1 sample per leaf. The number of random features chosen for each tree is relatively low (about 14) to reduce the chance of overfitting to the limited training dataset. The regression model optimizes the mean square error, while the classification RF optimizes the Gini index. These hyperparameters are similar to those in G17 and are relatively standard. The choice of RFs for producing operational severe hail forecasts is partly due to the speed for both training and forecasting. Computational efficiency and cost are large considerations for operations in addition to skillful forecasts. For forecasting rare events, RFs have shown greater skill over other models that assume linearity, such as logistic regression or elastic nets (G17; Herman and Schumacher 2018a). The randomness associated with RFs decreases model bias and variance, creating strong classifiers and regressors (Breiman 2001). Finally, RFs are easily interpretable (Herman and Schumacher 2018a), which is also a large consideration when employing a postprocessing method for operations.
Before calibration, described in section 2d, a RF classification model is trained to predict the probability of HREFv2 member storm tracks being matched with hail observations, where the binary truth dataset is described in section 2b. For each ensemble member, fivefold cross validation (Breiman and Spector 1992) produces five probability predictions, one per validation test fold, from the training dataset. Calculation of contingency table metrics over all five probability estimations, at 1% intervals, determines the threshold with the highest equitable threat score (ETS). This threshold “filters out” ensemble member storm objects because only storm tracks with probabilities exceeding the estimated threshold are classified as hail producing. Table 1 includes the thresholds for each HREFv2 member.
The member storm tracks classified as hail producing (with probability values exceeding the threshold values mentioned above, about 600–2000 storm tracks per ensemble member in the training dataset and 4000–7500 tracks in the testing set) are evaluated using a RF regression model. The regression model predicts the scale and shape parameters of a MESH gamma distribution for each input storm track. As in G17, the scale and shape parameters of the hail size gamma distribution are log-normalized and predicted together, to maintain the negative correlation between the shape and scale parameters. Comparing the predicted shape and scale parameters from each ensemble member to the values found through object-tracking produces average root mean squared error (RMSE) values of 0.64 for the shape parameter, and 4.51 for the scale parameter. The average ensemble mean absolute error (MAE) values are 0.51 and 3.28 for the shape and scale parameters, respectively. A higher scale parameter error could indicate that the RF regression model predicts a larger range of hail size values than observed.
Using the predicted shape and scale parameters, hail sizes from the gamma distribution are extracted such that the highest storm object values, MAXUVV in this case, in a track are associated with the largest MESH values. Unlike G17, the distribution of storm objects created for each member are based off data from the entire training period. To output hail sizes for a given storm object, G17 matched percentiles from a daily hail size distribution with a daily storm object distribution. However, if the range of storm objects on a given day is relatively low, higher percentile hail values are matched to storm object values that would not result in large hail on a different day. Preliminary testing identified a subjective high MESH bias on marginal days when using the daily values.
After predicting hail sizes from each ensemble member, the ensemble maximum size and neighborhood maximum ensemble probability (NMEP) of hail within 42 km of a point are calculated from 1200 UTC to 1200 UTC the following day. The NMEP predictions, based off the definition of ensemble probabilities in Schwartz and Sobash (2017), are evaluated at the severe (>25 mm) and significant severe (>50 mm) hail thresholds on the 3-km HREFv2 grid. The grid is further smoothed with a 2D Gaussian filter (σ = 42 km). In addition to managing the uncertainty of severe weather, the smoothed grid allows for direct comparison of predictions to SPC products for verification. The uncalibrated NMEP predictions were generated daily at 0700 UTC during the test period, valid at 1200 UTC. For a complete description of the data preprocessing and machine learning methods, we refer the reader to G17.
d. Calibration methods
In the uncalibrated NMEP hail predictions, probabilities range from 0% to 100% while SPC hail outlooks never predict probabilities of hail exceeding 60%. Ideally, to build forecaster trust in the ML-based model, the hail forecasts should be reliable and comparable in probability magnitude with operational products such as the SPC hail outlook. To address these issues, an IR model, chosen for its computational efficiency and lack of linearity assumptions, calibrates the RF output for operations. Previous studies have shown that RF probabilistic forecasts are more reliable after calibration using IR (Niculescu-Mizil and Caruana 2005; Lagerquist et al. 2017). With IR, which uses a nondecreasing function to map an input dataset toward a target dataset, only nonzero values are altered.
The operational target datasets used to calibrate the RF NMEP hail forecasts include the local storm reports (LSRs) and SPC practically perfect (PP) (Davis and Carr 2000; Hitchens et al. 2013) probability of severe hail occurrence. The LSR target dataset is a binary field, such that the observed hail grid points are within 40 km of at least one severe or significant severe hail report. The SPC PP probability field serves, during the HWT SFE, as an estimate of the optimal outlook a forecaster would issue if all LSR locations were known beforehand. After identifying the LSRs as a binary field, application of a two-dimensional Gaussian smoothing filter (σ = 2 for operations) creates PP smoothed probabilistic forecasts to account for uncertainties in LSR placement.
The calibration model trains and tests over the RF predictions (121 days between 1 May and 31 August 2018), as the model for creating calibrated probabilities can only be applied to existing probability predictions. A total of 84 training (70% of data) and 37 testing (30% of data) days were randomly selected to limit biases resulting from synoptic or seasonal patterns. We acknowledge that including a validation dataset would be ideal, as well as separate years of training and testing data to decrease the chances of overfitting. We plan to split the data accordingly in future iterations when a larger dataset is available. The calibrated probabilities range in time from the day-ahead forecasts to three 4-h periods used by 2018 HWT SFE forecasters (1700–2100, 1900–2300, and 2100–0100 UTC). The target PP data does not include data from the full 24-h period, instead considering only a 20-h period from 16 to 36 h of forecast time. However, the IR model maps the two datasets together for calibration purposes.
The calibrated day-ahead NMEP predictions are qualitatively verified against the SPC daily hail outlook and the 20-h PP output, valid the same forecast day. We analyze two case studies to examine the robustness of the ML model to accurately predict hail occurrence over varied weather regimes. Both the severe and significant severe (sig-severe) thresholds are mapped for consistency with the G17 study. Additional quantitative evaluations over the IR test period (37 random days between 1 May and 31 August 2018) investigate the ML-based hail predictions, both uncalibrated and calibrated, against the PP dataset, the 1200 UTC SPC hail outlook, and a hail proxy variable (updraft helicity). The 2–5-km HREFv2 UH data at thresholds related to severe (>75 m2 s−1) and sig-severe (>150 m2 s−1) hail (G17) provide another non-ML baseline.
a. Marginal hail case study
On 8 May 2018, a trough and surface cold front moved into Oregon, where a deep mixed layer provided enough support for a few nonsevere hail-producing storms despite relatively weak convective available potential energy (CAPE) values. Storms over the mid to lower Missouri Valley had ample CAPE, but a dry boundary layer restricted growth ahead of a surface trough. The SPC hail outlook valid 1200 UTC displays two regions with 5% probability of severe hail (Fig. 2a), located over Oregon and portions extending from South Dakota to Missouri. The PP probability of severe hail on 8 May 2018 includes values between 5% and 15% highlighting South Dakota, Minnesota, and northwestern Iowa (Fig. 2b). There were no areas of sig-severe hail probability (Fig. 2c) and no severe hail reports received over the western United States.
The uncalibrated ML NMEP prediction, valid 1200 UTC 8 May 2018, displaces the severe hail probabilities up to 35% over portions of Iowa, South Dakota, Nebraska, and Missouri to the southeast of the observed severe hail reports (black dots) (Fig. 3a). Portions of Oklahoma, Kansas, and Oregon also exhibit severe probabilities up to 22%, however the PP output does not contain probabilities in these regions (Fig. 2b). Although severe hail was not reported in Oregon, the SPC hail outlook (Fig. 2a) displays a similar area of probabilities as the uncalibrated prediction. At the sig-severe threshold (Fig. 3b), the uncalibrated model outputs probabilities up to 4% in eastern Iowa and southern Minnesota, while the PP output does not produce sig-severe probabilities (Fig. 2c). Overall, the uncalibrated prediction exhibits higher magnitude probabilities than the SPC outlook and PP output, however the severe hail prediction produces spatially similar areas of nonzero hail threat as the SPC outlook.
Compared to the uncalibrated ML output, the probability magnitudes of the LSR calibrated severe hail prediction (Fig. 3c) better resemble the SPC hail outlook and the severe hail PP output. The LSR calibrated output exhibits the same spatial coverage of nonzero probabilities as the uncalibrated prediction, although values less than 1% are not displayed. Differing from the uncalibrated output, the LSR calibrated probabilities do not exceed 14%, similar to the severe PP output and SPC outlook magnitudes (Figs. 2a,b). At the sig-severe threshold, the LSR calibrated prediction (Fig. 3d) outputs spatially similar probabilities as the uncalibrated model, however again overforecasting compared to the PP sig-severe probabilities (Fig. 2c). In general, the LSR calibrated model corrects the high magnitude bias present in the uncalibrated ML severe hail forecasts, although a high spatial bias persists at both thresholds.
The PP calibrated predictions, where the target dataset for calibration is the PP dataset, output lower probabilities of severe (Fig. 3e) and sig-severe (Fig. 3f) hail compared to the LSR calibrated forecast (Figs. 3c,d). The severe hail PP dataset provides higher magnitude probabilities (up to 14%) than the PP calibrated prediction (up to 4%). At the sig-severe threshold, the PP calibrated model correctly predicts no areas exceeding a 1% chance of sig-severe hail (Fig. 3f). For a marginal case, the PP calibrated model decreases the severe hail threat in areas observing hail reports but more closely resembles observations at the sig-severe threshold.
In general, the uncalibrated hail predictions overestimate the probability of severe hail, both spatially and in magnitude. An added calibration step decreases the high magnitude bias of the uncalibrated output, where the LSR calibrated model outputs severe hail probability values more comparable to the SPC outlook and PP data. Conversely, the predictions calibrated to the PP output underestimate the severe hail threat over South Dakota and Iowa. However, the PP calibrated model predicts the low sig-severe hail threat, compared to the slightly overestimated threat from the LSR calibrated model. At both size thresholds, at least one calibration model decreases the overprediction bias associated with the uncalibrated output.
b. Hail outbreak case study
Examination of a high-end severe hail event, occurring on 29 July 2018, investigates the robustness of the ML-based hail predictions in differing environments. On this day, multiple hail-producing storms resulted in 106 severe and 30 sig-severe hail reports concentrated in Colorado and extending into Wyoming, Kansas, South Dakota, and Nebraska. A strengthening upper-level trough and midlevel jet over the Midwest, strong diurnal heating, and a moist boundary layer set the stage for severe storms over the central high plains. The day-1 SPC hail outlook valid 1200 UTC (Fig. 4a) predicts a 15% chance of severe hail over eastern Colorado and 5% from Montana to Arkansas. Sig-severe hail was not anticipated until the outlook valid 1300 UTC. The PP output indicates severe hail probabilities up to 38% over northeastern Colorado (Fig. 4b), and up to 23% for sig-severe hail over eastern Colorado (Fig. 4c).
The uncalibrated hail prediction (Fig. 5a), valid 1200 UTC, features similar areas of 5% probability as the SPC hail outlook and PP output. However, neither the uncalibrated prediction nor the SPC outlook indicate the reported severe hail threat in North Dakota or Minnesota. In eastern Colorado, the uncalibrated severe hail prediction exceeds 60%, substantially overestimating the hail threat compared to observed PP severe hail output and SPC outlook (Figs. 4a,b). The uncalibrated sig-severe hail prediction (Fig. 5b) displays comparable probabilities in eastern Colorado compared to the sig-severe PP output (Fig. 4c) where values exceed 15%. Despite the overestimated hail probabilities, the uncalibrated predictions are closely collocated with the bulk of observed hail and spatially more comparable with the SPC hail outlook and PP output than in the marginal hail case.
The LSR calibrated model also outputs similar hail threat regions as the PP output and SPC hail outlook, but produces calibrated probabilities closer in magnitude to the observed severe PP output compared to the uncalibrated prediction (Fig. 5c). The calibrated probabilities do not exceed 29% over eastern Colorado, slightly lower than the maximum observed probabilities in the PP output (37%), but comparable to the hail threat in the SPC outlook. For sig-severe hail, the LSR calibrated probabilities (Fig. 5d) do not exceed 4% over portions of Colorado, again lower than the 22% maxima observed in the sig-severe PP output. In general, the LSR calibrated model predicts severe probability magnitudes closer to the PP output and SPC hail outlook, compared to the uncalibrated output, but underpredicts the sig-severe hail probabilities.
The last ML-based severe hail prediction, calibrated toward the PP output, underforecasts compared to both the PP output and SPC hail outlook (Fig. 5e). As in the marginal hail case study, the severe PP calibrated prediction exhibits very low probabilities of severe hail, not exceeding 14% in regions of eastern Colorado that observe PP probabilities of up to 22%. The sig-severe PP calibrated predictions (Fig. 5f) also underforecast compared to the PP output, where probabilities do not exceed 1% on this day, while the sig-severe PP probabilities reach 22% in this region. Both the severe and sig-severe PP calibrated predictions underestimate the severe hail threat in eastern Colorado, however the calibrated severe hail forecast highlights the area of greatest observed hail reports.
For the high-end severe hail case, the ML-based hail predictions focus the highest hail threat in eastern Colorado where the largest number of severe hail reports are observed. As in the marginal hail case, calibration decreases the high probability magnitude bias associated with the severe uncalibrated ML output. Also, the LSR calibrated severe hail predictions more closely resemble the observed probability magnitudes than the PP calibrated output. Last, both calibrated models underpredict the probability of sig-severe hail. As the uncalibrated model outputs relatively similar probability magnitudes compared to the observed PP output, calibration may not be necessary at the sig-severe threshold for this case.
c. Quantitative verification
Four metrics quantitatively verify the ML models, PP output, 1200 UTC SPC outlook, and UH proxy variable. The metrics [reliability, Brier skill score (BSS; Brier 1950), equitable threat score (ETS; Clark et al. 2010), and bias] evaluate the 24-h predictions (12–36 h) over the isotonic regression test set. ETS and bias are favorable for evaluating gridded forecasts (Hamill 1999), where ETS measures the fraction of correct forecasts to observed events and takes into account events randomly forecasted correctly. Higher ETS values are more skillful. Bias compares the frequency of forecast events to the frequency of observed events, where a bias of 1 is preferred. Both metrics require dichotomous forecasts to calculate contingency table metrics (Tables 3 and 4). The probabilistic forecasts are made deterministic at 5% interval thresholds, where forecasts equal to or greater than the threshold are predicted events. The observations for calculating the contingency table metrics are the LSRs (already binary) and PP dataset (applied thresholds similar to the forecasts).
Evaluating the forecasts using reliability and BSS, with the LSRs as observational truth, reveals that the PP dataset consistently underpredicts while the uncalibrated ML predictions overpredicts, at both size hail thresholds (Fig. 6). The 1200 UTC SPC outlook is only available at discrete intervals, however the forecasts exhibit near perfect reliability, slightly overforecasting at 45% (Fig. 6a). The severe uncalibrated hail prediction and UH proxy exhibit comparable reliability, although the uncalibrated model BSS is higher (−0.18 versus −0.4). With calibration, we expect the predictions to have similar reliability characteristics as their target datasets. As expected, the severe LSR calibrated predictions exhibit near perfect reliability up to 45%, and the severe PP calibrated predictions are comparable with the PP output. Both calibrated predictions feature higher BSSs than the uncalibrated dataset, UH proxy, and SPC outlook, although the PP output displays the highest score (0.17). At the sig-severe threshold (Fig. 6b), a high bias persists with the UH proxy and uncalibrated dataset, while the LSR calibrated output displays near perfect reliability up to about 15% before overforecasting. Additionally, the LSR calibrated model output shows a higher BSS (0.0) than the uncalibrated output (−0.02) and UH proxy (−0.73). The sig-severe PP calibrated probabilities are sufficiently low that they do not appear, but achieve a BSS of 0.01. The 1200 UTC SPC sig-severe outlook is binary and therefore does not appear on the probabilistic reliability diagram. In addition to changes in forecasting bias, calibration decreases output probability magnitudes, as the LSR calibrated model does not exceed 45% and 20% at the severe and sig-severe thresholds, respectively. The PP calibrated model predictions do not exceed 20% and 1% at the severe and sig-severe thresholds. Overall, calibration reliably maps the uncalibrated predictions toward two different target datasets and increases the BSS.
In Fig. 7, the LSR dataset serves as observations for calculating ETS and bias. Of the severe hail predictions, the LSR calibrated model outputs maxima and minima in ETS at similar forecast probabilities as the optimal PP output and 1200 UTC SPC outlook (Fig. 7a). However, the PP output achieves a higher skill score (0.32), compared to the LSR calibrated (0.12). The PP calibrated probabilities also achieve an ETS value of 0.12, but at a lower forecast probability (around 5%). Both the uncalibrated output and UH proxy achieve higher skill than the PP output past 20% forecast probability, however the datasets also provide the highest bias (Fig. 6b) with the uncalibrated model outputting lower bias below 70%. Although the calibrated models produce high biases at low forecast probabilities, the biases are lower than the SPC outlook across all forecast probabilities. Of all the severe hail predictions, the LSR calibrated model best resembles the PP output and SPC hail outlook in terms of skill.
Calculations of ETS and bias, again compared to the LSRs, at the sig-severe threshold show that the uncalibrated model outputs the highest ETS skill (0.03) of the ML predictions, while maintaining overall lower bias than the UH proxy (Figs. 7c,d). Both calibrated predictions have zero skill above 20%, however the LSR calibrated model bias is most comparable to the PP output. As with the reliability diagram, the binary sig-severe SPC outlook does not appear on the probabilistic diagram. The UH field displays the greatest ETS skill above 20%, but maintains the highest bias. Similar to case studies, the uncalibrated ML model performs better at the sig-severe threshold. The differences in bias between the uncalibrated output and UH proxy, at both size thresholds, indicates that the ML algorithm better focuses on hail threat regions before calibration. At the sig-severe threshold, the uncalibrated model displays the greatest skill of the different ML predictions, however the LSR calibrated model produces bias values closer to the PP output.
In addition to the LSRs as observations, ETS and bias are calculated using the PP dataset to verify the hail predictions against a probabilistic field (Fig. 8). The severe PP calibrated prediction achieves similar ETS skill as the SPC hail outlook at 5%, however decreases in ETS and bias values above the 5% probability threshold (Fig. 8a). The severe LSR calibrated model produces the highest ETS skill below 30%. Above 30% the SPC outlook achieves a higher score. For bias, the LSR calibrated model predicts lower values than the SPC hail outlook across all probabilities (Fig. 8b). The severe hail uncalibrated model and UH proxy display comparable skill, but again the uncalibrated output achieves lower bias. Verification of the sig-severe hail predictions show that the UH proxy achieves the highest ETS value (0.04) (Fig. 8c), but continues to produce the highest bias of all the datasets, similar to previous examinations (Fig. 8d). At the sig-severe threshold, the uncalibrated model outputs the greatest ETS skill of the ML predictions, but the LSR calibrated model produces lower bias values. Overall, verifying against the PP dataset indicates that the severe LSR calibrated model produces similar skill as the SPC hail outlook, but calibration decreases ETS skill at the sig-severe threshold. Despite decreases in ETS skill, calibration lowers bias values at both size thresholds, compared to the PP dataset.
Due to the subjective success of the LSR calibrated model from the previous verification metrics, model timing information is explored. Only the severe hail calibrated predictions are examined over the different forecast periods (1700–2100, 1900–2300, 2100–0100, and 1200–1200 UTC the next day) because of the relative rarity of sig-severe hail over short time periods. A similar analysis at the sig-severe threshold could be enlightening, however a much larger dataset would be needed. The calibrated predictions evaluated using ETS against the LSRs (Fig. 9a) and PP output (Fig. 9c) show that the 2100–0100 UTC and 24-h forecast periods achieve the highest ETS values of all the periods. Additionally, bias calculations when compared to the LSRs (Fig. 9b) and PP dataset (Fig. 9d) show similar values among the different time periods, although the 1900–2300 UTC prediction against the PP dataset shows lower bias at higher probabilities. In general, the later time periods outperform the earlier periods, but large biases are apparent across all time periods regardless of the observational dataset.
4. Discussion and summary
A random forest (RF) machine learning (ML) method for day-ahead hail prediction, based on that of G17, predicts severe hail probabilities with data from HREFv2 numerical model forecasts and observations from the Maximum Expected Size of Hail (MESH) dataset. An added isotonic regression (IR) model calibrates the RF severe hail predictions toward the local storm reports (LSRs) and SPC’s practically perfect (PP) output. The resulting hail predictions, including an uncalibrated ML model as well as the LSR and PP calibrated models, were verified through two case studies, reliability diagrams, Brier skill score (BSS), and plots of equitable threat score (ETS) and bias.
The marginal and high-end severe hail studies used to evaluate the HREFv2 ML hail predictions indicate that the uncalibrated severe hail predictions are spatially similar to the SPC hail outlook, but exhibit a high spatial bias compared to the severe PP dataset. The severe uncalibrated model also produces a high magnitude bias compared to both verification datasets. The significant severe (sig-severe) uncalibrated case study forecasts do not exhibit as high spatial or magnitude biases, however a high magnitude bias persists over the IR test set at both hail size thresholds, evidenced by the reliability diagrams and plots of ETS and bias.
The high magnitude bias (and spatial bias compared to the severe PP dataset) may be due in part from the different training and verification datasets used, where MESH observations are more numerous than LSRs. Verifying against MESH may decrease the uncalibrated ML model bias, however the LSRs are a common operational verification dataset. Beyond verification datasets, the member classification thresholds for “filtering out” storms as hail producing may contribute to the high size bias. A lower threshold classifies more storms as hail producing, leading to more nonzero hail sizes output from the RF regression model. Changing the skill metric, such as BSS, that determines the filtering thresholds may increase the thresholds. In addition, training both the classification and regressions model on a larger dataset could increase the filtering threshold, while also decreasing the scale parameter errors previously mentioned. Finally, changing the object-tracking algorithm thresholds, like increasing the MAXUVV and MESH thresholds, or decreasing the areal intensity used for creating storm tracks could decrease the number and spatial extent of predicted storm tracks, however this may also lead to greater misses. Even though the uncalibrated ML model exhibits a high magnitude bias, the severe hail proxy variable (updraft helicity) displays higher bias values and a lower BSS than the uncalibrated output at both size thresholds.
Adding calibration reduces the high magnitude bias associated with the uncalibrated ML severe hail predictions, across both calibrated datasets. For both case studies, the LSR calibrated model most closely resembles the operational day-1 SPC hail outlook and PP output at the severe hail threshold. However, the sig-severe uncalibrated ML model outputs similar probabilities, both in magnitude and spatially, as the two verification datasets. Calibrating the sig-severe ML model decreases the probability magnitudes below the observed PP output. Although the qualitative case studies display decreases in calibrated sig-severe probabilities, the reliability diagrams calculated over the IR test set show that calibration can reliably map the RF hail predictions toward either the PP or LSR data, and increase BSS, at both size thresholds. The LSR calibrated model achieves a higher BSS, and is more reliable at certain forecast probabilities, than the 1200 UTC SPC outlook, which has access to 12 h more data than the HREFv2 model runs.
Increases in reliability are important when forecasters consider using automated guidance (Hoffman et al. 2013; Karstens et al. 2018). Indicating to forecasters that the LSR calibrated dataset is reliable compared to the LSRs, a common operational verification dataset, could increase trust in the guidance. In addition to increased reliability, the severe LSR calibrated predictions exhibit lower bias than the SPC outlook, while producing higher ETS values across most forecast probabilities, against both the PP dataset and LSRs. Although both the uncalibrated ML output and UH proxy variable produce higher biases than PP dataset or SPC outlook, calibrating the UH field toward the LSRs would not produce the same results as the LSR calibrated ML model. Calibrating the UH field would only be appropriate in environments where hail results from strongly rotating storms. Conversely, all storms modes can be skillfully predicted by the RFs because of the multiple different storm and environment fields input to the models.
In addition to day-ahead forecast skill, ETS and bias calculations verify timing information from the LSR calibrated model against both the LSR and PP dataset. The 1200–1200 and 2100–0100 UTC time periods displayed the highest skill across both verification datasets. The poorer performance at early times likely results from the higher prevalence of severe weather, including severe hail, during the afternoon and evening hours. The decrease in performance may indicate that the LSR calibrated ML model exhibits an overforecasting bias during early time periods.
Similar to the LSR calibrated model, the PP calibrated model decreased the uncalibrated model output bias. However, the PP calibrated predictions were extremely conservative in predicting both hail size thresholds. The low forecasting bias likely results from the PP dataset underforecasting bias. The Gaussian smoother applied to the LSR locations smooths the probabilistic dataset, causing the PP dataset to underforecast the LSRs by design. Calibration further decreases the overall probability magnitudes of the RF ML hail predictions, leading to an enhanced low bias with the PP calibrated model. Comparatively, the LSR calibrated model calibrates directly to the LSRs, similar to the PP output, and therefore does not include a pre-existing bias. The lack of underforecasting bias associated with the LSR calibrated target dataset most likely causes the LSR calibrated output to better resemble the PP output, than the PP calibrated. The low forecast skill and reliability of the PP calibrated model, compared to the LSR calibrated model, indicates that the LSR calibrated model may be more trusted in an operational setting.
Beyond reliability, model interpretation can be important in increasing forecaster acceptance of automation (McGovern et al. 2019). Burke (2019), which uses two separate RF regression models to predict the shape and scale parameters, investigated permutation variable importance (Lakshmanan et al. 2015) for interpretation of the ML-based hail prediction algorithm. The regression method differences could influence the produced variable importance ranks, therefore further exploration applied to the single RF regression model method is needed.
Overall, the LSR calibrated ML-based hail guidance using the HREFv2 dataset provides increased reliability and skill over the uncalibrated ML output and SPC hail outlook, indicating that the forecasts may provide operationally useful predictions. All of the ML models overforecast hail in terms of spatial extent, but this could benefit forecasters preparing forecasts in terms of highlighting potential hail risk areas and addressing storm placement uncertainty. The severe hail predictions were particularly skillful after calibration, which is encouraging for this initial application of the technique to the HREFv2 dataset. Preliminary results indicate that calibrating the ML NMEP forecasts before smoothing produces predictions with decreased spatial bias and false alarms. The next iteration of calibrated severe hail forecasts will include this step to further increase the skill of the LSR calibrated output. Even without this update, the calibrated predictions produce output that, compared to the uncalibrated predictions, are more comparable to the PP probabilities while outperforming the SPC hail outlook. However, any improvements of the calibrated models over the SPC hail outlook may be within the ranges of forecast uncertainty. Subjective ratings from the 2019 HWT SFE indicate that the LSR calibrated hail guidance performs similarly to a SPC hail guidance product, with a slightly higher mean score. In general, producing reliable forecasts with comparable skill as SPC hail outlooks has the potential to increase operational forecaster trust in the LSR calibrated automated predictions.
This work was primarily supported by the Joint Technology Transfer Initiative (JTTI) Grant NA16OAR4590239 provided by NOAA, and supplemented with funding provided by NOAA JTTI Grant NA18OAR4590371. This material is based upon work supported by the National Center for Atmospheric Research, which is a major facility sponsored by the National Science Foundation under Cooperative Agreement 1852977. Computing was primarily executed on the University of Texas Advanced Computing Center (TACC) Stampede supercomputer. The authors thank Timothy Supinie and Chris Cook for data support, Ryan Lagerquist for implementation feedback, and Jonathon Labriola for contribution of ideas. We also thank Steven Weiss, Israel Jirak, and the participants of the 2018 HWT SFE for useful assessments and remarks. We thank David Harrison and Christopher Karstens for providing the gridded SPC outlook data. Finally, we thank the anonymous reviewers for their feedback, which helped improve this manuscript.
Forecasters from 2018 HWT SFE on14–18 May.