Forecasting severe hail accurately requires predicting how well atmospheric conditions support the development of thunderstorms, the growth of large hail, and the minimal loss of hail mass to melting before reaching the surface. Existing hail forecasting techniques incorporate information about these processes from proximity soundings and numerical weather prediction models, but they make many simplifying assumptions, are sensitive to differences in numerical model configuration, and are often not calibrated to observations. In this paper a storm-based probabilistic machine learning hail forecasting method is developed to overcome the deficiencies of existing methods. An object identification and tracking algorithm locates potential hailstorms in convection-allowing model output and gridded radar data. Forecast storms are matched with observed storms to determine hail occurrence and the parameters of the radar-estimated hail size distribution. The database of forecast storms contains information about storm properties and the conditions of the prestorm environment. Machine learning models are used to synthesize that information to predict the probability of a storm producing hail and the radar-estimated hail size distribution parameters for each forecast storm. Forecasts from the machine learning models are produced using two convection-allowing ensemble systems and the results are compared to other hail forecasting methods. The machine learning forecasts have a higher critical success index (CSI) at most probability thresholds and greater reliability for predicting both severe and significant hail.
Hail causes billions of dollars in property and crop damage in the United States each year (Changnon 2009) and is a major liability for insurance companies (Brown et al. 2015) with $850 million in average annual claims. Urban sprawl and population growth in large cities such as Dallas/Fort Worth, Texas; St. Louis, Missouri; Chicago, Illinois; and Denver, Colorado, have made large amounts of property damage from hail events more likely (Rosencrants and Ashley 2015). Hail growth and melting is governed by the microphysical processes that a hailstone experiences along its trajectory through a storm. Foote (1984) and Nelson (1983) found that the amount of hail growth strongly depends on the path through the updraft and the amount of supercooled liquid water encountered. Rasmussen and Heymsfield (1987) identified how differences in ice density, relative humidity, and hail diameter can affect the rate of melting. However, hail forecasting methods have relied on a combination of coarse approximations of the expected storm environment and local hail climatology with varying degrees of success (Johns and Doswell 1992). Most existing hail forecast methods (e.g., Fawbush and Miller 1953; Moore and Pino 1990; Brimelow et al. 2002) are based on information extracted from a sounding representative of the convective environment. All of these models estimate potential updraft strength from the integrated buoyancy between the melting layer and the equilibrium level. To determine the hail size at the ground, the hail size models also include a melting component based on the height where the wet-bulb temperature reaches 0°C, which is approximately where melting would begin in hail falling through downdraft air. Shallower, cooler, and drier melting layers lead to less melting (Rasmussen and Heymsfield 1987), and larger hailstones have higher terminal fall speeds, leading to shorter transit times (Johns and Doswell 1992).
Sounding-based hail forecasting has fundamental limitations. For best results, the sounding should be representative of the storm environment, which can be problematic when observed soundings are only available at discrete locations and twice a day. This issue can be addressed by correcting the sounding based on surface observations (Moore and Pino 1990) or by utilizing model soundings near the time and location of expected hail, but this may not be representative of the updrafts that actually produce hail. Also, updraft speed from CAPE is estimated at the top of the updraft, not within the hail growth region. The largest hail may not come from the strongest part of the updraft because hail may be lofted elsewhere if the speed is too high (Foote 1984; Johns and Doswell 1992). In storms with broader updrafts, hailstones may have significant horizontal motions in their growth trajectories (Nelson 1983; Foote 1984), leading to additional growing time. Hail embryos also may not traverse the optimal growth trajectory through the updraft and thus fail to achieve the largest size possible for that updraft speed (Dennis and Kumjian 2017).
Some of the shortcomings with sounding-based hail diagnostics were previously addressed by feeding sounding information into a one-dimensional hail growth model called HAILCAST (Brimelow et al. 2002). This model creates an ensemble of updrafts based on perturbations of the temperature and moisture profile. The energy shear index estimates the updraft lifetime and the amount of entrainment. Hail embryos are released into the updraft and grow until they can no longer be sustained or the updraft collapses. A bulk melting scheme is then applied to approximate the hail size at the ground. Jewell and Brimelow (2009) found that HAILCAST generally provides a reliable hail size forecast, especially when compared to other sounding methods. Adams-Selin and Ziegler (2016) implemented an updated version of HAILCAST with many physical improvements based on observational hail studies in the WRF Model. WRF-HAILCAST, using modeled cloud liquid and ice water and vertical velocities in place of those estimated from a sounding and in spatial verification experiments, was found to forecast hail sizes within 12 mm of observed 66% of the time (Adams-Selin and Ziegler 2016). Although WRF-HAILCAST does incorporate many physical processes into its hail growth model, it does have to make some simplifying assumptions about hail growth trajectories to be computationally efficient. It is also sensitive to the sizes of the initial hail embryos and the density of the hailstones (Adams-Selin and Ziegler 2016), but that sensitivity is addressed by growing an ensemble of hail embryos with varied initial sizes.
As computational speed and storage capacities have increased in the past 20 years, research and operational groups have run real-time NWP models capable of explicitly representing deep, moist convection (Kain et al. 2008). While not all convective processes are adequately resolved, individual updrafts and their associated heat and moisture fluxes can be represented reasonably well. Because some convective processes are not fully resolved, these models are called convection-allowing models (CAMs). CAMs have shown skill over mesoscale NWP models in precipitation forecasting and in forecasting convective evolution and morphology (Weisman et al. 2008). Storms generally occur over a small subset of an area with favorable conditions, so explicit forecasts of convection help constrain the forecast severe threat area. However, errors in the initial conditions and model physics lead to spatial errors in storm placement and timing errors in convective initiation and evolution. To account for this uncertainty, CAM ensembles perturb the initial and boundary conditions, physics parameterizations, and the dynamical cores to capture the range of possible convective solutions. CAM ensembles have run in real time as part of NOAA’s Hazardous Weather Testbed Spring Experiment since 2007 (Clark et al. 2012). The National Weather Service is already running deterministic CAMs and will be deploying ensembles soon (Benjamin 2014).
While CAMs cannot resolve severe hazards, such as tornadoes, hail, and high winds, storm diagnostics have shown skill in inferring hazard potential. Because storms develop and intensify in minutes, hourly snapshots of CAM output may miss the time when the storm is at its greatest intensity. More frequent model output may capture these extremes, but this would require more computational resources. A compromise is to track the maximum value at a point over a given time period and regularly output that field. These hourly maximum fields provide a surrogate for both storm track and intensity (Kain et al. 2010). For severe weather forecasting, updraft helicity (UH), the integrated product of vertical velocity and vertical vorticity between 2 and 5 km, is a proxy for strong, rotating updrafts (Kain et al. 2008) and correlates with tornado pathlength (Clark et al. 2013). Other hourly maximum fields include updraft and downdraft speeds, simulated radar reflectivity at the −10°C level and column total graupel (CTG) mass. Storm surrogate products have some limitations. Current storm surrogates do not directly predict the occurrence of severe weather hazards. The intensity distributions of the storm surrogates are dependent on the dynamical core, grid spacing, and parameterization scheme (Kain et al. 2008), which make them more challenging for forecasters to interpret when examining mixed ensembles. Most storm surrogates are indirectly verifiable through storm reports. A forecast product that directly forecasts the probability of a severe hazard is most useful if the predictions are well calibrated and physically consistent with the model forecast.
Machine learning (ML) models integrate data from multiple sources together to produce calibrated hazard predictions. Unlike physics-based models, which derive prognostic and diagnostic equations from primitive equations, a set of assumptions, and initial and boundary conditions, ML models identify patterns in a set of data and then find a relationship between those patterns and an outcome that optimizes an error metric. In meteorology, ML and statistical models are often used to link observations or NWP model fields to hazards and other events of importance that are not resolved or described well by NWP models or our current observing systems. Examples of ML for hazard identification include storm classification (Lakshmanan et al. 2010; Gagne et al. 2009), convection initiation (Ahijevych et al. 2016), aircraft turbulence (Williams 2014), and hurricane power outages (Nateghi et al. 2014). In the hail prediction domain, Manzato (2013) used neural network ensembles to predict hail occurrence and spatial extent in Italy using sounding parameters as input, and Marzban and Witt (2001) trained a neural network on radar parameters to nowcast hail size categories. ML models trained on observed data generally can only make accurate predictions out to a few hours ahead, but when given NWP model output within a model output statistics framework (Glahn and Lowry 1972), calibrated predictions for one or more days in advance are possible. ML and statistical models generally require a large dataset of historical NWP model runs and observations to build an accurate predictive model. These historical model runs should have similar configurations so that the ML model can identify systematic patterns and biases. Domain knowledge is still necessary to identify a reasonable set of relevant input variables for the ML models and to transform the output variables between forms that are easier for the ML models to fit and forms that are interpretable for end users.
The purpose of this paper is to describe track-based day-ahead probabilistic hail forecasts generated from ML models. The ML hail forecasts are compared with hail-related storm surrogate variables (Sobash et al. 2011), such as UH and CTG, and with HAILCAST. This evaluation tests the hypotheses that ML models will produce accurate and reliable hail forecasts, that the ML forecasts can detect more hailstorms and produce fewer false alarms than other hail size methods, and that ML model performance is more consistent across different NWP configurations than other hail size diagnostics. This work builds upon the hail size regression model described in Gagne et al. (2015), which predicted a storm’s maximum hail size but could not predict hail larger than 50 mm. The ML regression model in this paper predicts the parametric distribution parameters of the radar-estimated maximum expected size of hail over the area of a storm instead of predicting the maximum hail size within the storm. Predicting the distribution of hail sizes instead of the most likely size value allows us to estimate the probability of hail exceeding any size threshold consistently with one model. The predicted spatial hail size distribution is mapped to the areas of forecast storms in a manner akin to the probability-matched mean (Ebert 2001) in order to produce spatially realistic deterministic gridded hail size forecasts that can be compared directly with other hail forecast products from CAMs. This paper is adapted from the dissertation of Gagne (2016), and a summary description of the ML hail forecast model appears in McGovern et al. (2017).
The ML hail forecast method was evaluated on two CAM ensemble systems to demonstrate the robustness of the hail forecast method to differences in member configurations, data assimilation choices, training periods, and input variables. The first comes from the Center for Analysis and Prediction of Storms’ (CAPS) Storm-Scale Ensemble Forecast system, called the CAPS ensemble, which consists of 12 WRF-ARW models with varied combinations of microphysics and planetary boundary layer (PBL) parameterizations and perturbed initial and boundary conditions. The 2014 CAPS ensemble uses 4-km horizontal grid spacing, while the 2015 CAPS ensemble uses 3-km grid spacing. The CAPS ensemble is initialized with three-dimensional variational data assimilation (3DVAR) that includes radar and begins at 0000 UTC with hourly output for 60 h. Individual member parameterization configurations are listed in Table 1. The CAPS ensemble uses the following microphysics schemes: Thompson (Thompson et al. 2004), Morrison [with graupel settings for the rimed ice species; Morrison et al. (2009)], Milbrandt and Yau two-moment (MY; Milbrandt and Yau 2005), WRF double-moment 6-class (WDM6; Lim and Hong 2010), and Predicted Particle Properties (P3; Morrison and Milbrandt 2015; Morrison et al. 2015).
The second ensemble forecast set is from the National Center for Atmospheric Research (NCAR) ensemble (Schwartz et al. 2015), which consists of ten 3-km-grid-spacing WRF members initialized from mesoscale analyses produced using the Data Assimilation Research Testbed (DART; Anderson et al. 2009) ensemble adjustment Kalman filter data assimilation system on a 15-km grid. All members use the Thompson microphysics scheme (Thompson et al. 2004) and the Mellor–Yamada–Janjić (MYJ; Mellor and Yamada 1982) PBL scheme. The ensemble began running daily in April 2015 at 0000 UTC with plans to continue running through at least summer 2017. For this project, model runs from 1 May through 31 July 2015 were used.
Hail size observations are derived from the NOAA/National Severe Storms Laboratory Multi-Radar Multi-Sensor (MRMS) radar mosaic (Zhang et al. 2011). MRMS merges radar reflectivity from multiple radars onto a 0.01° × 0.01° uniform grid with 2-min updates and performs a series of quality control procedures. The MRMS maximum expected size of hail (MESH; Witt et al. 1998; Cintineo et al. 2012) product is used for the hail size observations. MESH is a power-law relationship between the severe hail index, which is a product derived from radar reflectivity values above the melting level, and hail reports from nine hail events in 1992 in Oklahoma and Florida. The MESH relationship is calibrated such that the MESH value should exceed 75% of reported hail sizes for a given severe hail index value. Because MRMS provides more complete information about the full depth of a storm than a single radar, and because MESH indirectly accounts for melting effects through melting-layer height information and calibration to hail reports, MRMS MESH shows good spatial coverage when compared with high-resolution hail reports (Cintineo et al. 2012) and is less biased by population compared to NOAA Storm Data hail reports. However, MESH hail sizes exhibit a large size bias compared with high-resolution hail reports (Cintineo et al. 2012) and were not designed for identifying hail sizes below 19 mm. If the distribution of MESH sizes below 19 mm is extremely biased, it would also bias the size distribution forecasts from the ML models. Even with these known issues, MRMS MESH offers the best spatial and temporal coverage, and any size or location issues with MESH are small compared to the spatial and intensity forecast errors in the 24-h NWP predictions.
b. Storm object identification, tracking, and matching
The storm data processing procedure is summarized in Fig. 1. An object-identification method locates storms, and data extraction and forecasting processes are performed. Object-based forecasting provides the advantages of identifying relevant areas for predicting severe weather and reducing the volume of data processed. These methods exhibit parameter sensitivity, which can vary the composition of the object set. The enhanced watershed method (Lakshmanan et al. 2009) identifies potential hailstorm objects from a storm field by identifying local maxima (Fig. 2) and then growing the objects until they meet area and intensity criteria. Hourly maximum CTG mass is used as a potential hailstorm surrogate because it identifies any storm containing a large amount of graupel, including both ordinary thunderstorms and supercells. Graupel is used instead of hail because most of the CAM members’ microphysics schemes feature one rimed ice species with fixed graupel-like properties. The exceptions are the MY scheme, which has separate hail and graupel species, and the P3 scheme, which allows the particle properties to vary based on atmospheric conditions. Large CTG mass may originate from large graupel or many small graupel, but either could lead to surface hail. ML models utilize environmental information to determine the difference. A minimum intensity of 3 kg m−2 captures both weak and strong storms and captures the varied distributions of graupel found with the different microphysics schemes. A minimum area of 144 km2 (16 grid cells on a grid with 3-km spacing) and a maximum area of 900 km2 are specified to isolate individual storm cells. The minimum area corresponds to the minimum resolvable area for a single storm cell, and the maximum area roughly corresponds to the area that a storm cell of area 300 km2 could traverse within an hour. Grid points outside the contiguous United States are excluded because of the lack of reliable verification data.
Once storm objects are identified in the models and observations, the objects are linked together through time into tracks. The tracking algorithm identifies trends within the storms and captures the full life cycle of each storm. The Hungarian method (Munkres 1957), a globally optimal matching algorithm, links tracks. The Hungarian method also forms the basis of the Thunderstorm Identification, Tracking, Analysis, and Nowcasting (TITAN) storm tracking algorithm (Dixon and Wiener 1993). Although Euclidean centroid distance works well with circular objects and frequent temporal updates, hourly maximum storm objects tend to be elongated along the axis of motion. To mitigate this issue, spatial cross-correlation motion estimation (Lakshmanan and Smith 2010) translates the centroid before matching (Fig. 2). The cross-correlation search is constrained by the storm dimensions to minimize motion estimation error. The maximum centroid distance for tracking is 24 km, which is empirically found to connect storms appropriately while minimizing the issue of tracks jumping to adjacent storms. Although tracking has been empirically tuned, not all connections will be correct, which adds some uncertainty to the tracking-related variables.
Forecast and observed storm tracks are matched assuming that the storms in similar locations and times should have a stronger connection. Matching tracks requires a metric that can account for both spatial and temporal differences and can compare different durations. This track matching distance is
Equation (1) contains the distance between the starting points , the difference in start times , the duration difference , and the mean area difference . Tight maximum distances, which are the denominators in (1), for start location difference (160 km) and start time difference (3 h) are used to limit the search area. Duration (16 h) and area (200 km2) break ties. The distances have units of kilometers, time and duration differences use hours, and area differences use square kilometers. If any components exceed their maximum, the track pairing is excluded from consideration. The scaling values are chosen empirically based on inspection of matches and an examination of the percentage of matched storms in the training data. Each forecast track is paired with the closest observed track, so multiple forecast tracks can be matched with the same observed track. This approach accounts for inherent spatial and temporal errors in storm placement. The main limitation is that small changes in the weights could result in different matches. The choice of matching function parameters is inherently subjective, but some combination of space, time, and storm properties should provide reasonable results.
c. Machine learning procedure
After storm tracking and matching has occurred, other meteorological variables are extracted from each forecast storm. This approach extracts statistics about the distribution of meteorological variable values from grid cells within the extent of a storm. These statistics are the mean, standard deviation, minimum, maximum, median, 10th and 90th percentiles, and skew. The full list of input variables for the CAPS ensemble is listed in Table 2, and the full list for the NCAR ensemble is in Table 3. Storm variables are extracted from model grids at the same forecast time as the storm, and environment variables are extracted from the previous hour to sample the environment before the storm traverses that area. The CAPS and NCAR ensembles use different archive and postprocessing systems, so some variables are available in one system but not the other.
The ML hail forecast procedure is summarized in Fig. 3. The goal of the ML algorithm is to use the meteorological variables within each storm footprint to forecast hail occurrence and a parametric approximation of the observed spatial MESH distribution. To distinguish the ML forecast product from the output of NWP model microphysics schemes, the output of the ML models will be named as a forecast MESH distribution instead of a forecast hail size distribution. If a forecast storm track is matched with a MESH track, then hail is assumed to have occurred. The spatial MESH distribution is modeled as a gamma distribution fit to the MESH values using maximum likelihood estimation. Fitting a parametric distribution to the spatial MESH values reduces the dimensionality of the prediction problem and preserves the proportional relationships between small and large hail areas within each storm. The gamma distribution probability density function takes the following form:
where is the gamma function, α is the shape parameter, is the location parameter, and β is the scale parameter. Unlike a bulk microphysics scheme, which estimates the moments of a parametric hail or graupel size distribution describing the sizes within each grid cell, this MESH distribution describes the radar-estimated hail sizes over the entire area of a storm. An example of a MESH object and the gamma distribution fit to its MESH distribution is shown in Fig. 4. The shape parameter affects the skewness, with small shape parameter values causing more skew and larger values leading to less skew. The location parameter determines the minimum value of the distribution and is fixed at 6 mm. The scale parameter adjusts the width of the gamma distribution and has the same units as the modeled quantity (Wilks 2011).
We use ML models from the Scikit-learn package version 0.16.1 (Pedregosa et al. 2011). A stacked model approach is used with a classifier model predicting whether or not hail occurs, along with a regression model that predicts the MESH distribution parameters. A random forest (RF; Breiman 2001) model is used as the hail occurrence classifier. RF is an ensemble of decision trees in which each decision tree is grown from a resampled version of the original training set, and in which a different random subset of input variables is evaluated for inclusion into the tree at each branch in the growing process. These sources of randomness ensure the independence of each tree. The RF classifier uses a configuration that weights each training example by the inverse of its class frequency, has 500 trees, and 10 minimum samples at each leaf node. The maximum number of features sampled at each split is chosen from the square root of the total number of features: 20, 30, or 50 features based on which choice maximizes the area under the receiver operating characteristic (ROC) curve (Mason 1982) through a grid search with fivefold cross validation.
Multitask learning (Caruana 1997) is used by RF and an elastic net to predict the shape and scale parameters of the MESH distribution simultaneously. During training, the models choose parameters to minimize the total error across all outputs. Multitask learning provides additional information about the fitting process and maintains correlations among the output values. If separate ML models predict the shape and scale parameter values independently, then the predicted values will be uncorrelated. The spatial MESH distribution parameters are log transformed and normalized before being fit by the ML models in order to capture the log-linear relationship between the shape and scale parameters, as shown in Fig. 5, and to ensure that both parameters are weighted equally by the ML loss functions. Both RFs and elastic net regression (Zou and Hastie 2005) support multitask learning and are used in this experiment. The elastic net is a linear regression model with additional terms in its optimization function that penalize large weights. A default RF, called random forest, has 500 trees, minimum samples at the split node of 10, and sampling square root of the total number of features. An optimized RF, called random forest CV, has 500 trees; fivefold cross-validated grid searching of the maximum number of features from the square root, 30, 50, and 100 features; and minimum samples at a splitting node from 5, 10, and 20 samples. The elastic net determines the ratio between the ridge and lasso terms from an independent validation set and normalizes the input features.
An RF classifier forecasts the probability of hail occurring, and RF and elastic net forecast the MESH size distribution for each storm in the testing set. To compare the ML model forecasts with other hail forecasting methods, the ML MESH distribution is translated into a deterministic set of MESH values overlaid onto the original storm in the NWP model (Fig. 3). The resulting grid can then be treated like any other diagnostic product and different gridded products, such as storm surrogate ensemble probabilities and ensemble maximum fields, can be generated. These products are used by forecasters at NOAA’s Storm Prediction Center to create day-ahead hail outlooks, so presenting the forecasts in a familiar format helps build trust in the product and simplifies incorporating the information into a forecaster’s existing workflow. If a forecast storm has a probability of hail occurrence less than 50%, it is removed from the grid. Weighting each training example by the inverse of the class frequency (examples from the no-hail class are weighted less than examples from the hail class) ensures that the 50% threshold maximizes the separation between the prediction distributions of the two classes. A deterministic threshold is used to retain or remove individual forecast hailstorms because we did not want to decrease the range of forecast hail sizes artificially by combining the hail occurrence probability with the spatial MESH distribution. Because the gamma distribution has no upper limit, we utilized Monte Carlo sampling to generate hail size values for each storm grid point. For each grid point occupied by a forecast storm, 1000 random samples are drawn from the forecast spatial MESH distribution [Eq. (2)], corresponding to 1000 statistical realizations of the MESH spatial field. The MESH values within each statistical realization are sorted by size. The sorted sizes are then associated with each grid cell occupied by the forecast storm in order of CTG intensity. For example, the set of largest sampled MESH sizes are associated with the largest CTG grid point. Within each set of MESH samples, the mean and percentiles of the MESH sizes are then calculated and output as gridded deterministic fields. The mean forecast MESH size field is used for further evaluation and comparison with other hail forecast methods.
Storm-surrogate ensemble probabilities (Sobash et al. 2011) of hail at different sizes are created from the ML models and are compared with similar probabilities derived from thresholding the storm surrogate variables and HAILCAST. A storm surrogate probability forecast creates a smooth probability field from a deterministic forecast that mimics a convective outlook from the Storm Prediction Center, which is the intended product for which these hail forecasts would provide guidance. The 3-km grid for each ensemble member is subsampled into a 42-km grid, and each point on the 42-km grid is marked with a 1 if at least one 3-km grid point within a 42-km radius over a 24-h period exceeds a specified intensity threshold. The 42-km radius has been chosen based on the Storm Prediction Center convective outlook criteria of at least one report within 40 km of a point (Hamill et al. 2005) and to be divisible by the 3-km grid spacing. A Gaussian filter with a standard deviation of 42 km is then applied to the coarse binary grid and spreads the probability mass to surrounding grid points to reflect the estimated spatial uncertainty. Larger standard deviations for the Gaussian filter result in a greater probability of detection but also decrease the sharpness and increase the false alarm ratio. Storm-surrogate neighborhood probabilities are also calculated for HAILCAST and storm surrogate variables, including simulated radar reflectivity at the −10°C level, UH, and CTG. UH thresholds of 75 and 150 m2 s−2, a reflectivity threshold of 60 dBZ (used for both 25- and 50-mm hail), and CTG thresholds of 25 and 50 kg m−2 were used to discriminate 25- and 50-mm-diameter hail, respectively. These thresholds were chosen based on previous experience with storm surrogate parameters and sensitivity studies for severe weather forecasting, such as Sobash et al. (2016).
The training data for the ML models were pooled so as to maximize the size of the dataset while preserving the temporal independence of the evaluation data. For the CAPS ensemble, each ML model is trained on a pool of ensemble members sharing the same microphysics parameterization scheme. The 2014 CAPS ensemble output is used to train the ML models, and the 2015 ensemble is used for evaluation. The training set consists of 20 model runs on weekdays between 6 May and 6 June 2014. Because the 2015 ensemble used the P3 microphysics scheme in place of WDM6, the P3 members are evaluated using models trained on the Milbrandt and Yau members. The testing dataset includes 18 runs on weekdays from 12 May to 5 June 2015. One run with missing MESH observation data during that period is excluded from the evaluation.
Training and evaluation for the NCAR ensemble is performed cyclically with a new round of training performed every 2 weeks and testing performed over the subsequent 2 weeks. Since each member uses the same parameterizations and is equally likely, all members are pooled into the training dataset for the ML models. The forecasting period runs from 15 May through 30 July 2015. Because the CAPS and NCAR ensembles have different training and evaluation periods, the evaluation statistics cannot be used to determine whether one ensemble configuration is better than the other.
The forecast methods are evaluated using ROC curves, performance diagrams, reliability diagrams, and attributes diagrams, as well as metrics related to all of those diagrams. The formulas for every score are shown in Table 4. ROC curves (Mason 1982) divide a probabilistic forecast into a set of binary forecasts by varying the decision threshold. The goal of a ROC curve is to show how well a forecasting method discriminates between two outcomes. The probability of detection (POD) for each decision threshold is plotted against the probability of false detection (POFD). Skilled forecasts will consistently have a higher POD than POFD for each decision threshold. The amount of skill is quantified by the area under the ROC curve (AUC). An AUC higher than 0.5 is considered skilled. The performance diagram (Roebber 2009) is a variation of the ROC curve in which the POFD is replaced with the success ratio (SR), which is the additive inverse of the false alarm ratio (FAR). In addition to these scores, the frequency bias (FB) is shown as rays expanding outward from the origin and the critical success index (CSI) as contours that maximize in the top-right corner of the plot. These statistics are all independent of the number of true negatives, so they are more sensitive to how well the positive event is predicted. This feature is particularly important for rare event forecast evaluation.
Probabilistic forecasts are also assessed for their reliability and resolution. Reliability is the difference between the forecast probability and the relative frequency of the forecast event occurring conditional on that probability being forecast. Resolution is the difference between the climatological probability of an event and the relative frequency of the event occurring for a given forecast probability. Information about reliability and resolution is contained within the Brier score (BS; Brier 1950), which can be decomposed into reliability, resolution, and uncertainty components (Murphy 1973). These components can be reorganized to calculate the Brier skill score (BSS), which is the difference between reliability and resolution divided by the uncertainty. The reliability diagram displays the observed relative frequency versus the forecast probability. If the forecast probability is greater than the observed relative frequency, then the model is overforecasting. The attributes diagram (Hsu and Murphy 1986) is a reliability diagram that also shows the climatological probability and the regions where the resolution is larger than the reliability and thus the BSS is positive. The attributes diagram also contains an inset showing the frequency of forecasts made at each probability threshold, which can be used to assess the sharpness, or variance across probabilities, of the forecasts. Reliability diagrams can also be used to evaluate the predictions of regression models by discretizing the forecast values and determining the mean observed value for a given range of forecast values.
a. Individual hailstorm forecast evaluation
Probabilistic forecasts of hail occurrence from the RF classifier are evaluated for the CAPS and NCAR ensembles. The top panel in Fig. 6 shows the ROC curve and reliability diagram for the CAPS ensemble members. Both diagrams evaluate probabilities in increments of 0.05 from 0 to 1. The observed relative frequency in the reliability diagrams refers to the percentage of hailstorms observed when a certain probability was forecast. The member ROC curves cluster by microphysics. The ML forecasts for Thompson members possess the lowest AUCs (about 0.73), and P3 members have higher POD at low probability thresholds and AUCs around 0.78. In the reliability diagram, all members display a consistent underforecasting bias, which may be caused by the change in grid spacing between the 2014 and 2015 CAPS ensembles and any resulting changes in the forecast variable distributions. At low forecast probabilities, the Thompson microphysics members display a higher observed relative frequency than other members, resulting in a lower BSS. For the NCAR ensemble (Fig. 6, bottom), the RF shows AUCs of about 0.70. The reliability diagram indicates that the probability forecasts are reliable but have a small overforecasting bias. The CAPS ensemble members show more variability in performance, which is expected from using multiple microphysics schemes. The differences in scores have not been evaluated for statistical significance.
Reliability diagrams for the shape and scale parameters in the CAPS and NCAR ensembles indicate how closely the forecast values correspond to the observed values on average (Fig. 7). The CAPS ensemble exhibits poor calibration with an underforecasting bias for small values of the shape and scale parameter and an overforecasting bias for large values. All of the ensemble members show similar issues. The NCAR ensemble reliability curves exhibit better calibration than the CAPS ensemble but still show biases at extreme parameter values. Poor sharpness in the shape and scale parameter forecasts results in forecast MESH distributions that underestimate the probability of very large hail when conditions are favorable and overestimate the probability of large hail when conditions are not favorable.
b. Full-period forecast evaluation
Storm surrogate probabilities were calculated over the 24-h period from 1200 to 1200 UTC for the ML models and raw storm surrogate variables over the testing sets. These probabilities are analogous to Storm Prediction Center convective outlooks and allow for a more direct comparison between the two types of forecast methods.
The performance diagrams in Fig. 8 show the POD and success ratio at 5% probability intervals. Lower probability thresholds will have higher POD and appear in the upper portion of the diagram. Because POD decreases faster than FAR for all models, we focus our analysis on the differences in FAR over a range of probability thresholds. For the 25-mm-diameter hail threshold, the ML methods all have lower FAR values than every other model at all probability thresholds (Fig. 8). The FAR for UH is nearly as low as the ML models. For the 50-mm hail threshold, all models have a much higher FAR, but their relative rankings stay the same (Fig. 8). At higher probability thresholds (near the bottom of the diagram), the ML models have much lower FAR compared with the other methods but are only able to detect a small percentage of events. The NCAR ensemble performance diagrams highlight the difference in false alarms for each method at high probability thresholds. UH has a lower POD than total graupel for 25-mm hail but also has a lower FAR for most probability thresholds (Fig. 8). The ML methods are able to maintain a lower FAR than UH for all thresholds. The RFs maintain a higher CSI consistently (Fig. 8). The elastic net underperforms compared to the RFs, and the performance of UH is comparable.
The attributes diagrams in Fig. 9 indicate the reliability and sharpness of each method for predicting the probability of 25- and 50-mm hail over a 40-km area, unlike the reliability diagrams in Fig. 6, which evaluated the probability of hail occurring for each potential hail storm object. The ML methods and UH are all reliable at the 25-mm threshold (Fig. 9) while the other storm surrogate methods are all overconfident. The overconfidence improves their AUC, because they detect more events at a given probability threshold, but hurts their reliability. All of the methods are sharp but HAILCAST has the largest number of forecasts at high probabilities. Every method is overconfident and less sharp at the 50-mm threshold (Fig. 9). The ML methods have less of an overconfidence bias than reflectivity and greater sharpness than CTG. HAILCAST is more overconfident than the ML models but still produces sharp forecasts. In addition to being better discriminators for the NCAR ensemble, the ML methods also produce more reliable probabilities. While an overconfidence bias is apparent at both hail size thresholds, the ML methods have higher BSSs for 25- (Fig. 9) and 50-mm-diameter hail (Fig. 9). Both UH and the ML models are underconfident at low probability thresholds and then trend toward overconfidence at higher thresholds while still showing skill. The elastic net produces the most reliable probabilities for hail sizes of 50 mm at the expense of some sharpness compared with UH (Fig. 9). The RFs are overconfident with 50-mm-diameter hail probabilities.
The spatial distribution of probabilistic hail forecasts from the 2015 CAPS ensemble over all evaluated runs is shown in Fig. 10. The chosen 10% probability threshold displays the maximum spatial extent of the forecasts at a threshold commonly used by forecasters to assess hail risk. RF and UH capture the two observed maxima in 25-mm hail frequency in the Texas Panhandle and across northeast Colorado. Overall, the RF and UH methods are comparable in their spatial forecast frequencies. The frequency maxima for HAILCAST and CTG are displaced eastward compared to the observed maximum. HAILCAST and CTG also produce large numbers of hail forecasts along the Gulf coast whereas the RF and UH have none in that area. The RF captures the extent of the 50-mm reports (Fig. 10) while the other methods produce more 50-mm hail forecasts to the east of the observed maximum. RF and UH present two local maxima in Texas and northeast Colorado whereas CTG and HAILCAST have one maximum in north-central Texas.
The spatial distribution of hail forecasts from the NCAR ensemble is displayed in Fig. 11. Most of the observed hail reports occur along the high plains east of the Rocky Mountains with a smaller secondary maximum in the Southeast. For 25-mm hail, the ML methods capture the full extent of the high plains hail observations while slightly underforecasting in the Southeast. UH follows a similar pattern. CTG greatly overforecasts hail frequency in the Southeast and underforecasts hail in the high plains with an eastward bias in the maximum. An eastward bulge in eastern Nebraska from overnight MCSs is also visible in both the forecast and the observed relative frequencies from all models. The ML models overforecast the frequency of hail in the high plains while CTG does the same over the eastern half of the United States. The ML methods and UH have the most misses along the Gulf Coast. For 50-mm hail, the RF captures the full extent of the hail events in the high plains while the elastic net is more concentrated in central Kansas (Fig. 11). The UH forecasts concentrate their presence in western Nebraska with a secondary maximum in Iowa, while CTG features the local maximum in Iowa but not in Nebraska.
ML storm-based hail forecasts were evaluated and compared with other diagnostic hail forecasting methods to determine what value may be added to raw CAM ensemble output by these approaches. The different forecasting methods were evaluated using spring and summer 2015 runs of the CAPS and NCAR CAM ensembles and validated against gridded radar-estimated hail sizes. The ML models showed skill in discriminating between forecast storms that produced hail and those that did not but were underdispersive with the hail size forecasts. Additional calibration of the raw ML forecasts improved the sharpness of the size forecasts for the NCAR ensemble. For 24-h hail outlooks, the ML methods were better able to identify hail threat areas while minimizing false alarms compared with HAILCAST and other storm-surrogate methods. Of the existing storm surrogates, UH provided the best indicator of large hail. This result is expected given that UH is a good proxy for the strong, rotating updrafts found in supercells. The skill of ML hail forecasts is more sensitive to the choices made during preprocessing than to the choice of the ML model. Any statistical or ML model will produce less skilled forecasts if NWP model configuration changes result in large alternations to the distributions of important input variables because the correlation with the forecast output variable will have changed. The NCAR ensemble ML algorithms benefited from having a larger amount of training data due to the longer training period and the ability to aggregate across 10 diverse but similarly configured ensemble members.
The verification statistics and spatial maps showed that the hail forecasting methods exhibit more pronounced forecasting biases in particular areas. The ML models and UH work very well for discriminating hail in plains supercells, particularly storms in the high plains and just east of the Rockies. Both of these methods take into account shear and updraft intensity. CTG and the 2015 HAILCAST are more influenced by CAPE and less by shear, so they produce hail in any storm that has a strong updraft. For a given CAPE environment, the amount of shear has a significant impact on the amount of hail growth (Dennis and Kumjian 2017). The ML model uses the LCL height to account for relative humidity effects on melting and gives lower probabilities to areas with low LCL heights, but it is too aggressive with lowering probabilities in the Southeast and misses too many storms. In addition, CTG does not account for melting, and HAILCAST melts hail in cloud based on hailstone temperature, cloud temperature, and collisions with liquid water and below cloud level based on the mean wet-bulb temperature in that layer. HAILCAST and CTG do not fully account for the melting that occurs with storms in the Southeast and tend to overpredict hail there. Because of the updraft issue, HAILCAST was significantly updated following the 2015 HWT Spring Experiment to account for hail growth occurring more on the edge of updrafts than in the center; these improvements are discussed by Adams-Selin and Ziegler (2016). The hailstorms in the Southeast in May–July tend to be pulse thunderstorms (Smith et al. 2012), which lack rotating updrafts, so hail methods calibrated to plains supercells that make up the bulk of hail events will not detect them as well. Using a lower decision threshold may help capture some of these cases that are currently missed with the 50% decision threshold. Alternatively, regionally calibrated decision thresholds may capture different environments more accurately.
While constraining ML hail forecasts to areas where the NWP model produces storms does produce better forecasts, there are situations in which this approach will struggle compared with ingredients-based methods. If the CAM forecast contains errors with the placement, timing, and evolution of storms, then the hail forecasts dependent on those storm forecasts will also contain those errors. These struggles tend to occur in scenarios where large-scale forcing is weaker, leading to convection initiation and evolution being governed by poorly observed mesoscale effects. When an area receives multiple days of convection, errors in forecasting the diurnal cycle of convection lead to additional spatial and temporal biases. Hail forecasts based solely on environmental parameters will tend to have better coverage in these situations and will detect hail threat areas that storm-based methods may miss.
Further improvements to ML model forecasts are possible with some additions to the current modeling framework. Additional diagnostic variables describing conditions within the hail growth region and the melting layer near the ground may provide stronger links between those processes and the forecast hail size. A more thorough sensitivity study of the forecast and observed storm object identification and matching procedure could lead to stronger correlations between the different weather variables and radar-estimated hail size. Longer convection-allowing model archives with more consistent configurations over multiple severe weather seasons could allow the ML models to capture a larger sample of extreme hail events. The evaluation of hail forecast accuracy conditioned on the storm environment could assist forecasters with knowing when to trust certain models more. Probabilities from the MESH distributions themselves could be interrogated by a forecaster in an interactive viewer to identify which areas have storms with a higher probability of very large hail. Creating and evaluating these visualizations is an area of future research. The ML hail forecasts were produced using the Hagelslag (https://github.com/djgagne/hagelslag) open-source Python package developed by the lead author. Data used for this project are available upon request.
The authors thank dissertation committee members Jeffrey Basara, Andrew Fagg, and Michael Richman for their helpful feedback. This work is part of the Severe Hail Analysis, Representation, and Prediction (SHARP) Project, which is funded by the National Science Foundation through Grant AGS-1261776. The SHARP team includes Nathan Snook, Youngsun Jung, and Jonathan Labriola in addition to coauthors DJG, AM, and MX. The authors thank James Correia Jr., Michael Coniglio, Adam Clark, Israel Jirak, David Imy, Kent Knopfmeier, Chris Melick, and Burkely Gallo for helping integrate the ML hail forecasts into the NOAA Hazardous Weather Testbed Experimental Forecast Program and providing valuable feedback about the forecasts. Rebecca Adams-Selin provided helpful feedback regarding HAILCAST and the overall paper. The anonymous reviewers also provided very thorough feedback and strengthened the quality of the paper. The CAPS ensemble model runs were funded through the NOAA Collaborative Science, Technology, and Applied Research (CSTAR) Program Grant DOC–NOAA NA13NWS4680001 and were run on the Texas Advanced Computing Center Stampede supercomputer. Scientists at CAPS, including Fanyou Kong, Kevin Thomas, Yunheng Wang, and Keith Brewster, contributed to the design and production of the CAPS ensemble forecasts. The NCAR ensemble was developed by Craig Schwartz, Glen Romine, Ryan Sobash, and Kate Fossell and is run on the Yellowstone supercomputer. NCAR is sponsored by the National Science Foundation.
Current affiliation: National Center for Atmospheric Research, Boulder, Colorado.