1. Introduction
Mesoscale snowbands are narrow regions of locally enhanced snowfall rates, most commonly forced by midlevel frontogenesis below a layer of reduced static stability (Nicosia and Grumm 1999; Novak et al. 2004; Baxter and Schumacher 2017). Snowbands occur frequently within midlatitude cyclones and have the potential to cause significant societal impacts but are challenging to forecast due to their small scale and forcings, with widths generally between 20 and 100 km (Novak et al. 2006; Radford et al. 2019). In addition, numerical model predictions of snowbands are highly sensitive to initial conditions and microphysics parameterization (Nicosia and Grumm 1999; Novak et al. 2006, 2008; Evans and Jurewicz 2009; Novak and Colle 2012; Colle et al. 2014; Molthan et al. 2016). Thus, there is substantial value in measuring the current snowband forecast capabilities of convection-allowing model ensembles using traditional contingency table statistics and investigating strategies to produce more accurate and precise snowband forecasts.
Prior to the operational implementation of convection-allowing models (CAMs), Novak et al. (2006) advocated that forecasters apply an ingredients-based approach to short-range snowband forecasts. At the time, the highest resolution model investigated by Novak et al. (2006) was the 12-km Eta model (Black 1994; Rogers et al. 2001). This approach encourages the identification of the most at-risk regions based upon predicted locations of enhanced midlevel frontogenesis underlying layers of small moist symmetric stability. Corroboration with model vertical velocities and quantitative precipitation forecasts (QPF) can then be applied to determine whether the forcing is strong enough for banding development. Within 12 h of band development, comparisons of the model-forecast evolution of banding ingredients to observations allows for refinement of band timing and location. Though Novak et al. (2006) focused on moist symmetric (in)stability, other work has suggested that release of moist symmetric instability may not be the dominant instability associated with band formation and bands are more likely to be attributable to conditional instability (Trapp et al. 2001; Morales 2008; Novak et al. 2010). Cloud-top convective generating cells are more typically associated with smaller, multibands, but are another commonly observed phenomenon that may contribute to primary bands (Rosenow et al. 2014; Rauber et al. 2015; McMurdie et al. 2022).
Since then, increased accessibility to CAMs has changed the forecasting landscape for mesoscale features. With grid spacings of 4 km or less, CAMs can explicitly represent some convective-scale processes, including small-scale frontogenetical forcings and slantwise convection (Novak et al. 2008). Other, smaller scale processes, such as convective generating cells, would still not be resolved at this grid spacing. Theoretically, this should result in more accurate depictions of mesoscale snowbands in short-range CAM forecasts. Novak et al. (2008) compared the evolution of an intense mesoscale snowband in a model simulation with 4-km grid length to high-resolution observations, finding under-forecast precipitation and an axis displacement of about 50 km. Radford et al. (2019) evaluated 3 years of snowband forecasts by the High-Resolution Rapid Refresh (HRRR) model and found mediocre forecast skill. In particular, the HRRR demonstrated large temporal error, even at forecast leads of less than 12 h. Despite the potential of CAMs, many forecasters still favor an ingredients-based approach rather than rely upon CAM simulated reflectivity or QPF (Radford et al. 2023).
Given the apparent limitations of deterministic CAMs, Radford et al. (2019) suggest a probabilistic approach to snowband forecasts using CAM ensemble output, such as the High-Resolution Ensemble Forecast (HREF; Roberts et al. 2019) system. A probabilistic approach was also suggested and tested by Novak and Colle (2012), though with a 12-km, non-convection-allowing ensemble prediction system. This system provided valuable uncertainty information and helped discriminate between cases with high and low predictive skill; they identified model initial condition uncertainty as an important distinguishing characteristic.
CAM ensembles offer tremendous potential for mesoscale snowband forecasting, but they also present new challenges. Namely, how can we sift through the vast amount of output from all of the members to identify only the material relevant to snowband forecasts? Furthermore, how do we combine the output from individual members to calibrate snowband forecasts? One option is to apply machine learning methods, which search for patterns between the input data and the outcome to identify those configurations that minimize prediction error. A relatively simple example of machine learning is a decision tree, which uses a series of conditionals to make a prediction. Random forests (Breiman 2001; Chase et al. 2022) generate an ensemble of decision trees, each of which uses a random sample of the training data and randomly subsets the input variables to increase the independence of each tree and to avoid overfitting of the input data. The predictions of all of the trees are averaged to produce an overall prediction. Random forests have been applied in numerous meteorological classification and prediction studies, such as severe winds, tornadoes, and hail, among others (e.g., Gagne et al. 2017; Burke et al. 2020; Hill et al. 2020; Loken et al. 2020). Outside of severe weather, Herman and Schumacher have applied random forests to the prediction of extreme precipitation (Herman and Schumacher 2018a,b).
We train three random forests with HREF output to predict the occurrences of snowbands within forecast precipitation features and compare the forecast skill of each to a simple HREF probability model. The first random forest uses only explicit HREF output, such as simulated reflectivity and explicit snowband identifications (Radford et al. 2019), to predict development. The second random forest uses only environmental ingredients, such as midlevel frontogenesis, midlevel stability, and vertical motion, from HREF output to predict snowband development. The third random forest uses both the environmental ingredients and explicit precipitation variables to predict development. Analysis of the performance of these three models serves two purposes: First, comparing the skill of the environmental predictors random forest with that of the explicit predictors random forest allows us to estimate the relative value of an ingredients-based forecasting strategy and an explicit CAM forecasting strategy. Second, the evaluation of all three random forests enables us to produce the best possible predictions of snowband occurrences given all of the available HREF output.
2. Data and methods
Following Radford et al. (2019) and Radford and Lackmann (2023), occurrence of snowbands is determined from hourly CONUS-wide base reflectivity mosaics provided by the Iowa Environmental Mesonet (IEM) between the months of November and March for the 2017/18, 2018/19, 2019/20, 2020/21, and 2021/22 winter seasons. March of 2022 is excluded from this study as it was unavailable at the time our analysis began. Reflectivity images are supplemented with HRRR analysis categorical precipitation-type to estimate the predominant precipitation type.
The detection of observed snowbands from reflectivity mosaics follows the same procedure as Radford et al. (2019) with the same alterations as Radford and Lackmann (2023): HRRR analysis categorical “p-type” is used as the proxy for precipitation type, rather than 2-m temperature, the minimum length (referring to the length of the feature exceeding the variable reflectivity threshold) criterion has been increased from 250 to 300 km, and a minimum average intensity of 15 dBZ was added. As a reminder, this procedure identifies low intensity precipitation features with a set 0-dBZ threshold, then calculates an intense precipitation threshold for each feature at 1.25 standard deviations above the mean feature reflectivity, aligned with the concept that snowbands are intense features relative to their surroundings. The 1.25 standard deviation threshold was allowed to vary slightly between 1.10 and 1.40 standard deviations to yield more consistent identifications from time step to time step. Positive band identifications and the null identifications serve as the predictands for our models.
The predictors for our random forest model are sourced from the HREF system output, provided in an archive maintained by the National Severe Storms Laboratory (NSSL). The HREF is an ensemble of opportunity or a “poor man’s ensemble” (Ebert 2001) of 8 members between 2017 and 2019 and 10 members between 2019 and 2022 with various overlapping physics configurations and initial conditions. The configuration of the HREF changed twice during the period of the study, as discussed by Radford and Lackmann (2023). HREFv1 included the Advanced Research version of the Weather Research and Forecasting (WRF-ARW) Model, the Nonhydrostatic Mesoscale Model on the B-grid (WRF-NMMB) model, the National Severe Storms Laboratory (WRF-NSSL) model, the North American Mesoscale Nest (NAM Nest) model, and 12-h time-lagged initializations of these same models. The HREFv2 added the HRRR and 6-h time-lagged HRRR, and the HREFv2.1 replaced the WRF-NMMB and 12-h time-lagged WRF-NMMB with the FV3 and 12-h time-lagged FV3. For the most part, we evaluate the HREF as a single entity for the 5-yr period, with missing data populating NMMB, HRRR, and FV3 features for various periods, but, where possible, we break down performance by HREF version.
We chose to use a “precipitation feature”-based approach to forecasting bands. That is to say, given forecast information about a highly predictable precipitation feature, how well can we predict whether a snowband will be observed within that feature at a given time? This is similar to the approach taken by Gagne et al. (2017), which predicted the occurrence of hail within individual storm objects. This first requires the identification of the high-predictability precipitation features. We do this by identifying regions with a length longer than 300 km where the median HREF 1000-m simulated reflectivity exceeds 0 dBZ, as this indicates that there is majority agreement on at least light precipitation. Note that while we generally think of this process as identifying a stratiform feature that may or may not have an embedded snowband, it works equally well for isolated bands, as they still meet the criteria of reflectivity greater than 0 dBZ. Whether these different banding contexts have differing predictabilities or should be handled independently is an open question for future work, though this will likely present sample size limitations. This minimum length corresponds to the smallest possible feature that could contain a band. We restrict our precipitation features to those with a centroid east of 104°W due to beam blockage in observed radar in the western United States, which severely limits observing capabilities there (Lundquist et al. 2019; James et al. 2022).
We then use the same geographical area to extract our model predictors. The same snowband identification procedure used to identify observed snowbands is applied to the 1000-m simulated reflectivity field of each HREF member (Fig. 1). This determines whether there was an explicitly forecast snowband in the region for each member and also allows us to determine how much overlap there is between member band forecasts. Comparing these individual band identifications to observed bands enables us to compare the forecast skill of each HREF member with simple contingency table statistics. Furthermore, these member predictions allow us to develop a simple probabilistic band prediction model that will serve as our baseline forecast performance against which we compare our random forest models. This baseline model, which we refer to as the “HREF threshold probability” model, makes band predictions based upon the fraction of HREF members that depict a band, with some intermediate probability threshold maximizing forecast skill through balancing probability of detection with false alarm ratio.
We hypothesize that the presence or lack of snowbands in each HREF-member within the forecast precipitation feature is likely a strong predictor for whether or not a snowband will ultimately be observed and thus serves as our first set of random forest predictors. If a snowband is forecast by an HREF member and the centroid falls within the HREF-forecast precipitation as defined above, a value of one is assigned to the HREF member as a binary model feature. Otherwise, a value of zero is assigned. This is performed for each HREF member, producing ten total model features representing whether each HREF member forecasts a snowband within the HREF-forecast precipitation feature. The individual member explicit band predictions are combined with statistics of the simulated 1000-m reflectivity and categorical snow fields to form the features for our “explicit” random forest model. For environmental variables, we calculate the mean, standard deviation, minimum, median, maximum, 10th and 90th percentiles, and skew within the precipitation feature for each HREF member. In addition to statistics of the precipitation feature for the given forecast hour, we also calculate statistics for the hour prior and the hour following to allow for small timing errors.
The process for calculating environmental predictors for the random forest models begins with identifying a precipitation feature as defined above based upon the HREF-forecast median reflectivity (Fig. 2a). Any area with a length of at least 300 km where the HREF-forecast median reflectivity exceeds 0 dBZ is identified as an HREF-forecast feature (Fig. 2b orange contour). We then use the area footprint of this feature to calculate environmental characteristics, such as the minimum, mean, standard deviation, and maximum simulated reflectivity of each HREF member within the orange contour area (Fig. 2c) for the valid hour, the hour prior, and the following hour. These statistics are the model features. Next, we determine whether the HREF-forecast precipitation feature is associated with an observed snowband at the valid forecast hour, the hour prior, or the hour following. Finally, the random forest is trained to use these environmental statistics to predict the probability that the HREF-forecast precipitation feature will be associated with an observed snowband in future cases. The predictors for the “explicit” model are shown in Table 1.
Predictors for “explicit” random forest. The mean, standard deviation, minimum, 10th percentile, median, 90th percentile, maximum, and skewness are calculated for each environmental variable for each member within the precipitation features.
The implicit predictors, or environmental banding ingredients, are the same statistical summary of categorical snow with the additions of statistical summaries of maximum upward and downward vertical velocity (maximum in the vertical; they are still 2D fields), frontogenesis, and saturation equivalent potential vorticity (SEPV) at the 500-, 700-, and 850-hPa levels. SEPV is calculated using MetPy (May et al. 2022), which uses the Bluestein (1993) formula, as a proxy for moist symmetric stability. “Metadata” is included as predictors in both models, including morphological traits of the precipitation feature, such as area, eccentricity, major axis length, minor axis length, orientation, and extent, as well as location and timing information, such as the centroid latitude and longitude, initialization time, forecast lead, valid hour, and month. The predictors for the “implicit” model are shown in Table 2. The predictors in Tables 1 and 2 are then combined in a final random forest to determine whether combining explicit and implicit band predictors yields better performance than each on their own. Though we believe this set of input features to be representative of the dominant processes associated with large, primary snowbands, we cannot rule out the possibility that there may be other contributing forcing mechanisms, the absence of which could limit the skill of the implicit random forest.
Predictors for “implicit” random forest. The mean, standard deviation, minimum, 10th percentile, median, 90th percentile, maximum, and skewness are calculated for each environmental variable for each member within the precipitation features. Each statistic is calculated at the 500-, 700-, and 850-hPa level for frontogenesis, SEPV, and maximum upward vertical velocity.
In the context of this study, “bands” refer to individual identifications in hourly radar reflectivity output, rather than banding events. That is to say, bands at consecutive timestamps are treated as independent, despite the fact that they may be part of the same banding event. Evaluating banding events would require tracking thousands of HREF forecast precipitation features through time and is beyond the scope of this study. Additionally, we do not stratify bands by forecast lead, as this would severely limit sample sizes. However, Radford et al. (2019) showed relatively small differences in predictability based on forecast lead for leads under 18 h. We acknowledge these are significant limitations in our analysis but are necessary given computational constraints.
All random forest models are developed with the random forest classifier of the Python Scikit-Learn package (Pedregosa et al. 2011). We adopt the methods of Gagne et al. (2017) to configure our models and use inverse class weights and a fivefold cross validation grid search to identify the optimal maximum number of features from the square root, 25, 50, and 100 features, and a minimum number of samples required to split a node from 10, 25, and 50 samples to maximize area under the receiver operating characteristic curve (ROC-AUC; Mason 1982). Our forest is composed of 500 decision trees and the loss function is Gini impurity (Breiman 1984). We evaluate model performance primarily using critical success index (CSI; Donaldson et al. 1975), a metric analogous to the F score that is well-suited to evaluating forecast skill for rare events due to its exclusion of true negatives. CSI is effectively visualized on the Roebber (2009) performance diagram as a function of success ratio and probability of detection. Though CSI is our primary verification metric, we also examine ROC curves and reliability diagrams for more complete pictures of model performance. Following evaluation, we investigate feature importance for each variable based upon decrease in Gini impurity to determine the most valuable predictors. Results presented are for a 30% held-out test set of the most recent banding events rather than a randomized test set to reduce effects of autocorrelation between temporally adjacent events.
3. Results
a. Baseline HREF verification
We identified 21 949 candidate precipitation features (those with HREF median 1000-m reflectivity greater than zero, at least some snow, and 300-km minimum length), 1676 of which were associated with snowbands. Based on the 305 banding events identified in Radford and Lackmann (2023) and the inclusion of transient bands here, we estimate that there are between 350 and 400 independent banding events among these precipitation features. Forecast performances for each HREF member are shown on the Roebber (2009) performance diagram in Fig. 3. We evaluated performance first by requiring a forecast band to occur at the same time that a band is observed (orange circles), after which we evaluated performance using fuzzy verification (Ebert 2008) with 1-h error tolerances (brown circles) in the forecast and observations. There is some variation in member CSIs, with the time-lagged NMMB performing the worst with CSIs of 0.16 and 0.26 and the time-lagged FV3 performing the best with CSIs of 0.25 and 0.32 without and with 1-h tolerances, respectively. The mean CSIs across members for the no temporal tolerance and 1-h tolerance cases are 0.21 and 0.30, respectively. There is a slight propensity of all members toward under-forecasting bands, as the number of misses exceeds the number of false alarms.
We next tested the performance of a binary band forecast based upon different discrimination thresholds, calculated as the fraction of HREF members forecasting a band for a given time, producing curves on the performance diagram (Fig. 3) similar to ROC curves for the no tolerance and 1-h tolerance scenarios. Here we do not break down performance based upon HREF version. In this HREF “threshold probability” model, CSI is maximized when a positive forecast is made based upon a 20%–30% HREF probability. In other words, CSI is maximized when we forecast a band if at least 30% of HREF members depict a band and do not forecast a band if less than 30% of HREF members depict a band. Without temporal tolerances, this strategy yields a CSI of 0.28, an improvement of 12% over the best performing HREF member and an improvement of 33% over the mean HREF member. With a 1-h window, this strategy yields a CSI of 0.38, an improvement of 19% over the best performing member and an improvement of 27% over the mean member CSI. Improvements are predominantly due to increased probability of detection, rather than reduced false alarms.
The above HREF threshold probability model demonstrates the overall HREF performance across all versions. We next examined each HREF version independently to determine whether model development has improved HREF snowband forecast skill over time (Fig. 4). The HREFv1 (n = 810) and HREFv2 (n = 599) systems exhibit nearly indistinguishable performance, while the HREFv2.1 (n = 267) system demonstrates modest improvement. This is consistent with the previous finding of the FV3 as the most skillful HREF member, but improved performance may also be partially attributable to upgrades to individual members between the HREFv2 and HREFv2.1, rather than the changing HREF membership. For example, the HRRRv4 was only introduced in early 2021 (James et al. 2022). Additionally, it is possible that some of the difference is associated with the higher variance of the smaller sample size for the HREFv2.1 or inherent differing predictabilities of banding events occurring during each HREF phase.
In Fig. 5, we assess the reliability of the HREF threshold probability model. The climatological occurrence of snowbands within precipitation features above 0 dBZ and a length of at least 300 km with at least some snow is 7.6%, increasing to 13.0% when including bands occurring the hour prior or the hour after. The average HREF snowband probability when a band was observed within 1 h was 41.3%, compared to 6.1% when no snowband was observed. The HREF probability model demonstrates skill relative to climatology, with a Brier skill score (BSS) of 0.28. The model is slightly underconfident at low probability thresholds but is significantly overconfident at higher probability thresholds. The HREF is an “ensemble of opportunity” (Ebert 2001) rather than a true ensemble, so members are not independent or designed to induce dispersion. Thus, this overconfidence may be due to similar solutions from members and their time-lagged counterparts or overlapping physics configurations. The inset panel shows the forecast sharpness. Because snowbands are rare events, a probability of zero is by far the most common outcome. However, the HREF probability model is clearly capable of producing probabilities as high as 100%.
At this point we theorized that training a simple random forest model with the individual HREF member band forecasts may yield modestly improved CSI beyond the HREF probability model. This would potentially occur due to specific combinations of members correlating with band predictability. For example, positive forecasts from both the HRRR and the ARW may have greater predictive value than positive forecasts from the ARW and the NSSL as the former indicates that models with two different microphysics schemes yields the same solution, while the latter both use the WSM6 microphysics scheme. However, this is not the case, as the random forest trained with binary member band forecasts (which incorporates which model made which forecast) produced CSIs nearly identical to the HREF probability model (Fig. 6). Thus, it appears that the proportion of members producing a band is far more important than which member, or combinations of members, produce a band.
b. Random forest evaluation
The CSI results for the explicit features, implicit features, and combined random forest models are presented in Fig. 7, alongside the HREF threshold probability model and HREF members. The random forest model trained on explicit member band predictions, simulated reflectivity, categorical snow, and feature metadata yields a maximum CSI of 0.42, an improvement of 11% (0.04 CSI points) over the HREF threshold probability model. Similar to the threshold probability model, skill is maximized when the random forest forecast probability is greater than 30%. The explicit random forest underperforms the threshold probability model at higher thresholds.
The precipitation features’ areas and lengths were the individual features with the most predictive value in the explicit random forest, as larger features are more likely to contain bands. Grouping features by type (Fig. 8), we see that even though the area and length features had high predictive value, reflectivity statistics yielded the greatest decrease in impurity. This is followed by the individual member band identifications, then the categorical snow field, and finally the precipitation feature metadata.
The implicit and combined random forest models generally demonstrate similar performance (Fig. 7). At low to medium probability thresholds, performance matches that of the HREF threshold probability model and caps at a CSI of 0.40. At higher thresholds, the two models underperform the threshold probability model. Examining feature importance for these two models revealed that the forests are highly reliant on SEPV statistics, with more than 50% of impurity reduction attributed to SEPV. This ran counter to our expectation that SEPV would result in less impurity reduction than vertical motion, given that SEPV is noisy and is generally a precursor to vertical motion, which is more directly related to snowband development. As such, we tested the impact of simplifying these two models by altogether excluding SEPV features. Performance was nearly the same or slightly improved without SEPV and the feature importances were more closely aligned with expectations (Fig. 9a), with categorical snow and vertical motion accounting for approximately 30% of impurity decrease each, followed by 24% for frontogenesis, and 17% for precipitation feature metadata for the implicit random forest. Ranking for the combined model is explicit detections, reflectivity, categorical snow, vertical motion, precipitation feature metadata, and frontogenesis (Fig. 9b). Two possible reasons for this apparent contradiction in the importance of SEPV are high correlation between SEPV and other inputs make SEPV largely redundant, or a high degree of noise in many SEPV inputs resulting in overfitting and poor generalizability to the test dataset.
For completeness we present the ROC-AUC (Fig. 10a) and reliability (Fig. 10b) for each of our models. The HREF threshold probability model and the implicit RF model slightly underperform the other two models according to AUC, but differences are negligible due to the prevalence of true negatives for relatively rare events. All four models are reliable, though the explicit and combined random forest models demonstrate less overconfidence at high probabilities compared to the HREF threshold probability model and implicit random forest model.
Finally, we tested the performance of a random forest trained with the explicit parameters used in the explicit random forest but in a model agnostic way. To accomplish this, we used the mean values of each parameter across all of the HREF members as input, rather than the individual member output. This serves two purposes. First, this helps shed light on whether the individual member output is actually valuable beyond the value provided by ensemble averages. Second, assuming the model agnostic random forest performs adequately, it increases the transferability of the model between HREF versions as the HREF updates and changes members. The explicit random forest and explicit, model agnostic random forests are shown in Fig. 11. The explicit random forest with individual member output holds a slight edge over the model agnostic random forest in terms of maximum CSI. However, the difference is relatively small (∼5% or 0.02 points) and the increased simplicity and transferability of the model agnostic random forest may outweigh the small performance benefit provided by individual model output. This also reiterates the earlier finding that which model makes which forecast is generally of little import.
4. Conclusions
We evaluated the predictive skill of each HREF member for short-range mesoscale snowband forecasts, then presented four simple models seeking to improve upon these baselines. HREF members verified similarly, with CSIs in the range of 0.25 and 0.32 when allowing for 1-h timing errors, with the time-lagged NMMB performing the worst and the time-lagged FV3 performing the best. Missed bands tended to be more prevalent than false alarms for all members.
Our first model calculated an uncalibrated probability of the occurrence of snowbands within a forecast precipitation feature based upon the number of HREF members depicting a band. This model produced a maximum CSI of 0.38 when a band was detected in at least 30% of HREF members, an improvement of 19% (CSI of 0.38 compared to 0.32 for the FV3 model; an increase of 0.06 points) over any individual HREF member. This improvement was primarily the result of increased probability of detection (∼60% compared to 40% for HREF members). This threshold probability model was generally reliable, but overconfident at higher probabilities, likely due to overlapping physics configurations and initial conditions. There was some variance in forecast skill by HREF version, with the HREFv2.1 verifying slightly better than the HREFv1 and the HREFv2.
The threshold probability model was then compared to three different random forest models. The first random forest was trained with features explicitly related to snowband occurrence, such as band detections according to the Radford et al. (2019) algorithm, simulated reflectivity, and the categorical snow field. This random forest improved maximum CSI to 0.42, an additional 11% (0.04 CSI points) increase over the HREF threshold probability model. The second random forest was trained with environmental snowband ingredients, such as midtropopsheric frontogenesis, SEPV, and vertical motion. This random forest was comparable in skill to the HREF threshold probability model and inferior to the explicit feature random forest model.
The final random forest was trained with both the explicit and implicit (environmental) features. This model was again comparable to the HREF threshold probability model and slightly worse than the explicit random forest model. This seems counterintuitive given that this model had more information than either model individually, but more features do not necessarily result in better performance in a random forest model, as each tree only has access to a random subsample of features (Hastie et al. 2016). In particular, noisy and correlated features can sometimes prevent the identification of stronger predictors, as seemed to be the case with SEPV. All three random forests demonstrated high AUCs and reliability, though the explicit and combined models were more reliable at higher probability thresholds than the implicit model or threshold probability model. We also tested a model agnostic random forest using mean values of the explicit input parameters rather than individual member output. This model performed slightly worse than the explicit random forest incorporating the member output but simplifies the model and may improve transferability to newer versions of the HREF.
Ultimately, our results suggest that a short-term snowband forecasting strategy based upon variables explicitly related to band identification, such as simulated reflectivity, has greater value than a strategy based upon common environmental snowband ingredients in a machine learning context. There are two caveats to this conclusion. First, in order to make the ingredients “machine learning ready,” we calculated only a limited set of statistics for each variable within each precipitation feature. This means that the data are highly structured and simplified, removing contextual information and spatial attributes that may be valuable to band prediction, such as the relative positioning of ingredients. Second, the relative value of explicit versus ingredients-based forecast strategies portrayed here may not extend outside of the context of machine learning, as forecasters supplement the ingredients with crucial domain expertise. Additionally, ingredients-based approaches are likely superior to the explicit approaches at lead times longer than those evaluated in this study. Regardless of the strategy chosen, it is clear that calibrated probabilistic snowband forecasts yield substantial improvements over deterministic forecasts, as evidenced by the threshold probability model’s CSI increase of 19% (0.08 points) over the best deterministic CAM and 27% (0.06 points) over the mean CAM, driven by increased probability of detection. Thus, even if forecasters forego machine learning to aid in snowband forecasting, investigation of ensemble probabilities of banding are a reasonable compromise that adds considerable forecast value.
There is ample opportunity for future work based on these results. For example, examination of other variables and different vertical levels could identify stronger predictors. Given the short lead times investigated in this study, it could also be beneficial to include recent observed radar reflectivity as a predictor of near-term band development. Second, with slight variations to the methods, we could produce gridpoint probabilities of experiencing a band rather than precipitation feature probabilities in order to gain more forecast precision. Finally, with some minor modifications, the methods applied here could be extended to the more general problem of heavy snowfall prediction, rather than mesoscale snowbands (e.g., Herman and Schumacher 2018a,b).
Acknowledgments.
Support for this research was provided by NOAA Grant NA19NWS4680001, awarded to North Carolina State University. We thank the National Severe Storms Laboratory, Dr. Brett Roberts, and Dr. Adam Clark for making an archive of HREF system data available, as well as the Iowa Environmental Mesonet (IEM) for providing base radar reflectivity mosaics. We also wish to thank Dr. Benjamin Radford, who provided valuable advice on machine learning procedures. Finally, we thank three anonymous reviewers for their insightful comments and suggestions.
Data availability statement.
HREF data are available through the National Severe Storms Laboratory at https://data.nssl.noaa.gov/thredds/catalog/FRDD/HREF.html, ERA5 data are available through the European Centre for Medium-Range Weather Forecasts (ECMWF) at https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5, and base reflectivity mosaics are available through the Iowa Environmental Mesonet (IEM) at https://mesonet.agron.iastate.edu/docs/nexrad_mosaic/.
REFERENCES
Baxter, M. A., and P. N. Schumacher, 2017: Distribution of single-banded snowfall in central U.S. cyclones. Wea. Forecasting, 32, 533–554, https://doi.org/10.1175/WAF-D-16-0154.1.
Black, T. L., 1994: The new NMC mesoscale Eta model: Description and forecast examples. Wea. Forecasting, 9, 265–278, https://doi.org/10.1175/1520-0434(1994)009<0265:TNNMEM>2.0.CO;2.
Bluestein, H. B., 1993: Synoptic-Dynamic Meteorology in Midlatitudes. Vol. II, Observations and Theory of Weather Systems, Oxford University Press, 606 pp.
Breiman, L., 1984: Classification and Regression Trees. 1st ed. Routledge, 368 pp.
Breiman, L., 2001: Random forests. Mach. Learn., 45, 5–32, https://doi.org/10.1023/A:1010933404324.
Burke, A., N. Snook, D. J. Gagne II, S. McCorkle, and A. McGovern, 2020: Calibration of machine learning-based probabilistic hail predictions for operational forecasting. Wea. Forecasting, 35, 149–168, https://doi.org/10.1175/WAF-D-19-0105.1.
Chase, R. J., D. R. Harrison, A. Burke, G. M. Lackmann, and A. McGovern, 2022: A machine learning tutorial for operational meteorology. Part I: Traditional machine learning. Wea. Forecasting, 37, 1509–1529, https://doi.org/10.1175/WAF-D-22-0070.1.
Colle, B. A., D. Stark, and S. E. Yuter, 2014: Surface microphysical observations within East Coast winter storms on Long Island, New York. Mon. Wea. Rev., 142, 3126–3146, https://doi.org/10.1175/MWR-D-14-00035.1.
Donaldson, R. J., R. M. Dyer, and M. J. Kraus, 1975: An objective evaluator of techniques for predicting severe weather events. Preprints, Ninth Conf. on Severe Local Storms, Norman, OK, Amer. Meteor. Soc., 321–326.
Ebert, E. E., 2001: Ability of a poor man’s ensemble to predict the probability and distribution of precipitation. Mon. Wea. Rev., 129, 2461–2480, https://doi.org/10.1175/1520-0493(2001)129<2461:AOAPMS>2.0.CO;2.
Ebert, E. E., 2008: Fuzzy verification of high-resolution gridded forecasts: A review and proposed framework. Meteor. Appl., 15, 51–64, https://doi.org/10.1002/met.25.
Evans, M., and M. L. Jurewicz Sr., 2009: Correlations between analyses and forecasts of banded heavy snow ingredients and observed snowfall. Wea. Forecasting, 24, 337–350, https://doi.org/10.1175/2008WAF2007105.1.
Gagne, D. J., II, A. McGovern, S. E. Haupt, R. A. Sobash, J. K. Williams, and M. Xue, 2017: Storm-based probabilistic hail forecasting with machine learning applied to convection-allowing ensembles. Wea. Forecasting, 32, 1819–1840, https://doi.org/10.1175/WAF-D-17-0010.1.
Hastie, T., R. Tibshirani, and J. Friedman, 2016: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Springer, 767 pp.
Herman, G. R., and R. S. Schumacher, 2018a: Money doesn’t grow on trees, but forecasts do: Forecasting extreme precipitation with random forests. Mon. Wea. Rev., 146, 1571–1600, https://doi.org/10.1175/MWR-D-17-0250.1.
Herman, G. R., and R. S. Schumacher, 2018b: “Dendrology” in numerical weather prediction: What random forests and logistic regression tell us about forecasting extreme precipitation. Mon. Wea. Rev., 146, 1785–1812, https://doi.org/10.1175/MWR-D-17-0307.1.
Hill, A. J., G. R. Herman, and R. S. Schumacher, 2020: Forecasting severe weather with random forests. Mon. Wea. Rev., 148, 2135–2161, https://doi.org/10.1175/MWR-D-19-0344.1.
James, E. P., and Coauthors, 2022: The High-Resolution Rapid Refresh (HRRR): An hourly updating convection permitting forecast model. Part II: Forecast performance. Wea. Forecasting, 37, 1397–1417, https://doi.org/10.1175/WAF-D-21-0130.1.
Loken, E. D., A. J. Clark, and C. D. Karstens, 2020: Generating probabilistic next-day severe weather forecasts from convection-allowing ensembles using random forests. Wea. Forecasting, 35, 1605–1631, https://doi.org/10.1175/WAF-D-19-0258.1.
Lundquist, J., M. Hughes, E. Gutmann, and S. Kapnick, 2019: Our skill in modeling mountain rain and snow is bypassing the skill of our observational networks. Bull. Amer. Meteor. Soc., 100, 2473–2490, https://doi.org/10.1175/BAMS-D-19-0001.1.
Mason, I., 1982: A model for assessment of weather forecasts. Aust. Meteor. Mag., 30, 291–303.
May, R. M., and Coauthors, 2022: MetPy: A python package for meteorological data. Unidata, accessed 25 April 2022, https://github.com/Unidata/MetPy.
McMurdie, L. A., and Coauthors, 2022: Chasing snowstorms: The Investigation of Microphysics and Precipitation for Atlantic Coast-Threatening Snowstorms (IMPACTS) campaign. Bull. Amer. Meteor. Soc., 103, E1243–E1269, https://doi.org/10.1175/BAMS-D-20-0246.1.
Molthan, A. L., B. A. Colle, S. E. Yuter, and D. Stark, 2016: Comparisons of modeled and observed reflectivities and fall speeds for snowfall of varied riming degrees during winter storms on Long Island, New York. Mon. Wea. Rev., 144, 4327–4347, https://doi.org/10.1175/MWR-D-15-0397.1.
Morales, R. F., Jr., 2008: The historic Christmas 2004 South Texas snow event: Diagnosis of the heavy snow band. Natl. Wea. Dig., 32, 135–152.
Nicosia, D. J., and R. H. Grumm, 1999: Mesoscale band formation in three major Northeastern United States snowstorms. Wea. Forecasting, 14, 346–368, https://doi.org/10.1175/1520-0434(1999)014<0346:MBFITM>2.0.CO;2.
Novak, D. R., and B. A. Colle, 2012: Diagnosing snowband predictability using a multimodel ensemble system. Wea. Forecasting, 27, 565–585, https://doi.org/10.1175/WAF-D-11-00047.1.
Novak, D. R., L. F. Bosart, D. Keyser, and J. S. Waldstreicher, 2004: An observational study of cold season–banded precipitation in Northeast U.S. cyclones. Wea. Forecasting, 19, 993–1010, https://doi.org/10.1175/815.1.
Novak, D. R., J. S. Waldstreicher, D. Keyser, and L. F. Bosart, 2006: A forecast strategy for anticipating cold season mesoscale band formation within eastern U.S. cyclones. Wea. Forecasting, 21, 3–23, https://doi.org/10.1175/WAF907.1.
Novak, D. R., B. A. Colle, and S. E. Yuter, 2008: High-resolution observations and model simulations of the life cycle of an intense mesoscale snowband over the Northeastern United States. Mon. Wea. Rev., 136, 1433–1456, https://doi.org/10.1175/2007MWR2233.1.
Novak, D. R., B. A. Colle, and A. R. Aiyyer, 2010: Evolution of mesoscale precipitation band environments within the comma head of Northeast U.S. cyclones. Mon. Wea. Rev., 138, 2354–2374, https://doi.org/10.1175/2010MWR3219.1.
Pedregosa, F., and Coauthors, 2011: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res., 12, 2825–2830.
Radford, J. T., and G. M. Lackmann, 2023: Assessing variations in the predictive skill of ensemble snowband forecasts with object-oriented verification and self-organizing maps. Wea. Forecasting, https://doi.org/10.1175/WAF-D-23-0004.1, in press.
Radford, J. T., G. M. Lackmann, and M. A. Baxter, 2019: An evaluation of snowband predictability in the High-Resolution Rapid Refresh. Wea. Forecasting, 34, 1477–1494, https://doi.org/10.1175/WAF-D-19-0089.1.
Radford, J. T., G. M. Lackmann, J. Goodwin, J. Correia Jr., and K. Harnos, 2023: An iterative approach toward development of ensemble visualization tools for high-impact winter weather hazards. Part I: Product development. Bull. Amer. Meteor. Soc., in press.
Rauber, R. M., and Coauthors, 2015: The role of cloud-top generating cells and boundary layer circulations in the finescale radar structure of a winter cyclone over the Great Lakes. Mon. Wea. Rev., 143, 2291–2318, https://doi.org/10.1175/MWR-D-14-00350.1.
Roberts, B., B. T. Gallo, I. L. Jirak, and A. J. Clark, 2019: The high resolution ensemble forecast (HREF) system: Applications and performance for forecasting convective storms. 2019 AGU Fall Meeting, San Francisco, CA, Amer. Geophys. Union, Abstract A31O-2797.
Roebber, P. J., 2009: Visualizing multiple measures of forecast quality. Wea. Forecasting, 24, 601–608, https://doi.org/10.1175/2008WAF2222159.1.
Rogers, E., T. Black, B. Ferrier, Y. Lin, D. Parrish, and G. DiMego, 2001: Changes to the NCEP meso Eta analysis and forecast system: Increase in resolution, new cloud microphysics, modified precipitation assimilation, modified 3DVAR analysis. NOAA/NWS Tech. Procedures Bull. 488, NOAA/NWS, 21 pp.
Rosenow, A. A., D. M. Plummer, R. M. Rauber, G. M. McFarquhar, B. F. Jewett, and D. Leon, 2014: Vertical velocity and physical structure of generating cells and convection in the comma head region of continental winter cyclones. J. Atmos. Sci., 71, 1538–1558, https://doi.org/10.1175/JAS-D-13-0249.1.
Trapp, J. R., D. M. Schultz, A. V. Ryzhkov, and R. L. Holle, 2001: Multiscale structure and evolution of an Oklahoma winter precipitation event. Mon. Wea. Rev., 129, 486–501, https://doi.org/10.1175/1520-0493(2001)129<0486:MSAEOA>2.0.CO;2.