1. Introduction
Wind gusts, which are brief and intense bursts of wind within a short span of time, arise when there is a transfer of high-momentum air to the surface (Kahl 2020). According to the U.S. National Weather Service, “gusts are reported when the peak wind speed reaches at least 16 knots and the variation in wind speed between the peaks and lulls is at least 9 knots. The duration of a gust is usually less than 20 seconds.” (1 kt ≈ 0.51 m s−1). Wind gusts are frequently associated with severe hazards and structural damages, making gust prediction one of the crucial elements of weather forecasting services. The effects of gusts are multifold; for example, wind gusts have a significant impact on wind energy generation and the occurrence of power outages. Therefore, it is crucial for wind power plants and electric utilities to have accurate forecasts to ensure smooth operation and be prepared for weather-induced blackouts. However, wind gusts are influenced by various small-scale processes such as turbulence due to friction, wind shear, surface roughness, topography, and solar heating of the ground. These complex interactions make gust prediction challenging, even for advanced numerical weather prediction (NWP) models capable of convection-permitting simulations (Schulz and Lerch 2022). This is primarily because turbulent eddies containing gusts are typically too small to be resolved by observation networks and mesoscale numerical models. Consequently, parameterizations and downscaling of model output become necessary (Fovell and Cao 2014). Moreover, the presence of reporting artifacts in observational wind data obstructs the development and validation of gust forecasting models (Harris and Kahl 2017).
NWP models such as WRF (Skamarock et al. 2008) use postprocessing tools to estimate wind gusts. There have been numerous efforts to statistically correct NWP forecasts against observational data using multiple linear regression techniques known as model output statistics (MOS) (Glahn and Lowry 1972). Ensemble model output statistics (EMOS) is the subsequent extension of MOS to treat ensemble NWP model outputs (Gneiting et al. 2005; Woodcock and Engel 2005). The U.S. National Weather Service has been using these statistical learning techniques since 1968 to reduce systematic model error (Carter et al. 1989; Wilks and Hamill 2007). Even though these techniques exhibit significant skill improvement in correcting forecasts from 0 to 10 days (Hemri et al. 2014) and are continuously refined with new observational data, MOS and EMOS are fundamentally linear methods that are particularly rigid and require specification of predictive distribution and estimation of the parameters, for example, the mean and the standard deviation in the case of a Gaussian distribution (Haupt et al. 2021). Methods based on machine learning, on the other hand, typically allow the incorporation of complex nonlinear processes (Zhang et al. 2019; Murdoch et al. 2019) and, thus, have the potential for more effective and robust corrections.
Application of artificial intelligence (AI) or ML in high-impact weather prediction has been the focus of recent research interest (McGovern et al. 2017). AI has shown effectiveness in a wide variety of meteorological applications, including the classification of wind gust events (Sallis et al. 2011) and precipitation types (Anagnostou 2004; Baldwin et al. 2005), tornado prediction (Marzban and Stumpf 1996), automated tornado warnings (Steinkruger et al. 2020), storm tracking (Miller et al. 2013), and probabilistic forecast of damaging straight-line wind (Lagerquist et al. 2017). ML techniques have also been used to forecast wind power production with the combination of the NWP model and meteorological observational data (Du 2019). One of the most promising aspects of AI is the postprocessing of NWP model outputs to correct the systematic error of NWP by comparing hindcasts to observations (Haupt et al. 2021). Mercer and Dyer (2014) used output from the North American Mesoscale Forecast System (NAM) to train a support vector regression (SVR) algorithm to forecast daily peak wind gusts of the year 2013 for 10 stations in the Midwest and central U.S. plains. They minimized both the root-mean-square error (RMSE) of wind gusts and the spread in the median residual. Wang et al. (2021) proposed a random forest (RF) (Breiman 2001) based method to adjust four WRF outputs: zonal wind in the direction of longitude at 10 m, meridional wind in the direction of latitude at 10 m, temperature at 2 m, and sea level pressure. Their proposed RF-based adjusting method improved the prediction accuracy of these variables by nearly 40% compared to the original WRF Model. Vassallo et al. (2020) used RF both as a standalone model to predict absolute wind speed at the time of interest and as an error correction mechanism to predict the difference between the wind speed at the current time period and that at the time period of interest. RF model prediction error outperformed its standalone counterpart in all study cases and proved more effective in a region with high interannual variability of wind speed.
Shanmuganathan and Sallis (2014) used several data mining techniques to discover wind gust patterns of a vineyard in the north of New Zealand. One of their proposed methods was based on an artificial neural network (ANN), which showed over 99% accuracy for gusts and over 85% accuracy for gust classes defined by the authors. Coburn et al. (2022) used ANNs of varying complexity and multiple logistic and linear regression for short-term forecasting of wind gusts. They used wind gust observation from eight airport stations across the contiguous United States (CONUS) for modeling wind gust occurrence and magnitude, upper-air variables from ERA5 reanalysis, and one autoregressive term as predictors. While ANNs improved the prediction of wind gust magnitude over linear regression, wind gust occurrence showed very similar or marginal improvement with ANNs than from logistic regression models. This was in line with their previous analysis (Coburn and Pryor 2022), where they used the same predictors and predictands for three airport stations in the northeast (NE) United States. Schulz and Lerch (2022) used several techniques for postprocessing ensemble forecasts of wind gusts from the operational ensemble prediction system of the German Weather Service. Their methods ranged from comparatively simple statistical methods up to complex ML-based approaches, including quantile regression forest (QRF), gradient boosting (GB) methods, and neural networks (NNs). Even though all postprocessing methods could correct the systematic errors of the raw ensemble predictions, their study indicated that a single best method did not exist as all approaches had certain advantages and caveats. The results indicated that the NN-based approach provided better forecasts than other methods and was able to learn physically consistent relations associated with the diurnal cycle, such as the evening transition of the planetary boundary layer. However, NN imposed higher computational requirements in terms of data, computing resources, finding the optimal model configuration and model parameters. The authors suggested that an NN-based approach should be implemented if various additional features and a large dataset for model training and validation are available. If the dataset is of a smaller scale, e.g., containing a limited number of stations, their results suggested that QRF and GB might still be able to obtain valuable information from the predictors. Wang et al. (2020) used an ensemble learning approach to forecast wind gust speed. Their ensemble model included long short-term memory (LSTM), RF, and Gaussian process regression (GPR) models. The outputs from LSTM and RF were used as input to train the GPR model to create one-step-ahead wind gust forecasts.
The aim of this study is to align the prediction of wind gusts for the NE United States with observations, by developing a system that integrates ML algorithms and WRF atmospheric variables. We have used the Unified Post Processor (UPP), developed by the National Centers for Environmental Prediction (NCEP), that estimates wind gusts by postprocessing the outputs from WRF. However, UPP estimates maximum gust potential through a simple parameterization, which results in higher-than-expected gust predictions (Benjamin et al. 2020). From now on, the gusts forecast made by UPP will be referred to as WRF-UPP. For the purpose of our study, we have used one tree-based algorithm, RF, and one extreme gradient boosting algorithm (XGB) to establish a wind gust predictive model that brings WRF-UPP surface wind gust closer to observations. We have also used two generalized linear models (GLMs), a statistical technique, to see if simpler statistical methods differ significantly from ML models. Our modeling approach involves taking explanatory variables from WRF (WRF outputs) and feeding those into machine learning and statistical models together with observed wind gust values. Another feature of our work is the construction of learning curves that quantify the improvement in ML-based gust forecast with increased data availability. As more data require increased computation power, the learning curves can help decide the trade-off between a physically acceptable level of error and computational resources.
The paper is structured as follows. Section 2 describes the types of data: gust observations and the WRF weather variables used for gust forecast. Section 3 explains the ML and statistical algorithms, cross-validation technique, hyperparameter tuning, and the performance metrics used for model evaluation. Section 4 contains the discussion of the results. The conclusions follow in section 5.
2. Wind gust observations and WRF Model configuration
We have focused on the NE United States as the study domain for this work. WRF has been used operationally in our research group and the Eversource Energy Center at the University of Connecticut since 2015 to support weather-induced power outage prediction for several New England utility territories (Wanik et al. 2018; Cerrai et al. 2017, 2019; Yang et al. 2017, 2018, 2019; Samalot et al. 2019). From our extensive storm simulated database, using WRF v3.8.1 at 4-km grid spacing, we selected 61 extratropical and tropical storms spanning from 2005 to 2020 (Table A1 in the appendix includes the dates for the selected storms). The majority of the storms that we considered were low pressure systems accompanied by cold fronts, which are commonly accompanied by gusty winds; however, some of those also featured warm, stationary, and occluded fronts co-occurring with the cold fronts. Additionally, our dataset included notable tropical storms such as Isaias and Jose, as well as Hurricane Wilma.
The WRF Model configuration included the Thompson et al. (2008) scheme for microphysics, the Rapid Radiative Transfer Model (Mlawer et al. 1997) for longwave radiation, the Goddard shortwave scheme (Chou and Suarez 1994) for shortwave radiation, the YSU scheme (Hong et al. 2006) for the planetary boundary layer, the Noah land surface model, and the revised MM5 surface layer scheme. The selected parameterization schemes were the outcome of approximately 10 years of research for optimum daily weather forecasts in our region using WRF (Wanik et al. 2015; Yang et al. 2017, 2018, 2019; Samalot et al. 2019; Khaira and Astitha 2023). Each storm simulation was 48 h in duration, with an additional 12 h of spinup time (for a total of 60 h). WRF was initialized with the 12-km gridded NAM analysis.
To provide a training and validation dataset, an observed set of wind gust observations were obtained for matching periods to the storms described in Table A1. These gust data originated from the Global Hourly–Integrated Surface Database (ISD) of NOAA. Observations obtained from this dataset were all wind gusts above the Automated Surface Observing System (ASOS) threshold of minimum gust speed, 14 kt or 7.2168 m s−1, for the duration of each storm event. The ISD, which consists of global surface observations, is compiled from numerous sources and has passed through multiple automated quality control algorithms, e.g., subjecting each observation to a series of validity checks, extreme value checks, internal consistency checks, and temporal (check another observation for the same station) continuity checks (Lott 2004). The primary data sources in the ISD consist of a variety of systems such as the ASOS, Automated Weather Observing System (AWOS), METAR, Coastal-Marine Automated Network (C-MAN), buoys, and several others (Smith et al. 2011). Many of the stations within our study domain are ASOS and AWOS, along with a limited number of observatories, C-MAN stations, and buoys. Since gusts last only for a brief period and do not occur at the same frequency as wind, gust observations recorded at the stations are intermittent and are not available every hour for the entire duration of the storm. Some of the stations that were within four grid cells (16 km) from the WRF domain boundary were discarded so that boundary conditions would not affect our analysis. The WRF Model domain, the weather stations’ location within the domain, and the frequency distribution of the observed gusts collected from ISD are shown in Fig. 1.
(a) WRF 4-km model domain, location of weather stations, and station elevation. (b) Frequency distribution of the reported gusts by all stations over 61 storms.
Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0047.1
Figure 1b shows a right-skewed distribution of the observed gusts, consistent with a recent study by Coburn et al. (2022), where the authors used 15 years of gust data from eight airport stations. Their study revealed that gusts within the range of 17.5–25.7 m s−1 (NWS definition of strong wind gusts) occurred for 200–2500 h, while gusts exceeding 25.7 m s−1 (NWS definition of damaging wind gusts) were observed for only 10–70 h. Our storm-based analysis comprised 121 125 gust observations from 180 stations within the domain outlined in Fig. 1a, covering 61 storms.
To train the machine learning models, weather variables from WRF that have a physical correlation with wind gusts were selected as features. Many of our selected features listed in Table 1 are in line with previous similar research (Schulz and Lerch 2022; Coburn and Pryor 2022; Coburn et al. 2022). Our initial study included more variables, such as temperature at 2 m, potential temperature at 2 m, surface pressure (atmospheric pressure near the surface), and potential temperature lapse rate at different levels (between the surface and 1-km height, between the surface and 2-km height, and between the surface and planetary boundary layer height). Later, we dropped the temperature at 2 m, which had a very high positive correlation (close to 1) with the potential temperature at 2 m. We also dropped the potential temperature lapse rates from the feature variables as they exhibited a high negative correlation (close to −1) with the lapse rates at different levels. In addition, we made feature importance plots over several trials to further aid the feature selection process, and we identified that potential temperature at 2 m and surface pressure were consistently of lower importance. Therefore, we removed potential temperature at 2 m and surface pressure from consideration since including redundant or irrelevant features might degrade the performance of the learning algorithms (Hastie et al. 2009). In fact, while computing the permutation feature importance, shuffling potential temperature at 2 m resulted in a decrease of the mean-squared error over some trials. That implied that the original potential temperature at 2 m was worse than noise, and a model would do better without it.
WRF variables used in the ML and statistical models.
To align the WRF outputs with observed gusts, we performed bilinear interpolation of the WRF outputs. This involved interpolating WRF variables at specific station locations determined by their latitude and longitude. Then, the interpolated WRF variables were paired with observed gusts considering both the time (matching WRF output time to the observed gust time) and the unique station identification number assigned to each station. The ML models predicted wind gusts at the same temporal scale as the observations, which is intermittent.
3. Methodology
We have considered two ML algorithms, RF and XGB, and two statistical models, the generalized linear model with identity link function and the generalized linear model with log link function. A brief description of the ML algorithms and the statistical models is provided below.
a. Machine learning algorithms and statistical models
A random forest is an ensemble learning algorithm that fits a number of decision trees on various subsamples of the dataset and uses averaging to improve predictive accuracy and control overfitting. We trained the RF model using scikit-learn’s (Pedregosa et al. 2011) random forest regressor function in Python. XGB is also a tree-based algorithm consisting of multiple decision trees; however, the trees are constructed differently than random forest. The “boosting” technique involves adding new trees sequentially based on the residuals or errors of prior trees until the errors can no longer be significantly reduced. We have used the Python-based scikit-learn application programming interface (API) (Buitinck et al. 2013) for XGB.
b. Hyperparameter tuning and cross validation
Hyperparameter tuning for RF and XGB was done using Hyperopt (Bergstra et al. 2013) Tree of Parzen Estimators (TPE) algorithm (Bergstra et al. 2011), which is available in Python. TPE is a sequential algorithm that can select model hyperparameters by leveraging the Bayesian optimization technique. Based on the user-defined objective function (e.g., mean-squared error that we used in this study), it optimizes the hyperparameter selection in each round according to the previous round score. Thus, instead of randomly choosing the next set of hyperparameters, the algorithm uses domain knowledge to restrict the search domain for optimized tuning. To perform the hyperparameter tuning, we divided the storms into six folds, each consisting of 51 training and 10 test storms. The test storms were kept unique over the folds. The best hyperparameters were searched using the storm data of 51 training storms during each iteration.
For RF, the hyperparameters that we tuned were n_estimators, max_depth, max_features, max_samples, min_samples_split, and bootstrap. For XGB, the selected hyperparameters were n_estimators, max_depth, min_child_weight, subsample, learning_rate, gamma, reg_lambda, colsample_bylevel, colsample_bynode, and colsample_bytree. The definitions of these hyperparameters can be found in the scikit-learn (Pedregosa et al. 2011) documentation related to random forest and XGB regression. Once the optimized hyperparameters were found for an iteration, the model was trained using those hyperparameters and was tested on the 10 test storms to compare how the tuned model performed compared to a model with default hyperparameter settings. This process was repeated six times, and the best hyperparameters from six iterations were then averaged (max voting for nonnumerical hyperparameters) to get the optimized hyperparameter values that were later applied to the leave-one-storm-out cross validation (LOSO CV).
For the LOSO CV, the ML models were trained on 60 storms using their respective optimized hyperparameters and then tested on the remaining one storm. The process was repeated 61 times so that each storm in the dataset was used as an independent test storm. The predictions were made on a single storm at each iteration, with no influence from the training process of the previous iterations, as the models were retrained each time. Like the ML models, the GLMs were trained on 60 storms to get the coefficients of the feature variables. Then, those feature coefficients were used to make predictions on the remaining test storm over the 61 iterations.
A flowchart of the methodology is shown in Fig. 2, which highlights the connections between WRF Model outputs, ML, and statistical models toward the final product and its evaluation (wind gust).
Flowchart of the methodology (CV = cross validation).
Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0047.1
In addition to the LOSO CV, we performed an independent validation of the XGB model to test if there was a significant difference in model performance between the two validation strategies (results not shown). We trained the XGB model on a dataset consisting of 51 storms and then evaluated the model on 10 storms that were never used either during training or during hyperparameter tuning. A two-sample t test was conducted on pairs of error metrics (the error metrics have been described in section 3c) derived from both the independent validation and the LOSO CV. Notably, the p values obtained for each error metric were well above the significance level of 0.05, suggesting no statistically significant difference in model performance between the two validation techniques. Therefore, we followed the LOSO CV for the entirety of our analysis. Cerrai et al. (2019) also observed a small difference (3% or less) in error metrics between the LOSO CV framework and different pairs of cross validations using a randomized holdout dataset.
c. Evaluation metrics
4. Results and discussion
a. Baseline model
As evident from Fig. 3, predicted gusts by UPP are highly divergent from the 45° line (solid black line), with the predicted gusts potential mostly being higher than the observed gusts, as expected. We will assess how much improvement can be obtained through the application of ML-based models or simple statistical models over UPP in the following sections.
(top) WRF-UPP estimated wind gusts vs observed wind gusts for 61 storms. Wind gust predictions by (a) RF, (b) XGB, (c) GLM-Identity, and (d) GLM-Log.
Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0047.1
b. Predictions of wind gusts by ML and statistical models
Predictions of wind gusts were made for each storm in the dataset. After completing hyperparameter tuning, ML models (RF and XGB) were trained on 60 storms and tested on one storm at each iteration, resulting in 61 iterations. This leave-one-storm-out cross-validation technique was adopted to avoid overfitting and to ensure each storm in the dataset is independently tested against observed gusts. Statistical models (GLM-Identity and GLM-Log) were developed in the same fashion over 61 iterations to ensure consistency of the training processes between ML and statistical models. Coefficients of GLM-Identity and GLM-Log were estimated by training the models on 60 storms and then applying the coefficients on the remaining one storm to make predictions. ML and statistical models were able to remove the bias. However, some of the higher gust values, specifically gusts exceeding 25 m s−1, seem to be underpredicted considerably (Fig. 3). Underprediction of higher gust values can be attributed to the rarity of such extreme wind values in nature (Fig. 1b), which makes it challenging for ML algorithms to be trained on a sufficient number of high gusts. Our initial study was conducted on 48 storms, where the predictive ceiling for strong and damaging gusts was also present (not shown here). Later, we expanded the dataset to 61 storms to investigate whether adding more storms could address the issue of intermittent gusts and the predictive ceiling. However, as depicted in Fig. 3, the models still underpredict strong gusts for the expanded dataset with 61 storms. The predictive ceiling for high-intensity gusts was also present in the studies by Coburn and Pryor (2022) and Coburn et al. (2022). In their studies, they used deep ANNs with up to 20 hidden layers, which showed slightly improved predictability of intense gusts over the regression models; however, that improvement came at the cost of model overfitting and enhanced false alarm rates and still the strong and damaging gusts were substantially underestimated (by 50%) irrespective of seasonality (warm and cold).
Figure 3 shows that RF and XGB yielded similar predictive performance while GLM-Log was slightly better than GLM-Identity when observed wind gusts were between 25 and 30 m s−1. GLM-Identity could only predict up to around 20 m s−1 when the actual gust was in the range of 25–30 m s−1. Additionally, predicted gusts for observations between 25 and 30 m s−1 appear to get closer to the 45° line for RF, XGB, and GLM-Log than GLM-Identity. However, it is to be noted that we have features with multicollinearity, e.g., wind speeds at different levels are linearly correlated, and the same holds for lapse rates at different levels. While collinearity among variables was not a problem for models like RF and XGB, it might have affected the GLMs.
Table 2 shows the values of the evaluation metrics and the percent changes of the evaluation metrics compared to WRF-UPP. Positive change indicates an increase in the value of a metric and negative change indicates a decrease in a metric for an ML or a GLM model compared to WRF-UPP. The percentage improvement in the error metrics shows that the ML models and GLMs outperform WRF-UPP while ML models do better than statistical GLMs. Values of the correlation coefficients increased for all models while absolute bias, RMSE, CRMSE, and MAE decreased. XGB performed better than other models with the exception of mean bias, where RF resulted in the maximum reduction. However, all the ML and statistical models were successful in reducing the bias almost entirely. As previously stated, since increasing the storm dataset from 48 to 61 storms did not remove the predictive ceiling for strong and damaging gusts, we then attempted to train the ML algorithms to predict WRF gust bias. The reconstructed wind gust fields from the predicted WRF gust bias did not improve the predictions for high gusts and yielded lower-quality results; thus, they were not included in the manuscript.
Evaluation metrics for wind gust predictions. Bold values highlight the best statistical metrics, i.e., the highest CC or the lowest error and the maximum percent improvement of the evaluation metrics. {
c. Confidence intervals for statistical error metrics
Bootstrapped confidence intervals were calculated for all evaluation metrics except bias since all ML and statistical models effectively removed bias. Python library scikits-bootstrap has been used to compute the 95% bootstrapped confidence intervals for the statistic functions on 121 125 paired observations and predictions. A total of 10 000 bootstrapped samples were generated out of all paired observations–predictions and correlation coefficient, MAE, RMSE, and CRMSE were calculated for each bootstrapped sample to construct the upper and lower limits of 95% confidence intervals. The black circle in the middle (Fig. 4) denotes values of evaluation metrics computed with the original data, and the two solid horizontal lines below and above the dot show the lower and upper confidence intervals, respectively, computed on 10 000 bootstrapped samples.
Bootstrapped 95% confidence intervals of (a) correlation coefficient, (b) MAE, (c) RMSE, and (d) CRMSE. Confidence intervals were generated over 10 000 bootstrapped samples. Dotted lines along the upper confidence intervals of XGB were added in (b)–(d) to help visualize that the confidence intervals of RF and XGB do not overlap. Table A2 lists the numerical values of the confidence intervals related to the plots in this figure.
Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0047.1
As can be seen from Table 2, XGB resulted in greater improvement compared to other models. The purpose of the confidence interval plots was to visualize whether the improvement achieved by XGB is statistically different and how the improvement compares to WRF-UPP wind gust statistics. XGB’s confidence intervals do not overlap with the other models (Fig. 4), which designated a statistically significant improvement. It is to be noted here that there is no overlap between XGB and RF in Figs. 4b–d as denoted by the dotted lines. However, GLM-Log and GLM-Identity overlap for all the evaluation metrics, and it can be said that the performance of GLM-Log and GLM-Identity are not significantly different. WRF-UPP is clearly exhibiting significantly larger errors compared to all models (Fig. 4). Table A2 in the appendix lists the bootstrapped 95% lower and upper confidence intervals for the evaluation metrics.
d. Importance of feature variables
Permutation feature importance plots were made for training storms over 61 iterations. We randomly shuffled the features and computed the mean-squared error (MSE) increase at each iteration. As discussed previously, the process was repeated 61 times to alter the test storm at every iteration. The boxplots shown in Fig. 5 illustrate the increase in MSE of each feature over all iterations. The higher the increase in MSE of a feature, the more important that feature is, as shuffling that feature had the highest impact on the model performance. The order of feature importance is similar for both RF and XGB, indicating consistency of the most important features between the two models. The relatively short length of the boxes suggests that the models’ performance was not affected by a single storm, as one storm in the training set was replaced with a new storm at each iteration. This is an indication of the stability of the models.
Permutation feature importance boxplots of (a) RF and (b) XGB for wind gust predictions.
Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0047.1
Wind speed at 10-m height (WS_10m) and 850-mb pressure level (WS_850mb; 1 mb = 1 hPa) are the two most important features for both RF and XGB (Fig. 5). For RF, friction velocity (Ustar) emerges as the next significant predictor after WS_10m and WS_850mb (Fig. 5a). However, in the case of XGB, terrain height is identified as the next important predictor, followed by WS_10m and WS_850mb (Fig. 5b). Considering the improved error metrics of XGB over RF (Table 2), it is plausible that terrain height, which RF did not fully comprehend, is indeed a significant predictor that contributed to the superior performance of XGB. The variability of station elevation in the model domain (Fig. 1a) can corroborate such a plausible connection. The most relevant features for gust prediction found in this study closely aligned with findings from past research. In a recent study conducted by Coburn et al. (2022), wind speed at 950 mb, wind speed at 850 mb, and lapse rate at 900 mb derived from ERA5 reanalysis were the dominant predictors for both gust occurrence and magnitude models at all stations and in both warm and cold seasons. However, they did not consider surface wind speed in their study (wind speed at 10 m height) which has been identified as a critically important predictor by RF and XGB in our study. The similarity among the most significant features in our study and past research dictate that upper-air wind speeds are possibly the governing weather variables for wind gust generation in regional weather prediction models, e.g., WRF and climate reanalysis models such as ERA5. However, these predicted atmospheric variables are complex to evaluate, given the lack of spatially dense upper-air observations, and such analysis is beyond the scope of this paper.
e. Spatial variability of gust error metrics
After the models were trained and predictions were made over leave-one-storm-out cross validation, each station’s average prediction error was computed. As XGB showed statistically better performance than RF (Fig. 4), we generated spatial plots exclusively for XGB to assess the extent of improvement achieved by XGB at each station (Fig. 6). It is important to note that these spatial plots are not intended to identify any error patterns based on the station locations (such as coastal vs inland stations). As explained in section 2, the intermittent nature of gusts results in different numbers of observations across stations, making it unfeasible to compare the average error across stations directly.
Wind gusts MAE, RMSE, and CRMSE for (a),(c),(e) XGB and (b),(d),(f) WRF-UPP. Each circle in the spatial plot shows the average error for 61 storms.
Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0047.1
Similar to the aggregated validation samples (Fig. 3b) combining all station data, XGB resulted in lower errors compared to WRF-UPP (Fig. 6) at the single station level. The percentage reduction of error at each weather station by XGB can be seen in Fig. 7.
Percentage change of wind gusts: (a) MAE, (b) RMSE, and (c) CRMSE by XGB, computed as
Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0047.1
f. Learning curves
The objective of constructing learning curves was to gain insight into the required size of the training dataset, as the cost of training the models increases with the addition of more data. Even though, in theory, ML models tend to improve with more training data, the uncertainty in the error metrics and the WRF outputs might limit how much improvement can be accomplished. Learning curves can highlight that limit and help the modeler decide if the intended improvement is physically relevant. We have considered four error metrics: correlation coefficient, MAE, RMSE, and CRMSE. The curves were constructed over 61 trials, and each trial went on for 12 rounds as the training sample size was increased from 5 to 60 storms. One storm out of 61 storms was kept separated at each trial and was used to evaluate the model performance as the training sample size increased from 5 to 60, with an increment of five storms at each step. After the end of a trial, a new storm was picked as the test storm, and the process was repeated. The gray dots in Figs. 8 and 9 represent the average error metrics of 61 storms over all trials for a given training sample size. To understand if the improvement in the error metrics was statistically significant, 95% bootstrapped confidence intervals were calculated as more storms were added to the training data.
Learning curves with 95% bootstrapped confidence intervals for RF: (a) correlation coefficient, (b) MAE, (c) RMSE, and (d) CRMSE.
Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0047.1
Learning curves with 95% bootstrapped confidence intervals for XGB: (a) correlation coefficient, (b) MAE, (c) RMSE, and (d) CRMSE.
Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0047.1
As the number of training storms increased from 5 to 10 and then from 10 to 15, the error metrics showed a noticeable and statistically significant improvement (Figs. 8a–d). This improvement was evident as there was no overlap between the confidence intervals. However, once we surpassed 15 training storms and added another five storms, the intervals began to overlap. This observation indicates that in order for the improvement to remain statistically significant, the required size of the increment in the training dataset needs to be larger as we continue to add more storms in the training set. For instance, when comparing 15–30 storms, the improvement in error metrics was statistically significant, demonstrating that the addition of 15 storms led to a substantial improvement. As we reached 30 storms, 20 storms needed to be included to achieve a statistically better correlation coefficient at 50 storms (Fig. 8a). The curves for the four error metrics plateaued at 55 storms. This experiment provided insights regarding whether expanding the existing dataset would lead to significant improvement, which, for the RF model, is likely to be minimal after 50–55 storms for gust prediction.
The prediction of gusts by XGB showed a similar trend (Fig. 9). However, there was one notable difference between Figs. 8 and 9, which is the steeper learning curve of XGB than that of RF. For example, the values were statistically significant when comparing the evaluation metrics at 25 storms to those at 15 training storms (Figs. 9a–d). This meant that in the case of XGB, an additional 10 storms were required to achieve a significant improvement when the training set consisted of 15 storms, while RF required an additional 15 storms for the same training dataset (Figs. 8a–d). Also, starting from 30 storms, the changes in the error metrics with further addition of storms were no longer statistically distinguishable for XGB. At the same time, for RF, there are statistical differences in correlation coefficient and MAE between 30 and 55 storms. Comparing Fig. 8 and Fig. 9, we can see that the highest correlation and the lowest errors that RF achieved with 60 training storms were attained by XGB with fewer training storms, between 15 and 20. This indicates that XGB has the potential to learn effectively with a smaller number of storms compared to RF.
5. Concluding remarks
With this study, we aimed to improve the deterministic prediction of storm wind gusts that affected the NE United States by developing a WRF-guided ML model. The UPP, a postprocessing tool to estimate wind gusts from WRF outputs, served as the baseline for evaluating the ML models. All ML models (RF and XGB) and statistical models (GLM-Identity and GLM-Log) demonstrated improved prediction compared to WRF-UPP; however, the statistical models were unable to surpass the performance of the ML models. While RF and XGB showed similar improvement, XGB’s predictions were still significantly better than RF. Even though there was a similarity in the order of feature importance between RF and XGB indicating consistency between the two models, terrain height had a greater impact on the XGB model’s predictions, which may have contributed to the superior performance of XGB over RF. Wind speeds in the upper atmosphere were potentially significant factors influencing the generation of gusts in the regional weather model WRF. However, the intricate relationship between these variables and near-surface gusts can be extremely complex, making it challenging for tree-based machine learning models to replicate the full spectrum of wind gust dynamics accurately. As a result, the ability to predict strong gusts remained limited, consistent with prior research findings. In the future, we plan to incorporate uncertainty quantification along with the predictions so that the results from the gust prediction models can facilitate critical decision-making involving intense gusts. The modeling framework developed in this study utilized low pressure systems coupled with frontal systems and three tropical storms that have affected the NE United States. In the future, we plan to test the models focusing on other meteorological conditions that are accompanied by wind gusts, for example, thunderstorms or nor’easters.
Finally, learning curves were constructed for the ML models, which indicated that as more storms were added to the training set, the size of the increment in training storms needed to be increased to achieve statistically significant changes. Even though the overall trends of the learning curves by RF and XGB were similar, one notable distinction was that XGB exhibited a steeper learning rate and the potential for better prediction with fewer storms. Therefore, XGB might be more appropriate for instances when we want the model to learn faster with a limited amount of storm data. The remaining challenges are the slower rate of improvement of each error metric as we increased the training sample size and the underprediction of higher wind gusts by both RF and XGB. As the distribution of wind gusts is right skewed, the machine learning algorithms we implemented showed their limits in capturing the upper tail of the gust distribution. However, decreasing the RMSE and the random component of the error (CRMSE) is an important outcome of our work, as the intended error reduction supports impact modeling studies that rely on more accurate wind gust prediction. In the future, we aim to expand this work from the station-based training to a grid-based training by incorporating reanalysis products such as ERA5 or RTMA. Weather variables, such as wind speed at 850- and 950-mb pressure levels, pose challenges when it comes to their evaluation against observations. Therefore, incorporating reanalysis products in our future work could provide valuable insights to diagnose WRF Model errors for variables associated with wind gust generation.
Acknowledgments.
This work was supported by the Research Excellence Program (REP) from the University of Connecticut’s Office of the Vice President for Research and partially supported by the Eversource Energy Center at the University of Connecticut through the research grant “Improving Extreme Weather Forecasting Capabilities in Support of Power Outage Prediction Activities” awarded to Professor Astitha. The authors declare no conflicts of interest. We would also like to acknowledge high-performance computing support from Cheyenne (https://doi.org/10.5065/D6RX99HX) provided by the NCAR’s Computational and Information Systems Laboratory, sponsored by the National Science Foundation.
Data availability statement.
Wind gust observations analyzed in this study are archived at the Global Hourly–Integrated Surface Database (ISD) of the National Centers for Environmental Information (NCEI) and are publicly accessible (https://www.ncei.noaa.gov/products/land-based-station/integrated-surface-database). We acknowledge the North American Mesoscale Forecast System (NAM), one of the major weather forecast models of the National Centers for Environmental Prediction (NCEP). We used NAM analyses (https://www.ncei.noaa.gov/data/north-american-mesoscale-model/access/) as the initialization data for our WRF simulations. Outputs from the ML-based models and comparisons with observations will become publicly available from the Open Science Framework (OSF) once the paper is accepted and published.
APPENDIX
List of Simulated Storms and Additional Statistical Metrics
Tables A1 and A2 show the dates of the 61 storms used in this study and the results of the bootstrapped 95% confidence intervals for the evaluation metrics, respectively.
Dates of the 61 simulated storms that are used in this study.
Bootstrapped 95% confidence intervals (CI) for correlation coefficient, MAE, RMSE, and CRMSE (related to Fig. 4).
REFERENCES
Anagnostou, E. N., 2004: A convective/stratiform precipitation classification algorithm for volume scanning weather radar observations. Meteor. Appl., 11, 291–300, https://doi.org/10.1017/S1350482704001409.
Baldwin, M. E., J. S. Kain, and S. Lakshmivarahan, 2005: Development of an automated classification procedure for rainfall systems. Mon. Wea. Rev., 133, 844–862, https://doi.org/10.1175/MWR2892.1.
Benjamin, S. G., E. P. James, J. M. Brown, E. J. Szoke, and J. S. Kenyon, and R. Ahmadov, 2020: Diagnostic fields developed for hourly updated NOAA weather models. NOAA Tech. Memo. OAR GSL-66, 58 pp.
Bergstra, J., R. Bardenet, Y. Bengio, and B. Kégl, 2011: Algorithms for hyper-parameter optimization. Advances in Neural Information Processing Systems 24 (NIPS 2011), Curran Associates Inc., 2546–2554.
Bergstra, J., D. Yamins, and D. Cox, 2013: Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. Proc.30th Int. Conf. on Machine Learning (ICML 2013), Atlanta, GA, PMLR, 115–123.
Breiman, L., 2001: Random forests. Mach. Learn., 45, 5–32, https://doi.org/10.1023/A:1010933404324.
Buitinck, L., and Coauthors, 2013: API design for machine learning software: Experiences from the scikit-learn project. arXiv, 1309.0238v1, https://doi.org/10.48550/arxiv.1309.0238.
Carter, G. M., J. P. Dallavalle, and H. R. Glahn, 1989: Statistical forecasts based on the National Meteorological Center’s numerical weather prediction system. Wea. Forecasting, 4, 401–412, https://doi.org/10.1175/1520-0434(1989)004<0401:SFBOTN>2.0.CO;2.
Cerrai, D., E. N. Anagnostou, J. Yang, and M. Astitha, 2017: Predicting power outages using multi-model ensemble forecasts. 2017 Fall Meeting, New Orleans, LA, Amer. Geophys. Union, Abstract H31G-1583.
Cerrai, D., D. W. Wanik, M. A. E. Bhuiyan, X. Zhang, J. Yang, M. E. B. Frediani, and E. N. Anagnostou, 2019: Predicting storm outages through new representations of weather and vegetation. IEEE Access, 7, 29 639–29 654, https://doi.org/10.1109/ACCESS.2019.2902558.
Chou, M.-D., and M. J. Suarez, 1994: An efficient thermal infrared radiation parameterization for use in general circulation models. NASA Tech. Memo. 104606, 98 pp.
Coburn, J., and S. C. Pryor, 2022: Do machine learning approaches offer skill improvement for short-term forecasting of wind gust occurrence and magnitude? Wea. Forecasting, 37, 525–543, https://doi.org/10.1175/WAF-D-21-0118.1.
Coburn, J., J. Arnheim, and S. C. Pryor, 2022: Short‐term forecasting of wind gusts at airports across CONUS using machine learning. Earth Space Sci., 9, e2022EA002486, https://doi.org/10.1029/2022EA002486.
Du, P., 2019: Ensemble machine learning-based wind forecasting to combine NWP output with data from weather station. IEEE Trans. Sustainable Energy, 10, 2133–2141, https://doi.org/10.1109/TSTE.2018.2880615.
Fovell, R. G., and Y. Cao, 2014: Wind and gust forecasting in complex terrain. 15th WRF Users’ Workshop, Boulder, CO, NCAR, 5A.2, http://www2.mmm.ucar.edu/wrf/users/workshops/WS2014/ppts/5A.2.pdf.
Glahn, H. R., and D. A. Lowry, 1972: The use of Model Output Statistics (MOS) in objective weather forecasting. J. Appl. Meteor., 11, 1203–1211, https://doi.org/10.1175/1520-0450(1972)011<1203:TUOMOS>2.0.CO;2.
Gneiting, T., A. E. Raftery, A. H. Westveld III, and T. Goldman, 2005: Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Mon. Wea. Rev., 133, 1098–1118, https://doi.org/10.1175/MWR2904.1.
Harris, A. R., and J. D. W. Kahl, 2017: Gust factors: Meteorologically stratified climatology, data artifacts, and utility in forecasting peak gusts. J. Appl. Meteor. Climatol., 56, 3151–3166, https://doi.org/10.1175/JAMC-D-17-0133.1.
Hastie, T., R. Tibshirani, and J. Friedman, 2009: The Elements of Statistical Learning. Springer, 745 pp.
Haupt, S. E., W. Chapman, S. V. Adams, C. Kirkwood, J. S. Hosking, N. H. Robinson, S. Lerch, and A. C. Subramanian, 2021: Towards implementing artificial intelligence post-processing in weather and climate: Proposed actions from the Oxford 2019 workshop. Philos. Trans. Roy. Soc., A379, 20200091, https://doi.org/10.1098/rsta.2020.0091.
Hemri, S., M. Scheuerer, F. Pappenberger, K. Bogner, and T. Haiden, 2014: Trends in the predictive performance of raw ensemble weather forecasts. Geophys. Res. Lett., 41, 9197–9205, https://doi.org/10.1002/2014GL062472.
Hong, S.-Y., Y. Noh, and J. Dudhia, 2006: A new vertical diffusion package with an explicit treatment of entrainment processes. Mon. Wea. Rev., 134, 2318–2341, https://doi.org/10.1175/MWR3199.1.
Kahl, J. D. W., 2020: Forecasting peak wind gusts using meteorologically stratified gust factors and MOS guidance. Wea. Forecasting, 35, 1129–1143, https://doi.org/10.1175/WAF-D-20-0045.1.
Khaira, U., and M. Astitha, 2023: Exploring the real-time WRF forecast skill for four tropical storms, Isaias, Henri, Elsa and Irene, as they impacted the northeast United States. Remote Sens., 15, 3219, https://doi.org/10.3390/rs15133219.
Lagerquist, R., A. McGovern, and T. Smith, 2017: Machine learning for real-time prediction of damaging straight-line convective wind. Wea. Forecasting, 32, 2175–2193, https://doi.org/10.1175/WAF-D-17-0038.1.
Lott, J. N., 2004: The quality control of the integrated surface hourly database. 14th Conf. on Applied Climatology, Seattle, WA, Amer. Meteor. Soc., 7.8, https://www1.ncdc.noaa.gov/pub/data/inventories/ish-qc.pdf.
Marzban, C., and G. J. Stumpf, 1996: A neural network for tornado prediction based on Doppler radar-derived attributes. J. Appl. Meteor., 35, 617–626, https://doi.org/10.1175/1520-0450(1996)035<0617:ANNFTP>2.0.CO;2.
McGovern, A., K. L. Elmore, D. J. Gagne II, S. E. Haupt, C. D. Karstens, R. Lagerquist, T. Smith, and J. K. Williams, 2017: Using artificial intelligence to improve real-time decision-making for high-impact weather. Bull. Amer. Meteor. Soc., 98, 2073–2090, https://doi.org/10.1175/BAMS-D-16-0123.1.
Mercer, A., and J. Dyer, 2014: A new scheme for daily peak wind gust prediction using machine learning. Proc. Comput. Sci., 36, 593–598, https://doi.org/10.1016/j.procs.2014.09.059.
Miller, M. L., V. Lakshmanan, and T. M. Smith, 2013: An automated method for depicting mesocyclone paths and intensities. Wea. Forecasting, 28, 570–585, https://doi.org/10.1175/WAF-D-12-00065.1.
Mlawer, E. J., S. J. Taubman, P. D. Brown, M. J. Iacono, and S. A. Clough, 1997: Radiative transfer for inhomogeneous atmospheres: RRTM, a validated correlated-k model for the longwave. J. Geophys. Res., 102, 16 663–16 682, https://doi.org/10.1029/97JD00237.
Murdoch, W. J., C. Singh, K. Kumbier, R. Abbasi-Asl, and B. Yu, 2019: Definitions, methods, and applications in interpretable machine learning. Proc. Natl. Acad. Sci. USA, 116, 22 071–22 080, https://doi.org/10.1073/pnas.1900654116.
Pedregosa, F., and Coauthors, 2011: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res., 12, 2825–2830.
Sallis, P. J., W. Claster, and S. Hernández, 2011: A machine-learning algorithm for wind gust prediction. Comput. Geosci., 37, 1337–1344, https://doi.org/10.1016/j.cageo.2011.03.004.
Samalot, A., M. Astitha, J. Yang, and G. Galanis, 2019: Combined Kalman filter and universal kriging to improve storm wind speed predictions for the northeastern United States. Wea. Forecasting, 34, 587–601, https://doi.org/10.1175/WAF-D-18-0068.1.
Schulz, B., and S. Lerch, 2022: Machine learning methods for postprocessing ensemble forecasts of wind gusts: A systematic comparison. Mon. Wea. Rev., 150, 235–257, https://doi.org/10.1175/MWR-D-21-0150.1.
Seabold, S., and J. Perktold, 2010: Statsmodels: Econometric and statistical modeling with Python. Proc. Ninth Python in Science Conf., Austin, TX, SCIPY, 92–96.
Shanmuganathan, S., and P. Sallis, 2014: Data mining methods to generate severe wind gust models. Atmosphere, 5, 60–80, https://doi.org/10.3390/atmos5010060.
Skamarock, C., and Coauthors, 2008: A description of the Advanced Research WRF version 3. NCAR Tech. Note NCAR/TN-475+STR, 113 pp.
Smith, A., N. Lott, and R. Vose, 2011: The integrated surface database: Recent developments and partnerships. Bull. Amer. Meteor. Soc., 92, 704–708, https://doi.org/10.1175/2011BAMS3015.1.
Steinkruger, D., P. Markowski, and G. Young, 2020: An artificially intelligent system for the automated issuance of tornado warnings in simulated convective storms. Wea. Forecasting, 35, 1939–1965, https://doi.org/10.1175/WAF-D-19-0249.1.
Thompson, G., P. R. Field, R. M. Rasmussen, and W. D. Hall, 2008: Explicit forecasts of winter precipitation using an improved bulk microphysics scheme. Part II: Implementation of a new snow parameterization. Mon. Wea. Rev., 136, 5095–5115, https://doi.org/10.1175/2008MWR2387.1.
Vassallo, D., R. Krishnamurthy, T. Sherman, and H. J. S. Fernando, 2020: Analysis of random forest modeling strategies for multi-step wind speed forecasting. Energies, 13, 5488, https://doi.org/10.3390/en13205488.
Wang, A., L. Xu, Y. Li, J. Xing, X. Chen, K. Liu, Y. Liang, and Z. Zhou, 2021: Random-forest based adjusting method for wind forecast of WRF model. Comput. Geosci., 155, 104842, https://doi.org/10.1016/j.cageo.2021.104842.
Wang, H., Y.-M. Zhang, J.-X. Mao, and H.-P. Wan, 2020: A probabilistic approach for short-term prediction of wind gust speed using ensemble learning. J. Wind Eng. Ind. Aerodyn., 202, 104198, https://doi.org/10.1016/j.jweia.2020.104198.
Wanik, D. W., E. N. Anagnostou, B. M. Hartman, M. E. B. Frediani, and M. Astitha, 2015: Storm outage modeling for an electric distribution network in northeastern USA. Nat. Hazards, 79, 1359–1384, https://doi.org/10.1007/s11069-015-1908-2.
Wanik, D. W., and Coauthors, 2018: A case study on power outage impacts from future hurricane sandy scenarios. J. Appl. Meteor. Climatol., 57, 51–79, https://doi.org/10.1175/JAMC-D-16-0408.1.
Wilks, D. S., and T. M. Hamill, 2007: Comparison of ensemble-MOS methods using GFS reforecasts. Mon. Wea. Rev., 135, 2379–2390, https://doi.org/10.1175/MWR3402.1.
Woodcock, F., and C. Engel, 2005: Operational consensus forecasts. Wea. Forecasting, 20, 101–111, https://doi.org/10.1175/WAF-831.1.
Yang, J., M. Astitha, E. N. Anagnostou, and B. M. Hartman, 2017: Using a Bayesian regression approach on dual-model windstorm simulations to improve wind speed prediction. J. Appl. Meteor. Climatol., 56, 1155–1174, https://doi.org/10.1175/JAMC-D-16-0206.1.
Yang, J., M. Astitha, L. D. Monache, and S. Alessandrini, 2018: An analog technique to improve storm wind speed prediction using a dual NWP model approach. Mon. Wea. Rev., 146, 4057–4077, https://doi.org/10.1175/MWR-D-17-0198.1.
Yang, J., M. Astitha, and C. S. Schwartz, 2019: Assessment of storm wind speed prediction using gridded Bayesian regression applied to historical events with NCAR’s real‐time ensemble forecast system. J. Geophys. Res. Atmos., 124, 9241–9261, https://doi.org/10.1029/2018JD029590.
Zhang, Z., Z. Wu, D. Rincon, and P. D. Christofides, 2019: Real-time optimization and control of nonlinear processes using machine learning. Mathematics, 7, 890, https://doi.org/10.3390/math7100890.