High-impact weather events, such as severe thunderstorms, tornadoes, and hurricanes, cause significant disruptions to infrastructure, property loss, and even fatalities. High-impact events can also positively impact society, such as the impact on savings through renewable energy. Prediction of these events has improved substantially with greater observational capabilities, increased computing power, and better model physics, but there is still significant room for improvement. Artificial intelligence (AI) and data science technologies, specifically machine learning and data mining, bridge the gap between numerical model prediction and real-time guidance by improving accuracy. AI techniques also extract otherwise unavailable information from forecast models by fusing model output with observations to provide additional decision support for forecasters and users. In this work, we demonstrate that applying AI techniques along with a physical understanding of the environment can significantly improve the prediction skill for multiple types of high-impact weather. The AI approach is also a contribution to the growing field of computational sustainability. The authors specifically discuss the prediction of storm duration, severe wind, severe hail, precipitation classification, forecasting for renewable energy, and aviation turbulence. They also discuss how AI techniques can process “big data,” provide insights into high-impact weather phenomena, and improve our understanding of high-impact weather.
Modern artificial intelligence (AI) techniques can aid forecasters on a wide variety of high-impact weather phenomena.
Weather significantly impacts society for better and for worse. For example, severe weather hazards caused over $7.9 billion of property damage in 2015 (National Oceanic and Atmospheric Administration/National Centers for Environmental Information 2016; CoreLogic 2016). The National Academies of Sciences, Engineering, and Medicine (2016) cites improving forecasting of such events as a critical priority, and the European Centre for Medium-Range Weather Forecasts (ECMWF) recently announced goals for 2025 (ECMWF 2016) that stress the importance of improving these forecasts. On the positive side, improvements in forecasting solar power, which increasingly impacts the electrical grid, are expected to save utility companies $455 million by 2040 (Haupt et al. 2016). Additional savings can be found through improved forecasting in other areas of computational sustainability. Computational sustainability is a new and growing interdisciplinary research area focusing on computational solutions for questions of Earth sustainability.
In recent years, operational numerical weather prediction (NWP) models have significantly increased in resolution (e.g., Weygandt et al. 2009). At the same time, the number and quality of observational systems has grown, and new systems, such as Geostationary Operational Environmental Satellite R series (GOES-R), will generate high-quality data at fine spatial and temporal resolutions. These data contain valuable information, but their variety and volume can be overwhelming to forecasters, and this can hinder decision-making if not handled properly (Karstens et al. 2015, 2016). This data deluge is commonly termed “big data.” Artificial intelligence (AI) and related data science methods have been developed to work with big data across a variety of disciplines.
Applying AI techniques in conjunction with a physical understanding of the environment can substantially improve prediction skill for multiple types of high-impact weather. This approach expands on traditional model output statistics (MOS) techniques (Glahn and Lowry 1972), which derive probabilistic, categorical, and deterministic forecasts from NWP model output. Because of their simplicity and longevity, forecasters have gained trust in MOS techniques. AI techniques provide a number of advantages, including easily generalizing spatially and temporally, handling large numbers of predictor variables, integrating physical understanding into the models, and discovering additional knowledge from the data. In recent years, forecasters and researchers have begun to adopt AI techniques much more widely, as they demonstrate their power in a wide variety of applications, including postmodel bias correction, handling large datasets, reducing cognitive overload, and discovering new knowledge in large datasets. With the growth in applications for data science techniques outside of atmospheric science as well, AI techniques promise to continue to enhance prediction and understanding of many weather-related phenomena. The primary goals of this paper are to introduce modern AI techniques to a broad audience and to demonstrate their utility in predicting a wide variety of high-impact weather phenomena.
The rest of this paper is organized as follows: We first review related work and provide a brief overview of some AI techniques highlighted in this paper, followed by demonstrations of how we have applied AI techniques to multiple high-impact weather applications. We discuss the benefits of AI and automation to both researchers and forecasters and conclude by discussing how AI techniques can be further used to help meteorologists and decision-makers.
Statistical models for postprocessing NWP model output have evolved within two general frameworks. “Perfect prog” models fit relationships between observed or analyzed variables and observations of a weather feature, such as temperature or precipitation (Klein et al. 1959). The models are then applied to NWP forecasts, thus implicitly assuming that the NWP model is perfect. In contrast, MOS fits a statistical model between NWP output at a given time horizon and subsequent observations at that time (Glahn and Lowry 1972), often using linear regression. Because MOS fits use the NWP output directly, they can correct for systematic biases in a model. When NWP model configurations are updated, MOS must be retrained after a sufficient number of new model forecasts are collected. Perfect-prog models are generally less accurate than a well-tuned MOS model, but they are less sensitive to model configuration changes and tend to be more robust over time. AI techniques can be used in both frameworks.
Haupt et al. (2008) provide an overview of AI techniques applied to the environmental sciences, including artificial neural networks (ANNs), decision trees, genetic algorithms (Allen et al. 2007), fuzzy logic, and principal component analysis (Elmore and Richman 2001). Baldwin et al. (2005) used hierarchical clustering to classify precipitation areas, Gagne et al. (2009) used k-means clustering to segment a radar image, Lakshmanan et al. (2010, 2014) used k-means clustering to segment a map of radar-echo classifications, and Miller et al. (2013) used clustering to identify storm tracks.
ANNs are interconnected networks of weighted nonlinear functions. When connected and trained in multiple layers, ANNs can represent any nonlinear function. They also provide the foundation for deep learning methods. ANNs have been used in a wide variety of meteorology applications since the late 1980s (Key et al. 1989), including cloud classification (Bankert 1994), tornado prediction and detection (Marzban and Stumpf 1996; Lakshmanan et al. 2005), damaging winds (Marzban and Stumpf 1998), hail size (Marzban and Witt 2001; Manzato 2013), precipitation classification (Anagnostou 2004; Lakshmanan et al. 2014), tracking storms (Lakshmanan et al. 2000), and radar quality control (Lakshmanan et al. 2007; Newman et al. 2013).
Support vector machines (SVMs) have also been used to detect and predict tornadoes (Trafalis et al. 2003; Adrianto et al. 2009). SVMs learn a linear model in a nonlinear space by transforming the data to the nonlinear space using kernels. Both ANNs and SVMs are flexible and powerful but produce models that are often difficult to interpret in terms of underlying physical concepts that the model has identified. For the ANNs, it is difficult to interpret the weights through the nonlinear functions. For the SVMs, the data transformation makes it difficult to identify the most important features of the data or what the model has identified.
One of the simplest and most well-known statistical learning methods, linear regression, has been used in weather prediction since at least the early 1950s (Malone 1955). Kitzmiller et al. (1995) used regression to forecast the probability of severe weather, Billet et al. (1997) used it to forecast maximum hail size and large-hail probability, and Mecikalski et al. (2015) used logistic regression to forecast the probability of convective initiation, to name just a few recent examples. In linear regression, a set of weights is chosen to combine input features xi so as to best predict an output variable ŷ, for example, to minimize the summed squared prediction error. The weights can be trained using matrix inversion or other optimization schemes, ranging from basic gradient descent through genetic algorithms if the matrix is poorly conditioned. Although linear regression learns quickly even on large datasets, it works best with problems that require a linear model and have a limited feature set. If features are redundant or not predictive, they can make learning more challenging. Ridge regression (Hoerl and Kennard 1988) penalizes the sum of squared weights in order to simplify models and improve their generalization. The lasso method penalizes the sum of the weights’ absolute values, which tends to remove irrelevant variables (Tibshirani 1996). Elastic nets (Zou and Hastie 2005) combine both penalties.
Decision-tree-based methods are popular in data science for handling big data. They are able to identify and learn with only the most relevant variables, enabling users to provide many possible predictive features without worrying whether extraneous variables will overwhelm the training process. Decision trees are also human readable, which can provide insight into what relationships the model has identified related to the event being forecasted. Decision-tree-based methods have proven quite powerful in a wide variety of weather applications (Williams et al. 2008a,b; Gagne et al. 2009; McGovern et al. 2014; Williams 2014; McGovern et al. 2015; Clark et al. 2015; Elmore and Grams 2016).
Although the first objective decision-tree learning method was not developed until the mid-1980s (Quinlan 1986, 1993), subjective (human derived) decision trees have been used in meteorology since at least the mid-1960s (Chisholm et al. 1968). A decision tree splits data recursively by identifying the most relevant question at each level of the data. The tree shown in Fig. 1 was automatically developed to predict whether hail will occur. At the root node, the data are split with the question “Is the mean radar reflectivity ≤ 43.4 dBZ?” The data are further refined down each of the yes and no branches until a prediction is made at a leaf node, which may contain a class label (e.g., hail: yes), probability p [e.g., p(hail) = 0.8; Provost and Domingos 2000], scalar prediction [hail size = 3.1 in. (∼7.9 cm)], or a linear predictive function.
A powerful related method is random forests (RFs; Breiman 2001). An RF is an ensemble of decision trees, each of which is trained on a separate set of bootstrapped resampled training data and selects from a random subset of questions at each node. Since they are trained on different data and using different predictors, the individual trees in the forest are diverse, providing an “ensemble of experts” that performs better than any individual tree.
Gradient boosted regression trees (GBRT; Friedman 2002) construct an ensemble of decision trees trained using boosting (Schapire 2003). Whereas each tree in an RF is equally weighted and trained on equally weighted examples, a GBRT trains on differently weighted subsets of data, where the weights are determined by the error residuals of the previous training step.
We will demonstrate the use of both RFs and GBRTs in several of the high-impact weather domains described below. While both methods are similar in performance in some cases, because of the equal weighting of the individual trees in the forest, an RF will tend to regress to the mean predictions and thus not produce as sharp of a forecast. GBRTs can address this issue, but sometimes, postmodel correction is also needed. We typically use isotonic regression for postmodel correction (Niculescu-Mizil and Caruana 2005).
Both RFs and GBRTs provide the ability to measure the importance of each attribute in the dataset, which is called variable importance. After the trees are trained, each variable’s data are permuted, and performance is measured with both the permuted and original data. The most important variables are those that cause the largest drop in performance. These importance estimates can be used to gain insight into the choices made by the forests, enabling physical interpretation of the models.
AI FOR HIGH-IMPACT WEATHER.
This section presents some of our recent work in applying AI to a variety of high-impact weather applications. The diversity in applications is intentional, to demonstrate to the reader that AI can be used for multiple problems.
Predicting a storm’s lifetime is important for forecasters as it helps to guide the creation of watches and warnings. This task requires knowledge of the current status of the storm as well as knowledge of the nearby environment. The training data for this task come from a preoperational product called ProbSevere (Cintineo et al. 2014). ProbSevere identifies and tracks storms in real time using composite reflectivity (maximum column reflectivity derived from multiple radars simultaneously) from the operational Multi-Radar Multi-Sensor (MRMS; Smith et al. 2016) system over the continental United States. ProbSevere also provides a small number of attributes that summarize information about the environment near the storm along with information on the current speed of the storm. The training labels are provided by running a post hoc storm-tracking program called best track (Lakshmanan et al. 2015). These labels were obtained by using data from the Multi-Year Reanalysis of Remotely Sensed Storms (MYRORSS; Ortega et al. 2012) project. The training and testing data were drawn from 9 April 2015 through 31 January 2016. Data are available for each storm cell on an approximate 2-min basis. To ensure there was no cross contamination between the training and the testing set, training was on all data except July, with testing on July and the day closest to the testing data dropped from training. For bias correction, we withheld an extra month of data (August). The training data were also subsampled for all storms lasting less than 7,200 s. Only 10% of this data were used for training. All storms lasting longer than 7,200 s in the training set were retained. This still yielded 2,872,680 samples for training. Testing data were evaluated independently on each day in July, enabling us to bootstrap the results for statistical analysis. Testing data were not subsampled.
We tested three machine-learning methods: GBRT, RF, and elastic nets. We also examined posttraining bias corrected using isotonic regression. We examined multiple settings of the standard parameters to the RF and GBRT using a validation set (results not shown because of space). The best choices for the RF and GBRT were 100 trees and a maximum depth of 5. For the GBRTs, the Huber loss function was significantly better than the other loss functions. For elastic nets, we used an alpha of 0.05 and the L1 ratio of 0.9.
Figure 2 displays the predicted distributions versus the observed distributions. GBRT stands out as the best-performing method across the range of predictions. While bias correction is able to improve the performance at the end points, it is not an overall improvement on the models and was left out of the real-time testing.
The best duration prediction method, GBRT, was implemented into a real-time system running in National Oceanic and Atmospheric Administration (NOAA)’s Hazardous Weather Testbed (HWT) called Probabilistic Hazard Information (PHI; Karstens et al. 2015) that uses ProbSevere to generate automated probabilistic forecasts for thunderstorm hazards. The forecasts were tested and evaluated with nine National Weather Service (NWS) forecasters in a 3-week human–machine-mix experiment during May and June of 2016, and the acceptance of the duration predictions was evaluated. As shown in Fig. 3, forecasters on average used the predicted ProbSevere duration in approximately 75% of all forecasts, while individual acceptance of these predictions varied from as low as approximately 25% to as much as 100%. These results imply that most forecasters trust these predictions or that the predictions are within an acceptable range at the time of warning decision. However, evidence (not shown) suggests that forecasters have a strong tendency to accept the default duration value so long as it is “good enough,” and the default duration value during the experiment was assigned from our duration predictions. Therefore, forecasters may not be giving much thought to this predictive aspect of the forecast. Interestingly, research in optimizing decision-making suggests that “choice architects” should account for inaction bias by assigning the most likely best option to the available default (Milkman et al. 2008).
Real-time prediction of severe wind, defined by the NWS as a gust ≥50 knots (kt; 25.7 m s−1), is another important task for forecasters. This project uses AI techniques to predict the probability of severe wind within various buffer distances (0, 5, and 10 km around the storm cell) and time windows (0–15, 15–30, 30–45, 45–60, and 60–90 min into the future). We use two datasets to create predictors: quality-controlled radar images from MYRORSS and near-storm environment soundings from the Rapid Update Cycle (RUC) model (Benjamin et al. 2004). MYRORSS has a resolution of 1 km and 5 min, while the RUC has a resolution of 13 km (20 km for earlier times) and 1 h. To determine when and where severe winds occurred (verification data), we use surface observations from four datasets: the Meteorological Assimilation Data Ingest System (MADIS; McNitt et al. 2008), Oklahoma Mesoscale Network (Mesonet; McPherson et al. 2007), 1-min meteorological aerodrome reports (METARs; National Climatic Data Center 2006), and NWS local storm reports (Storm Prediction Center 2015).
Before training the models, four types of data processing are applied. First, storm cells are identified and tracked through time using both real-time (Lakshmanan and Smith 2010) and postevent (Lakshmanan et al. 2015) methods. Real-time tracking outlines the edge of each storm cell, and postevent tracking corrects deficiencies in real-time tracking, mainly false truncations. Data are processed for 804 days in the continental United States (all days from 2004 to 2011 with ≥30 NWS wind reports and available MYRORSS data). This results in nearly 20 million storm objects, where a “storm object” is one storm cell at one time step. Second, wind observations are causally linked to storm cells. For each wind observation W, storm objects are interpolated along their respective tracks to the same time as W. If the edge of the nearest storm object S is within the given buffer distance (0, 5, or 10 km), W is linked to S and all other storm objects in the same track.
Third, predictors are calculated for each storm object. There are four types of predictors: radar statistics (mean, standard deviation, skewness, kurtosis, and seven percentiles calculated for each of 12 variables, based only on pixels inside the storm object; the same statistics are calculated for gradient magnitudes of the 12 variables), storm motion (speed and direction), shape parameters (area, orientation, eccentricity, etc., of the storm object), and sounding indices (both dynamic and thermodynamic). Sounding indices are calculated from interpolated RUC data using the Sounding and Hodograph Analysis and Research Program in Python (SHARPpy) software (Halbert et al. 2015). There are a total of 431 predictors. The fourth step is to label each storm object S. If S is linked to a wind observation ≥50 kt (25.7 m s−1) over the given buffer distance and time window, its label is “true” (Fig. 4a).
For each buffer distance and time window, a GBRT ensemble is trained. Then, isotonic regression (IR) is trained with independent data (no case within 24 h of a GBRT-training case) to bias correct the GBRTs. Next, the calibrated model (GBRT + IR) is tested on independent data. Results are shown in Fig. 4 for the median buffer distance (5 km) and lead time (30–45 min). The model shown in Fig. 4 is an ensemble of 500 GBRTs trained with the AdaBoost algorithm (Freund and Schapire 1997), resampling factor of 0.15 (with replacement), learning rate of 0.1, 25 variables tested per branch node, and a minimum of 10 storm objects per leaf node. Results are based on 12,155 test cases. No premodel variable selection was done, because decision trees perform built-in variable selection. The area under the receiver operating characteristic (ROC) curve (AUC) is >0.9, which is generally considered excellent (Luna-Herrera et al. 2003; Muller et al. 2005; Mehdi et al. 2011), and the reliability curve (Fig. 4d) is very close to perfect (x = y). Furthermore, the maximum critical success index (CSI) occurs with a frequency bias of 1.0 (unbiased model), which suggests that bias need not be sacrificed for other performance metrics. These results are based on Lagerquist (2016).
Prediction of hail occurrence and size days to hours ahead is needed to guide the issuance of convective outlooks and watches. Convection-allowing model (CAM) ensembles provide information about storm intensity, location, and evolution but do not forecast maximum hail size at the surface directly. Machine-learning models have been developed to predict the probability of hail occurrence and the expected hail-size distribution given information about storms and their environment from CAM output. The machine-learning hail models have been run in real time on two CAM ensemble systems and have been validated against the HAILCAST diagnostic (Adams-Selin and Ziegler 2016) and storm surrogate variables, such as updraft helicity (Sobash et al. 2016).
A storm-centered method is used for producing machine-learning hail forecasts. First, potential hailstorms are identified from the hourly maximum column total graupel field in the 2014 and 2015 Center for Analysis and Prediction of Storms (CAPS) CAM ensemble using the enhanced watershed feature identification technique. Observed hailstorms are identified from the maximum expected size of hail (MESH) field (Witt et al. 1998) in the NOAA National Severe Storms Laboratory (NSSL) Multi-Radar Multi-Sensor mosaic (Smith et al. 2016). Both forecast and observed storms are tracked through time and then matched based on proximity in space and time. Statistics describing the storm and environmental variables from within the bounds of each forecast storm are extracted and are used as input into the machine-learning models. A gamma distribution is fit to the distribution of MESH within an observed hailstorm, and the parameters of the gamma distribution are used as target labels for the machine-learning models.
An RF classification model predicts whether hail will occur based on whether an observed storm was matched with a given forecast storm and the hail-size distribution parameters given that hail occurred. An RF regression model estimates both the shape and scale parameters of the gamma distribution simultaneously to preserve the correlations among the parameters in the predictions. Gridded hail-size forecasts are produced by sampling hail sizes from the predicted distribution and applying them in rank order onto the column total graupel field. Potential hailstorms with less than 50% chance of hail occurrence are removed from the grid.
Verification results and a single forecast case are shown in Fig. 5 for the machine-learning hail forecasts and other storm surrogate probability forecasts, including HAILCAST, column total graupel, and updraft helicity. The RF used for this experiment was trained on CAPS ensemble forecasts from May to June 2014 and evaluated on CAPS ensemble forecasts for the same period in 2015. These results were based on analysis from Gagne (2016). The performance diagram (Roebber 2009) in Fig. 5a shows that for a given probability threshold, the machine-learning models tend to have fewer false alarms, lower frequency bias, and higher accuracy than other methods. The attributes diagram (Hsu and Murphy 1986) in Fig. 5b indicates the probabilities from the machine-learning models and updraft helicity are generally reliable, while other methods tend to produce probabilities that are overconfident. The case study in Fig. 5c shows that the RF model performed best at capturing the area where 50-mm hail occurred. The other two methods had both lower probabilities and enhanced probabilities in areas where 50-mm hail did not occur.
The Meteorological Phenomena Identification Near the Ground (mPING; Elmore et al. 2014) project has collected over 1.1 million observations since its launch on 19 December 2012. The mPING project uses crowdsourced observations of precipitation type (ptype) submitted anonymously through a smartphone app. Various other weather conditions can also be reported, such as floods, visibility restrictions, wind damage, hail, and tornadoes. The ptype observations have been used to help characterize the sensitivity of various ptype algorithms to model errors (Reeves et al. 2014) and to verify current NWP model performance and, in the process, find an outright error within the postprocessing of the RAP model (Elmore et al. 2015).
Given that the skill of NWP ptype forecasts have been characterized with mPING observations, a compelling next step is to use the mPING observations to build a new, hopefully improved, ptype algorithm. As a first attempt, the wet-bulb temperature Tw profiles from 5,000 m AGL to the surface created by each NWP model are characterized as one of four different types that are identical to the four types described in Schuur et al. (2012): type 1 is all Tw below freezing (273.16 K); type 2 has one freezing level such that Tw at the surface is above freezing; type 3 has three freezing levels with an elevated warm layer, an elevated cold layer, and Tw at the surface above freezing; and type 4 is the “classic” elevated-warm-layer profile with Tw at the surface below freezing. Multiple predictors are computed for each profile type, including area above and below freezing for each layer, height of the various freezing levels, wind shear [both bulk and in latitudinal and meridional (u and υ, respectively) directions] in warm and cold layers and across the entire depth of the profile, area of relative humidity (RH) above and below 0.8 for each layer along with the mean RH in each layer, and minimum Tw in the cold surface layers. Each profile type has a different set of predictors, though some predictors are common across all profile types. Overall, type 1 profiles have 28 predictors, type 2 profiles have 23, type 3 profiles have 49, and type 4 profiles have 38.
Because each profile type has a different set of predictors, each has its own RF. Training data consist of 80% of the available hours of data selected randomly. The remaining 20% of the hours are used for testing. Hours, instead of individual observations, are chosen so as to lessen cross contamination of testing data with training data. Thus, a training profile and a testing profile cannot come from the same hour.
These data are not balanced, in that there are far more snow and rain reports than ice pellets and freezing rain. Sampling weights and maximum tree size are adjusted by trial and error such that the bias for each of the four classes generated by each random forest is close to 1. No other adjustments are made.
Applying RFs in this way results in marked improvement in the ptype prediction form NWP models. Figure 6 is an example of this improvement for the Rapid Refresh (RAP) model over the cold season of 2014/15 with confidence intervals for each score. The right set of bars shows the output of the RAP postprocess ptype algorithm while the left set of bars displays the results of an RF ptype algorithm. Scores for the RF algorithm come from a smaller number of cases (the test data) than the scores for the RAP, which use the entire available dataset. There is not much room for improvement in predicting rain and snow, but the improvement for freezing rain and ice pellets is quite dramatic. In addition, the RF ptype output is unbiased, unlike the postprocessed ptype output. RFs can also provide probabilistic information about the ptype, which will likely be useful to operational forecasters and those involved in maintaining infrastructure systems. Clearly, if sufficient data are available, an RF approach to forecast ptype can lead to significant improvement to the most troublesome winter precipitation types.
Variable importance is examined for each forest for each model. No variable stands out as much more important than another. At the most extreme, the most important variable is roughly twice as important as the least important variable. No variable in particular stands out; because of this characteristic, variable selection is deemed unnecessary.
Forecasting for renewable energy resources is another example of high-impact weather forecasts. In this case, forecasting enables using clean, locally available, but highly variable renewable resources to produce energy in place of fossil fuel energy sources. Because the wind, water, and solar resources are highly variable, forecasting is needed to blend renewable power with other energy sources to assure reliable, efficient, and economic deployment. Utilities require forecasts on various scales. Here, we describe two shorter-range scales: the nowcast, for the next 3–6 h, and the day-ahead forecast (which can extend to 72 h to cover weekends). The nowcast is necessary to blend renewable energy into the grid in order to meet the electric load in real time. The day-ahead forecast is used for planning unit allocation and trading energy with other utilities. We specifically discuss how AI is used for forecasting for wind and solar energy, with more detailed descriptions of additional prediction methods being provided by Ahlstrom et al. (2013), Orwig et al. (2014), Tuohy et al. (2015), and Haupt et al. (2016).
The nowcast typically leverages observations from the wind or solar plant or remotely sensed data. The goal is to improve upon a persistence forecast at the location of the plant. Statistical learning and AI methods capture changes or deviations from persistence. One statistical learning method for wind speed nowcasts is the Markov-switching vector autoregressive model (Hering et al. 2015). Solar power nowcasting has leveraged various statistical learning methods. Hassanzadeh et al. (2010) and Yang et al. (2012) used autoregressive integrated moving average (ARIMA) models to predict solar irradiance and power, demonstrating lower errors than other time series models. ANNs are commonly used for nonlinear solar predictions (Mellit 2008) and have shown skill over other baseline techniques (Marquez and Coimbra 2011; Wang et al. 2012; Chu et al. 2013). Support vector machines have also shown skill over linear regression in postprocessing NWP model output (Sharma et al. 2011).
Solar models typically predict the clearness index, the ratio of the global horizontal irradiance (GHI) that reaches the surface of Earth to that at the top of the atmosphere. The clearness index ranges between 0 and 1 and depicts the depletion of solar energy via absorption and scattering by clouds and aerosols on its path through the atmosphere. It also removes the effects of the seasonal cycles and partially accounts for diurnal effects. One can explicitly compute the GHI at the top of the atmosphere given the solar angle and location information.
Some recent work has sought to identify regimes and forecast solar irradiance changes specific to those regimes through both implicit and explicit methods. The implicit method employs a regression tree approach (Quinlan 1996) with an embedded nearest neighbor scheme to forecast both deterministic irradiance and its variability (McCandless et al. 2015). Explicit regime identification using k-means clustering and training ANNs for each cluster was shown to improve over training a single ANN on the entire training dataset (McCandless et al. 2016b,a). These approaches to statistical forecasting outperformed a “smart persistence” approach that includes the change in solar angle. When compared to other nowcasting products, the statistical forecasting approach outperformed all others for the first hour (Haupt et al. 2016), as demonstrated in Fig. 7.
Day-ahead forecasting approaches use AI models to postprocess and correct NWP model output toward observations. Common methods of postprocessing include ANNs and blended optimization methods. The Dynamic Integrated Forecast (DICast) system (Myers et al. 2011; Mahoney et al. 2012) first applies a dynamic MOS approach followed by optimized blending. This system has improved forecasts of wind and solar power by at least 15% (Mahoney et al. 2012; Haupt et al. 2016).
For true decision support, utilities and grid operators do not want only wind speed or GHI forecasts; they actually require power predictions. Although manufacturers of wind turbines and solar panels provide average power curves, these are not perfectly representative of actual power produced at a site because of variation in terrain elevation, turbulence, and other factors. Thus, training an AI method to convert from wind or GHI to power can produce better power predictions for a specific site (Parks et al. 2011) and does not require the detailed metadata needed to apply alternative methods for solar irradiance (Haupt and Kosovic 2016). The National Center for Atmospheric Research (NCAR) has successfully applied the cubist regression tree approach to both wind (Kosovic et al. 2015) and solar (Haupt and Kosovic 2016).
Finally, many utilities request probabilistic predictions to estimate the forecast uncertainty and to plan their reserve requirements. Although NWP model ensembles traditionally provide probabilistic forecasts, the analog ensemble approach (AnEn; Delle Monache et al. 2013) has successfully produced probabilistic forecasts based on a single high-quality forecast from a consistent prediction system. The AnEn searches through historical forecasts for those most similar to the current forecast. Observations associated with each historical forecast form a probability density function that defines the forecast uncertainty. The AnEn mean can correct for systematic biases. This approach has proven to be at least as reliable as some of the best dynamical ensembles for wind speed (Delle Monache et al. 2013; Haupt and Delle Monache 2014), wind power (Kosovic et al. 2015), and solar power (Alessandrini et al. 2015).
AI methods are directly providing decision support for utilities and grid operators around the world and are enabling increases in the deployment of the variable renewable energy resources. All of the methods described in this section have been operationalized and used by utilities. In this way, enabling higher capacities of renewable energy can lead to energy security, lower the use of water in energy production, and lower the emissions of carbon dioxide and other pollutants, thus providing the world with a clean source of sustainable energy.
Although much of the severe weather of concern to humans occurs near the surface, conditions far above the ground may be equally hazardous. Commercial aviation is impacted by various weather threats, including airframe icing by supercooled liquid water, engine flameouts in areas of high ice water content, hail, lightning, and atmospheric turbulence. Turbulence is one of the most significant en route aviation hazards from an operational standpoint. Flying through turbulent eddies causes an aircraft to bounce from side to side and up and down, making passengers and crew uncomfortable and occasionally injuring them or damaging the aircraft. Turbulence is created by wind shear in regions of low stability, which may result from jet streams and fronts, mountain-wave or convectively induced gravity wave breaking, or the updrafts and downdrafts of thunderstorms. Because it is often a small-scale and fundamentally stochastic phenomenon, turbulence is difficult to forecast or even nowcast. Moreover, NWP models are not generally tuned to accurately forecast aviation-scale turbulence, and output variables such as subgrid turbulent kinetic energy (TKE) are not skillful in predicting aircraft observations of turbulence (Sharman 2016).
AI has become a key tool for observing, nowcasting, and forecasting aviation turbulence. For observing turbulence in clouds and storms, a fuzzy logic algorithm was developed to carefully quality control ground-based Doppler radar spectrum width measurements, allowing them to be scaled and combined into an estimate of the turbulence eddy dissipation rate (EDR). Fuzzy logic is a tool for building expert systems that mimic human reasoning, smoothly combining various sources of evidence to form a final assessment (Williams 2009). For the turbulence detection algorithm, the likelihood of radar spectrum width contamination is scored as a “confidence” between 0 and 1 for each of several diagnostic quantities derived from the radar signal or its spatial context and then these are combined in a geometric average to obtain an overall assessment. The spectrum widths are scaled to EDR based on distance from the radar, and a confidence-weighted average is performed to obtain the final EDR estimate (Williams and Meymaris 2016).
Aviation turbulence forecasting utilizes diagnostics, or indices, computed from NWP model-resolved wind shear, stability, and various other functions of the modeled variables (Sharman 2016). Although none of these explicitly represents aircraft-scale turbulence, the form of the turbulent energy cascade means that they may be related to it and thus may be transformed and weighted to form a good estimate. The Graphical Turbulence Guidance (GTG) algorithm (Sharman et al. 2006) evaluates each diagnostic against aircraft observations of turbulence, rescales it using a piecewise linear function, and uses weights based on the resulting skill scores to compute a weighted-mean consensus. More recent versions of GTG incorporate lognormal remapping functions. A weakness of this approach is that it does not take into account the linear and nonlinear dependencies between the diagnostics, many of which are highly correlated.
Decision-tree-based techniques offer the ability to incorporate features not proportional or even monotonically related to turbulence severity. Williams (2014) used RFs to combine both NWP diagnostics and features derived from satellite and radar products to create turbulence nowcasts. Predictors included NWP-derived turbulence diagnostics and thermodynamic variables such as convective available potential energy (CAPE) and convective inhibition (CIN); distances to relevant reflectivity, echo top, lightning, and in-cloud turbulence objects; and disc statistics over various radii from both the radar and satellite imagery. Several hundred candidate predictors were whittled down first through the RF’s variable importance analyses and then through forward and backward selection, where an RF is trained and evaluated on independent datasets and the predictor variables producing the best discrimination skill are preserved. The RF is then calibrated to produce either EDR or turbulence probability, and the resulting algorithm is run at every point in a predefined grid to produce a map suitable for use by pilots, dispatchers, or air traffic controllers. Although the benefits of an AI approach are particularly clear for fusing multiple data sources for turbulence nowcasting, Table 1 indicates that logistic regression, k-nearest neighbor, and especially RF exceed GTG’s skill even in the case when only NWP model data are used as predictors. A similar approach has been used to forecast convection (Mecikalski et al. 2015; Ahijevych et al. 2016). A downside of this approach is the need for significant feature engineering, that is, calculating many different features and then testing which are relevant. This requirement is somewhat mitigated by McGovern et al. (2014), who used spatiotemporal relational random forests guided by a schema identifying possibly relevant relationships between an aircraft location and various storm-related objects. In the future, convolutional neural networks operating in a deep learning framework may reduce the need for feature engineering even further. The use of AI for turbulence prediction will continue to make flights safer and more comfortable.
Application of modern AI techniques to high-impact weather forecasting is improving our ability to sift through the deluge of big data to extract insights and accurate, timely guidance for human weather forecasters and decision-makers. AI techniques build on traditional methods, such as MOS, by providing more flexible and powerful models capable of identifying complex relationships between a huge number of modeled and observed weather features or derived quantities. In addition, AI methods extend easily to directly predicting impacts of high-impact weather, such as power generated by variable sources such as solar or wind, energy consumption in an area, or airport arrival capacity.
This paper raises the interesting question of the role of automated guidance in forecasts. While we have demonstrated that AI/data science techniques can be used to significantly improve forecasts in a variety of high-impact weather domains, it is not simply a matter of bringing these techniques to operations. The forecasters must be able to trust the forecast produced by such techniques, as has been demonstrated in the HWT/PHI experiments (Karstens et al. 2016).
For forecasts of standard weather variables, such as temperature and precipitation, the NWS currently operates with a human-in-the-loop paradigm in which forecasters subjectively blend and adjust multiple sources. Local offices add predictive value in situations where local effects have a larger impact on the forecast. At the NWS Weather Prediction Center, which issues temperature and precipitation forecasts over the entire United States, the human forecasts now perform significantly worse than downscaled, bias-corrected ensemble forecasts for temperature and precipitation (Novak et al. 2014). Official NWS track forecasts of hurricanes, a major form of high-impact weather, also perform worse than weighted ensemble consensus forecasts (Cangialosi and Franklin 2015). There are also issues with spatial discontinuities in forecasts and warnings between the domains of different forecast offices (Gilbert et al. 2015). Private weather firms, including The Weather Company, operate in a human-over-the-loop paradigm in which an optimal blend of bias-corrected model output is generated as needed by users, and human forecasters can add filters and qualifiers to account for observed short-term biases or data quality issues (Williams et al. 2016). This approach scales easily and only requires a small team of meteorologists to oversee a mostly automated system. The downside of a heavily automated approach is that forecasters may become disengaged from the forecast process (Pliske et al. 2004) and struggle to take appropriate corrective action when automation fails (Skitka et al. 1999; Pagano et al. 2016).
By studying the error characteristics of different machine-learning methods in high-impact weather situations, researchers and forecasters can identify when the automated guidance should be trusted and when it is more likely to struggle. The methods presented in this paper are able to blend physical knowledge with automated corrections to produce critical products in this age of information overload.
This material is based upon work supported by the National Science Foundation under Grant SHARP NSF AGS-126-1776. Funding was provided by NOAA/Office of Oceanic and Atmospheric Research under NOAA–University of Oklahoma Cooperative Agreement NA11OAR4320072, U.S. Department of Commerce. This work was supported by the NEXRAD Product Improvement Program, by NOAA/Office of Oceanic and Atmospheric Research. The statements, findings, conclusions, and recommendations are those of the authors and do not necessarily reflect the views of NOAA, the U.S. Department of Commerce, or the University of Oklahoma. NCAR is sponsored by the National Science Foundation. The authors thank Tara Jensen for producing Fig. 7.
CURRENT AFFILIATIONS: Karstens—NOAA/National Weather Service/Storm Prediction Center, Norman, Oklahoma; Williams—The Weather Company, An IBM Business, Andover, Massachusetts