1. Introduction
Improvements to numerical weather prediction (NWP) models have been and continue being made to predict weather more accurately. However, there are still uncertainties in forecasting weather from state-of-the-art NWP models because of the growing, but still limited, observational information to initialize these models and our incomplete understanding of atmospheric processes and their representations in NWP models, e.g., parameterizations of unresolved cloud and turbulence processes (Buizza et al. 2005; Slingo and Palmer 2011; Bauer et al. 2015). Hence, a single deterministic forecast is inevitably accompanied with some uncertainty as it can only represent one weather state among a range of possible weather states.
These uncertainties carry over to operational aviation turbulence forecasts that are based on operational NWP models. The model resolution of operational NWP models is becoming finer, reaching a few kilometers in the case of limited-area models, e.g., the National Oceanic and Atmospheric Administration (NOAA) High-Resolution Rapid Refresh (HRRR) (Benjamin et al. 2016), the Met Office (UKMO) Unified Model (UM) Regional Atmosphere (Bush et al. 2020), and the Météo-France AROME (Seity et al. 2011); and O(10) km in the case of global models, e.g., the National Centers for Environmental Prediction (NCEP) Global Forecast System (GFS) (NCEP 2022), the UKMO UM Global Atmosphere (Walters et al. 2019), the European Centre for Medium-Range Weather Forecasts (ECMWF) High-Resolution Forecast (HRES) (Haiden et al. 2021), and the German Weather Service (DWD) Icosahedral Nonhydrostatic (ICON) model (Zängl et al. 2015). However, these model resolutions are still not fine enough to resolve small-scale turbulence that may impact aircraft operations (order of a few hundred meters or smaller). This is not expected to change for some time, for global models in particular, as tremendous computing power is needed to accommodate the required small grid spacing to resolve aviation-scale turbulence. Therefore, today’s operational turbulence forecasts rely on turbulence diagnostics, which are based on the assumption that NWP resolved large-scale atmospheric processes associated with the generation of aircraft-scale turbulence are tied to unresolved aircraft-scale turbulence (Ellrod and Knapp 1992; Ellrod and Knox 2010; Gill and Buchanan 2014; Sharman et al. 2006; Sharman and Pearson 2017). However, these diagnostics all suffer from some uncertainty as well. First, turbulence diagnostics do not explicitly resolve but only infer aircraft-scale turbulence from the NWP large-scale fields. Second, individual diagnostics can represent only certain aspects of a number of known turbulence generation mechanisms, yet some turbulence generation mechanisms themselves are not completely understood. This uncertainty in diagnosing turbulence can adversely affect the skill of operational deterministic turbulence forecasts.
As with NWP models themselves, these uncertainties in aviation turbulence forecasts can be expressed through probabilistic forecasts. Probabilistic turbulence forecasts may represent the uncertainty in NWP model by generating a range of possible weather states (Gill and Buchanan 2014; Storer et al. 2019; Lee et al. 2020) or the uncertainty in diagnosing turbulence by using an ensemble of turbulence diagnostics for a single deterministic weather state (Kim et al. 2018), or both (Storer et al. 2020). Gill and Buchanan (2014) used ensemble weather forecasts from the Met Office Global and Regional Ensemble Prediction Systems (MOGREPS) to compute probabilities of turbulence events and showed that the probabilistic forecasts based on the NWP ensembles improve the overall prediction skill in terms of the area under the relative (or receiver) operating characteristic (ROC) curve (AUC) compared to existing operational deterministic forecasts. Storer et al. (2019) also confirmed the benefits of probabilistic forecasts based on the MOGREPS and the ECMWF Ensemble Prediction System (EPS). They showed further enhancement in the skill of probabilistic forecasts by combining the two ensemble forecast models.
Such a multi-model-based turbulence forecasting system was also developed by Lee et al. (2020) using seven global operational deterministic NWP models available from The International Grand Global Ensemble (TIGGE) (Swinbank et al. 2016) database. The multi-model-based probabilistic forecasts that use a single turbulence diagnostic were shown to have a small ensemble spread as indicated by the relatively flat reliability line for intermediate probabilities (from 30% to 90%) in comparison to the perfect reliability line (Lee et al. 2020). For possible reasons of the small spread, they suggested that this small spread is due at least in part to the small number of ensemble members used, i.e., seven NWP models, and in part to the fact that for short-term forecasts relevant for aviation purposes (anywhere from 1 or 2 h to about 36-h lead) the NWP models have simply not had enough time to significantly diverge. Also, with these ensemble model and multimodel approaches the computational costs increase in proportion to the number of weather forecasts used, i.e., the number of forecast ensembles or the number of NWP models.
Conversely, the multidiagnostic approach proposed by Kim et al. (2018) is distinct from the ensemble model and multimodel approaches in that it utilizes an ensemble of individual turbulence diagnostics derived from a deterministic weather state to compute probabilities. In their comparisons using an earlier version of the NCEP GFS model, the AUC score of probabilistic forecasts was comparable to the skill of deterministic forecasts, suggesting better performance may be obtained by combining the multidiagnostic approach with NWP model ensembles.
The objective of this study is to compare various single NWP model-based probabilistic forecasting approaches to produce turbulence forecasts, including a multi-diagnostic ensemble, a time-lagged ensemble, a forecast-model ensemble, and combined time-lagged multi-diagnostic ensemble and forecast-model multi-diagnostic ensemble approaches. Only upper-level turbulence forecasts—flight levels (FL) ≥ 20 000 ft—are considered, and concentration is on evaluating “clear-air” turbulence (CAT) and mountain wave turbulence (MWT) performance. Results of this comparative study are expected to inform decisions to identify the best approaches (in terms of forecast skill and cost effectiveness) for producing next-generation operational probabilistic turbulence forecasts. To the best of our knowledge, none of these approaches have been directly compared to each other such that recommendations can be made on which approaches are the most suitable for operational implementation. Section 2 describes the NWP dataset, the turbulence forecasting system, and the probabilistic forecast approaches used in this study. Section 3 presents the comparison results based on case studies and statistical evaluations. The summary and conclusions follow in the final section.
2. Methods
The procedure for producing automated aviation turbulence forecasts consists of the following three components: 1) retrieving the appropriate meteorological variables from NWP model output to be used as input into the turbulence forecasting algorithms, 2) computing one or more turbulence diagnostics derived from the NWP meteorological variables, and 3) combining diagnostics and NWP meteorological variables to derive the final deterministic and/or probabilistic turbulence forecasts. In this study, an experimental NWP model dataset was created to enable a direct and fair comparison of different probabilistic forecasting approaches within a single NWP model framework. The Graphical Turbulence Guidance (GTG) system (Sharman et al. 2006; Sharman and Pearson 2017) was used to calculate multiple turbulence diagnostics and to generate the final deterministic and probabilistic turbulence forecasts. Following are overviews of the experimental NWP dataset, the GTG system, and probabilistic forecasting approaches.
a. An experimental NWP dataset: Pseudo–global ensemble forecast system (GEFS)
An experimental NWP dataset was created and used as input into GTG, hereafter referred to as the “pseudo-GEFS.” The pseudo-GEFS forecasts consist of 21-member ensemble forecasts for 12-, 18-, 24-, 30-, and 36-h forecasts at a 13-km horizontal grid spacing on 50 vertical sigma levels with a model top at 10 hPa. This pseudo-GEFS dataset was generated through dynamical downscaling of the publicly available GEFS forecasts dataset from the NOAA National Centers for Environmental Information (NCEI) (https://www.ncei.noaa.gov/products/weather-climate-models/global-ensemble-forecast) at 1° horizontal resolution on 26 pressure levels from 1000 to 10 hPa, consisting of 21 NWP ensemble members—1 control (unperturbed) and 20 perturbation forecasts. The GEFS ensemble is configured by perturbing initial conditions and employing stochastic physics schemes (Zhou et al. 2017), and the pseudo-GEFS is designed to mimic these characteristics of the GEFS ensemble.
For dynamical downscaling, the Weather Research and Forecasting (WRF) Model version 4.2.1 was configured over the contiguous United States (CONUS) domain and run at a 13-km horizontal grid spacing. The WRF Model and simulation setup are provided in the appendix. The 1° GEFS 0-h forecasts were used to initialize the pseudo-GEFS forecasts, and 1° GEFS forecasts were used to provide lateral boundary condition (LBC) forcings, respectively, as illustrated in Fig. 1. The 12-h 21-member pseudo-GEFS forecasts valid at 1800 UTC were initialized every 0600 UTC over a 1-yr period between 21 May 2019 and 20 May 2020.
The 13-km horizontal grid spacing used for the pseudo-GEFS is comparable to those currently employed in operational global deterministic forecasts—e.g., 13 km of NCEP GFS (NCEP 2022), 10 km of UKMO UM (Walters et al. 2019), and 9 km of ECMWF HRES (Haiden et al. 2021), but is somewhat higher resolution than those currently used in operational global ensemble forecasts—e.g., 25 km of NCEP GEFS and 20 km of UKMO MOGREPS and ECMWF EPS. The vertical resolution is also similar to the “raw” GEFS forecasts (∼50–54 hybrid levels below 10 hPa), i.e., before the data were interpolated to constant pressure levels. This downscaled GEFS ensemble output is needed because high-resolution input (close to operational NWP resolution) with all needed input variables into GTG is not available from the publicly available 1° GEFS data. This then provides an experimental ensemble NWP system that mimics the operational NWP ensemble forecasting method. The pseudo-GEFS dataset was created over a 1-yr period between 21 May 2019 and 20 May 2020, for 345 days on which the full 21 GEFS ensemble forecasts are available from the NOAA NCEI archive.
We begin by examining the ensemble spreads for two particular cases of observed widespread clear-air turbulence (CAT). Figure 2 shows the observations—in situ eddy dissipation rate (EDR) and pilot reports (PIREPs)—associated with the two cases. The 3 December 2019 case (Fig. 2a) was studied in detail by Trier et al. (2022) and was shown to be mainly associated with strong jet stream shears (CAT) but also with some embedded MWT over Colorado. The 11 January 2020 case (Fig. 2b) was associated with MWT over the Colorado Rockies, CAT over Illinois, and convectively induced turbulence (CIT) over the Mississippi Valley.
The GEFS and the pseudo-GEFS forecasts for different forecast lead times are compared in terms of the ensemble spread of the 250-hPa geopotential height and wind speed for these two cases in Fig. 3. For the 3 December 2019 case (Fig. 3a) the pseudo-GEFS ensemble compares well with the GEFS ensemble in terms of the location and the amount of the ensemble spread and the increase of the spread with the forecast lead time while providing the needed meteorological variables at higher horizontal and vertical resolution into GTG. For the 11 January 2020 case (Fig. 3b) the ensemble spread of the pseudo-GEFS compares well with the GEFS ensemble spread, except for the large spread of wind speed from Mississippi all the way to Michigan and Wisconsin. This large spread in the pseudo-GEFS coexists with squall lines developed at the time indicated by lightning observations (Fig. 2b), which are partially resolved in the 13-km pseudo-GEFS but remain unresolved in the 1° GEFS. This study concentrates on the prediction of CAT and MWT, therefore the similarity between the pseudo-GEFS and GEFS over the non-convective areas supports the use of the pseudo-GEFS as a substitute for the operational GEFS.
To show the effects of increased horizontal and vertical resolutions, vertical profiles of gradient Richardson number (Ri = N2/S2) derived from the publicly available 1° GEFS vertically interpolated and the 13-km pseudo-GEFS 12-h forecasts are compared in Fig. 4, together with its components, the vertical wind shear {S ≡ [(∂u/∂z)2 + (∂υ/∂z)2]1/2 where u and υ are zonal and meridional wind components and z is height}, and the square of Brunt–Väisälä frequency [N2 = (g/θ)∂θ/∂z where θ is potential temperature of dry air]. The S, N, and Ri terms were calculated at every vertical level of the pseudo-GEFS and GEFS forecasts, using the simple centered differences. These forecasted soundings are compared against high vertical resolution radiosonde data (HVRRD) observations at the Grand Junction station in Colorado (39.12°N, 108.53°W). The location of the Grand Junction station is indicated in Fig. 2, which is close to where a number of moderate and/or severe turbulence reports were recorded. The raw HVRRD soundings at a 1-s sampling frequency are available from the NOAA NCEI (https://www.ncei.noaa.gov/data/us-radiosonde-bufr/), and we used soundings at a 5-m grid spacing that are interpolated from the 1-s raw soundings (as in Ko et al. 2019). Since no sounding observations are available at 1800 UTC for both cases, we used the 1200 UTC sounding observations for the comparison. The HVRRD soundings are displayed every 50 m for better representation; however, no interpolation or smoothing was applied for the calculations of S, N2, and Ri.
The 3 December 2019 case (Figs. 4a–c) shows that the pseudo-GEFS forecast has larger shears and consequently smaller Ri at upper levels (e.g., 11–13 km MSL) similar to HVRRD observations, creating more favorable conditions for turbulence generation than the publicly available 1° GEFS. The 11 January 2020 (Figs. 4d–f) case also shows that the pseudo-GEFS forecast predicts larger shears and smaller Ri in the lower stratosphere. The coarse and vertically interpolated publicly available 1° GEFS forecasts (vertical grid spacings are indicated by the dots in the figure) predicts Ri larger than 10 at 11–13 km MSL for the December case and above ∼8 km MSL for the January case. For both cases, the general vertical structure of the computed quantities for the most part follow the HVRRD data, but with much smoother distributions.
b. The GTG system and deterministic forecasts
The list of 11 CAT and 8 MWT diagnostics used. The variable ds (m s−1) is a near-surface MWT diagnostic: ds = 0, if model terrain height (h) < 500 m or gradient(h) < 6.0 m km−1; otherwise, ds = vertical wind speed maximum in the lowest 1500 m.
c. Probabilistic forecasts
Probabilistic turbulence forecasts are generated by calculating percentages of the number of ensemble members that fall within predefined EDR ranges: e.g., “light” (0.15 ≤ EDR < 0.22), “moderate” (0.22 ≤ EDR < 0.34), and “severe or greater (SOG)” (EDR ≥ 0.34) turbulence for midsized aircraft, the same as those used in Sharman and Pearson (2017) based on the guidance provided by Sharman et al. (2014). Herein, we tested three stand-alone methods to produce probabilistic turbulence forecasts—a multi-diagnostic ensemble (MDE) approach, a time-lagged NWP ensemble (TLE) approach, and a forecast-model ensemble (FME) (or an ensemble NWP model) approach, and two combined methods—a time-lagged multi-diagnostic ensemble (TMDE) approach and an ensemble-model multi-diagnostic (FMDE) approach. Table 2 summarizes the five probabilistic forecast methods tested in this study with details as follow.
Summary of probabilistic forecast methods used in this study: the number of weather forecasts NWx (identical to the number of GTG forecast runs), the number of turbulence diagnostics ND (=NCAT + NMWT), the type of ensemble members (individual diagnostics vs GTGMAX), the number of total ensemble members used to compute probability NENS, and approximate computing runtimes using the runtime of MDE-19D as a baseline; NENS = NWx × ND if individual turbulence diagnostics are used as ensemble members; NENS = NWx if the final GTG products, GTGMAX, are used as ensemble members.
1) Multi-diagnostic ensemble (MDE) approach
An example of three representative ensemble members of the MDE method is shown in Figs. 5a–c, which are derived from the control (unperturbed) NWP member of the pseudo-GEFS 12-h forecasts valid at 1800 UTC 3 December 2019. Ellrod3 (Table 1) does capture moderate and severe turbulence events over eastern Colorado, as well as a cluster of light and moderate turbulence events along the Nevada–Utah border, but misses light turbulence events near the Arizona–New Mexico border (Fig. 5a). FTH/Ri (Table 1) predicts the light turbulence events near the Arizona–New Mexico border, but the turbulence events in Florida are not captured (Fig. 5b). VARE/Ri (Table 1) tends to predict larger areas of turbulence; this leads to the prediction of the light turbulence events near the Arizona–New Mexico border and over Florida, but also a large number of false alarms over the southeastern United States (Fig. 5c). These differences among the turbulence diagnostics show the uncertainty that may be associated with diagnosing turbulence.
For this comparative study, two different sets of diagnostics were tested. One uses 19 diagnostics (11 CAT and 8 MWT, as listed in Table 1) and the other uses 77 diagnostics (62 CAT and 15 MWT) that are available in the GTG version used in this study: hereafter, referred to as MDE-19D and MDE-77D, respectively (Table 2). All diagnostics used have been mapped to EDR values using the method outline in Sharman and Pearson (2017).
2) Time-lagged ensemble (TLE) approach
An example of selected ensemble members of the TLE method is shown in Figs. 5d–f. Overall, there is more agreement among the ensemble members of TLE compared to MDE. The light turbulence events over Arizona/New Mexico and Florida, as well as the intensity of moderate and severe turbulence events over Colorado, are better captured by the 12- and 24-h forecasts (i.e., shorter lead times) than by the 36-h forecasts.
3) Forecast-model ensemble (FME) (or ensemble NWP model) approach
The FME approach represents the uncertainty in NWP input into GTG due to NWP initial conditions and model formulations, by using the full 21 weather forecast ensemble members of the pseudo-GEFS. The pseudo-GEFS is designed to mimic the characteristics of the GEFS ensemble (section 2a and Fig. 3). The FME approach uses GTGMAX from the ensemble NWP members to compute probability, as in the TLE approach [Eq. (5)], and NWx = 21. Operationally, the TLE approach may have an advantage over the FME approach since they are (or could be) based on higher-resolution deterministic forecasts.
4) Combined approaches
3. Assessments of the different probabilistic forecast approaches
The characteristics of ensembles from the different probabilistic forecast approaches and the resultant probabilistic turbulence forecasts are investigated based on two cases in section 3a. Statistical evaluations based on probabilistic forecasts over a 1-yr period follow in section 3b.
a. Case studies: Characteristics of different approaches
Ensemble spreads from 12-h turbulence forecasts (EDR) are compared in Fig. 6 for the 3 December 2019 case, together with the corresponding deterministic EDR forecasts derived from the control (unperturbed) NWP member of the pseudo-GEFS weather forecasts. The 3 December 2019 case was dominated by CAT along a jet with the jet core approximately at 300 hPa exceeding 65 m s−1 over southern Colorado (not shown). The deterministic forecasts overlaid with in situ EDR reports show moderate and severe turbulence reports over eastern Colorado, as well as a cluster of light and moderate turbulence reports along the Nevada–Utah border (Fig. 6h). Some MWT reports were also recorded over the Colorado Rockies (Trier et al. 2022).
Increasing the number of diagnostics in the MDE method further increases the peak values of the ensemble spread, e.g., over Colorado and Montana, and over the Pacific Ocean (cf. Figs. 6a,b). Overall, the spatial distributions and magnitudes of the ensemble spread of the two combined methods—TMDE-5TL×19D and FMDE-21EM×19D (Figs. 6e,f)—are similar to those of MDE-19D alone (Fig. 6a) that use the same number of turbulence diagnostics, except that the areas of the small ensemble spread of approximately 0.04–0.06 m2/3 s−1 are more widely distributed when the combined methods are used. Increasing the number of diagnostics in the combined TMDE method further smooths the areas of small ensemble spread of 0.04–0.06 m2/3 s−1, while increasing the peak values of the ensemble spread (cf. Figs. 6e,g).
Ensemble spreads for the 11 January 2020 case are compared in Fig. 8. The 11 January 2020 case has a cluster of moderate and severe turbulence reports associated with MWT over the Colorado Rockies at altitudes around FL340. A number of moderate and severe turbulence reports were also recorded at lower altitudes over the Mississippi Valley associated with strong convection, as well as CAT over Illinois (Fig. 2b). Overall, the characteristic features of different probabilistic approaches that were found from the 3 December 2019 case hold, including the larger spreads of MDE, TMDE, and FMDE (cf. Figs. 6,8). Over the Colorado Rockies, the spread of MWT diagnostics has a relatively larger contribution to the dispersion of the MDE method, while the spread of CAT diagnostics is larger elsewhere (cf. Figs. 7c,d). The ensemble spreads of the TLE and FME methods over the mountainous region are even smaller in this January case than in the 3 December case. Compared to synoptic-scale jet streams that dominate the December case, mountain waves are less resolvable in the 13-km pseudo-GEFS NWP model. This difference in scales of the primary turbulence generation mechanisms between the two cases may contribute to the less dispersive NWP ensembles—i.e., TLE and FME—over the MWT region in the January case.
The results of 12-h probabilistic forecasts of light, moderate, and SOG turbulence at FL370 for the 3 December 2019 case are displayed in Figs. 9–11, together with the corresponding 12-h deterministic forecasts. For the prediction of light turbulence (Fig. 9), MDE-19D shows a relatively wide dispersion of intermediate probabilities ranging between 20% and 80% (Fig. 9a), indicative of a relatively large ensemble spread in comparison to the TLE and FME forecasts (Figs. 9c,d). By increasing the number of diagnostics used in the MDE approach, MDE-77D predicts light turbulence over larger areas at lower probabilities rarely exceeding 60% (Fig. 9b). Such large ensemble spread may result in a high probability of detection for low percentage thresholds (e.g., 10%) but at the same time the number of false alarms may increase if ensemble spread is too large. The light turbulence events over southern Arizona–New Mexico and Florida, which are captured by MDE-77D but missed by MDE-19D are good examples showing the characteristics of a large ensemble spread.
The TLE- and FME-based forecasts show probabilities exceeding 80% in most regions where light turbulence is predicted (Figs. 9c,d), indicative of a relatively small ensemble spread (i.e., large degree of agreement among ensemble members). Such small spread may reduce the number of false alarms for low probability thresholds but also increases the number of missed events, and generally its performance is highly dependent on the performance of the underlying NWP model. Also, when the TLE and FME methods capture a turbulence event, they tend to capture it at higher probabilities (i.e., with higher confidence). These characteristics of the TLE and FME-based forecasts are better seen in moderate and SOG turbulence forecasts (Figs. 10 and 11). TLE-5TL and FME-21EM predict moderate turbulence over Colorado at higher probabilities than other methods, while missing the moderate events along the Nevada–Utah border and predicting the SOG events over Colorado at probabilities lower than 10% (Figs. 10c,d and 11c,d).
Combining the MDE method with the other two—i.e., TMDE and FMDE—leads to more widely spread lower probabilities compared to TLE and FME as expected from the ensemble spread (Figs. 9e–g, 10e–g, and 11e–g). Similar to the MDE forecasts, the TMDE and FMDE-based forecasts capture the moderate events along the Nevada–Utah border and the SOG events over Colorado, which are missed in the TLE and FME-based forecasts. The differences between the combined approaches and the stand-alone MDE approach are not noticeable from this selected case. The probabilistic forecast results for the 11 January 2020 case are presented in Fig. 12, confirming the characteristics of different approaches: the severe turbulence events over Colorado that are predicted by the MDE approach at low probabilities (10%–20%) but entirely missed by the TLE and FME approaches, and the light and moderate turbulence events over Colorado that are predicted at higher probabilities (with higher confidence) by the TLE and FME approaches.
b. Statistical evaluations
The Brier skill score (BSS) of 12-h probabilistic forecasts of LOG and MOG turbulence events, derived from 345 cases over a 1-yr period from 21 May 2019 to 20 May 2020. The BSS is calculated using the 12-h deterministic forecasts as reference predictions. The best scores among the stand-alone methods and among all methods are highlighted in italics and in bold, respectively.
Among the three stand-alone approaches, the MDE method that uses all 77 diagnostics exhibits the highest BSS for light or greater (LOG) and moderate or greater (MOG) turbulence events. One thing to note is the improvement of the MDE-based probabilistic forecasts attained by increasing the number of diagnostics from 19 to 77. Increasing the number of diagnostics from 19 to 77 leads to improvements of statistical performance by incorporating a larger variety of turbulence generation mechanisms. The BSS is increased from 0.353 to 0.479 for LOG and from 0.157 to 0.359 for MOG. Increasing the number of diagnostics in the MDE framework does not require any additional GTG runs (i.e., NWx = 1), therefore this performance gain is achievable with only a small increase of computational cost. A single MDE-77D forecast run time is only 1.2–1.3 times longer than a single MDE-19 forecast, compared to the TLE or FME approaches for which computational times increase linearly with the number of ensembles used (Table 2). Among the three stand-alone approaches, MDE-19D shows the second highest BSS for LOG turbulence, and TLE-5TL for MOG turbulence. Between the two NWP ensemble approaches, the TLE approach outperforms the FME approach. This is at least in part due to the slightly larger ensemble spread of TLE than FME where turbulence events are reported (Figs. 6c,d and 8c,d).
Combining MDE with either TLE or FME also improves the BSS; the skill score is improved from 0.353 to 0.437 (TMDE-5TL×19D) or 0.409 (FMDE-21EM×19D) for LOG turbulence, and from 0.157 to 0.343 (TMDE-5TL×19D) or 0.273 (FMDE-21EM×19D) for MOG turbulence. The performance gain by combining the MDE to TLE is comparable to the performance gain by increasing the number of diagnostics in the MDE approach, but at the expense of additional computational costs proportional to the number of weather forecasts (Table 2). Increasing the number of diagnostics from 19 to 77 in the combined TMDE framework further improves the skill score; in terms of the BSS, the largest improvement over the reference (deterministic) forecast is obtained when using the TMDE approach with 77 diagnostics.
The 2 × 2 contingency table.
The slopes of the ROC curves presented in Fig. 13 are estimated by finding the ratio of the change of POD to the change of FAR centered at FAR ∼0.2 (0.05) where the TLE and FME ROC curves of LOG (MOG) bend. A larger slope indicates a better prediction skill, i.e., a larger increase of POD for the same increase of FAR for practical regions of interest (i.e., low FAR and high POD). Similar reasoning using a partial ROC was used by Gill (2016). For LOG turbulence (MOG turbulence), FME-21EM shows the steepest slopes for FAR < 0.2 (for FAR < 0.05), followed by TLE-5TL and TMDE-5TL×77D. MDE-19D shows the least steep slope. Increasing the number of diagnostics improves the MDE-based probabilistic forecasts (i.e., the slopes of the ROC curves are steeper in MDE-77D), and combining MDE with either TLE or FME also leads to steeper slopes, therefore better prediction skills. This is consistent with the findings based on the BSS scores.
Histograms of predicted probabilities that correspond to in situ EDR reports and PIREPs and reliability diagrams derived from the histograms are presented in Figs. 14 and 15, respectively. As was seen in the case studies, the MDE method predicts intermediate probabilities more frequently than the TLE and FME methods and 0% probability less often. It rarely predicts 100% probability, as expected from the large disagreement among the diagnostics (Figs. 5a–c). The TLE and FME methods, on the other hand, predict both 0% and 100% probabilities more frequently than the other methods, indicative of a small ensemble spread. Combining MDE with either TLE or FME decreases the frequency of 0% probability while increasing the frequency of low probabilities between 0% and 20%. The relative frequency of forecasted events (i.e., the sum of frequencies for probability > 0% in the histograms) is much higher than the relative frequency of observed turbulence events with respect to the total number of observations (2.43% and 0.63% for LOG and MOG turbulence events, respectively). This emphasizes the overall overprediction of events regardless of method, if not calibrated, as presented in Fig. 15. This overprediction of turbulence events by probabilistic forecasts was also found in previous studies as well (e.g., Gill and Buchanan 2014; Storer et al. 2019, 2020; Lee et al. 2020). This is due in part to a suspected bias in the observations toward nonturbulent events created when pilots try to avoid regions of known turbulence (Sharman et al. 2014; Lee et al. 2020). Lee (2021) demonstrated that the reliability curves can be improved by using subsets of in situ EDR reports that were randomly sampled to have the ratios of turbulence events to the total in situ reports similar to those of the forecasts.
To address the overprediction of probabilistic forecast, which is obvious in the reliability curves (Fig. 15), recalibration can be introduced following Storer et al. (2019, 2020). Multiplying the forecast probabilities of LOG and MOG turbulence events by calibration constants of 1/6 and 1/12, respectively, results in the reliability diagrams plotted in the insets of Fig. 15. The calibration constants depend on turbulence thresholds (i.e., LOG or MOG turbulence) and underlying NWP models. For example, Storer et al. (2020) used 1/50 for UKMO MOGREPS and 1/40 for ECWMF EPS. For LOG turbulence (Fig. 15a), the MDE, TMDE, and FMDE show similar slopes that are in parallel with the perfect reliability line (solid diagonal line) for calibrated forecast probability up to ∼3%. For calibrated probabilities above ∼4%, the TMDE-5TL×77D forecasts follow the perfect reliability line but with only slight overprediction. The combined forecasts that use selected diagnostics (TMDE-5TL × 19D and FMDE-5TL×19D) and the MDE forecasts further deviate from the perfect reliability line (i.e., overforecast). The TLE and FME forecasts show relatively flat lines, and it is hard to improve the reliability of these two methods by modulating the calibration constant; decreasing the calibration constant exacerbates the underprediction at low probabilities, while increasing it results in overprediction at all probabilities.
For MOG turbulence (Fig. 15b), MDE-77D and TMDE (TMDE-5TL×19D and TMDE-5TL×77D) show the slope closer to the perfect reliability line than other forecasts, but underforecast for calibrated probabilities above ∼6%–7%. The TLE and FME forecasts show relatively flat lines, as for LOG turbulence. The improvements of statistical performance achieved by combining the NWP-based TLE or FME methods with the MDE method is mainly because of their different characteristics of ensemble spreads: i.e., the larger spread of the MDE approach that leads to a wider dispersion of intermediate probabilities and the smaller spread of TLE and FME that results in probabilities at extremes.
4. Summary and conclusions
In this study, we implemented various probabilistic forecasting approaches to aviation turbulence forecasts and compared their characteristic features and prediction skills to identify the most suitable approach for operational implementation of probabilistic turbulence forecasts. This included the MDE method that represents the uncertainty in turbulence diagnostics, the TLE and FME methods that represent the uncertainty in weather forecasts, and the combined TMDE and FMDE methods that take account of both types of uncertainties. Two case studies were conducted to compare the characteristic features of the different approaches focusing on the ensemble spread of the forecasted EDR and the resulting probabilistic forecasts. Large ensemble spreads were found over the areas of large (deterministically predicted and observed) EDR for all methods, and the ensemble spreads can be comparable to or even larger than the intensity of MOG turbulence. This indicates large uncertainties in predicting aviation turbulence events, supporting the necessity of probabilistic forecasts. The stand-alone and combined multi-diagnostic approaches—MDE, TMDE, and FMDE—all exhibited larger spreads than either of the NWP ensemble approaches tested—TLE and FME. The larger spreads allowed for a higher probability of detection for low percentage thresholds at the cost of increasing false alarms. The small spreads of TLE and FME resulted in either hits with higher probability or missed events, highly dependent on the performance of the underlying NWP model. Operationally, the TLE approach may have an advantage over the FME approach since they are (or could be) based on higher-resolution deterministic forecasts.
Each approach has its own advantages and disadvantages over others, hence, statistical evaluations against in situ EDR reports and PIREPs were performed to quantify trade-offs. The three different evaluation metrics considered—the Brier skill score, the ROC curve, and the reliability diagram—all indicate superior skill for the multi-diagnostic approach MDE compared to the NWP model-based ensemble approaches, TLE and FME, especially when 77 turbulence diagnostics were used. The multi-diagnostic approach should therefore be preferred operationally for describing the uncertainty of turbulence forecasts, even when a relatively small number of turbulence diagnostics (19 in this case) are used. Also the computational resources required are less for the MDE approach than for the TLE or FME approaches (Table 2). Combining multiple diagnostics with either NWP time-lagged ensembles or NWP forecast ensembles can further improve prediction skill if sufficient computational resources are available.
For the pseudo-GEFS NWP dataset we tested in this study, it was found that representing the uncertainty in diagnosing turbulence is more important, i.e., provides higher forecast skill, than representing the uncertainty in NWP. This is because the pseudo-GEFS dataset has limited spread as can be inferred from the spread of GEFS (Zhou et al. 2017), which was used to generate the pseudo-GEFS dataset. In other words, the pseudo-GEFS ensemble spread is not sufficiently large to represent the NWP forecast error, which is carried over to the ensemble spread of turbulence forecasts (indicated in Figs. 6 and 8). Given the limited NWP spread, representing the uncertainty in turbulence diagnostics plays a more important role than representing the uncertainty in NWP, for the NWP ensemble model tested in this study.
Finally, it should be noted that these results are based on the GEFS NWP ensemble model, and if other NWP ensemble models that have larger or smaller spreads than the GEFS are considered, the results may change. Increasing the number of ensemble members of the time-lagged or forecast-model approaches can further increase the ensemble spread of these methods. The spreads of these methods can also be sensitive to NWP model resolutions, considering that the degree to which moist convection and mesoscale processes are resolved depends on model resolutions. However, it is likely that the MDE approach will still have larger spreads than these other ensemble models. Implementing recently developed CIT diagnostics (e.g., Kim et al. 2021) into the MDE framework could help improve the performance of probabilistic forecasts. For the GEFS NWP ensemble model tested in this study, it was shown that probabilistic forecasts skill is improved by increasing spreads of the NWP/diagnostic ensembles. The majority of previous studies on probabilistic forecasts of aviation turbulence also pointed out the lack of spread in their ensembles (e.g., Kim et al. 2018; Lee et al. 2020), and we can expect improved forecast skill by increasing the spread. However, if NWP models that have smaller forecast errors (forecast uncertainty) are considered, the results may change as they need smaller spreads to represent their forecast uncertainty.
Acknowledgments.
This research is in response to requirements and funding by the Federal Aviation Administration (FAA). The views expressed are those of the authors and do not necessarily represent the official policy or position of the FAA. The National Center for Atmospheric Research is sponsored by the National Science Foundation. We appreciate Teddie Keller (NCAR) for providing the turbulence and lightning observations plots that were used in Fig. 2, and Han-Chang Ko and Hye-Yeong Chun (Yonsei University) for their help with the HVRRD data processing. Matthias Steiner and Ken Stone (NCAR), and Jung-Hoon Kim (Seoul National University) are acknowledged for informative discussions that benefited this work. Junkyung Kay (NCAR) and Jiwoo Lee (LLNL) are acknowledged for discussions that helped the pseudo-GEFS model setup. Peer reviews provided by Joshua Scheck (NOAA AWC) and two anonymous reviewers helped improve and clarify the manuscript.
Data availability statement.
All simulations presented in the case studies will be made available from the corresponding author upon request. For the simulations used for statistical evaluations, relevant post-processed output including EDR reports and matching forecasts will be made available from the corresponding author upon request. The PIREPs used here can be obtained from NOAA’s Family of Services (https://weather.gov/noaaport/), and the in situ EDR data are available through NOAA’s MADIS data (https://madis-data.noaa.gov/madisPublic1/data/archive). The raw HVRRD soundings at a 1-s sampling frequency are available from the NOAA NCEI (https://www.ncei.noaa.gov/data/us-radiosonde-bufr/).
APPENDIX
Pseudo-GEFS Model Setup
The pseudo-GEFS ensemble forecasts were produced by dynamically downscaling the publicly available 1° GEFS ensemble forecasts. For dynamical downscaling, the Weather Research and Forecasting (WRF) Model version 4.2.1 was configured over the contiguous United States (CONUS) domain (Fig. A1) on Lambert conformal projection and run at a 13-km horizontal grid spacing on 50 vertical sigma levels with a model top at 10 hPa.
The physic parameterizations include the Thompson microphysics scheme (Thompson et al. 2008), the Grell–Freitas convective parameterization (Grell and Freitas 2014), the Rapid Radiative Transfer Model for General Circulation Models (RRTMG) longwave and shortwave radiation schemes (Iacono et al. 2008), the Mellor–Yamada–Nakanishi–Niino (MYNN) level-2.5 planetary boundary layer and surface layer schemes (Nakanishi and Niino 2009), and the unified Noah land surface model (Tewari et al. 2004). The model resolution and physics parameterization schemes used for the pseudo-GEFS are similar to the NOAA RAP NWP model.
REFERENCES
Bauer, P., A. Thorpe, and G. Brunet, 2015: The quiet revolution of numerical weather prediction. Nature, 525, 47–55, https://doi.org/10.1038/nature14956.
Ben Bouallègue, Z., and D. S. Richardson, 2021: On the ROC area of ensemble forecasts for rare events. Wea. Forecasting, 37, 787–796, https://doi.org/10.1175/WAF-D-21-0195.1.
Benjamin, S. G., and Coauthors, 2016: A North American hourly assimilation and model forecast cycle: The Rapid Refresh. Mon. Wea. Rev., 144, 1669–1694, https://doi.org/10.1175/MWR-D-15-0242.1.
Buizza, R., P. L. Houtekamer, Z. Toth, G. Pellerin, M. Wei, and Y. Zhu, 2005: A comparison of the ECMWF, MSC, and NCEP global ensemble prediction systems. Mon. Wea. Rev., 133, 1076–1097, https://doi.org/10.1175/MWR2905.1.
Bush, M., and Coauthors, 2020: The first Met Office Unified Model–JULES regional atmosphere and land configuration, RAL1. Geosci. Model Dev., 13, 1999–2029, https://doi.org/10.5194/gmd-13-1999-2020.
Casati, B., and Coauthors, 2008: Forecast verification: Current status and future directions. Meteor. Appl., 15, 3–18, https://doi.org/10.1002/met.52.
Ellrod, G. P., and D. I. Knapp, 1992: An objective clear-air turbulence forecasting technique: Verification and operational use. Wea. Forecasting, 7, 150–165, https://doi.org/10.1175/1520-0434(1992)007<0150:AOCATF>2.0.CO;2.
Ellrod, G. P., and J. A. Knox, 2010: Improvements to an operational clear-air turbulence diagnostic index by addition of a divergence trend term. Wea. Forecasting, 25, 789–798, https://doi.org/10.1175/2009WAF2222290.1.
Gill, P. G., 2016: Aviation turbulence forecast verification. Aviation Turbulence: Processes, Detection, Prediction, R. Sharman and T. Lane, Eds., Springer, 261–283, https://doi.org/10.1007/978-3-319-23630-8_13.
Gill, P. G., and P. Buchanan, 2014: An ensemble based turbulence forecasting system. Meteor. Appl., 21, 12–19, https://doi.org/10.1002/met.1373.
Grell, G. A., and S. R. Freitas, 2014: A scale and aerosol aware stochastic convective parameterization for weather and air quality modeling. Atmos. Chem. Phys., 14, 5233–5250, https://doi.org/10.5194/acp-14-5233-2014.
Haiden, T., M. Janousek, F. Vitart, Z. Ben-Bouallegue, L. Ferranti, C. Prates, and D. Richardson, 2021: Evaluation of ECMWF forecasts, including the 2020 upgrade. ECMWF Tech. Memo. 880, 54 pp., https://www.ecmwf.int/en/elibrary/19879-evaluation-ecmwf-forecasts-including-2020-upgrade.
Iacono, M. J., J. S. Delamere, E. J. Mlawer, M. W. Shephard, S. A. Clough, and W. D. Collins, 2008: Radiative forcing by long-lived greenhouse gases: Calculations with the AER radiative transfer models. J. Geophys. Res., 113, D13103, https://doi.org/10.1029/2008JD009944.
Kim, J.-H., W. N. Chan, B. Sridhar, and R. Sharman, 2015: Combined winds and turbulence prediction system for automated air-traffic management applications. J. Appl. Meteor. Climatol., 54, 766–784, https://doi.org/10.1175/JAMC-D-14-0216.1.
Kim, J.-H., R. Sharman, M. Strahan, J. W. Scheck, C. Bartholomew, J. C. H. Cheung, P. Buchanan, and N. Gait, 2018: Improvements in nonconvective aviation turbulence prediction for the World Area Forecast System. Bull. Amer. Meteor. Soc., 99, 2295–2311, https://doi.org/10.1175/BAMS-D-17-0117.1.
Kim, S.-H., H.-Y. Chun, D.-B. Lee, J.-H. Kim, and R. D. Sharman, 2021: Improving numerical weather prediction–based near-cloud aviation turbulence forecasts by diagnosing convective gravity wave breaking. Wea. Forecasting, 36, 1735–1757, https://doi.org/10.1175/WAF-D-20-0213.1.
Ko, H.-C., H.-Y. Chun, R. Wilson, and M. A. Geller, 2019: Characteristics of atmospheric turbulence retrieved from high vertical‐resolution radiosonde data in the United States. J. Geophys. Res. Atmos., 124, 7553–7579, https://doi.org/10.1029/2019JD030287.
Lee, D.-B., 2021: Development and evaluation of global Korean aviation turbulence forecast systems using the operational numerical weather prediction model outputs. Ph.D. dissertation, Yonsei University, 143 pp., https://dcollection.yonsei.ac.kr/common/orgView/000000539276.
Lee, D.-B., H.-Y. Chun, and J.-H. Kim, 2020: Evaluation of multimodel-based ensemble forecasts for clear-air turbulence. Wea. Forecasting, 35, 507–521, https://doi.org/10.1175/WAF-D-19-0155.1.
Nakanishi, M., and H. Niino, 2009: Development of an improved turbulence closure model for the atmospheric boundary layer. J. Meteor. Soc. Japan, 87, 895–912, https://doi.org/10.2151/jmsj.87.895.
NCEP, 2022: NCEP Service Change Notice SCN22-104 Updated: Upgrade NCEP Global Forecast System to v16.3.0: Effective November 29, 2022. NCEP, 5 pp., https://www.weather.gov/media/notification/pdf2/scn22-104_gfs.v16.3.0_aaa.pdf.
Seity, Y., P. Brousseau, S. Malardel, G. Hello, P. Bénard, F. Bouttier, C. Lac, and V. Masson, 2011: The AROME-France convective-scale operational model. Mon. Wea. Rev., 139, 976–991, https://doi.org/10.1175/2010MWR3425.1.
Sharman, R. D., and J. M. Pearson, 2017: Prediction of energy dissipation rates for aviation turbulence. Part I: Forecasting nonconvective turbulence. J. Appl. Meteor. Climatol., 56, 317–337, https://doi.org/10.1175/JAMC-D-16-0205.1.
Sharman, R. D., C. Tebaldi, G. Wiener, and J. Wolff, 2006: An integrated approach to mid- and upper-level turbulence forecasting. Wea. Forecasting, 21, 268–287, https://doi.org/10.1175/WAF924.1.
Sharman, R. D., L. B. Cornman, G. Meymaris, T. Farrar, and J. Pearson, 2014: Description and derived climatologies of automated in situ eddy-dissipation-rate reports of atmospheric turbulence. J. Appl. Meteor. Climatol., 53, 1416–1432, https://doi.org/10.1175/JAMC-D-13-0329.1.
Slingo, J., and T. Palmer, 2011: Uncertainty in weather and climate prediction. Philos. Trans. Roy. Soc., A369, 4751–4767, https://doi.org/10.1098/rsta.2011.0161.
Storer, L. N., P. G. Gill, and P. D. Williams, 2019: Multi-model ensemble predictions of aviation turbulence. Meteor. Appl., 26, 416–428, https://doi.org/10.1002/met.1772.
Storer, L. N., P. G. Gill, and P. D. Williams, 2020: Multi-diagnostic multi-model ensemble forecasts of aviation turbulence. Meteor. Appl., 27, e1885, https://doi.org/10.1002/met.1885.
Swinbank, R., and Coauthors, 2016: The TIGGE project and its achievements. Bull. Amer. Meteor. Soc., 97, 49–67, https://doi.org/10.1175/BAMS-D-13-00191.1.
Tewari, M., and Coauthors, 2004: Implementation and verification of the unified Noah land surface model in the WRF model. 20th Conf. on Weather Analysis and Forecasting/16th Conf. on Numerical Weather Prediction, Seattle, WA, Amer. Meteor. Soc., 14.2a, https://ams.confex.com/ams/84Annual/techprogram/paper_69061.htm.
Thompson, G., P. R. Field, R. M. Rasmussen, and W. D. Hall, 2008: Explicit forecasts of winter precipitation using an improved bulk microphysics scheme. Part II: Implementation of a new snow parameterization. Mon. Wea. Rev., 136, 5095–5115, https://doi.org/10.1175/2008MWR2387.1.
Thompson, G., M. K. Politovich, and R. M. Rasmussen, 2017: A numerical weather model’s ability to predict characteristics of aircraft icing environments. Wea. Forecasting, 32, 207–221, https://doi.org/10.1175/WAF-D-16-0125.1.
Trier, S. B., R. D. Sharman, D. Munoz-Esparza, and T. L. Keller, 2022: Effects of distant organized convection on forecasts of widespread clear-air turbulence. Mon. Wea. Rev., 150, 2593–2615, https://doi.org/10.1175/MWR-D-22-0077.1.
Walters, D., and Coauthors, 2019: The Met Office Unified Model Global Atmosphere 7.0/7.1 and JULES Global Land 7.0 configurations. Geosci. Model Dev., 12, 1909–1963, https://doi.org/10.5194/gmd-12-1909-2019.
Wilks, D. S., 2011: Statistical Methods in the Atmospheric Sciences. 3rd ed. International Geophysics Series, Vol. 100, Academic Press, 704 pp.
Xu, M., G. Thompson, D. R. Adriaansen, and S. D. Landolt, 2019: On the value of time-lag-ensemble averaging to improve numerical model predictions of aircraft icing conditions. Wea. Forecasting, 34, 507–519, https://doi.org/10.1175/WAF-D-18-0087.1.
Zängl, G., D. Reinert, P. Rípodas, and M. Baldauf, 2015: The ICON (ICOsahedral Non-hydrostatic) modelling framework of DWD and MPI-M: Description of the non-hydrostatic dynamical core. Quart. J. Roy. Meteor. Soc., 141, 563–579, https://doi.org/10.1002/qj.2378.
Zhou, X., Y. Zhu, D. Hou, Y. Luo, J. Peng, and R. Wobus, 2017: Performance of the new NCEP ensemble global forecast system in a parallel experiment. Wea. Forecasting, 32, 1989–2004, https://doi.org/10.1175/WAF-D-17-0023.1.