## Abstract

An automated procedure for forecasting mid- and upper-level turbulence that affects aircraft is described. This procedure, termed the Graphical Turbulence Guidance system, uses output from numerical weather prediction model forecasts to derive many turbulence diagnostics that are combined as a weighted sum with the relative weights computed to give best agreement with the most recent available turbulence observations (i.e., pilot reports of turbulence or PIREPs). This procedure minimizes forecast errors due to uncertainties in individual turbulence diagnostics and their thresholds. Thorough statistical verification studies have been performed that focused on the probabilities of correct detections of yes and no PIREPs by the forecast algorithm. Using these statistics as a guide, the authors have been able to intercompare individual diagnostic performance, and test various diagnostic threshold and weighting strategies. The overall performance of the turbulence forecast and the effect of these strategies on performance are described.

## 1. Introduction

Commercial, air taxi, and general aviation (GA) aircraft encounters with turbulence continue to be a source of occupant injuries, and in the case of GA, fatalities and loss of aircraft. Although the number of fatalities related to commercial airline turbulence encounters have been very low (only three in the last 10 yr), turbulence encounters do account for a significant percentage (about 65%) of all weather-related commercial aircraft incidents. The average number of air carrier turbulence-related injuries is about 45 per year according to the National Transportation Safety Board (NTSB), but these are of course only those cases that were reported to the NTSB. The actual number is probably much higher: one major carrier reported almost 400 injury-causing turbulence encounters over a 3-yr period; another estimated about 200 turbulence-related customer injury claims per year. Over the last 12 yr, the average number of moderate-or-greater and severe-or-greater encounters of turbulence actually reported and recorded amounts to over 63 000 and 5000 per year, respectively. Costs to the airlines are difficult to establish, but a vice president for one major air carrier, in a presentation delivered at the National Aeronautics and Space Administration–Federal Aviation Administration (NASA–FAA)-sponsored Aircraft Turbulence Accident Prevention First Users' and Technologists' Workshop in Hampton, Virginia, in 1998, estimated that it pays out “tens of millions per year“ for customer injuries, and loses about 7000 days in employee injury-related disabilities. The vast majority of air carrier turbulence incidents occur above 10 000 ft, where passengers and flight attendants are more likely to be unbuckled.

A large number of these turbulence encounters might be avoided if better turbulence forecast products were available to air traffic controllers, airline flight dispatchers, and flight crews. In fact, previous studies (e.g., Fahey 1993) have shown that for commercial air carriers, strategic planning to avoid turbulence encounters can lead to a reduction in cabin injuries and costs. However, current forecasting methods have not generally provided acceptably high detection rates and at the same time acceptably low false alarm rates to achieve significant reductions. The term “acceptable” does not have a universal quantitative definition, but the Turbulence Joint Safety Implementation Team, a team of representatives from the FAA, NASA, various federal laboratories, and end users, recommended probabilities of moderate or greater (MOG) turbulence detection should be >0.8 and probabilities of null detection should be >0.85 for turbulence forecasts to be most useful. These goals are currently not achievable by either automated or experienced human forecasters.

The turbulence forecasting difficulty is due in large part to the fact that, from the meteorological perspective, turbulence is a “microscale” phenomenon. In the atmosphere, turbulent “eddies” are contained in a spectrum of sizes, from 100s of kilometers down to centimeters. But aircraft bumpiness is most pronounced when the size of the turbulent eddies encountered is about the size of the aircraft; for commercial aircraft this would correspond to eddy dimensions of ∼100 m. It is impossible to directly and routinely forecast atmospheric motion at this scale, now or even in the foreseeable future. Fortunately, it appears that most of the energy associated with turbulent eddies on this scale cascade down from the larger scales of atmospheric motion [e.g., Dutton and Panofsky (1970) and more recently Tung and Orlando (2003) and Koshyk and Hamilton (2001)], and these larger scales may be resolved by current weather observation networks and numerical weather prediction (NWP) models. Assuming the large-scale forecasts are sufficiently accurate, the turbulence forecasting problem is then one of identifying large-scale features that are conducive to the formation of aircraft-scale eddies.

Empirically based linkages between large-scale atmospheric features (i.e., observable by routine meteorological observations and resolvable by NWP models) and aircraft-scale turbulence (i.e., forecasting “rules of thumb”) have been developed over the years by National Weather Service (NWS) and airline meteorological forecasters. The successful application of these rules, however, depends on the forecaster, and any perceived skill diminishes rapidly with forecast lead time. Because there is now a tremendous amount of meteorological data available to forecasters, more than can be digested in a reasonable length of time, automated turbulence forecasting tools could aid humans in making decisions about where to locate regions of potential turbulence that may be hazardous to aircraft.

To address the need for an automated turbulence forecasting tool, the Research Applications Laboratory at the National Center for Atmospheric Research (NCAR/RAL) and the National Oceanic and Atmospheric Administration's (NOAA) Earth System Research Laboratory/Global Systems Division (NOAA-Research-ESRL/GSD), under sponsorship from the FAA's Aviation Weather Research Program, have been developing and testing a completely automated turbulence forecasting system. This system was originally dubbed the Integrated Turbulence Forecasting Algorithm (ITFA; Sharman et al. 1999, 2002) and concentrated only on the prediction of clear-air turbulence (CAT) related to jet streams and fronts at upper-levels [flight pressure altitudes > 20 000 ft MSL or “flight levels”^{1} (FLs) > 200]. The term “integrated” was used to describe the blending of NWP model based turbulence diagnostics with available turbulence observations used to produce the forecasts. The ITFA system became operational for qualified meteorologists and dispatchers to use as a guide for making turbulence avoidance decisions in March 2003 and at that time was renamed the Graphical Turbulence Guidance (GTG) product. The two Gs emphasize the nature of the turbulence product: “graphical” because, as opposed to traditional AIRMET (Airmen's Meteorological Information) and SIGMET (Significant Meteorological Information) polygons, the output is provided as Web-based contours of turbulence potential, and “guidance” stresses that the output should be used as a decision support tool in addition to other information that may not be available to GTG (e.g., satellite imagery). The first generation GTG, or GTG1, provides gridded CAT forecasts stratified by flight levels with graphical displays of turbulence potential provided on NOAA's Aviation Digital Data Service (ADDS) Web site: (information online at http://adds.aviationweather.gov/turbulence). An example GTG1 image from the ADDS Web site is provided in Fig. 1.

This has been followed up by a second version, GTG2, which expands the capabilities of GTG1 by extending the turbulence forecasts down to FL100 and includes some diagnostics for mountain wave–related turbulence. The FL100–FL200 altitude band is especially significant for air taxis. Thus, in the new system, there are turbulence predictions at both upper (>FL200) and midlevels (FL100–FL200).

Both subjective and objective evaluations of the ITFA/GTG system based on comparisons with available turbulence reports (or pilot reports, PIREPs) have been an integral part of the algorithm's development since its inception. Independent *subjective* evaluations include those from the Aviation Weather Center (AWC) for the 2000–2003 winter seasons (Kelsch et al. 2004), Delta Airlines's Meteorology Department (winter 2001), United Airlines's Meteorology Department [winter 2002 and 2003; Kelsch et al. (2004)], Comair Dispatch (winter 2002), and the FAA's William J. Hughes Technical Center (winter 2000) severe case studies (Passetti et al. 2000; Weinrich and Sims 2002).

*Objective* evaluations based on comparisons with PIREPs have been on going by both the developers and an independent verification team composed of researchers from NOAA-Research-ESRL/GSD and NCAR/RAL. Results of evaluations from previous years can be found in Brown et al. (2000). Complete objective evaluations in the form of probability of detection (POD) statistics are also available on a daily basis on NOAA-Research-ESRL/GSD's Real-Time Verification System (RTVS) (see Mahoney et al. 2002 for a description) Web site (http://www-ad.fsl.noaa.gov/fvb/rtvs) since 1999.

This paper describes the current GTG algorithm (GTG2) and provides some statistical evaluations of its performance. The GTG methodology will be described in section 2. GTG performance derived from 1 yr (2003) of evaluations against turbulence PIREPs is presented in section 3. Development and tuning is an ongoing task, and current problem and work areas are outlined in section 4.

## 2. GTG procedure

The GTG process starts by automatically ingesting gridded NWP data, which are supposed to accurately represent the large-scale features of the atmosphere that may be related to aircraft-scale turbulence. In principle, any NWP model could be used, but the National Centers for Environmental Prediction's (NCEP's) Rapid Update Cycle (RUC-2) model was chosen because of the higher effective vertical resolution provided by the isentropic vertical coordinate system at upper levels in the model (Benjamin et al. 2004). The essence of the GTG forecasting method is to integrate a combination of many separate turbulence diagnostics, with each diagnostic weighted to get the best agreement with available observations (i.e., PIREPs). This idea of using a weighted combination of diagnostics to provide turbulence forecasts is not in itself a new one. For example, Dutton (1980) evaluated the performance of 11 diagnostics compared with pilot reports of CAT over the North Atlantic and parts of Europe. He found the weighted sum of the vertical and horizontal wind shears provided the best agreement with his observations. Also Clark et al. (1975) used a set of five weighted diagnostics, where the set used depended on elevation bands and the weights were determined by the best fit to data from several XB-70 stratospheric turbulence encounters over the western United States. Similar procedures have been used by Russian investigators. For example, Leshkevich (1988) used a weighted sum of 12 diagnostics, and Buldovskii et al. (1976) used a weighted combination of horizontal temperature gradient and vertical wind shear to predict CAT, again with the weights determined by best agreement to available observations. However, all of these studies were based on a limited set of observations and the weights determined by the best fit to this limited set. These weights, once established, are static; that is, they never change. The GTG procedure also obtains weights for a set of diagnostics based on the best fit to observations, when a sufficient number of PIREPs are available in real time, and are determined dynamically and updated with every RUC model update. Alternatively, a set of climatologically derived static weights can be used when the number of observations is insufficient for robust assessment of the dynamic weights. In particular, PIREPs undergo a strong diurnal period with considerably fewer at night, roughly 0200–1300 UTC (see Fig. 2), making it difficult to use the dynamic weighting method during those times. In the following, the GTG version that makes use of the dynamic weighting strategy will be referred to as GTGD, and the version that uses the climatological weights will be referred to as *GTGC*. The entire GTG process involves a six-step procedure described in the following sections.

### a. Step 1

In step 1 a set of *n* turbulence indices or diagnostics *D _{n}* (e.g., a local Richardson number) is computed from the native resolution NWP output at each grid point in the model domain at the current analysis time. Most of the current diagnostics used are intended to diagnose regions of high turbulence potential due to the presence of upper-level fronts and jet streams, but some are derived from turbulence theory and therefore should be valid for any turbulence source. The suite of diagnostics selected depends on the overall performance of each diagnostic. In addition, the set of diagnostics is selected to ensure that the indices appropriately represent the variety of atmospheric processes that may be contributing to the existing turbulence conditions (i.e., to ensure that the diagnostics are uncorrelated with each other). In general, the diagnostic performance is highly variable; see, for example, the box plots in Fig. 2 of Tebaldi et al. (2002). We have tried as many as 40 different turbulence diagnostics, but currently use a subset that has demonstrated minimum scatter and therefore the best overall performance. The algorithms in this subset are listed below with appropriate references, and the algorithmic expressions for these and others in the GTG suite are provided in appendix A. At upper levels, the following 10 algorithms are used:

Colson–Panofsky index (Colson and Panofsky 1965);

Richardson number (Ri; e.g., Endlich 1964; Kronebach 1964; Dutton and Panofsky 1970);

diagnostic turbulent kinetic energy (TKE) formulation (DTF3; Marroquin 1998);

frontogenesis function (isentropic coordinates; e.g., Bluestein 1993);

unbalanced flow diagnostic (Knox 1997; McCann 2001; Koch and Caracena 2002);

horizontal temperature gradient (Buldovskii et al. 1976);

Turbulence Index 1 (TI1; Ellrod and Knapp 1992);

North Carolina State University index (NCSU1; Kaplan et al. 2004);

structure function–derived eddy dissipation rate (EDR; Frehlich and Sharman 2004a; Lindborg 1999); and

structure function–derived sigma vertical velocity (SIGW; Frehlich and Sharman 2004b).

And at midlevels the nine algorithms used are

TI1 (Ellrod and Knapp 1992);

wind speed × horizontal deformation (Reap 1996);

Absolute value “inertial advection − centrifugal wind” (ABSIA; McCann 2001);

horizontal temperature gradient (Buldovskii et al. 1976);

wind speed (e.g., Endlich 1964);

NCSU1 (Kaplan et al. 2004);

structure function–derived EDR (Frehlich and Sharman 2004a; Lindborg 1999);

structure function–derived SIGW (Frehlich and Sharman 2004b); and

frontogenesis function (pressure coordinates) (e.g., Bluestein 1993).

### b. Step 2

In step 2 *D _{n}* is interpolated to common flight levels (in increments of 1000 ft) and mapped to a common turbulence intensity scale 0 ≤

*D**

_{n}≤1, where 0 corresponds to no turbulence (null) and 1 corresponds to extreme turbulence. This same scale is also used for PIREP intensities to allow quantitative comparisons. A required input for combining the various turbulence diagnostics is the set of threshold values that distinguish the null-light, light-moderate, moderate-severe, and severe-extreme turbulence categories. These thresholds are derived by comparing the PIREP values with the index values for many index–PIREP pairs (essentially a climatology), and computing the median index value corresponding to each turbulence intensity category. The median values of null, light, moderate, severe, and extreme are in turn associated with values of 0, 0.25, 0.5, 0.75, and 1.0, respectively, on the common turbulence intensity scale. The mapping process is performed using a piecewise linear function as shown schematically in Fig. 3. The breakpoints at which the derivative (slope) of the function changes are the thresholds and are given in appendix B for each index in the current GTG suite. Linear interpolation is performed within each range.

### c. Step 3

When using the dynamic weighting strategy, each diagnostic is compared with the available observations (PIREPs) within a time window (currently ±90 min) around the current NWP model time. For each altitude band of interest, a “score” is determined that measures the relative error between the turbulence intensity as predicted by each diagnostic and the available turbulence PIREPs. There are a number of options available for scoring, but based on previous verification studies of icing and turbulence (e.g., Brown et al. 1997, 2000; Tebaldi et al. 2002), a particularly robust method is to score using probabilities of detection. In this method a contingency table of observations (PIREPs) and forecasted turbulence index values is formed and a PODY, representing the probability of detection of a moderate-or-greater (MOG) event (“yes”), and a PODN, representing the probability of detection of a null or smooth event (“no”), are computed. As shown by Brown et al. (1997) and Brown and Young (2000), combinations of PODY and PODN are preferable to the use of the false alarm ratio in assessing statistical performance since they are less susceptible to the relative frequencies of yes and no PIREPs, that is, reporting biases.

Another important performance metric is the MOG-forecasted volume occupied by the diagnostic. Based on various atmospheric sampling programs, small MOG volumes are expected at any given time. An example is given in Fig. 4. This is a plot of the distribution of binned eddy dissipation rate (actually *ɛ*^{1/3}) automated measurements (Cornman et al. 1995) from about 85 United Airlines (UAL) B757 aircraft in cruise collected over a 3-month period. Note that 99.6% of the measurements fall in the first bin. Since this bin contains what corresponds to both null and light turbulence PIREPs, the fraction of the atmosphere at upper levels containing MOG turbulence should be 1% at most. These percentages are consistent with the results of Dutton (1980).

Therefore, an appropriate metric for evaluating the overall skill of each turbulence diagnostic should include PODY, PODN, and the fraction of the grid volume occupied by MOG turbulence forecasts compared with the computational grid volume *f*_{MOG}. Similar arguments were made by Brown et al. (1997) for evaluating icing forecasts. A simple scoring function that involves these quantities is

where the constants *C* and *p* can be used to adjust the relative importance of *f*_{MOG}. In the results to be presented here, *C* = 1 and *p* = 0.25. In the numerator of Eq. (1), the true skill statistic TSS = PODY + PODN − 1, and ranges in value from −1 to +1, with +1 indicating the diagnostic has perfect skill at classifying yes and no PIREPs, and values less than 0 indicating negative skill (0 represents no skill). The TSS is computed by comparing the maximum value of the turbulence diagnostic values *D**_{n} at the four grid points surrounding each PIREP. Using the average of the four points produces similar results. The thresholds used in evaluating the TSS are given in appendix B (see Table B1). However, not all PIREPs available during the scoring time period are actually used. The goal of the current GTG is to predict turbulence not related to convection. Therefore, a turbulence PIREP that is convectively related should not be used for scoring. Since a PIREP does not specify the source of turbulence, convectively induced turbulence encounters must be identified by indirect means. This is accomplished by comparing each PIREP location with cloud-to-ground (CG) lightning flash data from the National Lightning Detection Network. If the PIREP is within a certain radial distance of a CG lightning flash and is within a certain time window (currently 50 km and 40 min), it is discarded and not used for scoring.

### d. Step 4

Once the scoring function *ϕ _{n}* has been computed from step 3, a set of weights

*W*can be formed from each diagnostic

_{n}*n:*

Because the number of PIREPs available at any given time is still a small number (for a three hour time window this may be anywhere from less than 10 to as many as 250 depending on the situation), it is not possible to form weights regionally or vertically, so the weights assigned are constant throughout the domain of interest.

### e. Step 5

The (dynamically or statically) weighted diagnostics are combined to form the GTG combination. At this point all the diagnostics have been computed and remapped to the 0–1 scale at the initialization time at each grid point (*i, j, k*). The GTG diagnostic is now computed for the initialization time as the simple weighted sum of the diagnostics:

### f. Step 6

The GTG forecasts are formed. When using the dynamic weighting method, an assumption must be made about the temporal variability of the weights. The simplest assumption is that they are constant in time throughout the maximum forecast duration (12 h for RUC-2). This essentially is equivalent to assuming that the state of the atmosphere responsible for turbulence generation processes persists for this length of time. This may be on the long side; Vinnichenko et al. (1980) estimate that the probability of a turbulence patch persisting for longer than 6 h is no more than 50%. In the current GTGD implementation though, the weights are reevaluated every major RUC update cycle, that is, every 3 h, so that the change in large-scale behavior is at least captured at these intervals. In addition, advection (and perhaps growth and decay) of turbulence generation zones should be handled to some extent by the NWP forecast.

Assuming the weights are constant for a given model run, the GTGD forecast procedure is simple. The NWP gridded data are obtained for all *forecast* times (3, 6, 9, and 12 h), the same set of diagnostics is computed from the forecast fields, and the weights derived from the *analysis* time are used in Eq. (3) to get the GTG forecast.

Currently, the entire cycle repeats with every major NWP update; for RUC-2 this is every 3 h. The process is performed separately for midlevels and upper levels, and the results are merged at the FL200 boundary. This was necessary since it was found that

the best set of turbulence diagnostics was not the same at midlevels and upper levels,

their optimum threshold values were not the same, and

the number of available PIREPs was substantially less at midlevels than at upper levels.

When using climatological weights within *GTGC*, steps 3 and 4 are bypassed and a constant set of weights are used for the analysis time and all forecast times. The procedure for deriving the climatological or default weights is given in the next section and their current values are provided in appendix B.

## 3. GTG2 performance statistics

In this section various performance statistics are provided for GTG and its component diagnostics. The computed performance is based on inputs from the RUC-2 NWP model, in its 2003 configuration (roughly 20-km horizontal resolution and 50 vertical levels). Alternative methods of combining the individual diagnostics are also examined, and sensitivity studies are provided to estimate the effects of uncertainties in the verification data sources. Here, performance statistics are derived from comparisons with the only routine observations of aircraft-scale atmospheric turbulence available: verbal reports of turbulence by pilots, or PIREPs. This verification source, although not ideal, has been used in other aviation weather verification studies of both icing (e.g., Brown et al. 1997) and turbulence (Brown et al. 2000; Tebaldi et al. 2002; Lee et al. 2003).

### a. PIREPs

PIREPs provide information about a turbulence encounter (time, latitude, longitude, altitude, and severity). A fairly comprehensive review of PIREP reporting and dissemination practices is given in Schwartz (1996). The PIREPs used in this study are received through the NWS's Family of Services communication gateway (see http://www.nws.noaa.gov/datamgmt/fos/fospage.html for a description) and augmented by proprietary reports from two major airlines. The raw textual PIREPs are decoded automatically with some data checking to remove reports with one or more invalid parameters and to discard duplicates. These duplicate and bad data records were a very small percentage of the total (<1%). It should be noted that, because of pilot reporting and recording biases, the distribution of PIREP intensities is not what would be expected from Fig. 4; we find the reported intensity distribution is about 55% null, 27% light, 17% moderate, and 1% severe based on 12 yr of turbulence PIREPs between FL100 and FL450.

As noted by Schwartz (1996), PIREP inaccuracies in time, position, and intensity can lead to some uncertainty in the verification results that will be quantified here to the extent possible when presenting the final results. Further, it must be realized that a report is based on a turbulence experience along a flight path, that is, along a line, but is usually reported as a single point value. If the model-derived diagnostics are supposed to be a grid volume average, the correspondence to a line is not necessarily direct.

Position and time uncertainties in PIREPs were evaluated by comparing the locations of positive UAL in situ turbulence measurements using the automated Cornman et al. (1995) algorithm with PIREPS from the same aircraft. We found that, based on about 450 comparisons over a 4-month period, the median uncertainty was about 50 km horizontally, 200 s in time, and 70 m vertically. These time and vertical position differences are well within the windows used for verification. The horizontal position uncertainty of 50 km corresponds to about two to three RUC-2 grid points, and since the resulting turbulence forecast fields are fairly smooth both horizontally (cf. Fig. 1) and vertically, this uncertainty should have a negligible effect on the results.

To assess the uncertainty in PIREP turbulence intensities, the intensity values reported by two aircraft close together in space and time (using the 50-km uncertainty found above, at the same flight level, and within a time window of 600 s) were compared using 12 yr of PIREP data at the flight levels assessed in this study (FL100–FL200 and FL200–FL450). For the upper-altitude band this provided about 8200 pairs for comparison, with the results that roughly 84%, 68%, 85%, and 86% agree on intensities of smooth, light, moderate, and severe, respectively, regardless of aircraft weight class. For the midlevel altitude band about 3600 pairs were compared giving percentage agreements of 64%, 46%, 75%, and 77%, respectively, for the major intensity categories. The larger discrepancies at midlevels are probably because of the greater mix of aircraft types at lower flight levels. The relative uncertainty in the light intensity values is understandable given the wide range of aircraft sizes and weights reporting, and it is for this reason they are not used for verification. However, at least statistically, turbulence PIREPS seem to have acceptable position and timing errors, and the null and MOG reports are very consistent. These numbers bolster our confidence in the use of PIREPs for verification, and provide a means for assessing the uncertainty in the results to be presented.

### b. GTG2 performance

In previous studies that used PIREPs for both icing (e.g., Brown et al. 1997) and turbulence (e.g., Brown et al. 2000; Tebaldi et al. 2002; Lee et al. 2003) verification, one metric used to evaluate performance was the area contained under the PODY–PODN curves, similar to receiver (or relative) operating characteristic (ROC) curves. In this procedure a set of thresholds is assumed for each diagnostic, and for each threshold the diagnostic performance based on comparisons with available turbulence PIREPs is evaluated by computing both a PODN and a PODY. These curves essentially measure the ability of a forecast algorithm to discriminate between yes and no observations. For small values of the chosen threshold, PODY will be high, near unity, while PODN will be low, near 0, and vice versa for large values of the chosen threshold. For the range of thresholds selected, higher combinations of PODY and PODN, and therefore larger areas under the PODY–PODN curves, imply greater skill in discriminating between null and MOG turbulence events. The area under the curve (or AUC) ranges from 0.5 for no skill to 1.0 for perfect skill. For a more complete discussion of the use of the AUC as a discrimination metric, see, for example, Hanley and McNeil (1982), Mason (1982), Kharin and Zwiers (2003), and Marzban (2004).

Samples of PODY–PODN statistical performance derived for GTG2 as described above are provided in Fig. 5 (for upper levels) and Fig. 6 (for midlevels). The curves are based on 1 yr (2003) of PIREP comparisons for 0-h analyses and 6-h forecasts. The 0-h results give some indication of the ability of the individual turbulence diagnostics and the GTG combinations to account for the observed turbulent state of the atmosphere. The 6-h forecast was chosen for assessment because it is an adequate lead time for route planning purposes for almost all continental U.S. (CONUS) flights. All 0-h analyses are taken at 1800 UTC with the GTGD weights formed based on the performance of the individual diagnostics computed from the RUC 1800 UTC analysis. The 6-h forecasts are derived from diagnostics computed from the 6-h RUC forecast, initialized at 1800 UTC, valid at 0000 UTC the next day, and with weights provided from the 1800 UTC analysis (GTGD). These times were chosen because both times correspond to daylight hours over the CONUS where air traffic density is sufficient to provide large numbers of PIREPs for both initialization and verification.

Figures 5 and 6 demonstrate that both the GTGD and *GTGC* combinations are superior to the individual diagnostics as measured by the ROC AUC metric. As expected, the AUC is higher for the analysis time (*GTGC* AUC is 0.878 for upper levels and 0.818 for midlevels) than for the 6-h forecasts (0.852 and 0.792, respectively). For both times the upper-level performance is slightly better than the midlevel performance. This is probably because of the fewer PIREPs available to fit at midlevels and to the experience derived from the GTG1 development, which has allowed formulation of better turbulence diagnostics for upper levels.

For comparison, the AIRMET performance is also shown. AIRMETs are the operational forecasts of turbulence produced by the AWC every 6 h and are valid for up to 6 h (refer to http://aviationweather.gov/exp/product_overlay/help/p-airmets.html for a description) but may be amended as needed between the standard issue times. They are textual products that describe as three-dimensional polygons the regions of forecasted turbulence. Since the polygons are necessarily relatively simple, and are assumed valid for the entire 6 h (although we did allow for amendments), the comparison here is not exact; nevertheless, AIRMETs are the current operational product, and some comparisons must be made to assess the ability of automated forecasts to provide benefit to the aviation community. The AIRMET performance (with amendments) was retrieved from the RTVS archives for the same time window (centered at 2100 UTC) as the model-based algorithms. They are competitive with some of the individual diagnostics, but are not as good as either of the GTG combinations by this measure. Comparisons with turbulence SIGMETs were not attempted because they are usually based on observations of severe turbulence, most of which tend to be related to convection.

For *GTGC*, consistent with Eq. (2), the static weights are proportional to the ROC AUC squared for each diagnostic. Note from Figs. 5 and 6 that the use of climatological weights degrades the performance only slightly for the analysis time, and is nearly identical in performance for the 6-h forecast.

An equivalent representation of the relative performance of the individual diagnostics and the GTGD combination is provided by plots of the probability density functions (PDFs) for both null and MOG turbulence encounters. These are shown in Fig. 7. A perfect diagnostic would have no overlap between the null and MOG PDF curves, and so the amount of overlap is a measure of the PODs. Qualitatively, there is substantial overlap for all indices, but the overlap is clearly minimized with the GTG combination, reinforcing the overall robust nature of the GTG approach.

Other performance measures are shown in Table 1, namely, the 6-h forecast average PODN, PODY, TSS, root-mean-square error (rmse) per PIREP, *f*_{MOG}, *ϕ*, and dynamic weight for each turbulence diagnostic; the GTG combinations; and AIRMETS. By almost any measure, the GTG combinations provide superior performance for both mid- and upper levels. From the average weights given in Table 1, the single best diagnostic is the frontogenesis function [Eq. (A9)] evaluated in isentropic coordinates at upper levels and evaluated on constant pressure surfaces [Eq. (A10)] at midlevels. Consistent with Figs. 5 and 6 the AIRMET performance has similar skill to some of the individual diagnostics in terms of both statistical error and fraction of MOG airspace forecasted.

### c. Sensitivity studies

As stated earlier there are reporting biases, as well as position, timing, and intensity errors associated with PIREPs, consequently the verification performance statistics are subject to some amount of uncertainty. To attempt to quantify this uncertainty, we have performed two sensitivity studies. The first study addresses the irregular nature of the PIREPs distribution (in frequency and location) by degrading the quality of the data distribution by randomly resampling only a fraction of the available PIREPs and using that fraction for verification. Specifically, from the full set of PIREP–GTG forecast data pairs, five subsets of 1/2, 1/3, 1/4, 1/5, and 1/6 of the available pairs were used for scoring. Two hundred subsets were used, and for each subset a ROC curve was computed. The solid curves in Fig. 8 show the derived envelope of “uncertainty” around the original GTG ROC 6-h forecast curves. The results plotted are for *GTGC*, but identical results were obtained with GTGD. The uncertainty bands are very tight for high levels (about ±2%), but are somewhat wider for midlevels (about ±6%) because of the smaller number of PIREPs available at midlevels. Note that the lower bound of the uncertainty curve still provides significantly better PODY–PODN performance than AIRMETs. A comparison with the best single index was also performed (not shown) by computing the difference between these curves for each subset of PIREPs at the highest point on the curves corresponding to the best TSS value, across the 200 randomized subsets. The 95% confidence interval for this difference (GTG-best index) is always positive for both PODN and PODY. These results are consistent with those of Kane and Brown (2000) using an earlier version of GTG.

The second study addresses the uncertainty in the reported intensity values of the PIREPs. This problem was discussed in section 3a, and there it was stated that, climatologically, the percentage agreement in intensities reported by nearby aircraft was somewhere between 70% and 80% on average. Based on these results, a “perturbation” experiment was performed by assuming that 25% of the verification PIREPS may be incorrectly reported by one full intensity category (e.g., from light to moderate). The sensitivity to PIREP intensity uncertainty was then assessed by either randomly increasing or decreasing by one intensity level 25% of the available verification PIREPs. The process was repeated 200 times for different subsets of the verification PIREPS.

The dashed curves in Fig. 9 show the results in terms of an envelope of “downgraded” ROC curves. Note that the degradation (from about −5% to −6%) in the upper-level ROC curves is greater than for the midlevel ROC curves (from about −1% to −3%). This may be due in part to the greater inherent uncertainty in PIREP intensities at midlevels noted above in section 3a, and in part by the finer GTG tuning achieved at upper levels compared with midlevels making the upper-level results more sensitive to degradation. Still the ROC curve for GTG significantly outperforms the unperturbed AIRMETS and the single indices' curves (not shown), as confirmed through the computation of the differences as described above in the resampling exercise.

This is a rather severe test; when the process was repeated by perturbing the PIREPs by only a one-half intensity category (e.g., from light to light-moderate) the resulting GTG ROC curves were almost indistinguishable from the original.

### d. Other combinatorial methods

Other methods for combining the indices were also tested. In particular, three well-established statistical models for forecasting a binary outcome (with null encounters of turbulence being 0s, and MOG encounters being 1s) were investigated and their predictive performance compared with the GTG method:

logistic regression,

tree classification, and

neural networks.

Logistic regression fits a linear combination of the diagnostics to the probability of a positive (i.e., MOG) outcome. For a description of the method, see, for example, McCullagh and Nelder (1989). The estimation is likelihood based, so that the optimal solution for the coefficients of the linear combination maximizes the probability of the training set of data, under the assumption of a binomial distribution; namely,

where *D _{n}* is the value of diagnostic

*n*matched to each PIREP and

*β*is the derived coefficient for

_{n}*D*.

_{n}Classification based on recursive binary partitions (tree classification) is another popular model for predictions (see, e.g., Breiman et al. 1984). Starting with the full set of observation–individual diagnostic “pairs,” the algorithm looks at every diagnostic in turn, in order to determine an optimal value for splitting the dataset. The value must be such that the two subsets of PIREPs thus created (PIREPs matched to values of the diagnostic less than the splitting threshold, and PIREPs matched to values of the diagnostic greater than the splitting threshold) are as homogeneous as possible within and as heterogeneous as possible between. Ideally, if a diagnostic were a perfect discriminator of null versus MOG turbulence encounters, a value of that diagnostic would exist that would separate the PIREP set into two groups perfectly homogeneous “within” (all nulls on one side, all MOGs on the other) and thus perfectly heterogeneous “between.” After the first split is performed, the algorithm is applied independently to the two groups of observations thus formed. Each split chooses a diagnostic and a value within the diagnostic range. The algorithm may separate the data perfectly by splitting the groups until only one observation is left in each of the “leaves” of the tree. However the splitting rules thus defined are bound to be too ad hoc with respect to the training set, and would certainly constitute a bad model for forecasting purposes. Thus, tree classification algorithms must be optimized in order to achieve a balance between fitting the training set of data and performing accurate prediction on an independent set.

Neural networks are flexible regression methods that substitute a nonlinear combination of diagnostics for the linear form assumed by logistic models. Similar to tree models, a balance has to be found between extremely flexible structures that are able to fit the training set almost perfectly, and a more parsimonious representation that has better generalization ability. For a more complete discussion, see, for example, Ripley (1996).

Each of the three statistical models was executed using the R, version 2.0 (see Ihaka and Gentleman 1996 for a description), statistical analysis package (routines glm, rpart, and nnet for logistic regression, decision trees, and neural nets, respectively). The performance for each is based on the same set of observations as used to obtain the GTG results. PIREP–diagnostic “pairs” from the analysis time were used to train the models, and then evaluated on the 6-h forecast. As the curves in Fig. 9a demonstrate, GTG does outperform the estimates derived from all three competing methods. This is probably because of the relative scarcity of the training data available, which is insufficient for robust estimations using highly multivariate models.

The performances become very close (almost undistinguishable for logistic and neural networks, still subpar for tree regression) as the training set is increased to include 14 days of PIREP–diagnostic pairs, as Fig. 9b shows. By increasing the number of cases in the training set, the estimates of the many parameters in the statistical models stabilize and are not so susceptible to small, nonrepresentative (at least in terms of the functional forms that are being fitted) groups of observations. Evidently, the functional form used in the GTG optimization procedure is simple enough, and tuned explicitly to the POD performance measure, to allow an effective estimation on the basis of a single day of training, thus offering an extremely manageable solution to the operational implementation of the algorithm. Obviously, it would be much more expensive from a storage and computational perspective if a longer set of training data had to be processed at every forecast issue time.

## 4. Summary and conclusions

In summary, the overall performance of GTG seems to be skillful enough to provide useful information to meteorologists and dispatchers for strategic planning for turbulence avoidance. In particular the following points have been shown.

The GTG combination provides superior PODY–PODN performance over that available by using a single turbulence diagnostic or by AIRMETS.

Although the dynamic weighting method gives the best nowcast performance, the use of a set of static climatologically derived weights provides forecast performance comparable to the dynamic weighting method, making it an attractive alternative for use in data-sparse (PIREPs) regions.

The simpler weighting strategies used within the GTG framework provide performance comparable to more complicated procedures, such as neural networks, and generally requires less “training.”

In this investigation, based on the average dynamic weights chosen by GTGD (see Table 1) the single best diagnostic at both midlevels and upper levels is the frontogenesis function.

At this point a few caveats should be stated. The results presented here are based on inputs from the RUC-2 NWP model, in its 2003 configuration (roughly 20-km horizontal resolution with 50 vertical levels). The use of a different NWP model and/or different vertical or horizontal resolutions may be expected to change some of these results, including the relative performance of the individual turbulence diagnostics and the GTG combinations. Further, the analyses are based on the performance derived from an entire year of data over the entire RUC-2 computational domain. Daily, seasonal, and regional dependencies would be expected in the performance statistics.

The ability to provide still more accurate aircraft-scale turbulence nowcasts and forecasts is hampered by several fundamental difficulties. First, the resolution of current NWP models (several 10s of km) is about two orders of magnitude too coarse to resolve aircraft-scale turbulence (roughly 100s of m). Therefore, aircraft-scale turbulence diagnoses and predictions must be based on resolvable- (by the NWP) scale features. However, and this is the second difficulty, the performance of turbulence diagnostics is hampered by our current lack of understanding of the linkage between NWP resolvable-scale features and aircraft-scale turbulence. An implicit assumption underlying the use of all these diagnostics is that turbulence-generating mechanisms have their origin at resolvable scales and that energy cascades down to aircraft scales, but it is unclear what the exact cascade mechanism is. Recent high-resolution simulations by Lane et al. (2004, 2005) indicate that the linkage is related at least in some cases to gravity wave production by features such as upper-level fronts and convection, and subsequent breakdown of the waves into turbulence. Third, even if it is true that aircraft-scale turbulence has its origins at the resolvable scales, the turbulence forecast system has all the inherent NWP errors associated with the resolvable scales. Fourth, it is not clear that the current suite of turbulence diagnostics is in fact capturing all the relevant information that the larger-scale representations can provide.

Then there is the difficult matter of model fitting and verification. The GTG system uses PIREPs for tuning and verification. But as shown, an individual PIREP is subject to spatial, temporal, and intensity misrepresentations. The quantitative automated in situ turbulence reporting system (Cornman et al. 1995, 2004) should eliminate most of the uncertainty associated with PIREPs but will still not alleviate the nighttime underreporting bias. The advantages of this data for GTG are obvious. First, the data will be more accurate than PIREPs, for both intensity and position. Second, the amount of data will be vastly increased, since the current plan is to relay turbulence every minute in cruise. This will provide a much more complete mapping of the turbulent state of the atmosphere (at least at upper levels), and will allow GTG to fit that state much more precisely than has been possible using the current set of scattered PIREPs. Just as the accuracy of upper-level winds in NWP models has increased with the use of Aircraft Communications Addressing and Reporting System wind data (e.g., Schwartz et al. 2000), the GTG upper-level forecasts should become more accurate with the ingest of in situ data.

Efforts continue to provide a better turbulence forecasting system through the following research areas.

Better diagnostics—this is a continued research area at major laboratories and universities. But any diagnostic must be judged by its overall performance, not just on a few select cases. In addition, information about when a particular diagnostic performs well and when it does not could be used in dynamically assigning its weight within the GTG framework. Hopkins (1977) and Lester (1994) describe synoptic conditions that are known to be conducive to CAT, and these could be developed into automated algorithms. Although some of the current diagnostics are independent of the source of turbulence (e.g., Ri), most are tuned for CAT associated with upper-level fronts and enhanced wind shears in the vicinity of jet streams. Diagnostics for other known sources of turbulence, for example, those related to deep convection, must be developed, tested, and implemented into future versions of GTG. Turbulence related to convection has been shown by Kaplan et al. (2005) to be coincident with some particularly severe turbulence encounters. Nevertheless, the current version of GTG does capture some turbulence cases related to convection, if the convection is associated with mid- to upper-level disturbances that may be identified by one or more of the current diagnostics.

“Local” fits—within the current GTG framework, the best fit of diagnostics is determined for the entire volume of atmosphere between the altitude bands of interest. Better fits are probably attainable in subvolumes that could be overlapped to give smooth transitions from one subvolume to another. Although the number of PIREPs available for regional or local fits is probably insufficient at the current time, the use of the turbulence in situ measurements may allow for local fits, both horizontally and vertically.

## Acknowledgments

We appreciate the careful readings and suggestions for improvement of an earlier version of the manuscript by Barbara Brown, Jennifer Abernethy, Teddie Keller, and Rod Frehlich. We also thank the three anonymous reviewers for their comments that helped improve the presentation of the material. We also thank Jennifer Mahoney, NOAA-Research-ESRL/GSD, for supplying the relevant AIRMET data. This research is in response to requirements and funding by the FAA. The views expressed are those of the authors and do not necessarily represent the official policy or position of the FAA.

## REFERENCES

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

### APPENDIX A

#### GTG Turbulence Diagnostics

This appendix lists the current suite of turbulence diagnostic algorithms within GTG. Although all these are or have been computed and evaluated, only a (user selectable) subset is actually included in the GTG combination. Note that in some cases the constituent components of a diagnostic may themselves be used as a turbulence index.

##### Richardson number and its components

The Richardson number and its components are well-known turbulence diagnostics (e.g., Endlich 1964; Kronebach 1964; Dutton and Panofsky 1970, etc.). Theory and observations have shown that at least in some situations clear-air turbulence patches are produced by Kelvin–Helmholtz instabilities. This occurs when Ri becomes small. Therefore, theoretically, regions of small Ri should be favored regions of turbulence, where

with

and

Here, *θ* is potential temperature, *θ _{e}* is equivalent potential temperature,

*g*is the acceleration due to gravity,

*z*is the vertical direction, and

**v**is the horizontal wind vector with components

*u*and

*υ*in the east–west and north–south directions, respectively.

##### TKE

TKE formulations are based on the TKE balance equation, assuming horizontal homogeneity and stationarity. The Colson–Panofsky index (CP; Colson and Panofsky 1965) uses dimensional arguments in a stable atmosphere to estimate clear-air turbulence intensities as

where Λ is a length scale, taken as the local value of vertical grid increment Δ*z*, and Ri_{crit} is an empirical constant (≈0.5).

Laikhtman and Al'ter-Zalik (1966; also Vinnichenko et al. 1980) developed a similar equation but with a different prescription for the length scale:

where *C* is an adjustable constant and

where *α* = 1*/*Pr *= K _{H}/K_{M}* is taken as another adjustable constant, Pr is the turbulent Prandtl number, and

*K*and

_{H}*K*are the eddy diffusivities of heat and momentum, respectively.

_{M}Marroquin (1998) diagnostic TKE formulations (DTF) used *k*–*ɛ* closure equations (e.g., Stull 1988) and other simplifications to derive diagnostics for TKE and/or *ɛ*, giving, for example, for DTF3,

where *c*_{1} = 1.44, *c*_{2} = 1.0, and *c*_{3} = 1.92 (Stull 1988, p. 219), and *K _{M}* and Pr are taken as adjustable constants to get the best agreement with observations.

##### Eddy dissipation rates

Eddy dissipation rates are estimated from second-order structure functions (Frehlich and Sharman 2004a, b). The one-dimensional structure function of variable *q*(*x*) is defined as

where *s* is the separation distance and 〈〉 denotes an ensemble average. The structure functions of the velocity components parallel or normal to the displacement vector **s** = (*x*, *y*, *z*) can be related to turbulence intensity *ɛ* (for *q* = *u*, *υ*) or *σ _{w}*

^{2}(for

*q*=

*w*, the vertical velocity component) through

and

where *C _{q}*(

*s*) and

*C*(

_{w}*s*) take into account NWP model specific spatial filtering effects and

*D*

_{REF}is given by Lindborg (1999); for small separations

*s*it is proportional to

*s*

^{+2/3}. In the text and tables the relation (A7) to derive

*ɛ*

^{1/3}is indicated as EDR and relation (A8) to derive

*σ*is indicated as SIGW.

_{w}##### Frontogenesis function

Fronts contain regions of low Ri and therefore may be conducive to turbulence (e.g., Jacobi et al. 1996) and can also be a source of gravity waves that may be unstable (e.g., Lane et al. 2004). The definition of the frontogenesis function is (e.g., Bluestein 1993, p. 253)

where *D*/*Dt* is the Eulerian time derivative. This can be rewritten in two dimensions, using the thermal wind relation, as

Expanding on a constant-*θ* surface and invoking continuity gives

This is the form used in GTG at upper levels. Note that its formulation is based on an isentropic coordinate system (as used at upper levels in the RUC model). In that case an alternative is to use a constant pressure *p* surface formulation:

This is the form used in GTG at midlevels.

##### Richardson number tendency

The index, *dRi/dt* (Roach 1970; Keller 1990), is based on attempts by several investigators to forecast turbulence by using a time tendency equation for the Richardson number. The version used within GTG is based on a formulation of this equation in isentropic coordinates by Keller (1990), termed Specific CAT Risk (SCATR), given by

with

and

Brown's index (Brown 1973) is a simplification of the Ri tendency equation originally derived by Roach (1970). The simplifications involve use of the thermal wind relation, the gradient wind as an approximation to the horizontal wind, and some empiricism. Form (A13) is the simplified Ri tendency equation, while form (A14) is an extension to provide a measure of turbulence intensity:

and

where the shearing deformation *D*_{SH} = ∂*υ*/∂*x* + ∂*u*/∂*y*, the stretching deformation *D*_{ST} = ∂*u*/∂*x* − ∂*υ*/∂*y*, absolute vorticity *ζ _{a}* =

*ζ*+

*f*, with

*ζ*= ∂

*υ*/∂

*x*− ∂

*u*/∂

*y*and

*f*is the Coriolis frequency.

##### Ellrod indices

Ellrod indices (Ellrod and Knapp 1992) are derived from simplifications of the frontogenetic function. As such it depends mainly upon the magnitudes of the horizontal gradients of *θ* (proportional to *S _{V}* through the thermal wind relation) and the total deformation. The two variants developed were

and

where

##### Potential vorticity

Another index is the potential vorticity (PV) (Knox 2001) or horizontal gradient of PV (Shapiro 1978):

and

where

##### Clark's CAT algorithm

Clark's CAT (*CCAT*; Vogel and Sampson 1996) index has been used by U.S. Navy forecasters for at least two decades. It is defined as

where *T* is absolute temperature.

##### Curvature measures

Tight curvatures in upper-level troughs and ridges have been empirically related to turbulence outbreaks (e.g., Lester 1994; Hopkins 1977; Arakawa 1952; Stone 1966). A measure of curvature is relative vorticity *ζ*, with a maximum (positive value) in upper-level troughs and a minimum (negative value) in upper-level ridges. Therefore,

should be a measure of curvature in both troughs and ridges.

Another measure may be related to inertial instability in strongly anticyclonic flows (Arakawa 1952; Stone 1966; Knox 1997):

##### Horizontal temperature gradient

The horizontal temperature gradient is a measure of the deformation and also vertical wind shear from the thermal wind relation, and is routinely used by airline forecasters. It was also used in Buldovskii et al. (1976):

##### Wind-related indices

Besides the speed vertical shear (A3), the wind speed

may be related to turbulence. Also Endlich (1964) noted that jet stream turbulence tended to be associated with both high wind speeds and large directional wind shears; thus,

could be a turbulence diagnostic, where *ψ* is the wind direction.

##### Dutton's empirical index

Dutton's empirical index (Dutton 1980) is based on linear regression analyses of turbulence reports over the North Atlantic and northwest Europe during 1976 and various synoptic-scale turbulence indices:

where *S _{H}* is the horizontal wind shear,

and the constants were empirically determined by Dutton.

##### MOS CAT probability predictor indices

The following indices were found to be the most skillful subset of a suite of turbulence diagnostics used in NCEP's Nested Grid Model (NGM) model output statistics (MOS) probability forecasts (Reap 1996):

and

##### Unbalanced flow

There is some evidence that regions of strong imbalance may be related to turbulence aloft (e.g., Knox 1997; McCann 2001; Koch and Caracena 2002). The UBF diagnostic (Knox 1997, 2001; McCann 2001; O'Sullivan and Dunkerton 1995; Koch and Caracena 2002) formulation used within GTG was developed by Koch and Caracena (2002) and McCann (2001), and derives from the residual of the nonlinear balance equation

where Φ is geopotential, *J* is the Jacobian operator, and *β* is the Coriolis frequency gradient.

The Lagrangian Rossby number RoL is another unbalanced flow diagnostic originally suggested by Van Tuyl and Young (1982). Following the definition of Ro in Bluestein (1992, Eq. 4.1.93) and evaluating in isentropic coordinates gives

where *M* is the Montgomery streamfunction.

The hydrostatic NWP model derived or nonhydrostatic computed vertical velocity

and the horizontal divergence,

may also be measures of unbalanced flow.

Other unbalanced flow related diagnostics developed by McCann (2001) and used in a case study by Knox (2001) include

where

and

with

and where *K _{s}* is the streamline curvature.

##### NCSU1

The NCSU1 is described in Kaplan et al. (2004) and was developed from investigations of several severe turbulence encounters:

##### Negative vorticity advection

A rule-of-thumb forecasting approach used by the airlines is to look for regions of large negative vorticity advection (NVA) computed as follows (Bluestein 1992, p. 335):

### APPENDIX B

#### Thresholds Used with the 20-km RUC

Table B1 provides thresholds and default weights for the upper- and midlevel turbulence diagnostics used within the current version of GTG. The thresholds are determined from median values for 1 yr (2003) of 1800 UTC 6-h forecast (valid 0000 UTC) index–PIREP pairs for each of the five major turbulence categories. The default weights are derived from the area^{2} under the PODY–PODN 6-h forecast curves (Figs. 5 and 6), normalized so that the resultant sum is unity.

## Footnotes

*Corresponding author address:* Dr. Robert Sharman, Research Applications Laboratory, National Center for Atmospheric Research, P.O. Box 3000, Boulder, CO 80307-3000. Email: sharman@ucar.edu

* The National Center for Atmospheric Research is sponsored by the National Science Foundation.

^{1}

Flight levels are actually isobaric surfaces that correspond to a particular geopotential altitude according to the *U. S. Standard Atmosphere*.