Accurate cloud-ceiling-height forecasts derived from numerical weather prediction (NWP) model data are useful for aviation and other interests where low cloud ceilings have an impact on operations. A demonstration of the usefulness of data-mining methods in developing cloud-ceiling forecast algorithms from NWP model output is provided here. Rapid Update Cycle (RUC) 1-h forecast data were made available for nearly every hour in 2004. Various model variables were extracted from these data and stored in a database of hourly records for routine aviation weather report (METAR) station KJFK at John F. Kennedy International Airport along with other single-station locations. Using KJFK cloud-ceiling observations as ground truth, algorithms were derived for 1-, 3-, 6-, and 12-h forecasts through a data-mining process. Performance of these cloud-ceiling forecast algorithms, as evaluated through cross-validation testing, is compared with persistence and Global Forecast System (GFS) model output statistics (MOS) performance (6 and 12 h only) over the entire year. The 1-h algorithms were also compared with the RUC model cloud-ceiling (or cloud base) height translation algorithms. The cloud-ceiling algorithms developed through data mining outperformed these RUC model translation algorithms, showed slightly better skill and accuracy than persistence at 3 h, and outperformed persistence at 6 and 12 h. Comparisons to GFS MOS (which uses observations in addition to model data for algorithm derivation) at 6 h demonstrated similar performance between the two methods with the cloud-ceiling algorithm derived through data mining demonstrating more skill at 12 h.
Numerical weather prediction (NWP) models provide guidance for operational forecasters with useful insight into future atmospheric conditions. However, generating accurate cloud-ceiling forecasts from these models remains a difficult and complex problem. Accurate cloud-ceiling analysis and forecasts are important for aviation operations, both in terms of monetary costs and human safety. The Federal Aviation Administration’s (FAA’s) Aviation Weather Research Program (AWRP) is investigating this problem through a research and development effort of the National Ceiling and Visibility (NCV) Product Development Team (PDT). The forecast product being developed by the NCV PDT involves the integration of numerical models, guidance products, and observation-based methods (Herzegh et al. 2006).
Model output statistics (MOS) (Glahn and Lowery 1972) was one of the first methodologies employed to find a statistical relationship between NWP model output and cloud-ceiling height (Bocchieri and Glahn 1972; Bocchieri et al. 1974). MOS schemes are still applied to many of today’s numerical models. Along with NWP model data, observations are also used to develop MOS regression equations for ceiling and visibility, temperature and dewpoint, wind speed and direction, probability of precipitation, precipitation amount, and cloud cover. MOS techniques, applied to various NWP models, have also been used to develop specific algorithms to forecast the occurrence of Levante wind regimes during the warm season (Godfrey 1982), marine fog probability for the northern Pacific Ocean (Koziara et al. 1983), maximum and minimum temperatures for seven Australian cities (Woodcock 1984), and the probability of precipitation and rain amount in Australia (Tapp et al. 1986). More recently, Norquist (1999) presented a MOS approach to diagnose cloud characteristics, including cloud-base height, from a mesoscale NWP model. Weiss and Ghiradelli (2005) describe the redevelopment of the Localized Aviation MOS Program (LAMP). LAMP focuses on the predictions of probabilities of eight ceiling-height and five total-sky categories.
Model data and MOS schemes are not the only resources being used to predict cloud cover and ceiling conditions. Vislocky and Fritsch (1997) investigated an observation-based system for short-term ceiling and visibility prediction using a network of surface stations. Other observation-based ceiling estimation and prediction schemes recently developed include a statistical system for low-ceiling forecasts at San Francisco International Airport using surface and upper-air observations (Hilliker and Fritsch 1999), an algorithm that combines satellite-derived cloud classification data and surface observations to improve cloud-base height estimates (Forsythe et al. 2000), and the use of total-sky imagers for the retrieval of cloud-base heights (Kassianov et al. 2005). Leyton and Fritsch (2003, 2004) describe the utilization of high-density and high-frequency surface observations for short-term probabilistic forecasts of ceiling and visibility. A cloud-ceiling and visibility forecast system that applies fuzzy logic and case-based reasoning to historical data is presented in Hansen and Riordan (2003).
MOS and observation-based methods rely on observations as predictors within the algorithms or regression equations. An algorithm that can be applied to an NWP model grid directly (without using observations) would be beneficial, particularly when and/or where observations are not available. Other research and applications of NWP model data for specific weather phenomena include the use of pattern recognition in locating fronts in model output (Fine and Fraser 1990), model output calibration for temperature forecasts (Mao et al. 1999), the training of neural networks for temperature forecasts for 31 stations (Marzban 2003), maximum and minimum temperature forecasts at 12 locations in India based on the perfect prog method (Maini et al. 2003), the application of quantile regression for probabilistic precipitation forecasts (Bremnes 2004), and turbulence estimations (Frehlich and Sharman 2004).
With the work in Hansen and Riordan (2003) being one example, machine learning, data mining, and other artificial intelligence tools are being used more often in the diagnosis and forecasting of meteorological phenomena. Abdel-Aal and Elhadidy (1995) applied a machine-learning modeling tool for forecasting daily maximum temperatures in Dhahran, Saudi Arabia. Induction of rules has been applied to develop maritime fog forecasts (Tag and Peak 1996), and, as mentioned earlier, neural networks have been used for temperature forecasts (Marzban 2003). There has been much research and many algorithms developed in the area of cloud classification that included the use of various machine-learning tools. One of the more recent classification methodologies has involved the use of support vector machines (Lee et al. 2004). Classification trees and machine learning are used to develop rules for predicting the onset of Australian winter rainfall (Firth et al. 2005). Related to this effort, Bankert et al. (2004) demonstrated the use of data-mining methods in the diagnosis of cloud-ceiling height through satellite and numerical model data. Data-mining techniques have also been applied to the discovery of associations between drought and oceanic parameters (Tadesse et al. 2005).
Data mining is applied here to NWP model output to create a station-specific (John F. Kennedy International Airport, New York, New York, in this experiment) translation algorithm for the Rapid Update Cycle (RUC) model. The potential of this approach will be analyzed and future enhancements and application of this methodology to an entire model grid will be discussed.
2. Data and data mining
RUC is a regional mesoscale high-frequency data assimilation and short-range numerical prediction system (Benjamin et al. 2004a, b). RUC runs on a 1-h assimilation cycle to produce frequently updated 3D mesoscale analyses and short-range forecasts. The version of the RUC model that produced the data used here is RUC20. The RUC20 horizontal resolution is 20 km (301 × 225 grid points) and the model uses a hybrid isentropic-sigma coordinate (50 vertical levels) in which most of the atmosphere is resolved on isentropic surfaces except for layers near the ground where sigma coordinates are used. The RUC20 microphysics scheme (Brown et al. 2000) explicitly predicts mixing ratios of cloud water, rainwater, snow, cloud ice, and graupel. Ice particle number concentration is also predicted. The model also employs the Grell–Devenyi ensemble-based convective parameterization scheme (Grell and Devenyi 2001).
RUC data from the entire year of 2004 were made available for use in the development of single-station cloud-ceiling forecast algorithms. The hourly RUC model output was saved in a database for data-mining exploration. Using a bilinear interpolation of neighboring grid points, the values of RUC variables at various routine aviation weather report (METAR) locations compose the database. Developed algorithms, applied to RUC model output, will produce ceiling estimations at specific locations (or grid points). The cloud-ceiling height is defined here as the lowest level at which broken or overcast cloud cover exists. Ceilometer measurements provide the data for METAR cloud coverage (NWS 1998). KJFK is the METAR station at John F. Kennedy International Airport and was the focus of this study.
The RUC variables selected for inclusion in the database are listed in Table 1. They were chosen based on a priori assumptions of their potential influence or relationship to cloud-ceiling height. The available RUC data are 1-h forecast data valid at nearly every hour in 2004. Not all variables are actually used by the final classification algorithms.
After establishing the database, a data-mining tool was applied to find the relationships between the RUC data and 1-, 3-, 6-, and 12-h cloud-ceiling observations. C5.0 is the supervised inductive-learning tool used here to generate classifiers from data in the form of decision trees (Quinlan 1993; Bankert et al. 2004). It recursively breaks down a training set into smaller sets, using a single independent variable test at each iteration. Variables and test conditions on those variables are chosen such that each of the resulting subsets is as homogeneous as possible. The result is a set of classification decisions, organized as a decision tree, where each leaf of the tree denotes a final classification. With these empirically developed classification models expressed as decision trees, they are less of a “black box” and more open to interpretation than classifiers from other data-mining methods (e.g., neural networks). The decision tree size and complexity are determined by the training data (used to generate the trees) and the balancing of training set classification accuracy with the ability of the classifier to generalize to events not adequately represented in the training data. Decision tree “pruning” is implemented to approach this balance. Pruning removes decision boundaries and introduces an acceptable degree of error in order to compact, and thus generalize, the tree. Additional details on decision tree generation can be found in Quinlan (1993).
Data mining is initially performed to create a seven-class classifier algorithm. The seven classes, listed and defined in Table 2, were chosen based on the FAA flight regulations regarding cloud ceilings: visual flight rules (VFR), marginal visual flight rules (MVFR), instrument flight rules (IFR), and low instrument flight rules (LIFR). Cloud-ceiling heights that fall into the VFR category were further distributed into four classes as defined in Table 2. The total number of database records used in this study was 7682 and these records were distributed among the classes as displayed in Fig. 1. To evaluate the algorithms developed through C5.0, 10 tenfold cross-validation (CV) tests were performed. For each tenfold CV, the data are randomly split into 10 data record subsets with each subset held out for testing and the data mining performed on the remaining data. The average skill and accuracy of the 10 tenfold CV tests are reported here.
3. Performance evaluation
a. 1-h forecasts
Using 1-h RUC data, 1-h cloud-ceiling-height classification algorithms were created first. Arbitrary discretization of continuous cloud-height data into the seven aviation-defined height categories blurs any natural “clusters” of the height data. These artificial class boundaries make for a more difficult classification problem. Initial results are presented for the seven-class problem, but additional evaluations for a two-class grouping are also shown.
The 1-h forecast overall accuracy and individual class accuracies are presented in Table 3 for the C5.0-developed algorithm [referred to as data-mining cloud ceiling (DMCC) from here on] for persistence (Murphy 1992; Wilks 2006), which is used as a standard of reference, and for three cloud-ceiling estimations from RUC output: the mixing ratio threshold (RUC: qc + qi), the relative humidity threshold (RUC: RH), and the Stoelinga–Warner translation algorithm (Stoelinga and Warner 1999) applied to RUC data (RUC: SW). The cloud-ceiling-height estimate from the mixing ratio data is the lowest level at which the combined cloud water (qc) and cloud ice (qi) mixing ratios exceed 10−6 g g−1. For the RH threshold, the cloud-ceiling height is the lowest level of greater than 90% RH. The Stoelinga–Warner translation algorithm is based on empirical and theoretical relationships between hydrometeor attributes and light extinction. The DMCC algorithm produced higher overall accuracy than did the current RUC estimates of cloud-ceiling height at 1 h. With the DMCC algorithm depending solely on RUC data (i.e., no historical or current observations used as predictors), persistence had a higher accuracy for 1-h forecasts. Including observations and/or satellite data should improve DMCC accuracies, as demonstrated in previous and related work for cloud-ceiling-height diagnosis (Bankert et al. 2004).
Note some of the relatively low individual class accuracies. Three of the seven classes had less than 50% accuracy for the DMCC algorithm. Specific RUC algorithms had more classes with even lower accuracies. This outcome is related, at least in part, to the problem of classifying continuous data. However, very few of the misclassifications for the DMCC algorithm (1.3% of all testing samples) would be considered very bad misclassifications, with a majority of all misclassifications falling into neighboring height classes. As displayed in Table 3, the overall accuracy for each method improves substantially when combining VFR1 (no ceiling) with VFR2 (>12 000 ft ceiling) to produce a six-class outcome. As a baseline climatology, a 1-h forecast of “no ceiling” for all samples produces an overall accuracy of 42.0%.
To get a more extensive evaluation on the accuracy and skill of the DMCC algorithm compared with other methods, the output data were distributed into two classes: less than 1000 ft (<1000 ft) and greater than or equal to 1000 ft (≥1000 ft). The resulting 2 × 2 contingency tables produced for each classification were analyzed with low ceilings (<1000 ft) being the event of interest. As noted in Wilks (2006) and Doswell et al. (1990), three or more performance parameters are needed to fully represent the forecast performance. Probability of detection is a measure of the discrimination (or detection) capability. False alarm ratio is a reliability attribute and is the proportion of forecasted events (ceilings <1000 ft) that are incorrect. The low-ceiling class (<1000 ft) occurs much less frequently in this data source than does the high-ceiling class (≥1000 ft); therefore, critical success index (CSI) is a more useful accuracy measure than the total percent correct. The Heidke skill score (HSS) is computed to measure the performance of each algorithm relative to random chance. A positive HSS value indicates a performance better than random chance (perfect score is 1.0), an HSS value of zero indicates no better (or worse) than random chance, and a negative HSS value indicates forecast skill that is worse than random chance. These results are presented in graphical form in Fig. 2. The DMCC algorithm produced the closest performance level, when all statistics are considered, to persistence at 1 h.
b. 3-h forecasts
With 1-h RUC output being the only available data for this study, C5.0 data mining had to be applied to those data with appropriate 3-, 6-, and 12-h METAR observation times. For example, the 1-h RUC data with a 0300 UTC initial time (data valid at 0400 UTC) are paired with the METAR observation at 0600 UTC for a 3-h forecast from the initial time. It is not known whether the DMCC algorithm developed in this manner would be more or less accurate than one developed using 3-h RUC data (valid at 0600 UTC in this example). The resulting forecast skill would be dependent on the skill of the 3-h RUC forecast, as well as the ability of the C5.0 algorithms to find the relationships.
Without 3-h RUC data, the DMCC 3-h cloud-ceiling forecast algorithm is compared with persistence only. The seven-class results are presented in Table 4. DMCC algorithm class accuracies compare very favorably to persistence at 3 h. In fact, it produces higher (nearly 5%) overall accuracy. As expected, the persistence accuracies at 3 h are much lower than those produced at 1 h. For the DMCC algorithm, there is not much difference between the 1- and 3-h accuracies. The VFR1 and VFR2 classes are combined into one class, increasing the overall accuracy by approximately 10% for both DMCC and persistence.
As was done with the 1-h results, the seven-class output for 3-h forecasts is regrouped into two classes (<1000 ft and ≥1000 ft). The performance measures are presented in Fig. 3 with the DMCC algorithm demonstrating slightly higher accuracy (CSI) and skill (HSS) than persistence.
c. 6- and 12-h forecasts
For both 6- and 12-h cloud-ceiling forecasts, Global Forecast System (GFS) MOS output was available for comparison. The seven classes output by GFS MOS are slightly different than those used for DMCC and persistence. The VFR1 and VFR2 classes have been combined for DMCC and persistence to match GFS MOS. The two lowest-height classes used in GFS MOS have been combined to match the LIFR class. These class mergers create a six-class system. The comparison still is not perfect, with GFS MOS having a class boundary at 6500 ft as opposed to 5000 ft for persistence and MOS. However, a representative comparison is possible.
Six-class accuracy results for the three methods can be found in Table 5 for 6-h forecasts and Table 6 for 12-h forecasts. The two-class (<1000 ft and ≥1000 ft) accuracy and skill evaluations are presented in Figs. 4 and 5 for 6 and 12 h, respectively. Not surprisingly, persistence fairs poorly at these longer forecast times. DMCC and GFS MOS have comparable performance at 6 h, but the DMCC algorithm performs noticeably better at 12 h as evidenced by higher accuracy and skill scores for the lower-ceiling-height class in the two-class test (Fig. 5).
4. Model analysis
As outlined in Bankert et al. (2004), C5.0 produces decision trees by selecting those attributes (RUC variables in this study) that best partition the data. Attributes selected at each tree node are selected using an information gain principle (Shannon 1948). The attributes in the top branches of the decision tree are assumed to be the most significant contributors to the classification. Examining the top three branches of each decision tree for the 1-, 3-, 6-, and 12-h cloud-ceiling-height classification produced some interesting findings. The top contributing RUC variables (in order) for each of the forecast times are listed in Table 7.
Precipitable water ratio (PWR) was computed for the lowest 150 mb and is the ratio of precipitable water to precipitable water assuming vapor saturation. While the PWR variable is prevalent at all forecast times, the average relative humidity (in the lowest 150 mb) becomes less significant at longer forecast times and the u wind component in the lowest 30 mb becomes more significant. Because the 1-h RUC variables were used to create all decision trees, the 1-h moisture variables appear to play a more significant role in classifying cloud-ceiling heights at 1 and 3 h, and the 1-h momentum (e.g., the u wind component) variables contribute more at the longer forecast times.
Other significant contributors in the top three branches included the dewpoint temperature (at lowest RUC level) and the lifting condensation level (LCL). Both of these variables played a larger role at shorter forecast times (along with yes or no cloud, as noted in Table 7). Average turbulent kinetic energy (TKE) and planetary boundary layer (PBL) depth were strong contributors at the longer forecast times.
While a detailed analysis of the physical reasoning of how these variables—valid at 1-h forecast time—relate to current (defined here as 1 h) and future (3, 6, and 12 h) cloud-ceiling heights is beyond the scope of this research, a general analysis is provided here. At 1 h, one would expect the current state of the atmosphere, in terms of temperature and moisture, to play a prominent role. This is indeed the case when considering the strongest contributing variables selected: average RH, PWR, yes–no cloud, dewpoint temperature (lowest level), and LCL height. At 3 h, the current moisture state of the atmosphere is still important, but the current wind speed and direction (u wind component in the lowest 30 mb) also appears to be a factor. In the coastal environment in which KJFK exists, the magnitude of the temperature and moisture advection are keys in determining future cloud-ceiling conditions, possibly explaining the relationship of future cloud-ceiling height with the current wind data. The 1-h u wind (lowest 30 mb) is an even stronger contributor in the 6- and 12-h forecasts and the vertically averaged TKE appears as a significant variable. The dynamic elements (wind, TKE) at 1 h appear to be stronger factors in determining the 6- and 12-h cloud-ceiling-height forecasts than the current moisture state. If the state of the environment is changing from 1 to 6 or 12 h, having current data that relates to change should play a more prominent role.
Application of a data-mining technique to 1-h RUC model data and METAR observations at a single station (KJFK) has resulted in RUC-based cloud-ceiling-height classification algorithms producing estimates at 1-, 3-, 6-, and 12-h forecast times. The DMCC algorithms were able to maintain six-class (no ceiling and >12 000 ft ceiling combined into one class) overall accuracy from 1 to 12 h (Fig. 6) indicating the ability to find useful relationships between RUC output and cloud-ceiling observations out to 12 h. Persistence and GFS MOS both have a steeper decrease in accuracy at the longer forecast times. Similar results can be seen in the two-class (<1000 ft and ≥1000 ft) HSS statistics (Fig. 7).
Improvements to the DMCC model could be accomplished by taking into account day–night differences and warm–cold seasonal variations. Additionally, a more complex algorithm, developed through data mining, that also includes station observations, satellite data, and climatological parameters should produce a more robust cloud-ceiling forecast.
Future work includes generating algorithms for multiple locations within the contiguous United States and applying those algorithms in postprocessing RUC data for appropriate grid points. Some locations will prove to be more challenging than others. Mountainous terrain, microclimates, and other local factors not adequately represented in the RUC model can contribute to the development of cloud-ceiling algorithms with unsatisfactory results. Also, the sensitivity of developed algorithms to RUC model upgrades needs to be considered.
The support of the sponsor, the FAA through the AWRP, is gratefully acknowledged. The views expressed are those of the authors and do not necessarily represent the official policy or position of the FAA. Without the assistance, advice, and support of Paul Herzegh, Gerry Weiner, and Jim Cowie at NCAR RAL and John Brown and Stan Benjamin at the Global System Division of ESRL, this work would not have been possible. Their assistance is greatly appreciated.
Corresponding author address: Richard Bankert, Naval Research Laboratory, 7 Grace Hopper Ave., Monterey, CA 93943-5502. Email: firstname.lastname@example.org