Data-mining methods are applied to numerical weather prediction (NWP) output and satellite data to develop automated algorithms for the diagnosis of cloud ceiling height in regions where no local observations are available at analysis time. A database of hourly records that include Coupled Ocean–Atmosphere Mesoscale Prediction System (COAMPS) output, satellite data, and ground truth observations [aviation routine weather reports (METAR)] has been created. Data were collected over a 2.5-yr period for specific locations in California. Data-mining techniques have been applied to the database to determine relationships in the collected physical parameters that best estimate cloud ceiling conditions, with an emphasis on low ceiling heights. Algorithm development resulted in a three-step approach: 1) determine if a cloud ceiling exists, 2) if a cloud ceiling is determined to exist, determine if the ceiling is high or low (below 1 000 m), and 3) if the cloud ceiling is determined to be low, compute ceiling height. A sample of the performance evaluation indicates an average absolute height error of 120.6 m with a 0.76 correlation and a root-mean-square error of 168.0 m for the low-cloud-ceiling testing set. These results are a significant improvement over the ceiling-height estimations generated by an operational translation algorithm applied to COAMPS output.
The U.S. Navy operational meteorologist is often required to assess and forecast specific weather conditions at remote locations where there are no direct observations available because of data-void or data-denied conditions. In these situations, generating an immediate diagnosis of remote meteorological parameters would enhance the support these meteorologists provide to the tactical decision-making process. One parameter vital to operations is cloud ceiling.
Both numerical weather prediction (NWP) models and satellite imagery provide extremely useful information to an operational forecaster, but only to a degree. As an example, neither the model output nor the satellite imagery can reliably provide the cloud ceiling height at a specific location. Translation algorithms (Stoelinga and Warner 1999, hereinafter SW99) are applied to NWP model output in order to extract cloud ceiling fields from numerical model fields. SW99 is based on empirical and theoretical relationships between hydrometer attributes and light extinction. As noted in that study, SW99 ceilings were consistently higher than observed. Various satellite techniques have been developed and applied with limited success. Ellrod (2002) used a surface temperature and infrared cloud-top temperature difference to estimate low cloud ceiling height at night.
In general, the modeling of weather phenomena has been theory driven: parameters are determined by equations developed from physical laws and subsequently verified by data. However, in some highly complex situations, the physical laws governing these phenomena are unknown, too complicated to represent, or not fully understood. For example, while a system of physical equations allows for the modeling of temperature and winds, phenomena such as cloud processes are more complex and their relevant parameters must be approximated. In these circumstances, the conceptual modeling can be data driven. That is, through proper analysis, data relationships representing the physics implicit in the data are empirically discovered. We hypothesize that these relationships approximate the physical laws and allow development of the required application.
Supervised machine-learning techniques are used to discover patterns in data and to develop associated classification and parameter estimation algorithms. These data-mining methods, used in a Knowledge Discovery from Databases (KDD) procedure, are applied to the cloud-ceiling-height assessment problem. Within the KDD methodology, raw data are collected, processed, and stored in a database. Data-mining tools are then applied to the database records to uncover the patterns and relationships that represent physical laws implicit in the data. This research attempts to find relationships in satellite and NWP data that can provide accurate estimates of cloud ceiling height. The discovered relationships are incorporated into a cloud ceiling classifier with the goal of generating an operational product in regions where local observations are not available.
Sources for NWP output, meteorological satellite sensor data, and ground observations [aviation routine weather reports (METAR)] are established, and a unique meteorological research environment is developed for the automated collection, processing, and storage of meteorological data records. Parameter values from the disparate data sources are extracted with location and time markers, and these values are combined (“fused”) to form a single data record with elements that are coincident in space and time. These heterogeneous data are collected continuously from the various sources to form a flexible and powerful environment for meteorological discovery. Hourly data records are included in a single database optimized for data mining. In addition to the automated routines to populate the database, this research environment also includes Web-based monitoring tools and 2D images of satellite and NWP data that can be accessed in near–real time. The monitoring tools provide supplemental information related to individual NWP model runs, missing data (from any source), and database statistics. This unique KDD environment has been introduced in Bankert et al. (2001).
Parameter values for each data type are collected for 18 METAR station locations in California. NWP data are obtained from the Coupled Ocean–Atmosphere Mesoscale Prediction System (COAMPS;1 Hodur 1997). Real-time data from the Geostationary Operational Environmental Satellite-10 (GOES-10) are available with hourly data extracted and stored in the database.
Relevant observation elements are parsed from METAR reports for the selected stations and stored in the database. They represent ground truth and serve as the dependent variables in the subsequent search for patterns in the data that relate GOES-10 and COAMPS variables to locally observable parameters. Cloud ceiling height will be the parameter examined in this study.
Section 2 is a description of related research in the determination of cloud ceiling and the application of certain artificial intelligence methods to scientific data analysis. Section 3 provides detailed information on all data sources, and section 4 is a description of the research database development. A discussion of the applied data-mining processes and the experimental methodology are found in section 5. The results and conclusions are presented in sections 6 and 7. A table of acronyms is also included in the appendix.
Determination of cloud ceiling height
Currently, the primary approach for correcting numerical model output is via model output statistics (MOS; Miller 1995; Vislocky and Fritsch 1997). MOS is a technique for postprocessing NWP output in order to predict surface observations. MOS correlates model output variables such as relative humidity and precipitable water and climatic variables, such as station climatology, with surface observations, using multiple linear regression, in order to predict observed variables such as cloud-base height (Hebenstreit 1981; Erickson 1988; Jacks et al. 1990; Miller 1995). Two drawbacks can be identified with using only MOS for model correction via output postprocessing: 1) the latest surface observation predictors are often the most important terms in MOS forecast equations and dominate over all model-based terms (Miller 1995; Vislocky and Fritsch 1997) while this research involves the feasibility of developing an algorithm for those situations when local observations are not available; 2) MOS methodology is generally limited to multiple linear regressions; therefore, any nonlinear solutions will be missed. In particular, any MOS result will be limited by the statistical model imposed on the solution.
Using an MOS approach, Norquist (1999) describes multiple linear regression (MLR) and multiple discriminant analysis (MDA) methods applied to mesoscale NWP model data in order to provide predictions of cloud characteristics, including cloud ceiling height. Both MLR and MDA methods demonstrated improved skill compared to persistence for the 10-day studies in both January and July over southern Europe. In Forsythe et al. (2000) a fusion of satellite-derived cloud classification and surface observations are used to estimate cloud-base height at locations between observation stations. Using one month of data, this technique provided a superior estimate of cloud-base height when compared with a distance-weighted interpolation of surface data alone.
Scientific data mining
Although KDD has been primarily and most visibly applied in the business domain (analyzing customer transactions, market activity), KDD and data-mining have demonstrated great success in the analysis of scientific data. Significant increases in computer storage and computing power have enabled new data-mining methods and procedures. These methods are collectively grouped in the field of KDD. Specifically, KDD refers to the overall process of discovering useful information from data. This process includes data storage and access, all forms of data preparation, the addition of necessary domain knowledge, and the interpretation and visualization of the data-mining results. Data mining refers to the specific algorithms applied to the data in order to extract patterns. Data-mining techniques, both traditional statistical methods and artificial intelligence machine-learning algorithms, include inductive learning, regression, clustering, summarization, generalization, dependency modeling, and link analysis (Fayyad et al. 1996b; Berry and Linoff 1997; Weiss and Indurkhya 1998). KDD has been successfully applied in both business and scientific fields. The mining of scientific data is well reviewed in Fayyad et al. (1996a). Notable systems and results include the Sky Imaging Cataloging and Analysis Tool (SKICAT; Weir et al. 1995), the Jet Propulsion Laboratory (JPL) Adaptive Recognition Tool (JARtool; Burl et al. 1994), and Open Architecture Scientific Information System (OASIS; Mesrobian et al. 1996). SKICAT uses decision-tree methods to predict the classes of faint astronomical objects in photographic image data. JARtool learned to recognize small volcanoes in satellite images of the surface of Venus. The OASIS data-mining environment was designed for discovery and visualization in large geophysical datasets. It has been applied successfully in the study of spatiotemporal features of cyclonic storms. Additionally, clustering data-mining techniques have been applied in both the discovery of climate indices (Steinbach et al. 2003) and 700-hPa height-field anomalies (Smyth et al. 1999).
To maximize chances of discovering useful relationships between a variety of physical variables, both calculated and measured, both an NWP model and a satellite sensor are identified as data sources. Each source provides a unique type of data: satellite imagery provides a view “from above” in the form of actual observed radiances in various spectral channels, while the model provides a large number of calculated variables at multiple levels in an atmospheric column at any particular model grid point. These data sources provide coincident (temporal and spatial) data that can be explored individually (NWP-only or satellite-only) or in combination. A research environment consisting of data collection from COAMPS, GOES-10, and METAR observations has been created to uncover data relationships useful for estimating meteorological parameters of interest. COAMPS output parameters and coincident GOES-10 parameters are computed and extracted at 18 METAR observation sites (listed in Table 1). Automated data collection routines were written and data were collected hourly over a 2.5-yr period (12 July 2000–29 November 2002).
COAMPS is a nonhydrostatic, multiply nested, mesoscale NWP model used in this research to generate output values of selected relevant parameters. The model is run over the U.S. West Coast and configured with three horizontal nested grids at 81-, 27-, and 9-km resolution (Fig. 1). There are 33 vertical levels for all grids, with the top located at 32.1 km. Grid points are strongly compressed near the surface to resolve the shallow boundary layer. COAMPS is run for a 12-h forecast cycle for this domain configuration at 0000 and 1200 UTC each day. The Navy Operational Global Atmospheric Prediction System (NOGAPS) provides the time-dependent boundary conditions for the 81-km domain. The model is described by Hodur (1997), Hodur et al. (2002), and Chen (2003). COAMPS features a full suite of physical parameterizations, including the Mellor and Yamada (1982) level-2.5 turbulence parameterization, radiation (Harshvardhan et al. 1987), and cloud microphysics (Rutledge and Hobbs 1983) schemes.
To parameterize boundary layer processes, COAMPS uses a prognostic equation for turbulent kinetic energy (TKE) with diagnostic equations for other second-moment quantities. The eddy coefficients are based on Yamada (1983), while surface fluxes and surface stress are computed from the Louis (1979) scheme. The cloud microphysics scheme features prognostic equations (continuity equations) for mixing ratios of five species of water substance: water vapor, cloud liquid water, rainwater, cloud ice, and snow. Source and sink terms are from 13 different microphysical conversion processes.
In the absence of observational data, data assimilation is accomplished using a scheme that retains horizontal and vertical structure developed in the previous forecast in the initial conditions of a subsequent forecast. When available, observational data are incorporated into the initial conditions using an increment formed between the observational analysis and the previous forecast. The 12-h data assimilation cycle was maintained throughout the study period.
The closest land grid point (within the 9-km domain) to each of the selected METAR stations is determined, and all points were found to be within 5 km of the METAR station location. COAMPS output values at those grid points for each hour are extracted and written to the database. Table 2 is a list of the COAMPS parameters utilized for the present study. The database currently has 660 000 NWP hourly records of these selected COAMPS parameters.
COAMPS parameters were selected based on a priori assumptions about which parameters might have the most influence on cloud ceiling height. Most of the descriptions in Table 2 are self-explanatory. However, some require additional attention: u∗, t∗, and q∗ are the friction-scale velocity, temperature, and moisture from the Monin–Ohbukhov similarity treatment of the surface layer; the “10-m–surface temperature (mixing ratio) difference” is the difference in degrees (g kg−1) between the value at 10 m and the value at the surface; the cloud parameters, “cloud-base height (qc)” and “cloud-top height (qc),” are determined from examination of the prognostic cloud-liquid-water and cloud-ice mixing ratio (qc) field while those followed by “(RH)” are based on relative humidity; LCL is the lifting condensation level and CCL is the convective condensation level; z/L is the height of the surface layer divided by the Monin–Obukhov length scale (in COAMPS, the height of the lowest model grid point is used for the depth of the surface layer as is the common practice in mesoscale models); and the cloud/no cloud determination is based on the existence of cloud liquid water or cloud ice in the atmospheric column above the point of interest.
While COAMPS cannot forecast cloud ceiling height with sufficient reliability and accuracy to be of direct use operationally, three COAMPS parameters represent the cloud ceiling height directly. One of these ceiling-height estimations, “cloud-base height (qc),” is based on cloud water and ice. A search is made in a vertical column for the lowest altitude at which cloud-liquid-water mixing ratio or the cloud-ice mixing ratio exceeds a threshold of 1.0 × 10−6. A second COAMPS ceiling-height parameter is cloud-base height (RH). The method for computing this height is similar to cloud-base height (qc) except that a search is made for the lowest height level at which the relative humidity is greater than 95%. Ceiling height (SW99) first uses the mixing ratios for cloud liquid water, cloud ice, snow, and rainwater to compute concentrations for each species. The extinction coefficients are then computed and integrated upward to the lowest altitude at which a light beam from the surface decreases to 0.02 times the original intensity.
In addition to extracting values for the database and within the context of the KDD process, COAMPS output can be viewed in static or animated 2D form over the appropriate domains for further interpretation and analysis.
Hourly data are available from GOES-10. Pixel data are extracted and written to the database. This data consists of all sensor channel data at a given pixel whose center (over land) is closest (no greater than 4-km distance) to the latitude–longitude of each of the METAR stations. The visible channel value is corrected for the solar zenith angle. In addition to the channel data, a satellite rain-rate algorithm (Turk et al. 2001) is applied to determine the instantaneous rain rate at the selected locations. A cloud optical depth algorithm (Wetzel and Stowe 1999) is applied to the GOES-10 data (daytime only). A low-cloud product (Lee et al. 1997) is derived by computing the difference between the shortwave and longwave infrared (IR) channels. In addition, a cloud-top-height estimate is made using the satellite-derived longwave IR temperature and the corresponding COAMPS temperature profile. Tables 3 and 4 summarize this information and include resolution and coverage information. There are 436 032 GOES-10 database records.
Similar to the COAMPS output, satellite imagery can be viewed in static or animated 2D form. In addition to this visualization, monitoring tools have been developed to allow for a quick view of model and satellite-retrieval performance.
Ground truth data: METAR
METAR reports were collected in near–real time each hour from the Fleet Numerical Meteorological and Oceanographic Center (FNMOC) data server. Observed cloud ceiling height is one of the sensible weather elements parsed from the hourly METAR reports for the 18 selected stations. The map in Fig. 2 is marked with the locations and names of those selected stations. These METAR stations were chosen for their coastal nature (for most stations), the availability of satellite data over their location, and the reliability and robustness of the METAR reports. The entire list of parsed parameters can be found in Table 5. The ceiling-height values represent the observer ground truth and serve as the dependent variable in the regression relating GOES-10 and/or COAMPS variables to local cloud ceiling height. The observed cloud ceiling height is defined as the lowest level that has at least broken-sky conditions (equal to or greater than 6/8 cloud coverage). The database currently stores 1 207 000 METAR records.
Automated routines have been developed to collect, process, and store the parameter values discussed in section 3. Database tables are updated once a day after all model output is available and all postprocessing has been completed. Each table row represents the available information (variables) for a particular location at a specific hour. For data-mining purposes, a table of “event” records is exported from the database. Each record of the event table represents all the known variables at one time and one location.
The flow of data involves five steps:
Data generation/collection: NWP data are generated by the COAMPS program. Satellite and METAR data are gathered from real-time ingest antennas and various file transfer protocol (FTP) sites.
Data quality control and preload processing and transformation: Data preload processing includes
time rounding: adjusting the time stamp of METAR reports and satellite points to the closest hour, to correspond with the model data,
satellite filtering: identifying missing data,
calculation of satellite-derived products: estimation of low-cloud presence, cloud optical depth, environmental data records, and so on,
METAR report processing: computation of vapor pressure, ceiling/no ceiling (based on fraction of cloud coverage), variable wind directions, and so on,
METAR quality control: removing duplicate, later corrected, or mislabeled reports, and
COAMPS–satellite combination products: a cloud-top height variable is computed using satellite (cloud top) IR temperatures together with COAMPS model output temperature profiles.
Data loading in each individual, source-specific table.
Data postload processing (calculate/update various derived fields): Before exporting a table of event records, various fields are corrected as necessary, for example, adjusting certain height fields to make them relative to sea level, or creating special fields indicating ceiling presence, and low-cloud presence.
Data consolidation: generating an event record for each date–time for which complete information is available (i.e., data from all three sources).
The data flow and database organization are represented graphically in Fig. 3. This figure shows the flow of data from the three sources: COAMPS, satellite, and METARS. Two-dimensional data are stored as graphic images and are made available in a visualization database via a Web interface. One-dimensional (1D) point data are filtered, then additional parameters are calculated, and finally a loader program stores the data in a database table. The index key of each table is the day– time group and the location identifier. The climatology and location tables are constant-valued reference tables, while the NWP (COAMPS data), OBS (METAR data), and SAT (Satellite data) tables are updated daily with new data, consisting of records for each specific location and day–time group.
To collect, visualize, interpret, and exploit the vast amount of digital data available in the environmental sciences, researchers have turned to artificial intelligence (AI) methods, and specifically, data mining (Hand et al. 2001). Data mining is a discipline born of machine learning and statistics, enhanced by large database concerns, pattern recognition, knowledge representation, and other areas of computer science and AI. In contrast to standard statistical approaches, data-mining methods in the KDD tool chest typically relax requirements of sample size and prespecified model hypotheses. They are driven by the data and do not require the correct preselection of a hypothesis. Statistical methods generally hypothesize the form of the model and then use data to confirm the hypothesis. Furthermore, data-mining methods are designed to be able to handle larger datasets (millions of records, and hundreds of variables).
A wide variety of data-mining tools and algorithms are available. These methods include, but are not limited to, decision rules, decision trees, clustering, k nearest neighbor, neural networks, association rules, and statistical regression. Unsupervised methods (e.g., clustering and associations) summarize, count, or organize data in ways that may reveal patterns or relationships among groups of data records. Supervised methods (e.g., decision trees/rules, neural networks, and k nearest neighbor) extract from a set of data patterns relating the independent variables to one or more dependent variables. This study will focus primarily on supervised methods. KDD describes a methodology and selection of tools for handling all these tasks. As a tool for scientific data analysis, data mining utilizes induction to determine empirical models from the observed data. This is in contrast to traditional methods of analysis, where a hypothesis is made based on understanding of physical laws, and data are used to confirm or refute the hypothesis.
C5.0 and Cubist
C5.0 is a data-mining algorithm used for producing classification models in the form of decision trees or if– then rules. The software is designed to explore hundreds of thousands of database records with hundreds of numeric fields. Because these classification models are expressed as decision trees or rules, they are easier to interpret than other “black box” data-mining tools such as neural networks.
C5.0 constructs models for classification by using inductive, supervised machine learning. Input consists of a set of training items, each of which is described by a single record consisting of attribute-value pairs. Each item in the training set is assigned one of a predefined set of discrete classes (this is supervised learning). The program uses inductive generalization to generalize from the specific items in the training dataset to a general model representing the same classification information. The resulting decision tree is evaluated by classification of a previously unseen, independent set of testing data.
Decision trees are constructed by a repeated partitioning of the description space. Existing regions of the space are split into smaller regions that eventually will consist of only, or predominantly, a single class of events (points in the description space). For certain distributions of points, where the classes are not easily separated, the partitioning may become extremely complex. Pruning the decision tree amounts to reducing the complexity of the partitioning, but requires acceptance of a certain amount of error in the classifications.
In this work, C5.0 is used to generate two types of classifiers. The first classifier classifies event records into ceiling and no ceiling categories. The second classifies ceiling records into high-ceiling and low-ceiling categories.
Each classification model is represented as a decision tree. A decision tree consists of a set of n nodes and links between them. Each node represents either a test on a variable or a final classification. Links emanating from a node denote the possible test outcomes. They lead to nodes indicating further tests, or to tree “leaf” nodes (which have no nodes below them) indicating the final classification. Annotating each classification is an error estimate. A simple example of a partial decision tree is shown in Fig. 4 (without error estimates). In the example, the nonleaf nodes represent tests on variables V1 through V6, and the leaf nodes represent the possible classifications, class A, B, and C.
The size of the tree is determined by the data and the desired accuracy–generality trade-off. A large, complex tree generally produces highly accurate classifications of the training data, at the expense of model generality. This is referred to as overfitting. In order to avoid overfitting the data, a certain amount of classification accuracy on the training set is sacrificed by “pruning” the leaves of the tree. In return, we gain generality, which may help to classify events not adequately represented in the training data. The process of generating decision trees is outside the scope of this paper but is detailed extensively by Quinlan (1993).
A data record is classified by performing the test specified by the top-level “root node” (e.g., test the value of variable V1 in Fig. 4). The link to follow to the next test is indicated by the value of V1 (the test result). If V1 is a discretely valued attribute, there will be one link for each possible value. If it is a continuous value, V1 is tested against a threshold value that splits the set of records into two classes, each represented by one link. This process repeats until a leaf node is reached (with no links below it). At this point, the record is assigned to the class associated with that leaf.
The Cubist algorithm produces rule-based predictive models for numerical prediction (also known as regression). Each model is expressed as a set of rules. Each rule applies to only a small part of the input space and has a set of preliminary conditions and an associated local multivariate linear model. If a rule's conditions are satisfied, the associated model is used to calculate the predicted value. This approach works well in high-dimension problems, such as the one addressed in this work, as only a small number of variables may be required for a particular rule and model. As a result, the rule set is more easily interpreted than a standard regression equation on all input variables, or a neural network. An example Cubist rule is shown in Fig. 5.
The goal of this research is to create a set of algorithms that produce the most accurate assessment of cloud ceiling presence and low-cloud-ceiling height. Through a KDD process, a three-step method was developed for generating cloud ceiling classifiers and ceiling-height estimators:
step 1—create a cloud ceiling/no-cloud-ceiling classifier (using C5.0 with all training data),
step 2—create a low-cloud-ceiling/high-cloud-ceiling classifier (using C5.0 with training data consisting only of cases where a cloud ceiling is present), and
step 3—create a low-cloud-ceiling height estimator (using Cubist with training data consisting only of cases where a low cloud ceiling is present).
The resulting system is executed analogously. If the result of step-1 classification for a given data point is “cloud ceiling,” then the step-2 classifier is executed for that point. The step-3 estimator is executed if the result of the step-2 classification is “low cloud ceiling.”
The “no-cloud-ceiling” class in the first step also includes data records for which the METAR ceiling is above 12 000 ft (3657.6 m). Most METAR reporting stations are automated and observation instruments at such stations cannot detect clouds above that altitude. Therefore, if the lowest observed ceiling is above this limit, it is indistinguishable from a no-cloud-ceiling condition. As a result, all observed ceilings greater than 3657.6 m are classified as no cloud ceiling. For step 2, the threshold to separate low and high ceilings is 1000 m.
The basic classifier/estimator algorithm development was performed using the following procedures:
export a set of event records data from the database,
randomly split data into equal-sized training and testing datasets,
perform C5.0 data mining for step 1 on training data, and test resultant algorithm on testing data,
perform C5.0 data mining for step 2 on “ceiling” cases only in the training data, and test the resultant algorithm on ceiling cases from the testing data,
perform Cubist data mining for step 3 on “low-cloud-ceiling” cases only in the training data, and test the resulting algorithm on the low-cloud-ceiling cases in the testing data, and
when satisfied with results, output an algorithm trained on all available data for incorporation into the final, production algorithm.
Several experiment permutations were performed in order to confirm the generality of our results:
A random split of all data: The available data (for all locations together) were randomly divided into equal-sized training and testing sets. The analysis was done for all California locations as one.
A random split of all location data (for all locations): The data for each location individually were randomly split into two equal sets for training and testing. Each location was studied independently.
Training on the first year of data, testing on the second year for all data: In order to avoid any dependence introduced into the training and testing data by possibly having a training record for location X at hour T, and then testing on record for location X at time T ± 1, the data were split by year. The system was trained on the first year's data and tested on the second year's data. This tested the generality of our results across time.
Leave-one-location-out testing: In 18 separate trials (one per each California location), the classifier and estimator were trained on all data for 17 of the 18 locations and then tested on the missing location. This scheme tested the generality of the results across coastal locations.
All learning experiments were based on three different sets of variables:
1) COAMPS variables only,
2) GOES-10 variables only, and
3) fused (combined) COAMPS and GOES-10 variables.
In addition to splitting the data into training and testing sets, the data were further separated into day and night as determined by the solar zenith angle (where an angle of less than or equal to 80° is considered daytime). As mentioned previously, complete hourly records for each of the 18 METAR stations were collected over a 2.5-yr period. Complete records used for performance evaluation included COAMPS output, GOES-10 data, and the METAR cloud ceiling height. There are 263 483 complete records for the California stations. Data mining and testing analysis for these California data records have been completed, with a performance evaluation presented in section 6.
To demonstrate the potential performance of the KDD-produced cloud ceiling algorithm, the results of the data-mining experiment for the California data (daytime only) are presented below. For each of the three steps defined in section 5b, four algorithms are compared using bias, accuracy measures, and skill scores. The four algorithms are
KDD-produced algorithm using GOES-10 data,
KDD-produced algorithm using COAMPS data,
KDD-produced algorithm using both GOES-10 and COAMPS data, and
SW99 translation algorithm applied to COAMPS data.
For step 1 (ceiling/no ceiling classification), there are 51 611 randomly selected training records and 51 690 randomly selected testing records. The training records are used to create the algorithm. Within the testing set there are 21.7% ceiling records. Table 6 is a listing of the performance statistics for step 1.
In Table 6, bias is defined as the ratio of “predicted” ceilings over observed ceilings for the testing set output. Bias provides a measurement of reliability for diagnosing a ceiling event and does not measure how well the algorithm output corresponds to observations. Bias has a range of 0 to infinity and no bias has a value of 1. All four algorithms tested produced a bias value less than 1.0 (Table 6), indicating an underprediction of ceiling events or a bias toward no ceiling classification. This bias is particularly strong for SW99.
To measure the accuracy of the algorithms in determining cloud ceiling events, the percent correct (% correct), probability of detection (POD), false-alarm ratio (FAR), and critical success index (CSI; Marzban 1998) are computed and presented in Table 6. These measures are described as follows, with “event” defined as ceiling for this step:
% correct is the total percentage correct for both events and nonevents,
POD is the fraction of observed events that were correctly predicted to exist (ignores false alarms),
FAR is the fraction of predicted events that are nonevents (ignores missed events), and
CSI, also known as the threat score, is the ratio of correctly predicted events (hits) to the total number of hits, misses, and false alarms (it does not distinguish source of forecast error).
By any of these accuracy measurements (Table 6), the combination of GOES-10 and COAMPS in the KDD cloud ceiling algorithm produced the best results. Satellite-only (GOES-10) algorithm scores are only slightly lower. All three KDD algorithms are much more accurate than SW.
The skill scores computed here include the following:
Equitable Threat Score (ETS) is a measure of skill that uses chance as the benchmark. It accounts for climatological event frequencies. The range of values is −0.333 to +1.0 (0 is no skill).
True Skill Score (TSS; Hanssen and Kuipers 1965) examines the ability of the algorithm to separate events from nonevents (accuracy of events + accuracy of nonevents − 1.0), with scores ranging from −1.0 to +1.0. It does not depend on data distribution. The benchmark for this score is the “naive” prediction, for example, event always (or never) predicted produces a TSS of 0.0 (no skill).
Based on the skill scores presented in Table 6, the KDD algorithm incorporating both GOES-10 and COAMPS data demonstrated the most skill. As was the case with the accuracy measurements, all three KDD cloud ceiling algorithms have much higher skill scores than SW99.
For step 2, low-cloud-ceiling/high-cloud-ceiling classifications (with low cloud ceiling as the event being analyzed), there are 11 279 randomly selected training samples and 11 199 randomly selected testing samples. Within the testing set there are 74.9% low-ceiling records. Similar to step 1, bias, accuracy, and skill are computed and presented in Table 7.
The bias values of the four algorithms for this classification step indicate a very slight bias for overprediction of low ceilings in the KDD algorithms and an extreme underprediction of low ceilings (or bias toward high ceilings) for SW99.
Accuracy measurements shown in Table 7 are similar for the three KDD-produced algorithms with the combined data algorithm having slightly better results. While SW99 has comparable FAR to the KDD algorithms, the other three accuracy scores are much lower. These results indicate the KDD cloud ceiling algorithms minimize both misses and false alarms, but SW99 frequently misses low-ceiling events.
Similar conclusions can be drawn from the skill scores (Table 7) at this step. Interestingly, in comparison with step 1 (Table 6), the skill level in step 2 (Table 7) is much lower for all algorithms except the KDD algorithm using only COAMPS data. This algorithm actually has higher scores for the low-ceiling/high-ceiling classification. While the KDD algorithm using only COAMPS data showed relatively lower skill in the ceiling/no ceiling classification step, the information from the atmospheric column as seen in the COAMPS data is helpful in distinguishing low ceilings from high. Contrast that result with the KDD cloud ceiling algorithm using only GOES-10 data. The satellite data features are representing the atmosphere from an above-cloud perspective only; therefore, the skill in detecting the existence of a ceiling (through the satellite data alone) is expected to be higher than determining whether the ceiling is high or low.
For the third step of each algorithm, the heights of the low-ceiling (<1000 m) cases (training: 8429 samples; testing: 8389 samples) are estimated. Performance measures are presented in Table 8 for this step. The bias computed for this step is equivalent to the average error of the testing set. The KDD-produced algorithms have very small negative bias, and SW99 has a more substantial positive bias (Table 8).
The three accuracy measures in Table 8 were computed as follows:
Correlation coefficient (CC) is a measure of the relationship of the algorithm output with observation. Values range from −1.0 (perfect negative correlation) to +1.0 (perfect positive correlation). A value of 0.0 is no correlation.
Mean absolute error (MAE) is the average difference between the algorithm output and the observation for the testing set.
Root-mean-square error (rmse) is
where N is the number of testing samples, ei is the estimated ceiling height for each testing sample, and oi is the observed ceiling height for each sample. Rmse is affected more by larger errors than MAE.
The four CC values (Table 8) are similar, ranging from 0.57 for SW99 to 0.76 for the KDD algorithm using both datasets (GOES-10 and COAMPS). However, the MAE and rmse are much lower for the KDD algorithms (Table 8) with the combined data and COAMPS-only data producing similar MAE and rmse values. To measure the skill of the KDD cloud ceiling algorithms relative to SW99, the following equation was used:
This skill score provides a single value (with a maximum score of 1.0) of the KDD-produced algorithms' skill level relative to SW99 (Table 8). As was the case with the first two steps, the KDD algorithm using both GOES-10 and COAMPS data demonstrated the most skill. Overall, the KDD “fused”-data algorithm had only slightly better testing results than the KDD GOES-10-only algorithm in step 1 and the KDD COAMPS-only algorithm in steps 2 and 3. However, when all three steps are examined as a total cloud ceiling estimation algorithm, the KDD fused-data algorithm provides the most accurate and highest skill in ceiling diagnosis. Table 9 provides a quick-look summary of the skill scores (steps 1 and 2) and correlation (step 3) for all four algorithms.
Testing individual METAR stations
The performance statistics presented in section 6a provide a general indication of algorithm accuracy and skill for the 18 California locations used in the study. In other words, data records from each METAR station are used in both training and testing of each KDD algorithm. To determine how the three-step cloud ceiling algorithm would perform at locations outside of the training data, a leave-one-station-out analysis was performed. The data records for one METAR station were used to test an algorithm developed from the data records of the remaining 17 locations. If the results of these leave-one-out tests are comparable to the results of the random split, the potential success of the algorithm on unseen data (new locations) will be demonstrated.
As shown in Table 10, the step 1 (ceiling/no ceiling) bias values for the algorithms applied to each individual station are, for the most part, comparable to the bias computed for the random data testing (RANDOM) described earlier. Note that the leave-one-station-out KDD algorithms discussed here are fused-data (GOES-10 and COAMPS) algorithms. While RANDOM had a slight no-ceiling bias (0.91), 13 of 18 station KDD algorithms also exhibited a no-ceiling bias. The extreme biases were 0.56 for KVNY (Van Nuys, California) and 1.74 for KWJF (Lancaster, California). The overprediction of ceilings for KWJF was not surprising, considering the small percentage of true ceiling cases at that station located in the Antelope Valley of the Mojave Desert region of California. In Fig. 6, comparisons of bias between the KDD algorithms and SW99 applied to each station indicate that SW99 has a stronger no-ceiling bias at each station with the exception of KSFO (San Francisco, California), for which SW99 actually has a stronger ceiling bias.
Accuracy statistics, represented by CSI in Fig. 7, and skill scores, represented by TSS in Fig. 8, suggest that the KDD algorithms perform at a higher level for ceiling/no-ceiling classifications than SW99 at individual station locations. Also, as seen in Table 10, these performance measures for the KDD algorithms are similar in nearly all instances to RANDOM, intimating that the performance of the KDD algorithm applied in RANDOM is representative of the level of performance on unseen data for ceiling/no-ceiling discrimination. Among those statistics that are not as similar are the much higher FAR for those stations with smaller percentage of ceiling records—KWJF, KMOD (Modesto, California), and KBFL (Bakersfield, California). These are the most inland stations. Also, KLPC (Lompoc, California) and KVNY skill scores and accuracy (excluding FAR and % correct) are noticeably lower than RANDOM. These stations had a higher number of missed ceiling cases.
For the low-ceiling/high-ceiling classification step of the algorithm (Table 11), there is more variability (among the stations) in the results as compared with the ceiling/no-ceiling step. This variability could be an indication that the atmospheric characteristics (as seen in GOES-10 and COAMPS data) at a given location are more unique when discriminating low and high ceilings as opposed to determining if a ceiling exists or not. SW99 has a relatively strong high-ceiling bias (underpredicts low ceiling) for all stations (Fig. 9), but the KDD algorithms have a bias close to 1.0 for most stations—similar to RANDOM (Table 11). A notable exception to this is KWJF, which has a large bias (1.98) for overprediction of low ceilings. This bias is due, in part, to the small number of low-ceiling records.
Similar to step 1, the inland stations (KWJF, KBFL, and KMOD), with few low-ceiling records, exhibit relatively high FAR (Table 11). All three stations also have relatively low CSI (in comparison with RANDOM), which is an indication of a higher number of missed low ceilings. This is particularly true for KWJF. The skill scores (ETS TSS) for this station are extremely low, suggesting almost no skill at all. Apparently, the decision-tree rules developed using data from every station other than KWJF are not sufficient to adequately classify data points at KWJF. This result is probably due to the unique location of KWJF relative to the other stations. The majority of the stations in the leave-one-out tests produce favorable results, indicating the KDD algorithm should work well on new locations. However, the performance can be degraded at locations, such as KWJF, that are unique (in terms of the data parameters) when compared with the training data.
SW99 performance measures (other than FAR) at step 2 are at a lower level than the KDD algorithms. The single exception is KWJF. The % correct and ETS values for the KDD algorithm are lower than SW99 at this station. Figures 10 and 11 are CSI and TSS comparisons and provide representative results of the performance statistics.
For step 3 (low-ceiling-height estimation) the bias for RANDOM (Table 12) was slightly negative (−1.2 m), but the individual station KDD algorithms had biases that ranged from −258.1 (KWJF) to 231.1 m (KVNY). Bias comparison for KDD versus SW99 is also mixed (see Fig. 12).
While SW99 has much higher errors (rmse; Fig. 13) for individual stations than KDD, SW99 does have a higher CC for a handful of stations (Fig. 14). KDD station algorithms produce higher errors and lower correlations (Table 12) than RANDOM. As was the case for steps 1 and 2, KWJF accuracy statistics compare least favorably to RANDOM. The skill of the KDD algorithms relative to SW99 (Table 12) have values ranging from 0.27 [KSAN (San Diego, California)] to 0.80 [KSMX (Santa Maria, California)], with 12 of the 18 stations having values greater than 0.60.
The C5.0 machine-learning program performs a greedy, top-down search through the space of possible decision trees. That is, at each tree node, the algorithm selects the best possible attribute with which to partition the data. Data partitioning recurses down the links of the tree until a stopping condition is reached. Best attributes are selected according to the information gain achieved when the value of that attribute is known. Information gain is based on the reduction of entropy (Shannon 1948). Given a dataset S with c classes, where pi is the probability that an element is S belongs to class i, then
For some attribute A describing the elements of S, where values(A) is the set of possible values of A and Sυ is the subset of S for which attribute A has value υ, then
This information, combined with knowledge of the number of cases satisfying a tree node test, gives an indication of the relative importance of each attribute in the classification tree.
In the C5.0 training phase, the boosting method is used. Boosting is a model-averaging approach that generates multiple classification trees for higher accuracy (Schapire et al. 1998). Table 13 shows the most highly represented root node variables in the set of trees created during the C5.0 boosting process for the California daytime. It also shows those variables that occur the most frequently anywhere in the top three levels of a decision tree. The nodes at the tree roots are selected because they lead to the highest information gain, so we will assume that they are significant contributors to the classifications.
The most highly represented root variable in step 1 (ceiling/no ceiling), GOES-10 visible channel, conforms with our intuition that the visible channel is a strong indicator of cloud presence during the daytime.
To distinguish between low- and high-cloud ceilings (step 2), C5.0 selects the COAMPS 1500–10-m temperature difference (ΔTemp) as the primary determining variable. The connection to the low-versus high-ceiling classification is not intuitively obvious but is discussed later in this section. Not surprisingly, GOES-10 variables dominate step 1, and COAMPS variables are predominant at step 2.
In contrast to the daytime results, Table 14 shows that for nighttime training, when no visible satellite channel is available, cloud coverage, a COAMPS variable based on liquid water and relative humidity, is the predominant variable but is closely tied by the potential temperature and GOES-10 channel-4 (longwave IR) temperature. Once again, ΔTemp is the predominant variable in the classification of low versus high cloud ceilings.
In order to further examine each of the resultant KDD cloud-ceiling-height algorithms, the algorithms are output in rule format for each of the classification steps (ceiling/no ceiling; low ceiling/high ceiling). Hundreds of rules are generated from the C5 analysis of the training data for each step. To simplify the rule analysis, only those rules that apply to a high number of training data records at a high confidence are examined. Based on the domain knowledge and the fact that the rules reflect the data, variables within the rules can be analyzed in terms of confirming expectations based on the physics of the problem and discovering new insights.
As discussed earlier, GOES-10 variables tend to be most dominant and provide more accurate results for ceiling/no-ceiling classifications (step 1). Two of the most common GOES-10 variables in the rule sets examined here are longwave IR [11 μm (channel 4) and 12 μm (channel 5)] temperature and the longwave IR temperature difference (channel 4 − channel 5). As expected, ceiling cases require lower IR temperature than no-ceiling (clear sky) cases. Two rules for the GOES-10 algorithm provide an example of how different weather situations affect the longwave IR difference variable:
If latitude > 34.43°N and channel 5 > 24.79°C and longwave IR difference > 2.49°C, then no ceiling.
If latitude > 34.43°N and channel 4 > 17.03°C and seasonal date (time of year) ≤ 149.21 and longwave IR difference ≤ 0.79°C, then no ceiling.
Both rules result in no-ceiling classification, but while the longwave IR temperatures (channel-4 and channel-5 variables) are similarly used, the longwave IR difference is required to be significantly different. The top rule requires a larger positive difference, which is a signature for thin cirrus. High clouds are probably present, but either a very high ceiling (no-ceiling class) or no ceiling is present. The bottom rule, with longwave IR difference near zero, is most likely derived from clear-sky situations.
Considering the nature of the data sources (satellite and numerical model) and the demonstrated results, analysis of the COAMPS portion of the rules for low-ceiling/high-ceiling classification (step 2) is discussed here. As was the case in the decision-tree analysis discussed earlier, the variable that occurs most frequently is the temperature difference (ΔTemp) between 10 and 1500 m. For low-ceiling cases, ΔTemp is required to be smaller than for high-ceiling cases. This variable may be reflecting the cool, less stable environment for low-ceiling situations. The steeper lapse rate for high-ceiling situations could be occurring with low-level subsidence or a “dirty ridge” environment. While the dominance of ΔTemp is surprising, this temperature difference does characterize the lower atmosphere to a certain extent. Further investigation into the relationship or correlation of this variable with cloud ceiling height may prove useful. The resultant algorithms in this study are indicating that the lapse rate of the lowest 1500 m of the troposphere is an important parameter to examine when discriminating low-ceiling environments from high-ceiling environments. Two other prominent variables are used as expected in low-ceiling rules. According to the derived rules, the difference in mixing ratio between the surface and 10 m is required to be large for low-ceiling cases. This result is consistent, with the expectation that strong upward surface latent heat flux may be required to form low clouds. Also, a high maximum TKE in the PBL is required for certain low-ceiling situations. This is consistent with physical reasoning in that, in a low cloud-topped boundary layer, strong radiative cloud-top cooling and latent heat release generate high levels of TKE.
While some rules need deeper analysis to determine their “basis,” there are physically plausible explanations for these empirical rules. The extracted data relationships are reflective of the data collected and can be analyzed or studied in a number of ways. The purpose of this article is to demonstrate the feasibility of using KDD methodology to establish a cloud ceiling algorithm. Future papers will provide analysis of the algorithms as done briefly in this section, but with more in-depth examination.
Situations arise in which an operational meteorologist requires sensible weather information from remote locations where local observations do not exist. Motivated by military and aviation requirements for more accurate assessment of cloud ceiling, an improved utilization of both satellite data and NWP output in determining cloud ceiling conditions has been developed and demonstrated.
Through the creation of a meteorological research environment that allows for the automation of data collection, processing, and storage (along with satellite and COAMPS parameter visualization and monitoring tools), discovery of data relationships for cloud ceiling diagnosis was successfully undertaken. Through the KDD process, a cloud-ceiling-height-analysis algorithm has been developed, specifically targeting low-cloud-ceiling conditions. This algorithm will be developed into an operational product for the California coast.
The results from analysis of the California daytime dataset demonstrate the potential and viability of using KDD to develop algorithms from NWP and/or satellite data for estimating cloud ceiling conditions. Taking advantage of the unique characteristics of each data type, a KDD-produced algorithm that applies both COAMPS and GOES-10 data performed the best over the entire three-step cloud ceiling estimation system. All three KDD-produced algorithms performed significantly better than the currently operational SW99 algorithm. Similar accuracy and skill scores were produced for the nighttime California ceiling algorithms. This KDD process has also been applied to appropriate satellite data and COAMPS model runs for both the Adriatic Sea region in Europe and the Korean peninsula. Initial testing shows similar results for Adriatic and Korea KDD-produced algorithms.
The initial expectation that a combination of satellite and COAMPS data in a KDD-produced algorithm would produce the highest skill scores was met. It is also worth noting that the skill level of the KDD-produced algorithm using satellite data alone to diagnose quantitative cloud ceiling height is remarkably high. Of course, multilayered cloud situations would present a problem if satellite data were the only data source available.
Leave-one-station-out tests provided another indication of KDD algorithm performance on unseen data or at “new” locations. Almost all of these individual station results compare favorably to the random-split test at all three steps. However, there is some indication (based on KWJF results) that the algorithms could be geographically or climatologically dependent.
The opportunities for future research using the current database include, but are not limited to, data mining for a cloud ceiling algorithm using data from all three regions studied to determine if and how a generalized (not region specific) algorithm can be developed, determining a method to incorporate polar-orbiting satellite data, developing and examining satellite and COAMPS combined forecast (i.e., not diagnostic) algorithms, and data mining for other weather elements, including visibility.
The support of the sponsor, the Office of Naval Research, under Program Element 0602435N is gratefully acknowledged. The assistance of Jeff Hawkins, Joe Turk for the establishment of useful satellite data at NRL, John Cook, Sue Chen, Tracy Haack, and other COAMPS researchers (all with NRL) is very much appreciated. Melanie Wetzel of the Desert Research Institute is acknowledged for her contribution in the area of cloud optical depth estimation. Thanks also are given to FNMOC for assistance on acquisition of METAR reports.
List of Acronyms
AI Artificial intelligence
CC Correlation coefficient
CCL Convective condensation level
COAMPS Coupled Ocean–Atmosphere Mesoscale Prediction System
CSI Critical success index
ETS Equitable threat score
FAR False-alarm ratio
FNMOC Fleet Numerical Meteorological and Oceanographic Center
GOES Geostationary Operational Environmental Satellite
KDD Knowledge Discovery from Databases
LCL Lifting condensation level
MAE Mean absolute error
MDA Multiple discriminant analysis
METAR Aviation routine weather report
MLR Multiple linear regression
MOS Model output statistics
NOGAPS Navy Operational Global Atmospheric Prediction System
NWP Numerical weather prediction
POD Probability of detection
RH Relative humidity
rmse Root-mean-square error
TKE Turbulent kinetic energy
TSS True skill score
Corresponding author address: Richard Bankert, Naval Research Laboratory, 7 Grace Hopper Avenue, Monterey, CA 93943-5502. email@example.com
COAMPS is a trademark of the Naval Research Laboratory.