• Allett, T., 2004: Linate crash report. Airports Int., 37, 2425.

  • Bremnes, J. B., , and Michaelides S. C. , 2007: Probabilistic visibility forecasting using neural networks. Pure Appl. Geophys., 164, 13651381, doi:10.1007/s00024-007-0223-6.

    • Search Google Scholar
    • Export Citation
  • Chmielecki, R. M., , and Raftery A. E. , 2011: Probabilistic visibility forecasting using Bayesian model averaging. Mon. Wea. Rev., 139, 16261636, doi:10.1175/2010MWR3516.1.

    • Search Google Scholar
    • Export Citation
  • Dallavalle, J. P., , and Cosgrove R. L. , 2005: GFS-based MOS guidance – The short-range alphanumeric messages from the 0000/1200 UTC forecast cycles. Meteorological Development Laboratory Tech. Procedures Bull. TPB 05-03, National Weather Service, 13 pp. [Available online at http://www.nws.noaa.gov/mdl/synop/tpb/mdltpb05-03.pdf.]

  • FAA, 2015: Electronic Code of Federal Regulations. [Available online at http://www.ecfr.gov/cgi-bin/text-idx?SID=2b4c7d7a623d7fd9284d18a7c9ed756d&mc=true&node=pt14.2.91&rgn=div5.]

  • Fritsch, J. M., , and Carbone R. , 2004: Improving quantitative precipitation forecasts in the warm season: A USWRP research and development strategy. Bull. Amer. Meteor. Soc., 85, 955965, doi:10.1175/BAMS-85-7-955.

    • Search Google Scholar
    • Export Citation
  • Ghirardelli, J. E., , and Glahn B. , 2010: The Meteorological Development Laboratory’s Aviation Weather Prediction System. Wea. Forecasting, 25, 10271051, doi:10.1175/2010WAF2222312.1.

    • Search Google Scholar
    • Export Citation
  • Gilbert, K. K., , Cosgrove R. L. , , and Maloney J. , 2008: NAM-based MOS guidance – The 0000/1200 UTC alphanumeric messages. Meteorological Development Laboratory Tech. Procedures Bull. TPB 08-01, National Weather Service, 11 pp. [Available online at http://www.nws.noaa.gov/mdl/synop/tpb/mdltpb08-01.pdf.]

  • Glahn, B., , Gilbert K. , , Cosgrove R. , , Ruth D. P. , , and Sheets K. , 2009: The gridding of MOS. Wea. Forecasting, 24, 520529, doi:10.1175/2008WAF2007080.1.

    • Search Google Scholar
    • Export Citation
  • Glahn, H. R., , and Lowry D. A. , 1972: The use of model output statistics (MOS) in objective weather forecasting. J. Appl. Meteor., 11, 12031211, doi:10.1175/1520-0450(1972)011<1203:TUOMOS>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Gultepe, I., and et al. , 2007: Fog research: A review of past achievements and future perspectives. Pure Appl. Geophys., 164, 11211159, doi:10.1007/s00024-007-0211-x.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., , Hagedorn R. , , and Whitaker J. S. , 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part II: Precipitation. Mon. Wea. Rev., 136, 26202632, doi:10.1175/2007MWR2411.1.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., , Bates G. T. , , Whitaker J. S. , , Murray D. R. , , Fiorino M. , , Galarneau T. J. Jr., , Zhu Y. , , and Lapenta W. , 2013: NOAA’s second-generation global medium-range ensemble reforecast dataset. Bull. Amer. Meteor. Soc., 94, 15531565, doi:10.1175/BAMS-D-12-00014.1.

    • Search Google Scholar
    • Export Citation
  • Hansen, B., 2007: A fuzzy logic–based analog forecasting system for ceiling and visibility. Wea. Forecasting, 22, 13191330, doi:10.1175/2007WAF2006017.1.

    • Search Google Scholar
    • Export Citation
  • Mass, C., 2008: The Weather of the Pacific Northwest. University of Washington Press, 281 pp.

  • NCDC, 2015: Automated Surface Observing System. National Climatic Data Center. [Available online at http://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/automated-surface-observing-system-asos.]

  • Novak, D. R., , Bailey C. , , Brill K. F. , , Burke P. , , Hogsett W. A. , , Rausch R. , , and Schichtel M. , 2014a: Precipitation and temperature forecast performance at the Weather Prediction Center. Wea. Forecasting, 29, 489504, doi:10.1175/WAF-D-13-00066.1.

    • Search Google Scholar
    • Export Citation
  • Novak, D. R., , Brill K. F. , , and Hogsett W. A. , 2014b: Using percentiles to communicate snowfall uncertainty. Wea. Forecasting, 29, 12591265, doi:10.1175/WAF-D-14-00019.1.

    • Search Google Scholar
    • Export Citation
  • Pedregosa, F., and et al. , 2011: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res., 12, 28252830.

  • Roebber, P. J., , Bruening S. L. , , Schultz D. M. , , and Cortinas J. V. Jr., 2003: Improving snowfall forecasting by diagnosing snow density. Wea. Forecasting, 18, 264287, doi:10.1175/1520-0434(2003)018<0264:ISFBDS>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Weick, K. E., 1990: The vulnerable system: An analysis of the Tenerife air disaster. J. Manage., 16, 571593.

  • Wilks, D. S., 2011: Statistical Methods in the Atmospheric Sciences. 3rd ed. Elsevier, 676 pp.

  • Wilks, D. S., , and Hamill T. M. , 2007: Comparison of ensemble-MOS methods using GFS reforecasts. Mon. Wea. Rev., 135, 23792390, doi:10.1175/MWR3402.1.

    • Search Google Scholar
    • Export Citation
  • View in gallery

    Climatological conditions at KSEA for the diurnal cycle of FRCs as a function of season based on years 2000–14: (a) January, (b) April, (c) July, and (d) October. Relative frequencies of FR classifications are filled in green for VFR, blue for MVFR, red for IFR, and pink for LIFR conditions. All times are local.

  • View in gallery

    As in Fig. 1, but for KDEN.

  • View in gallery

    As in Fig. 1, but for KSFO.

  • View in gallery

    As in Fig. 1, but for KIAH.

  • View in gallery

    Seasonal climatology of FR conditions at (a) KSEA, (b) KDEN, (c) KSFO, and (d) KIAH based on 2000–14 observations. Relative frequencies of VFR, MVFR, IFR, and LIFR conditions are filled in green, blue, red, and pink, respectively. Frequency distributions for a given month are averaged over all hours of the day.

  • View in gallery

    ARPSS sensitivity to ensemble information type and predictor radius is shown in red and blue, respectively. The mean skill score has been removed from each to give an accurate comparison of the respective sensitivities; the zero line corresponds to the mean value of each set of sensitivity experiments. Sensitivity tests shown here are performed at KSEA for the GRAD_BOOST algorithm.

  • View in gallery

    ARPSS sensitivity to model training data length. Results from KSEA, KDEN, KSFO, and KIAH are shown in red, blue, green, and yellow, respectively. Skill scores obtained through cross validation using three-quarters of the allotted training data, excluding the quarter of the training data that includes the day tested, to train the SAFM. Verification for 1.5–12-yr training lengths are conducted over the same period, with the 1.5-yr period from January 2000 through June 2001; the 0.75-yr training length verification was performed over the first half of this period. All skill scores reflect those obtained from the RAND_FOR classifier. Error bars denote 95% confidence bounds obtained via bootstrapping.

  • View in gallery

    ARPSSs evaluated over the training period and obtained by cross validation. Values shown correspond to the scores obtained from the greedily determined optimal algorithmic parameter configuration. KNN corresponds to the K-nearest neighbors algorithm, GRAD_BOOST to the gradient boosting classification algorithm, RAND_FOR to the random forest classification algorithm, SVC to the support vector classification algorithm, and WAVG_ALL corresponds to the scores obtained from a weighted average of the above classifiers, with the weights determined by the cross-validation skill scores appearing in this figure.

  • View in gallery

    Feature importances for FR classification at KSEA corresponding to the random forest component of the forecast system trained over the entire training period. Numbers represent the magnitude of change in the entropy (information gain) of the random forest by removal of the atmospheric variable at the depicted location from the forecast system; numbers are normalized so that the summed influence of all features of the algorithm is unity. Yellows indicate relatively important features, while dark magentas suggest unimportant features. Using the variable notation indicated in Table 1, all plots depict the importance of the ensemble median for the atmospheric fields of (a) U10, (b) V10, (c) UV10, (d) WME80, (e) TDD, (f) T2M, (g) TSK_SL, (h) T2M_SK, (i) T925_2, (j) Q2M, (k) MSLP, (l) SSHF, (m) SLHF, and (n) CLDC. Numbers to the bottom right of each subplot indicate the relative rank in area-averaged feature importance for the corresponding atmospheric field, including both the low- and high-confidence bounds in addition to the pictured ensemble median importances.

  • View in gallery

    As in Fig. 9, but for KDEN.

  • View in gallery

    As in Fig. 9, but for KSFO.

  • View in gallery

    As in Fig. 9, but for KIAH.

  • View in gallery

    ARPSSs over the test period for both deterministic and probabilistic forecasts at all analyzed stations. The forecasts from the final forecast system, determined by cross validation over the training period, appear in green and purple for the probabilistic and deterministic versions of the system’s forecasts, respectively; skill scores of the individual components of the forecasting system are included for comparison. GFSMOS (blue) and NAMMOS (red) correspond respectively to validation over the test period for the operational GFS and NAM MOS. Error bars correspond to 95% confidence bounds obtained by bootstrapping.

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 257 257 10
PDF Downloads 239 239 15

Using Reforecasts to Improve Forecasting of Fog and Visibility for Aviation

View More View Less
  • 1 Department of Atmospheric Science, Colorado State University, Fort Collins, Colorado
© Get Permissions
Full access

Abstract

Fifteen years of forecasts from the National Oceanic and Atmospheric Administration’s Second-Generation Global Medium-Range Ensemble Reforecast (GEFS/R) dataset were used to develop a statistical model that generates probabilistic predictions of cloud ceiling and visibility. Four major airports—Seattle–Tacoma International Airport (KSEA), San Francisco International Airport (KSFO), Denver International Airport (KDEN), and George Bush Intercontinental Airport (KIAH) in Houston, Texas—were selected for model training and analysis. Numerous statistical model configurations, including the use of several different machine learning algorithms, input predictors, and internal parameters, were explored and verified through cross validation to develop skillful forecasts at each station. The final model was then compared with both probabilistic climatology-based forecasts and deterministic operational guidance. Results indicated significantly enhanced skill within both deterministic and probabilistic frameworks from the model trained in this study relative to both operational guidance and climatology at all stations. Probabilistic forecasts also showed substantially higher skill within the framework used than any deterministic forecast. Dewpoint depression and cloud cover forecast fields from the GEFS/R model were typically found to have the highest correspondence with observed flight rule conditions of the atmospheric fields examined. Often forecast values nearest the prediction station were not found to be the most important flight rule condition predictors, with forecast values along coastlines and immediately offshore, where applicable, often serving as superior predictors. The effect of training data length on model performance was also examined; it was determined that approximately 3 yr of training data from a dynamical model were required for the statistical model to robustly capture the relationships between model variables and observed flight rule conditions (FRCs).

Supplemental information related to this paper is available at the Journals Online website: http://dx.doi.org/10.1175/WAF-D-15-0108.s1.

Corresponding author address: Gregory R. Herman, Dept. of Atmospheric Science, Colorado State University, 200 West Lake Street, 1371 Campus Delivery, Fort Collins, CO 80523. E-mail: gherman@atmos.colostate.edu

Abstract

Fifteen years of forecasts from the National Oceanic and Atmospheric Administration’s Second-Generation Global Medium-Range Ensemble Reforecast (GEFS/R) dataset were used to develop a statistical model that generates probabilistic predictions of cloud ceiling and visibility. Four major airports—Seattle–Tacoma International Airport (KSEA), San Francisco International Airport (KSFO), Denver International Airport (KDEN), and George Bush Intercontinental Airport (KIAH) in Houston, Texas—were selected for model training and analysis. Numerous statistical model configurations, including the use of several different machine learning algorithms, input predictors, and internal parameters, were explored and verified through cross validation to develop skillful forecasts at each station. The final model was then compared with both probabilistic climatology-based forecasts and deterministic operational guidance. Results indicated significantly enhanced skill within both deterministic and probabilistic frameworks from the model trained in this study relative to both operational guidance and climatology at all stations. Probabilistic forecasts also showed substantially higher skill within the framework used than any deterministic forecast. Dewpoint depression and cloud cover forecast fields from the GEFS/R model were typically found to have the highest correspondence with observed flight rule conditions of the atmospheric fields examined. Often forecast values nearest the prediction station were not found to be the most important flight rule condition predictors, with forecast values along coastlines and immediately offshore, where applicable, often serving as superior predictors. The effect of training data length on model performance was also examined; it was determined that approximately 3 yr of training data from a dynamical model were required for the statistical model to robustly capture the relationships between model variables and observed flight rule conditions (FRCs).

Supplemental information related to this paper is available at the Journals Online website: http://dx.doi.org/10.1175/WAF-D-15-0108.s1.

Corresponding author address: Gregory R. Herman, Dept. of Atmospheric Science, Colorado State University, 200 West Lake Street, 1371 Campus Delivery, Fort Collins, CO 80523. E-mail: gherman@atmos.colostate.edu

1. Introduction

Fog, heavy precipitation, and other obstructions to surface visibility pose a significant and important hazard to both general and commercial aviation. Reduced visibility is frequently responsible for travel delays, and has also been a contributor to many of the most devastating aviation disasters in history (e.g., Weick 1990; Allett 2004). Accordingly, the Federal Aviation Administration (FAA) enforces different regulations based on the observed ceiling and visibility of the airport terminal in question (FAA 2015, section 91). Fog is a particularly difficult phenomenon to accurately simulate operationally with a dynamical model. This is in part due to the high sensitivity of fog formation to small differences in surface winds, low-level stability, and dewpoint depression, among other factors. However, perhaps the more significant limitation, especially for operational modeling, is the extremely high vertical resolution—on the order of meters, or less—required in the lower boundary layer to accurately simulate the phenomenon (e.g., Gultepe et al. 2007). Heavy precipitation and snowfall, other major contributors to reduction in ceiling and visibility, have been characterized as two of the most poorly forecast atmospheric fields in current modeling (e.g., Roebber et al. 2003; Fritsch and Carbone 2004; Novak et al. 2014a,b).

The difficulty in dynamically modeling the atmospheric phenomena impacting ceiling and visibility in conjunction with the ease of observation/verification and relatively robust understanding of the processes leading to fog and low cloud formation make this forecasting problem particularly well suited to statistical forecasting via postprocessing of output from a dynamical model. Since the operational implementation of model output statistics (MOS) in 1976, it has been the leading operational technique for statistically postprocessing model output. MOS, essentially a simple, yet effective multivariate linear regression approach (Glahn and Lowry 1972), includes categorical forecasts for ceiling and visibility at approximately 1700 stations around the country (Dallavalle and Cosgrove 2005; Gilbert et al. 2008). However, the operational MOS approach has several limitations for this forecasting problem. First, although categorical probabilities are computed, MOS text bulletins only include deterministic visibility and ceiling forecasts, which provide no information about the confidence or uncertainty associated with the forecast. This can substantially limit the utility to end users, particularly those most sensitive to the possibility of fog (such as pilots who are not certified to fly in low-visibility conditions). Second, the forecast problem here is an ordinal classification problem; the goal is to accurately forecast the conditions into one of several discrete bins or categories, with forecasts predicting a category closer to the verifying category being superior to those that forecast a category more distant. MOS, however, still uses linear regression to generate these forecasts, a different type of algorithm from classification; there being a mismatch between the class of forecasting problem and algorithmic technique applied to solve it, one should anticipate diminished results compared with applying an appropriate technique for the forecasting problem in question. Third, MOS regressions are based on a very limited number (15) of predictors, when robust statistical relationships between candidate predictors and the predictand can often be found with many more predictors (Glahn and Lowry 1972). This problem is partially alleviated by MOS’s use of regionalized regression equations for both its cloud height and visibility forecasts, particularly for forecasting rare events (Wilks 2011).

Over the past several years, there have been numerous attempts to construct statistical models to improve forecasts of this important forecast problem. Bremnes and Michaelides (2007) built a neural network to generate probabilistic visibility forecasts for two European cities at short (1–6 h) lead times using observations. Hansen (2007) developed a skillful deterministic analog forecasting system for combined cloud ceiling and visibility categories. Hansen used model data from the Canadian Meteorological Centre’s (CMC) Global Environmental Multiscale Model (GEM). Chmielecki and Raftery (2011) applied the Bayesian model averaging technique using a regional mesoscale ensemble to make 12-h probabilistic visibility predictions over sites in the Pacific Northwest. While each of these studies produced encouraging results, none of them comprehensively examined different approaches to statistical model construction, looked extensively at lead times beyond the appreciable utility of current observations, or demonstrated significantly improved skill relative to operational model guidance.

This study seeks to develop a statistical forecast system to improve upon some of the limitations of MOS ceiling and visibility forecasts, as well as other approaches described in the literature. Operational dynamical model output is ingested and statistically postprocessed to produce probabilistic forecasts of flight rule conditions, based on ceiling and visibility, at several airports around the contiguous United States. Model performance is evaluated to determine an optimal configuration for the forecast system, which is then compared with climatology and operational forecast guidance. Section 2 of this paper outlines the procedure used to develop this model, section 3 presents findings about the climatological characteristics of the stations used for model training in this study, section 4 presents the results obtained from model training and development, section 5 presents results from the final evaluation of the developed forecast system, and section 6 discusses the implications of these findings both to the aviation forecasting problem and to the forecasting community at large.

2. Data and methods

In the United States, the FAA classifies terminal cloud height and visibility conditions into one of four flight rule categories (FRCs): visual flight rules (VFR), marginal visual flight rules (MVFR), instrument flight rules (IFR), and low instrument flight rules (LIFR). Though in FAA definitions, MVFR and LIFR conditions are subsets of VFR and IFR conditions, respectively, for the purposes of this study, all categories are treated as mutually exclusive. VFR conditions are defined as conditions in which the cloud ceiling is above 3000 ft above ground level (AGL) and visibility is in excess of 5 mi. MVFR conditions are defined as a ceiling from 1000 to 3000 ft AGL or visibility between 3 and 5 mi, inclusive. IFR conditions are defined by a ceiling of at least 500 but less than 1000 ft AGL or a visibility of at least 1 mi, but less than 3 mi. Finally, LIFR conditions are defined as those with either a ceiling below 500 feet AGL or visibility less than 1 mi (FAA 2015, section 91).

Four major airports were selected for examination in this study: Seattle–Tacoma International Airport (KSEA), San Francisco International Airport (KSFO), Denver International Airport (KDEN), and George Bush Intercontinental Airport (KIAH) in Houston, Texas. Time series of FRCs used for climatology construction and verification were created by inspection of 15 yr of human-augmented 5-min Automated Surface Observing System (ASOS) observations at each aforementioned airport from January 2000 through December 2014 (NCDC 2015). These data possess information allowing the separate determination of both the ceiling and visibility at report time, which was then converted to FRC values using the criteria outlined above. The prevailing visibility—the minimum of the surface sensor reading and subjectively reported tower visibility—was used for the visibility observations in this study. As noted by an anonymous reviewer, while the aviation industry uses runway visual range as the visibility input to determine FRCs, which approximates surface visibility but is distinct from prevailing visibility, inspection of the remarks over a subset of the observing record indicated that use of surface visibility rather than prevailing visibility resulted in a different categorical FRC only in approximately 0.1% of observations, and thus use of prevailing visibility in FRC determinations was not considered to appreciably influence the results. Additionally, the ASOS data may be further modified through manual input of cloud ceiling or surface visibility observations from contract weather observers (CWOs) stationed at major airports, including those used in this study. The 5-min time series was then converted to an hourly time series using the minimum observed FRC for each hour of record.

Fifteen years of reforecasts associated with the National Oceanic and Atmospheric Administration’s (NOAA) Second Generation Global Ensemble Forecast System Reforecast (GEFS/R; Hamill et al. 2013) spanning the same period as the verification record (2000–14) were used to train and evaluate statistical models for FRC forecasting at 15–36-h lead times for KSEA, KDEN, KSFO, and KIAH. It was conjectured that forecasts generated from synoptic-scale variables from a convection-parameterized model such as GEFS/R would have the most utility at this range of lead times since the lead times are thought to be sufficiently long for nowcasting based on current observations and trends to have only marginal skill while still sufficiently short to be of value to the decision-making of many end users in aviation. The GEFS/R dataset contains reforecasts of the February 2012 version of the Global Ensemble Forecast System (GEFS) from December 1984 to the present. Eleven forecast members are included in this dataset for each initialization; the ensemble was rerun once daily for the 0000 UTC forecast cycle. The model is run at T254L42 resolution; many forecast fields are available at this resolution, but some are only available on a 1° × 1° grid. Since forecast variables from GEFS/R are available only at 3-h intervals, forecasts were produced only for those times, for a total of eight forecasts per initialization (from 1500 UTC on the day of initialization to 1200 UTC on the day after initialization) (Hamill et al. 2013). This policy is also consistent with MOS, the primary operational statistical model guidance for visibility and ceiling forecasts, which also produces forecasts at 3-h intervals centered about the model initialization time.1 The data were separated into a training period and a test period; the training dataset spans the 12-yr period from 2000 to 2011, while the test dataset encompasses the remaining 3 yr from 2012 to 2014.

A Statistical Aviation Forecasting Model (SAFM) was developed over the training period using cross validation. The training data were further segmented into four 3-yr chunks; four statistical models with identical configurations were trained using three of the four 3-yr chunks, with the remaining chunk used for validation. The most skillful model of those tested during cross validation was selected as the final forecast system (FFS). The optimal model configuration—the configuration corresponding to the FFS—was then retrained over the entire training period and final model evaluation was made using the previously unused test data.

A plethora of SAFM configurations were examined during cross validation. Numerous base statistical algorithms were explored in detail, including logistic regression (LOG_REG), random forests (RAND_FOR), gradient boosting (GRAD_BOOST), K-nearest neighbors clustering (KNN), and support vector machine classification (SVC). Additionally, a final method incorporating all of the individual classifiers, employing different algorithms to generate predictions, into an aggregate ensemble of classifiers was examined, with forecasts from this method generated as an average of all tested classifiers, weighted based on the skill of the individual classifiers obtained from model training (WAVG_ALL). Python’s Scikit-learn package (Pedregosa et al. 2011) was employed for implementation of each of these algorithms. For more details about the mechanics, theory, application, and tuning of the algorithms used in this study, please see the online supplement to this article. Beyond algorithmic configurations, numerous different predictor configurations were explored, spanning three “dimensions”: 1) atmospheric field/variable selection, 2) field radius, and 3) ensemble information. Atmospheric fields from the GEFS/R model explored in this study are summarized in Table 1. Most predictors were extracted from the GEFS/R’s native T254L42 resolution grids (~40-km equivalent horizontal grid spacing at 40° latitude); however, 925-hPa temperature and cloud cover were only provided on a 1° × 1° grid. Predictor radii spanning from zero to five grid boxes from the nearest native GEFS/R grid point to the station being forecast were explored; a radius of zero indicates that only data from the nearest grid point to the forecast station are used for model training, while a radius of five indicates that for each GEFS/R field, any grid point within five grid points in either direction both latitudinally or longitudinally is included, with each spatial location counting as a separate model feature for a total of 121 features per atmospheric field. Further, with GEFS/R being an 11-member ensemble, different types of ensemble information were also examined: 1) the control run forecast only (CTRL); 2) the ensemble mean and ensemble standard deviation (MNSPRD); 3) the ensemble median, second-lowest ranking, and second-highest ranking members (CNFDB); 4) all of the individual members, unsorted (i.e., each perturbation number serves as a distinct feature) (MEMS); and 5) all of the individual members, ranked on forecast value (RMEMS) (Hamill et al. 2013). A fourth predictor selection dimension, whether to standardize the predictors prior to ingestion into the statistical algorithm, was also examined. It was found, however, that so doing either had no effect or improved results, depending on the algorithm in question, and thus all features were standardized prior to model training. The dimensionality of this problem was too large to test all possible configurations, so a monotonic relationship was assumed between varying parameters and measured skill, allowing the best-guess optimal configuration to be determined greedily.

Table 1.

Summary of dynamical model fields examined in this study, including the abbreviated symbol to which each variable is referred throughout the paper, a description of each variable, and the highest resolution for which the field can be obtained from the GEFS/R dataset.

Table 1.

Evaluation of probabilistic ordinal categorical forecasts is a surprisingly complex and challenging task. One would like the uncertainty, or distribution of possibilities, inherent in the probabilistic formulation to be accounted for in the forecast verification process, as opposed to just verifying based on the category with the largest assigned probability. Further, because the categories are ordinal and not nominal, putting forecast probability (FP) in a category closer to the verifying observation should evaluate as more skillful than assigning the same probability to a more distant class; the skill should not simply be a function of the probability assigned to the verifying class. The only metric used previously in the literature that satisfies these requirements is the rank probability score (RPS). For a forecast vector f = corresponding to the probabilities of the forecast verifying in categories from 1 through K, and an observation vector o = ,
the RPS is given by
eq2
This was aggregated over all D forecasts in the period, and compared with climatology to form an aggregated rank probability skill score (ARPSS):
eq3
with possessing the climatological frequency of observation in the FRC, as discerned from analysis over the training period of this study, at the hour and month corresponding to the forecast time. As with any skill score, a score of 1.0 indicates a perfect forecast, and a score of 0.0 indicates model performance equivalent to acting based on climatology (Wilks 2011).

3. Results: Station climatologies

Figures 14 show the diurnal cycles of FRCs at KSEA, KDEN, KSFO, and KIAH, respectively, at different seasons throughout the year, illustrating the diurnal cycle in (a) January, (b) April, (c) July, and (d) October. Several deductions are apparent by inspection of these figures. All cities examined in this study experience a peak in frequency of degraded FRCs during the early morning hours for all seasons of the year; this diurnal oscillation is weakest in winter at all stations. The diurnal cycle is very strong at KSFO during the summer season, and also quite strong at KIAH in the spring and KSEA outside of the winter. KSEA tends to have a moderate to high frequency of reduced ceiling and visibility, as evidenced by the elevated frequency of IFR and LIFR conditions relative to other stations examined. During the winter, MVFR, IFR, or LIFR conditions are roughly as common as VFR conditions, while in other seasons, VFR conditions are still dominant. During summer, and to a lesser extent during spring, IFR and LIFR conditions are extremely uncommon during the afternoon and evening hours. Dense fog, as evidenced by LIFR conditions, appears to be most common during autumn mornings. KSFO, in comparison, is not as foggy/cloudy as KSEA at almost any time of the year. LIFR conditions are very rare outside of spring mornings. However, reduced FRCs are still very common during summer mornings, with MVFR and IFR conditions occurring in the majority of cases. KIAH, despite being located in a rather different climate, exhibits many similar properties to KSEA. IFR and LIFR conditions are very rare during the summer, and are most common during the winter. Autumn and especially spring mornings are also common times for reduced FRCs. KDEN, the only station with no marine influence, experiences predominantly VFR conditions at all times of day and year, much more so than any other station studied here. The most substantial change relative to other stations is the reduction of MVFR and IFR frequency, with the LIFR frequency actually remaining comparable to that at other stations, suggesting that a combination of dense fog, heavy rainfall, and snowfall is perhaps as common as at other locations, but general cloudiness is less common. Figures 5a–d depict the seasonal cycle of the FRCs for the KSEA, KDEN, KSFO, and KIAH stations, respectively. There are some differences in the seasonal cycles between the stations. Beyond what has already been described from Figs. 14 above, KSEA attains peak cloudiness during the late autumn and early winter, while KIAH’s peak lags several months, occurring in winter and early spring. KDEN does not experience a significant seasonal cycle, though the mid- to late summer has even more frequent VFR conditions compared with the surrounding months. KSFO experiences two weaker peaks than observed at KSEA and KIAH; the first occurring during the winter, and the second occurring during the summer. The winter peak, however, has a higher frequency of LIFR conditions than does the summer peak.

Fig. 1.
Fig. 1.

Climatological conditions at KSEA for the diurnal cycle of FRCs as a function of season based on years 2000–14: (a) January, (b) April, (c) July, and (d) October. Relative frequencies of FR classifications are filled in green for VFR, blue for MVFR, red for IFR, and pink for LIFR conditions. All times are local.

Citation: Weather and Forecasting 31, 2; 10.1175/WAF-D-15-0108.1

Fig. 2.
Fig. 2.

As in Fig. 1, but for KDEN.

Citation: Weather and Forecasting 31, 2; 10.1175/WAF-D-15-0108.1

Fig. 3.
Fig. 3.

As in Fig. 1, but for KSFO.

Citation: Weather and Forecasting 31, 2; 10.1175/WAF-D-15-0108.1

Fig. 4.
Fig. 4.

As in Fig. 1, but for KIAH.

Citation: Weather and Forecasting 31, 2; 10.1175/WAF-D-15-0108.1

Fig. 5.
Fig. 5.

Seasonal climatology of FR conditions at (a) KSEA, (b) KDEN, (c) KSFO, and (d) KIAH based on 2000–14 observations. Relative frequencies of VFR, MVFR, IFR, and LIFR conditions are filled in green, blue, red, and pink, respectively. Frequency distributions for a given month are averaged over all hours of the day.

Citation: Weather and Forecasting 31, 2; 10.1175/WAF-D-15-0108.1

4. Results: Cross validation

Figure 6 depicts the skill score sensitivity to two of the three dimensions of the model input predictor selection, the neglected dimension being the effect of the addition of individual GEFS/R model fields to the cross-validation ARPSS. The values in Fig. 6 have had the mean of each data series removed to allow for more appropriate comparison between the two skill score series. The predictor radius, how many grid points from the forecast station are considered, is found to have a very substantial impact, with a change in ARPSS approaching 0.1 between considering only the local grid point and considering all forecast values within an 11 × 11 point box centered about the point nearest the forecast station. Increasing predictor radius was found to continuously add value, though the magnitude of the increase appeared to decrease as the radius increased, indicating model robustness to overfitting and valuable relationships between observed FRCs and remote predictors; this is explored in more detail in subsequent sections. The type of ensemble information included played some role as well, though the sensitivity was an order of magnitude less than seen with the predictor radius. In general, including individual members (MEMS) yielded only marginally better performance than including just the control run data, which is not necessarily surprising for an underdispersive, only initial condition-perturbed ensemble like GEFS/R. Compared to CTRL, MNSPRD also added only marginal value, but CNFDB, including the median, second-lowest, and second highest forecast values as input predictors, appreciably enhanced skill in the vicinity of 0.01. Including a ranking of all 11 ensemble members minimally improved cross-validation skill beyond the three-predictor CNFDB approach. The results shown in Fig. 6 correspond to the sensitivity at KSEA for the GRAD_BOOST algorithm; incomplete tests of the same manner were performed for different locations and algorithms, and indicated similar results. Results of cross validation on algorithmic parameter tuning are summarized in the online supplement to this article.

Fig. 6.
Fig. 6.

ARPSS sensitivity to ensemble information type and predictor radius is shown in red and blue, respectively. The mean skill score has been removed from each to give an accurate comparison of the respective sensitivities; the zero line corresponds to the mean value of each set of sensitivity experiments. Sensitivity tests shown here are performed at KSEA for the GRAD_BOOST algorithm.

Citation: Weather and Forecasting 31, 2; 10.1175/WAF-D-15-0108.1

Results of experiments conducted to assess the sensitivity of forecast skill to SAFM training data length are presented in Fig. 7. By inspection, it is apparent that, for all forecast stations used in this study, no discernable increase in forecast skill is seen by allowing the statistical model to train over the 12 yr of forecast data utilized throughout this study compared with training over a truncated period of only 6 yr. Similarly, while slight improvements are seen at each station expanding from 3 yr of training data to 6 yr, none of these increases was found to be statistically significant. However, a very different pattern emerges beyond the truncation to 3 yr. The truncation from 3 yr of training data to 1.5 yr of training data decreased the skill scores by approximately 10%–20% at each station, with another 5%–44% decrease and a factor of 2 truncation to 0.75 yr, for net skill differences between 0.75 and 3 yr of training data, from 21% at KSEA to almost 52% at KIAH. All of these differences are found to be statistically significant. These findings speak to the value of reforecasts. Even though the more restrictive FRCs are not especially common, they are not especially rare either, with IFR or LIFR conditions prevailing approximately 2%–20% of the time, depending on the station and time of year. Yet still 3 yr of training data are required to adequately capture the statistical relationships between the dynamical model variables and the observed FRCs at each station. Especially since operational models are often updated more frequently than triennially, and these upgrades may sever the relationship between model variables and observations by eliminating old biases and introducing new ones, this finding suggests that conducting reforecasts of new model implementations may be necessary to yield the full potential of statistical models for FRC forecasting and many other applications.

Fig. 7.
Fig. 7.

ARPSS sensitivity to model training data length. Results from KSEA, KDEN, KSFO, and KIAH are shown in red, blue, green, and yellow, respectively. Skill scores obtained through cross validation using three-quarters of the allotted training data, excluding the quarter of the training data that includes the day tested, to train the SAFM. Verification for 1.5–12-yr training lengths are conducted over the same period, with the 1.5-yr period from January 2000 through June 2001; the 0.75-yr training length verification was performed over the first half of this period. All skill scores reflect those obtained from the RAND_FOR classifier. Error bars denote 95% confidence bounds obtained via bootstrapping.

Citation: Weather and Forecasting 31, 2; 10.1175/WAF-D-15-0108.1

The final forecast system consisted of a weighted average of the FPs provided by four classifiers: 1) KNN, 2) RAND_FOR, 3) GRAD_BOOST, and 4) SVC. Along with other approaches attempted briefly such as pure decision trees, LOG_REG produced demonstrably inferior results to the aforementioned four that and were not considered to merit the added computational expense to include in the WAVG_ALL model. Ensemble information used for each classifier was CNFDB, with a predictor radius of four grid boxes. Though RMEMS and higher predictor radii (five was the highest tested) produced superior cross-validation skill scores, the difference was marginal and the additional computational, particularly memory, resources were quite significant, with the number of features increasing by factors of 11/3 and 121/81, respectively. Weighting was based on the cross-validation scores appearing in Fig. 8.

Fig. 8.
Fig. 8.

ARPSSs evaluated over the training period and obtained by cross validation. Values shown correspond to the scores obtained from the greedily determined optimal algorithmic parameter configuration. KNN corresponds to the K-nearest neighbors algorithm, GRAD_BOOST to the gradient boosting classification algorithm, RAND_FOR to the random forest classification algorithm, SVC to the support vector classification algorithm, and WAVG_ALL corresponds to the scores obtained from a weighted average of the above classifiers, with the weights determined by the cross-validation skill scores appearing in this figure.

Citation: Weather and Forecasting 31, 2; 10.1175/WAF-D-15-0108.1

5. Results: Final forecast system

To gain an appreciation for the otherwise black-box nature of the FFS, feature importances from the random forest component of the systems are included in Figs. 912. While not shown, similar metrics in other components of the FFS tended to produce fairly similar results. Figures 912 correspond to the feature importances of the systems at KSEA, KDEN, KSFO, and KIAH, respectively. The first striking observation from these figures is that the classic concentric isopleths of importance centered about the station location, with importance decreasing with distance, is a very rare signature of these plots, with “classic” patterns only appearing for dewpoint depression at KSFO and total cloud cover at KIAH. This suggests that the SAFM is discerning statistical relationships between station FRCs and other atmospheric fields at spatially distant locations. This is also consistent with the significant improvement in ARPSS with increasing predictor radius observed in the cross-validation phase. In general, absolute values of surface temperature, moisture, or pressure (MSLP) are not found to be important predictors of FRC conditions. This makes sense, as in general, temperature and specific humidity exhibit a strong seasonal cycle and the raw values are then only a good indicator of the time of year. MSLP fluctuates for many reasons, only some of them significantly relating to the environmental favorability of degraded FRCs. The two places these fields do appear to have some elevated importance are MSLP at KSFO, several degrees north of the airport, and interior surface temperature at KIAH, again to the north of the station.

Fig. 9.
Fig. 9.

Feature importances for FR classification at KSEA corresponding to the random forest component of the forecast system trained over the entire training period. Numbers represent the magnitude of change in the entropy (information gain) of the random forest by removal of the atmospheric variable at the depicted location from the forecast system; numbers are normalized so that the summed influence of all features of the algorithm is unity. Yellows indicate relatively important features, while dark magentas suggest unimportant features. Using the variable notation indicated in Table 1, all plots depict the importance of the ensemble median for the atmospheric fields of (a) U10, (b) V10, (c) UV10, (d) WME80, (e) TDD, (f) T2M, (g) TSK_SL, (h) T2M_SK, (i) T925_2, (j) Q2M, (k) MSLP, (l) SSHF, (m) SLHF, and (n) CLDC. Numbers to the bottom right of each subplot indicate the relative rank in area-averaged feature importance for the corresponding atmospheric field, including both the low- and high-confidence bounds in addition to the pictured ensemble median importances.

Citation: Weather and Forecasting 31, 2; 10.1175/WAF-D-15-0108.1

Fig. 10.
Fig. 10.

As in Fig. 9, but for KDEN.

Citation: Weather and Forecasting 31, 2; 10.1175/WAF-D-15-0108.1

Fig. 11.
Fig. 11.

As in Fig. 9, but for KSFO.

Citation: Weather and Forecasting 31, 2; 10.1175/WAF-D-15-0108.1

Fig. 12.
Fig. 12.

As in Fig. 9, but for KIAH.

Citation: Weather and Forecasting 31, 2; 10.1175/WAF-D-15-0108.1

At KSEA, low-level winds appear to be somewhat important predictors of FRCs, though not necessarily in the locations one would anticipate. The zonal component of wind (Fig. 9a), the meridional component of the wind (Fig. 9b), and surface wind speed (Fig. 9c) all show some enhanced importance immediately to the west of KSEA; this may be because the overwater winds show more variability, both in the observations and in the model, than directly over the station, and thus calm forecast winds over water may serve as better indicators of calm winds at KSEA than the winds forecast at nearer points. This also likely explains the swath of enhanced importance of the meridional wind extending well to the north. Interestingly, the most important component of the influence of surface wind on FRC prediction at KSEA identified by the RAND_FOR classifier appears to be the zonal wind component in the vicinity of the Columbia Gorge, several hundred kilometers to the SE. Perhaps being one of the largest gaps in the Cascade Range, strong easterly wind surges resulting from significant across-mountain pressure gradients are best resolved in the gorge in the GEFS/R. In the late autumn and early winter, persistent low clouds and fog often form east of the Cascades, and the same air can be advected to KSEA through these easterly surges. The temperature difference between the surface and 925 hPa (Fig. 9i) appears to also be an important FRC indicator at KSEA. While this lapse rate appears to be relatively important throughout much of the domain, it is most important off the coast of SW Washington and NW Oregon, near the mouth of the Columbia River. This peculiar signal may result from indications of a marine push or some other local phenomenon (Mass 2008). Cloud cover (Fig. 9n)—the closest direct model proxy to an FRC forecast—is also, as expected, an important FRC indicator; it also follows a more classic signal, with a swath of enhanced importance throughout the Puget Sound and interior Washington lowlands, with diminishing importance outside this region. Dewpoint depression (Fig. 9e) is also, as one would intuitively expect, an important FRC predictor, but TDD is most important at the mouth of the Columbia and secondarily where Puget Sound intersects the Strait of Juan de Fuca. Perhaps, as seen with the zonal wind (Fig. 9a), these nearby marine locales are simply better indicators of the local TDD at KSEA than the local forecast value as a result of fewer terrain-based complications yielding more accurate forecasts. Last, latent heat flux (Fig. 9m), especially over the Olympic Mountains, appears to be a quite important feature for KSEA FRC forecasts, having the second highest area-averaged importance.

We show that zonal winds (Fig. 10a) along the Front Range appear to be a fairly important FRC predictor at KDEN, perhaps related to the cloud formation observed in association with upslope winds. Meridional wind (Fig. 10b) and surface wind speed (Fig. 10c) appear to exhibit less importance overall, but are still somewhat important in the immediate vicinity of the station, perhaps reflecting the correspondence between low wind speed and low cloud formation. Because of its nonphysicality, T925_2 (Fig. 10i) is not found to be an important FRC predictor anywhere in the vicinity of the airport; instead, the difference between surface air and skin temperatures (Fig. 10h) serves as a much better proxy for low-level stability here and is considerably more important, being the area-averaged second-most-important atmospheric field. The field with the highest feature importances, dewpoint depression (Fig. 10e), has importances following a similar spatial pattern as T2M_SK. The high overall importance of TDD is not at all surprising, but the importance maximum well to the north over Cheyenne, Wyoming, and vicinity is rather unexpected. Cloud cover (Fig. 10n), while ranking only fourth in area-averaged importance, has a broad region of high importance to the north and east of KDEN, exhibiting a similar spatial pattern to T2M_SK. This northward displacement of importance is observed in T2M_SK (Fig. 10h), TDD (Fig. 10e), SSHF (Fig. 10l), and CLDC (Fig. 10n), and the spatial pattern suggests a higher correspondence between GEFS/R forecast values over the Cheyenne Ridge than directly over KDEN. This may relate to the resolution of the terrain of the South Platte valley and vicinity in the GEFS/R, but may also relate to the depth and persistence of cool, moist air dammed along the Front Range, with a farther spatial extent corresponding to denser, longer-lived low clouds.

KSFO presents several peculiarities relative to those previously observed in KSEA and KDEN. Surprisingly, zonal wind (Fig. 11a) appears to be a rather unimportant feature for KSFO FRC prediction and even exhibits a reversed version of the classic signal, with feature importance largely increasing with distance from the station, especially to the south. It was anticipated that U10 would be an important indicator of advection fog, but it is possible that GEFS/R resolves the local features of the Bay Area, such as the mountains directly to the west of KSFO, quite poorly, resulting in poor predictive skill. Meridional wind (Fig. 11b), in contrast, is a very important predictor in the model, but the feature importance maximum is displaced to the northeast of the airport, in the gap between the northern and southern coastal ranges. It is conjectured that, being in a large, resolvable mountain gap region, calm winds in this vicinity may be a more robust and consistent indicator of calm winds conducive to fog formation over KSFO, similar to the hypothesized reasoning for wind importance displacement observed at KSEA. Somewhat like KSEA, the offshore low-level lapse rates, as calculated in the T925_2 (Fig. 11i) and T2M_SK (Fig. 11h) fields, also appear to be significant indicators of favorability for degraded FRCs at KSFO. This actually makes a great deal of sense, as it indicates conditions conducive to the formation of fog immediately offshore, which can then be advected over land, including over the airport. Unsurprisingly, TDD (Fig. 11e) is again a very important indicator of FRC conditions and actually exhibits a very classic importance signal, with importance decreasing radially from the forecast location. Latent (Fig. 11m), and to a lesser extent, sensible (Fig. 11l), surface heat fluxes along the coast are moderately important predictors of FRCs, perhaps serving as good indicators of the favorability of low cloud and fog formation in that region. Perhaps the most surprising outcome of Fig. 11 is the volte-face in importance of cloud cover (Fig. 11n) compared with the models trained for KSEA and KDEN, with very low feature importances throughout the region. This perhaps suggests lower predictability of clouds over the Bay Area, or the smaller-scale advective processes responsible for the majority of low clouds and fog at KSFO, which cannot be adequately resolved by GEFS/R.

Figure 12 depicts the feature importances at KIAH. Unlike previously examined stations, both components of the surface wind (Figs. 12a,b), T925_2 (Fig. 12i) and T2M_SK (Fig. 12h), were determined to be fairly unimportant predictors of KIAH FRCs. Surface wind speed (Fig. 12c) and T2M (Fig. 12f) were instead found to be more significant, especially farther inland, to the north of the station. These may act as a proxy for the strength of the land and sea breezes that are a significant source of fog and low clouds in Houston and vicinity, but could instead be an indicator of stalling fronts draped along interior Texas, another relatively common phenomenon. TDD (Fig. 11e) is again found to be an extremely important predictor of KIAH FRCs, and further, for the same reasons as the Pacific coastal airports, the TDDs along the coast are again found to be most important. Latent heat (Fig. 12m) fluxes off the coast are similarly found to be quite significant at KIAH. Cloud cover (Fig. 12n) is found to be a somewhat important atmospheric field in parts of the domain as well, exhibiting a classic pattern with forecasted cloud cover becoming quite unimportant with distance from the station. This perhaps suggests higher predictably of clouds and associated FRCs at KIAH, which is also consistent with the slightly elevated skill scores at KIAH discussed below.

The results of the FFS validated over the test period are presented in Fig. 13. Skill scores range from 0.3 to 0.4, with the least skill relative to climatology at KSEA, and the largest climatology-relative skill seen at KIAH. This may suggest lower predictability of low clouds and fog in Seattle; that the test period featured anomalously foggy winters may have also contributed to the reduced score to some extent. Although skill scores do not inherently carry concrete quantitative meaning, skill scores in this range indicate that the FFS has substantial value, providing much more skillful forecasts than climatology. Inspection of Fig. 13 also reveals that the FFS performs better than any individual component at all stations; noting further that the Fig. 13 ARPSS values are in line with or better than the scores obtained in the cross validation displayed in Fig. 8, this suggests that the FFS is not substantially overfit. To draw a fair comparison with current operational forecast guidance, and to gain an estimate of the difference in forecast skill, albeit using a probabilistic verification framework, between probabilistic and deterministic forecasts, the probabilistic forecasts produced by the FFS were transformed into deterministic forecasts by putting all FPs into the class containing the highest FP in the probabilistic forecast. The final verification of these forecasts, in addition to the verification of operational Global Forecast System (GFS) MOS and North American Mesoscale (NAM) MOS, appear in Fig. 13. At least within an ARPSS framework, it is apparent from inspection of the table that both GFS MOS and NAM MOS exhibit almost no skill relative to climatology, with the only positive scores occurring at KDEN, and substantially negative scores appearing at KSFO. The individual components of the FFS verify somewhat better, with at least marginal climatology-relative skill observed at all stations except for KSEA. However, a dramatic improvement is observed using the weighted average of the components, as in the FFS; statistically significantly positive skill scores are obtained at all four stations, with ARPSS values of 0.08, 0.19, 0.23, and 0.27 at KSEA, KSFO, KDEN, and KIAH, respectively. These results suggest that the FFS exhibits statistically significant skill over both climatology and other available forecast guidance at FRC forecasting within both a deterministic and a probabilistic framework. The results also illustrate the highly statistically significant skill added by communicating forecast uncertainty in forecast dissemination by providing categorical FPs rather than providing only a single-numbered best-guess categorical forecast. Comparing the deterministic and probabilistic forecasts directly within Fig. 13, ARPSS values ranged from 0.13 to 0.23 higher for probabilistic than deterministic versions of the FFS forecasts, with the smallest difference observed at KIAH and the largest difference observed in KSEA. This is also consistent with the suggestion of higher forecast uncertainty at KSEA, as more value is added from probabilistic forecasting in a high-uncertainty environment than a low-uncertainty setting.

Fig. 13.
Fig. 13.

ARPSSs over the test period for both deterministic and probabilistic forecasts at all analyzed stations. The forecasts from the final forecast system, determined by cross validation over the training period, appear in green and purple for the probabilistic and deterministic versions of the system’s forecasts, respectively; skill scores of the individual components of the forecasting system are included for comparison. GFSMOS (blue) and NAMMOS (red) correspond respectively to validation over the test period for the operational GFS and NAM MOS. Error bars correspond to 95% confidence bounds obtained by bootstrapping.

Citation: Weather and Forecasting 31, 2; 10.1175/WAF-D-15-0108.1

While not shown, the different forecast systems do exhibit very different bias characteristics. At each station examined, both MOS systems are positively biased to varying degrees, in some cases forecasting IFR and LIFR conditions much more often than is observed. The deterministic FFS, in contrast, tends to be negatively biased, with the transition from probabilistic to deterministic frameworks usually resulting in increased confidence in the climatologically most frequent FRC: VFR. The forecast probabilities in the probabilistic FFS tend to occur in approximately proportional to the observed frequencies of the categorical FRCs, indicating a fairly unbiased forecast system. As an approximately strictly proper scoring metric (e.g., Wilks 2011), the observed biases in the deterministic systems, by representing FPs different than their respective true expected probabilities of verification, result in diminished skill scores for those systems.

6. Discussion and conclusions

A SAFM was developed using 15 yr of GEFS/R data to improve FRC forecasting. Four major airports—KSEA, KDEN, KSFO, and KIAH—were selected as a representative sample for further study. The same length record of METARs was examined to determine explicit FRC climatologies at the four stations of interest. Many statistical model configurations were explored and verified through cross validation to develop skillful probabilistic FRC forecasts at each station. The final model was then compared with both probabilistic climatology-based forecasts and deterministic operational guidance. The final system showed statistically significantly enhanced skill within both deterministic and probabilistic frameworks relative to both operational guidance and climatology at all stations. Probabilistic forecasts showed substantially higher skill within the framework used than any deterministic forecast, indicating that the FFS captured forecast uncertainty accurately, and illustrate the immense forecast capability added by using reliable, resolved forecast probabilities. This study also illustrates the potential capability of a long record of consistent forecasts such as those provided for the GEFS in GEFS/R. It was found that at least 3 yr of consistent model data are necessary to fully capture the relationship between model predictors and observed station FRCs, highlighting the importance of reforecasts in providing the necessary additional data to accurately diagnose these relationships.

Several other findings are also of note. Perhaps in part because of the coarseness of the GEFS/R model used here, significant relationships were found between local station FRCs and far-field predictions of dynamical model forecasts of numerous atmospheric variables. These findings suggest that previous work with statistical postprocessing of small-scale processes may have too narrow of a geographic focus and may miss important statistical relationships that exist between the dynamical model output and observations of the small-scale phenomenon. The nonintuitive nature of some of the discerned relationships also illustrates the potential forecasting capabilities that a statistical model may yield but that a human forecaster may overlook. Many previous statistical models appearing in the literature and, to a lesser extent, in operations, use ensemble mean and spread for a given atmospheric field from a dynamical model as input features/predictors for a statistical model (e.g., Hamill et al. 2008; Wilks and Hamill 2007). This study’s results suggest that median and confidence bounds or a ranked ordering of the individual ensemble members may serve as more robust predictors, and using this information instead may improve the statistical model skill. Most statistical models appearing in the literature either select a single algorithmic approach and tune it to the extent possible or train several different algorithms and select the best-performing model. This paper is among the first to form predictions from an ensemble of statistical classifiers. Within a probabilistic framework, improved skill relative to any individual component is observed, similar to the case of typically improved error statistics for a dynamical ensemble mean compared with any individual member. The difference in skill within the probabilistic framework is small however, and it can be reasonably debated that the added expense in training is not worth the additional skill relative to the best-performing component. Within a deterministic framework, however, which, even if probabilistic information is provided, many end users will continue to interpret and use deterministically, the results will be dramatically improved using the ensemble of classifiers approach relative to any other model component, with either the skill scores 200%–300%+ better than the best component verified deterministically; or the ensemble being the only forecast to produce deterministic forecasts exhibiting climatology-relative skill.

There are also some takeaways regarding the physical findings regarding proxy variables to the formation of fog, low clouds, and other obstructions to visibility as determined by the SAFM trained herein. In all areas studied, surface dewpoint depression was among the most important predictors of FRCs, likely due to its known correspondence with surface and low cloud formation. The direct GEFS/R cloud cover forecasts were typically the second-most-important forecast variable after dewpoint depression among those examined, except at KSFO, where it was found to have very weak correspondence with observed FRCs. In coastal areas, dewpoint depression, latent heat flux, and temperature differences between the 2-m air and both 925-hPa and surface/skin temperatures along and just offshore appear to be particular important proxies for FRC observations, likely because of cloud formation and subsequent advection from those regions. Surface winds were usually somewhat important, but perhaps less than one would expect, as it was consistently found to be a less important than some of the thermodynamic and moisture variables. The wind fields were also found to have strong, nonintuitive relationships identified between local station FRCs and nonlocal wind values.

This study leaves open many promising avenues for continued research. The models built in this study are quite naïve, primarily using only “native” model variables from the GEFS/R dataset, and using no specific knowledge of local topography or meteorology. Several of the derived variables, such as dewpoint depression, were consistently indicated as being among the most important features in FRC forecasting. It is anticipated that an improved predictor set of derived variables, each with an a priori expectation of possessing a physical or statistical relationship with the station’s FRCs instead of naively using native variables with no such a priori expectation, could notably improve the quality of the FFS. Some of the hypotheses regarding reasons for algorithms selecting particular features, especially at remote locations, could be explicitly tested by appropriately subsetting the model training data. Doing so could importantly diagnose which findings identified here are robust physical relationships in the model from those that are just statistical artifacts of the analysis. The findings from this study, including the robust statistical relationships between local predictions and far-field predictors and the added predictive ability of using confidence bounds—approximately the ensemble 10th and 90th percentiles in addition to the ensemble median—as statistical model predictors compared with the more traditional use of ensemble mean and spread, may remain valid for a suite of other forecasting applications beyond just FRC forecasting. This would have far-reaching implications to the forecasting community at large; future research geared toward other forecasting problems should examine these questions and attempt to address these findings. Last, statistical models such as the SAFM trained in this study are much simpler to develop and validate than modifications to a dynamical model; models like those presented here are readily implementable in an operational setting and it has been demonstrated that they exhibit significantly enhanced skill relative to the operationally available forecast guidance.

Acknowledgments

The authors thank Susan Van den Heever, Philip Schumacher, Erik Nielsen, and two anonymous reviewers for their insightful input on the scientific research conducted and on helpful revisions to the original manuscript. This work derived from a class project for ATS 712 at Colorado State University; the authors would also like to thank the entire class for their insights and suggestions. This research would not have been possible without the METAR data access provided by the National Climatic Data Center and GEFS/R model data generously supplied by the Earth System Research Laboratory.

REFERENCES

  • Allett, T., 2004: Linate crash report. Airports Int., 37, 2425.

  • Bremnes, J. B., , and Michaelides S. C. , 2007: Probabilistic visibility forecasting using neural networks. Pure Appl. Geophys., 164, 13651381, doi:10.1007/s00024-007-0223-6.

    • Search Google Scholar
    • Export Citation
  • Chmielecki, R. M., , and Raftery A. E. , 2011: Probabilistic visibility forecasting using Bayesian model averaging. Mon. Wea. Rev., 139, 16261636, doi:10.1175/2010MWR3516.1.

    • Search Google Scholar
    • Export Citation
  • Dallavalle, J. P., , and Cosgrove R. L. , 2005: GFS-based MOS guidance – The short-range alphanumeric messages from the 0000/1200 UTC forecast cycles. Meteorological Development Laboratory Tech. Procedures Bull. TPB 05-03, National Weather Service, 13 pp. [Available online at http://www.nws.noaa.gov/mdl/synop/tpb/mdltpb05-03.pdf.]

  • FAA, 2015: Electronic Code of Federal Regulations. [Available online at http://www.ecfr.gov/cgi-bin/text-idx?SID=2b4c7d7a623d7fd9284d18a7c9ed756d&mc=true&node=pt14.2.91&rgn=div5.]

  • Fritsch, J. M., , and Carbone R. , 2004: Improving quantitative precipitation forecasts in the warm season: A USWRP research and development strategy. Bull. Amer. Meteor. Soc., 85, 955965, doi:10.1175/BAMS-85-7-955.

    • Search Google Scholar
    • Export Citation
  • Ghirardelli, J. E., , and Glahn B. , 2010: The Meteorological Development Laboratory’s Aviation Weather Prediction System. Wea. Forecasting, 25, 10271051, doi:10.1175/2010WAF2222312.1.

    • Search Google Scholar
    • Export Citation
  • Gilbert, K. K., , Cosgrove R. L. , , and Maloney J. , 2008: NAM-based MOS guidance – The 0000/1200 UTC alphanumeric messages. Meteorological Development Laboratory Tech. Procedures Bull. TPB 08-01, National Weather Service, 11 pp. [Available online at http://www.nws.noaa.gov/mdl/synop/tpb/mdltpb08-01.pdf.]

  • Glahn, B., , Gilbert K. , , Cosgrove R. , , Ruth D. P. , , and Sheets K. , 2009: The gridding of MOS. Wea. Forecasting, 24, 520529, doi:10.1175/2008WAF2007080.1.

    • Search Google Scholar
    • Export Citation
  • Glahn, H. R., , and Lowry D. A. , 1972: The use of model output statistics (MOS) in objective weather forecasting. J. Appl. Meteor., 11, 12031211, doi:10.1175/1520-0450(1972)011<1203:TUOMOS>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Gultepe, I., and et al. , 2007: Fog research: A review of past achievements and future perspectives. Pure Appl. Geophys., 164, 11211159, doi:10.1007/s00024-007-0211-x.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., , Hagedorn R. , , and Whitaker J. S. , 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part II: Precipitation. Mon. Wea. Rev., 136, 26202632, doi:10.1175/2007MWR2411.1.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., , Bates G. T. , , Whitaker J. S. , , Murray D. R. , , Fiorino M. , , Galarneau T. J. Jr., , Zhu Y. , , and Lapenta W. , 2013: NOAA’s second-generation global medium-range ensemble reforecast dataset. Bull. Amer. Meteor. Soc., 94, 15531565, doi:10.1175/BAMS-D-12-00014.1.

    • Search Google Scholar
    • Export Citation
  • Hansen, B., 2007: A fuzzy logic–based analog forecasting system for ceiling and visibility. Wea. Forecasting, 22, 13191330, doi:10.1175/2007WAF2006017.1.

    • Search Google Scholar
    • Export Citation
  • Mass, C., 2008: The Weather of the Pacific Northwest. University of Washington Press, 281 pp.

  • NCDC, 2015: Automated Surface Observing System. National Climatic Data Center. [Available online at http://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/automated-surface-observing-system-asos.]

  • Novak, D. R., , Bailey C. , , Brill K. F. , , Burke P. , , Hogsett W. A. , , Rausch R. , , and Schichtel M. , 2014a: Precipitation and temperature forecast performance at the Weather Prediction Center. Wea. Forecasting, 29, 489504, doi:10.1175/WAF-D-13-00066.1.

    • Search Google Scholar
    • Export Citation
  • Novak, D. R., , Brill K. F. , , and Hogsett W. A. , 2014b: Using percentiles to communicate snowfall uncertainty. Wea. Forecasting, 29, 12591265, doi:10.1175/WAF-D-14-00019.1.

    • Search Google Scholar
    • Export Citation
  • Pedregosa, F., and et al. , 2011: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res., 12, 28252830.

  • Roebber, P. J., , Bruening S. L. , , Schultz D. M. , , and Cortinas J. V. Jr., 2003: Improving snowfall forecasting by diagnosing snow density. Wea. Forecasting, 18, 264287, doi:10.1175/1520-0434(2003)018<0264:ISFBDS>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Weick, K. E., 1990: The vulnerable system: An analysis of the Tenerife air disaster. J. Manage., 16, 571593.

  • Wilks, D. S., 2011: Statistical Methods in the Atmospheric Sciences. 3rd ed. Elsevier, 676 pp.

  • Wilks, D. S., , and Hamill T. M. , 2007: Comparison of ensemble-MOS methods using GFS reforecasts. Mon. Wea. Rev., 135, 23792390, doi:10.1175/MWR3402.1.

    • Search Google Scholar
    • Export Citation
1

It should be noted, however, that other MOS products, such as the gridded MOS and the Localized Aviation MOS Product (LAMP), do produce cloud ceiling and visibility forecasts at a higher hourly temporal resolution (Glahn et al. 2009; Ghirardelli and Glahn 2010).

Supplementary Materials

Save