• Brennan, M. J., , and Lackmann G. M. , 2005: The influence of incipient latent heat release on the precipitation distribution of the 24–25 January 2000 U.S. east coast cyclone. Mon. Wea. Rev., 133 , 19131937.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Buizza, R., , and Chessa P. , 2002: Prediction of the U.S. storm of 24–26 January 2000 with the ECMWF Ensemble Prediction System. Mon. Wea. Rev., 130 , 15311551.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Charles, M., 2008: Verification of extratropical cyclones within NCEP forecast models using an automated cyclone tracking algorithm. M.S. thesis, School of Marine and Atmospheric Sciences, Stony Brook University, Stony Brook, NY, 135 pp. [Available from SoMAS, Stony Brook University, Stony Brook, NY 11794-5000].

  • Charles, M., , and Colle B. A. , 2009: Verification of extratropical cyclones within the NCEP operational models. Part I: Analysis errors and short-term NAM and GFS forecasts. Wea. Forecasting, 24 , 11731190.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Davis, C. A., , and Emanuel K. A. , 1991: Potential vorticity diagnostics of cyclogenesis. Mon. Wea. Rev., 119 , 19291953.

  • Du, J., , Mullen S. L. , , and Sanders F. , 1997: Short-range ensemble forecasting of quantitative precipitation. Mon. Wea. Rev., 125 , 24272459.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Du, J., , McQueen J. , , DiMego G. , , Toth Z. , , Jovic D. , , Zhou B. , , and Chuang H. , 2006: New dimension of NCEP Short-Range Ensemble Forecasting (SREF) System: Inclusion of WRF Members. Preprints, WMO Expert Team Meeting on Ensemble Prediction System, Exeter, United Kingdom, World Meteorological Organization. [Available online at http://wwwt.emc.ncep.noaa.gov/mmb/SREF/reference.html].

    • Search Google Scholar
    • Export Citation
  • Eckel, F. A., , and Mass C. F. , 2005: Aspects of effective mesoscale, short-range ensemble forecasting. Wea. Forecasting, 20 , 328350.

  • Froude, L. S. R., , Bengtsson L. , , and Hodges K. I. , 2007: The prediction of extratropical storm tracks by the ECMWF and NCEP ensemble prediction systems. Mon. Wea. Rev., 135 , 25452567.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Grimit, E. P., , and Mass C. F. , 2002: Initial results of a mesoscale short-range ensemble forecasting system over the Pacific Northwest. Wea. Forecasting, 17 , 192205.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hacker, J. P., , Krayenhoff E. S. , , and Stull R. B. , 2003: Ensemble experiments on numerical weather prediction error and uncertainty for a North Pacific forecast failure. Wea. Forecasting, 18 , 1231.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., 2001: Interpretation of rank histograms for verifying ensemble forecasts. Mon. Wea. Rev., 129 , 550560.

  • Hamill, T. M., , and Colucci S. J. , 1997: Verification of Eta–RSM short-range ensemble forecasts. Mon. Wea. Rev., 125 , 13121327.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Jones, M. S., , Colle B. A. , , and Tongue J. S. , 2007: Evaluation of a mesoscale short-range ensemble forecast system over the northeast United States. Wea. Forecasting, 22 , 3655.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Langland, R. H., , Shapiro M. A. , , and Gelaro R. , 2002: Initial condition sensitivity and error growth in forecasts of the 25 January 2000 East Coast snowstorm. Mon. Wea. Rev., 130 , 957974.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • McQueen, J., , Du J. , , Zhou B. , , Manikin G. , , Ferrier B. , , Chuang H-Y. , , DiMego G. , , and Toth Z. , 2005: Recent upgrades to the NCEP Short-Range Ensemble Forecasting System (SREF) and future plans. Preprints, 17th Conf. on Numerical Weather Prediction/21st Conf. on Weather Analysis and Forecasting, Washington, DC, Amer. Meteor. Soc., 11A.2. [Available online at http://ams.confex.com/ams/pdfpapers/94665.pdf].

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1973: Hedging and skill scores for probability forecasts. J. Appl. Meteor., 12 , 215223.

  • Novak, D. R., , Bright D. , , and Brennan M. , 2008: Operational forecaster uncertainty needs and future roles. Wea. Forecasting, 23 , 10691084.

  • Silberberg, S. R., , and Bosart L. F. , 1982: An analysis of systematic cyclone errors in the NMC LFM-II model during the 1978–79 cool season. Mon. Wea. Rev., 110 , 254271.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Stensrud, D. J., , Brooks H. E. , , Du J. , , Tracton M. S. , , and Rogers E. , 1999: Using ensembles for short-range forecasting. Mon. Wea. Rev., 127 , 433446.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Toth, Z., , and Kalnay E. , 1997: Ensemble forecasting at NCEP and the breeding method. Mon. Wea. Rev., 125 , 32973319.

  • Weigel, A. P., , Liniger M. A. , , and Appenzeller C. , 2007: Generalization of the discrete Brier and ranked probability skill scores for weighted multimodel ensemble forecasts. Mon. Wea. Rev., 135 , 27782785.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences: An Introduction. Academic Press, 467 pp.

  • Zhang, F., , Snyder C. , , and Rotunno R. , 2002: Mesoscale predictability of the “surprise” snowstorm of 24–25 January 2000. Mon. Wea. Rev., 130 , 16171632.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • View in gallery

    Geographical regions (1–6) used in the cyclone verification.

  • View in gallery

    The central pressure MAEs (mb) vs forecast hour averaged for the 2004–07 cool seasons for the GFS, NAM, and SREF mean (and its subcomponents) for regions (a) 1–2, (b) 3–4, and (c) 5–6. Confidence intervals at the 90% significance level are given by the vertical bars. The SREF forecast hours are labeled at the bottom of each panel, while the GFS and NAM forecast hours are at the top. The inset legend shows the symbols and line types used for the various members.

  • View in gallery

    As in Fig. 2, but for the cyclone central pressure ME (mb).

  • View in gallery

    As in Fig. 2, but for the cyclone displacement (km).

  • View in gallery

    As in Fig. 2, but here for SREF (and its subcomponents, including six WRF members).

  • View in gallery

    As in Fig. 5, but for the cyclone displacement (km).

  • View in gallery

    Time series of SREF MAEs of the cyclone central pressure minus the GFS MAEs at hour 48 for regions (a) 1, (b) 4, and (c) 6, during the 2004–07 cool seasons (separated by gray vertical lines). Horizontal black dashed lines denote one standard deviation above and below the mean SREF − GFS error difference for each region. The solid black line is the 30-day running mean of the cyclone central pressure error differences. Interesting periods for text discussion are highlighted with boxes and numbered.

  • View in gallery

    SREF rank histograms of the cyclone central pressure for all regions for hours 3–15 for regions (a) 1, (c) 2, and (e) 3, as well as hours 51–63 for regions (b) 1, (d) 2, and (f) 3.

  • View in gallery

    As in Fig. 8, but for regions 4–6.

  • View in gallery

    Histograms showing the percent time each member in SREF is best during the 2004–07 cool seasons for the cyclone central pressure for regions (a)–(f) 1–6, respectively, for hours 3–15.

  • View in gallery

    As in Fig. 10, but at hours 51–63.

  • View in gallery

    As in Fig. 10, but for the cyclone position forecasts at hours 51–63.

  • View in gallery

    BSs during the 2004–07 cool seasons for regions (a) 1–2, (b) 3–4, and (c) 5–6 calculated for the cyclone central pressure over a range of thresholds for the SREF, GFS, NAM, and GFS + NAM blend for F27–63. Confidence intervals at the 90% significance level are given by the vertical bars. The inset legend shows the symbols and the line types used for the various members.

  • View in gallery

    As in Fig. 13, but for SREF forecasts F27–63 using the GFS, NAM, and GFS + NAM blend as the reference forecasts. Values of BSS greater than zero indicate that the SREF forecasts are an improvement over the reference forecasts from the various models. The inset legend shows the symbols and line types used for the various members.

  • View in gallery

    BSs and their components (REL, RES, and UNC) during the 2004–07 cool seasons for regions (a) 1–2, (b) 3–4, and (c) 5–6, calculated for the cyclone central pressure over a range of thresholds for the SREF for F27–63. Confidence intervals at the 90% significance level are given by the vertical bars. The inset legend shows the symbols and line types used for the various members.

  • View in gallery

    As in Fig. 13, but the calculation of Brier score includes only events in which the forecast probability and/or the observed probability of the event occurring was greater than 0. Confidence intervals at the 90% significance level are given by the vertical bars.

  • View in gallery

    As in Fig. 13, but for SREF (including WRF members), GFS, NAM, and GFS + NAM blend for F27–63. Confidence intervals at the 90% significance level are given by the vertical bars.

  • View in gallery

    As in Fig. 17, but for the BSS.

  • View in gallery

    Reliability diagrams for the SREF 15-member ensemble at F27–63 with respect to the average cyclone central pressure for regions (a) 1–2, (b) 3–4, and (c) 5–6, as well as for cyclones that are 1.5 standard deviations below the average central pressure for regions (b) 1–2, (d) 3–4, and (f) 5–6. Reliability is represented by the solid curve with markers plotted every 20%. A perfect ensemble forecast is shown by the 1:1 solid line. The tilted dashed line indicates an ensemble with no skill, and the horizontal dashed line indicates the climatology (no resolution).

  • View in gallery

    Surface analysis and station models (full barb, 10 kt) from the Hydrological Prediction Center (HPC) valid at 1200 UTC 3 Feb 2006.

  • View in gallery

    (a) Cyclone tracks from F00 to F51 (every 6 h) for the SREF EBM, EKF, and RSM members initialized at 0900 UTC 2 Feb 2006, the GFS and NAM initialized at 1200 UTC 2 Feb 2006, and the observed track (solid boldface and black). The inset legend shows the symbols and line types used for the various members. (b) As in (a), but showing the central pressure vs forecast hour. The SREF forecast hours are labeled at the bottom of each panel, while the GFS and NAM forecast hours are at the top.

  • View in gallery

    As in Fig. 21, but for the SREF cyclone event initialized at 0900 UTC 15 Dec 2005.

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 31 31 5
PDF Downloads 18 18 2

Verification of Extratropical Cyclones within the NCEP Operational Models. Part II: The Short-Range Ensemble Forecast System

View More View Less
  • 1 School of Marine and Atmospheric Sciences, Stony Brook University, Stony Brook, New York
© Get Permissions
Full access

Abstract

This paper verifies the strengths and positions of extratropical cyclones around North America and the adjacent oceans within the Short Range Ensemble Forecast (SREF) system at the National Centers for Environmental Prediction (NCEP) during the 2004–07 cool seasons (October–March). The SREF mean for cyclone position and central pressure has a smaller error than the various subgroups within SREF and the operational North American Mesoscale (NAM) model in many regions on average, but not the operational Global Forecast System (GFS) for many forecast times. Inclusion of six additional Weather Research and Forecasting (WRF) model members into SREF during the 2006–07 cool season did not improve the SREF mean predictions.

The SREF has slightly more probabilistic skill over the eastern United States and western Atlantic than the western portions of the domain for cyclone central pressure. The SREF also has slightly greater probabilistic skill than the combined GFS and NAM for central pressure, which is significant at the 90% level for many regions and thresholds. The SREF probabilities are fairly reliable, although the SREF is overconfident at higher probabilities in all regions. The inclusion of WRF did not improve the SREF probabilistic skill. Over the eastern Pacific, eastern Canada, and western Atlantic, the SREF is overdispersed on average, especially early in the forecast, while across the central and eastern United States the SREF is underdispersed later in the forecast. There are relatively large biases in cyclone central pressure within each SREF subgroup. As a result, the best-member diagrams reveal that the SREF members are not equally accurate for the cyclone central pressure and displacement. Two cases are presented to illustrate examples of SREF developing large errors early in the forecast for cyclones over the eastern United States.

* Current affiliation: NOAA/NWS/NCEP/Climate Prediction Center, Camp Springs, Maryland.

Corresponding author address: Dr. Brian A. Colle, School of Marine and Atmospheric Sciences, Stony Brook University, Stony Brook, NY 11794-5000. Email: brian.colle@stonybrook.edu

Abstract

This paper verifies the strengths and positions of extratropical cyclones around North America and the adjacent oceans within the Short Range Ensemble Forecast (SREF) system at the National Centers for Environmental Prediction (NCEP) during the 2004–07 cool seasons (October–March). The SREF mean for cyclone position and central pressure has a smaller error than the various subgroups within SREF and the operational North American Mesoscale (NAM) model in many regions on average, but not the operational Global Forecast System (GFS) for many forecast times. Inclusion of six additional Weather Research and Forecasting (WRF) model members into SREF during the 2006–07 cool season did not improve the SREF mean predictions.

The SREF has slightly more probabilistic skill over the eastern United States and western Atlantic than the western portions of the domain for cyclone central pressure. The SREF also has slightly greater probabilistic skill than the combined GFS and NAM for central pressure, which is significant at the 90% level for many regions and thresholds. The SREF probabilities are fairly reliable, although the SREF is overconfident at higher probabilities in all regions. The inclusion of WRF did not improve the SREF probabilistic skill. Over the eastern Pacific, eastern Canada, and western Atlantic, the SREF is overdispersed on average, especially early in the forecast, while across the central and eastern United States the SREF is underdispersed later in the forecast. There are relatively large biases in cyclone central pressure within each SREF subgroup. As a result, the best-member diagrams reveal that the SREF members are not equally accurate for the cyclone central pressure and displacement. Two cases are presented to illustrate examples of SREF developing large errors early in the forecast for cyclones over the eastern United States.

* Current affiliation: NOAA/NWS/NCEP/Climate Prediction Center, Camp Springs, Maryland.

Corresponding author address: Dr. Brian A. Colle, School of Marine and Atmospheric Sciences, Stony Brook University, Stony Brook, NY 11794-5000. Email: brian.colle@stonybrook.edu

1. Introduction

a. Background

This paper is the second in a series verifying extratropical cyclones within the operational models at the National Centers for Environmental Prediction (NCEP). Charles and Colle (2009, hereafter Part I) highlighted the performance of the North American Mesoscale (NAM) and Global Forecast System (GFS) models around North America and its adjacent oceans for the 0–60-h forecasts of extratropical cyclones during the 2002–07 cool seasons.

Cyclone forecast errors result from uncertainties in initial conditions (ICs) and the model physics (Hacker et al. 2003). Silberberg and Bosart (1982) examined cyclone forecast errors by the LFM for one cool season and showed the importance of IC errors in reducing the accuracy of 1–2-day forecasts for several cyclone events. For the 24–25 January 2000 “surprise” snowstorm, Langland et al. (2002) attributed most of the forecast error to the growth of IC uncertainties 96 h before the event over the eastern Pacific. Meanwhile, Brennan and Lackmann (2005) found that operational models did not forecast the large area of convection over Alabama and Georgia for the 25 January 2000 event. As a result, the models missed the low-level potential vorticity (PV) maximum from the convection, which can be important in intensifying the low-level circulation associated with a cyclone (Davis and Emanuel 1991). Zhang et al. (2002) found that the fifth-generation Pennsylvania State University–National Center for Atmospheric Research (PSU–NCAR) Mesoscale Model (MM5) better resolved the precipitation patterns for the surprise snowstorm with increasing horizontal resolution. They also initialized the MM5 with several different initial conditions and found that an IC ensemble significantly improved the forecast.

To estimate the forecast uncertainty in operational models, a number of slightly perturbed ICs can be used to initialize an ensemble of model runs, such as the “breeding” method used at NCEP (Toth and Kalnay 1997). Regional short-range ensembles have been tested using members that apply either different operational IC analyses or physical parameterizations (Grimit and Mass 2002; Jones et al. 2007). The different ICs can be combined with different physics to further diversify the ensemble (Eckel and Mass 2005). Du et al. (1997) studied the impacts of short-range ensemble forecasting on the precipitation forecasts for the 14–16 December 1987 cyclone over the continental United States, which deepened to 978 mb and produced heavy snow over a region from the Texas Panhandle to Michigan. It was found that simply averaging the precipitation at forecast hour 36 from a 25-member Mesoscale Model 4 (MM4) ensemble reduced the root-mean-squared error (RMSE) of the precipitation forecast by about 50%–60%. They showed that an ensemble of PSU–NCAR MM4 members at 80-km grid spacing could match or exceed the accuracy of a deterministic model at 40-km grid spacing when forecasting the 500-mb geopotential height, 850-mb temperature, and surface precipitation.

Stensrud et al. (1999) illustrated the benefits of ensemble forecasting of cyclone positions by comparing the performance of a 29-km Meso-Eta Model and a 10-member 80-km Eta Model ensemble from the NCEP Short-Range Ensemble Forecast (SREF) system in 1994. It was found that the mean of the lower-resolution ensemble Eta Model members was as accurate as the higher-resolution deterministic Eta Model in forecasting cyclone positions on average across the United States. The ensemble spread was greatly increased by including the Regional Spectral Model (RSM) members, suggesting that model diversity within an ensemble is important in determining the forecast uncertainty.

Froude et al. (2007) verified extratropical cyclone tracks in the European Centre for Medium-Range Weather Forecasts (ECMWF) Ensemble Prediction System (EPS) and the GFS-based NCEP EPS between 6 January and 5 April 2005. The ECMWF EPS consisted of 50 perturbed members out to 10 days, with a spectral resolution of T255L40, while the 10-member NCEP EPS was run at a resolution of T126L28 (T62L28) for the first 7.5 (last 8.5) days. They found that the ECMWF ensemble has greater accuracy than the NCEP ensemble for cyclones in the NH, while in the SH the NCEP ensemble has smaller errors. The ECMWF ensemble mean had greater accuracy than the control member, although the ECMWF ensemble was underdispersed.

b. Motivation

Novak et al. (2008) surveyed many operational managers in the National Weather Service and found that forecasters want to use ensembles more in the forecast process; however, they need information on how ensembles perform for significant weather events. While ensembles have been shown to improve cyclone forecasts on an individual case study basis (Buizza and Chessa 2002), there has not been much assessment of the ensemble performance of extratropical cyclone strength and position over a few seasons. Froude et al. (2007) provided a quantitative assessment of the ECMWF and NCEP EPSs; however, their available data were limited to 1 yr, and they focused primarily on the extended forecasts. Using cyclone errors over the contiguous United States compiled by the Hydrometeorological Prediction Center for 2004–07, Novak et al. (2008) showed that for cyclone central pressure a simple NAM–GFS pressure average was superior deterministically to the mean pressures from the SREF or GFS-based EPS. The goal of this study is to verify SREF more comprehensively for cyclone events (both deterministically and probabilistically) over many regions and forecast periods.

Specifically, this study will address the following motivational questions:

  • How do the SREF cyclone errors vary across North America and its adjacent oceans?
  • How does the accuracy for the ensemble mean cyclone central pressure position compare with the deterministic GFS and NAM models?
  • Does SREF have probabilistic skill and reliability for extratropical cyclones?

2. Data and methods

The SREF was available for the cool seasons (March–October) of 2004–07 at ∼40 km grid spacing over the conterminous United States (CONUS) and its adjacent oceans (Fig. 1). There were two runs daily at 0900 and 2100 UTC, with 3-hourly forecast intervals to 63 h. For the first 2 yr, SREF consisted of 15 members (McQueen et al. 2005): 5 Eta members using Betts–Miller–Janjić convection (EBM), 5 Eta members using Kain–Fritsch convection (EKF), and 5 RSM members using the Arakawa–Shubert convective scheme. During the last (2006–07) cool season, six additional Weather Research and Forecasting (WRF) model members were also available, with three members from the Nonhydrostatic Mesoscale Model (NMM) core and three from the Advanced Research WRF (ARW) core (Table 1). The 0900 and 2100 UTC SREF forecasts were compared with the 1200 and 0000 UTC GFS and NAM predictions for the same cyclones (closest cyclone within 800 km).

The SREF system creates initial condition perturbations using the breeding method (Toth and Kalnay 1997). First, a small arbitrary perturbation is added to the 3-day old global analysis (for the global ensemble’s breeding cycle). This perturbed analysis is then truncated to the SREF domain and used as an initial perturbed field for the SREF breeding cycle. The model is integrated out to 12 h from both the original initial state and the perturbed initial state. The difference between these forecasts is then scaled down so that the RMS difference between the two 850-mb temperature forecasts is less than 0.5°C (J. Du 2007, personal communication). This scaled perturbation is then used to initialize another 12-h run, and this process is repeated for consecutive runs up to the current analysis time. After several days, the errors reach an asymptotic value. This new perturbation is added to and subtracted from the analysis produced by the Eta Data Assimilation System (EDAS) for Eta members and the Global Data Assimilation System (GDAS) for RSM members to generate positive and negative ensemble members, respectively, for the current SREF cycle.

There are some known limitations to the SREF approach as compared to other operational NCEP models. First, SREF uses 3-h forecasts from the GFS for initial conditions for the RSM and WRF members. Second, as noted above, SREF is started at 0900 and 2100 UTC, which is 3 h before the 1200 and 0000 UTC operational models (NAM and GFS), so SREF errors may be slightly larger because of its earlier start times. Finally, SREF is run at 32–45-km grid spacing, while the NAM is run at 12-km grid spacing.

Cyclones were identified using the approach outlined in Part I. The sea level pressure from the GFS analysis was used as truth for the cyclone central pressure, given its greater accuracy for cyclones as compared to NAM and the North American Regional Reanalysis (NARR; see Part I). Sea level pressure is obtained similarly between the models during their postprocessing processes, so this allows for direct comparisons. As highlighted in Charles (2008), using the NAM as truth for cyclone central pressure does not change the verification results between the NAM and GFS forecasts. As in Part I, the observed cyclone positions were a vector average of the GFS and NAM analyses. All model data were interpolated to a 0.8° latitude–longitude grid, since it is closest to the GFS data, which has the coarsest resolution of all of the models. To be included in the verification dataset, each cyclone had to be objectively identified in three out of the five members for each of the three SREF subgroups (EKF, EBM, and RSM). This resulted in at least 100 cyclones per forecast hour for 2004–07 in each of the regions shown in Fig. 1; however, to increase the statistical significance of some of the results below, some of the regions are combined. To obtain the average cyclone pressure error in the ensemble at a particular time, each of the ensemble member errors were calculated separately and then averaged to obtain an ensemble mean error. In the case of displacement, each component (x and y) was averaged separately over all ensemble members and, then, was used to calculate the displacement error. To test for statistical significance, a bootstrapping approach was used to resample the data and obtain 90% confidence intervals around the means as described in Part I.

There are several approaches to quantifying the probabilistic skill of an ensemble. One widely used skill score for categorical forecasts is the Brier score (BS), defined as (Wilks 1995)
i1520-0434-24-5-1191-e1
where n is the number of cases, p is the forecast probability, and o is the observed probability (1 if an event occurs, 0 if it does not). An ensemble with perfect probabilistic predictive skill would have a BS of 0, and one with no predictive skill would have a BS of 1. The BS is generally used to show how well an ensemble estimates the probability of a given variable [e.g., sea level pressure (SLP)] that falls below or above a certain threshold–category. For the purposes of this study, a slightly different approach was used to calculate the BS. Instead of defining an event based on whether or not the cyclone will fall below a threshold SLP value, an event was defined based on whether or not the predicted cyclone central pressure is within a certain range X on either side of a selected SLP value, where X = 1.0 × E (E is the average GFS MAE for SLP at that forecast hour and region). This approach was used to determine whether SREF could provide probabilistic skill for an SLP range of similar magnitude to the average errors in the deterministic models.
The Brier score can also be decomposed into several components (Murphy 1973):
i1520-0434-24-5-1191-e2
where,
i1520-0434-24-5-1191-e3
i1520-0434-24-5-1191-e4
i1520-0434-24-5-1191-e5

The Ni in Eqs. (3) and (4) represents the number of forecast–event pairs in a probability category, while n is the total number of forecast–event pairs for a specific threshold. The REL (reliability) term summarizes the calibration of the ensemble, describing the mean squared difference between the ensemble’s probability forecast and the actual probability of that event occurring [Eq. (3)]. For a well-calibrated ensemble, the events with a given forecast probability of the event occurring (pi) will be close to the actual percentage of those events that actually occur (oi). A perfectly reliable ensemble will have an REL of 0, therefore contributing to a lower (better) Brier score. The RES (resolution) term [Eq. (4)] summarizes the ensemble’s ability to distinguish between events. It measures the average of the squared differences between the sampled climatologies of each discrete forecast probability the ensemble can produce (oi) and the overall climatology of the event (o). Finally, the UNC (uncertainty) term [Eq. (5)] is a function of only the observations, with a minimum of 0 when the climatological probability (o) is 0 or 1, and a maximum of 0.25 when o is 0.5. When an event is very likely, or very unlikely (e.g., the tails of a distribution of events), there is little uncertainty in how the event will unfold. However, when the probability of an event occurring is ∼0.5, then it is less certain how the event will unfold, and the uncertainty will be higher. A smaller UNC leads to a smaller Brier score.

To test the improvement of the ensemble over another ensemble, deterministic model, or climatology, the Brier skill score (BSS) is defined as
i1520-0434-24-5-1191-e6
and
i1520-0434-24-5-1191-e7
where BSref is the reference BS and BSSclim is the BSS with respect to climatology. The reference BS is usually based on climatology, but it can be based on another ensemble or deterministic model as well (Weigel et al. 2007). A BSS of 1 would indicate that the ensemble provides a perfect probabilistic forecast as compared to the reference. A BSS of 0 would indicate that the ensemble shows no improvement over the reference score (deterministic model or climatology), while negative BSS values show that the forecast ensemble is less skillful than the reference forecast.

While the BS is used mainly to evaluate ensembles, it was also applied to the NCEP NAM and GFS models, since the deterministic forecast distributions are essentially binary yes–no forecasts of meeting a certain threshold range (1 or 0). By using the NAM and GFS as references in calculating the BSS, one can determine whether SREF has an advantage over the NAM and GFS for probabilistic forecasting.

3. Results

a. Model error versus forecast hour

1) Cyclone central pressure

To quantify how the SREF compares with the deterministic NAM and GFS, the errors in each subgroup of SREF (EKF, EBM, and RSM), as well as the full ensemble mean, were averaged over each region in Fig. 1 for combined periods of forecast hours from F03 to F63. The regions were also combined (1–2, 3–4, and 5–6) for some of the analysis in order to improve their statistical significance. Figure 2 shows the mean absolute errors (MAEs) of the cyclone central pressure for the GFS, NAM, and SREF. The performance of the SREF mean relative to each of its subgroups varies by region. For regions 1–2 (eastern Pacific and western United States; see Fig. 2a), the SREF mean has more accuracy than its three subgroups at all forecast hours (given 90% confidence intervals), with the RSM subgroup having the largest errors before hour 24. For regions 3–4 (central United States and central-eastern Canada; see Fig. 2b), the SREF mean is only significantly better than all subcomponents at hours 3–15 and 45–51, with the RSM group performing the worst before hour 39. In regions 5–6 (eastern United States and western Atlantic; see Fig. 2c), the SREF mean is significantly better than all its components at hours 3–15 and 33–51.

The GFS has the lowest MAEs of all the models and the SREF mean for all regions and times except 57–63 h (of SREF) in regions 1–2 (Fig. 2). The GFS MAEs are 0.5–1.0 mb smaller than the SREF mean on average, but this is within the average analysis error of the GFS analysis used for verification (cf. Fig. 2a in Part I). Thus, the MAEs were recalculated for only those events with GFS or SREF mean cyclone errors greater than 1 mb (greater than the average analysis error), and the GFS error was still 0.5–1.0 mb smaller than the SREF mean on average (not shown). Meanwhile, the SREF mean has lower MAEs than the NAM for all forecast hours in regions 1–2, and for 42–60 h in regions 3–4 and 5–6 (Fig. 2).

Figure 3 shows the mean error of the cyclone central pressure for the SREF, NAM, and GFS. For regions 1–2 (Fig. 3a), the cyclone SLP errors have a positive trend (0.5 to 1.5 mb) in the EKF and EBM from hours 15–39, and this positive (underdeepening) bias remains during the rest of the forecast. In contrast, the RSM group has a large negative (overdeepening) bias (−1.5 to −2.5 mb) throughout the forecast. The offsetting positive and negative member errors in SREF for regions 1–2 result in a SREF mean error within 0.5 mb of zero throughout the forecast, which is less than the other models for hours 33–63. The NAM bias is similar to the SREF Eta Model members, while the GFS develops a negative bias after hour 24. In regions 3–4 (Fig. 3b), the SREF mean and all its components have a slight negative pressure error (−0.5 to −0.05 mb) before hour 48, with the RSM having the largest negative error (−1.25 to −1.0 mb). In regions 5–6 (Fig. 3c), the SREF mean and its subgroups begin the simulation (hours 3–15) with a negative bias on average, but there is a positive error trend for the rest of the forecast, with either a slight positive error or no bias by hours 57–63. These results suggest that cyclones in regions 3–4, which are overdeepened on average (Fig. 3b), move into regions 5–6 early in the forecast, but then SREF no longer overdeepens them later in the forecast over the East Coast and the Atlantic storm track. The NAM and GFS also experience similar positive trends in their biases in regions 5–6, but after hour 36.

Overall, Table 2 shows the percentage of time each model and combination of models is best for the individual regions. For the MAE of the cyclone central pressure, the GFS is the best 30%–45% of the time. Meanwhile, the SREF mean is best as often as the NAM or an average of the GFS and NAM predictions (15%–20%). If the SREF mean is combined with GFS and NAM, the percentage of time it is best is less than 10%. If the individual SREF members are combined with GFS and NAM (rather than using the SREF mean), this combination remains best only 5%–10% of the time (not shown). Thus, a multimodel average of SREF, NAM, and GFS does not improve the deterministic predictions of the cyclone central pressure very often. This suggests that this multimodel ensemble often has some outliers or a skewed distribution of pressures around the observed that hurts its deterministic performance relative to the various ensemble components (SREF, GFS, and NAM).

2) Cyclone displacement

Figure 4 shows cyclone displacement errors for the GFS, NAM, and SREF versus forecast hour for regions 1–2, 3–4, and 5–6. The cyclone displacement error is the distance between this SREF mean position and the observed cyclone. The SREF mean is more accurate (at the 90% level) than all of its subgroups for region 1 at all times, 3–15 and 33–63 h in regions 3–4, and 39–51 h in regions 5–6. Most of the SREF subgroups are clustered together with similar accuracy; however, the RSM group has significantly larger displacement errors (by 30–60 km) than other models in regions 3–4 before hour 39. The SREF mean has smaller displacement errors than the GFS for 57–63 h in regions 1–2, while the GFS errors are smaller than the SREF mean for 3–15 h in region 1–2, for 3–27 h in regions 3–4, and for 3–51 h in regions 5–6. The SREF mean has smaller errors than the NAM from 21 to 63 h in region 1, 45 to 63 h in regions 3–4, and 57 to 63 h for regions 5–6.

Overall, the GFS has the best cyclone position 30%–40% of the time as compared to the NAM, SREF mean, NAM and GFS, and an ensemble of the NAM, GFS, and SREF mean (Table 3). A combination of NAM and GFS is best (20%–25%), as it is slightly more effective than SREF (15%–20%), while all models together (SREF mean, NAM, GFS, and GFS+NAM) are best only 5%–10% of the time.

3) Impact of including WRF members

During the last cool season of this study (2006–07), six WRF members were added to SREF (Du et al. 2006). An important question is whether the WRF ensemble members helped increase the accuracy of the overall ensemble mean. Figure 5 shows the MAEs of the cyclone central pressure versus forecast hour for the 2006–07 cool seasons for the GFS, NAM, and SREF (including the six WRF members) for regions 1–2, 3–4, and 5–6. For all regions, the inclusion of WRF did not improve the accuracy of the SREF mean. For regions 1–2 (Fig. 5a), the SREF WRF errors are similar to those of the RSM, which are both significantly larger than the other SREF groups before hour 51. The WRF and RSM have large negative biases (overdeepen by 2–4 mb) in regions 1–2 during the forecast (not shown). For regions 3–4 (Fig. 5b), the WRF absolute errors are larger than for any of the other SREF components after hour 21 given an overdeepening bias of 2–3 mb (not shown). This overdeepening problem is less pronounced in the WRF (1–2 mb) for regions 5–6 (not shown); thus, its MAEs are more comparable to the other members (Fig. 5c).

The contribution of the six WRF members had little impact on the cyclone position errors during 2006–07 (Fig. 6). The WRF has a slight decrease in cyclone displacement errors after hour 21 for the eastern Pacific and western United States (regions 1–2; see Fig. 6a); however, the results are not significant at the 90% level. The WRF also adds little benefit to the SREF mean for regions 3–4 (Fig. 6b), since the WRF and RSM are the poorest performing subgroups for most hours. The WRF group errors are more comparable to all the other SREF members for regions 5–6 (Fig. 6c), but this has no impact on the accuracy of the SREF mean. Overall, the six additional WRF members did little to improve the SREF ensemble mean cyclone forecasts. Therefore, unless specified otherwise below, the remainder of this paper includes the 15 SREF members without the WRF.

b. Time series of SREF and GFS pressure errors

As highlighted in Part I, short-term operational model forecasts have intraseasonal variability in cyclone errors that have periods lasting from several days to several weeks. To compare the SREF with the GFS over each cool season, Fig. 7 shows a time series of the SREF MAEs for the cyclone central pressure minus the GFS MAEs for events during the 2004–07 cool seasons at F45 for GFS and at F48 for SREF. For region 1 (Fig. 7a), a large majority (∼65%) of the cyclone forecasts are better forecast in the GFS than SREF. For example, for much of February 2005, the SREF cyclone SLP errors are consistently greater than the GFS (Fig. 7a, box 1). Also, for nearly all of the 2006–07 cool season, the SREF cyclone SLP errors are larger than those of the GFS, especially from December through February (Fig. 7a, box 2). For region 4 (Fig. 7b), the GFS errors are not always better than the SREF. For example, at the beginning of the 2004–05 cool season (Fig. 7b, box 1) and the end of the 2006–07 cool season (Fig. 7b, box 3), there are periods when the SREF pressure errors are less than the GFS. In contrast, for most of January–March 2006, the mean SREF has larger errors than the GFS (Fig. 7b, box 2). In region 6 (Fig. 7c), from the end of December 2005 through January 2006, most cyclone events are better forecast by the GFS than the SREF (Fig. 7c, box 1).

c. Ensemble evaluation

1) Rank histograms

Another way of measuring the performance of an ensemble is through a rank histogram (Talagrand diagram). Rank histograms show how representative the ensemble spread is relative to the true uncertainty. These results do not include the WRF members in SREF (15 total members). Random noise on the order of ∼0.5 mb was added to each member as outlined by Hamill (2001) in order to represent the uncertainty in the GFS analyses of the central pressure (Fig. 4 in Part I).

Rank histograms for cyclone central SLP were constructed by first taking the 12 SREF members with the smallest central pressure errors at a particular forecast run and time period1 and, then, ordering the member central pressures from lowest to highest. Then, the observed central pressure was placed in the appropriate pressure bin. For example, an observed central pressure is placed in the left-most bin if its central pressure is less than that of the member with the lowest pressure. The number of cyclones in each pressure bin was summed over each forecast containing at least 12 ensemble members with a matched cyclone for the 3–15- and 51–63-h forecast periods (Fig. 8). In region 1 at all forecast times (Figs. 8a and 8b), there is a suggestion of SREF underdeepening cyclones for many members, since a majority of the observed cyclones have central pressures deeper than nearly half of the SREF members. The SREF is also overdispersed (inverted U shape) early in the forecast at region 1 (Fig. 8a), which suggests that the SREF spread is too great. Some of this overdispersion is by design, since SREF perturbations are added around a control analysis, which may be similar to the GFS analysis used to verify. Later in the forecast, the SREF is slightly underdispersed given the more U-shape histogram (Fig. 8b), thus suggesting too little spread.

In region 2 (Figs. 8c and 8d), the SREF cyclones are consistently deeper than observed for all forecast hours. In region 3, there is overdispersion early in the forecast (Fig. 8e), which tends to decrease later in the forecast (Fig. 8f). Meanwhile, some overdeepening tends to develop in region 3 later in the forecast. The SREF experiences overdeepening early in the forecast in region 4 (Fig. 9a), while afterward the SREF forecasts in this region tend to be slightly underdispersed (Fig. 9b). There is also overdeepening and some overdispersion in region 5 early in the forecast (Fig. 9c), and this overdeepening is less pronounced later in the forecast (Fig. 9d). Meanwhile, there is some overdispersion in region 6 (Figs. 9e and 9f), especially early in the forecast. Overall, the overdispersion tends to be over the storm track oceanic regions early in the forecast, while some of the largest underdispersion and overdeepening biases occur over the central United States. The ensemble spread is the best over the central and eastern United States (regions 4–5), with only a slight underdispersion later in the forecast.

2) Best member diagrams

If the ensemble probability distribution is correctly reproducing natural uncertainty, each member should have nearly the same probability of being the best (Hamill and Colucci 1997). This can be quantified by using a best-member histogram, which was constructed by incrementing a counter for the member with the smallest absolute error for all cyclone events in which at least three out of five members of each group forecasted a cyclone (if a member did not have a cyclone match, it was not considered best). The total number of times that each member was best was divided by the total number of cyclone forecasts in order to obtain the percentage best.

All members were best for some fraction of the events, but each member in SREF was not equally accurate (Figs. 10 and 11). For region 1 at F03–15 and 51–63 h (Figs. 10a and 11a), the best SREF member on average (14%–15%) is the RSM control member, while the EKF.N4 member is the best least often (∼4%). The RSM control is likely the best early in the forecast, since its analysis is strongly related to the GFS analysis. When combining regions 1–2, it was shown above that the RSM group was one of the worst members (Fig. 2a), especially early in the forecast, but this is the result of its poor performance in region 2 (western United States; see Fig. 10b). By 51–63 h in region 2 (Figs. 10b and 11b), the EBM subgroup is consistently (6%–10%) the best group as well as being the RSM and EKF control members later in the forecast. The EBM and RSM subgroups are better more often than is the EFK subgroup between F03 and F15 in region 3 (Fig. 10c); however, by F51 all subgroups have similar accuracies (Fig. 11c). For region 4 at F03–15 (Fig. 10d), the EBM and RSM subgroups tend to be the best on average, but the RSM members are the best by F51 (Fig. 11d). The RSM and EBM control members are the best in region 5 for F03–15 (Fig. 10e), and three of the five RSM members are the best for F51–63 (Fig. 11e). In region 6 (Fig. 10f), the RSM control member is best, especially at F03–15 (∼20%) given the GFS analysis input, while after F51 the EBM control member is the best on average (Fig. 11f).

Best-member diagrams were also created for F51–63 to show the distribution cyclone position accuracy. For region 1 (Fig. 12a), the control RSM member is the best member on average (13%–18%). For region 2 (Fig. 12b), the RSM group is best at hours 51–63 (8%–10% versus 4%–7% for the other groups). There is large variability in the member accuracy in region 3 by F51–63 (Fig. 12c), with RSM control and RSM.P1 the best on average. In region 4 (Fig. 12d), the RSM and EBM members are better than the EKF members by F51–63. In region 5 (Fig. 12e), the RSM and EKF control members tend to be the best. For the Atlantic (region 6) (Fig. 12f), the RSM control member is the best, and there is large variability in performance within each subgroup. Overall, the position accuracy is not evenly distributed across all 15 ensemble members. For example, for regions 1, 5, and 6, the RSM control member is best most often. This suggests the need for postcalibration to extract more useful probabilistic information from the ensemble.

3) Brier score decomposition

Figure 13 shows the Brier score for the cyclone central pressure, calculated every 5 mb and for a particular range of pressures on either side of each threshold as determined by the average GFS MAE (Table 4) (±1.0 × GFS MAE on either side of the threshold), for the SREF, GFS, NAM, and a GFS + NAM blend, for hours 27–63. The regions and forecast times were combined (1–2, 3–4, and 5–6) in order to increase the size of the dataset, and the range of the plots was limited to 980–1010 mb, since SLP thresholds outside of this range had less than 100 cases. For the GFS + NAM blend, pk in Eq. (1) was calculated by treating the GFS and NAM as a two-member ensemble. The NAM has the largest Brier score (least accurate) for most thresholds in all regions. The GFS has a lower Brier score than the NAM but is still larger than the SREF, which has the greatest accuracy (significant at the 90% level). However, the GFS + NAM blend has a Brier score close to SREF in all regions, with the 90% confidence intervals overlapping except for the 1000-mb threshold in regions 1–4 and for the 990-mb threshold in regions 5–6. These results suggest a miniensemble of GFS + NAM can predict cyclone events probabilities nearly as well as the SREF for many thresholds. The results are similar if the threshold range is increased to 1.5 times the GFS MAE on either side of a threshold (not shown).

Figure 14 shows the SREF BSS [Eq. (6)] for the central pressure calculated using the GFS, NAM, the GFS + NAM blend, and climatology as a reference. For all regions the SREF has much more probabilistic skill improvement over the NAM, and it has nearly the same level of improvement over the deterministic GFS run and climatology for most thresholds. Finally, the BSS for SREF is lowest (0.10–0.15) when using the GFS + NAM blend as a reference, but the confidence intervals suggest that the SREF skill improvement is still significant at the 90% level for all thresholds except the 995- and 1010-mb thresholds in regions 3–4 as well as 980 mb in regions 5–6. The results are similar for a pressure range of 1.5 times the GFS MAE, except that the SREF BSS improvement increases relative to climatology, with the BSS being similar between climatology and the NAM.

To see what properties of the ensemble contribute to the overall predictive skill shown by the BS, the components of the BS [Eqs. (2)(5)] were calculated. Figure 15 shows the SREF BS and its components (REL, RES, and UNC) calculated over a range of thresholds for hours 27–63. In all regions, REL is small, suggesting that the SREF forecasts events with probabilities close to the true climatological probability of the events. According to Eq. (2), RES is negatively proportional to BS, which suggests that a low SREF RES is contributing to an increased (worse) BS. This would imply that when events are binned according to the SREF’s forecast probability, each subsample’s observed frequency is close to the overall climatological frequency of the event.

One problem with the BS as calculated above is that for relatively high (>1005 mb) and relatively low(<990 mb) thresholds, a large percentage of the events fall outside the range defined as 1.0 × the GFS MAE, likely resulting in pi and oi both being 0%. In Eq. (1), this equates to a perfect sample of BS, and when iterating BS over all the events, the resulting score is very low (good), since the SREF correctly forecasts when these events do not occur. Operationally, a forecaster is interested in the SREF skill when an event is likely to occur, hence with pi and/or oi greater than 0. To better capture the SREF’s ability to predict events that are more likely to occur, BS was recalculated using only events in which pi and/or oi were greater than 0. Figure 16 shows the recalculated BS for the SREF, GFS, NAM, and the GFS + NAM blend. With this new method, the BS is more uniform across the range of cyclone depths, and the SREF BS is very similar across all regions. The NAM has the least accuracy, followed by the GFS, and the skill of the GFS + NAM blend is close to the skill of SREF. With many more members than the GFS + NAM blend, the SREF should have a lower Brier score than the GFS + NAM, but SREF is likely suffering from poor calibration and/or poor resolution of events.

The SREF central pressure probabilities were also evaluated when including the six additional WRF members in 2006–07 (Figs. 17 and 18). As compared to the results without using the WRF (Figs. 13 and 14), the BS and BSS scores in all regions suggest that WRF did not improve the SREF performance relative tot he other models. Unlike the verification without WRF, the 90% confidence intervals for the BS overlap between SREF and a blend of GFS and NAM at all thresholds (Fig. 17). Meanwhile, the BSS is negative for some pressure thresholds when compared to the GFS + NAM blend (Fig. 18), especially for regions 1–2, although the confidence intervals are large given the relatively small sample size of one cool season. The limited improvement with WRF may be the result of some the overdeepening bias noted above for the WRF members.

Reliability diagrams were created for the SREF ensemble without WRF (Fig. 19), which show the observed frequency of events as a function of forecast probability. First, for each region, only events in which at least 12 out of 15 members had a cyclone were kept in the analysis. For each of these events, only the top 12 members were used. Then, all events were binned into 7 discrete forecast probabilities from the SREF [0, (1/12 or 2/12), (3/12 or 4/12), … , (11/12 or 1)], instead of 13 (0, 1/12, 2/12, 3/12, … , 1), to increase the sample size of each bin. Finally, the frequency of events observed within each bin was plotted as a function of that bin’s forecast probability. The solid 1:1 line in Fig. 19 indicates perfect reliability, the sloped dashed line indicates no skill, and the horizontal dashed line indicates no resolution (climatology). Probabilities are defined as whether or not a cyclone central pressure is or will be within 1.0 × X mb of either an average or deep cyclone for each region, where X is the GFS MAE for each region (Table 4). A deep cyclone is defined as having a central pressure of more than 1.5 standard deviations below the mean central pressure for each region (Part I).

In regions 1–2, the SREF forecast probabilities are close to the observed probabilities of <0.4 for average cyclones, but SREF is somewhat overconfident (Fig. 19a) at higher probabilities (>0.6), since the reliability has a smaller slope than 1:1. For example, during events for which the SREF gives a 70% chance of the cyclone falling in a given threshold range, only ∼60% actually do. The same is true for regions 3–6 (Figs. 19c and 19e), where the SREF is overconfident when forecasting average intensity cyclones for the higher probabilities. When the threshold noted above is increased from (1.0 to 1.5) × X, the SREF reliability is improved somewhat (not shown), with good reliabilities to 0.7 for all regions.

When the SREF predicts relatively deep cyclones, there is also a bias toward greater forecast probabilities than observed (overforecasting) for the high forecast probabilities (>0.7) in all regions using the 1.0 × X mb criteria (Figs. 19b, 19d, and 19f). Regions 3–4 have good reliability for forecast probabilities <0.6, while the other regions have slightly lower reliability at the smaller probabilities than the average cyclones.

d. Case examples

The above results suggest that the SREF is fairly reliable, although the SREF is overconfident at higher probabilities. Also, in a deterministic sense, the SREF mean was not better than the GFS on average for many forecast times. To illustrate examples of how these problems develop, such that forecasters can be more aware, two representative cyclone cases are briefly presented. The SREF run initialized at 0900 UTC 2 February 2006 is shown for a cyclone that developed in southeast Texas (not shown). The cyclone moved northeastward across the Mississippi and Ohio Valleys and was located over Lake Erie by 1200 UTC 3 February (Fig. 20).

Figure 21a shows the SREF forecast of the cyclone track, including the EBM, EKF, and RSM members as well as the operational GFS and NAM. The envelope of the 15-member SREF track is quite good, since it encompasses the observed cyclone track. The mean absolute displacement of the SREF mean is 196 km, while the mean absolute displacements of the GFS and NAM tracks are 211 and 260 km, respectively. Some SREF tracks are better than the operational GFS and NAM, although some of the tracks are clustered for a particular modeling system and share relatively large errors. For example, four out of five RSM members have an eastward displacement error of about 200 km.

Even though the cyclone track is well forecast by the SREF system, the cyclone central pressures for most members are too deep (Fig. 21b). Many SREF members do follow the same pressure trend as the observed, and two of the SREF members have pressures relatively close to the observed (better than GFS and NAM); thus, the ensemble did produce a ∼14% probability of predicting the observed cyclone within a given range (within the average GFS MAE). However, the probabilities could have been higher if the members did not overdeepen or cluster by modeling system (error trends are similar for RSM, EKF, and EBM). As a result, the SREF was too overconfident about the strength of the system and its mean error was fairly large, both of which were problems noted in the above SREF statistics.

The cyclones in all the SREF members were initialized at 0900 UTC 2 February at ∼1000 mb (Fig. 21b), which is within 1 mb of the NCEP analysis. However, from F00 to F03 (0900–1200 UTC), the SREF EBM and EKF members deepen the cyclone 4–7 mb in 3 h, while there was little observed pressure change (∼1 mb in 3 h). In contrast, the operational GFS run at 1200 UTC deepens the cyclone <1 mb in the first 3 h, while the NAM weakens the cyclone by 1.5 mb over 3 h. These SREF pressure falls during the first 3 h of the forecast are suggestive of a startup problem, which resulted in nearly all member cyclones having pressures less than the observed and the deterministic NAM and GFS for much of the 54-h forecast. The mean central pressure error for the 15-member SREF mean is −4.3 mb, while the errors for the NAM and GFS are −1.3 and −3.5 mb, respectively. Even though SREF may have some probabilistic skill, this case illustrates how relatively large mean pressure errors can develop early in the SREF forecast even though the IC perturbation added is relatively small.

Another cyclone event involving large errors in the SREF impacted the East Coast on 15–17 December 2005. The surface cyclone formed along the Gulf Coast near 0900 UTC 15 December (not shown). The cyclone developed in response to a short-wave trough moving over the southeast United States as a broad closed long-wave trough progressed eastward toward the Great Lakes (not shown). During the next 24 h, the cyclone moved quickly up the coast and was centered over New York City at 1200 UTC 16 December.

Figure 22a shows the observed and SREF forecast tracks for the SREF run initialized at 0900 UTC 15 December 2005. The initialization of the cyclone position by SREF is relatively good. However, by hour 3 of the forecast (1200 UTC 15 December), all the SREF members were to the north of the observed track, while the initialized cyclone in the GFS was closer to the observed as determined by the surface observations (not just GFS analysis). Most SREF members remain too far north and west during the forecast and they tend to cluster by model, with the RSM members displaced westward by more than 250 km. Although there are a few members that are relatively close to the observed track (∼20% probability) and are better than GFS and NAM at times during the forecast, the 15-member mean forecast track is ∼100 km west of the observed track as a result of the poor SREF startup. For this particular event there was freezing rain over much of North and South Carolina, so a 100-km track error can significantly impact the forecast for some regions.

The cyclone central pressure in SREF is also too deep for many members for this December 2005 event (Fig. 22b). Within the first few hours of the forecast, the SREF envelope of pressures surrounds the observed cyclone central pressure; however, many SREF members quickly overdeepen the cyclone, such that by hour 15 all SREF members are too deep by 1–6 mb. The errors tend to once again cluster by SREF subgroup, with the RSM members deepening the cyclone more than most other members. This creates a probabilistic forecast of a cyclone <1000 mb at hour 27 that is moderately high (∼50%) while the observed cyclone deepened to only ∼1002 mb. This is consistent with the SREF’s slight overconfidence in ensemble reliability noted above, and suggests that the member clustering may at times contribute to this issue.

4. Summary and conclusions

The goal of this study is to quantify the performance of NCEP’s SREF system for forecasting extratropical cyclones for six regions across the United States, southern Canada, and their adjacent oceans (cf. Fig. 1) during the 2004–07 cool seasons. For this period of study, SREF run cycles were at 0900 and 2100 UTC, which were compared to the 1200 and 0000 UTC run cycles of the NAM and GFS, respectively. The SREF 15-member mean (without WRF members) does provide a better forecast than its subgroups (EBM, EKF, and RSM) for all hours except 21–39 h in regions 3–4 for the cyclone central pressure and 21–27 h for the cyclone displacement. The SREF mean also has smaller pressure and displacement errors than the NAM except for 3–27 h in all regions. The NAM mean errors are likely smaller than those of the SREF early in the forecast, since SREF is initialized 3 h earlier, has coarser grid spacing (12 km in NAM and 40 km in SREF), and uses an older version of the NAM model.

The SREF mean errors for the cyclone central pressure are <0.5 mb in regions 1–2, which is less than in the other models for hours 33–63. This is the result of a cancellation of the negative pressure errors in the RSM members and the positive pressure errors in the other SREF members. This illustrates some of the benefit of running a multimodel ensemble.

Meanwhile, the GFS has more accuracy than the SREF mean for the cyclone central pressure in all regions and at all forecast times except hours 57–63 in regions 1–2. The GFS is also more accurate than the SREF mean for the cyclone position for several forecast times in all regions, while a combination of GFS and NAM is also better than SREF more often deterministically in all regions. Inclusion of six additional WRF members in SREF during the 2006–07 cool season slightly reduced the cyclone errors than without WRF, but these improvements were not significant at the 90% level. The SREF position errors even without WRF are comparable to those for GFS for 2006–07, which suggests there is some interannual variability in the model performance.

Since the 0000–1200UTC NAM and GFS are initialized 3 h after the 2100/0900 UTC SREF, one might expect GFS + NAM to outperform SREF deterministically. However, these comparisons are still useful, since the 0900/2100 UTC SREF is available operationally only a few hours before the 0000/1200 UTC GFS + NAM; thus, many forecasters synthesize all of these modeling systems simultaneously (e.g., 0900 UTC SREF and 1200 UTC GFS + NAM). Obviously, one way to improve SREF accuracy slightly relative to GFS+NAM is to start SREF 3 h later at 0000/1200 UTC.

Several statistical approaches were used to evaluate SREF in a probabilistic way, such as the rank histogram, best-member diagrams, and Brier skill scores. Over the eastern Pacific, eastern Canada, and the western Atlantic, the SREF is overdispersed. The overdispersion in the early hours is partially by design, since the GFS is used as truth on which the perturbations are centered at the initial time for many members. Although SREF is underdispersive (but not severely) over the central and eastern United States, its ensemble spread performance is better for these two regions (4–5) than others in later forecast hours (F51–63). The best-member diagrams reveal that the SREF members are not equally accurate. The unequal performance of the ensemble members implies that SREF may not accurately portray the true uncertainty in the forecast. The RSM control is best for a few regions early in the forecast given the selection of the GFS analysis as truth.

The SREF has slightly more probabilistic skill over the eastern United States and the western Atlantic than the western portions of the domain for the cyclone central pressure. The SREF is slightly better probabilistically than a GFS + NAM blend at the 90% significant level for many regions and forecast times, while the SREF has significantly better probabilistic performance than the GFS and NAM individually. The inclusion of WRF did not improve the probabilistic forecasts relative to the GFS + NAM blend. The SREF probabilities are fairly reliable, although the SREF is overconfident at higher probabilities in all regions as given by the reliability (REL) term, and it has poor resolution (RES) for average-depth cyclone events, which increases the Brier score (less skill). Although the ensemble strategies applied to SREF are reasonable, these results suggest room for improvement in the quality of base models and ICs (data assimilation). For example, the RES term is contributed by the model and IC quality, while the REL term is determined in part by the quality of ensemble strategies.

Finally, two cases are shown to illustrate situations in which the SREF develops large errors in cyclone position and central pressure early in the forecast. They illustrate examples in which the sign of the central pressure error or cyclone track error (either left or right of the observed) is similar between the various SREF subgroups, which can diminish the probabilistic performance. This is consistent with the overconfident higher probabilities noted in the REL analysis for the three cool seasons. For one case, even though the SREF cyclone track was well forecast, all SREF members overdeepened the cyclone during the first few hours of model spinup, which detrimentally impacted the rest of the forecast. In another case, all SREF members tracked the cyclone north and west of the observed early in the forecast.

To further improve the SREF system will require postcalibration to remove some of the biases and member clustering. The clustering of members within a particular subgroup (RSM, ETA, etc.) suggests that more physics diversity (and better physics) is needed. Some subgroups are better than others within SREF, which suggests that the initial conditions and numerics may also be important.

Acknowledgments

This work represents a portion of the first author’s M.S. thesis at Stony Brook University. The research was supported by UCAR–COMET (Grant S07-66814). The authors wish to thank Dr. Jun Du, David Novak, and the three anonymous reviewers for their helpful suggestions concerning this work.

REFERENCES

  • Brennan, M. J., , and Lackmann G. M. , 2005: The influence of incipient latent heat release on the precipitation distribution of the 24–25 January 2000 U.S. east coast cyclone. Mon. Wea. Rev., 133 , 19131937.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Buizza, R., , and Chessa P. , 2002: Prediction of the U.S. storm of 24–26 January 2000 with the ECMWF Ensemble Prediction System. Mon. Wea. Rev., 130 , 15311551.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Charles, M., 2008: Verification of extratropical cyclones within NCEP forecast models using an automated cyclone tracking algorithm. M.S. thesis, School of Marine and Atmospheric Sciences, Stony Brook University, Stony Brook, NY, 135 pp. [Available from SoMAS, Stony Brook University, Stony Brook, NY 11794-5000].

  • Charles, M., , and Colle B. A. , 2009: Verification of extratropical cyclones within the NCEP operational models. Part I: Analysis errors and short-term NAM and GFS forecasts. Wea. Forecasting, 24 , 11731190.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Davis, C. A., , and Emanuel K. A. , 1991: Potential vorticity diagnostics of cyclogenesis. Mon. Wea. Rev., 119 , 19291953.

  • Du, J., , Mullen S. L. , , and Sanders F. , 1997: Short-range ensemble forecasting of quantitative precipitation. Mon. Wea. Rev., 125 , 24272459.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Du, J., , McQueen J. , , DiMego G. , , Toth Z. , , Jovic D. , , Zhou B. , , and Chuang H. , 2006: New dimension of NCEP Short-Range Ensemble Forecasting (SREF) System: Inclusion of WRF Members. Preprints, WMO Expert Team Meeting on Ensemble Prediction System, Exeter, United Kingdom, World Meteorological Organization. [Available online at http://wwwt.emc.ncep.noaa.gov/mmb/SREF/reference.html].

    • Search Google Scholar
    • Export Citation
  • Eckel, F. A., , and Mass C. F. , 2005: Aspects of effective mesoscale, short-range ensemble forecasting. Wea. Forecasting, 20 , 328350.

  • Froude, L. S. R., , Bengtsson L. , , and Hodges K. I. , 2007: The prediction of extratropical storm tracks by the ECMWF and NCEP ensemble prediction systems. Mon. Wea. Rev., 135 , 25452567.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Grimit, E. P., , and Mass C. F. , 2002: Initial results of a mesoscale short-range ensemble forecasting system over the Pacific Northwest. Wea. Forecasting, 17 , 192205.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hacker, J. P., , Krayenhoff E. S. , , and Stull R. B. , 2003: Ensemble experiments on numerical weather prediction error and uncertainty for a North Pacific forecast failure. Wea. Forecasting, 18 , 1231.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., 2001: Interpretation of rank histograms for verifying ensemble forecasts. Mon. Wea. Rev., 129 , 550560.

  • Hamill, T. M., , and Colucci S. J. , 1997: Verification of Eta–RSM short-range ensemble forecasts. Mon. Wea. Rev., 125 , 13121327.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Jones, M. S., , Colle B. A. , , and Tongue J. S. , 2007: Evaluation of a mesoscale short-range ensemble forecast system over the northeast United States. Wea. Forecasting, 22 , 3655.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Langland, R. H., , Shapiro M. A. , , and Gelaro R. , 2002: Initial condition sensitivity and error growth in forecasts of the 25 January 2000 East Coast snowstorm. Mon. Wea. Rev., 130 , 957974.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • McQueen, J., , Du J. , , Zhou B. , , Manikin G. , , Ferrier B. , , Chuang H-Y. , , DiMego G. , , and Toth Z. , 2005: Recent upgrades to the NCEP Short-Range Ensemble Forecasting System (SREF) and future plans. Preprints, 17th Conf. on Numerical Weather Prediction/21st Conf. on Weather Analysis and Forecasting, Washington, DC, Amer. Meteor. Soc., 11A.2. [Available online at http://ams.confex.com/ams/pdfpapers/94665.pdf].

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1973: Hedging and skill scores for probability forecasts. J. Appl. Meteor., 12 , 215223.

  • Novak, D. R., , Bright D. , , and Brennan M. , 2008: Operational forecaster uncertainty needs and future roles. Wea. Forecasting, 23 , 10691084.

  • Silberberg, S. R., , and Bosart L. F. , 1982: An analysis of systematic cyclone errors in the NMC LFM-II model during the 1978–79 cool season. Mon. Wea. Rev., 110 , 254271.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Stensrud, D. J., , Brooks H. E. , , Du J. , , Tracton M. S. , , and Rogers E. , 1999: Using ensembles for short-range forecasting. Mon. Wea. Rev., 127 , 433446.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Toth, Z., , and Kalnay E. , 1997: Ensemble forecasting at NCEP and the breeding method. Mon. Wea. Rev., 125 , 32973319.

  • Weigel, A. P., , Liniger M. A. , , and Appenzeller C. , 2007: Generalization of the discrete Brier and ranked probability skill scores for weighted multimodel ensemble forecasts. Mon. Wea. Rev., 135 , 27782785.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences: An Introduction. Academic Press, 467 pp.

  • Zhang, F., , Snyder C. , , and Rotunno R. , 2002: Mesoscale predictability of the “surprise” snowstorm of 24–25 January 2000. Mon. Wea. Rev., 130 , 16171632.

    • Crossref
    • Search Google Scholar
    • Export Citation

Fig. 1.
Fig. 1.

Geographical regions (1–6) used in the cyclone verification.

Citation: Weather and Forecasting 24, 5; 10.1175/2009WAF2222170.1

Fig. 2.
Fig. 2.

The central pressure MAEs (mb) vs forecast hour averaged for the 2004–07 cool seasons for the GFS, NAM, and SREF mean (and its subcomponents) for regions (a) 1–2, (b) 3–4, and (c) 5–6. Confidence intervals at the 90% significance level are given by the vertical bars. The SREF forecast hours are labeled at the bottom of each panel, while the GFS and NAM forecast hours are at the top. The inset legend shows the symbols and line types used for the various members.

Citation: Weather and Forecasting 24, 5; 10.1175/2009WAF2222170.1

Fig. 3.
Fig. 3.

As in Fig. 2, but for the cyclone central pressure ME (mb).

Citation: Weather and Forecasting 24, 5; 10.1175/2009WAF2222170.1

Fig. 4.
Fig. 4.

As in Fig. 2, but for the cyclone displacement (km).

Citation: Weather and Forecasting 24, 5; 10.1175/2009WAF2222170.1

Fig. 5.
Fig. 5.

As in Fig. 2, but here for SREF (and its subcomponents, including six WRF members).

Citation: Weather and Forecasting 24, 5; 10.1175/2009WAF2222170.1

Fig. 6.
Fig. 6.

As in Fig. 5, but for the cyclone displacement (km).

Citation: Weather and Forecasting 24, 5; 10.1175/2009WAF2222170.1

Fig. 7.
Fig. 7.

Time series of SREF MAEs of the cyclone central pressure minus the GFS MAEs at hour 48 for regions (a) 1, (b) 4, and (c) 6, during the 2004–07 cool seasons (separated by gray vertical lines). Horizontal black dashed lines denote one standard deviation above and below the mean SREF − GFS error difference for each region. The solid black line is the 30-day running mean of the cyclone central pressure error differences. Interesting periods for text discussion are highlighted with boxes and numbered.

Citation: Weather and Forecasting 24, 5; 10.1175/2009WAF2222170.1

Fig. 8.
Fig. 8.

SREF rank histograms of the cyclone central pressure for all regions for hours 3–15 for regions (a) 1, (c) 2, and (e) 3, as well as hours 51–63 for regions (b) 1, (d) 2, and (f) 3.

Citation: Weather and Forecasting 24, 5; 10.1175/2009WAF2222170.1

Fig. 9.
Fig. 9.

As in Fig. 8, but for regions 4–6.

Citation: Weather and Forecasting 24, 5; 10.1175/2009WAF2222170.1

Fig. 10.
Fig. 10.

Histograms showing the percent time each member in SREF is best during the 2004–07 cool seasons for the cyclone central pressure for regions (a)–(f) 1–6, respectively, for hours 3–15.

Citation: Weather and Forecasting 24, 5; 10.1175/2009WAF2222170.1

Fig. 11.
Fig. 11.

As in Fig. 10, but at hours 51–63.

Citation: Weather and Forecasting 24, 5; 10.1175/2009WAF2222170.1

Fig. 12.
Fig. 12.

As in Fig. 10, but for the cyclone position forecasts at hours 51–63.

Citation: Weather and Forecasting 24, 5; 10.1175/2009WAF2222170.1

Fig. 13.
Fig. 13.

BSs during the 2004–07 cool seasons for regions (a) 1–2, (b) 3–4, and (c) 5–6 calculated for the cyclone central pressure over a range of thresholds for the SREF, GFS, NAM, and GFS + NAM blend for F27–63. Confidence intervals at the 90% significance level are given by the vertical bars. The inset legend shows the symbols and the line types used for the various members.

Citation: Weather and Forecasting 24, 5; 10.1175/2009WAF2222170.1

Fig. 14.
Fig. 14.

As in Fig. 13, but for SREF forecasts F27–63 using the GFS, NAM, and GFS + NAM blend as the reference forecasts. Values of BSS greater than zero indicate that the SREF forecasts are an improvement over the reference forecasts from the various models. The inset legend shows the symbols and line types used for the various members.

Citation: Weather and Forecasting 24, 5; 10.1175/2009WAF2222170.1

Fig. 15.
Fig. 15.

BSs and their components (REL, RES, and UNC) during the 2004–07 cool seasons for regions (a) 1–2, (b) 3–4, and (c) 5–6, calculated for the cyclone central pressure over a range of thresholds for the SREF for F27–63. Confidence intervals at the 90% significance level are given by the vertical bars. The inset legend shows the symbols and line types used for the various members.

Citation: Weather and Forecasting 24, 5; 10.1175/2009WAF2222170.1

Fig. 16.
Fig. 16.

As in Fig. 13, but the calculation of Brier score includes only events in which the forecast probability and/or the observed probability of the event occurring was greater than 0. Confidence intervals at the 90% significance level are given by the vertical bars.

Citation: Weather and Forecasting 24, 5; 10.1175/2009WAF2222170.1

Fig. 17.
Fig. 17.

As in Fig. 13, but for SREF (including WRF members), GFS, NAM, and GFS + NAM blend for F27–63. Confidence intervals at the 90% significance level are given by the vertical bars.

Citation: Weather and Forecasting 24, 5; 10.1175/2009WAF2222170.1

Fig. 18.
Fig. 18.

As in Fig. 17, but for the BSS.

Citation: Weather and Forecasting 24, 5; 10.1175/2009WAF2222170.1

Fig. 19.
Fig. 19.

Reliability diagrams for the SREF 15-member ensemble at F27–63 with respect to the average cyclone central pressure for regions (a) 1–2, (b) 3–4, and (c) 5–6, as well as for cyclones that are 1.5 standard deviations below the average central pressure for regions (b) 1–2, (d) 3–4, and (f) 5–6. Reliability is represented by the solid curve with markers plotted every 20%. A perfect ensemble forecast is shown by the 1:1 solid line. The tilted dashed line indicates an ensemble with no skill, and the horizontal dashed line indicates the climatology (no resolution).

Citation: Weather and Forecasting 24, 5; 10.1175/2009WAF2222170.1

Fig. 20.
Fig. 20.

Surface analysis and station models (full barb, 10 kt) from the Hydrological Prediction Center (HPC) valid at 1200 UTC 3 Feb 2006.

Citation: Weather and Forecasting 24, 5; 10.1175/2009WAF2222170.1

Fig. 21.
Fig. 21.

(a) Cyclone tracks from F00 to F51 (every 6 h) for the SREF EBM, EKF, and RSM members initialized at 0900 UTC 2 Feb 2006, the GFS and NAM initialized at 1200 UTC 2 Feb 2006, and the observed track (solid boldface and black). The inset legend shows the symbols and line types used for the various members. (b) As in (a), but showing the central pressure vs forecast hour. The SREF forecast hours are labeled at the bottom of each panel, while the GFS and NAM forecast hours are at the top.

Citation: Weather and Forecasting 24, 5; 10.1175/2009WAF2222170.1

Fig. 22.
Fig. 22.

As in Fig. 21, but for the SREF cyclone event initialized at 0900 UTC 15 Dec 2005.

Citation: Weather and Forecasting 24, 5; 10.1175/2009WAF2222170.1

Table 1.

Significant changes made to the NCEP SREF system from 2004 to 2007.

Table 1.
Table 2.

Percentage best for the cyclone central pressure for regions 1–6 for periods 27–39 and 51–63 h, in which S is the SREF mean, N is the NAM, G is the GFS, and N + G is the combination of NAM and GFS.

Table 2.
Table 3.

As in Table 2, but for the percentage best for the cyclone position error.

Table 3.
Table 4.

MAEs from the GFS for the cyclone central pressure (mb) for regions 1–6.

Table 4.

1

To increase the sample size for the rank histogram, the number of bins used was decreased to 12, since a cyclone could not always be objectively identified for all members. When there were more than 12 members with a cyclone, the best 12 SREF members were selected. However, some of the SREF underdispersion may be the result of using this more limited membership.

Save