• Alhamed, A., , and Lakshmivarahan S. , 2002: Cluster analysis of a multimodel ensemble data from SAMEX. Mon. Wea. Rev., 130 , 226256.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Anthes, R. A., 1986: The general question of predictability. Mesoscale Meteorology and Forecasting, P. S. Ray, Ed., Amer. Meteor. Soc., 636–656.

    • Search Google Scholar
    • Export Citation
  • Betts, A. K., , and Miller M. J. , 1986: A new convective adjustment scheme. Part I: Observational and theoretical basis. Quart. J. Roy. Meteor. Soc., 112 , 677692.

    • Search Google Scholar
    • Export Citation
  • Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78 , 13.

  • Bright, D. R., , and Mullen S. L. , 2002: The sensitivity of the numerical simulation of the southwest monsoon boundary layer to the choice of PBL turbulence parameterization in MM5. Wea. Forecasting, 17 , 99114.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Colle, B. A., , Westrick K. J. , , and Mass C. F. , 1999: Evaluation of MM5 and Eta-10 precipitation forecasts over the Pacific Northwest during the cool season. Wea. Forecasting, 14 , 137154.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Colle, B. A., , Olson J. B. , , and Tongue J. S. , 2003a: Multiseason verification of the MM5. Part I: Comparison with the Eta Model over the central and eastern United States and impact of MM5 resolution. Wea. Forecasting, 18 , 431457.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Colle, B. A., , Olson J. B. , , and Tongue J. S. , 2003b: Multiseason verification of the MM5. Part II: Evaluation of high-resolution precipitation forecasts over the northeastern United States. Wea. Forecasting, 18 , 458479.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Du, J., , Mullen S. L. , , and Sanders F. , 1997: Short-range ensemble forecasting of quantitative precipitation. Mon. Wea. Rev., 125 , 24272459.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Du, J., , DiMego G. , , Tracton M. S. , , and Zhou B. , 2003: NCEP short-range ensemble forecasting (SREF) system: Multi-IC, multi-model and multi-physics approach. Research Activities in Atmospheric and Oceanic Modeling, J. Cote, Ed., CAS/JSC Working Group Numerical Experimentation Rep. 23, WMO/TD 1161, 5.09–5.10.

    • Search Google Scholar
    • Export Citation
  • Eckel, F. A., , and Mass C. F. , 2005: Aspects of effective short-range ensemble forecasting. Wea. Forecasting, 20 , 328350.

  • Frank, W. M., 1983: The cumulus parameterization problem. Mon. Wea. Rev., 111 , 18591871.

  • Grell, G. A., 1993: Prognostic evaluation of assumptions used by cumulus parameterizations. Mon. Wea. Rev., 121 , 764787.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Grell, G. A., , Kuo Y. , , and Pasch R. J. , 1991: Semiprognostic tests of cumulus parameterization schemes in the middle latitudes. Mon. Wea. Rev., 119 , 531.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Grimit, E. P., 2004: Probabilistic mesoscale forecast error prediction using short-range ensembles. Ph.D. dissertation, University of Washington, 146 pp. [Available from Dept. of Atmospheric Sciences, University of Washington, Seattle, WA 98195.].

  • Grimit, E. P., , and Mass C. F. , 2002: Initial results of a mesoscale short-range ensemble forecasting system over the Pacific Northwest. Wea. Forecasting, 17 , 192205.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., 2001: Interpretation of rank histograms for verifying ensemble forecasts. Mon. Wea. Rev., 129 , 550560.

  • Hong, S-Y., , and Pan H-L. , 1996: Nonlocal boundary layer vertical diffusion in a medium-range forecast model. Mon. Wea. Rev., 124 , 23222339.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Houtekamer, P. L., , Lefaivre L. , , Derome J. , , Ritchie H. , , and Mitchell H. L. , 1996: A system simulation approach to ensemble prediction. Mon. Wea. Rev., 124 , 12251242.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Janjić, Z. I., 1994: The step-mountain eta coordinate model: Further developments of the convection, viscous sublayer, and turbulence closure schemes. Mon. Wea. Rev., 122 , 927945.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Jones, M. S., 2004: Evaluation of a mesoscale short-range ensemble forecasting system over the northeast U.S. M.S. thesis, Marine Sciences Research Center, Stony Brook University, 135 pp. [Available from MSRC, Stony Brook University/SUNY, Stony Brook, NY 11794-5000.].

  • Kain, J. S., 2004: The Kain–Fritsch convective parameterization: An update. J. Appl. Meteor., 43 , 170181.

  • Kain, J. S., , and Fritsch J. M. , 1990: A one-dimensional entraining/detraining plume model and its application in convective parameterization. J. Atmos. Sci., 47 , 27842802.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Lorenz, E. N., 1963: Deterministic nonperiodic flow. J. Atmos. Sci., 20 , 130141.

  • McMurdie, L., , and Mass C. F. , 2004: Major numerical forecast failures over the northeast Pacific. Wea. Forecasting, 19 , 338356.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Richardson, D. S., 2001: Ensembles using multiple models and analyses. Quart. J. Roy. Meteor. Soc., 127 , 18471864.

  • Stanski, H. R., , Wilson L. J. , , and Burrows W. R. , 1989: Survey of common verification methods in meteorology. Environment Canada Research Rep. 89-5, 114 pp. [Available from Forecast Research Division, Atmospheric Environment Service, 4905 Dufferin St., Downsview, ON M3H 5T4, Canada.].

  • Stensrud, D. J., , and Yussouf N. , 2003: Short-range ensemble predictions of 2-m temperature and dewpoint temperature over New England. Mon. Wea. Rev., 131 , 25102524.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Stensrud, D. J., , Bao J. , , and Warner T. T. , 2000: Using initial condition and model physics perturbations in short-range ensemble simulations of mesoscale convective systems. Mon. Wea. Rev., 128 , 20772107.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Toth, Z., , and Kalnay E. , 1993: Ensemble forecasting at NMC: The generation of perturbations. Bull. Amer. Meteor. Soc., 74 , 23172330.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Toth, Z., , and Kalnay E. , 1997: Ensemble forecasting at NCEP and the breeding method. Mon. Wea. Rev., 125 , 32973319.

  • Wandashin, M. S., , Mullen S. L. , , Stensrud D. J. , , and Brooks H. E. , 2001: Evaluation of a short-range multimodel ensemble system. Mon. Wea. Rev., 129 , 729747.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wang, W., , and Seaman N. L. , 1997: A comparison study of convective parameterization schemes in a mesoscale model. Mon. Wea. Rev., 125 , 252278.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 1995: Statistical Methods in Atmospheric Science: An Introduction. Academic Press, 467 pp.

  • Yussouf, N., , Stensrud D. , , and Lakshmivarahan S. , 2004: Cluster analysis of multimodel ensemble data over New England. Mon. Wea. Rev., 132 , 24522462.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Zhang, D-L., , and Anthes R. , 1982: A high-resolution model of the planetary boundary layer—Sensitivity tests and comparisons with SESAME-79 data. J. Appl. Meteor., 21 , 15941609.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Zhang, D-L., , and Zheng W-Z. , 2004: Diurnal cycles of surface winds and temperatures as simulated by five boundary layer parameterizations. J. Appl. Meteor., 43 , 157169.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Zhang, F., , Snyder C. , , and Rotunno R. , 2002: Mesoscale predictability of the “surprise” snowstorm of 24–25 January 2000. Mon. Wea. Rev., 130 , 16171632.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Zou, X., , and Kuo Y-H. , 1996: Rainfall assimilation through an optimal control of initial and boundary conditions in a limited-area mesoscale model. Mon. Wea. Rev., 124 , 28592882.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • View in gallery

    Location of the (a) 36- and (b) 12-km MM5 domains. The SA and COOP sites are plotted in (b) using open and filled circles, respectively. The 2-m temperature, 10-m wind speed and direction, and sea level pressure verification statistics use the SA sites, while the 24-h precipitation verification includes both SA and COOP sites.

  • View in gallery

    Diurnal RMSEs every 1 h for (a) 2-m temperature, (b) SLP, (c) 10-m wind speed, and (d) 10-m wind direction for the warm season 0000 UTC forecasts made by the three ensemble means and 0000–1200 UTC CTL member averaged over the 12-km domain.

  • View in gallery

    Same as in Fig. 2 but for the cool season.

  • View in gallery

    ETSs for the warm season (black lines) and cool season (gray lines) 0000 UTC 24-h QPFs (12–36 h) made by the three ensemble means and 0000 UTC CTL member vs 24-h QPF event threshold.

  • View in gallery

    BSSs for 2-m temperature forecasts made by the three ensemble means and the 14-day bias-calibrated 18-member ensemble mean. Skill scores were compiled for the (a) warm season night, (b) warm season day, (c) cool season night, and (d) cool season day using forecast hours 25–36 and 37–48 as night and day, respectively.

  • View in gallery

    Reliability diagrams for 2-m temperature (2mT) forecasts made by the three ensemble means and the 14-day bias-calibrated 18-member ensemble mean. Reliability statistics were compiled under the conditions of (a) warm season nighttime 2mT over 18°C, (b) warm season daytime 2mT over 29°C, (c) cool season nighttime 2mT below 0°C, and (d) cool season daytime 2mT over 10°C using forecast hours 25–36 and 37–48 as night and day, respectively. The solid 1:1 line represents perfect reliability while the lower diagonal line represents the line of no skill in a probability forecast. Inset plots show the sample size of each forecast probability.

  • View in gallery

    BSSs for SLP forecasts made by the three ensemble means and the 14-day bias-calibrated 18-member ensemble mean. Skill scores were compiled for the (a) warm and (b) cool seasons.

  • View in gallery

    Reliability diagrams for SLP forecasts made by the three ensemble means and the 14-day bias-calibrated 18-member ensemble mean. Reliability statistics were compiled under the conditions of (a) warm season SLP over 1024 mb, (b) warm season SLP below 1010 mb, (c) cool season SLP over 1024 mb, and (d) cool season SLP below 1010 mb (d). The solid 1:1 line represents perfect reliability while the lower diagonal line represents the line of no skill in a probability forecast. Inset plots show the sample size of each forecast probability.

  • View in gallery

    BSSs for 24-h precipitation forecasts made by the three ensemble means. Skill scores were compiled for the (a) warm and (b) cool seasons.

  • View in gallery

    Reliability diagrams for three 24-h precipitation thresholds: (a) 0.1 in. (0.254 mm), (b) 0.5 in. (12.70 mm), and (c) 0.7 in. (17.78 mm) during the warm season. The solid 1:1 line represents perfect reliability while the lower straight diagonal line represents the line of no skill in a probability forecast. Inset plots show the sample size of each forecast probability.

  • View in gallery

    Same as in Fig. 9 but for the cool season.

  • View in gallery

    Verification rank histograms for the 18-member 12-km ensemble for (top) 2-m temperature, (middle) SLP, and (bottom) 10-m wind speed. Statistics were gathered over the warm season for the night (forecast hours 0–12 and 25–36) and day (forecast hours 13–24 and 37–48) forecast periods. Raw (bias calibrated) ensemble forecasts correspond with the black (white) histogram bars. Inset percentages represent the adjusted missing rates (MRadj) for each rank histogram.

  • View in gallery

    The frequency at which each 12-km ensemble member and 18-member ensemble mean verifies with the (a) lowest (best) and (b) highest (worst) MAEs for 2-m temperature, SLP, 10-m wind speed, and 10-m wind direction warm season forecasts. The raw and bias corrected forecasts are given by the black and white bars, respectively.

  • View in gallery

    Same as in Fig. 13 but for the cool season.

  • View in gallery

    Scatterplots of 12-km domain-average variance (abscissa) vs domain-average MAE (ordinate) for before (gray circles) and after (black x’s) a 14-day bias calibration is applied. Scatterplots are for (a), (b) 2-m temperature, (c), (d) SLP, (e), (f) 10-m wind speed, and (g), (h) 10-m wind direction forecasts for the 18-member ensemble mean during the warm season. The MAE–variance correlation coefficients (i), (j) are shown for (white) before and after (black) the bias calibration is applied.

  • View in gallery

    Diurnal MEs every 1 h for (a) 2-m temperature, (b) SLP, (c) 10-m wind speed, and (d) 10-m wind direction for the warm season 0000 UTC forecasts for all members, the three ensemble means, and the 14-day bias-calibrated 18-member ensemble mean averaged over the 12-km domain.

  • View in gallery

    The 24-h quantitative precipitation BIAS over the 12-km domain during the (a) warm and (b) cool seasons for the three ensemble means (PHS, IC, and ALL) and three convective parameterization groups (BM, GR, KF, and KF2).

  • View in gallery

    RMSEs averaged over the 12-km domain every 1 h for SLP during the (a) cool and (b) warm seasons for each ensemble members (color coded by PBL type), the three ensemble means (PHS, IC, and ALL), and the 14-day bias calibration applied to the ALL ensemble (ALLBC). See Table 1 for run abbreviations.

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 26 26 9
PDF Downloads 20 20 4

Evaluation of a Mesoscale Short-Range Ensemble Forecast System over the Northeast United States

View More View Less
  • 1 Institute for Terrestrial and Planetary Atmospheres, Stony Brook University, Stony Brook, New York
  • 2 NOAA/National Weather Service, Upton, New York
© Get Permissions
Full access

Abstract

A short-range ensemble forecast system was constructed over the northeast United States down to 12-km grid spacing using 18 members from the fifth-generation Pennsylvania State University–National Center for Atmospheric Research Mesoscale Model (MM5). The ensemble consisted of 12 physics members with varying planetary boundary layer schemes and convective parameterizations as well as seven different initial conditions (ICs) [five National Centers for Environmental Prediction (NCEP) Eta-bred members at 2100 UTC and the 0000 UTC NCEP Global Forecast System (GFS) and Eta runs]. The full 18-member ensemble (ALL) was verified at the surface for the warm (May–September 2003) and cool (October 2003–March 2004) seasons. A randomly chosen subset of seven physics (PHS) members at each forecast hour was used to quantitatively compare with the seven IC members. During the warm season, the PHS ensemble predictions for surface temperature and wind speed had more skill than the IC ensemble and a control (shared PHS and IC member) run initialized 12 h later (CTL12). During the cool and warm seasons, a 14-day running-mean bias calibration applied to the ALL ensemble (ALLBC) added 10%–30% more skill for temperature, wind speed, and sea level pressure, with the ALLBC far outperforming the CTL12. For the 24-h precipitation, the PHS ensemble had comparable probabilistic skill to the IC ensemble during the warm season, while the IC subensemble was more skillful during the cool season. All ensemble members had large diurnal surface biases, with ensemble variance approximating ensemble uncertainty only for wind direction. Selection of ICs was also important, because during the cool season the NCEP-bred members introduced large errors into the IC ensemble for sea level pressure, while none of the subensembles (PHS, IC, or ALL) outperformed the GFS–MM5 for sea level pressure.

Corresponding author address: Dr. Brian A. Colle, Marine Sciences Research Center, Stony Brook University, Stony Brook, NY 11794-5000. Email: brian.colle@stonybrook.edu

Abstract

A short-range ensemble forecast system was constructed over the northeast United States down to 12-km grid spacing using 18 members from the fifth-generation Pennsylvania State University–National Center for Atmospheric Research Mesoscale Model (MM5). The ensemble consisted of 12 physics members with varying planetary boundary layer schemes and convective parameterizations as well as seven different initial conditions (ICs) [five National Centers for Environmental Prediction (NCEP) Eta-bred members at 2100 UTC and the 0000 UTC NCEP Global Forecast System (GFS) and Eta runs]. The full 18-member ensemble (ALL) was verified at the surface for the warm (May–September 2003) and cool (October 2003–March 2004) seasons. A randomly chosen subset of seven physics (PHS) members at each forecast hour was used to quantitatively compare with the seven IC members. During the warm season, the PHS ensemble predictions for surface temperature and wind speed had more skill than the IC ensemble and a control (shared PHS and IC member) run initialized 12 h later (CTL12). During the cool and warm seasons, a 14-day running-mean bias calibration applied to the ALL ensemble (ALLBC) added 10%–30% more skill for temperature, wind speed, and sea level pressure, with the ALLBC far outperforming the CTL12. For the 24-h precipitation, the PHS ensemble had comparable probabilistic skill to the IC ensemble during the warm season, while the IC subensemble was more skillful during the cool season. All ensemble members had large diurnal surface biases, with ensemble variance approximating ensemble uncertainty only for wind direction. Selection of ICs was also important, because during the cool season the NCEP-bred members introduced large errors into the IC ensemble for sea level pressure, while none of the subensembles (PHS, IC, or ALL) outperformed the GFS–MM5 for sea level pressure.

Corresponding author address: Dr. Brian A. Colle, Marine Sciences Research Center, Stony Brook University, Stony Brook, NY 11794-5000. Email: brian.colle@stonybrook.edu

1. Introduction

Significant model errors can develop for relatively short-range predictions (0–48-h forecasts), such as for the January 2000 “surprise” East Coast snowstorm (Zhang et al. 2002) and the major forecast failures documented over the northeast Pacific (McMurdie and Mass 2004). These errors in numerical weather prediction result from uncertainty in the initial conditions (ICs; Lorenz 1963; Anthes 1986) and imperfect physical parameterizations (Frank 1983; Grell et al. 1991). As a result, several recent studies have explored the benefits and shortcomings of short-range ensemble forecast (SREF) modeling systems (Stensrud et al. 2000; Wandashin et al. 2001; Grimit and Mass 2002; Alhamed and Lakshmivarahan 2002; among others). Developers of these SREF systems have quantified the impact of initial condition uncertainty, model dynamics diversity, and model physics variability on short-term forecasts.

The relative importance of physics (PHS) versus IC uncertainty is important when constructing an ensemble. For example, because there is relatively large IC uncertainty over the Pacific Ocean given the limited in situ data over that region, Grimit and Mass (2002) used a multianalysis (IC) approach from several different operational centers to show the ensemble benefits of forecasting wind directions and precipitation over the Pacific Northwest at 12-km grid spacing. It has also been shown that the IC uncertainties can lead to large precipitation errors in short-range predictions for midlatitude cyclones over the central and eastern United States (Du et al. 1997; Zou and Kuo 1996). However, the importance of accounting for model error to improve the ensemble performance is also well established (Houtekamer et al. 1996; Stensrud et al. 2000), because there are also large model sensitivities to parameterized deep convection (Wang and Seaman 1997) and boundary layer processes (Bright and Mullen 2002; Zhang and Zheng 2004). Therefore, utilizing multiple physical parameterizations (Stensrud et al. 2000; Bright and Mullen 2002) or combining different physics with multiple ICs or models have been shown to be a useful ensemble approach (Wandashin et al. 2001; Alhamed and Lakshmivarahan 2002).

The performance of a multiphysics or multimodel ensemble may be sensitive to what parameterizations are used. For example, Yussouf et al. (2004) noted that the fifth-generation Pennsylvania State University–National Center for Atmospheric Research (PSU–NCAR) Mesoscale Model (MM5) members tended to cluster over the northeast United States for 2-m surface temperature; however, they used the Medium-Range Forecast model (MRF; Hong and Pan 1996) and Blackadar (Zhang and Anthes 1982) boundary layer parameterizations, which share nearly identical surface flux representations. It is hypothesized that if a more diverse subset of model physical parameterizations can be identified, a more useful ensemble probability density function can be produced even using the same model. For example, Zhang and Zheng (2004) showed that the introduction of a turbulent kinetic energy based PBL can introduce much different surface temperature variations than the Blackadar and MRF boundary layer parameterizations.

Many operational ensemble systems combine different initial conditions with different physics and/or models. For example, the National Centers for Environmental Prediction (NCEP) SREF was recently updated to include convective parameterization diversity in their bred members for the Eta and regional spectral members (Du et al. 2003). Meanwhile, the University of Washington’s ensemble system over the Pacific Northwest also includes physics, sea surface temperature, and land-use diversity in addition to the eight analyses from different operational centers (Eckel and Mass 2005). These efforts have shown some promise; however, the ensemble surface predictions are still underdispersed and require postprocessing bias removal (Eckel and Mass 2005). Unfortunately, when using this hybrid (combining physics and IC perturbations) approach, one cannot easily identify poorly performing parameterizations or IC analyses that may persistently create ensemble outliers or clustering, since each hybrid member includes more than one IC or physics change. Therefore, to improve an ensemble also requires a more critical understanding of the individual physics packages, IC analyses, and model numerical cores. Until the strengths and weaknesses of available parameterizations and ICs are better quantified, SREF systems will likely continue to be constructed by simply combining as many ICs, physics packages, surface parameter variations, and models as computer power will allow in the hope that some combination will dramatically improve ensemble performance. Therefore, a goal of our study was not to develop the most sophisticated hybrid or multimodel ensemble, but rather to construct an ensemble with each member having only one physical parameterization or IC change, so that the components of the ensemble can be more easily evaluated.

Most SREF studies have been focused over the Pacific Northwest or the central United States, while there have been few published SREF verification studies over the northeast United States. Stensrud and Yussouf (2003) and Yussouf et al. (2004) focused on summer temperature prediction over the northeast United States using a multimodel approach with varied physics, lagged initializations, varied resolutions, and a 7-day bias correction. The weather across the Northeast also poses different challenges than other regions where SREF systems have been documented. The Great Lakes, the Appalachian Mountains, urban centers, irregular coastlines, the Gulf Stream, and Labrador currents all add mesoscale complexity and result in model errors that vary significantly from season to season (Colle et al. 2003a, b). Thus, a SREF system over this region requires evaluation for both the warm and cool seasons in order to quantify the relative importance of the IC and PHS uncertainties.

This paper summarizes an SREF ensemble forecast system that was developed at Stony Brook University over the northeast United States in collaboration with several of the National Oceanic and Atmospheric Administration’s National Weather Service (NWS) forecast offices as part of a Collaborative Program for Operational Meteorology, Education, and Training (COMET) collaborative project. The 18-member SREF system utilized the MM5 at 12-km grid spacing, which at the time of this research was the highest resolution and largest operational SREF ensemble over the Northeast. This study addresses the following questions:

  • How well does the full 18-member ensemble (ALL), as well as the PHS and IC subensembles, perform for surface temperature, winds, and precipitation over northeast United States?
  • How is ensemble skill influenced by individual member biases and the fact that all members may not be equally as skillful on average?
  • Does a simple postprocessing technique significantly mitigate ensemble bias and improve probabilistic skill?

2. Model and verification methods

a. SREF design

A mesoscale SREF system was constructed using 18 members of the MM5, version 3.6. The MM5 was integrated for an outer 36-km domain that extended from the Rocky Mountains to the western Atlantic Ocean and a 12-km (one way) nested grid that covered much of the northeast United States (Fig. 1a). Thirty-three sigma levels were used in the vertical, with a maximum resolution in the boundary layer. The terrain for the 36- and 12-km grids was analyzed using a 5′ and 30″ terrain dataset, respectively, while a 30″ land-use dataset was used to initialize 25 land surface categories.

The full 18-member (ALL) ensemble contains two distinct 12- and 7-member subensembles using different model PHS and ICs, respectively. Member 6 is shared between PHS and IC and is labeled as the CTL member, because it applied the same physics and ICs as the real-time MM5 run twice daily over the northeast United States for the past several years (Colle et al. 2003a). We realize that not combining the mixed physics with the different ICs may limit ensemble performance, especially for IC members during the warm season. However, as mentioned in the introduction, our purpose was to better understand the relative performance of the PHS and IC ensembles separately over the northeast United States. The 12 PHS members were initialized using the 12-km Eta interpolated to the NCEP-221 grid (32-km grid spacing, 25-mb interval vertical levels). These analyses were bilinearly interpolated to the MM5, while boundary conditions were obtained by linearly interpolating the 3-h Eta-104 model forecasts (90-km grid spacing, 25-mb vertical levels). The U.S. Navy Optimum Thermal Interpolation System sea surface temperature analyses (∼30 km grid spacing) were used to initialize the MM5 sea surface temperatures, while the daily U.S. Air Force snow distribution grids (∼45 km grid spacing) were used to initialize the model snow cover, and the Eta-221 grid was used to initialize the soil moisture. Table 1 lists the 12 PHS members, which included three MM5 PBL schemes described in the appendix of Zhang and Zheng (2004): that of Blackadar (BLK; Zhang and Anthes 1982), the Mellor–Yamada–Janjić (MYJ; Janjić 1994) scheme used in the Eta, and the MRF (Hong and Pan 1996), as well as four MM5 convective parameterizations (CPs): that of Betts–Miller (BM; Betts and Miller 1986), Grell (GR; Grell 1993), Kain–Fritsch (KF; Kain and Fritsch 1990), and Kain–Fritsch-2 (KF2; Kain 2004). Although the lateral boundary conditions (LBCs) were not perturbed for the PHS members, the impact of using the same LBC forcing (NCEP Eta) was likely relatively small over the northeast United States given the large impact of the model parameterizations on ensemble surface errors as highlighted in subsequent sections.

Five IC members were initialized at 0000 UTC using the 3-h forecast from the 2100 UTC NCEP Eta-bred members that used the Betts–Miller–Janjić (BMJ) convective parameterization (Janjić 1994). In addition to the control Eta member at 2100 UTC, two positive and two negative perturbations are generated at NCEP using the breeding of growing modes approach (Toth and Kalnay 1993, 1997). The Eta-bred forecasts at 3-h intervals were interpolated to a 90-km grid (104 grid), and these data were interpolated spatially to the MM5 grid to create the initial conditions and linearly in time to construct the boundary conditions. Another IC member was initialized using the NCEP Global Forecast System (GFS) model initialized at 0000 UTC at 1° resolution, with boundary conditions from the GFS at 6-h intervals. Table 1 lists the seven IC members. The ensemble system was run once daily at 0000 UTC, with the warm season of 1 May–31 September 2003 and cool season of 1 October 2003–31 March 2004 examined in this study.

b. Verification approaches

One goal of this study was to determine the relative importance of the PHS and IC members during the warm and cool seasons. However, because the PHS and IC subensembles were unequal in size, a random subset of seven PHS members were chosen at each forecast hour in order to fairly compare with the seven IC members. Therefore, hereafter the PHS will refer to the randomly generated seven physics members. The performance of the ALL ensemble, which includes all 18 members, was also quantified and compared, because this was the full ensemble available to the NWS each day, and it helps provide some measure of the combined impact of including both PHS and IC members in an ensemble. It has been shown that the application of a bias correction can improve ensemble-mean forecasts (Richardson 2001; Eckel and Mass 2005). Thus, a bias correction was applied to the ALL ensemble of surface temperature, wind speed, wind direction, and sea level pressure (SLP) by using a 14-day running-mean bias calibration for each individual member (ALLBC). Specifically, the mean errors from the previous 14-day period for a particular forecast hour, ensemble member, and observation location were added to that member’s most recent forecast for that same hour and site location. It was found that 14 days produced the best results as compared with 7 and 21 days (not shown).

Verification results were compiled over the warm and cool seasons using conventional observations from the North American surface airways (SA) observing sites, which were also supplemented with the Coastal-Marine Automated Network and buoy stations (Fig. 1b). Those observations with wind speeds of less than 5 kt (∼2.5 m s−1) were not counted in the wind direction statistics in order to prevent large errors during light and variable winds. An additional 350 cooperative observer (COOP) stations were used to verify the MM5 precipitation forecasts (Fig. 1b). Model forecast data at the four grid points surrounding each observation location were bilinearly interpolated to the observation site for most low-level metrics. For precipitation, an inverse-distance Cressman weighting method was used to interpolate the model-forecasted precipitation from the grid points to the observation location (Colle et al. 2003b).

Standard measures of forecast skill were calculated for each ensemble member and ensemble-mean forecast, such as the mean error (ME) and root-mean-square error (RMSE). These errors were averaged over the 12-km MM5 domain. A Brier skill score (BSS),
i1520-0434-22-1-36-e1
was used to determine the skill improvement relative to climatology and the skill of the ensemble’s probabilistic forecasts (Wilks 1995), in which
i1520-0434-22-1-36-e2
where f cI is the forecast probability of occurrence for the ith forecast and ocI is the observed probability of occurrence (1 for occurrence, 0 for nonoccurrence). The BSref represents the uncertainty of an event, Cf(1 − Cf), in which Cf is the climatological frequency of a specific event obtained by summing the number of events classified by a certain threshold (e.g., T < 0°C or SLP > 1000 mb), and dividing by the total cases. Following Wilks (1995), the Brier score (BS; Brier 1950) was partitioned in order to produce reliability diagrams that further illustrate the ensemble forecast quality. A perfectly reliable system will feature a forecast probability that is representative of the uncertainty of an event.
Verification rank histograms were constructed to examine the dispersion qualities of the ensemble (Hamill 2001). These histograms were made by calculating the frequency of occurrence that an observation value fell outside or within the ensemble distribution. The adjusted missing rate (MRadj) is the fraction of cases, other than those expected (MRexp), that fall outside the ensemble’s forecast distribution:
i1520-0434-22-1-36-e3
where MR is the percentage of members that fall outside the distribution and MRexp is the expected percentage for the M = 18 number of ensemble members:
i1520-0434-22-1-36-e4
A perfect ensemble has an MRadj of zero.

To examine the 24-h precipitation bias and skill of individual ensemble members and ensemble mean forecasts, verification was based on the contingency table from Colle et al. (1999). The contingency bias score (BIAS) quantifies the forecast frequency of occurrence relative to the observed for precipitation greater than or equal to a given threshold. The equitable threat score (ETS) was used to measure the skill in predicting the precipitation amounts over a given threshold (Stanski et al. 1989).

To estimate statistical significance, a nonparametric resampling (bootstrap) method was applied using a 90% confidence interval (Wilks 1995). For this method a large number (∼1000) of random samples were obtained from the original population for each season, and the skill scores were recalculated using these artificial datasets. The resulting array of artificial skill scores was used to determine whether the true statistics fall within predetermined probability ranges defining the confidence intervals (95% and 5%). Given the large sample size, the 90% confidence bars are small; therefore, for presentation purposes these bars have been omitted and significance is discussed in the text where appropriate.

3. Surface verification

a. Ensemble skill

This section highlights the diurnal variation of 12-km ensemble skill at the surface over the northeast United States during the warm and cool seasons.

1) Temperature, sea level pressure, and winds

Figure 2 shows the 48-h RMSEs starting at 0000 UTC for the PHS, IC, and ALL ensembles as compared with the CTL member (No. 6 in Table 1) and a CTL member initialized 12 h later at 1200 UTC (CTL12). During the warm season, the ensemble mean surface temperature errors oscillate diurnally, with the largest errors during the late-night and midday periods (Fig. 2a). The IC ensemble, CTL, and CTL12 have nocturnal temperature errors 10%–20% larger than the PHS ensemble, which exceeds the 95% confidence level based on nonparametric resampling tests. The ALL ensemble skill is similar to that of the PHS, even though the ALL has more members. During the day the IC ensemble is 10%–15% better than the CTL member, which is significant at the 95% level. An additional 15%–25% improvement in skill for the ALL ensemble is obtained after applying a 14-day bias correction (ALLBC).

In contrast, for sea level pressure there is little skill improvement of the ensemble means over the CTL (Fig. 2b). In fact, the CTL12 run is slightly better than the ensemble means after hour 20. There is little diurnal variation of sea level pressure errors, but rather a steady increase in error with forecast lead times. A 14-day bias correction (ALLBC) results in a dramatic (30%–50%) reduction in RMSEs, with the ALLBC outperforming the deterministic sea level pressure forecast initialized 12 h later (CTL12).

Wind speed also has large diurnal variations in RMSEs during the warm season (Fig. 2c), with much larger errors occurring at night. There is a 10%–15% reduction in wind speed errors for the PHS and ALL ensembles over the CTL and CTL12, which is significant at the 95% level. However, the largest wind speed improvement occurs for the ALLBC, which reduces the ALL ensemble RMSEs by 30%–50% at night and helps to remove most of the diurnal variation in RMSEs. In contrast, the wind direction errors are similar to the sea level pressure, with less diurnal variation and little difference between ensemble means (Fig. 2d). The wind direction ensemble means are 5%–10% better than the CTL, which is significant at the 95% level, but there is little bias correction improvement for wind direction.

During the cool season (Fig. 3), the RMSEs for the various subensembles are similar to those of the warm season. However, the temperature RMSEs for the PHS and ALL are only slightly less than the CTL member at night, while during the day the ALL mean shows no improvement as compared with the CTL or CTL12 runs (Fig. 3a). The ALLBC outperforms the CTL12 for temperature during the late day by 5%–10%, which is significant at the 95% level. Meanwhile, the cool season sea level pressure errors are 20%–50% larger than those of the warm season (Fig. 3b), with little or no increase in skill of the ensemble means over the CTL or CTL12. The wind speed errors during the cool season are similar in magnitude to those of the warm season (Fig. 3c), while cool season wind direction errors are 20%–30% smaller than for the warm season. The impact of the bias correction for wind speed and wind direction is similar between seasons.

2) Precipitation

The 24-h (12–36 h) quantitative precipitation forecast (QPF) skill was calculated during the warm and cool seasons over the 12-km domain. Figure 4 shows the equitable threat scores (ETSs) for the ensemble means and CTL. During the warm season the PHS ensemble has better (higher) ETSs than the IC members and CTL at all thresholds (Fig. 4), which is significant at the 95% level. The ALL performance is similar to that of the PHS.

The ALL ensemble mean has cool season ETSs that are 0.15–0.30 larger than for the warm season (Fig. 4). As in the warm season, the cool season PHS members have greater skill than the IC members on average, but the PHS benefit is smaller than that of the warm season, and only the improvements for thresholds less than 1.5 cm are significant at the 95% level. The ALL ensemble had the best ETSs for thresholds greater than 1.0 cm.

b. Probability-based measures

To evaluate the ensemble’s probabilistic forecasts of temperature, sea level pressure, and precipitation, the BSSs were computed for the PHS, IC, and ALL ensembles during the warm and cool seasons. Reliability diagrams showing forecast probabilities versus observed relative frequencies are also presented, in which the 1:1 solid line represents a perfect reliability (REL), while the lower diagonal line represents the line of “no skill,” where REL equals the resolution and BSS = 0 (Wilks 1995).

1) Temperature

Figures 5a and 5b show the warm season BSSs for temperature predictions or observations exceeding a given threshold ranging from 5° to 25°C at night (25–36 h) and 10° to 30°C during the day (37–48 h). At night (Fig. 5a), the PHS subensemble has more probabilistic skill than the IC ensemble for all temperatures; however, both subensembles have no skill for very warm evening temperatures of >23°C. A 14-day bias correction (ALLBC) improves the skill of the ALL ensemble for temperatures between 20° and 25°C. The daytime BSSs are similar except that the rapid drop in skill in the raw ensemble data occurs for temperatures >27°C. The IC ensemble has no skill at 30°C (BSS = 0), while the PHS and ALLBC have BSSs of 0.32 and 0.54, respectively.

Figure 6 shows warm season reliability diagrams during the night (24–36 h) and day (37–48 h) periods for selected temperature thresholds. For relatively warm nights of >18°C during the warm season (Fig. 6a), the various subensembles overpredict the probabilities and have no skill at the moderate to high (0.4–0.8) probabilities. The ALLBC forecast probabilities still exceed the observed relative frequencies. In contrast, during the day (Fig. 6b), the warm season temperatures of >29°C have some reliability, except for the PHS group at low to moderate probabilities (0.3–0.7). Meanwhile, the ALLBC results in a nearly perfect reliability at all forecast probabilities, thus illustrating the benefit of postprocessing for relatively warm daytime temperatures over the northeast United States.

The cool season BSSs rapidly decrease at night for temperatures >11°C (Fig. 5c), and there is little difference among the subensembles or improvement using bias correction. In contrast, during the day the PHS members have 10%–20% more skill than do the IC and ALL ensembles (Fig. 5d), while the ALLBC adds 30%–40% more skill to the ALL ensemble. The ALL and ALLBC reliabilities are relatively good for temperatures <0°C at night for some low to moderate probabilities (Fig. 6c), but there is little reliability for probabilities of 0.7–0.9. For temperatures >10°C during the day (Fig. 6d), the various subensembles and ALLBC underforecast the low probabilities, while the IC and ALL ensembles overforecast some of the moderate to high probabilities (0.6–0.8).

2) Sea level pressure

Because there is little diurnal variability in sea level pressure errors (cf. Figs. 2 and 3), the full 25–48-h forecast period was used in the probabilistic evaluations. During the warm season (Fig. 7a), the PHS ensemble has 10%–20% larger (better) BSSs than does the IC ensemble for most thresholds. There is an additional 10%–15% improvement in the ALL ensemble after bias correction for sea level pressure forecasts >1010 mb. For the warm season periods >1024 mb (Fig. 8a), all ensemble components have fairly good reliability at most probabilities, with the ALLBC outperforming the ALL ensemble at moderate to high probabilities. In contrast, for <1010 mb during the warm season (Fig. 8b), the PHS ensemble has no reliability for most probabilities, while there is some forecast reliability in the IC and ALL ensembles for probabilities greater than 0.5. During the cool season (Fig. 7b), the sea level pressure BSSs are similar among the various subensembles, so there is much less PHS benefit as compared with the warm season (Fig. 7a). The bias correction (ALLBC) also gives little additional skill to the sea level pressure probabilistic forecasts at most thresholds. Most of the reliability problems are at moderate probabilities (0.4–0.8) for relatively high pressure (>1024 mb) events (Fig. 8c). There is some reliability for cool season sea level pressures <1010 mb (Fig. 8d), but no subensemble is outperforming any other.

3) Precipitation

During the warm season the precipitation BSSs decrease rapidly with increasing QPF threshold (Fig. 9a), with the PHS and IC ensembles showing less skill than climatology at thresholds >1.78 cm (0.7 in.). The PHS ensemble is only slightly better than the IC ensemble for the light to moderate thresholds, but this is still significant at the 95% level. The ALL ensemble has skill with respect to the sample climatology at thresholds to 2.54 cm (1.0 in.).

For the 0.25-cm (∼0.1-in.) threshold during the warm season (Fig. 10a), lower probabilities tend to be underforecast and higher probabilities tend to be overforecast for the PHS, IC, and ALL ensembles. For the >1.27 cm (0.5 in.) threshold (Fig. 10b), the subensembles have little reliability at moderate probabilities (0.3–0.7), while the ALL ensemble has slightly more reliability than the IC and PHS at higher probabilities. Meanwhile, for >1.78 cm (0.7 in.) (Fig. 10c), the IC and PHS ensembles have little or no reliability at all probabilities, whereas the ALL ensemble has more reliability at high probabilities (>0.7).

The cool season BSSs for the 24-h (12–36 h) precipitation increase initially to the 0.51-cm threshold (Fig. 9b), and then decrease gradually with increasing precipitation amount. The BSSs during the cool season are 2–3 times larger than the warm season for >1.27 cm (0.5 in.), with all cool season ensembles showing skill with respect to the sample climatology at all thresholds. The IC ensemble has greater skill than the PHS ensemble during the cool season, and the ALL ensemble has slightly more skill than the IC ensemble at >1.78 cm (0.7 in.).

During the cool season (Fig. 11), all subensembles tend to overpredict 24-h precipitation probabilities, except at very low (<0.25) probabilities. Each of the subensemble’s reliability approaches the no skill line even at the 0.254-cm (0.1 in.) threshold for low- to midforecast probabilities (0.2–0.5) (Fig. 11a). The IC ensemble forecast probabilities have greater reliability than the PHS or ALL ensembles at all thresholds for probabilities over 0.6. Overall, these results suggest that an ensemble weighted more with IC members can prove beneficial for cool season precipitation forecasts over the northeast United States.

c. Rank histograms

Figure 12 shows the rank histograms using the ALL and ALLBC ensembles for surface temperature, winds, and sea level pressure during the warm season. With no bias correction (black bars in Fig. 12), there is an overpopulation of the extreme ranks of the histograms for all parameters (i.e., the histograms are U or L shaped) during the night (0–12 and 25–36 h) and day (13–24 and 37–48 h), with the observations falling outside the envelope of solutions. The pronounced L-shape histogram during the night (0–12 and 25–36 h) indicates a positive bias, and this feature is most prevalent in the 2-m temperature and 10-m wind speed distributions. The raw ensemble has adjusted missing-rate (MRadj) values of over 30% for all parameters at all forecast periods in the warm season. The largest missing rates are at night for temperature (46%) and wind speed (57%).

During the cool season (not shown), the ensemble captures even less of the observed temperature distribution than in the warm season. The 2-m temperature distribution is more U shaped in the cool season, especially during the day, which suggests many observed temperatures are cooler and warmer than forecast. The observed cool season temperature and wind speed forecasts fall outside the ensemble’s distribution over 50% of the time.

After applying the 14-day bias correction (ALLBC) during the warm season (white bars in Fig. 12), the MRadj is reduced (improved) by ∼10% during both the day and night. The sea level pressure and wind speed distributions show the greatest MRadj reductions, with distributions becoming less L shaped. The cool season rank histograms are also improved (not shown), with over 10% MRadj reductions for all forecast periods and most parameters. Overall, a bias calibration shifts the ensemble envelope closer to reality by reducing the ensemble-wide bias; however, even after calibration, the ensemble remains underdispersed.

d. Percentage of best and worst

Figures 13a and 13b show the percentage of forecasts (0–48-h average) that an individual member and ALL mean (member 19) verify the best (smallest RMSE) and worst (largest RMSE), respectively, during the warm season. These diagrams are useful in the probabilistic evaluation of an ensemble system, as it is important that all members have similar skill. In addition, they also help one gauge the relative performance of the various physics packages and the IC analyses. For the raw ensemble (black bars in Fig. 13), the MYJ PBL members verify best at a greater frequency than other members for 2-m temperature and 10-m wind speed (Figs. 13a and 13c), especially during the night (not shown). The IC members, particularly the bred members (Nos. 13–16) and GFS–MM5 member (No. 18), verify the best more often than the MRF and BLK PBL members (Nos. 5–12) for all variables. The GFS–MM5 member (No. 18) verifies the best more often than any other member for temperature, sea level pressure, and wind direction. The ensemble mean verifies best as often as the MRF and BLK PBL members. The application of a 14-day bias correction improves the best member distribution slightly, with the MYJ PBL and IC members verifying best less often and the other PBL schemes more often.

Meanwhile, the MYJ PBL and many IC members verify the worst at a greater frequency than the other PHS members for 2-m temperature during the warm season (Fig. 13b), with most of the MYJ PBL poor forecasts occurring during the day (not shown). Most IC members verify the worst at the highest frequency for sea level pressure, and 10-m wind speed and direction (Fig. 13b). Within the IC ensemble, the 2100 and 0000 UTC Eta–MM5 members (Nos. 6 and 17) verify the worst at a lower frequency than the bred and GFS–MM5 members (Nos. 13–16 and 18). The 14-day bias correction does not substantially improve the distribution of the percentage of the worst members.

During the cool season (Fig. 14), the GFS–MM5 member verifies best at a substantially higher frequency than the other component members (Fig. 14a), especially for temperature and sea level pressure. On the other hand, the IC bred members verify worse more often than the other members (Fig. 14b), especially for sea level pressure and 10-m wind direction. The unequal performance between the bred members (Nos. 13–16) as compared with their control (No. 17) is bothersome and will be discussed in section 4. Overall, the skill inequality among the 18 ensemble members is more prevalent in the cool season than the warm season, and there is little improvement using bias correction (white bars in Fig. 14).

e. Predictability of ensemble skill

The ability of an ensemble to predict the skill of its mean forecast can be inferred from the degree of correlation between the amount of variance in the ensemble and the magnitude of the error of the ensemble mean. A higher correlation implies a greater predictability of ensemble mean skill, suggesting that ensemble variance is a good representation of ensemble certainty. This correlation only provides an estimate of the predictability of ensemble skill, because the spread–error correlation assumes a linear dependence between ensemble spread and forecast error, which has been shown to be invalid for many error metrics (Grimit 2004).

Figure 15 summarizes the ability of the ALL ensemble to predict the skill of the mean of the ALL ensemble forecasts during the warm season. Each point in the scatterplot represents the 12-km domain-averaged ensemble variance versus the mean absolute errors (MAEs) averaged during the night (0–12 and 25–36 h) or day (13–25 and 37–48 h) periods for the ALL and ALLBC. The MAEs differ greatly compared with the variances, producing a “column” pattern in the temperature scatterplots during the night and day (Figs. 15a and 15b). In other words, a wide range of 2-m temperature error is associated with little variance between members, which suggests that many members share similar temperature biases. This results in spread–error correlations that are relatively poor, with correlation coefficients between 0.20 and 0.40 (Figs. 15i and 15j). The 2-m temperature error–variance patterns vary less for the cool season forecast periods than they do for the warm season (not shown), resulting in correlation coefficients only ranging from 0.07 to 0.09. Unfortunately, the temperature correlation results do not change substantially after the calibration is applied (Figs. 15a and 15b).

The sea level pressure MAEs during the warm season also have little or no correlation (0.26–0.37) with the variance during the day or night (Figs. 15c and 15d), while the cool season error–variance correlations improve slightly from 0.39 to 0.43 (not shown). For the 10-m wind speeds during the warm season night (Fig. 15e), the error–variance patterns cover a large range of MAE values, but only a relatively small range in the variance levels due to the ensemble-wide bias. During the day (Fig. 17f), the wind speed correlation coefficients are slightly larger (0.35) than at night (0.17).

Figures 15g and 15h show the 10-m wind direction error–variance patterns for the warm season during the night and day, respectively. In general, the errors tend to increase with increasing variance, with a correlation coefficient much higher than other variables. The 10-m wind direction correlations of ∼0.65 change only slightly after applying a bias calibration for the day period, while at night there is a large (50%) decrease in correlation (to ∼0.35) after bias correction, because many large MAE events were introduced with little ensemble spread. The cool season 10-m wind direction forecasts tend to have less variance than the warm season (not shown), with correlation coefficients ranging from 0.60 to 0.64. Overall, the surface wind direction is the only surface parameter that shows some ability to predict forecast skill over the northeast United States.

4. Discussion and summary

This paper describes the performance of a SREF system that used 18 members at 12-km grid spacing from the MM5. The ensemble contained different boundary layer and CPs within the MM5 (12 members) as well as seven different ICs for the MM5 from NCEP Eta-bred members at 2100 UTC and the 0000 UTC GFS model. To compare the PHS and IC subensembles fairly, seven members from the PHS ensemble were randomly chosen for each forecast hour.

The PHS members were most useful during the warm season, with the PHS ensemble having smaller RMSEs and more probabilistic skill for surface temperature and wind speed than the IC ensemble and CTL run (shared PHS and IC member), while the IC ensemble was more comparable during the cool season. This reinforces the results of Stensrud et al. (2000), who found that a model physics ensemble is more useful when large-scale forcing for upward motion is weak, whereas different initial conditions becomes more useful when large-scale forcing is strong.

The PHS ensemble had better (higher) ETSs for 24-h precipitation than did the IC members and the CTL during the warm season, while the PHS probabilistic skill for 24-h precipitation was slightly better than that of the IC ensemble. In contrast, during the cool season the PHS ensemble had much lower probabilistic skill for precipitation than did the IC ensemble.

The seven IC and 12 PHS members (one member is shared) were combined as an ALL ensemble. The ALL ensemble had errors comparable with a CTL run initialized 12 h later (CTL12) for surface wind speed and temperature; however, the ensemble performance was degraded by persistent model biases at the surface. When a 14-day running mean bias correction was applied to the ALL ensemble (ALLBC), the ALLBC outperformed the CTL12 forecasts of temperature, sea level pressure, and wind speed on average. Bias correction enhanced the ALLBC skill by 15%–25% for surface wind speeds and temperature, and 30%–50% for sea level pressure. These results suggest that a bias calibration can dramatically increase the performance of an ensemble, such as the ensemble can outperform a deterministic run even initialized 12 h later.

The probabilistic skill and ensemble reliabilities for the ALL ensemble were also evaluated after bias correction. The ALL ensemble had little probabilistic skill or reliability for unseasonably warm periods near 30°C, but did have moderate skill and reliability after the bias correction. In contrast, the ensemble had poor reliability during the warm season for sea level pressures less than 1010 mb and cool season pressures greater than 1024 mb.

The Stony Brook University (SBU) SREF has some ability to predict forecast skill and to estimate the uncertainty of a forecast through ensemble variance; however, the inherent problem in the model bias and clustering in the SBU ensemble system is similar to that noted for the hybrid PHS–IC ensemble system over the Pacific Northwest (Eckel and Mass 2005) and a multimodel ensemble over the Northeast (Yussouf et al. 2004). For the SBU SREF, the 10-m wind direction has the best error–variance correlation (0.6–0.7), which is slightly better than over the Pacific Northwest (0.4–0.6 for all cases) in Grimit and Mass (2002).

The surface temperature, wind speed, and sea level pressure ensemble forecasts showed low (0.2–0.4) error–variance correlations even after bias calibration. The SBU SREF spread–error correlations for the surface temperature are less than those obtained by Stensrud and Yussouf (2003), who used a multimodel SREF over the northeast United States of varied physics, lagged initializations, resolutions, and applied a 7-day bias correction. They also found that MM5 members clustered for temperature, but their full ensemble apparently benefited from including other modeling systems.

The above results illustrate the difficulty in constructing a short-range ensemble system over the northeast United States. As found in other regions, such as over the Pacific Northwest, large model biases limit the raw ensemble performance. For example, Fig. 16 shows the 12-km domain-averaged MEs during the warm season for each ensemble member (color coded by PBL type), the three ensemble means (PHS, IC, and ALL in boldface), and the ALLBC (gray). All members have a warm (0.5°–1.0°C) 2-m temperature bias at night (Fig. 16a), while the MYJ PBL members develop a 1°–2°C cool bias during the day. The strong diurnal temperature biases are likely due to imperfect land surface and boundary layer physics. For example, the MYJ PBL cool bias during the day is also associated with a large (20%–30%) moist bias at the surface (Jones 2004). Also, too much mixing at night favors a near-surface warm bias as well as the nocturnal high wind speed bias (0.5–1.75 m s−1) and positive (clockwise or too geostrophic) wind direction bias (5°–10°) in all members (Figs. 16c and 16d), with the MYJ PBL having a smaller bias than the other PBL members. The sea level pressures for most members tend to have a weak (∼0.5 mb) negative bias in the late day and weak positive bias at late night (Fig. 16b).

The large variation in low-level temperature among the MYJ PBL members for different CPs is interesting. Jones (2004) noted that these differences were associated with those days that had convective precipitation over the northeast United States. Because the Grell CP does not trigger as much as other CPs over the northeast United States (Colle et al. 2003b), this favors more low-level cloud water production by the explicit cloud scheme (not shown). This large amount of low-level cloud water reduces the incoming shortwave radiation at the surface, which leads to cooler daytime temperatures than the other CP members. The temperature variations are most prominent in the MYJ PBL members, because the MYJ PBL has a low-level moist bias (Jones 2004), which favors more explicit cloud water production.

Precipitation forecasts also suffer from large biases that vary among members. For example, Fig. 17a shows the 24-h (12–36 h) precipitation bias over the northeast United States during the warm season based on the contingency table for the IC, PHS, and ALL subensembles, as well as those members grouped by a particular convective parameterization. All individual members except the KF CP show increasing bias with increasing threshold amount. The intrinsic smoothing created by averaging individual member’s forecasts to create the ensemble mean leads to the overprediction of low- and midthresholds, and underprediction at high thresholds. During the cool season (Fig. 17b), there is less variation among CP schemes and the 24-h precipitation biases are larger than for the warm season by 0.1–0.2, although some of this increase may be attributable to rain gauge undercatchment of snow during the winter.

The ability to obtain an equally skillful set of ICs is also challenging. The unequal performance between using the Eta-bred members (Nos. 13–16) as compared with the 2100 UTC control Eta member (No. 17) is bothersome (cf. Fig. 14b). This suggests that NCEP bred members are initialized with or develop large errors that degrade their performance as compared with the control member. For example, Fig. 18a shows the sea level pressure RMSEs for the individual members and subensembles during the cool season. The Eta-bred members have 20%–50% larger pressure errors than the other members, including the 2100 UTC Eta CTL member (ECT in Table 1), while the GFS–MM5 member (No. 18 in Table 1) is the most skillful member. The GFS–MM5 is also comparable to the ALLBC, which shows the potential benefit of using GFS–MM5 to initialize ensemble members. In contrast, the RMSEs for the bred members during the warm season are not much larger than some of the other members after hour 24 (Fig. 18b). This suggests that most of the breeding problem occurs during the cool season, when there is more potential for larger IC perturbations.

The large RMSEs for sea level pressure at hour 0 in the Eta-bred members suggest that the problem was introduced during initialization (2100 UTC). This problem was also apparent during the 2004–2005 cool season verification (not shown). NCEP also noted this behavior in their SREF verification for the 2004–2005 cool season over the northeast United States (J. McQueen 2005, personal communication), with the Eta-bred members at 2100 UTC having 20%–25% larger RMSEs than the control member on average for sea level pressure (not shown). Apparently, this problem was passed on to the MM5 ensemble, which diminished its overall performance.

This work was a first step in understanding the ensemble performance of many surface parameters over the northeast United States. It has showed the benefit of completing individual member evaluations in order to better interpret the overall ensemble performance. This work suggests that large deficiencies in model physical parameterizations are the Achilles’ heel of ensemble prediction. Therefore, more work is needed to further understand and fix some of the fundamental physics problems. Meanwhile, a more effective bias corrective or postprocessing scheme (Bayesian model averaging, member dressing, model output statistics, or ensemble Kalman filter) may also help alleviate some of the ensemble errors.

Acknowledgments

This work represents a portion of the first author’s M.S. thesis. The research was supported by ONR (Grant N000014-00-1-0407) and UCAR–COMET (Grant S0238662). The authors thank Jun Du, David Novak, four anonymous reviewers, and the Joint-Chief Editor (Dr. D. Stensrud) for their helpful comments concerning this work. Collective insights from the COMET NWS partners improved the direction of this research. Use of the MM5 is provided by the Mesoscale Microscale Meteorology (MMM) division of NCAR, which is supported by the National Science Foundation.

REFERENCES

  • Alhamed, A., , and Lakshmivarahan S. , 2002: Cluster analysis of a multimodel ensemble data from SAMEX. Mon. Wea. Rev., 130 , 226256.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Anthes, R. A., 1986: The general question of predictability. Mesoscale Meteorology and Forecasting, P. S. Ray, Ed., Amer. Meteor. Soc., 636–656.

    • Search Google Scholar
    • Export Citation
  • Betts, A. K., , and Miller M. J. , 1986: A new convective adjustment scheme. Part I: Observational and theoretical basis. Quart. J. Roy. Meteor. Soc., 112 , 677692.

    • Search Google Scholar
    • Export Citation
  • Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78 , 13.

  • Bright, D. R., , and Mullen S. L. , 2002: The sensitivity of the numerical simulation of the southwest monsoon boundary layer to the choice of PBL turbulence parameterization in MM5. Wea. Forecasting, 17 , 99114.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Colle, B. A., , Westrick K. J. , , and Mass C. F. , 1999: Evaluation of MM5 and Eta-10 precipitation forecasts over the Pacific Northwest during the cool season. Wea. Forecasting, 14 , 137154.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Colle, B. A., , Olson J. B. , , and Tongue J. S. , 2003a: Multiseason verification of the MM5. Part I: Comparison with the Eta Model over the central and eastern United States and impact of MM5 resolution. Wea. Forecasting, 18 , 431457.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Colle, B. A., , Olson J. B. , , and Tongue J. S. , 2003b: Multiseason verification of the MM5. Part II: Evaluation of high-resolution precipitation forecasts over the northeastern United States. Wea. Forecasting, 18 , 458479.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Du, J., , Mullen S. L. , , and Sanders F. , 1997: Short-range ensemble forecasting of quantitative precipitation. Mon. Wea. Rev., 125 , 24272459.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Du, J., , DiMego G. , , Tracton M. S. , , and Zhou B. , 2003: NCEP short-range ensemble forecasting (SREF) system: Multi-IC, multi-model and multi-physics approach. Research Activities in Atmospheric and Oceanic Modeling, J. Cote, Ed., CAS/JSC Working Group Numerical Experimentation Rep. 23, WMO/TD 1161, 5.09–5.10.

    • Search Google Scholar
    • Export Citation
  • Eckel, F. A., , and Mass C. F. , 2005: Aspects of effective short-range ensemble forecasting. Wea. Forecasting, 20 , 328350.

  • Frank, W. M., 1983: The cumulus parameterization problem. Mon. Wea. Rev., 111 , 18591871.

  • Grell, G. A., 1993: Prognostic evaluation of assumptions used by cumulus parameterizations. Mon. Wea. Rev., 121 , 764787.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Grell, G. A., , Kuo Y. , , and Pasch R. J. , 1991: Semiprognostic tests of cumulus parameterization schemes in the middle latitudes. Mon. Wea. Rev., 119 , 531.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Grimit, E. P., 2004: Probabilistic mesoscale forecast error prediction using short-range ensembles. Ph.D. dissertation, University of Washington, 146 pp. [Available from Dept. of Atmospheric Sciences, University of Washington, Seattle, WA 98195.].

  • Grimit, E. P., , and Mass C. F. , 2002: Initial results of a mesoscale short-range ensemble forecasting system over the Pacific Northwest. Wea. Forecasting, 17 , 192205.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., 2001: Interpretation of rank histograms for verifying ensemble forecasts. Mon. Wea. Rev., 129 , 550560.

  • Hong, S-Y., , and Pan H-L. , 1996: Nonlocal boundary layer vertical diffusion in a medium-range forecast model. Mon. Wea. Rev., 124 , 23222339.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Houtekamer, P. L., , Lefaivre L. , , Derome J. , , Ritchie H. , , and Mitchell H. L. , 1996: A system simulation approach to ensemble prediction. Mon. Wea. Rev., 124 , 12251242.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Janjić, Z. I., 1994: The step-mountain eta coordinate model: Further developments of the convection, viscous sublayer, and turbulence closure schemes. Mon. Wea. Rev., 122 , 927945.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Jones, M. S., 2004: Evaluation of a mesoscale short-range ensemble forecasting system over the northeast U.S. M.S. thesis, Marine Sciences Research Center, Stony Brook University, 135 pp. [Available from MSRC, Stony Brook University/SUNY, Stony Brook, NY 11794-5000.].

  • Kain, J. S., 2004: The Kain–Fritsch convective parameterization: An update. J. Appl. Meteor., 43 , 170181.

  • Kain, J. S., , and Fritsch J. M. , 1990: A one-dimensional entraining/detraining plume model and its application in convective parameterization. J. Atmos. Sci., 47 , 27842802.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Lorenz, E. N., 1963: Deterministic nonperiodic flow. J. Atmos. Sci., 20 , 130141.

  • McMurdie, L., , and Mass C. F. , 2004: Major numerical forecast failures over the northeast Pacific. Wea. Forecasting, 19 , 338356.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Richardson, D. S., 2001: Ensembles using multiple models and analyses. Quart. J. Roy. Meteor. Soc., 127 , 18471864.

  • Stanski, H. R., , Wilson L. J. , , and Burrows W. R. , 1989: Survey of common verification methods in meteorology. Environment Canada Research Rep. 89-5, 114 pp. [Available from Forecast Research Division, Atmospheric Environment Service, 4905 Dufferin St., Downsview, ON M3H 5T4, Canada.].

  • Stensrud, D. J., , and Yussouf N. , 2003: Short-range ensemble predictions of 2-m temperature and dewpoint temperature over New England. Mon. Wea. Rev., 131 , 25102524.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Stensrud, D. J., , Bao J. , , and Warner T. T. , 2000: Using initial condition and model physics perturbations in short-range ensemble simulations of mesoscale convective systems. Mon. Wea. Rev., 128 , 20772107.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Toth, Z., , and Kalnay E. , 1993: Ensemble forecasting at NMC: The generation of perturbations. Bull. Amer. Meteor. Soc., 74 , 23172330.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Toth, Z., , and Kalnay E. , 1997: Ensemble forecasting at NCEP and the breeding method. Mon. Wea. Rev., 125 , 32973319.

  • Wandashin, M. S., , Mullen S. L. , , Stensrud D. J. , , and Brooks H. E. , 2001: Evaluation of a short-range multimodel ensemble system. Mon. Wea. Rev., 129 , 729747.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wang, W., , and Seaman N. L. , 1997: A comparison study of convective parameterization schemes in a mesoscale model. Mon. Wea. Rev., 125 , 252278.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 1995: Statistical Methods in Atmospheric Science: An Introduction. Academic Press, 467 pp.

  • Yussouf, N., , Stensrud D. , , and Lakshmivarahan S. , 2004: Cluster analysis of multimodel ensemble data over New England. Mon. Wea. Rev., 132 , 24522462.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Zhang, D-L., , and Anthes R. , 1982: A high-resolution model of the planetary boundary layer—Sensitivity tests and comparisons with SESAME-79 data. J. Appl. Meteor., 21 , 15941609.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Zhang, D-L., , and Zheng W-Z. , 2004: Diurnal cycles of surface winds and temperatures as simulated by five boundary layer parameterizations. J. Appl. Meteor., 43 , 157169.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Zhang, F., , Snyder C. , , and Rotunno R. , 2002: Mesoscale predictability of the “surprise” snowstorm of 24–25 January 2000. Mon. Wea. Rev., 130 , 16171632.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Zou, X., , and Kuo Y-H. , 1996: Rainfall assimilation through an optimal control of initial and boundary conditions in a limited-area mesoscale model. Mon. Wea. Rev., 124 , 28592882.

    • Crossref
    • Search Google Scholar
    • Export Citation
Fig. 1.
Fig. 1.

Location of the (a) 36- and (b) 12-km MM5 domains. The SA and COOP sites are plotted in (b) using open and filled circles, respectively. The 2-m temperature, 10-m wind speed and direction, and sea level pressure verification statistics use the SA sites, while the 24-h precipitation verification includes both SA and COOP sites.

Citation: Weather and Forecasting 22, 1; 10.1175/WAF973.1

Fig. 2.
Fig. 2.

Diurnal RMSEs every 1 h for (a) 2-m temperature, (b) SLP, (c) 10-m wind speed, and (d) 10-m wind direction for the warm season 0000 UTC forecasts made by the three ensemble means and 0000–1200 UTC CTL member averaged over the 12-km domain.

Citation: Weather and Forecasting 22, 1; 10.1175/WAF973.1

Fig. 3.
Fig. 3.

Same as in Fig. 2 but for the cool season.

Citation: Weather and Forecasting 22, 1; 10.1175/WAF973.1

Fig. 4.
Fig. 4.

ETSs for the warm season (black lines) and cool season (gray lines) 0000 UTC 24-h QPFs (12–36 h) made by the three ensemble means and 0000 UTC CTL member vs 24-h QPF event threshold.

Citation: Weather and Forecasting 22, 1; 10.1175/WAF973.1

Fig. 5.
Fig. 5.

BSSs for 2-m temperature forecasts made by the three ensemble means and the 14-day bias-calibrated 18-member ensemble mean. Skill scores were compiled for the (a) warm season night, (b) warm season day, (c) cool season night, and (d) cool season day using forecast hours 25–36 and 37–48 as night and day, respectively.

Citation: Weather and Forecasting 22, 1; 10.1175/WAF973.1

Fig. 6.
Fig. 6.

Reliability diagrams for 2-m temperature (2mT) forecasts made by the three ensemble means and the 14-day bias-calibrated 18-member ensemble mean. Reliability statistics were compiled under the conditions of (a) warm season nighttime 2mT over 18°C, (b) warm season daytime 2mT over 29°C, (c) cool season nighttime 2mT below 0°C, and (d) cool season daytime 2mT over 10°C using forecast hours 25–36 and 37–48 as night and day, respectively. The solid 1:1 line represents perfect reliability while the lower diagonal line represents the line of no skill in a probability forecast. Inset plots show the sample size of each forecast probability.

Citation: Weather and Forecasting 22, 1; 10.1175/WAF973.1

Fig. 7.
Fig. 7.

BSSs for SLP forecasts made by the three ensemble means and the 14-day bias-calibrated 18-member ensemble mean. Skill scores were compiled for the (a) warm and (b) cool seasons.

Citation: Weather and Forecasting 22, 1; 10.1175/WAF973.1

Fig. 8.
Fig. 8.

Reliability diagrams for SLP forecasts made by the three ensemble means and the 14-day bias-calibrated 18-member ensemble mean. Reliability statistics were compiled under the conditions of (a) warm season SLP over 1024 mb, (b) warm season SLP below 1010 mb, (c) cool season SLP over 1024 mb, and (d) cool season SLP below 1010 mb (d). The solid 1:1 line represents perfect reliability while the lower diagonal line represents the line of no skill in a probability forecast. Inset plots show the sample size of each forecast probability.

Citation: Weather and Forecasting 22, 1; 10.1175/WAF973.1

Fig. 9.
Fig. 9.

BSSs for 24-h precipitation forecasts made by the three ensemble means. Skill scores were compiled for the (a) warm and (b) cool seasons.

Citation: Weather and Forecasting 22, 1; 10.1175/WAF973.1

Fig. 10.
Fig. 10.

Reliability diagrams for three 24-h precipitation thresholds: (a) 0.1 in. (0.254 mm), (b) 0.5 in. (12.70 mm), and (c) 0.7 in. (17.78 mm) during the warm season. The solid 1:1 line represents perfect reliability while the lower straight diagonal line represents the line of no skill in a probability forecast. Inset plots show the sample size of each forecast probability.

Citation: Weather and Forecasting 22, 1; 10.1175/WAF973.1

Fig. 11.
Fig. 11.

Same as in Fig. 9 but for the cool season.

Citation: Weather and Forecasting 22, 1; 10.1175/WAF973.1

Fig. 12.
Fig. 12.

Verification rank histograms for the 18-member 12-km ensemble for (top) 2-m temperature, (middle) SLP, and (bottom) 10-m wind speed. Statistics were gathered over the warm season for the night (forecast hours 0–12 and 25–36) and day (forecast hours 13–24 and 37–48) forecast periods. Raw (bias calibrated) ensemble forecasts correspond with the black (white) histogram bars. Inset percentages represent the adjusted missing rates (MRadj) for each rank histogram.

Citation: Weather and Forecasting 22, 1; 10.1175/WAF973.1

Fig. 13.
Fig. 13.

The frequency at which each 12-km ensemble member and 18-member ensemble mean verifies with the (a) lowest (best) and (b) highest (worst) MAEs for 2-m temperature, SLP, 10-m wind speed, and 10-m wind direction warm season forecasts. The raw and bias corrected forecasts are given by the black and white bars, respectively.

Citation: Weather and Forecasting 22, 1; 10.1175/WAF973.1

Fig. 14.
Fig. 14.

Same as in Fig. 13 but for the cool season.

Citation: Weather and Forecasting 22, 1; 10.1175/WAF973.1

Fig. 15.
Fig. 15.

Scatterplots of 12-km domain-average variance (abscissa) vs domain-average MAE (ordinate) for before (gray circles) and after (black x’s) a 14-day bias calibration is applied. Scatterplots are for (a), (b) 2-m temperature, (c), (d) SLP, (e), (f) 10-m wind speed, and (g), (h) 10-m wind direction forecasts for the 18-member ensemble mean during the warm season. The MAE–variance correlation coefficients (i), (j) are shown for (white) before and after (black) the bias calibration is applied.

Citation: Weather and Forecasting 22, 1; 10.1175/WAF973.1

Fig. 16.
Fig. 16.

Diurnal MEs every 1 h for (a) 2-m temperature, (b) SLP, (c) 10-m wind speed, and (d) 10-m wind direction for the warm season 0000 UTC forecasts for all members, the three ensemble means, and the 14-day bias-calibrated 18-member ensemble mean averaged over the 12-km domain.

Citation: Weather and Forecasting 22, 1; 10.1175/WAF973.1

Fig. 17.
Fig. 17.

The 24-h quantitative precipitation BIAS over the 12-km domain during the (a) warm and (b) cool seasons for the three ensemble means (PHS, IC, and ALL) and three convective parameterization groups (BM, GR, KF, and KF2).

Citation: Weather and Forecasting 22, 1; 10.1175/WAF973.1

Fig. 18.
Fig. 18.

RMSEs averaged over the 12-km domain every 1 h for SLP during the (a) cool and (b) warm seasons for each ensemble members (color coded by PBL type), the three ensemble means (PHS, IC, and ALL), and the 14-day bias calibration applied to the ALL ensemble (ALLBC). See Table 1 for run abbreviations.

Citation: Weather and Forecasting 22, 1; 10.1175/WAF973.1

Table 1.

List of PHS (Nos. 1–12) and IC (Nos. 6, 13–18) members used in the MM5 SREF ensemble.

Table 1.
Save