## 1. Introduction

Operational weather prediction centers strive to improve forecasts by continually improving weather forecast models, and the analyses used to initialize those models. In a recent study, Hamill et al. (2004) investigated an alternative approach to improving forecasts, namely, using the statistics of past forecast errors to correct errors in independent forecasts. They used a dataset of retrospective ensemble forecasts (or “reforecasts”) with a 1998 version of the operational global forecast model from the National Centers for Environmental Prediction (NCEP; part of the U.S. National Weather Service). Probabilistic forecasts of surface temperature and precipitation in week-2 (days 8–14) were not skillful relative to climatology when computed directly from the model output, but were more skillful than the operationally produced forecasts from NCEP’s Climate Prediction Center when computed using forecast-error statistics from the reforecast dataset.

Ensembles of forecasts with a single model suffer from a deficiency in spread partly because they do not represent the error in the forecast model itself. This causes probabilistic forecasts derived from such ensembles to be overconfident. Statistical postprocessing using reforecast datasets can ameliorate this problem to some extent, yielding reliable probabilities (e.g., Hamill et al. 2004). However, the use of multimodel ensembles (ensembles consisting of forecasts from different models, perhaps initialized from different analyses) have also been shown to significantly increase the skill of probabilistic forecasts (Krishnamurti et al. 1999; Palmer et al. 2004; Rajagopalan et al. 2002; Mylne et al. 2002), even without statistical postprocessing. In this note, we examine whether the results of Hamill et al. (2004) can be improved upon if the reforecast dataset used to statistically correct the forecasts consists of more than one model. Here we combine the reforecast dataset described in Hamill et al. (2006) using a 1998 version of the NCEP Global Forecast System (GFS) model, with a reforecast dataset generated at the European Centre for Medium-Range Weather Forecasts (ECMWF) to support monthly forecasting (Vitart 2004). The ECMWF forecasts are run at roughly 1° grid spacing compared with a 2.5° grid spacing used for the NCEP 1998 model forecasts.

In this note we address two questions. The first concerns whether the large gains in skill reported by Hamill et al. (2004) were a consequence of the fact that the forecast model was run at a relatively low resolution and is now nearly 6 yr behind the state-of-the art model. In other words, do the benefits of statistically correcting forecasts using reforecasts datasets found in that study also apply to newer, higher-resolution models? Second, we investigate whether the methods described in Hamill et al. (2004) can be applied to multimodel ensembles. Specifically, we seek to address the question of whether statistically corrected multimodel forecasts are more skillful than the forecasts generated from the component models (using the statistics from the component reforecast datasets).

## 2. Datasets and methodology

### a. The analyses

Forecasts for 850-hPa temperature are presented in this note. These forecasts have been verified using both the NCEP–National Center for Atmospheric Research (NCAR) reanalysis (Kistler et al. 2001) and the 40-yr ECMWF Re-Analysis (ERA-40; Uppala et al. 2005). Both of these analyses are on a 2.5° grid, and only those points poleward of 20°N are included. Because the verification statistics computed with the NCEP–NCAR and ERA-40 analyses are so similar, all of the results described here have been computed using the ERA-40 reanalyses unless otherwise noted.

### b. The forecasts

A reforecast dataset was created at the National Oceanic and Atmospheric Administration (NOAA) Climate Diagnostics Center (CDC), using a T62 resolution (roughly 2.5° grid spacing) version of NCEP’s GFS model, which was operational until January 1998. This model was run with 28 vertical sigma levels. A 15-member ensemble run out to 15 days lead time is available for every day from 1979 to the present, starting from 0000 UTC initial conditions. The ensemble initial conditions consisted of a control initialized with the NCEP–NCAR reanalysis (Kistler et al. 2001) and a set of seven bred pairs of initial conditions (Toth and Kalnay 1997), centered each day on the reanalysis initial condition. The breeding method and the forecast model are the same as that used operationally at NCEP in January 1998. Sea surface conditions were specified from the NCEP–NCAR reanalysis, and were held fixed to their initial values throughout the forecast. Further details describing this dataset are available in Hamill et al. (2004) and Hamill et al. (2006).

An independent set of reforecasts has been generated at the ECMWF to support operational monthly forecasting (Vitart 2004). The atmospheric component of the forecast model is a T159 (roughly 1° grid) version of the ECMWF Integrated Forecast System (IFS) with 40 levels in the vertical. The oceanic component is the Hamburg Ocean Primitive Equation model. Atmospheric initial conditions were taken from the ERA-40 reanalysis, and ocean conditions were taken from the ocean data assimilation system used to produce seasonal forecasts at ECMWF. A five-member ensemble was run out to 32 days once every 2 weeks for a 12-yr period from 27 March 1990 to 18 June 2002. Initial atmospheric perturbations were generated using the same singular vector technique used in ECMWF operations (Buizza and Palmer 1995). Further details describing the ECMWF monthly forecasting system can be found in Vitart (2004).

A combined reforecast dataset was created by subsampling five members from the NCEP CDC reforecasts for just those dates in December, January, and February for which the ECMWF forecasts were available (a total of 84). Only week-2 (8–14-day average) forecast means for 850-hPa temperature in the Northern Hemisphere poleward of 20°N were included.

### c. Methodology

Three-category probability forecasts for 850-hPa temperature in week 2 were produced for the Northern Hemisphere poleward of 20°N. The three categories are the lower, middle, and upper terciles of the climatological distribution of analyzed anomalies, so that the climatological forecast for each category is always 33%. The climatological distribution is defined for all winter (December–February) seasons from 1971 to 2000. The climatological mean for each day is computed by smoothing the 30-yr mean analyses for each calendar day with a 31-day running mean. The terciles of the climatological distribution are calculated using the analyzed anomalies over a 31-day window centered on each calendar day. Further details may be found in Hamill et al. (2004).

*F*

_{NCEP}is the NCEP ensemble mean,

*F*

_{ECMWF}is the ECMWF ensemble mean,

*F*

_{combined}is the multimodel ensemble mean,

*b*

_{NCEP}is the weight given to the NCEP ensemble mean,

*b*

_{ECMWF}is the weight given to the ECMWF ensemble mean, and

*b*

_{0}is a constant term. The coefficients (

*b*

_{0},

*b*

_{NCEP}, and

*b*

_{ECMWF}) are estimated at each Northern Hemisphere grid point using an iterative least squares regression algorithm (NAG library routine G02GAF).

Instead of combining the ensemble means and using the result as a predictor in the logistic regression, a multimodel forecast can be computed by simply using each ensemble mean as a separate predictor in the logistic regression. We have found that forecasts obtained with a two-predictor logistic regression are slightly less skillful, so only results using the two-step procedure (using the combined ensemble mean as a single predictor in the logistic regression) are presented here.

## 3. Results

*B*is the BSS for an ensemble of size

_{M}*M*and

*B*

_{∞}is the BSS for an infinite ensemble. This equation assumes the ensemble prediction system is perfect (i.e., the verifying analysis is statistically indistinguishable from a randomly chosen ensemble member). According to this formula, if a five-member ensemble with a perfect model has no skill (

*B*= 0) then an infinite ensemble should have a BSS of 0.166. However, we have found that the RPSS of the NCEP forecast is still negative for an ensemble size of 15 (the total number of ensemble members available). This suggests that model error is primarily responsible for the low skill of the NCEP model forecasts, not the small ensemble size used to generate the probabilities. Although we cannot test the sensitivity of the RPSS to ensemble size with the ECMWF model (since the reforecasts were only run with five members), the fact that the reliability diagrams shown in Fig. 1 are so similar suggests that the same holds true for the ECMWF forecasts. The lack of reliability shown in Fig. 1 is a consequence of the spread deficiency in both models—on average the ensemble spread is a factor of nearly 2 smaller than the ensemble mean error (not shown).

_{M}Applying a logistic regression to the ensemble means produces forecasts that are more reliable and have higher skill for both ensembles (Fig. 2). The corrected ECMWF model forecasts are significantly more skillful than the corrected NCEP model forecasts (RPSS of 0.155 for the ECMWF versus 0.113 for the NCEP). The relative improvement in the ECMWF forecasts is nearly the same as the NCEP forecasts, demonstrating that the value of having a reforecast dataset is just as large for the ECMWF model as it was for the NCEP model. This is despite the fact that the ECMWF model has the benefit of twice the resolution and five extra years of model development. The improvement in skill is limited by the small sample size—if all 25 yr of forecasts available for the NCEP model are used to compute the logistic regression, the RPSS increases to 0.15. The multimodel forecast, using the combined ensemble mean in the logistic regression, produces a forecast with an average RPSS of 0.17 (Fig. 3), about a 10% improvement over the corrected ECMWF forecast alone. The fact that the multimodel ensemble forecast is more skillful than the ECMWF forecast, even though the ECMWF forecast is, on average, superior to the NCEP forecast, is a consequence of the fact that there are places where the NCEP model is consistently more skillful than the ECMWF model (Krishnamurty et al. 2000). This is reflected in the weights used to combine the two ensemble means (Fig. 4). Although the ECMWF model receives more weight on average, there are regions where the NCEP model receives a weight of greater than 0.5 (the red regions in the left panel of Fig. 4). In other words, the NCEP model, though inferior on average to the ECMWF model, supplies independent information that can be used to improve upon the ECMWF forecast. This is obviously only possible if a reforecast dataset is available for both models.

All of the results discussed in this section have been computed using the ERA-40 reanalysis as the verifying analysis. The results are not substantively changed if the NCEP–NCAR reanalysis is used instead (not shown).

## 4. Discussion

The study of Hamill et al. (2004) has been extended to include more than one forecast model. The results show that the benefits of reforecasts apply equally to older, lower-resolution forecasts systems as they do to higher-resolution state-of-the-art forecast systems. In this case, logistic regression was used to combine the NCEP and ECMWF week-2 forecasts. The resulting multimodel forecast was more skillful than the statistically corrected ECMWF forecast, which was run at more than twice the horizontal resolution of the other model, a 1998 version of the NCEP global forecast model. These results clearly demonstrate that all operational centers can benefit by sharing both reforecast datasets and real-time forecasts. This is true even if the models run by the various centers differ substantially in resolution, as long as they have some skill and provide independent information to the statistical scheme used to combine the forecasts. For week-2 tercile probability forecasts of 850-hPa temperature, the benefits of multimodel ensembles cannot be realized unless all the models have corresponding reforecast datasets with which to estimate regression equations to combine (and calibrate) the forecasts.

## Acknowledgments

Fruitful discussions with Tom Hamill are gratefully acknowledged. The NOAA “Weather–Climate Connection” program funded this project.

## REFERENCES

Buizza, R., and T. N. Palmer, 1995: The singular vector structure of the atmospheric global circulation.

,*J. Atmos. Sci.***52****,**1434–1456.Hamill, T. M., J. S. Whitaker, and X. Wei, 2004: Ensemble reforecasting: Improving medium-range forecast skill using retrospective forecasts.

,*Mon. Wea. Rev.***132****,**1434–1447.Hamill, T. M., J. S. Whitaker, and S. L. Mullen, 2006: Reforecasts: An important dataset for improving weather predictions.

,*Bull. Amer. Meteor. Soc.***87****,**33–46.Kistler, R., and Coauthors, 2001: The NCEP–NCAR 50-Year Reanalysis: Montly means CD-ROM and documentation.

,*Bull. Amer. Meteor. Soc.***82****,**247–268.Krishnamurti, T. N., C. M. Kishtawal, T. E. LaRow, D. R. Bachiochi, Z. Zhang, E. C. Williford, and S. Surendran, 1999: Improved weather and seasonal climate forecasts from multimodel superensembles.

,*Science***285****,**1548–1550.Krishnamurty, T. N., C. M. Kishtawal, Z. Zhang, T. LaRow, D. Bachiochi, E. Williford, S. Gadgil, and S. Surendran, 2000: Multimodel ensemble forecasts for weather and seasonal climate.

,*J. Climate***13****,**4196–4216.Mylne, K. R., R. E. Evans, and R. T. Clark, 2002: Multi-model multi-analysis ensembles in quasi-operational medium-range forecasting.

,*Quart. J. Roy. Meteor. Soc.***128****,**361–384.Palmer, T. N., and Coauthors, 2004: Development of a European multimodel ensemble system for seasonal to interannual prediction (DEMETER).

,*Bull. Amer. Meteor. Soc.***85****,**853–872.Rajagopalan, B., U. Lall, and S. E. Zebiak, 2002: Categorical climate forecasts through regularization and optimal combination of multiple GCM ensembles.

,*Mon. Wea. Rev.***130****,**1792–1811.Richardson, D. S., 2001: Measures of skill and value of ensemble prediction systems, their interrelationship and the effect of ensemble size.

,*Quart. J. Roy. Meteor. Soc.***127****,**2473–2489.Toth, Z., and E. Kalnay, 1997: Ensemble forecasting at NCEP and the breeding method.

,*Mon. Wea. Rev.***125****,**3297–3319.Uppala, S. M., and Coauthors, 2005: The ERA-40 reanalysis.

,*Quart. J. Roy. Meteor. Soc.***131****,**2961–3012.Vitart, F., 2004: Monthly forecasting at ECMWF.

,*Mon. Wea. Rev.***132****,**2761–2779.Wilks, D. S., 1995:

*Statistical Methods in the Atmospheric Sciences*. Academic Press, 467 pp.

Same as in Fig. 1, but for forecasts corrected with a logistic regression.

Citation: Monthly Weather Review 134, 8; 10.1175/MWR3175.1

Same as in Fig. 1, but for forecasts corrected with a logistic regression.

Citation: Monthly Weather Review 134, 8; 10.1175/MWR3175.1

Same as in Fig. 1, but for forecasts corrected with a logistic regression.

Citation: Monthly Weather Review 134, 8; 10.1175/MWR3175.1

Same as in Fig. 1, but for multimodel ensemble. The NCEP and ECMWF forecast ensemble means are combined using linear regression, and then the combined ensemble mean is used to predict tercile probabilities using logistic regression. The combined ensemble is more skillful than the logistic regression correction applied to either the ECMWF or NCEP models alone (Fig. 2).

Citation: Monthly Weather Review 134, 8; 10.1175/MWR3175.1

Same as in Fig. 1, but for multimodel ensemble. The NCEP and ECMWF forecast ensemble means are combined using linear regression, and then the combined ensemble mean is used to predict tercile probabilities using logistic regression. The combined ensemble is more skillful than the logistic regression correction applied to either the ECMWF or NCEP models alone (Fig. 2).

Citation: Monthly Weather Review 134, 8; 10.1175/MWR3175.1

Same as in Fig. 1, but for multimodel ensemble. The NCEP and ECMWF forecast ensemble means are combined using linear regression, and then the combined ensemble mean is used to predict tercile probabilities using logistic regression. The combined ensemble is more skillful than the logistic regression correction applied to either the ECMWF or NCEP models alone (Fig. 2).

Citation: Monthly Weather Review 134, 8; 10.1175/MWR3175.1

A map of the weights used to combine the ECMWF and NCEP ensemble mean forecasts. These maps correspond to *b*_{NCEP} and *b*_{ECMWF} in Eq. (1). The mean term *b*_{0} is not shown.

Citation: Monthly Weather Review 134, 8; 10.1175/MWR3175.1

A map of the weights used to combine the ECMWF and NCEP ensemble mean forecasts. These maps correspond to *b*_{NCEP} and *b*_{ECMWF} in Eq. (1). The mean term *b*_{0} is not shown.

Citation: Monthly Weather Review 134, 8; 10.1175/MWR3175.1

A map of the weights used to combine the ECMWF and NCEP ensemble mean forecasts. These maps correspond to *b*_{NCEP} and *b*_{ECMWF} in Eq. (1). The mean term *b*_{0} is not shown.

Citation: Monthly Weather Review 134, 8; 10.1175/MWR3175.1