1. Introduction
An ongoing challenge with short- and medium-range ensemble prediction systems (EPSs) is how to generate probabilistic forecasts that account for the system errors in the ensemble. System errors include sampling error due to the finite ensemble size, the error introduced by model imperfections such as the grid truncation, the use of deterministic parameterizations (Houtekamer and Mitchell 2005), and assimilation system and observation imperfections. There are many methods for treating system error, from introducing stochastic aspects into the ensemble prediction system (Buizza et al. 1999; Shutts 2005; Berner et al. 2009; Palmer et al. 2009; Charron et al. 2010), using multiple parameterizations (Charron et al. 2010; Berner et al. 2011), using multiple models (Bougeault et al. 2010), and statistical postprocessing.
Two methods that will be explored and contrasted here are multimodel methods and statistical postprocessing. The underlying hypothesis of multimodel ensembles (Krishnamurti et al. 2000; Wandishin et al. 2001; Mylne et al. 2002; Doblas-Reyes et al. 2005; Hagedorn et al. 2005; Weigel et al. 2008; Candille 2009; Johnson and Swinbank 2009; Bougeault et al. 2010; Iversen et al. 2011) is that the many differences between constituent EPSs will result in them generating ensemble forecasts with quasi-independent systematic errors, so the combination may result in a more accurate estimate of the uncertainty. Practically, also, these are ensembles of opportunity. If all centers are willing to share rather than sell their forecast data, the additional members’ ensembles can be used for only the cost of data transmittal and storage, so they may provide an inexpensive way to improve forecast skill. However, there are some potential disadvantages of multimodel ensembles. Developing an accurate, stable weather prediction system is costly, so multimodel ensembles are likely to be less useful when formed from immature systems. System outages may prevent routine access to other centers’ ensembles. One or another of the models is likely to have been changed recently, rendering it difficult to understand the multimodel system error characteristics. Also, the hypothesis of quasi-independent errors may not always hold. Practically, each operational center is interested in providing a high-quality product without depending on another center’s data. When another center develops a method that improves the forecast significantly, it may be adopted at other operational centers. The similarity could result in some colinearity of errors and decreased collective usefulness (Lorenz et al. 2011).
Another method for addressing system error is through statistical postprocessing. Discrepancies between time series of past forecasts from a fixed model and the verifying observations/analyses can be used to modify the real-time forecasts. For some variables such as short-range forecasts of surface temperature, a short time series may be sufficient (Stensrud and Yussouf 2003; Yussouf and Stensrud 2007; Hagedorn et al. 2008). For others such as heavy precipitation and longer-lead forecasts, using a long time series of reforecasts has been shown to dramatically improve the reliability and skill of the probabilistic forecasts (Hamill et al. 2004, 2006; Hamill and Whitaker 2007; Wilks and Hamill 2007; Hamill et al. 2008). A drawback of using reforecasts is that a forecast time series spanning many years or even decades may be necessary to produce a sufficiently large sample to adjust for systematic errors in rare-event forecasts. Since forecast models are frequently updated, which may change the systematic error characteristics, either a forecast model must be frozen once a reforecast dataset has been generated or a new reforecast dataset must be generated every time the modeling system changes significantly. Hence, reforecasting can be computationally expensive and can restrict the ability of a forecast center to upgrade its system rapidly. Recently, statistical postprocessing methods have been the subject of much investigation (Gneiting et al. 2005; Raftery et al. 2005; Sloughter et al. 2007; Wilson et al. 2007; Vannitsem and Nicolis 2008; Glahn et al. 2009; Bao et al. 2010).
To date, however, there have been no systematic comparisons of multimodel and reforecast-calibrated probabilistic quantitative precipitation forecasts (PQPFs) verified over a large enough area and a long enough period of time to confidently assess the relative strengths and weaknesses of these two approaches. This study attempts to provide such a comparison for this high-impact forecast parameter. Using The Observing System Research and Predictability Experiment (THORPEX) Interactive Grand Global Ensemble (TIGGE) forecast data from the National Centers for Environmental Prediction (NCEP), the Canadian Meteorological Centre (CMC), the Met Office (UKMO), and the European Centre for Medium-Range Weather Forecasts (ECMWF), multimodel ensemble 24-h accumulated probabilistic forecasts of precipitation were generated and then compared against ECMWF forecasts that were statistically adjusted using their reforecast dataset. The comparison was performed over the contiguous United States (CONUS) during the period of July–October 2010. Statistical adjustments were also attempted with multimodel forecasts, trained on the previous 30 days of forecasts and analyses.
Below, section 2 describes the datasets used in this experiment, the verification methodology, and the statistical postprocessing method. Section 3 provides results, and section 4 some conclusions.
2. Datasets and methods
a. Analysis data used
A recently created precipitation dataset, NCEP’s Climatology-Calibrated Precipitation Analysis (CCPA), was used for verification. CCPA attempts to combine the relative advantages of the 4-km, hourly NCEP stage-IV precipitation analysis (Lin and Mitchell 2005) and the daily, 0.25° NCEP Climate Prediction Center (CPC) Unified Precipitation Analysis (Higgins et al. 1996). The former is based on gauge and radar data, the latter solely on gauge data. A disadvantage of the stage-IV product is that it may inherit some of the biases due to the estimation of rainfall from radars. A disadvantage of the CPC product is that there are areas of the United States that are only sparsely covered by gauge data. CCPA regressed the stage-IV analysis (the predictor) to the CPC analysis (the predictand), thereby reducing bias with respect to the in situ observations. In several of the driest locations in the western United States, the CCPA analysis was set to missing, for the regression analysis was untrustworthy and singular because of no precipitation in either analysis product. In such cases, CCPA for this study was simply replaced with the stage-IV analysis. For our purposes, we used CCPAs that also were upscaled to 1° and accumulated over a 24-h period in a manner that preserved total precipitation, similar to the “remapping” procedure described in Accadia et al. (2003). CCPAs were available from 2002 to present, a shorter period than the ECMWF reforecasts, thus limiting the amount of training data that could be used in the statistical postprocessing.
b. Forecast and reforecast model data
For this experiment, 20 perturbed member forecasts of 24-h accumulated precipitation were extracted from the UKMO, CMC, NCEP, and ECMWF ensemble systems archived in the TIGGE database at ECMWF. Probabilities were calculated directly from the ensemble relative frequency, referred to as “raw” probabilities henceforth. The forecast period was from July to October 2010; only 0000 UTC initial time forecasts were extracted to allow comparison with postprocessed forecasts using ECMWF’s reforecasts, which were generated only from 0000 UTC initial conditions. Daily forecasts of 24-h accumulated precipitation were examined from +1- to +5-day leads. Regardless of the original model resolution, all centers’ forecasts were bilinearly interpolated to a 1° latitude–longitude grid covering CONUS using ECMWF’s TIGGE portal software. ECMWF’s interpolation procedure set the amount to zero if there was no precipitation at the nearest neighboring point and the interpolated value was less than 0.05 mm. No control forecasts were included—just the forecasts from the perturbed initial conditions. Other forecast centers’ contributions to the TIGGE archive were not used here for various reasons, such as the unavailability of 0000 UTC ensemble forecasts from the Japan Meteorological Agency. For size consistency and to facilitate skill comparisons, only the first 20 of the full 50 ECMWF member forecasts were used in the generation of the multimodel ensemble, though the 50-member ECMWF forecasts were evaluated for skill and reliability. More detailed descriptions of the configurations of these four ensemble systems are described in appendix A.
When calibrating ECMWF data with reforecasts, the five-member weekly reforecast precipitation data were extracted from ECMWF’s weekly reforecast archive (Hagedorn 2008) and similarly interpolated to the 1° grid. The control reforecasts were initialized from the ECMWF Interim reanalysis (ERA-Interim; Dee et al. 2011), which used version Cy31r2 of the ECMWF Integrated Forecast System (IFS) in the data assimilation process. The 2010 real-time ensemble forecasts and the reforecasts were then run using IFS model version Cy36r2 (more detail is provided in appendix A).
The four perturbed initial conditions for the reforecasts were generated with a combination of their singular-vector approach (Molteni et al. 1996; Barkmeijer et al. 1999) and their “ensembles of data assimilations” (EDA; Isaksen et al. 2010) that used a cycled, reduced-resolution, four-dimensional variational data assimilation (4D-Var) and perturbed observations. However, for initialization of the reforecasts, ECMWF used the EDA perturbations from 2010 and applied them to the 2002–09 data rather than running the EDA during the reforecast period. To apply EDA to dates in the past would have been computationally expensive, but having not done so may have resulted in the perturbations in the reforecast having less flow-dependent character, possibly making them somewhat statistically inconsistent with ECMWF’s real-time ensemble forecasts.
Since precipitation analysis data were only available for the period from 2002 forward, the training data were limited to the reforecasts for the period from June to November 2002–09, or 8 yr. To limit the possible deleterious effects of seasonal biases in the forecast model, only the reforecast data for the month of interest and the months before and after were used. For example, when calibrating July forecasts, June–August reforecast data were used. With reforecasts generated once per week, this typically meant there were ~13 once-weekly samples × 8 yr = 104 samples. This was in many cases an insufficient sample size, so data from other grid points were used to increase the training sample size (appendix B).
c. Statistical postprocessing methodology


ELR was applied both to calibrating real-time multimodel forecasts and to calibrating ECMWF forecasts alone using the weekly reforecasts. It was found that forecast skill increased if some method was applied to increase the modest training sample sizes. A discussion of how sample sizes were augmented using data from other nearby forecast grid points is provided in appendix B.
Roulin and Vannitsem (2012) noted that since the ECMWF reforecast size (5 members) was smaller than the operational ensemble size (50 members, or in the case here, 20 members selected from the 50), the regression coefficients may be somewhat biased when trained with a smaller ensemble compared to what they would be were they trained with a larger ensemble. Hence, when the coefficients are used to correct the larger real-time ensemble, they may produce somewhat biased probabilistic forecasts. They adjusted the values of the five-member ensemble training data to better estimate the values that would be obtained with the larger real-time ensemble. An analogous approach was tried here but did not improve the forecast skill. The results discussed below will omit this adjustment.
d. Verification methods
The primary verification methods used here were Brier skill scores (BSSs), continuous ranked probability skill scores (CRPSSs), and reliability diagrams (Wilks 2006). The BSS and CRPSS as conventionally calculated (see section 7.4.2 of Wilks 2006) can exaggerate forecast skill, attributing skill to variations in climatological event probabilities. Thus, the procedures suggested in Hamill and Juras (2006) were used here to avoid this.





Illustration of the process for determining precipitation classes used in the calculation of BSS. (a) Climatological probability of >1 mm (24 h)−1 precipitation as determined from stage-IV data for September 2002–09. (b) Climatological class assigned to each grid point for 1 mm (24 h)−1 event in September.
Citation: Monthly Weather Review 140, 7; 10.1175/MWR-D-11-00220.1
BSS confidence intervals were estimated using the paired block bootstrap approach of Hamill (1999). The input data to the bootstrap approach consisted of arrays of
These block bootstrap confidence intervals should be regarded as approximations. An assumption underlying this process is that there were 123 independent data samples. However,









Two other common verification statistics were also used—root-mean-square (RMS) errors and bias (the average forecast divided by the average analyzed amount).
3. Results
a. Properties of forecasts from the individual centers
Before considering the multimodel and ECMWF reforecast-calibrated forecast properties, let us consider the verification of PQPFs from the individual centers. Figure 2 shows >1 mm (24 h)−1 and >10 mm (24 h)−1 event BSSs and CRPSSs. ECMWF generally produced the most skillful raw precipitation PQPFs. Depending on the metric, either NCEP or UKMO produced the least skillful forecasts.
BSSs of various forecasts for the (a) >1 mm (24 h)−1 event, (b) >10 mm (24 h)−1 event, and (c) CRPSSs, all as a function of forecast lead time. Error bars denote confidence intervals, the 5th and 95th percentiles of a paired block bootstrap between ECMWF and NCEP forecasts.
Citation: Monthly Weather Review 140, 7; 10.1175/MWR-D-11-00220.1
Interestingly, though UKMO forecasts generally appeared to be more skillful than NCEP forecasts for BSS, they appeared to be consistently worse for CRPSS. This was a consequence of the CRPSS verification algorithm as implemented here, which attempted to equally weight the CRPSS at all grid points, irrespective of whether the climatological event probability was extremely high or extremely low. The conventionally calculated CRPSS is dominated by the performance of the forecasts in the climatologically wet areas (Hamill and Juras 2006). There is inherently greater climatological variance of precipitation for the wet regions, and associated with this there are generally much larger CRPS values than in dry regions. Consequently, when evaluated over many grid points, the conventionally calculated CRPS and hence the CRPSS are dominated by the performance at the wetter points. Figure 3 shows maps of the day +3 CRPSS scores (see the supplemental material for CRPSS maps for the other lead times). The UKMO forecasts had negative skill in the extremely dry regions of the western United States. The RMS errors of the ensemble-mean forecasts in the dry regions of all the models were very small and relatively similar (Fig. 4a; for other lead times, see the supplemental material). However, the UKMO forecasts exhibited a large moist bias in the climatologically dry regions (Fig. 4b), which resulted in a very large overforecast of probabilities and poor skill for those points. This was apparently due to a drizzle overforecast bias in that version of the UKMO’s forecast model (D. Barker 2011, personal communication).
Maps of average CRPSS for day +3 forecasts for (a) ECMWF, (b) NCEP, (c) UKMO, and (d) CMC.
Citation: Monthly Weather Review 140, 7; 10.1175/MWR-D-11-00220.1
(a) RMS errors and (b) bias for day +3 forecasts, each as a function of the climatological probability of >1 mm (24 h)−1 precipitation. Light gray bars in (a) denote the relative frequency of each climatological probability.
Citation: Monthly Weather Review 140, 7; 10.1175/MWR-D-11-00220.1
Figure 4b also illustrates some other interesting characteristics of the ensemble systems. NCEP overforecasted rainfall for the grid points and dates where the climatological probability was already quite high. CMC forecasts were also biased, exhibiting a moist bias at the lowest climatological probabilities but dry biases for most of the rest of the larger climatological probabilities. ECMWF forecasts were the least biased, with a moderate overforecast bias at the low climatological probabilities.
Figure 5 provides reliability diagrams of day +3 forecasts of the >10 mm (24 h)−1 event. Other reliability diagrams for other lead times and for the >1 mm (24 h)−1 event are available in the supplemental material. CMC forecasts were generally the most reliable, though they were not as sharp as the ECMWF forecasts and hence had a lower BSS. UKMO and NCEP forecasts were much less reliable, though NCEP forecasts were slightly sharper than the others. ECMWF 50-member forecasts were slightly more reliable and skillful than their 20-member subset.
Reliability diagrams for day +3 forecasts for the >10 mm (24 h)−1 event. (a) ECMWF, (b) NCEP, (c) CMC, and (d) UKMO. The dark line on each is the 20-member reliability curve. The lighter gray line in (a) is the reliability for the full 50-member ensemble. The inset histogram bars show the relative frequency of usage for each probability bin. The black lines on the inset are the relative frequency of usage for the climatological distribution across all the sample points. The gray dots on the inset histogram of (a) are the relative frequency of usage for the ECMWF full 50-member ensemble.
Citation: Monthly Weather Review 140, 7; 10.1175/MWR-D-11-00220.1
In subjective analyses of individual forecasts, it appeared that several of the forecast models had subtle systematic northward biases in the northern central United States. Figure 6 shows the 10-mm observed contour and the 0.5 probability contour for the >10 mm (24 h)−1 event from the day +3 ECMWF forecasts. Here, the 25 cases with the largest areal coverage of observed precipitation between 105° and 80°W longitude and 35° and 50°N latitude were chosen. Similar plots for the other forecast models are included in the supplemental material.
Analyzed >10 mm (24 h)−1 precipitation boundary (black line) and area exceeding 10 mm (gray shading) for 25 cases with the largest areal coverage of greater than 10 mm in the upper Midwest. Red lines indicate the 0.5 probability contour from the ECMWF ensemble for the day +3 forecasts of >10 mm (24 h)−1 precipitation.
Citation: Monthly Weather Review 140, 7; 10.1175/MWR-D-11-00220.1
b. Properties of multimodel and statistically postprocessed forecasts
Before considering verification scores, consider first two actual forecast cases, presented in Figs. 7 and 8, showing probabilities from the 20-member ensembles and from the 80-member multimodel ensemble. The first case, covering the 24-h period ending 0000 UTC 21 July 2010, illustrates that sometimes the forecast models could be overly similar to each other. Here all the forecast precipitation areas were significantly north of the observed area. A multimodel forecast would not be expected to provide much benefit in such a situation. Figure 8 shows the same type of plot, but for the 24-h period ending 0000 UTC 7 August 2010. Here the multimodel forecast provided some improvement. On this day the CMC and UKMO areas of high probabilities were too far north and the NCEP area too far south, but the higher probabilities in the multimodel forecasts were more coincident with the analyzed regions exceeding 10 mm. Most of the area with greater than 10 mm in the analysis was covered by nonzero multimodel probabilities. More generally, when there was some diversity of positions in the multimodel forecasts, this often allowed the forecast to avoid being inappropriately sharp.
(a) Analyzed precipitation for the 24-h period ending 0000 UTC 21 Jul 2010; 10 mm (24 h)−1 contour is denoted by the thick black line. (b) Probability of >10 mm (24 h)−1 precipitation for day +3 forecast from the ECMWF ensemble for the same period. The analyzed 10-mm contour from (a) is repeated. (c)–(f) As in (b), but for NCEP, CMC, UKMO, and the multimodel combination, respectively.
Citation: Monthly Weather Review 140, 7; 10.1175/MWR-D-11-00220.1
As in Fig. 7, but for 24-h period ending 0000 UTC 11 Aug 2010.
Citation: Monthly Weather Review 140, 7; 10.1175/MWR-D-11-00220.1
Figure 9 provides BSSs and CRPSSs for the multimodel and postprocessed forecasts. For the light precipitation forecasts [>1.0 mm (24 h)−1; Fig. 9a], the multimodel forecasts improved the skill by approximately +1 day relative to ECMWF at the earliest lead times; a +2-day multimodel forecast could now be made as skillfully as a +1-day ECMWF forecast. The improvement in skill was a more modest ~+0.3 days at the longer forecast lead times. The calibrated multimodel forecast product improved skill over the basic multimodel forecast by a tiny amount at day +1 but degraded the skill after day +3. This is consistent with previous results; at the longer lead times, the growth of errors makes it more difficult to differentiate the model bias from the chaotically induced errors with short training datasets (Hamill et al. 2004). The improvement from reforecast-based postprocessing to the raw ECMWF system was much smaller than the improvement from the single to multimodel approach and was even slightly negative at the day +5 lead. Reasons for the less impressive performance of reforecast calibration than in previous studies will be discussed at the end of this section.
BSSs of various forecasts for (a) >1 mm (24 h)−1 event and (b) >10 mm (24 h)−1 event, and (c) CRPSSs, all as a function of forecast lead time. “Multi-model/cal” refers to forecasts from the multimodel calibrated using ELR. “ECMWF/reforecast” refers to ECMWF forecasts calibrated using ELR and the reforecast dataset. Error bars denote confidence intervals, the 5th and 95th percentiles of a paired block bootstrap between ECMWF and NCEP forecasts.
Citation: Monthly Weather Review 140, 7; 10.1175/MWR-D-11-00220.1
More impressive increases in skill were evident for the >10 mm (24 h)−1 event. Both the reforecast-based calibration and the multimodel approach increased forecast skill by an equivalent of up to +2 days of additional lead time. Again, the calibration of the multimodel forecasts provided modest improvement at the early leads and degradation at the longer leads relative to the unprocessed multimodel approach.
Measured in CRPSS, the multimodel forecasts produced the most skillful forecasts, exceeding the skill of reforecast-calibrated ECMWF forecasts by a small amount. Consider now where the forecasts were improved or degraded by the various approaches. Figure 10 provides maps of the day +3 CRPSS; maps for other lead times are in the supplemental material. The patterns of multimodel skill are rather similar to those of the most skillful ensemble system, ECMWF (Fig. 3a). The reforecast-calibrated ECMWF forecasts appear to have increased the skill most notably in the driest regions of the western United States.
Maps of CRPSS for day +3 forecasts for (a) the multimodel, (b) the multimodel with ELR calibration, and (c) ECMWF with ELR calibration using reforecasts.
Citation: Monthly Weather Review 140, 7; 10.1175/MWR-D-11-00220.1
Figure 11 shows day +3 >10 mm (24 h)−1 event reliability diagrams for the multimodel, calibrated multimodel, and reforecast-calibrated ECMWF PQPFs. The raw multimodel PQPFs were slightly more reliable than any of the PQPFs from the individual centers (Fig. 5) and retained a slight overforecast bias at the higher probabilities. The improvements in reliability were more substantial than for the >1 mm (24 h)−1 event (see diagrams in the supplemental material). The reforecast-calibrated PQPFs exhibited a slight underforecast bias and were not as sharp as those from the multimodel forecasts. Was this due to some inhomogeneity between the 2002–09 training data and the 2010 real-time forecasts? Figure 12 shows that there were fewer large forecast busts in 2010 than there were in 2002 or 2006. When the regression analysis from the 2002–09 data was applied to correct the 2010 forecasts, the assumption was that the 2010 forecasts would be equally unskillful. In fact they were better, and as a consequence the postprocessed forecasts were less sharp than they could have been. Though it was not attempted here, it might be possible to apply ad hoc corrections to the training data to improve the regression analysis. Perhaps a slight adjustment of the training data ensemble mean toward the analyzed data would make its accuracy more closely resemble that of the 2010 data, sharpening and making the ELR forecasts more reliable and skillful.
As in Fig. 3, the reliability diagrams for day +3 forecasts at the >10 mm (24 h)−1 event: (a) the multimodel, (b) the multimodel with ELR calibration, and (c) ECMWF with ELR calibration using reforecasts.
Citation: Monthly Weather Review 140, 7; 10.1175/MWR-D-11-00220.1
A histogram of the absolute errors of day +3 ensemble-mean precipitation forecasts for the 2002 and 2006 reforecasts and for the 2010 20-member real-time ensemble.
Citation: Monthly Weather Review 140, 7; 10.1175/MWR-D-11-00220.1
Figure 13 shows the multimodel areal coverage of the 0.5 probability contours for the >10 mm (24 h)−1 event for selected cases; these should be compared with Fig. 6 for ECMWF-only PQPFs. Figure 14 also shows the areal coverage, but for reforecast-calibrated ECMWF PQPFs. The areal coverage was only slightly smaller for the multimodel PQPFs than it was for the ECMWF PQPFs, illustrating that the multimodel forecasts did not lose a tremendous amount of sharpness (coverage of greater than 0.5 probability being a proxy for sharpness here). In comparison, the reforecast-calibrated PQPFs in Fig. 14 show a marked decrease in the areal coverage; many grid points with probability p > 0.5 in the raw ECMWF PQPF had p < 0.5 after calibration. Figures 15 and 16 show for the cases plotted in Figs. 7, 8 a bit more detail on what happened with typical multimodel and reforecast-calibrated PQPFs. The multimodel forecasts retained their sharpness, but not always desirably so. For example, in Fig. 15, the multimodel forecasts retain relatively high probabilities in eastern Iowa and northern Illinois, whereas the analyzed area was displaced farther south. The reforecast-calibrated PQPFs decreased the areal coverage of high probabilities, appropriately so in this case, reducing the false alarms. However, as seen in the inspection of Figs. 13, 14, there were many cases when the sharpness retained in the multimodel forecasts was desirable.
As in Fig. 6, but for multimodel forecasts.
Citation: Monthly Weather Review 140, 7; 10.1175/MWR-D-11-00220.1
As in Fig. 6, but for reforecast-calibrated ECMWF forecasts.
Citation: Monthly Weather Review 140, 7; 10.1175/MWR-D-11-00220.1
(a) Analyzed precipitation for the 24-h period ending 0000 UTC 21 Jul 2010; 10-mm contour is denoted by the thick black line. (b) Probability of >10 mm (24 h)−1 precipitation for day +3 forecast from the ECMWF ensemble for the same period. (c) As in (b), but for the multimodel ensemble. (d) As in (b), but for reforecast-calibrated ECMWF ensemble.
Citation: Monthly Weather Review 140, 7; 10.1175/MWR-D-11-00220.1
As in Fig. 15, but for the 24-h period ending 0000 UTC 8 Aug 2010.
Citation: Monthly Weather Review 140, 7; 10.1175/MWR-D-11-00220.1
The results exhibited here with reforecast calibration were not as impressive as they have been in previous studies (e.g., Hamill and Whitaker 2006; Hamill et al. 2008). There are at least four reasons for this. First, the training data were not as accurate as the real-time data in this application (Fig. 12), and this inhomogeneity degraded the regression analysis. This may have been due to less accurate initial conditions (ERA-Interim for the reforecast, operational 4D-Var for the real-time forecasts) and because the reforecast ensemble was initialized with perturbations that were constructed with approximations different from those in the real-time forecasts (section 2b). The second reason is that gratifying improvements have been made to models and EPSs so that they produce more skillful and reliable forecasts than they did even in the recent past; it is tougher to improve upon ECMWF’s 2010 model output than its 2005 model output. The third reason is that even with the use of ECMWF’s reforecasts, there really was a limited training dataset in this study, here because of the unavailability of precipitation analyses prior to 2002 and the unavailability of reforecast data more frequent than once per week. The fourth reason is that in prior studies, the ensemble forecasts (at coarse resolution) were evaluated against analysis data at finer resolution, so that the reforecast calibration process was also producing a statistical downscaling. This point is worth keeping in mind when considering the relative merits of reforecast calibration versus multimodel approaches. If the desired output is forecast data at the grid scale, multimodels may have substantial appeal. If the desired output is point data or high-resolution gridded data, the statistical downscaling is more straightforward when reforecasts are used.
Overall, the impressive skill improvements provide evidence of the merit of both multimodel ensemble and reforecast approaches. Should other forecast centers share precipitation ensemble data, large gains in probabilistic precipitation forecast skill are possible for little more than the cost of data transmission and storage. Alternatively, should any one center produce and utilize reforecasts, they can improve their own forecasts significantly, assuming a comparably long time series of observations or analyses are available. The improvement here noted with reforecasts may have also been modest because the training data were limited on account of a short time series of analyses, dating back to only 2002; only around 40% of the available reforecast data were used.
4. Conclusions
This article examined probabilistic multimodel weather forecasts of precipitation over CONUS and the relative advantages and disadvantages of these forecasts when compared to statistically postprocessed ECMWF forecasts. Twenty-member forecasts were extracted from the ECMWF, NCEP, UKMO, and CMC global ensemble systems at 1° resolution between June and October 2010. Daily 24-h accumulated probabilistic precipitation forecasts were generated from the subsequent 80-member ensemble for lead times from +1 to +5 days and compared to gridded precipitation analyses. Two statistically postprocessed products were also evaluated, the first being multimodel forecasts that were adjusted using extended logistic regression and that were trained on the previous 30 days of forecasts and analyses. The second was ECMWF forecasts, which were statistically adjusted using forecast/analysis data for the period 2002–09—the time period when both reforecasts and analyses were available.
Considering first the skill of forecasts from the individual EPSs, ECMWF forecasts generally were the most skillful in terms of Brier skill scores and the continuous ranked probability skill score. CMC forecasts were the most reliable but the least sharp, while NCEP and UKMO forecasts were more sharp but less reliable.
Multimodel probabilistic forecast products were substantially more skillful than the best of the individual centers’ probabilistic forecasts. The improvement was approximately from an extra +0.5 to +1 day of forecast lead time for light precipitation events and as much as +2 days for heavier precipitation events. The reforecast-calibrated ECMWF forecasts exhibited more skill and reliability improvement at the >10 mm (24 h)−1 event than they did at the >1 mm (24 h)−1 event. Relative to the multimodel forecasts, the reforecast-calibrated skills were similar for the >10 mm (24 h)−1 event, but the reforecast-calibrated forecast was more reliable while the multimodel forecast was sharper.
The results exhibited here with reforecast calibration were not as impressive as they have been in previous studies. There were at least four reasons for the lessened improvement of reforecast calibration here. First, the reforecast training data were shown to be less accurate than the real-time data in this application. Second, gratifying improvements have been made to models and EPSs in the last few years; it is tougher to improve upon ECMWF’s 2010 model output than its 2005 model output. Third, a limited training dataset was available for this study. Fourth, prior studies were performed at higher resolution and produced a statistical downscaling that the coarser raw forecasts could not accomplish.
I was pleasantly surprised by the magnitude of skill improvements demonstrated here from multimodel ensembles, improvements that were larger than those seen with 2-m temperatures (Hagedorn et al. 2012). From our own experience, however, I recommend some caution against broadly generalizing these results to any multimodel ensemble system. This study examined a combination of data from four mature EPSs based on mature models and assimilation systems. Each center’s system has been refined through the collective efforts of hundreds if not thousands of person-years of research and development. A combination of less developed EPSs may not provide nearly the same gratifying result.
Nonetheless, these results demonstrate the potential value of multimodel ensembles. THORPEX, organized by the World Meteorological Organization, has promoted the concept of a multimodel based “Global Interactive Forecast System” (Bougeault et al. 2010), whereby the operational centers share data that will facilitate the production of multimodel products for high-impact weather events. This study provides additional evidence for the validity and the potential benefits of such a system. Currently, several centers have restrictive data policies; full access to their data is reserved for paying customers, and those customers cannot thereafter share the data they purchased. Perhaps the approach embraced in the United States and Canada will be followed by other centers worldwide, for the mutual benefit of all. In the United States and Canada, the data are effectively free since the research, development, and production were funded by public taxpayer funds.
Finally, can we all have “the best of both worlds”? That is, will NWP centers agree to both share their ensemble data freely and internationally in real time, and produce reforecast datasets so that each model can be calibrated to remove systematic errors prior to their combination? There is evidence that such approaches will provide a substantial benefit. The climate community is working on sharing multimodel information and hindcasts to facilitate the error correction for intraseasonal and seasonal forecasts. For weather and weather-to-climate applications, there have also been successful demonstrations of multimodel calibrated forecasts (Vislocky and Fritsch 1995; Whitaker et al. 2006). The National Oceanic and Atmospheric Administration (NOAA) is currently developing a new reforecast dataset for its global ensemble prediction system, and I hope that other centers will be inspired to do so as well.
Acknowledgments
TIGGE data were obtained from ECMWF’s TIGGE data portal. I thank ECMWF for the development of this portal software and for the archival of this immense dataset. Florian Pappenberger of ECMWF was very helpful in extracting and preprocessing the reforecasts. Yan Luo of NCEP/EMC was helpful in obtaining the CCPA data. Tom Galarneau of NCAR/MMM is thanked for providing an informal review. Publication of this study was funded by NOAA THORPEX. Three anonymous reviewers are thanked for their helpful recommendations to improve this article.
APPENDIX A
Description of Modeling Systems Used
Here are additional details on the forecast models and ensemble systems used in this study.
a. NCEP
NCEP used the Global Forecast System (GFS) model in their ensemble system at T190L28 resolution. A lengthier description of the physical packages used in this model was given in Hamill et al. (2011). A description of the GFS model is available from the NCEP Environmental Modeling Center (EMC), with changes as of 2003 included (see online at www.emc.ncep.noaa.gov/gmb/moorthi/gam.html).
The control initial condition around which the perturbed initial conditions were centered was produced by the T382 gridpoint statistical interpolation (GSI) analysis (Kleist et al. 2009) at T384L64 resolution. Perturbed initial conditions were generated with the ensemble transform with the rescaling technique of Wei et al. (2008). Stochastic perturbations were included, following Hou et al. (2008). More details on changes to the NCEP ensemble system can be found online (see http://www.emc.ncep.noaa.gov/gmb/yzhu/html/ENS_IMP.html).
b. Canadian Meteorological Centre
The CMC EPS used the Global Environmental Multiscale Model (GEM), a primitive equation model with a terrain-following pressure vertical coordinate. Further documentation on the GEM model can be found online (see http://collaboration.cmc.ec.gc.ca/science/rpn/gef_html_public/DOCUMENTATION/GENERAL/general.html) and in Charron et al. (2010). The CMC ensemble system used a horizontal computational grid of 400 × 200 grid points, or approximately 0.9°, and 28 vertical levels. The ensemble Kalman filter (EnKF) initial conditions were used, following Charron et al. (2010) and Houtekamer et al. (2009; and references therein). The 20 forecast ensemble members used a variety of perturbed physics: changing gravity wave drag parameters, land surface process type, condensation scheme type, convection scheme type, shallow convection scheme type, mixing-length formulation, and turbulent vertical diffusion parameter. More details on these are provided online (see http://www.weatheroffice.gc.ca/ensemble/verifs/model_e.html).
c. ECMWF
The ECMWF EPS used the ECMWF Integrated Forecast System model, versions 36r2. The model resolution was T639L62 for both versions (details on the IFS are provided at www.ecmwf.int/research/ifsdocs/). The changes to the ensemble stochastic treatments in the 8 September 2009 implementation are described in Palmer et al. (2009). The ensemble was initialized with a combination of initial-time and evolved total-energy singular vectors (Buizza and Palmer 1995; Molteni et al. 1996; Barkmeijer et al. 1998, 1999; Leutbecher 2005) and utilized stochastic perturbations to physical tendencies. An overview of the ensemble system was provided in Buizza et al. (2007; and references therein). For consistency with the analyses of other EPSs, only the first 20 perturbed members were used here.
d. UKMO
The Met Office ensemble system was the Met Office Global and Regional Ensemble Prediction System (MOGREPS). Tropical cyclone track forecasts from this system came from its global component, which was described in Bowler et al. (2008, 2009). The global system was run at a resolution of 0.83° longitude and 0.55° latitude on a regular latitude–longitude grid. Seventy vertical levels were employed (Tennant et al. 2011). Initial condition perturbations were generated from an implementation of the ensemble transform Kalman filter (Hunt et al. 2006; Bowler et al. 2009). The mean initial state was generated from the UKMO 4D-Var system (Rawlins et al. 2007). The model included a parameterization of one type of model uncertainty via its stochastic kinetic-energy backscatter scheme, following Shutts (2005) and Tennant et al. (2011).
APPENDIX B
Methodology to Increase Training Sample Size
This appendix discusses the method used to augment the training sample size used in the regression analyses. Suppose when calibrating multimodel ensembles using the past 30 days of forecasts and analyses, only data from the grid point of interest were used. This would provide, of course, only 30 training samples. Older forecasts could be used, but precipitation biases are often seasonally dependent, so the older data may degrade the results despite augmenting the sample size. Also, with such a multimodel ensemble, the farther back into the past one seeks training data, the more likely it is that at least one of the models will have had a major upgrade and concomitant change in systematic error characteristics.
Despite ECMWF providing a multidecadal reforecast, in practice the sample sizes were too small here, too. When using the 2002–09 weekly, five-member ECMWF reforecasts (including reforecast dates ±6 weeks around the week of interest), this provided a total of 13 weeks × 8 yr = 104 samples. In both cases these were relatively small samples to estimate four regression parameters, and, especially for rare events such as heavy precipitation, experience has shown that larger training samples improved the regression analysis.
Hence, following the general philosophy demonstrated and discussed in Hamill et al. (2008) and inspired by the regionalization used in some model output statistics algorithms (Lowry and Glahn 1976), the training dataset for a particular grid point was augmented by finding 25 other grid points that had relatively similar climatological analyzed CDFs. Consider a particular location
REFERENCES
Accadia , C. , S. Mariani , M. Casaioli , A. Lavagnini , and A. Speranza , 2003 : Sensitivity of precipitation forecast skill scores to bilinear interpolation and a simple nearest-neighbor average method on high-resolution verification grids . Wea. Forecasting , 18 , 918 –932 .
Bao , L. , T. Gneiting , E. P. Grimit , P. Guttorp , and A. E. Raftery , 2010 : Bias correction and bayesian model averaging for ensemble forecasts of surface wind direction . Mon. Wea. Rev. , 138 , 1811 –1821 .
Barkmeijer , J. , F. Bouttier , and M. Van Gijzen , 1998 : Singular vectors and estimates of the analysis-error covariance metric . Quart. J. Roy. Meteor. Soc. , 124 , 1695 –1713 .
Barkmeijer , J. , R. Buizza , and T. N. Palmer , 1999 : 3D-Var Hessian singular vectors and their potential use in the ECMWF ensemble prediction system . Quart. J. Roy. Meteor. Soc. , 125 , 2333 –2351 .
Berner , J. , G. J. Shutts , M. Leutbecher , and T. N. Palmer , 2009 : A spectral stochastic kinetic energy backscatter scheme and its impact on flow-dependent predictability in the ECMWF ensemble prediction system . J. Atmos. Sci. , 66 , 603 –626 .
Berner , J. , S.-Y. Ha , J. P. Hacker , A. Fournier , and C. Snyder , 2011 : Model uncertainty in a mesoscale ensemble prediction system: Stochastic versus multiphysics representations . Mon. Wea. Rev. , 139 , 1972 –1995 .
Bougeault , P. , and Coauthors , 2010 : The THORPEX Interactive Grand Global Ensemble . Bull. Amer. Meteor. Soc. , 91 , 1059 –1072 .
Bowler , N. E. , A. Arribas , K. R. Mylne , K. B. Robertson , and S. E. Beare , 2008 : The MOGREPS short-range ensemble prediction system . Quart. J. Roy. Meteor. Soc. , 134 , 703 –722 .
Bowler , N. E. , A. Arribas , S. E. Beare , K. R. Mylne , and G. J. Shutts , 2009 : The local ETKF and SKEB: Upgrades to the MOGREPS short-range ensemble prediction system . Quart. J. Roy. Meteor. Soc. , 135 , 767 –776 .
Buizza , R. , and T. N. Palmer , 1995 : The singular-vector structure of the atmospheric global circulation . J. Atmos. Sci. , 52 , 1434 –1456 .
Buizza , R. , M. Miller , and T. N. Palmer , 1999 : Stochastic representation of model uncertainties in the ECMWF ensemble prediction system . Quart. J. Roy. Meteor. Soc. , 125 , 2887 –2908 .
Buizza , R. , J.-R. Bidlot , N. Wedi , M. Fuentes , M. Hamrud , G. Holt , and F. Vitart , 2007 : The new ECMWF VAREPS (Variable Resolution Ensemble Prediction System) . Quart. J. Roy. Meteor. Soc. , 133 , 681 –695 .
Candille , G. , 2009 : The multiensemble approach: The NAEFS example . Mon. Wea. Rev. , 137 , 1655 –1665 .
Charron , M. , G. Pellerin , L. Spacek , P. L. Houtekamer , N. Gagnon , H. L. Mitchell , and L. Michelin , 2010 : Toward random sampling of model error in the Canadian ensemble prediction system . Mon. Wea. Rev. , 138 , 1877 –1901 .
Dee , D. P. , and Coauthors , 2011 : The ERA-Interim reanalysis: Configuration and performance of the data assimilation system . Quart. J. Roy. Meteor. Soc. , 137 , 553 –597 .
Doblas-Reyes , F. J. , R. Hagedorn , and T. N. Palmer , 2005 : The rationale behind the success of multi-model ensembles in seasonal forecasting – II. Calibration and combination . Tellus , 57A , 234 –252 .
Glahn , B. , M. Peroutka , J. Wiedenfeld , J. Wagner , G. Zylstra , B. Schuknecht , and B. Jackson , 2009 : MOS uncertainty estimates in an ensemble framework . Mon. Wea. Rev. , 137 , 246 –268 .
Gneiting , T. , A. E. Raftery , A. H. Westveld , and T. Goldman , 2005 : Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation . Mon. Wea. Rev. , 133 , 1098 –1118 .
Hagedorn , R. , 2008 : Using the ECMWF reforecast data set to calibrate EPS reforecasts. ECMWF Newsletter, No. 117, ECMWF, Reading, United Kingdom, 8–13 .
Hagedorn , R. , F. J. Doblas-Reyes , and T. N. Palmer , 2005 : The rationale behind the success of multi-model ensembles in seasonal forecasting – I. Basic concept . Tellus , 57A , 219 –233 .
Hagedorn , R. , T. M. Hamill , and J. S. Whitaker , 2008 : Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part I: Two-meter temperatures . Mon. Wea. Rev. , 136 , 2608 –2619 .
Hagedorn , R. , R. Buizza , T. M. Hamill , M. Leutbecher , and T. N. Palmer , 2012 : Comparing TIGGE multi-model forecasts with reforecast-calibrated ECMWF ensemble forecasts . Quart. J. Roy. Meteor. Soc. , in press .
Hamill , T. M. , 1999 : Hypothesis tests for evaluating numerical precipitation forecasts . Wea. Forecasting , 14 , 155 –167 .
Hamill , T. M. , and J. Juras , 2006 : Measuring forecast skill: Is it real skill or is it the varying climatology? Quart. J. Roy. Meteor. Soc. , 132 , 2905 –2923 .
Hamill , T. M. , and J. S. Whitaker , 2006 : Probabilistic quantitative precipitation forecasts based on reforecast analogs: Theory and application . Mon. Wea. Rev. , 134 , 3209 –3229 .
Hamill , T. M. , and J. S. Whitaker , 2007 : Ensemble calibration of 500-hPa geopotential height and 850-hPa and 2-m temperatures using reforecasts . Mon. Wea. Rev. , 135 , 3273 –3280 .
Hamill , T. M. , J. S. Whitaker , and X. Wei , 2004 : Ensemble reforecasting: Improving medium-range forecast skill using retrospective forecasts . Mon. Wea. Rev. , 132 , 1434 –1447 .
Hamill , T. M. , J. S. Whitaker , and S. L. Mullen , 2006 : Reforecasts: An important dataset for improving weather predictions . Bull. Amer. Meteor. Soc. , 87 , 33 –46 .
Hamill , T. M. , R. Hagedorn , and J. S. Whitaker , 2008 : Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part II: Precipitation . Mon. Wea. Rev. , 136 , 2620 –2632 .
Hamill , T. M. , J. S. Whitaker , M. Fiorino , and S. G. Benjamin , 2011 : Global ensemble predictions of 2009’s tropical cyclones initialized with an ensemble Kalman filter . Mon. Wea. Rev. , 139 , 668 –688 .
Higgins , R. W. , J. E. Janowiak , and Y.-P. Yao , 1996 : A gridded hourly precipitation data base for the United States (1963-1993). NCEP/Climate Prediction Center ATLAS 1, U.S. Department of Commerce, NOAA/NWS, 47 pp .
Hou , D. , Z. Toth , Y. Zhu , and W. Yang , 2008 : Impact of a stochastic perturbation scheme on NCEP Global Ensemble Forecast System. Proc. 19th Conf. on Probability and Statistics, New Orleans, LA, Amer. Meteor. Soc., 1.1. [Available online at http://ams.confex.com/ams/pdfpapers/134165.pdf.]
Houtekamer , P. L. , and H. L. Mitchell , 2005 : Ensemble Kalman filtering . Quart. J. Roy. Meteor. Soc. , 131 , 3269 –3289 .
Houtekamer , P. L. , H. L. Mitchell , and X. Deng , 2009 : Model error representation in an operational ensemble Kalman filter . Mon. Wea. Rev. , 137 , 2126 –2143 .
Hunt , B. , E. Kostelich , and I. Szunyogh , 2006 : Efficient data assimilation for spatiotemporal chaos: A local ensemble transform Kalman filter. Physica D, 230, 112–126.
Isaksen , L. , M. Bonavita , R. Buizza , M. Fisher , J. Haseler , M. Leutbecher , and L. Raynaud , 2010 : Ensemble of data assimilations at ECMWF. ECMWF Tech. Memo., Vol. 636, 46 pp .
Iversen , T. , A. Deckmyn , C. Santos , K. A. I. Sattler , J. B. Bremnes , H. Feddersen , and I.-L. Frogner , 2011 : Evaluation of ‘GLAMEPS’—A proposed multimodel EPS for short range forecasting . Tellus , 63A , 513 –530 .
Johnson , C. , and R. Swinbank , 2009 : Medium-range multimodel ensemble combination and calibration . Quart. J. Roy. Meteor. Soc. , 135 , 777 –794 .
Kleist , D. T. , D. F. Parrish , J. C. Derber , R. Treadon , W.-S. Wu , and S. Lord , 2009 : Introduction of the GSI into the NCEP Global Data Assimilation System . Wea. Forecasting , 24 , 1691 –1705 .
Krishnamurti , T. N. , C. M. Kishtawal , Z. Zhang , T. LaRow , D. Bachiochi , E. Williford , S. Gadgil , and S. Surendran , 2000 : Multimodel ensemble forecasts for weather and seasonal climate . J. Climate , 13 , 4196 –4216 .
Leutbecher , M. , 2005 : On ensemble prediction using singular vectors started from forecasts . Mon. Wea. Rev. , 133 , 3038 –3046 .
Lin , Y. , and K. E. Mitchell , 2005 : The NCEP stage II/IV hourly precipitation analyses: Development and applications. Preprints, 19th Conf. on Hydrology, San Diego, CA, Amer. Meteor. Soc., 1.2. [Available online at http://ams.confex.com/ams/Annual2005/webprogram/Paper83847.html.]
Lorenz , J. , H. Rauhut , F. Schweitzer , and D. Helbing , 2011 : How social influence can undermine the wisdom of crowd effect . Proc. Natl. Acad. Sci. USA , 108 , 920 –925 .
Lowry , D. A. , and H. R. Glahn , 1976 : An operational model for forecasting probability of precipitation—PEATMOS PoP . Mon. Wea. Rev. , 104 , 221 –232 .
Molteni , F. , R. Buizza , T. N. Palmer , and T. Petroliagis , 1996 : The ECMWF Ensemble Prediction System: Methodology and validation . Quart. J. Roy. Meteor. Soc. , 122 , 73 –119 .
Mylne , K. R. , R. E. Evans , and R. T. Clark , 2002 : Multi-model multi-analysis ensembles in quasi-operational medium-range forecasting . Quart. J. Roy. Meteor. Soc. , 128 , 361 –384 .
Palmer , T. N. , R. Buizza , F. J. Doblas-Reyes , T. Jung , M. Leutbecher , G. J. Shutts , M. Steinheimer , and A. Weisheimer , 2009 : Stochastic parametrization and model uncertainty. ECMWF Tech. Memo. 598, 42 pp .
Raftery , A. E. , T. Gneiting , F. Balabdaoui , and M. Polakowski , 2005 : Using Bayesian model averaging to calibrate forecast ensembles . Mon. Wea. Rev. , 133 , 1155 –1174 .
Rawlins , F. , S. P. Ballard , K. J. Bovis , A. M. Clayton , D. Li , G. W. Inverarity , A. C. Lorenc , and T. J. Payne , 2007 : The Met Office global four-dimensional variational data assimilation scheme . Quart. J. Roy. Meteor. Soc. , 133 , 347 –362 .
Roulin , E. , and S. Vannitsem , 2012 : Postprocessing of ensemble precipitation predictions with extended logistic regression based on hindcasts . Mon. Wea. Rev. , 140 , 874 –888 .
Schmeits , M. J. , and K. J. Kok , 2010 : A comparison between raw ensemble output, (modified) Bayesian model averaging, and extended logistic regression using ECMWF ensemble precipitation reforecasts . Mon. Wea. Rev. , 138 , 4199 –4211 .
Shutts , G. , 2005 : A kinetic energy backscatter algorithm for use in ensemble prediction systems . Quart. J. Roy. Meteor. Soc. , 131 , 3079 –3102 .
Sloughter , J. M. L. , A. E. Raftery , T. Gneiting , and C. Fraley , 2007 : Probabilistic quantitative precipitation forecasting using Bayesian model averaging . Mon. Wea. Rev. , 135 , 3209 –3220 .
Stensrud , D. J. , and N. Yussouf , 2003 : Short-range ensemble predictions of 2-m temperature and dewpoint temperature over New England . Mon. Wea. Rev. , 131 , 2510 –2524 .
Tennant , W. J. , G. J. Shutts , A. Arribas , and S. A. Thompson , 2011 : Using a stochastic kinetic energy backscatter scheme to improve MOGREPS probabilistic forecast skill . Mon. Wea. Rev. , 139 , 1190 –1206 .
Vannitsem , S. , and C. Nicolis , 2008 : Dynamical properties of model output statistics forecasts . Mon. Wea. Rev. , 136 , 405 –419 .
Vislocky , R. L. , and J. M. Fritsch , 1995 : Improved model output statistics forecasts through model consensus . Bull. Amer. Meteor. Soc. , 76 , 1157 –1164 .
Wandishin , M. S. , S. L. Mullen , D. J. Stensrud , and H. E. Brooks , 2001 : Evaluation of a short-range multimodel ensemble system . Mon. Wea. Rev. , 129 , 729 –747 .
Wei , M. , Z. Toth , R. Wobus , and Y. Zhu , 2008 : Initial perturbations based on the ensemble transform (ET) technique in the NCEP global operational forecast system . Tellus , 60A , 62 –79 .
Weigel , A. P. , M. A. Liniger , and C. Appenzeller , 2008 : Can multi-model combination really enhance the prediction skill of probabilistic ensemble forecasts? Quart. J. Roy. Meteor. Soc. , 134 , 241 –260 .
Whitaker , J. S. , X. Wei , and F. Vitart , 2006 : Improving week-2 forecasts with multimodel reforecast ensembles . Mon. Wea. Rev. , 134 , 2279 –2284 .
Wilks , D. S. , 2006 : Statistical Methods in the Atmospheric Sciences. 2nd ed. Academic Press, 627 pp .
Wilks , D. S. , 2009 : Extending logistic regression to provide full-probability-distribution MOS forecasts . Meteor. Appl. , 16 , 361 –368 .
Wilks , D. S. , and T. M. Hamill , 2007 : Comparison of ensemble-MOS methods using GFS reforecasts . Mon. Wea. Rev. , 135 , 2379 –2390 .
Wilson , L. J. , S. Beauregard , A. E. Raftery , and R. Verret , 2007 : Calibrated surface temperature forecasts from the Canadian ensemble prediction system using Bayesian model averaging . Mon. Wea. Rev. , 135 , 1364 –1385 .
Yussouf , N. , and D. J. Stensrud , 2007 : Bias-corrected short-range ensemble forecasts of near-surface variables during the 2005/06 cool season . Wea. Forecasting , 22 , 1274 –1286 .