## 1. Introduction

Wilson et al. (2007, hereafter W07) recently described the application of the Bayesian model averaging (BMA; Raftery et al. 2005, hereafter R05) calibration technique to surface temperature forecasts using the Canadian ensemble prediction system. The BMA technique as applied in W07 produced an adjusted probabilistic forecast from an ensemble through a two-step procedure. The first step was the correction of biases of individual members through regression analyses. The second step was the fitting of a Gaussian kernel around each bias-corrected member of the ensemble. The amount of weight applied to each member’s kernel and the width of the kernel(s) were set through an expectation maximization (EM) algorithm (Dempster et al. 1977). The final probability density function (pdf) was a sum of the weighted kernels.

W07 reported (their Fig. 2) that at any given instant, a majority of the ensemble members were typically assigned zero weight, while a few select members received the majority of the weight. Which members received large weights varied from one day to the next. These results were counterintuitive. Why effectively discard the information from so many ensemble members? Why should one member have positive weight one day and none the next?

This comment to W07 will show that BMA where the EM is permitted to adjust the weights individually for each member is not an appropriate application of the technique when the sample size is small;^{1} specifically, the radically unequal weights of W07 exemplify an “overfitting” (Wilks 2006a, p. 207) to the training data. A symptom of overfitting is an improved fitted relationship to the training data but a worsened relationship with independent data. This may happen when the statistician attempts to fit a large number of parameters using a relatively small training sample. In W07, the EM algorithm was required to set the weights of 16 individual ensemble members and a kernel standard deviation with between 25 and 80 days of data.

To illustrate the problem of overfitting in W07’s methodology, a reforecast dataset was used. This was composed of more than two decades of daily ensemble forecasts with perturbed initial conditions, all from a single forecast model. This large dataset permitted a comparison of BMA properties based on small and large training samples. This reforecast dataset used a T62, circa 1998 version of the National Centers for Environmental Prediction (NCEP) Global Forecast System. A 15-member forecast, consisting of a control and seven bred pairs (Toth and Kalnay 1997) was integrated to 15 days lead for every day from 1979 to current. For more details on this reforecast dataset, please see Hamill et al. (2006). The verification data were from the NCEP–National Center for Atmospheric Research reanalysis (Kalnay et al. 1996).

## 2. Overfitting with the BMA–EM algorithm

EM is an iterative algorithm that adjusts the BMA model parameters through a two-step procedure of parameter estimation and maximization. R05 [Eqs. (5)–(6) and accompanying text] provides more detail. The algorithm iterates to convergence, stopping when the change in log-likelihood function from one iteration to the next is less than a cutoff *δ*. The magnitude of *δ* may be chosen by the user, but it can be assumed *δ* ≪ 1.0.

To illustrate the tendency for the BMA EM to overfit when trained with small sample sizes, consider 4-day 850-hPa temperature ensemble forecasts for a grid point near Montreal, Quebec, Canada. Forecasts were produced and validated for 23 yr × 365 days − 40 days = 8355 cases. Because we would like to assume in this example a priori that the member weights should be equal, the 15-member ensemble was thinned, eliminating the slightly more accurate control member. The remaining 14 bred members can be assumed to have identically distributed (but not independent; see Wang and Bishop 2003) errors and hence should have been assigned equal weights. The BMA algorithm was then trained using the remaining 14 identically distributed bred members and only the prior 40-days forecasts and analyses, posited in W07 to be an acceptably long training period. We shall refer to this as the “40-day training” dataset. In addition, the BMA algorithm was also trained with a very long training dataset in a cross-validated manner using 22 yr × 91 days of data, with the 91 days centered on the Julian day of the forecast. This will be referred to as the “22-yr training dataset.”

*x*

^{f}

_{i}, an ensemble mean

x

^{f}, and a regression-corrected ensemble mean forecast (

*a*+

*b*x ), the member forecast was replaced with a forecast that was the sum of the initial perturbation from the ensemble mean and the corrected forecast:where ← denotes the replacement operation. This modified regression correction was used because when every member was regressed separately, as forecast lead increased and skill decreased, all the members were increasingly regressed toward the training sample mean of the observed. Consequently, the ensemble spread of adjusted members shrank (Fig. 1; see also Wilks 2006b) and colinearity of errors among members was accentuated (Fig. 2). These were clearly undesirable properties; the spread should asymptotically approach the climatological spread of the ensemble forecast, and ideally, member forecasts should have independent errors. Had the regression correction of each member been applied, there may have been some confusion as to whether the subsequent highly nonuniform weights produced by the BMA were a generic property of a short training dataset or whether they were artificially induced from the increased colinearity induced by the regression analyses.

^{f}We now consider the properties of the EM algorithm for this application. The initial guess for all member weights was 1/14. Keeping track of the ratio of maximum to minimum BMA member weights after EM convergence for each of the 8355 cases, these ratios were sorted, and the median ratio was plotted as the EM convergence criterion *δ* was varied. For the 40-day training period, when *δ* = 0.01, the largest and smallest weights were much more similar compared to when *δ* ≪ 0.01 (Fig. 3a). With the 22-yr training data, the weights stayed much more equal as *δ* was decreased (Fig. 3b).

Could the unequal weightings with the 40-day training set and tight *δ* actually be appropriate? As mentioned in R05, as the EM iterates, the log likelihood *of the fit to the training data* is guaranteed to increase. However, we can also track the fit to the validation data. Figures 4a,b show the average training and validation log likelihoods (per forecast day) for the small and large training data sizes. Notice that for the small sample size, the validation data log likelihood decreased as the convergence criterion was tightened, a sign that the unequal weights were not realistic. The same effect was hardly noticed with the large training dataset, where the weights remained nearly equal as the convergence criterion was tightened.^{2} This demonstrates that the highly variable weights with the 40-day training were most likely an artifact of overfitting. Perhaps this was not surprising, given that the EM algorithm was expected to fit 15 parameters here (14 weights plus a standard deviation) with the 40 samples. Further, the effective sample size (Wilks 2006a, p. 144) may actually have been smaller than 40; perhaps the assumption of independence of forecast errors in space and time (R05, p. 1159) was badly violated with these ensemble forecasts. Also, we agree with the W07 proposition that the radical differences in weights may also be in part a consequence of the colinearity of members’ errors in the training data. What is clear here is that this colinearity was not properly estimated from small samples, which led to the inappropriate deweighting and exclusion of information from some members.

When the BMA weights were enforced to be equal and 40-day training was used, the resulting continuous ranked probability skill score [CRPSS; calculated in the manner suggested in Hamill and Juras (2006) to avoid overestimating skill; 0.0 = the skill of climatology, 1.0 = perfect forecast] was 0.38. When the individual weights were allowed to be estimated by the EM and the convergence criterion was 0.000 03, the resulting CRPSS was smaller, 0.35. When the 22-yr training data were used, the CRPSS was 0.410, regardless of whether the weights were enforced to be equal or allowed to vary.

*s*

_{1}, . . . ,

*s*. The weights that would have produced the minimum variance estimate of the mean state [e.g., Daley 1986, p. 36, Eq. (2.2.3)] under assumptions of normality of errors wereThe advantage of this method for setting weights, also, was that if there truly was a strong colinearity of member errors, the BMA pdf should not have been worse as a consequence of using the more equal weights of Eq. (2) rather than the unequal weights from a highly iterated EM. This can be demonstrated simply by considering two member highly colinear forecasts with similar errors and biases, so

_{n}*x*

^{f}

_{i}≅

*x*

^{f}

_{j}. Then the weighted sums are similar, regardless of the partitioning of the weights. For example,

## 3. Conclusions

While the BMA technique is theoretically appealing for ensemble forecast calibration, the BMA and the EM technique cannot be expected to set realistic weights for each member when using a short training dataset. Enforcing more similar weights among BMA members [Eq. (2)] may work as well or better than allowing the EM method to estimate variable weights for each member.

## REFERENCES

Daley, R., 1986:

*Atmospheric Data Analysis*. Cambridge University Press, 457 pp.Dempster, A. P., , N. M. Laird, , and D. B. Rubin, 1977: Maximum likelihood from incomplete data via the EM algorithm.

,*J. Roy. Stat. Soc.***39B****,**1–39.Hamill, T. M., , and J. Juras, 2006: Measuring forecast skill: Is it real skill or is it the varying climatology?

,*Quart. J. Roy. Meteor. Soc.***132****,**2905–2923.Hamill, T. M., , J. S. Whitaker, , and S. L. Mullen, 2006: Reforecasts: An important dataset for improving weather predictions.

,*Bull. Amer. Meteor. Soc.***87****,**33–46.Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems.

,*Wea. Forecasting***15****,**559–570.Kalnay, E., and Coauthors, 1996: The NCEP/NCAR 40-Year Reanalysis Project.

,*Bull. Amer. Meteor. Soc.***77****,**437–471.Raftery, A. E., , T. Gneiting, , F. Balabdaoui, , and M. Polakowski, 2005: Using Bayesian model averaging to calibrate forecast ensembles.

,*Mon. Wea. Rev.***133****,**1155–1174.Toth, Z., , and E. Kalnay, 1997: Ensemble forecasting at NCEP and the breeding method.

,*Mon. Wea. Rev.***125****,**3297–3319.Wang, X., , and C. H. Bishop, 2003: A comparison of breeding and ensemble transform Kalman filter ensemble forecast schemes.

,*J. Atmos. Sci.***60****,**1140–1158.Wilks, D. S., 2006a:

*Statistical Methods in the Atmospheric Sciences*. 2d ed. Academic Press, 627 pp.Wilks, D. S., 2006b: Comparison of ensemble-MOS methods in the Lorenz ’96 setting.

,*Meteor. Appl.***13****,**243–256.Wilson, L. J., , S. Beauregard, , A. E. Raftery, , and R. Verret, 2007: Calibrated surface temperature forecasts from the Canadian ensemble prediction system using Bayesian model averaging.

,*Mon. Wea. Rev.***135****,**1364–1385.

^{1}

This is not meant to imply that BMA and the EM method are inappropriate, merely that the methods can be inappropriately applied.