1. Introduction
Wilson et al. (2007, hereafter W07) recently described the application of the Bayesian model averaging (BMA; Raftery et al. 2005, hereafter R05) calibration technique to surface temperature forecasts using the Canadian ensemble prediction system. The BMA technique as applied in W07 produced an adjusted probabilistic forecast from an ensemble through a two-step procedure. The first step was the correction of biases of individual members through regression analyses. The second step was the fitting of a Gaussian kernel around each bias-corrected member of the ensemble. The amount of weight applied to each member’s kernel and the width of the kernel(s) were set through an expectation maximization (EM) algorithm (Dempster et al. 1977). The final probability density function (pdf) was a sum of the weighted kernels.
W07 reported (their Fig. 2) that at any given instant, a majority of the ensemble members were typically assigned zero weight, while a few select members received the majority of the weight. Which members received large weights varied from one day to the next. These results were counterintuitive. Why effectively discard the information from so many ensemble members? Why should one member have positive weight one day and none the next?
This comment to W07 will show that BMA where the EM is permitted to adjust the weights individually for each member is not an appropriate application of the technique when the sample size is small;1 specifically, the radically unequal weights of W07 exemplify an “overfitting” (Wilks 2006a, p. 207) to the training data. A symptom of overfitting is an improved fitted relationship to the training data but a worsened relationship with independent data. This may happen when the statistician attempts to fit a large number of parameters using a relatively small training sample. In W07, the EM algorithm was required to set the weights of 16 individual ensemble members and a kernel standard deviation with between 25 and 80 days of data.
To illustrate the problem of overfitting in W07’s methodology, a reforecast dataset was used. This was composed of more than two decades of daily ensemble forecasts with perturbed initial conditions, all from a single forecast model. This large dataset permitted a comparison of BMA properties based on small and large training samples. This reforecast dataset used a T62, circa 1998 version of the National Centers for Environmental Prediction (NCEP) Global Forecast System. A 15-member forecast, consisting of a control and seven bred pairs (Toth and Kalnay 1997) was integrated to 15 days lead for every day from 1979 to current. For more details on this reforecast dataset, please see Hamill et al. (2006). The verification data were from the NCEP–National Center for Atmospheric Research reanalysis (Kalnay et al. 1996).
2. Overfitting with the BMA–EM algorithm
EM is an iterative algorithm that adjusts the BMA model parameters through a two-step procedure of parameter estimation and maximization. R05 [Eqs. (5)–(6) and accompanying text] provides more detail. The algorithm iterates to convergence, stopping when the change in log-likelihood function from one iteration to the next is less than a cutoff δ. The magnitude of δ may be chosen by the user, but it can be assumed δ ≪ 1.0.
To illustrate the tendency for the BMA EM to overfit when trained with small sample sizes, consider 4-day 850-hPa temperature ensemble forecasts for a grid point near Montreal, Quebec, Canada. Forecasts were produced and validated for 23 yr × 365 days − 40 days = 8355 cases. Because we would like to assume in this example a priori that the member weights should be equal, the 15-member ensemble was thinned, eliminating the slightly more accurate control member. The remaining 14 bred members can be assumed to have identically distributed (but not independent; see Wang and Bishop 2003) errors and hence should have been assigned equal weights. The BMA algorithm was then trained using the remaining 14 identically distributed bred members and only the prior 40-days forecasts and analyses, posited in W07 to be an acceptably long training period. We shall refer to this as the “40-day training” dataset. In addition, the BMA algorithm was also trained with a very long training dataset in a cross-validated manner using 22 yr × 91 days of data, with the 91 days centered on the Julian day of the forecast. This will be referred to as the “22-yr training dataset.”


We now consider the properties of the EM algorithm for this application. The initial guess for all member weights was 1/14. Keeping track of the ratio of maximum to minimum BMA member weights after EM convergence for each of the 8355 cases, these ratios were sorted, and the median ratio was plotted as the EM convergence criterion δ was varied. For the 40-day training period, when δ = 0.01, the largest and smallest weights were much more similar compared to when δ ≪ 0.01 (Fig. 3a). With the 22-yr training data, the weights stayed much more equal as δ was decreased (Fig. 3b).
Could the unequal weightings with the 40-day training set and tight δ actually be appropriate? As mentioned in R05, as the EM iterates, the log likelihood of the fit to the training data is guaranteed to increase. However, we can also track the fit to the validation data. Figures 4a,b show the average training and validation log likelihoods (per forecast day) for the small and large training data sizes. Notice that for the small sample size, the validation data log likelihood decreased as the convergence criterion was tightened, a sign that the unequal weights were not realistic. The same effect was hardly noticed with the large training dataset, where the weights remained nearly equal as the convergence criterion was tightened.2 This demonstrates that the highly variable weights with the 40-day training were most likely an artifact of overfitting. Perhaps this was not surprising, given that the EM algorithm was expected to fit 15 parameters here (14 weights plus a standard deviation) with the 40 samples. Further, the effective sample size (Wilks 2006a, p. 144) may actually have been smaller than 40; perhaps the assumption of independence of forecast errors in space and time (R05, p. 1159) was badly violated with these ensemble forecasts. Also, we agree with the W07 proposition that the radical differences in weights may also be in part a consequence of the colinearity of members’ errors in the training data. What is clear here is that this colinearity was not properly estimated from small samples, which led to the inappropriate deweighting and exclusion of information from some members.
When the BMA weights were enforced to be equal and 40-day training was used, the resulting continuous ranked probability skill score [CRPSS; calculated in the manner suggested in Hamill and Juras (2006) to avoid overestimating skill; 0.0 = the skill of climatology, 1.0 = perfect forecast] was 0.38. When the individual weights were allowed to be estimated by the EM and the convergence criterion was 0.000 03, the resulting CRPSS was smaller, 0.35. When the 22-yr training data were used, the CRPSS was 0.410, regardless of whether the weights were enforced to be equal or allowed to vary.




3. Conclusions
While the BMA technique is theoretically appealing for ensemble forecast calibration, the BMA and the EM technique cannot be expected to set realistic weights for each member when using a short training dataset. Enforcing more similar weights among BMA members [Eq. (2)] may work as well or better than allowing the EM method to estimate variable weights for each member.
REFERENCES
Daley, R., 1986: Atmospheric Data Analysis. Cambridge University Press, 457 pp.
Dempster, A. P., N. M. Laird, and D. B. Rubin, 1977: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc., 39B , 1–39.
Hamill, T. M., and J. Juras, 2006: Measuring forecast skill: Is it real skill or is it the varying climatology? Quart. J. Roy. Meteor. Soc., 132 , 2905–2923.
Hamill, T. M., J. S. Whitaker, and S. L. Mullen, 2006: Reforecasts: An important dataset for improving weather predictions. Bull. Amer. Meteor. Soc., 87 , 33–46.
Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems. Wea. Forecasting, 15 , 559–570.
Kalnay, E., and Coauthors, 1996: The NCEP/NCAR 40-Year Reanalysis Project. Bull. Amer. Meteor. Soc., 77 , 437–471.
Raftery, A. E., T. Gneiting, F. Balabdaoui, and M. Polakowski, 2005: Using Bayesian model averaging to calibrate forecast ensembles. Mon. Wea. Rev., 133 , 1155–1174.
Toth, Z., and E. Kalnay, 1997: Ensemble forecasting at NCEP and the breeding method. Mon. Wea. Rev., 125 , 3297–3319.
Wang, X., and C. H. Bishop, 2003: A comparison of breeding and ensemble transform Kalman filter ensemble forecast schemes. J. Atmos. Sci., 60 , 1140–1158.
Wilks, D. S., 2006a: Statistical Methods in the Atmospheric Sciences. 2d ed. Academic Press, 627 pp.
Wilks, D. S., 2006b: Comparison of ensemble-MOS methods in the Lorenz ’96 setting. Meteor. Appl., 13 , 243–256.
Wilson, L. J., S. Beauregard, A. E. Raftery, and R. Verret, 2007: Calibrated surface temperature forecasts from the Canadian ensemble prediction system using Bayesian model averaging. Mon. Wea. Rev., 135 , 1364–1385.

Spread of a regression-corrected ensemble of day 4 forecasts of 850-hPa temperature at Montreal (using 40-day training data) vs the spread of the raw ensemble forecasts.
Citation: Monthly Weather Review 135, 12; 10.1175/2007MWR1963.1

Spread of a regression-corrected ensemble of day 4 forecasts of 850-hPa temperature at Montreal (using 40-day training data) vs the spread of the raw ensemble forecasts.
Citation: Monthly Weather Review 135, 12; 10.1175/2007MWR1963.1
Spread of a regression-corrected ensemble of day 4 forecasts of 850-hPa temperature at Montreal (using 40-day training data) vs the spread of the raw ensemble forecasts.
Citation: Monthly Weather Review 135, 12; 10.1175/2007MWR1963.1

Errors of day 4 850-hPa temperature forecasts for members 2 and 4 of the ensemble (a) before a regression correction of the member errors using the prior 40 days for training and (b) after the regression correction. Correlation coefficient (r) noted in the upper-left corner.
Citation: Monthly Weather Review 135, 12; 10.1175/2007MWR1963.1

Errors of day 4 850-hPa temperature forecasts for members 2 and 4 of the ensemble (a) before a regression correction of the member errors using the prior 40 days for training and (b) after the regression correction. Correlation coefficient (r) noted in the upper-left corner.
Citation: Monthly Weather Review 135, 12; 10.1175/2007MWR1963.1
Errors of day 4 850-hPa temperature forecasts for members 2 and 4 of the ensemble (a) before a regression correction of the member errors using the prior 40 days for training and (b) after the regression correction. Correlation coefficient (r) noted in the upper-left corner.
Citation: Monthly Weather Review 135, 12; 10.1175/2007MWR1963.1

Log10 of the median sample’s maximum member weight divided by minimum member weight, as a function of the EM convergence criterion. The median represents the (23 × 365/2)th rank-ordered ratio among the 23 × 365 sample days. (a) 40-day training period and (b) 22-yr cross-validated training period.
Citation: Monthly Weather Review 135, 12; 10.1175/2007MWR1963.1

Log10 of the median sample’s maximum member weight divided by minimum member weight, as a function of the EM convergence criterion. The median represents the (23 × 365/2)th rank-ordered ratio among the 23 × 365 sample days. (a) 40-day training period and (b) 22-yr cross-validated training period.
Citation: Monthly Weather Review 135, 12; 10.1175/2007MWR1963.1
Log10 of the median sample’s maximum member weight divided by minimum member weight, as a function of the EM convergence criterion. The median represents the (23 × 365/2)th rank-ordered ratio among the 23 × 365 sample days. (a) 40-day training period and (b) 22-yr cross-validated training period.
Citation: Monthly Weather Review 135, 12; 10.1175/2007MWR1963.1

Log likelihood (per unit day) of training and validation data as a function of the convergence criterion. (a) 40-day training data and (b) 22-yr cross-validated training data.
Citation: Monthly Weather Review 135, 12; 10.1175/2007MWR1963.1

Log likelihood (per unit day) of training and validation data as a function of the convergence criterion. (a) 40-day training data and (b) 22-yr cross-validated training data.
Citation: Monthly Weather Review 135, 12; 10.1175/2007MWR1963.1
Log likelihood (per unit day) of training and validation data as a function of the convergence criterion. (a) 40-day training data and (b) 22-yr cross-validated training data.
Citation: Monthly Weather Review 135, 12; 10.1175/2007MWR1963.1
This is not meant to imply that BMA and the EM method are inappropriate, merely that the methods can be inappropriately applied.