## 1. Introduction

For several decades, statistical methods have been used to postprocess the output of numerical prediction models into forecasts of sensible weather elements such as temperature and precipitation. The main goal of statistical postprocessing has been to interpret the model output variables to produce forecasts that are more accurate than could be produced directly from the model. The statistical postprocessing improves the accuracy by removing biases, and/or by increasing the correlation between forecast and observation. A prime example is model output statistics (MOS; Glahn and Lowry 1972), which for over 30 yr has been used in many centers to improve the output of deterministic operational forecasts. MOS typically uses linear statistical predictive techniques such as regression to relate a historical set of model forecast variables (“predictors”) to surface observations. If the predictor set includes the model estimate of the predictand variable, then MOS explicitly accounts for the bias in the model forecast, or calibrates it to the training sample.

More recently, as changes in operational models became more frequent, it became desirable to develop adaptive statistical methods to postprocess model output. Adaptive procedures use shorter (smaller) samples of recent realizations of the forecast system to estimate and update the parameters of the statistical model. While small sample sizes are sometimes a problem, adaptive procedures respond quickly to changing statistical properties of the training sample, and bring the benefits of improvements to numerical weather prediction (NWP) models into the post processed products more quickly. Two examples are an updateable form of MOS (Wilson and Vallée 2002) and the Kalman filter (Simonsen 1991). The latter technique has been tested not only for the correction of model forecasts of surface variables, but also as a way of combining forecasts from different sources (Vallée et al. 1996).

All of the above techniques have been applied to the output of “deterministic” models, which produce one forecast of the surface variables of interest at each location. Since the early 1990s, ensemble systems have been developed and increasingly used for weather prediction at many centers. Ensemble systems take many forms, but they all provide multiple predictions of each forecast variable at each valid time, and the forecasts are generated using one or more deterministic models, in one or more versions, starting from different analyses. Examples include so-called poor man’s systems involving a combination of existing models and analyses from different centers (Ziehmann 2000), systems based on different models and perturbed initial conditions (Pellerin et al. 2003), and single-model systems using perturbed initial conditions (Molteni et al. 1996; Toth and Kalnay 1997). The set of alternative forecast values obtained is usually interpreted as a sample from a probability density function (pdf), which is intended to represent the uncertainty in the forecast for each valid time and location.

Evaluations of ensemble pdfs that have been carried out often reveal that they are not calibrated. They may be biased in the sense that the pdf is not centered on the observation, and/or they are often found to be underdispersive, in the sense that the observations exhibit greater variance than the ensemble, on average (Toth et al. 2001; Pellerin et al. 2003; Buizza 1997; Hamill and Colucci 1997). Furthermore, gridded forecasts from ensembles are valid over large areas rather than at points. Statistical calibration of ensemble forecasts with respect to point observations can thus add some downscaling information to the forecasts.

Calibration of ensembles is also important if they are to be combined with other forecasts or other ensembles. This is because biases of individual ensembles will lead to artificially large ensemble spread when the ensembles are combined, and systematic errors of each ensemble will decrease the accuracy of the combined ensemble. In the recently planned North American Ensemble Forecast System (NAEFS; Toth et al. 2005), the importance of ensemble calibration has been recognized as part of the joint ensemble project. Both the Meteorological Service of Canada and the U.S. National Centers for Environmental Prediction are pursuing and comparatively evaluating ensemble calibration methods.

One rather promising method for calibrating ensemble forecasts is Bayesian model averaging (BMA). Widely applied to the combination of statistical models in the social and health sciences, BMA has recently been applied to the combination of NWP model forecasts in an ensemble context by Raftery et al. (2005). BMA is adaptive in the sense that recent realizations of the forecast system are used as a training sample to carry out the calibration. BMA is also a method of combining forecasts from different sources into a consensus pdf, an ensemble analog to consensus forecasting methods applied to deterministic forecasts from different sources (Vallée et al. 1996; Vislocky and Fritsch 1995; Krishnamurti et al. 1999). BMA naturally applies to ensemble systems made up of sets of discrete models such as the Canadian ensemble system.

Raftery et al. (2005) applied BMA to the University of Washington short-range ensemble, which is a five-member multianalysis, single-model ensemble (Grimit and Mass 2002). BMA was applied to surface temperature and mean sea level pressure forecasts, and the coefficients were fitted using forecasts and observations over a spatial region. They found that BMA was able to correct the underdispersion in both the temperature and pressure forecasts.

In our study we apply BMA to the Canadian ensemble system (Pellerin et al. 2003). This is a 16-member ensemble using perturbed initial conditions and eight different versions of each of two different NWP models. As in Raftery et al. (2005), we apply the BMA to surface temperature forecasts, but we test the method on single-station data, rather than over a spatial area. In this way, we attempt to calibrate the ensemble forecasts with respect to local effects that may not be resolved by the forecast models. The BMA procedure is described in section 2; section 3 gives a brief summary of the Canadian ensemble system; the available data and experimental setup are discussed in section 4; section 5 describes the results; and we conclude with a discussion in section 6.

## 2. Bayesian model averaging

BMA is a way of combining statistical models and at the same time calibrating them using a training dataset. BMA is not a bias-correction procedure; models that are to be combined using BMA should be individually corrected for any bias errors before the BMA is applied. The BMA pdf is a weighted sum of the bias-corrected pdfs of the component models, where all the pdfs are estimated from the forecast error distribution over the training dataset.

*y*is

*p*(

*y | M*) is the forecast pdf based on model

_{k}, y^{T}*M*alone, estimated from the training data;

_{k}*K*is the number of models being combined, 16 or 18 in the present study; and

*p*(

*M*) is the posterior probability of model

_{k}| y^{T}*M*being correct given the training data. This term is computed with the aid of Bayes’s theorem:

_{k}*f*is the bias-corrected forecast for model

_{k}*k*,

*y*is the forecast value of the variable from model

_{k}*k*, and

*a*is the mean error for model

_{k}*k*over the training dataset. This is referred to in the results section as “b1.” The other bias-correction procedure considered is a simple linear regression fit of the training data using the corresponding model-predicted variable as the single predictor:

*K*models, Eq. (1) can be rewritten as

*ω*=

_{k}*p*(

*M*) is the BMA weight for model

_{k}| y^{T}*k*, computed from the training dataset, and reflects the relative performance of model

*k*on the training period. The weights

*ω*add up to 1. The conditional probabilities

_{k}*p*[

_{k}*y |*(

*f*,

_{k}*y*)] may be interpreted as the conditional pdf of

^{T}*y*given

*f*, given that model

_{k}*k*has been chosen (or is the “best” model or member), based on the training data

*y*. These conditional pdfs are assumed to be normally distributed,

^{T}*a*and

_{k}*b*are estimated from the bias-correction procedures described above. This means that the BMA predictive distribution becomes a weighted sum of normal distributions, with equal variances, each one centered on the bias-corrected forecast from an ensemble member. A deterministic forecast can also be obtained from the BMA distribution, using the conditional expectation of

_{k}*y*given the forecasts:

This forecast would be expected to be more skilful than either the ensemble mean or any one member, since it has been determined from an ensemble distribution that has had its first and second moments debiased, using recent verification data for all the ensemble members. It is essentially an “intelligent” consensus forecast, weighted by the recent performance results for the component models.

The BMA weights and the variance *σ*^{2} are estimated using maximum likelihood (Fisher 1922). The likelihood function is the probability of the training data given the parameters to be estimated, viewed as a function of the parameters, that is, the term *p*(*y ^{T}* |

*M*) in Eq. (2). The

_{k}*K*weights and variance are chosen so as to maximize this function (i.e., the parameter values for which the observed data were most likely to have been observed). The algorithm used to calculate the BMA weights and variance is called the expectation maximization (EM) algorithm (Dempster et al. 1977). The method is iterative, and normally converges to a local maximum of the likelihood. For a summary of the method as applied to BMA, the reader is referred to Raftery et al. (2005), and more complete details of the EM algorithm are given in McLachlan and Krishnan (1997). The value of

*σ*

^{2}is related to the total pooled error variance over all the models in the training dataset. (A free software package to estimate the BMA parameters called ensemble BMA, written in the freely available statistical language R, is available online at http://cran.us.r-project.org/src/contrib/Descriptions/ensembleBMA.html.)

It is possible to relax the assumption that the conditional distributions for the component models all have constant variance. We carried out tests where the variance was allowed to vary, and was estimated separately for each component model. This meant that, in addition to the 16 or 18 of each of *a _{k}*,

*b*, and

_{k}*ω*, 16 or 18 different values of

_{k}*σ*

^{2}

_{k}needed to be estimated, instead of a single

*σ*

^{2}value. This significantly increases the number of independent BMA parameters that must be supported by the training sample.

## 3. The Canadian Ensemble System

Operational ensemble forecasts have been issued from the Canadian Meteorological Centre (CMC) since 1996. The ensemble strategy used at CMC has been described in Houtekamer et al. (1996) and Lefaivre et al. (1997). The version of the system that generated the data used in this study is described in Pellerin et al. (2003). The original ensemble consisted of eight members, which were eight variants of a global spectral model (SEF) that was used for operational deterministic forecasts in the late 1980s and early 1990s (Ritchie 1991). In 1999, eight additional members were added, consisting of eight variants of the Global Environmental Multiscale (GEM) model, a high-resolution version of which provides operational deterministic forecasts. The use of 16 different model versions is intended to account for uncertainties in the forecast due to limitations in model formulations. The 16 model versions are initialized with perturbed analyses; the perturbations are random and are produced through the data assimilation cycle in the model. In this way, errors due to the uncertainty in the initial conditions are accounted for. The SEF models have a horizontal resolution of T149 (equivalent to a grid length of about 140 km) and the GEM models have a resolution of 1.2°, about 130 km. The vertical resolution is 23 or 41 levels for the SEF members and 28 levels for the GEM members. The models differ from each other in several other ways, mostly related to the physical parameterization (details are contained in Pellerin et al. 2003).

One of the advantages of the BMA method is that it does not really matter what models are included. As long as sufficiently large training samples are available that are simultaneous in space and time for all models, the BMA can be applied to calibrate the ensemble of forecasts. The BMA weighting procedure will also ensure that the models containing the most accurate predictive information and unique predictive information in light of the full ensemble of models will be assigned high weights in the combined pdf. To test the behavior of the BMA when models from different sources were included in the ensemble, we conducted some experiments with an 18-member ensemble consisting of the original 16 members plus the control forecast and the operational full-resolution global model. The control forecast model is a version of the SEF model where some of the physical parameters that are varied for the eight SEF members are replaced by their averages. The control run uses an unperturbed analysis.

The full-resolution model is the version of the GEM global model that was operational at the time of the BMA experiments. This model has a uniform horizontal grid at 100-km resolution (0.9°) and 28 levels in the vertical, which means that the resolution is only slightly higher than the versions of GEM used in the ensemble. The GEM global is described fully by Coté et al. (1998).

## 4. Data and experiment

Data used in the experiment came from 1 yr (28 September 2003–27 September 2004) of surface temperature forecasts from the 16-member ensemble, supplemented with forecasts from the unperturbed control model and the full-resolution global model. All forecasts were interpolated to 21 Canadian stations for which a set of reliable observations was available. All observations are of 2-m temperature, taken at manned observing sites. The stations used in the study are distributed across Canada, and are representative of a variety of climatological regimes (Fig. 1). Forecasts were available at 24-hourly intervals out to 10 days, initialized at 0000 UTC each day of the 1-yr period. The full sample comprises 366 separate BMA analyses for each of 21 stations, which are used to produce a maximum of 7686 forecasts, based on the 7686 updated BMA analyses. In practice, the total sample on which the results are based is a little smaller than that due to missing forecasts or verifying observations.

The experiments were conducted in a recursive mode: the BMA was retrained each day through the 1-yr period, using a training sample period of the N previous days, where *N* = 25, 30, 35, 40, 50, 60, or 80. The training was carried out separately for each station and each 24-h forecast projection. Then the BMA coefficients were applied to the next day’s forecast, as an independent case. In this way, one year of independent test data was accumulated to verify the performance of the technique. This kind of recursive tuning is similar in some ways to the Canadian application of updateable model output statistics (UMOS; Wilson and Vallée 2003), except that BMA is retuned on a daily rather than weekly basis, and the training sample is of fixed size in each application. As with UMOS, there is also no direct dependence of any particular run on the previous runs; BMA coefficients were recalculated separately each time. With the addition of one day of new forecast data and the deletion of the oldest case from the training sample, there is a large overlap in the training dataset from one day to the next, especially for the longer training periods. One would expect, therefore, that the coefficients should not change rapidly from day to day.

Four different comparative experiments were carried out to try to optimize the parameters of BMA:

The length of the training period, N.

Constant variance over the ensemble members based on the total pooled error variance versus variable variance, estimated from the error distribution for each member. These experiments are referred to as “BMA” and “BMAvar,” respectively.

Using a full regression to remove the bias versus additive bias removal only. The full regression involved application of Eq. (4) to each member of the ensemble over the training period, resulting in a set of ensemble forecasts for which the average error of all members is 0. The simple bias removal was accomplished by subtracting the mean error over the training sample from all the ensemble member forecasts [Eq. (3)]. These experiments are referred to as FR and bl, respectively.

The effect of adding the control forecast and the full-resolution forecast to the BMA (16 versus 18 members).

In addition to the original unprocessed set of ensemble forecasts, the various permutations lead to a total of 12 sets of processed ensemble forecasts, 6 for each of the 16- and 18-member ensembles: FR, bl, BMA after FR, BMA after b1, BMAvar after FR, and BMAvar after b1. All runs were tested for training periods ranging from 30 to 80 days. The shortest training period, 25 days, was dropped from consideration as soon as it became clear that it was too short.

The various experiments are assessed using three verification measures, the rank histogram (Anderson 1996; Hamill 2001), the continuous rank probability score (CRPS; Hersbach 2000), and the continuous rank probability skill score (CRPSS), a skill score with respect to climatology based on the CRPS. The first is used to check whether the ensemble spread is representative of the spread in the observations, and thus can give a good indication whether the second moment of the ensemble distribution has been calibrated by BMA. The CRPS measures the distance between the forecast continuous distribution function (cdf) and the observation expressed as a cdf, and measures the accuracy of the ensemble distribution. The CRPS is equal to the mean absolute error for deterministic forecasts. The skill score measures the percentage improvement of the ensemble distribution compared to a forecast of the long-term climatological distribution for the station and day of the year. These three measures are described more fully in the appendix.

## 5. Results

### a. BMA weights

Figure 2 shows the time series of weights for one station (Montreal) for 24-, 120-, and 216-h projections, using a training period of 40 days. All 16 ensemble models are shown. This figure illustrates how the coefficient selection changes from day to day. There are several aspects to note here. First, the day-to-day changes in the coefficients can be quite large, especially for the shortest-range forecasts. This is perhaps surprising, given that the training sample changes by only 1 case in 40 from one day to the next. Second, different models are favored at different projections at the same time. This was true even for adjacent 24-h projections. Third, there is a tendency for the individual models to go in and out of “favor” for periods of as much as a month or more. Fourth, coefficients sometimes but rarely are close to 1, which means essentially that only one model was considered useful by BMA in that case. And finally, especially at the early projections, some models are rarely selected at all, for example SEF1 and 8 at 24 h and SEF7 at 120 h. We found that the length of the training period had relatively little effect on the coefficients. One can also note that sometimes model weights change pairwise. For example, at 216 h, SEF6 takes over from SEF5 for a couple of weeks in the winter. And, at 24 h, GEM 13 hands over to GEM 12 over a period of only a few days in the summer.

Although some models were practically never chosen through the whole year for some stations and projections, it turns out that all models were approximately equally important when the coefficients are averaged over all stations for the full year. The top row of Fig. 3 shows this for a training period of 40 days, for BMA carried out after FR bias removal, for 24-, 120-, and 216-h projections. At the short and perhaps medium forecast range, there is a slight tendency for the coefficients to favor the GEM model components of the ensemble on average, models 9–16. At the longest range, 216 h, average coefficient values are distributed nearly equally across all ensemble members.

Although there is not much variation in the yearly average coefficient values among the ensemble members, stratification by season reveals greater differences. Figure 3 illustrates this also, showing winter (December–February) and summer (June–August) average coefficients in the second and third rows. The eight GEM model members are strongly favored in summer at the shortest ranges, but not so much in winter. At the medium range, there is not a strong preference shown for either model, but individual members are preferred in different seasons. For example, SEF member 8 carries the highest average weight in winter, but GEM member 16 dominates in summer. At the longer ranges, there is not a clear seasonal signal. This is not surprising, since none of the individual forecast models are believed to have significant predictive skill at the longer-range projections, although there may still be useful forecast information in the distribution obtained from the ensemble. As the predictive accuracy decreases, one would expect the coefficients and model selection in each case to approach a random selection from a uniform distribution, and the histogram should therefore tend toward a flat distribution. Noticeable departures from uniformity, for example members 2 and 4 at 216 h in the summer, would warrant further investigation. Although based on averages over relatively large samples (21 stations for 90-day periods, sample sizes in the range of 1800 cases), these results may not be significant; further testing on additional seasons would be of value to see whether there is a tendency for the preferred models to change according to seasonal average flow regimes for example.

These results are potentially useful diagnostics for model performance. It would be interesting to investigate whether there is any relationship between the characteristics of the models favored with relatively high coefficients and specific types of synoptic flow regimes. It might also be informative to relate variations in the seasonal average coefficients of individual models to the strengths and weaknesses of their formulations vis-à-vis seasonal differences in average atmospheric structure. It should be noted, however, that the weights are not directly related to the performance of the individual models; each model is evaluated in light of the predictive information from all models. Thus, if two models perform similarly during the training period, but neither adds much predictive skill with respect to the other, BMA will tend to select the better one and practically ignore the other. This occurs, for example, when two forecasts are very highly correlated during the training period. The magnitude of the coefficients relates both to the accuracy of the model and its ability to bring unique predictive information to the ensemble.

Figure 4 shows an example of how the BMA works on a single forecast. In this case, a full regression has been used prior to the BMA, and the error variance was pooled over all models and applied to each in the averaged pdf. The BMA pdf is shown by the heavy line. This is an example where the forecast predicted cooler temperatures than long-term climatology (dash–dot line), and the regression has increased the temperatures. The component model pdfs, shown by the lighter Gaussian curves, tend to spread the distribution somewhat: one of the models near the cold tail of the distribution carries the highest weight, while two others with lower weights, predict near the climatological mean. Of the 16 models offered to BMA, only 3 of them carry any significant weight in this case. This is consistent with the coefficient trends shown above, and was typical of all the results for all stations.

The tendency of the BMA to assign significant weights to only a few of the models might seem surprising. In this way it objectively mimics the “model of the day” approach used by forecasters, where the model judged to give the best solution on a particular day is chosen, perhaps modified, and all other solutions are rejected. It is possible that nonzero coefficients would be spread out over more members with larger sample sizes, especially if the samples were kept seasonally homogeneous. A comparison of the coefficient statistics for 40- and 80-day training samples gave no evidence of this however, and the much larger sample sizes used in Raftery et al. (2005) also produced negligibly small coefficients for some models. A more likely explanation of this outcome is that there is considerable colinearity in the training samples, which means that relatively few of the models are needed to explain most of the explainable variance in the observations.

To further investigate this issue, histograms of the coefficient values were examined. Of the approximately 120 000 weights for the 16-member, 21-station 1-yr dataset, as many as 75% took on small values, and up to 500 or so took on relatively large values near one. There was a tendency for the number of very small coefficients and the number of very large coefficient values to decrease with longer projections, indicating spreading of the weights over more of the models at these projections. This is consistent with expectation since, at longer projections the accuracy of all the models is low, and the best model for a particular training period moves closer to a random selection from all the models. Even at 10 days though, a significant majority of the weights took on small values, reflecting colinearity in the forecasts as discussed above. The tendency toward significant numbers of very large weights is of more concern especially at longer projections, for it indicates that there could be overfitting in the results. Accordingly, we tested this too, by rerunning the BMA analysis on the full dataset with a considerably relaxed convergence criterion for the EM algorithm, differences between successive iterations of 10^{−2} instead of 10^{−4}. This did indeed significantly reduce the number of high-valued weights and spread the weights more evenly over more models. Also, these results slightly improved performance on the independent dataset, which is also consistent with overfitting in the original results. However, the effects were not significant, improving the CRPS only by about 0.05° on average, and a significant majority of the weights remained near zero. Furthermore, the relaxed criterion produced some undesirable effects, such as decreasing the spreading of the weights at longer projections, which is counterintuitive. We therefore report the original results in the following sections, and note that further exploration is needed to optimize the convergence criterion in the BMA for smaller samples.

### b. Rank histogram and CRPS results

The rank histogram is a way of assessing the ability of the BMA to correct the tendency of the ensemble distribution toward underdispersion. Rank histograms were computed over all the independent forecasts, for the original ensemble, the bias-corrected ensemble (both FR and b1), BMA, and BMAvar. BMA and BMAvar equations were run on both versions of the bias-corrected forecasts.

Figure 5 shows a grid of rank histograms, with the rows representing the stages of the calibration, from the original ensembles at the top to BMAvar on the bottom. The columns are for the different projections, each 2 days from 1 to 9 days left to right. These results were obtained with a training period of 40 days and bias removal using FR. The original ensembles show the underdispersion that is characteristic of ensemble forecasts, especially for surface weather variables. This underdispersion decreases with increasing projection time, which reflects the tendency for the ensemble spread to approach the spread of the climatological distribution at longer projections. The original ensembles also show a cold bias on average; the observation more often occurs in the highest (warmest) bin than in the lowest (coldest) bin of the ensemble distribution. The second row shows that the FR reduces the bias, especially at the early projections, but also increases the underdispersion, especially at the longer projections. This is characteristic of regression; as the quality of the fit decreases for longer projections, the regression predicts more toward the mean of the predictand, reducing the predicted variance in the forecasts.

*M*is the total sample size on which the rank histogram is computed,

*N*is the number of ensemble members, and

*s*is the number of occurrences in the

_{k}*k*th interval of the histogram. Candille and Talagrand (2005) point out that, due to the finite sample size, the expected value of the sum is [

*MN*/(

*N*+ 1)]. This means the expected value of the RMSD is

*MN*/(

*N*+ 1)

^{2}]

Figure 6 shows these results, for original, regression, BMA, and BMAvar, using both full-regression and regression constant-based bias correction only. The figure confirms that BMAvar is not quite as well calibrated as the BMA. This may be due to the additional parameters that must be estimated for the BMAvar, 16 variances instead of one. Training samples of 40 days might not be long enough to provide stable estimates of these additional parameters. BMAvar performance is slightly worse than that of BMA also for the b1 equations. Using only an additive bias removal does improve the BMA results overall in comparison to the FR. This is likely due to seasonality problems in the fitting of the full regression, an issue explored more fully below.

The figure also shows that the departure from flatness of the rank histogram is greater after the FR than it is for the original ensemble after day 2 of the forecast. This is due to the tendency of the full regression to decrease the ensemble spread for the longer projections. The b1 results show a fairly small, but consistent improvement in the rank histogram over all projections compared to the original ensemble. This improvement can be attributed mainly to the tendency of the bias correction to “equalize” the frequency of occurrence of the observation in the two extreme bins of the histogram.

Figure 7 shows a comparison of the RMS departure from flatness of rank histograms for b1 equations, for 40- and 60-day training periods. The curves for the 40-day training period are the same as in Fig. 6, to facilitate comparison. As would be expected, the bias correction itself is not quite as effective for 60-day training periods as for 40 days, especially after day 3. The extra 20 cases included in the 60-day training periods occur earlier in time compared to the independent case on which the forecasts are tested. Thus, the training sample as a whole is likely to be less representative of the independent data than for shorter training periods. This effect is most pronounced for the longer forecast projections. While this problem might be avoided by using cross validation for the development and testing, this is not possible in operations, where the latest updated equations must be used to prepare the next forecast.

Despite a slightly poorer bias correction, the longer training period does result in a small improvement in calibration of the BMA-corrected forecasts. The longer training period presumably provides a more stable estimate of the weights using larger samples, which translates to slightly less noise in the rank histogram distribution. For BMAvar, where more coefficients must be estimated, the difference with the larger sample is slightly more pronounced, but there is still more variation from flatness in the BMAvar than in the BMA results. In numerical terms, BMAvar with a 60-day training sample is about as well calibrated as BMA with a 40-day training sample. In summary, these last two figures suggest that longer training periods with simple bias removal (bl) and BMA with constant variance provide the most reliable correction of the underdispersion in the ensemble temperature forecasts.

We now turn to the CRPS results for further exploration of the different options for BMA analysis. The CRPS summarizes the accuracy of the ensemble distribution, by determining the integrated distance of the cdf from the observation represented as a cdf. The CRPS has the same dimension as the variable in question, temperature, and is understandable as the equivalent of mean absolute error for a single deterministic forecast.

Figure 8 corresponds to Fig. 6, except it shows the CRPS results for a 40-day training period rather than the RMS departure of the rank histogram from flatness. First of all, the FR bias removal improves on the original ensemble forecast only for the first 3 days of the forecast. By reducing the dispersion in the ensemble as forecast projection time increases, the FR, which is essentially model output statistics with one predictor, produces forecast distributions that are too sharp (too overconfident) in the face of decreasing accuracy of the individual members. This leads to an increased frequency of situations where the bulk of the probability density is far from the observation location.

Second, the b1 bias correction leads to a reduction (improvement) in the CRPS of about 0.2°, with slightly larger values at the short forecast ranges. The BMA improves the CRPS by another 0.2°. It is interesting that the BMA result for the FR is nearly indistinguishable from the result for the b1 bias correction. Essentially, the BMA has “made up” for the effects of the decrease in dispersion at the longer forecast ranges, which suggests that, at least for a 40-day training period, either FR or b1 could be used to correct bias, as long as a BMA analysis follows. And finally, the BMAvar results are similar to the BMA results.

Figure 9 shows the yearly summary results for a longer training period of 80 days. The levels of error indicated differ very little from the corresponding results for 40 days, though there is a slight tendency for these results to be worse overall. The only noticeable difference is now that, for medium and long ranges, there is a tendency for the b1 bias removal to perform better than the FR version. This tendency was traced to the failure of the FR to fully remove the bias with respect to the independent data, as shown below.

To explore the question of the best training period using the available data, Fig. 10 shows the CRPS for BMA for independent data as a function of the length of the training period, for both FR and b1. In general, a minimum in the CRPS (best performance) occurs around 35–50 days, depending on the projection. However, all the curves are rather flat, suggesting that the dependence of the performance on the length of the training period is rather weak within the range tested here. There is ample evidence, however that 25 days is too short. For the shortest forecast ranges, there is little difference between FR and b1. At medium and longer ranges, the minimum CRPS occurs for shorter training periods in the FR than for the b1 version. This minimum in the FR results may be artificial in some sense: the relatively rapid rise in the CRPS for longer training periods and longer forecast projections for the FR is caused by its failure to remove the bias in the spring and autumn. Rank histograms for 216 h (Fig. 11) show a cold bias in the spring and warm bias in the fall, while the corresponding results for b1 are relatively bias free. In the figure, the cold bias is indicated by a large number of occurrences of the observation in the 17th bin, which means all the ensemble members are forecasting lower temperatures than observed. Conversely, a warm bias is indicated when the observed temperature most frequently occurs in the lowest bin of the histogram. When the extreme bins of the histogram are approximately equally populated, this indicates an unbiased ensemble. When presented with biased forecasts following the regression step, the BMA improves the spread in the forecasts, as before, but they remain biased (Fig. 12). The FR may fail to remove the bias in spring and fall because the slope coefficient attempts to fit the seasonal trend in the data. The independent case, which for a 8-, 9-, or 10-day forecast will verify at least 8 days after the end of the training period, may be seen as an outlier with respect to the training sample. This effect would be greatest for longer training periods and longer forecast projections, which is consistent with the results. Another factor may be the tendency for overlap of spring and autumn training periods with the previous seasons, which would lead to underfitting of the seasonal trend. Again, this would be most pronounced for longer training periods and longer forecast projections. Since the biases in the spring and fall results are in the opposite sense, cold in the spring and warm in the fall, they tend to partially cancel each other when the rank histogram is accumulated over the whole year. This is an example of a limitation of the rank histogram, which was pointed out by Hamill (2001).

In summary, these results indicate that a training period of about 40 days is a good choice for general use. BMA does a good job of correcting the underdispersion for either a full-regression bias removal or an additive bias correction with training periods of this length. Shorter training periods are too short for reliable estimates of the weights, while for longer training periods, best results are obtainable only if the simpler b1 bias correction is used, because of difficulties fitting seasonal variations in the training sample. Allowing the variance to vary among the component models does not improve the results.

### c. 16- versus 18-member ensembles

One advantage of BMA is that it is a simple matter to include any available models in the analysis, and the BMA procedure will combine and calibrate forecast distributions based on all available models. In adding the unperturbed control and the full-resolution global model to the analysis, we expected that the additional predictive information would improve the forecast output. In general this was true, but the improvements were relatively modest.

Figure 13 shows the summary CRPS results for the full year independent dataset for the 18-member ensemble, for both the FR and b1 bias corrections, plotted as a function of both the training sample length and the projection. The differences shown with respect to the 16-member ensemble are positive everywhere except for the shortest training period and longest forecast ranges. The impact of the additional members reaches a maximum of nearly 0.1° for 96-h forecasts, then drops off for the longer-range projections. To put that in perspective, the BMA typically improves the CRPS by about 0.2° with respect to the bias removal only, (see Fig. 8) which means the maximum achieved improvement is about 50% of the basic BMA improvement. For forecasts of 8 days or more it is safe to say that there is no meaningful advantage to the additional ensemble members for any training period; neither the full-resolution model nor the control forecast adds any predictive information. The patterns for the two forms of bias removal are very similar.

Additional diagnostic information on the relative predictive ability of the different models is provided by looking at the average weights for the 18-member ensemble BMA. Figure 14 shows another grid of histograms with the columns representing the short-, medium-, and longer-range projections. The top row is for the full year, and the other four rows are for the four 3-month seasons. In all cases, the control forecast is the first (leftmost) and the full-resolution global model is the 18th (rightmost) member. This grid of charts makes several points. First, at the shortest forecast range, the full-resolution model is by far the most important member of the ensemble. This is not surprising, indicating the advantage of the higher resolution at short range, even though the model resolution is only slightly finer (100 km) than the 130 km of the GEM members of the ensemble. The relative importance of the global model can be seen in all seasons except summer, where all the GEM members are relatively important. Second, the relative importance of the global model decreases with increasing projection; by 216 h, it is “just one of the members.” This is consistent with the CRPS results in Fig. 13. Third, the control model carries, along with the GEM model, relative importance at the medium range. This is interesting, suggesting that the perturbations may have a tendency to degrade the forecast relative to an unperturbed model of the same resolution. The relative importance of the control model is usually less than the full-resolution model, and it too becomes “one of the pack” by 216 h.

Overall, these results do give the impression that both the control model and the full-resolution model bring predictive information that is unique compared to the ensemble members. This is potentially significant, for it suggests that there may be an advantage to running BMA on ensembles composed of models from different sources.

### d. CRPSS results

All of the results discussed so far relate to the comparison of the accuracy of the independent sample forecasts for different configurations of the calibration. The CRPSS shows the skill of the forecasts against an unskilled standard forecast, that is, the station-specific daily climatological temperature distribution. The climatology is derived from 30 or more years of observations for each station and each day of the year. The skill score expresses the difference between the CRPS achieved by climatology and the CRPS achieved by the forecast, normalized by the seasonal average climatological score. Details of the score computation are given in the appendix.

Figure 15 summarizes the skill characteristics of the original, regression and BMA-calibrated forecasts, over the 1-yr independent sample. Original and BMA-calibrated results are also shown for the 18-member ensembles.

The skill curves for the original, bias-corrected, and BMA forecasts for 16-member ensembles are essentially mirror images of the corresponding curves in Fig. 6. This is expected, for the underlying climatology is the same for all these curves; it is only the CRPS values that change. The original forecasts show positive skill to about day 8 while the FR bias correction moves the 0 skill point back to about 156 h. The simple additive bias removal is sufficient to extend the positive skill to the end of the forecast period and the BMA improves on the skill of the bias-corrected forecasts.

Regarding the results for the 16- versus 18-member ensembles, there is essentially no difference in the skill of the original ensemble forecasts. However, the BMA-calibrated forecasts show slightly higher skill for the 18-member ensembles than for the 16-member ensembles, out to day 7. This is consistent with Fig. 14, where the full-resolution model is favored with higher weights for the short- and medium-range forecasts.

### e. Comparison of BMA with a simple Gaussian pdf

Finally, we try to answer the question of whether it is worth going to the trouble of assigning different weights to the different members of the ensemble. Perhaps a simple pdf estimated using the errors in the training sample would perform as well as the BMA. To test this, a Gaussian pdf was created for each forecast, using the bias-corrected ensemble mean as the mean and the root-mean-squared error of the ensemble mean based on the training period as the standard deviation. The Gaussian pdfs were then evaluated on the independent data using the CRPS in exactly the same way as the BMA forecasts were evaluated.

The results of this evaluation are presented in Figs. 16 and 17. Figure 16 shows the overall CRPS results for the 16- and 18-member ensembles, and the corresponding Gaussian pdfs, all based on the independent data. There is little difference in performance between the BMA and the Gaussian overall; the Gaussian proves to be a good competitor. Without the ability to assign higher weights to the better performing ensemble members, the Gaussian results are nearly identical for 16 and 18 members. On the other hand, the BMA is able to take advantage of the better performance of the full-resolution model in the first 5 days to give it a slight performance edge over the Gaussian and the 16-member ensemble BMA. This suggests that, in multimodel situations where there are large differences between the models, the ability to weight the individual members becomes more important.

Figure 16 also shows the mean absolute error (MAE) of the ensemble mean, which is much higher than the CRPS for either the BMA or the Gaussian. Since the MAE is consistent with the CRPS in the sense that the CRPS reduces to the MAE for a deterministic forecast, this effectively shows the error levels that would be obtained if the spread of the ensemble were to be reduced to zero.

The true advantage of the BMA over the Gaussian in the present results is revealed in Fig. 17, which shows the average 90% prediction interval for Gaussian and BMA pdfs, for both 16- and 18-member ensembles. There is little difference between the 16- and 18-member ensemble results, but the BMA has significantly reduced the 90% prediction interval compared to the Gaussian, by 20%–25% over the whole 10-day prediction range. This is significant if the forecast pdfs are to be used for credible interval temperature forecasting, for example. A 25% more precise interval is important when it can be achieved without loss of accuracy (Fig. 16) and with ensembles that are not markedly underdispersed (see Fig. 5).

## 6. Discussion

This paper describes results of experiments with Bayesian model averaging as a tool for the calibration of ensemble forecasts from the operational Canadian ensemble forecast system. The experiment was set up in such a way that it could be run in real-time operations; the BMA was trained on recent realizations of the forecast, then applied to the next forecast. The BMA was applied to bias-corrected forecasts only, and two methods were tested for bias removal prior to the BMA analysis. Several different training sample sizes were tested, ranging from 25 to 80 days. The benefits of allowing the variance of the component models to vary, and the effect of adding the control forecast and the full-resolution model forecast to the ensemble were also evaluated. And finally, the fundamental question of whether the BMA improves upon a very simple Gaussian pdf with equal weights was examined.

The main results of the study can be summarized by the following statements:

The BMA faithfully removed most, but not all of the underdispersion exhibited by the original ensemble. It proved to be capable of this even for the longest-range forecasts, where the underdispersion had been increased by the regression-based bias removal procedure. The remaining underdispersion seems to be systematic but small and may be related to differences between the training sample and the independent data on which the BMA was evaluated.

On the basis of the 1 yr of data available, the results suggest that training periods of 40 days give the best results, though we found that forecast accuracy on the independent data did not vary strongly with the training period. However, 25 days seemed to be clearly too short a period on which to fit the BMA weights, and there was some evidence that longer training periods with only an additive bias removal gave a more stable BMA analysis.

Bias removal using simple linear regression seemed to work about as well as bias removal by correcting only for the mean error up until about 7 days, and for training periods up to about 50 days. For longer training periods and longer projections, performance for regression-based bias removal was poorer than for additive bias removal apparently because the bias was not successfully removed by the regression for the spring and fall seasons.

Allowing the variance to vary among the component models in the BMA did not improve the performance; in fact, there was a tendency to slightly degrade the performance on the independent data. This probably means that the training samples were not large enough to obtain reliable estimates of the additional coefficients.

Addition of the control forecast and the full-resolution model to the ensemble improved the results modestly, up until about 7 days. Beyond that, the additional models do not have enough skill to contribute to the accuracy of the calibrated forecasts.

Examination of the BMA weights is extremely useful in a diagnostic sense, to identify models that contribute less to the ensemble than other models, and to identify synoptic situations that are handled particularly well or poorly by the individual model formulations. The fact that the full-resolution model carried higher weight on average for shorter-range forecasts indicates that the extra resolution is beneficial, while the relatively high importance of the control model suggests that, at least for some seasons and projections, the perturbations increase the error levels of the individual models.

The BMA achieved approximately equal accuracy in terms of the CRPS overall, but produced significantly sharper pdfs when compared to a simple Gaussian pdf with standard deviation equal to the RMSE of the ensemble mean based on the same training period. The 90% prediction interval was reduced by more than 20% over the 10-day forecast range in the independent sample. These results also suggest that the ability to weight the component models of an ensemble is more important when there are significant systematic differences in the structure of the ensemble members.

These results have dealt with surface temperature only, for which the assumption of a normal error distribution for each model is reasonable. For other variables such as precipitation and wind, the gamma distribution might be more appropriate, but there are problems to work out such as how to handle calm winds and 0 precipitation events in the combined distribution. This is the next major step of our work in applying BMA to Canadian ensemble prediction.

One might also speculate whether a system that assigns different weights to the events of the training sample would enhance the performance. On the assumption that more recent realizations of the forecast are better indicators of the current performance than earlier realizations, perhaps a weighting scheme could be added to the BMA weight calculation. Such a scheme might also improve the fitting of the seasonal trends.

Another enhancement one might consider is the use of a full screening multipredictor MOS fit to remove the bias. In that case, seasonal variations in the training sample could be accounted for by the addition of specific predictors, and/or a weighting scheme could be used.

Both the bias removal and the BMA steps might also benefit from the use of longer training periods of data from the ensemble models, such as would be available from reforecasting projects (e.g., Hamill et al. 2004). If sufficient data were available, one could stratify the training sample by season, using the corresponding period from more than 1 yr. Considering the evidence presented here that seasonal variations within the training dataset led to poorer performance on independent data, the use of multiyear, seasonally stratified training datasets might produce higher-quality predictions than were possible in this study. The disadvantage of such an approach is that any significant change to the ensemble model would mean that the reforecasting would have to be redone to produce a new statistically representative training sample. Or, perhaps an updating system could be devised that would ensure a smooth transition from calibration with respect to an old system to calibration with respect to a new system.

When BMA is applied to small samples, as in this study, care must be taken to set the parameters of the EM algorithm to avoid overfitting, while at the same time integrating the algorithm far enough to achieve meaningful results. Further work is needed on this issue.

In this study, we applied BMA in a way that is consistent with previous applications, that is, to combine models that are distinct in order to generate a consensus pdf that is calibrated. The question arises how BMA might be applied to other operational ensemble forecasts that do not consist of separate distinct models. BMA could also be used for a single-model ensemble system where only the initial conditions are perturbed. In that case, error distributions over a training sample would be expected to be as random draws from the same distribution, since it is the same model that is run each day. In that case, the weights *ω _{k}* in (5) and (7) would be constrained to be equal, so that

*ω*= 1/

*k*for each

*k*. The EM algorithm can easily be modified for this case.

BMA should be a valuable method for the calibration of any multimodel ensemble system, whether it consists of individual discrete models, as in a “poor-man’s” system, or of combinations of ensembles from different centers. The latter is becoming more important with the initiatives of the North American Ensemble Forecast System (NAEFS) and the global ensemble initiative of The Observing System Research and Predictability Experiment (THORPEX) Interactive Grand Global Ensemble (TIGGE; Richardson et al. 2005). In such a system, which will be made up of single-model ensembles along with multimodel ensembles, BMA could be applied to simultaneously combine and calibrate them. If the members of a component ensemble are not distinct, the weights *ω _{k}* for the members of that ensemble would be constrained to be equal. Some very preliminary work has been done to apply BMA to combined Meteorological Service of Canada and the National Centers for Environmental Prediction ensembles, using a total of 26 members all initialized at the same time. Results of this work are promising so far.

Finally, it is the flexibility of BMA that makes it most attractive for use in multimodel ensemble systems: one can add any model for which data is available during the training period and produce a calibrated combination of models. This means it can be used to diagnose the added information that the ensemble brings to the deterministic forecast, which is of importance to all centers that run both an ensemble system and one or more deterministic models.

## Acknowledgments

The authors wish to thank Dr. Tilmann Gneiting for helpful discussions on this paper, and Dr. Eric Grimit, Michael Polakowski, and J. McLean Sloughter for contributing software. Adrian Raftery’s work was supported by the DoD Multidisciplinary University Research Initiative (MURI) program administered by the Office of Naval Research under Grant N00014-01-10745. We also wish to thank Dr. Tom Hamill and an anonymous reviewer for their comments, which have greatly improved the paper.

## REFERENCES

Anderson, J., 1996: A method for producing and evaluating probabilistic forecasts from ensemble model integrations.

,*J. Climate***9****,**1518–1530.Buizza, R., 1997: Potential forecast skill of ensemble prediction and spread and skill distributions of the ECMWF ensemble prediction system.

,*Mon. Wea. Rev.***125****,**99–119.Candille, G., and O. Talagrand, 2005: Evaluation of probabilistic prediction systems for a scalar variable.

,*Quart. J. Roy. Meteor. Soc.***131****,**231–250.Coté, J., S. Gravel, A. Méthot, A. Patoine, M. Roch, and A. Staniforth, 1998: The operational CMC/MRB Global Environmental Multiscale (GEM) Model. Part I:–Design considerations and formulation.

,*Mon. Wea. Rev.***126****,**1373–1395.Dempster, A. P., N. M. Laird, and D. B. Rubin, 1977: Maximum likelihood from incomplete data via the EM algorithm.

,*J. Roy. Stat. Soc.***39B****,**1–39.Fisher, R. A., 1922: On the mathematical foundations of theoretical statistics.

,*Philos. Trans. Roy. Soc. London***222A****,**309–368.Glahn, H. R., and D. A. Lowry, 1972: The use of model output statistics (MOS) in objective weather forecasting.

,*J. Appl. Meteor.***11****,**1202–1211.Grimit, E. P., and C. F. Mass, 2002: Initial results of a mesoscale short-range ensemble forecasting system over the Pacific Northwest.

,*Wea. Forecasting***17****,**192–205.Hamill, T. M., 2001: Interpretation of rank histograms for verifying ensemble forecasts.

,*Mon. Wea. Rev.***129****,**550–560.Hamill, T. M., and S. J. Colucci, 1997: Verification of ETA-RSM short-range ensemble forecasts.

,*Mon. Wea. Rev.***125****,**1312–1327.Hamill, T. M., J. S. Whitaker, and X. Wei, 2004: Ensemble reforecasting: Improving medium-range forecast skill using retrospective forecasts.

,*Mon. Wea. Rev.***132****,**1434–1447.Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems.

,*Wea. Forecasting***15****,**559–570.Houtekamer, P. L., L. Lefaivre, J. Derome, H. Ritchie, and H. L. Mitchell, 1996: A system simulation approach to ensemble prediction.

,*Mon. Wea. Rev.***124****,**1225–1242.Krishnamurti, T. N., , T. LaRow, D. Bachiochi, Z. Zhang, C. E. Williford, S. Gadgil, and S. Surendran, 1999: Improved weather and seasonal climate forecasts from multimodel superensembles.

,*Science***285****,**1548–1550.Lefaivre, L., P. L. Houtekamer, A. Bergeron, and R. Verret, 1997: The CMC ensemble prediction system.

*Proc. Sixth Workshop on Meteorological Operational Systems,*Reading, United Kingdom, ECMWF, 31–44.McLachlan, G. J., and T. Krishnan, 1997:

*The EM Algorithm and Extensions*. Wiley, 274 pp.Molteni, F., R. Buizza, T. N. Palmer, and T. Petroliagis, 1996: The ECMWF ensemble system: Methodology and validation.

,*Quart. J. Roy. Meteor. Soc.***122****,**73–119.Pellerin, G., L. Lefaivre, P. Houtekamer, and C. Girard, 2003: Increasing the horizontal resolution of ensemble forecasts at CMC.

,*Nonlinear Processes Geophys.***10****,**463–468.Raftery, A. E., T. Gneiting, F. Balabdaoui, and M. Polakowski, 2005: Using Bayesian model averaging to calibrate forecast ensembles.

,*Mon. Wea. Rev.***133****,**1155–1174.Richardson, D. S., R. Buizza, and R. Hagedorn, 2005: Report of the 1st Workshop on the THORPEX Interactive Grand Global Ensemble (TIGGE). ECMWF, 34 pp. [Available online at http://www.ecmwf.int/newsevents/meetings/workshops/2005/TIGGE/TIGGE_report.pdf.].

Ritchie, H., 1991: Application of the semi-Lagrangian method to a multi-level spectral primitive-equations model.

,*Quart. J. Roy. Meteor. Soc.***117****,**91–106.Simonsen, C., 1991: Self-adaptive model output statistics based on Kalman filtering. Lectures and papers presented at the WMO training workshop on the interpretation of NWP products in terms of local weather phenomena and their verification, Wageningen, Netherlands, WMO PSMP Research Rep. Series 34, XX-33–XX-37.

Toth, Z., and E. Kalnay, 1997: Ensemble forecasting at NCEP and the breeding method.

,*Mon. Wea. Rev.***125****,**3297–3319.Toth, Z., Y. Zhu, and T. Marchok, 2001: The use of ensembles to identify forecasts with small and large uncertainty.

,*Wea. Forecasting***16****,**463–477.Toth, Z., and Coauthors, 2005: The North American Ensemble Forecast System. Preprints,

*21st Conf. on Weather Analysis and Forecasting/17th Conf. on Numerical Weather Prediction,*Washington, DC, Amer. Meteor. Soc., CD-ROM, 11A.1.Vallée, M., L. Wilson, and P. Bourgouin, 1996: New statistical methods for the interpretation of NWP output at the Canadian Meteorological Center. Preprints,

*13th Conf. on Probability and Statistics in the Atmospheric Sciences,*San Francisco, CA, Amer. Meteor. Soc., 37–44.Vislocky, R. L., and J. M. Fritsch, 1995: Improved model output statistics forecasts through model consensus.

,*Bull. Amer. Meteor. Soc.***76****,**1157–1164.Wilson, L. J., and M. Vallée, 2002: The Canadian updateable model output statistics (UMOS) system: Design and development tests.

,*Wea. Forecasting***17****,**206–222.Wilson, L. J., and M. Vallée, 2003: The Canadian updateable model output statistics (UMOS) system: Validation against perfect prog.

,*Wea. Forecasting***18****,**288–302.Ziehmann, C., 2000: Comparison of a single-model EPS with a multi-model ensemble consisting of a few operational models.

,*Tellus***52A****,**280–299.

## APPENDIX

### Verification Measures

#### Rank histogram

The rank histogram (Anderson 1996) is formed by first ranking the values from each ensemble forecast from lowest to highest. These ordered values then are used as thresholds to define *N* + 1 ranges or bins of the predictand, where N is the ensemble size. The two extreme bins are open ended. The histogram is formed by tallying the number of occurrences of the observation in each bin over the verification sample.

Under the assumption that the observation is equally likely to fall in each bin, a “perfect” rank histogram is one which is flat, indicating that, on average, the ensemble spread covers the variability in the predictand. The U-shaped histograms indicate that ensemble spread is too small on average (the observation too often lies outside the ensemble), and asymmetric histograms indicate biases in the forecasts.

Rank histograms are meaningful only for relatively large verification sample sizes.

#### Continuous rank probability score

*ρ*(

*x*) and a verifying observed value

*x*. The CRPS is defined by

_{a}*P*and

*P*are cumulative distributions (cdf),

_{a}*H*(

*x*) = {

^{0,x<0}

_{1,x≥0}is the Heaviside function. The CRPS is thus the difference between the predicted cdf and the observation expressed as a cdf. It is negatively oriented (smaller is better) and the perfect score of 0 is achieved only for a perfect deterministic forecast. The CRPS has dimensions of the variable

*x*. The CRPS reduces to the mean absolute error for a deterministic forecast, and therefore can be considered as a mean absolute error for a probability distribution.

#### Continuous rank probability skill score

*is the standard score, in this case for a climatological forecast and CRPS*

_{c}*is the score for the forecast. To avoid the inclusion of artificial skill, the climatology reference score was computed using the long-term climatological distribution applicable to the day of the year and the station. Scores for each station*

_{f}*i*and each of four seasons

*s*were then computed as