## Abstract

Reliability is an essential attribute of the quality of probabilistic forecasts. It indicates the correspondence between a given probability, and the observed frequency of an event in the case this event is forecast with this probability. The variability of the reliability of ECMWF ensemble forecasts has been investigated. Probabilistic forecasts of a850-hPa temperature anomaly have been considered during four consecutive winter seasons. Reliability appears highly variable in space and time. A proper evaluation thus requires a local, seasonwise verification, in order to avoid an overestimation of the performance. On the other hand, stratification of the data is likely to lead to an underestimation of the performance due to ill-sampling, so that a compromise has to be found.

Variations of model bias contribute for a major part to the interannual variability of the reliability. After bias correction, reliability is virtually constant during the considered period of 4 yr, despite two major changes that have been implemented in the ECMWF Ensemble Prediction System (EPS). Local variations of model bias contribute highly to the spatial variability of the reliability, but this contribution is only revealed when considering separate winter seasons.

Due to sampling limitations, local calibration and/or local bias correction are unlikely to bring a large improvement of reliability in operational conditions. A good scheme might rather be a combination of domain bias correction and domain calibration. Because calibration and bias correction require large samples of data for computing statistics, and at the same time other, independent samples are needed for validation, the implementation of an operational scheme for improving local reliability might be an arduous challenge.

## 1. Introduction

The validation of Ensemble Prediction Systems (EPS) has become an important field of research during the last few years. Since December 1992, both the National Centers for Environmental Prediction (NCEP) and the European Centre for Medium-Range Weather Forecasts (ECMWF) have produced operational forecasts based on ensemble prediction (Tracton and Kalnay 1993; Palmer et al. 1993). Several other centers worldwide have since implemented operational or quasi-operational ensembles (e.g., Houtekamer et al. 1996; Kobayashi et al. 1996). Different strategies can be followed for setting up an operational ensemble system, for example, model characteristics, number of ensemble members, method for generating perturbations. One of the goals of ensemble validation is to provide an estimation of the performance of the different options (Atger 1999). Another issue is the “return on investments” of an EPS that may be illustrated by quantitative comparisons between the cost of running an ensemble and the economic value of the forecasts for the community of users (Richardson 2001).

An EPS is designed to provide an ensemble of realizations of the probability density function (pdf) of the future meteorological state. EPS verification thus consists of evaluating to what extent ensemble members properly sample this pdf. A large variety of scores has been used to estimate the performance of an EPS in that respect. Among these scores, the most widely used is the Brier score (Brier 1950), designed for quantifying the performance of a probabilistic forecast of a dichotomous event. The Brier score is simply the mean square error of forecast probabilities, under the assumption that a perfect probabilistic forecast is a deterministic forecast that proves correct. The Ranked Probability Score (Epstein 1969) is a generalization of the Brier score to scalar variables. The Continuous Ranked Probability Score (CRPS) allows an evaluation of the performance of a continuous probabilistic forecast (Bouttier 1994).

Two aspects of the performance of the probabilistic forecast of a dichotomous event can be distinguished. The first aspect is the correspondence between a given probability, and the observed frequency of an event in the case this event is forecast with this probability. The second aspect is the variability of the observed frequency of an event, when the forecast probability of this event varies. These two attributes, called, respectively, reliability and resolution, appear in the decomposition of the Brier score proposed by Murphy (1973). Hersbach (2000) and Talagrand and Candille (2001) have proposed two different decompositions of the CRPS in order to separate the contributions of reliability and resolution. In a more general sense, reliability and resolution can be seen as two aspects of the performance of an ensemble. Reliability consists in predicting a distribution that proves close, a posteriori, to the distribution of observations when this distribution is predicted. Resolution is the variability of the distribution of observations, when the predicted distribution varies.

A necessary condition for a good reliability is a flat rank histogram. Each bar of the histogram indicates the proportion of cases when the verification is found between two consecutive ensemble members, once all members have been sorted (Talagrand et al. 1997). Outliers are found in the two extreme categories, corresponding to those cases when the verification is outside the range of ensemble forecasts. Rank histograms computed from operational ensembles are generally U-shaped, which is traditionally interpreted as the consequence of the lack of ensemble spread that is a common characteristic of operational EPS (Anderson 1996). When considering surface or low-level atmospheric parameters, rank histograms are not only U-shaped, they often exhibit an asymmetry due to numerical models systematic biases. For example, the rank histogram is strongly L-shaped when considering the temperature at 850 hPa in an area where the model suffers from a warm bias (Palany et al. 1999). Similarly, symmetric U-shaped rank histograms may be interpreted as a consequence of symmetric, conditional model biases, among other plausible explanations (Hamill 2001). A part of the lack of reliability of probabilistic forecasts based on an EPS might thus come from model systematic biases, rather than point to a weakness of the method used for generating the ensemble.

As noticed by Murphy (1993), the lack of reliability can be seen as a probability bias. Therefore, several authors have proposed a calibration of probabilistic forecasts in order to improve their reliability. Calibration consists of a statistical correction of the forecast probability, based on previous verification. For example, in the case of the probabilistic forecast of a dichotomous event, if the observed frequency in the past was 0.3 when the forecast probability was 0.4, the calibrated probability will be 0.3 when the raw probability is 0.4 (Zhu et al. 1996). Alternatively, a statistical correction can be applied to the ensemble distribution, in order to get it closer to the conditional distribution of observations (Hamill and Colucci 1998). This correction generally consists in inflating the distribution, that is, increasing ensemble spread in order to flatten a U-shaped rank histogram. On the other hand, the lack of reliability is often partly due to model biases, in particular when considering surface or low-level parameters. Therefore, applying a bias correction to individual forecasts might be required before attempting to improve reliability by correcting (i.e., generally increasing) the ensemble spread.

Scores that indicate the performance of operational forecasting systems are traditionally computed over large geographical domains, for example, Europe or the United States, in order to accumulate large samples from relatively short periods of time (typically one season). This is generally the case for the evaluation of operational EPS (Buizza et al. 1999a; Mullen and Buizza 2001), although some studies are based on local verification (Ziehmann 2000). This is also generally the case when a method of calibration of probabilistic forecasts is applied to an ensemble in order to improve reliability (Eckel and Walters 1998). Yet reliability may be highly variable in space, especially when considering surface or low-level parameters. As it is the case for any bias, the lack of reliability is likely to compensate from one point to another, for example, because the probability is overestimated in some places while it is underestimated in other places. Local variations of model biases, due to dependencies to the model topography, are likely to contribute to this variability. Therefore the estimation of the reliability from a computation over large geographical domains may give a poor indication about the performance of the system. For the same reason, the positive impact of calibration might be meaningless when the correction is based on spatially averaged statistics. The usefulness of such a calibration scheme is rather limited in practice, since reliability may not be improved locally.

Ensemble reliability may strongly vary in time, too. Most variations of the performance are likely due to the variability of meteorological conditions, from one year to another, according to the season, and even at the synoptic timescale. The variability of the reliability is also a consequence of modifications applied both to the model and to the EPS, affecting the quality of the model as well as of the initial generation of ensemble members. Evaluations of operational ensembles generally take into account the interseasonal variability of the performance by computing seasonal scores. On the other hand, one might be tempted to compute operational scores from several years of data in order to increase the size of the verification sample, especially when evaluating the local performance, rather than computing scores from large geographical domains. Similarly, local calibration might require statistics based on several years of data. Because of the variability of the performance from one year to another, reliability estimates might be unrealistic, as well as the positive impact of calibration.

The aim of the work described in this article has been to study the spatial and interannual variability of the reliability of ECMWF ensemble forecasts. Consequences for the calibration of probabilistic forecasts are also discussed. The impact of model biases is considered too. The methodology is described in section 2. Results are presented in section 3, discussed in section 4, and summarized in section 5.

## 2. Methodology

### a. Data

Four consecutive winter seasons (December–February) are considered: from 10 December 1996 to 28 February 2000. During these 4 yr, the operational version of the ECMWF EPS has been improved several times, as well as the model on which it is based, but the horizontal resolution remained the same (T_{L}159), as well as the number of ensembles members (control + 50).

All forecasts considered in the study are +96 h probabilities of the 850-hPa temperature anomaly being above 1 standard deviation at 48 grid points over Europe (35°N–60°N, 10°W–25°E, 5°/5° grid). The climate reference, for the definition of anomalies, has been computed from the ECMWF 15-yr reanalysis (Gibson et al. 1997). Three-month averages (December, January, February) have been computed at every grid point. The verification reference is the ECMWF high-resolution analysis that was operational during the considered period (T_{L}319), interpolated at every grid point. Given the high density of upper-air observations over the considered area (radiosondes, aircraft observations, etc.), these local interpolations are assumed to reflect the true state of the atmosphere (i.e., analysis error is close to zero). The standard deviation of the temperature anomaly has been computed at every grid point from the sample of analyses. Figure 1 shows that the frequency of occurrence of the considered event, that is, a temperature anomaly above 1 standard deviation, is between 0.148 and 0.194 over the domain.

The four winter seasons are verified either separately or together. Each winter season is randomly divided into three equally populated subsamples. Each subsample is verified separately. Calibration and bias correction, when required, are based on statistics computed from two subsamples and implemented in the remaining subsample. This verification scheme has been preferred to a simple split of the data in two equal halves in order (i) to increase the number of cases considered for computing calibration or bias correction statistics, and (ii) to avoid dependencies to intraseasonal changes, by randomizing the data inside a winter season.

### b. Decomposition of the Brier score

The Brier score (BS) is defined as the mean square error of the probability forecast:

where *N* is the number of forecasts, *p*_{i} is the forecast probability, and *o*_{i} is the verifying observation (1 if the event occurs, 0 if it does not) (Brier 1950). The Brier score is easily transformed into a decomposition of three terms:

where the sample of *N* forecasts has been divided into *T* categories, each comprising *n*_{k} cases of forecast probability *p*_{k}. Here *o*_{k} is the observed frequency when *p*_{k} is forecast and *o* is the observed frequency in the whole sample (Murphy 1973).

The first term of the decomposition is the reliability, that is, the square difference between the probability and the observed frequency of the event, averaged over the *T* categories. The reliability term indicates to which extent the forecast probabilities are accurate (it is negatively oriented, as the Brier score, i.e., the smaller the better). The second term is the resolution, that is, the square difference between the observed frequency when *p*_{k} is forecast and the observed frequency for the whole sample. The resolution term indicates to which extent the different forecast categories do indicate different frequencies (it is positively oriented, i.e., the larger the better). The last term of the decomposition is the uncertainty, that is, the variance of the observations. It indicates the intrinsic difficulty in forecasting the event and does not depend on the forecasting system.

### c. Estimation of the reliability

The computation of the reliability from Eq. (2) requires a categorization of the forecast probabilities. In the case of ensemble-based probabilistic forecasts, the decomposition of the Brier score is exact only when the categorization consists in *n* + 1 categories from *p* = 0/*n* to *p* = *n*/*n,* where *p* is the forecast probability and *n* is the number of ensemble members (Atger 1999). When evaluating large ensembles (e.g., *n* = 51 for the ECMWF EPS) a large amount of cases is needed in order to sample the *n* categories. For this reason, reliability is often evaluated from a smaller number of probability categories, for example, 10% probability intervals, that is, from 0%–10% to 90%–100% (Mullen and Buizza 2001). This categorization leads to an approximate decomposition of the Brier score.

In practice, the number of cases in each probability category varies significantly. Specifically, higher and/or lower probabilities are more often forecast than probabilities lying close to the climate frequency. This is in fact a desirable feature of ensembles, which contributes to improve some aspects of the performance (those aspects related to the resolution component of the Brier score). Some probability categories are thus likely to suffer from a lack of representative cases.

A user-oriented evaluation implies that probability categories reflect the variety of probabilistic forecasts that have been delivered to forecast users. This implies the use of *n* + 1 categories, as mentionned above, unless ensemble-based probabilities have been rounded, for example, to 10% units. On the other hand, significant results are more likely obtained when probability categories are carefully designed in order to evenly distribute the cases among them, at least approximately, in order to avoid undersampling as much as possible. This is obviously a serious concern when investigating the variability of the performance in space and time, that implies a strong stratification of the data. Therefore, in the present study, probability categories have been prescribed according to the latter option. Since the prior probability of the event, that is, the frequency of occurrence over the considered sample, is rather small (approximately 0.175) the zero probability is most frequently forecast and defines the first category. Four other categories are defined so that the number of forecast cases is approximately the same in each of them. Probability categories thus differ according to the considered sample.

The solid curve in Fig. 2 is the so-called reliability curve, that indicates the correspondence between the forecast probability (*x* axis), averaged over all the cases of the category, and the observed frequency (*y* axis). All grid points are considered together, leading to the domain reliability (by contrast to the local reliability, computed at every grid point). The four winter seasons are considered together. The reliability component of the Brier score is 2.5 × 10^{−3}, which can be seen graphically as the weighted, averaged, squared distance between the reliability curve and the 45° line. The distribution of forecast probabilities is also shown (Fig. 2, inset). It indicates a rather balanced distribution of the cases among the five categories: approximately 60% in the first category (zero probability) and approximately 10% in categories 2–5.

### d. Calibration

Several methods have been proposed for the calibration of ensemble forecasts (e.g., Hamill and Colucci 1998). The method used in the present study is based on reliability diagram statistics (e.g., Zhu et al. 1996). In its simplest form, this method consists of applying the following rule: when the probability category *i* is forecast in the evaluation subsample, the calibrated probability is the frequency with which the event is observed in the calibration subsample when the same category *i* is forecast. As mentioned in section 2c, the choice of a “balanced” categorization implies that probability categories differ according to the considered subsample. Therefore it is not possible to calibrate directly the probability as indicated above. Three cases can be distinguished:

No member forecast the event in the evaluation subsample. The calibrated probability is the observed frequency when no member forecast the event in the calibration subsample.

The number of members forecasting the event in the evaluation subsample is between

*k*_{i}and*k*_{i+1}(*i*= 1 to 4) if*k*_{i}is the number of members forecasting the event, averaged over the cases belonging to the*i*-th category of the calibration subsample. The calibrated probability is linearly interpolated between*f*_{i}and*f*_{i+1}, if*f*_{i}is the observed frequency when the*i*-th category is forecast in the calibration subsample.The number of members forecasting the event in the evaluation subsample is above

*k*_{5}. The calibrated probability is linearly interpolated between*f*_{5}and 1.

The following is an example for illustrating the calibration process in cases (ii) and (iii). Assume the fourth (respectively, 5th) category of the calibration subsample consists of the cases when the number of ensemble members forecasting the event is between 22 and 30 members (between 31 and 51 members). The average number of members forecasting the event is *k*_{4} = 25 in the fourth category, *k*_{5} = 38 in the fifth category. Case (ii): if 27 members forecast the event in the evaluation subsample, then the calibrated probability is linearly interpolated between the observed frequencies *f*_{4} and *f*_{5} in the fourth and the fifth categories (since *k*_{4} < 27 < *k*_{5}). Case (iii): if 40 members forecast the event in the evaluation subsample, then the calibrated probability is linearly interpolated between the observed frequency *f*_{5} in the 5th category and 1 (since 40 > *k*_{5}).

Note that it would have been more simple, in case (ii) above, to take *f*_{4} as the calibrated probability (since 22 < 27 < 30). But this method would lead to use the same calibrated probability for a large number of different cases, while the relationship between the forecast probability and the observed frequency is more accurately taken into account through interpolation.

Figure 2 shows the effect of calibration to the reliability curve and to the distribution of forecast probabilities. All grid points are considered together for computing the calibration statistics, leading to a domain calibration (in contrast to a local calibration when calibration statistics are computed separately for every grid point). The four winter seasons are considered together, for calibration statistics as well as for the evaluation. The domain reliability after calibration is almost perfect, while the distribution of forecasts is little affected. The reliability term of the Brier score is reduced from 2.5 × 10^{−3} to 1.7 × 10^{−5}.

### e. Correction of model biases

The correction of systematic model biases, when applied, consists in translating the ensemble distribution according to the average error of the ensemble mean. This allows the ensemble spread not to be affected by model bias correction. The procedure is the following:

A linear regression is applied to the correction subsample. More precisely, the parameters

*α*and*β*are defined such as to achieve a least squares fit of the quantity*αX*+*β,*where*X*is the ensemble mean forecast, to the verification.- This linear correction is applied to the evaluation subsample. The difference between the ensemble mean forecast before and after bias correction is then applied to individual ensemble forecasts:where(3)
*x*_{c}=*x*+ (*α*− 1)*X*+*β,**x*(*x*_{c}) is the individual ensemble forecast, before (after) correction.

This correction scheme has proven more efficient than a mere subtraction, from every ensemble member, of the mean error averaged over all ensemble forecasts. Although the main biases are corrected, this scheme is still very simple and limited in effect compared to more sophisticated procedures, for example, model output statistics (MOS) methods that would take into account statistical dependencies involving various parameters.

The effect of model bias correction to the domain reliability is shown in Fig. 3. All grid points are considered together for computing the correction statistics, leading to a domain bias correction (in contrast to a local bias correction when statistics are computed separately for every grid point). The four winter seasons are considered together, for computing statistics as well as for the evaluation. Figure 3a shows the impact to the reliability curve shown in Fig. 2. The reliability component of the Brier score is reduced from 2.5 × 10^{−3} to 0.4 × 10^{−3}, a substantial reduction but still much less than that obtained through domain calibration (Fig. 2). Figure 3b shows the rank histogram, before and after bias correction. Again, all grid points and the four winter seasons are considered together. A model cold bias is probably the reason of the high proportion of verifications that are warmer than all ensemble members. After bias correction the histogram is almost symmetric, but the U shape indicates that reliability is still far from being perfect.

## 3. Results

### a. Interannual variability

In this section all grid points are considered together, but the four winter seasons are considered separately, in order to investigate the interannual variability of domain reliability. The corresponding reliability curves are shown in Fig. 4. Table 1a shows the reliability component of the Brier score averaged over the four winter seasons, as well as the interannual standard deviation. The latter is larger than the former, indicating a strong variability from one year to another. However, sampling limitations contribute to this variability. A computer-based method of hypothesis testing has been used to evaluate the significance of the differences between the four winter seasons. Differences between 1996 and 1997 and the other seasons are significant at the 0.98 level, while other differences are not significant at the 0.9 level.

The method for testing the significance of the difference between two winter seasons is based on the construction of an empirical distribution of differences that are presumably not significant. These differences are generated by computing the reliability component of the Brier score from two samples obtained by randomly choosing the data from either one or the other winter season. The probability that the actual difference between the two winter seasons belongs to this null distribution is then evaluated (Hamill 1999). The different steps of the procedure are as follows. (i) Winter *A* and winter *B* cases are grouped together into a test sample. (ii) This test sample is randomly halved into two subsamples. (iii) The reliability component of the Brier score is computed separately from each subsample. (iv) The difference between the reliability component of the Brier score for the two subsamples is computed. (v) The procedure is repeated 1000 times from (ii) to (iv). (vi) The probability that the difference between the actual reliability component of the Brier score for winter *A* and the actual reliability component of the Brier score for system *B* is significant is estimated from the empirical distribution obtained at the end of step (v).

Consequences for calibration are shown in Figs. 5–6 and Tables 1b and 1c. The effect of domain calibration is far from perfect when considering the four winter seasons together for computing the calibration statistics (Fig. 5). These results contrast with those shown in section 2d, that is, when the evaluation is done over the four winter seasons together. Figure 5 looks like a mere translation of Fig. 4 and the interannual variability is unchanged (Table 1b). On the contrary, computing calibration statistics from every winter season separately has a strong positive impact on domain reliability, since it reduces the average reliability as well as the relative variance (Fig. 6 and Table 1c). Interannual variability is negligible after calibration. This can be seen as an a posteriori indication that variability before calibration is significant, despite the hypothesis testing results presented above.

The positive impact of domain bias correction is not as large as the improvement obtained through domain calibration, with respect to the average reliability component of the Brier score (Tables 1d and 1e). On the other hand, bias correction leads to a stronger reduction of the interannual variance, especially when computing correction statistics from every winter season separately (Table 1e). Once bias correction has been applied, reliability is virtually the same for the four winter seasons. This indicates that a large part of the interannual variability of domain reliability might be a consequence of bias variations from one year to another.

### b. Spatial variability

In this section the four winter seasons are considered together, but the grid points are considered separately, in order to investigate the spatial variability of the local reliability. Figure 7 shows the 48 local reliability curves corresponding to every grid point. The curves spread widely, indicating a large spatial variability. Figure 8 shows the distribution of the reliability component of the Brier score over the considered domain. Average local reliability is 3.9 × 10^{−3} that is, 50% higher (i.e., worse) than domain reliability (see section 2c). This indicates that domain evaluation may lead to overestimation of the local performance with respect to reliability.

Although the categories have been defined carefully, as discussed in section 2c, some curves in Fig. 7 appear rather noisy due to the relatively limited number of cases that have been considered at every grid point. With a sample of 351 days, each point of the curves is computed from only 35 forecast cases on average (10%), except the first point where the forecast probability is zero (60% of the cases). Therefore, the spatial variability shown in Fig. 8 may partly result from ill-sampling rather than indicating true discrepancies regarding the behavior of the forecasting system. For the same reason, the higher level of local reliability compared to domain reliability might be an artefact due to sampling limitations. Figure 9 shows an estimate of the probability that reliability computed at a given grid point is significantly different from reliability computed at other grid points over the considered area. The method is similar to that described in section 3a. The reliability component of the Brier score is first computed for a theoretical “average” grid point, varying randomly every day from point 0 to point 47. This computation is repeated 1000 times, in order to construct an empirical distribution of the reliability of an “average” grid point. The actual reliability at every grid point is then compared to this distribution. Significance level is higher than 0.9 at 23 grid points out of 48, higher than 0.95 at 10 grid points corresponding to the minima and maxima of Fig. 8.

Consequences for calibration are shown in Figs. 10 and Fig. 11. Figure 10a shows the spatial distribution of the local reliability after domain calibration, that is, when all grid points are considered together for computing calibration statistics. The impact is almost uniformly positive, slightly detrimental at a small number of grid points (Fig. 10b). The spatial structure of the field is almost unchanged and most local maxima are still present. Average local reliability is 2.0 × 10^{−3}, that is, reduced by a factor of 2. Spatial standard deviation is reduced in the same proportion (from 2.9 × 10^{−3} to 1.5 × 10^{−3}). By contrast, local calibration allows a stronger reduction of reliability, reflecting more closely the spatial distribution of reliability before calibration (Fig. 11). After local calibration no local maximum exceeds 2.5 × 10^{−3} and the spatial distribution is rather uniform (not shown). Average local reliability is 0.8 × 10^{−3}, that is, reduced by a factor of 5. Spatial standard deviation is reduced in the same proportion (from 2.9 × 10^{−3} to 0.7 × 10^{−3}).

The positive impact of domain bias correction is as large as the impact of domain calibration with respect to the average local reliability (2.0 × 10^{−3}). The reduction of the spatial variability is even larger (standard deviation is 1.3 × 10^{−3}). Surprisingly, local bias correction brings only little further improvement to the average local reliability (1.7 × 10^{−3}). The spatial variability is reduced in the same proportion as it is after domain bias correction. This may indicate that local variations of model bias contribute little to the spatial variability of the reliability, when considering the four winter seasons together. This counterintuitive result might be a consequence of the large interannual variability of model bias. Further investigation has confirmed that the spatial variance of model bias is high, but varies a lot from one year to another (not shown).

### c. Interannual variability of the local reliability

In the previous subsection the strong improvement of local reliability after local calibration was obtained by considering the four winter seasons together. On the other hand, it was shown in section 3a that reliability is highly variable from one year to another. Therefore, a proper estimate of the local reliability, before and after calibration, would require a winterwise verification. The reliability component of the Brier score has been computed separately for the four winter seasons, at every grid point. Table 2 indicates the interannual mean and standard deviation, after spatial averaging over all grid points. Before calibration, the spatial distribution for an average winter season is very much similar to the spatial distribution when the four winter seasons are considered together (not shown), but the average reliability is higher (9.8 × 10^{−3} vs 3.9 × 10^{−3}).

The effect of a local, winterwise calibration is disappointing (Table 2c). The impact is seriously detrimental on average at several locations (Fig. 12). Examination of the results for every winter season confirms that calibration is far from optimal (not shown). Surprisingly, the effect of calibration is slightly better on average when the four winter seasons are considered together for computing calibration statistics, the evaluation still being done winterwise (Table 2b). The impact is positive almost everywhere (Fig. 13).

This last result seems to indicate that the interannual variability of the reliability would not affect the efficiency of a local calibration method. On the other hand, it might well be a mere consequence of ill-sampling. Local, winterwise calibration statistics, based on a small number of cases (two-thirds of the season, i.e., 54 or 60 days depending on the winter season), might not be robust enough to represent a systematic behavior of the forecasting system. In order to investigate this issue further, reliability has been examined at grid point 16, located off Ireland, where the impact of local, winterwise calibration is detrimental on average. Figure 14 shows the reliability diagram at grid point 16, before and after calibration, for winter 1999/2000. Calibration based on the four winter seasons considered together has a limited positive impact overall, while the noisy reliability curve obtained after winterwise calibration leads to a large deterioration.

The impact of winterwise local bias correction appears much better than that of winterwise local calibration (Fig. 15). The reduction of the interannual variance of the reliability is also much larger (Table 2e). Sampling limitations are apparently more detrimental for computing calibration statistics than bias correction statistics. This is probably because in the former case the sample has to be further stratified into a number of probability categories, while no further stratification is required in the latter case. The importance of interannual variations of model bias is confirmed by the relatively poor efficiency of a local bias correction considering the four winter seasons together for computing correction statistics (Table 2d). Winterwise local bias correction leads also to better results than a local calibration considering the four winter seasons together for computing statistics, especially with respect to the reduction of the spatial variability of the performance (Table 2b). The comparatively good efficiency of bias correction can be illustrated with the example of grid point 16 (Fig. 14): the last probability category is probably not sampled enough, but the rest of the curve indicates a rather good reliability.

One issue discussed at the end of section 3b is that local variations of model bias may contribute little to the spatial variability of reliability. Indeed a winterwise, domain correction of model bias has a substantial positive impact, similar to that obtained through winterwise, local calibration (Table 2f). On the other hand, local correction has a larger impact on the average reliability (almost double than that obtained through domain correction) and allows a much stronger reduction of the interannual variance (Table 2e). The spatial variance is reduced from 7.2 × 10^{−3} to 4.1 × 10^{−3} after local bias correction, for the average winter season. The impact to the spatial variance is almost as high after domain correction (4.9 × 10^{−3}). This seems to indicate that local statistics are not robust enough to allow a further reduction of the spatial variance, although reliability is improved at the local scale.

Further investigation has shown that applying local calibration after local bias correction leads to little further improvement, in terms of average reliability as well as the reduction of spatial and interannual variance. This is likely a consequence of the calibration scheme being much affected by sampling limitations. Applying domain calibration after local bias correction has not much effect either. This last result might indicate that ensemble reliability is intrinsically variable in space, independently of model bias.

## 4. Discussion

### a. Estimation of the reliability

Because the reliability component of the Brier score is basically a squared bias, the noise introduced by undersampling is likely to lead to underestimation of performance. On the other hand, it is common knowledge that biases tend to compensate one another when computed from an heterogenous sample, for example, when considering different grid points or winter seasons together. Therefore, a compromise has to be found between (i) an overestimation of the performance caused by considering heterogenous data, and (ii) an underestimation of the performance caused by considering samples that are too small. This compromise is a general condition for evaluating properly the performance of a forecasting system, but it has a special importance in the case of the estimation of the reliability of ensembles forecasts, due to the necessary partitioning of the sample into probability categories.

From the results presented in section 3, one can conclude that the estimation of reliability requires a deep stratification of the data, in order to take into account spatial variability as well as interannual variability. Considering grid points and/or consecutive winter seasons together is likely to lead to estimates of the reliability that do not reflect the actual performance of the forecasts that are made available to the users (i.e., local forecasts whose reliability varies from one year to the next acording to interannual climate variations). On the other hand, a local, winterwise evaluation implies inevitably a strong reduction of the available sample that may be detrimental for a proper evaluation. Even if one considers together consecutive winter seasons, data for local verification cannot be accumulated for long periods of time because ensembles and models change frequently, due to progress in the field of numerical weather prediction. A period of 4 yr, as considered in the present study, is probably a maximum (note that the model resolution of the ECMWF EPS was changed again in November 2000).

An ideal, definitive solution would be that operational centers provide long samples of forecasts performed with the current models and/or ensembles over past periods. Potential users could evaluate a priori the expected performance of operational forecasts. Calibration, as well as any kind of statistical postprocessing, would be achieved in much better conditions than in the case of working with small samples accumulated over short periods of time. However, the limitation of computer resources makes the implementation of such a practice very unlikely for the time being.

A possible strategy, that has not been explored further in the present study, might consist of first estimating the variations of local reliability, from a large sample considering several winter seasons together (or even different seasons). Locations (or stations, in case of weather parameters) might then be classified according to their similarities with respect to local reliability. A seasonwise evaluation might be conducted in a second step, considering several locations grouped together in order to increase the size of the sample.

### b. Significance of the results

As discussed in section 2c, the estimation of the reliability requires a careful definition of probability categories in order to avoid the effects of ill-sampling. In this study each category groups only 10% of the sample on average, except the first category. This leads to a strong further reduction of the computation sample, already stratified according to the winter season and the location: for a season of 90 days, nine cases on average are considered in each category. On the other hand, these numbers depend obviously on both the number of probability categories (here, five) and the parameter threshold that conditions the proportion of probability zero (here, 1 standard deviation).

In order to increase the number of cases considered in each probability category, only two probability categories have been used for verification in this section. This evaluation scheme can be seen as indicating the performance of a binary forecast: either the event is excluded (zero probability), or the event is possible (with a moderate probability, of the order of 0.3). The aim is to evaluate the potential impact of calibration and bias correction when sampling limitations are relaxed. Table 3 indicates the average reliability component of the Brier score, computed locally and winterwise, before and after calibration and/or bias correction. Most results obtained in the previous sections are confirmed, in particular the impact of local model bias correction. Furthermore, the reduction of the spatial variance is much higher after local correction than after domain correction (not shown). This confirms that the limited impact of local bias correction is probably due to sampling limitations.

The main difference, compared to previous results, is the strong improvement obtained through local winterwise calibration. This is a confirmation that the poor efficiency obtained with five probability categories is the consequence of ill-sampling. On the other hand, the reduction of spatial variance is hardly larger than that obtained after local bias correction, confirming that much of the spatial variability of reliability is due to local bias variations. The reduction of interannual variance is also larger after local bias correction than after any kind of calibration. The largest impact overall is obtained when a local calibration is applied after local bias correction (both winterwise). On the contrary, applying domain calibration after local bias correction leads to no further improvement. This confirms that the spatial variability of ensemble reliability is not entirely due to model bias variations, as mentioned in section 3c. There exist local discrepancies of the EPS statistical behavior, independent of the model bias, that result in local variations of the reliability.

Another modification has been tested, consisting of an increase of the frequency of occurrence of the considered event, so that the proportion of cases in each probability category is larger (due to the reduction of the frequency of the zero probability). The threshold has been reduced to 0.5 standard deviation, leading to an average frequency of occurrence of around 0.3. Results (not shown) confirm those obtained when considering two probability categories instead of five for the evaluation.

### c. Variations of model bias and reliability

The results presented in this study indicate that model bias contributes highly to the lack of reliability. It was shown that most of the improvement expected through calibration can be obtained after an adequate correction of model bias. Furthermore, the main part of the interannual variability of the reliability can be explained by the interannual variability of model bias. The spatial variability of model bias has also a large impact on local variations of variability. It should be kept in mind that the present study was conducted with a free atmosphere parameter (temperature at 850 hPa). Local variations of model bias are generally even larger when considering weather parameters (e.g., 2-m temperature), and the impact of these variations to reliability is probably higher.

It was shown that reliability varies according to the considered winter season. This is mainly due to model bias variations, while the part of the reliability that is not related to model bias remains relatively constant during the considered period of 4 yr. Two major changes have been implemented in the ECMWF EPS during the 1996–2000 period: introduction of evolved singular vectors (Barkmeijer et al. 1999) and simulation of random model errors (Buizza et al. 1999b). The amplitude of the reliability improvement related to these changes, if any, appears much lower than the amplitude of interannual variations, most of these variations being due to systematic model biases.

### d. Consequences for the calibration of probabilistic forecasts

Calibration and/or bias correction schemes have been used in this study in order to investigate the variability of reliability. From the results presented in previous sections, one might define a calibration strategy consisting of an optimal combination of bias correction and calibration. On the other hand, the results have been obtained with a partition of the data that has the effect of giving an ideal, unrealistic picture of the performance of any statistical correction. In the real world, winter statistics are not available before the end of the season, and the first month of the season cannot be calibrated and/or corrected from statistics computed from the next two months. Randomization of the data further increases the estimated performance of calibration and bias correction, since correlations between consecutive days emphasize the statistical consistency between the calibration (or correction) subsample and the evaluation subsample.

The conditions for a proper estimation of the reliability have been discussed in section 4a. For defining a calibration strategy, one wants to estimate reliability as accurately as possible from available verification data. On the other hand, validation of the strategy would require an independent sample of data in order to test the efficiency of the method. In the real world, the size of available samples is dramatically limited. In this section the impact of calibration and bias correction has been tested from a different partition of the data, with the aim of giving a more realistic picture of the expected performance in an operational environment. In this new experiment, winterwise calibration or bias correction consists of computing statistics from the first half of the season (45 days) then applying them to the second half. Calibration or bias correction based on the four winter seasons considered together is replaced by computing statistics from one winter season, then applying them to the next winter season (this implying that the first winter season is not used for evaluation, while the last one is not used for computing statistics). Results (not shown) indicate a large effect of ill-sampling, for local as well as winterwise calibration. The effect of calibration is very limited overall, although domain statistics based on the previous winter season appear the most robust. Local bias correction also suffers much from ill-sampling, while winterwise domain statistics (i.e., based on the current winter season) lead to a moderate bias correction. The best combination might be a domain calibration based on the previous winter season, applied after domain bias correction based on the current winter season.

These results do not mean that local calibration and/or local bias correction are definitely useless. They still indicate that the positive effect of taking into account the spatial variability of the reliability cannot be demonstrated from the available data. One might consider that the results obtained from an “idealized” partition of the data give sufficient evidence of the positive effect of local calibration and/or local bias correction, so that a calibration and/or correction scheme can be based on the whole available sample, with no need for further validation. Proper evaluation would come later, when new data are available. On the other hand, the fact that no evaluation is possible for a long time is hardly acceptable in an operational environment where the performance of the methods has to be monitored in real time.

Results related to the efficiency of calibration have been obtained by applying a simple, arbitrarily designed calibration method, described in section 2d. Other methods exist that might give better results, in particular methods that consider the whole ensemble distribution instead of focusing on one peculiar probability threshold (Hamill and Colucci 1998). Correction of model biases has also followed a rather simple scheme, described in section 2d. Statistical techniques traditionally used for postprocessing numerical forecasts, for example, model output statistics methods, might be able to reduce more efficiently model biases, including conditional biases. In particular, the combined effect of statistical adaptation and calibration, along the lines discussed in this article, is likely to improve significantly the reliability of probabilistic forecasts of weather parameters.

## 5. Summary and concluding remarks

Reliability is an attribute of the quality of probabilistic forecasts. It indicates the correspondence between a given probability, and the observed frequency of an event in the case this event is forecast with this probability. The variability of the reliability of ECMWF ensemble forecasts has been investigated. Probabilistic forecasts of 850-hPa temperature anomaly have been considered from December 1996 to February 2000. The four consecutive winter seasons have been verified both together and separately. Local reliability has been estimated at 48 grid points over Europe. Domain reliability, considering all grid points together, has also been estimated. The effect of model bias variations on the spatial and interannual variability of the reliability has been studied. The consequences of this variability for calibrating probabilistic forecasts have been examined too. The main conclusions of the study are the following:

Reliability is highly variable in space and time. A proper evaluation thus requires a local, seasonwise verification in order to avoid an overestimation of the performance. On the other hand, stratification of the data implies a strong reduction of available samples, leading to an underestimation of the performance due to ill-sampling. A compromise must thus be found in order to avoid these two pitfalls.

Variation of model bias from one year to another is the main source of the interannual variability of the reliability. After bias correction, reliability is virtually constant during the considered period of four years (only winter seasons are considered), although two major changes have been implemented in the ECMWF EPS in 1998.

Local variations of model bias contribute highly to the spatial variability of the reliability. However, the impact of these variations is only revealed when considering each winter season separately, due to the large interannual variability of model bias.

Seasonwise, local bias correction leads to a substantial improvement of reliability. Applying a calibration scheme after this bias correction brings little further improvement. However, there exist local discrepancies of ensemble statistical behavior that contribute to the spatial variability of reliability independent of model bias local variations.

Due to sampling limitations, local calibration and/or local bias correction are unlikely to bring any significant improvement of the reliability in operational conditions. A good practical scheme might be a combination of domain bias correction (based on the current season) and domain calibration (based on the previous year).

Calibration and bias correction require large samples of data for computing statistics. At the same time, validation of the methods implies that other, independent samples are available. This double requirement means that the implementation of an operational scheme, along the lines discussed in this article, is not an easy task.

Most results presented in this work have been obtained from an evaluation of the efficiency of calibration and bias correction schemes applied to the idealized partition of the data described in section 2a. As discussed in section 4d this evaluation gives an unrealistic picture of the performance of statistical corrections. The matter of the study is the variability of ensemble reliability, rather than the performance of statistical correction schemes. However, some credible results have been obtained about the relative efficiency of calibration and bias correction, depending on the fact that grid points and/or consecutive winter seasons are grouped together or considered separately.

Sampling limitations that lead to pernicious effects would be relaxed if the considered event was more frequent and/or if the number of probability categories was smaller (as discussed in section 4b). The choice of an anomaly above 1 standard deviation (prior probability around 0.17) is motivated by the fact that, in practice, forecasters express generally little interest in probabilities of events that occur more frequently (except forecasters in charge of seasonal forecasts). Forecasting rare events has always been more challenging than indicating a vague deviation from the normal. Most standard ensemble products, as well as most evaluations of the performance of ensembles, focus on relatively infrequent events.

The scope of this article might appear rather limited: a single lead time, a single quantity, a limited region. The choice of the temperature at 850 hPa is a compromise between the interest of end users and the availability of an acceptable reference for verification (i.e., a reliable analysis). This quantity is also subject to model bias characteristics that are close to those affecting surface “weather” quantities, for example, 2-m temperature. The choice of the lead time and the region was arbitrary. Further work will be necessary to determine whether the results presented here can be generalized to other lead times, other meteorological parameters, and other regions of the globe.

## Acknowledgments

O. Talagrand and an anonymous reviewer provided helpful comments on earlier versions of the manuscript. Part of the work described in this article has been supported by ECMWF as part of an expert visit by the author in Summer 2000.

## REFERENCES

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

## Footnotes

*Corresponding author address:* Frédéric Atger, Météo-France (DPREVI/COMPAS), 42, Av. G. Coriolis, 31057 Toulouse, France. Email: frederic.atger@meteo.fr