• Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev.,78, 1–3.

  • Buizza, R., 1997: Potential forecast skill of ensemble prediction, and spread and skill distributions of the ECMWF Ensemble Prediction System. Mon. Wea. Rev.,125, 99–119.

  • ——, and T. N. Palmer, 1998: Impact of ensemble size on ensemble prediction. Mon. Wea. Rev.,126, 2503–2518.

  • ——, T. Petroliagis, T. N. Palmer, J. Barkmeijer, M. Hamrud, A. Hollingsworth, A. Simmons, and N. Wedi, 1998: Impact of model resolution and ensemble size on the performance of an ensemble prediction system. Quart. J. Roy. Meteor. Soc.,124, 1935–1960.

  • Courtier, P., C. Freyder, J. F. Geleyn, F. Rabier, and M. Rochas, 1991:The Arpege project at Meteo-France. Proc. Seminar on Numerical Methods in Atmospheric Models, Vol. 2, Reading, United Kingdom, ECMWF, 192–231.

  • Gibson, J. K., P. Kallberg, S. Uppala, A. Hernandez, A. Nomura, and E. Serrano, 1997: ERA description. ECMWF Re-Analysis Project Report Series, Vol. 1, 72 pp.

  • Hamill, T. M., and S. J. Colucci, 1998: Evaluation of Eta-RSM ensemble probabilistic precipitation forecasts. Mon. Wea. Rev.,126, 711–724.

  • Harrison, M. S. J., T. N. Palmer, D. S. Richardson, R. Buizza, and T. Petroliagis, 1995: Joint ensembles from the UKMO and ECMWF models. Proc. Seminar on Predictability, Vol. 2, Reading, United Kingdom, ECMWF, 61–120.

  • Leith, C. E., 1974: Theoretical skill of Monte Carlo forecasts. Mon. Wea. Rev., 102, 409–418.

  • Mason, I., 1982: A model for assessment of weather forecasts. Aust. Meteor. Mag.,30, 291–303.

  • Molteni, F., R. Buizza, T. N. Palmer, and T. Petroliagis, 1996: The ECMWF ensemble prediction system: Methodology and validation. Quart. J. Roy Meteor. Soc.,122, 73–119.

  • Murphy, A., 1973: A new vector partition of the probability score. J. Appl. Meteor.,12, 595–600.

  • Palmer, T. N., F. Molteni, R. Mureau, R. Buizza, P. Chapelet, and J. Tribbia, 1993: Ensemble prediction. Proc. Seminar on Validation of Models over Europe, Vol. 1, Reading, United Kingdom, ECMWF, 21–66.

  • Rousseau, D., and P. Chapelet, 1985: A test of the Monte-Carlo method using the WMO/CAS Intercomparison Project data. Report of the second session of the CAS working group on short- and medium-range weather prediction research, WMO/TD 91, PSMP Rep. Series 18, 114 pp.

  • Simmons, A. J., D. M. Burridge, M. Jarraud, C. Girard, and W. Wergen, 1989: The ECMWF medium-range prediction models development of the numerical formulations and the impact of increased resolution. Meteor. Atmos. Phys.,40, 28–60.

  • Stanski, H. R., L. J. Wilson, and W. R. Burrows, 1989: Survey of common verification methods in meteorology. WMO/WWW Tech. Rep. 8, 114 pp.

  • Talagrand, O., R. Vautard, and B. Strauss, 1998: Evaluation of probabilistic prediction systems. Proc. Seminar on Predictability, Reading, United Kingdom, ECMWF, 1–26.

  • Toth, Z., and E. Kalnay, 1997: Ensemble forecasting at NCEP and the breeding method. Mon. Wea. Rev.,125, 3297–3319.

  • ——, Y. Zhu, T. Marchok, S. Tracton, and E. Kalnay, 1998: Verification of the NCEP global ensemble forecasts. Preprints, 12th Conf. on Numerical Weather Prediction, Phoenix, AZ, Amer. Meteor. Soc., 286–289.

  • Tracton, M. S., and E. Kalnay, 1993: Operational ensemble prediction at the National Meteorological Center: Practical aspects. Wea. Forecasting,8, 379–398.

  • Wilson, L. J., 1995: Verification of weather element forecasts from an ensemble prediction system. Proc. Fifth Workshop on Meteorological Operational Systems, Reading, United Kingdom, ECMWF, 114–126.

  • Zhu, Y., G. Yyengar, Z. Toth, S. M. Tracton, and T. Marchok, 1996:Objective evaluation of the NCEP global ensemble forecasting system. Preprints, 15th Conf. on Weather Analysis and Forecasting, Norfolk, VA, Amer. Meteor. Soc., J79–J82.

  • Ziehmann, C., 1998: Comparison of the ECMWF ensemble with an ensemble consisting of four operational models. Abstracts, Seventh Int. Meeting on Statistical Climatology, Whistler, BC, Canada, ECMWF, 147.

  • View in gallery

    Typical reliability diagrams (showing predicted probability vs observed frequency) and sharpness histograms (showing the distribution of predicted probabilities). (a) Perfect resolution and reliability, perfect sharpness. (b) Perfect reliability but poor sharpness, lower resolution than (a). (c) Perfect sharpness but poor reliability, lower resolution than (a). (d) As in (c) but after calibration, perfect reliability, same resolution.

  • View in gallery

    Reliability diagram and sharpness histogram of the reference forecast based on the ECMWF EPS control forecast (see text for details). Winter 1996/97, Europe, 500-hPa geopotential height anomaly exceeding 50 m, +96 h.

  • View in gallery

    Decomposition of the skill score of the reference forecast vs a forecast based on the long-term climatology. Winter 1996/97, Europe, 500-hPa geopotential height anomaly exceeding 50 m, +96 h.

  • View in gallery

    (a) Skill score of probabilistic forecasts based on the ECMWF EPS. Forecasts based on the raw distribution (EPS), on the distribution based on the ensemble mean and standard deviation (EM+stdev), on the distribution based on the ensemble mean and the standard deviation of the control forecast error (EM+error), and on the distribution based on the ensemble mean and a calibrated standard deviation [EM+corr(stdev)]. Winter 1996/97, Europe, 500-hPa geopotential height anomaly exceeding 50 m, +96 h. (b) Same as Fig. 4a but for reliability term of the skill score. (c) Same as Fig. 4a but for resolution term of the skill score.

  • View in gallery

    Relation between the ECMWF EPS standard deviation, stratified in 10 equally populated categories, and the standard deviation of the ECMWF EPS control error. Winter 1996/97, Europe, 500-hPa geopotential height.

  • View in gallery

    (a) Skill score of the forecasts based on the raw distribution of smaller ensembles constructed from the ECMWF EPS. Winter 1996/97, Europe, 500-hPa geopotential height anomaly exceeding 50 m, +96 h. (b) Same as Fig. 6a but for reliability term of the skill score. (c) Same as Fig. 6a but for resolution term of the skill score.

  • View in gallery

    Same as Fig. 6a but for distribution based on the ensemble mean and standard deviation, instead of the raw ensemble distribution.

  • View in gallery

    Same as Fig. 6a but for distribution based on the ensemble mean and the standard deviation of the control forecast error, instead of the raw distribution.

  • View in gallery

    (a) Same as Fig. 4a but based on the NCEP EPS. (b) Same as Fig. 9a but for reliability term of the skill score. (c) Same as Fig. 9a but for resolution term of the skill score.

  • View in gallery

    (a) Skill score of probabilistic forecasts based on a poor man’s ensemble (EC–NC, EC–NC–DW, and EC–NC–DW–UK), compared with the forecasts based on the ECMWF EPS (EC) and the NCEP EPS (NC). Distribution based on ensemble mean and the standard deviation of the ECMWF EPS control error. (b) Same as Fig. 10a but for reliability term of the skill score. (c) Same as Fig. 10a but for resolution term of the skill score.

  • View in gallery

    Same as Fig. 10a but for distribution based on ensemble mean and ensemble standard deviation.

  • View in gallery

    Same as Fig. 10a but for 150-m geopotential height anomaly.

  • View in gallery

    Same as Fig. 10a but for skill score based on the area below the ROC curve instead of the probability score.

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 517 266 16
PDF Downloads 219 125 8

The Skill of Ensemble Prediction Systems

View More View Less
  • 1 European Centre for Medium-Range Weather Forecasts, Reading, United Kingdom
Full access

Abstract

The performance of ensemble prediction systems (EPSs) is investigated by examining the probability distribution of 500-hPa geopotential height over Europe. The probability score (or half Brier score) is used to evaluate the quality of probabilistic forecasts of a single binary event. The skill of an EPS is assessed by comparing its performance, in terms of the probability score, to the performance of a reference probabilistic forecast. The reference forecast is based on the control forecast of the system under consideration, using model error statistics to estimate a probability distribution. A decomposition of the skill score is applied in order to distinguish between the two main aspects of the forecast performance: reliability and resolution. The contribution of the ensemble mean and the ensemble spread to the performance of an EPS is evaluated by comparing the skill score to the skill score of a probabilistic forecast based on the EPS mean, using model error statistics to estimate a probability distribution.

The performance of the European Centre for Medium-Range Weather Forecasts (ECMWF) EPS is reviewed. The system is skillful (with respect to the reference forecast) from +96 h onward. There is some skill from +48 h in terms of reliability. The performance comes mainly from the contribution of the ensemble mean. The contribution of the ensemble spread is slightly negative, but becomes positive after a calibration of the EPS standard deviation. The calibration improves predominantly the reliability contribution to the skill score. The calibrated EPS is skillful from +72 h onward.

The impact of ensemble size on the performance of an EPS is also investigated. The skill score of the ECMWF EPS decreases steadily with reducing numbers of ensemble members and the resolution is particularly affected. The impact is mainly due to the ensemble spread contributing negatively to the skill. The ensemble mean contribution to the skill decreases marginally when reducing the ensemble size up to 11 members.

The performance of the U.S. National Centers for Environmental Prediction (NCEP) EPS is also reviewed. The NCEP EPS has a lower skill score (vs a reference forecast based on its control forecast) than the ECMWF EPS especially in terms of reliability. This is mainly due to the smaller spread of the NCEP EPS contributing negatively to the skill. On the other hand, the NCEP and ECMWF ensemble means contribute similarly to the skill. As a consequence, the performance of the two systems in terms of resolution is comparable.

The performance of a poor man’s EPS, consisting of the forecasts of different NWP centers, is discussed. The poor man’s EPS is more skillful than either the ECMWF EPS or the NCEP EPS up to +144 h, despite a negative contribution of the spread to the skill score. The higher skill of the poor man’s EPS is mainly due to a better resolution.

Corresponding author address: Dr. Frédéric Atger, Météo-France (SCEM/PREVI), 42, av. G. Coriolis, 31057 Toulouse Cedex, France.

Email: frederic.atger@meteo.fr

Abstract

The performance of ensemble prediction systems (EPSs) is investigated by examining the probability distribution of 500-hPa geopotential height over Europe. The probability score (or half Brier score) is used to evaluate the quality of probabilistic forecasts of a single binary event. The skill of an EPS is assessed by comparing its performance, in terms of the probability score, to the performance of a reference probabilistic forecast. The reference forecast is based on the control forecast of the system under consideration, using model error statistics to estimate a probability distribution. A decomposition of the skill score is applied in order to distinguish between the two main aspects of the forecast performance: reliability and resolution. The contribution of the ensemble mean and the ensemble spread to the performance of an EPS is evaluated by comparing the skill score to the skill score of a probabilistic forecast based on the EPS mean, using model error statistics to estimate a probability distribution.

The performance of the European Centre for Medium-Range Weather Forecasts (ECMWF) EPS is reviewed. The system is skillful (with respect to the reference forecast) from +96 h onward. There is some skill from +48 h in terms of reliability. The performance comes mainly from the contribution of the ensemble mean. The contribution of the ensemble spread is slightly negative, but becomes positive after a calibration of the EPS standard deviation. The calibration improves predominantly the reliability contribution to the skill score. The calibrated EPS is skillful from +72 h onward.

The impact of ensemble size on the performance of an EPS is also investigated. The skill score of the ECMWF EPS decreases steadily with reducing numbers of ensemble members and the resolution is particularly affected. The impact is mainly due to the ensemble spread contributing negatively to the skill. The ensemble mean contribution to the skill decreases marginally when reducing the ensemble size up to 11 members.

The performance of the U.S. National Centers for Environmental Prediction (NCEP) EPS is also reviewed. The NCEP EPS has a lower skill score (vs a reference forecast based on its control forecast) than the ECMWF EPS especially in terms of reliability. This is mainly due to the smaller spread of the NCEP EPS contributing negatively to the skill. On the other hand, the NCEP and ECMWF ensemble means contribute similarly to the skill. As a consequence, the performance of the two systems in terms of resolution is comparable.

The performance of a poor man’s EPS, consisting of the forecasts of different NWP centers, is discussed. The poor man’s EPS is more skillful than either the ECMWF EPS or the NCEP EPS up to +144 h, despite a negative contribution of the spread to the skill score. The higher skill of the poor man’s EPS is mainly due to a better resolution.

Corresponding author address: Dr. Frédéric Atger, Météo-France (SCEM/PREVI), 42, av. G. Coriolis, 31057 Toulouse Cedex, France.

Email: frederic.atger@meteo.fr

1. Introduction

Until the advent of ensemble prediction systems (EPSs) the evaluation of the quality of numerical weather forecasts was essentially based on a space–time comparison between forecast and verifying values, with only one forecast value and one verification value occurring at the same time and the same place. Since December 1992, both the U.S. National Centers for Environmental Prediction (NCEP) and the European Centre for Medium-Range Weather Forecasts (ECMWF) have produced operational forecasts based on ensemble prediction (Tracton and Kalnay 1993; Palmer et al. 1993). An EPS is a prediction system designed to provide an ensemble of N forecasts of the meteorological state, considered as N independent realizations of a predicted probability distribution. The quality evaluation of an EPS should thus be based on the verification of a probability distribution. This implies that the forecast error cannot be estimated from a simple comparison between a forecast value and a verifying value. While the forecast is a distribution of values, the basic verification is still a single value. Two different approaches can be followed to solve this dilemma:

  • The quality of a single probability distribution forecast (one time, one location) is estimated from the conditional probability that the actual verification occurs, given the probability distribution (Wilson 1995). In this Bayesian approach, the performance depends on two independent aspects: (i) how close the distribution mode is to the verifying value and (ii) how sharp is the distribution.

  • The quality of a single forecast is not estimated per se, but a set of probability distribution forecasts is compared with the distribution of the corresponding verification values. The performance depends again on two aspects: (i) the correspondence between the predicted probability and the actual frequency of occurrence and (ii) the prominence of lower and higher probabilities, which are more meaningful than probabilities close to the climatological frequency.

The more common second approach has been adopted in this study of the quality of ensemble prediction systems. The performance has been assessed by comparing the distribution of ensemble forecasts to a reference distribution based on the control forecast of the considered EPS. A skill score resulting from this approach is defined in section 2. A decomposition of the skill score, according to the two independent aspects of the performance mentioned above, is proposed in section 3. The performance of the ECMWF EPS is reviewed in section 4. The question of the impact of reducing the ensemble size is addressed in section 5. The ECMWF EPS and the NCEP EPS are compared in section 6. A poor man’s scheme is used in section 7 to investigate the cost efficiency of an EPS. The main results of the study are summarized and discussed in section 8.

2. Skill score

A skill score measures the ability of a forecast to obtain a better score than a reference forecast. Skill scores are usually computed using either the climatology or the persistence of initial conditions as a reference forecast. As far as the performance of an EPS is concerned, the skill versus the climatology indicates the overall performance of the system (e.g., Buizza 1997) but does not allow one to distinguish the relative performance of the model on which it is based from the performance of the EPS itself, that is, the generation of the perturbed forecasts. In this study the skill of an EPS is assessed using the single control forecast of the EPS as a reference, in order to judge the benefit of the extra ensemble members.

The score used in this study is the (half) Brier score (Brier 1950) or probability score (PS), that is, the average square difference between the forecast probability pi and the observation oi with oi = 1 if observed, oi = 0 if not observed. The number of individual forecasts is N:
i1520-0493-127-9-1941-eq1
The probability skill score (PSS) is conventionally defined as the relative probability score compared with the probability score of the reference forecast. It can be expressed in percent of the potential improvement over the reference score:
i1520-0493-127-9-1941-eq2

The performance of different EPSs has been assessed regarding the forecast of a simple binary event, the 500-hPa geopotential height anomaly exceeding 50 m (above +50 m or below −50 m), at 140 grid points over Europe (75°N, 20°W; 30°N, 45°E) every 5° of latitude and longitude, weighted according to the latitude. The period of verification is from 10 December 1996 to 28 February 1997 (starting when the improved, higher-resolution ECMWF EPS was implemented). The total number of individual events considered in the study (140 points × 81 days = 11 340), although not independent, is believed to be large enough to get significant results. The sensitivity of the results to the choice of the 50-m threshold is discussed in section 8.

The reference forecast is a Gaussian distribution of the 500-hPa geopotential height, with a mean equal to the EPS control forecast, and a standard deviation equal to the standard deviation of the control forecast error estimated over an independent winter season (10 December 1995–28 February 1996).

In order to get a comparison as fair as possible between the reference forecast and the EPS forecast, the same number of probability categories, depending on the number of ensemble members of the considered EPS, has been used for both systems, that is, N + 1 categories from 0/N to N/N for evaluation of an N-member EPS.

3. Skill score decomposition

The probability score decomposition proposed by Murphy (1973) has been used in order to distinguish the two main aspects of the forecast performance: statistical consistency, or reliability, and variability, or resolution. Murphy’s decomposition consists of three terms:
i1520-0493-127-9-1941-eq3
when a sample of N forecasts has been divided in T categories, each comprising nk forecasts of a probability pk, ok being the observed frequency when the forecast was lying in that category and o the observed frequency in the whole sample.
  • The first term is the reliability, that is, the average square difference between the forecast probability and the observed frequency in the different categories. This term indicates the ability of the system to forecast accurate probabilities, so that for example an observed frequency of 30% can be expected when a 30% probability is forecast. The reliability is negatively oriented, as is the probability score: the lower the reliability the better. A perfect reliability (0) is indicated by a curve lying along the diagonal of a reliability diagram, that is, a plot of forecast probability versus observed frequency (Fig 1).

  • The second term is the resolution, that is, the average square difference between the observed frequency in each category and the mean frequency observed in the whole sample. This term indicates the ability of the forecast to separate the different categories, whatever the forecast probabilities. For a given reliability, the resolution thus indicates the sharpness of the forecast (Fig. 1): the maximum resolution corresponds to a deterministic forecast (only 0% and 100% are forecast), the minimum resolution corresponds to a climatological forecast (only one probability is forecast). The resolution is positively oriented: the higher the resolution, the better.

  • The third term is the uncertainty, that is, the variance of the observations, indicating the intrinsic difficulty in forecasting the event during the period. It is also the probability score of the sample climatology forecast. The uncertainty is obviously independent of the forecast system: being the same for the reference forecast and the forecast under evaluation, it plays no role in the skill score.

The skill score defined above can thus be decomposed into two terms, positively oriented, indicating (i) the skill due to the reliability and (ii) the skill due to the resolution:
i1520-0493-127-9-1941-eq4

Reliability and resolution are independent. For example, if the observed frequency is 90% in the 10% probability category, and 10% in the 0% probability category, the resolution is high but the reliability is poor. For operational purposes, the resolution term is the most relevant, since the reliability, as any bias, can generally be improved by a calibration. This can be done for instance by replacing forecast probabilities by the actual frequencies observed during a previous season in the same categories (Zhu et al. 1996). The reliability improvement is obtained at the expense of sharpness, as illustrated by the histograms in Fig. 1 showing the distribution of predicted probabilities. The resolution is not modified by the calibration if the number of categories remains the same.

4. The skill of the ECMWF EPS

The ECMWF ensemble prediction system comprises (in its current high-resolution version implemented in December 1996) 50 perturbed and an unperturbed, control integration of a TL159L31 version of the ECMWF model (Simmons et al. 1989; Courtier et al. 1991). The ECMWF EPS methodology is described in Molteni et al. (1996) while the advantages of the more recent, higher-resolution system are discussed in Buizza et al. (1998).

a. The reference forecast

The reference forecast is based on the control forecast as defined in section 2. The TL159 version of the ECMWF model on which the EPS was based in winter 1996/97 was not operational in winter 1995/96. For this reason, the standard deviation of the control forecast error was estimated from the T213 ECMWF model instead, assuming that the two models are close enough to lead to the same error distribution characteristics over a season. The reference forecast is particularly good in the early medium range. The reliability diagram on +96 h (Fig. 2) exhibits a good reliability and a sharp distribution, leading to a low (i.e., good) probability score. This is not the case as the forecast range increases, when the distribution becomes flatter and the forecast less reliable.

Comparing the reference forecast to a climatological forecast based on the ECMWF 15-yr reanalysis (Gibson et al. 1997), the skill score of the reference forecast has been computed then decomposed as indicated in section 3. The reference forecast has some skill versus the climatological forecast up to +240 h. The resolution is the only source of skill, while the reliability component tends to become slightly negative in the late medium range (Fig. 3). The climatological forecast has optimal reliability, since its distribution is expected to match the sample observations distribution, unless in an anomalous period. A similar behavior is observed for the reference forecast, since it is based on model error distribution. The loss of reliability in the late medium range may be due to the fact that the model error is not independent of forecast values, also to slight differences between the T213 ECMWF model used for the error statistics and the actual EPS TL159 model.

b. The skill score of the ECMWF EPS forecast

The skill score of the ECMWF EPS forecast, with respect to the reference forecast described above, is shown in Fig. 4a (solid line, labeled EPS). The EPS forecast has a negative skill in the short range, as expected since the singular vectors have a 48-h optimization time interval (the system is designed for medium-range prediction) and starts to be skillful around +96 h. A 10% skill score is reached at +168 h, 19% at +240 h is the maximum. The decomposition of the skill score shows that the reliability component (Fig. 4b) starts to be slightly positive earlier (+48 h) than the resolution component (Fig. 4c). On the other hand the resolution component grows faster from +96 h onward.

The reference forecast used in this study is based on a certain knowledge of the model error distribution, from past verification statistics. The EPS forecast, unless it is calibrated (as later on in this paper), does not assume such a knowledge, so that its comparison to the reference forecast might be considered as disadvantageous. A more neutral, less skillful reference forecast is obtained by replacing the standard deviation of the model error by the climate monthly standard deviation. This forecast is worse than the actual reference forecast only in the short range (not shown). This is due to its lack of reliability, the climate variability being much larger than the expected forecast uncertainty. On the other hand its resolution is virtually the same as the actual reference forecast at all lead times (not shown).

The skill score of a Gaussian distribution based on the ensemble mean and standard deviation is shown in Fig. 4a (dashed line, labeled EM+stdev). In this case, only the first two moments are used and the distribution is assumed to be monomodal and symmetric. Yet the skill score is almost exactly the same as the original distribution, especially in terms of resolution (Fig. 4c). This seems to indicate that, as far as the probability score is concerned, the main information that can be extracted from an EPS distribution is already contained in the mean and standard deviation of this distribution.

c. Ensemble mean and ensemble spread contributions to the skill score

Figure 4a also shows the skill score of a Gaussian distribution based only on the ensemble mean, using an estimate of the standard deviation of the control forecast error (as for the reference forecast) instead of the EPS standard deviation (dotted line, labeled EM+error). Here the ensemble standard deviation is ignored, the distribution still being assumed to be monomodal and symmetric. This forecast is slightly more skillful than the EPS forecast, suggesting that all the information that can be extracted from the EPS distribution is already contained in the mean of the distribution. The spread appears to have no impact in terms of resolution (Fig. 4c) and a negative impact in terms of reliability (Fig. 4b).

The ECMWF EPS standard deviation is well known to be too small on average compared with the control error, although they tend to be correlated (Buizza 1997). A way to highlight this “hidden” information is to calibrate the EPS standard deviation by taking into account this correlation, as it is known from the past. Figure 5 shows that there is indeed a linear correlation between the mean value of different categories of EPS standard deviation and the standard deviation of the control error in these categories during winter 1996/97. Assuming that this result could have been known from previous months or years of data (it is not the case since the TL159 EPS has run only since December 1996), a forecast distribution based on the ensemble mean and the corrected ensemble standard deviation has been constructed. The main, positive impact of the correction is on the reliability part of the skill score, the resolution being improved only in the short range. Overall the calibrated forecast has some skill from +72 h onward and is even slightly more skillful than the distribution based on the ensemble mean and the standard deviation of the control error [Fig. 4a, dotted–dashed line, labeled EM+corr(std)].

Nevertheless the ensemble mean appears to be the main source of skill. In a recent paper, Hamill and Colucci (1998) found a similar result when assessing the performance of short-range ensemble prediction. With regard to the synoptic fields, the main benefit of the ensemble mean compared with the control forecast is to smooth out small-scale, unpredictable features, while retaining the more slowly varying large-scale pattern (Leith 1974). A similar effect is achieved by filtering high-resolution deterministic forecasts, for instance by retaining only the leading components of a spectral decomposition. The skill score of a filtered forecast, obtained by an arbitrary T10 truncation of the control forecast, positive from +144 h, is still far smaller than the skill score of a forecast based on the ensemble mean and the standard deviation of the control error (Fig. 4a, long dashed line, labeled T10+error).

The comparatively poor contribution of the ensemble spread to the skill, even after calibration, might be related to an insufficient number of ensemble members causing poorer sampling of the higher moments of the distribution. However, it should be stressed that the good performance of the ensemble mean is obviously due to the ensemble spread being reliable enough to “shift” the mean of the distribution from the control forecast. The decomposition proposed in this section allows an assessment of the relative importance of reliable information already contained in the ensemble mean, versus the residual information that can be found in the spread itself. Therefore, the results presented here are not in contradiction with several studies showing the value of ensemble spread in forecasting the forecast skill (e.g., Buizza 1997; Toth et al. 1998)

5. Smaller ensembles

This section addresses the issue of the impact of ensemble size on the skill score. Smaller ensembles, constructed using the ECMWF EPS control and the 4, 10, and 32 first perturbed members as in Buizza and Palmer (1998), were compared with the whole ensemble (control + 50 members) in terms of skill score. Note that the first perturbed members initial conditions are still obtained from all 25 singular vectors (SVs) (since the perturbations are combinations of SVs). This comparison thus favors smaller ensembles and only addresses the question of the number of integrations that are needed. As in the previous section the skill score has been computed for three different distributions: (i) the raw EPS distribution, (ii) a Gaussian distribution based on the ensemble mean and standard deviation, and (iii) a Gaussian distribution based on the ensemble mean and the standard deviation of the control forecast error.

As expected, the skill score of the raw distribution decreases when the EPS population is reduced (Fig. 6a). The performance is almost the same from 33 to 51 members, then decreases rapidly. These results are similar to those obtained recently by Talagrand et al. (1998) for 850-hPa temperature probabilistic forecasts. Resolution and reliability behave differently in that respect. Reliability is almost not affected from 11 to 51 members, especially in the shorter ranges (Fig. 6b). Resolution decreases more steadily from 51 to 5 members, especially in the shorter ranges (Fig. 6c).

The skill score of the distribution based on the ensemble mean and standard deviation tends to be larger than for the raw distribution when reducing ensemble membership up to 11 and 5 members (Fig. 7). The probability score depends on the number of forecast categories: any increase of the number of realizations extracted from the same statistical population would have the effect of smoothing noise due to the finiteness of the sample. In the case of the distribution based on ensemble mean and standard deviation, the number of categories is arbitrarily fixed to 52, as for the reference forecast, whereas the number of categories is obviously limited to the number of members when considering the raw distribution.

The decrease of the skill score when reducing the ensemble size is limited when considering the distribution based on the ensemble mean and the standard deviation of the control forecast error. The skill score is significantly reduced only for a five-member ensemble (Fig. 8). The smaller ensemble mean is obviously similar to the whole ensemble mean, so that the positive impact on skill when increasing the EPS population is only due to the ensemble spread being significantly better.

Increasing ensemble size has a definite impact on the average amplitude of the spread. As shown in Table 1, the spread increases by approximately 10%–16% from 5 to 11 members, and by 6%–9% from 11 to 33 members extracted from the same subset. These figures are larger than expected from theory. The standard deviation, STD, of an ensemble of N elements randomly extracted from a distribution of standard deviation std is given by
i1520-0493-127-9-1941-eq5
which is an increase of 7% from 5 to 11 members, and 3% from 11 to 33 members. The observed values indicate that extra members do not belong exactly to the same statistical population as the first members. Increasing the size of an EPS leads not only to a better sampling, but also to a significant improvement of the distribution.

The larger spread explains the better reliability of larger ensembles (Fig. 6b). But the resolution component of the skill score is also improved when increasing ensemble size (Fig. 6c). The resolution does not depend on the average amplitude of the spread, rather on its daily variations (see section 3). This variability seems to be improved by increasing ensemble size.

6. Comparison of ECMWF EPS with NCEP EPS

In this section the skill score has been used in order to compare two different operational ensemble prediction systems: the ECMWF EPS (50 + 1 members) and the NCEP 0000 UTC EPS (10 + 1 members). The reference forecast has been separately constructed from each EPS’s own control forecast, in order to limit the performance dependence to the model on which the system is based. As in previous sections the skill score was computed for three different distributions: (i) the raw EPS distribution, (ii) a Gaussian distribution based on the ensemble mean and standard deviation, and (iii) a Gaussian distribution based on the ensemble mean and the standard deviation of the control forecast error. When it was possible (ii and iii) the number of categories used for the computations was the same (52) for both systems in order to get a comparison as fair as possible. The same analysis (ECMWF analysis) was used to verify both systems. Using NCEP analysis to verify NCEP EPS might have had a slight, positive impact on NCEP skill. This was not possible at the time of the study, for technical reasons.

The NCEP ensemble prediction system based on 0000 UTC consists of 10 perturbed and 1 unperturbed control integrations of a T62 version of the NCEP model. The main difference from the ECMWF EPS, besides the number of members, lies in the way the initial perturbations are generated. While the singular vectors technique is used at ECMWF (Molteni et al. 1996), the NCEP perturbations are obtained through the method of breeding of growing modes (Toth and Kalnay 1997).

The NCEP EPS proves less skillful overall than the ECMWF EPS (cf. Fig. 9a and Fig. 4a). The raw distribution starts to be skillful at +144 h only (+96 h for the ECMWF EPS) and reaches a maximum of 10% at +240 h (19% for the ECMWF EPS). The distribution based on the ensemble mean and standard deviation is even worse.

The performance of the distribution based on the ensemble mean and the standard deviation of the control forecast error is rather similar for the two systems (cf. Fig. 9a and Fig. 4a). NCEP EPS is slightly better in the early medium range, while ECMWF EPS is slightly better in the late medium range. A similar behavior was observed by Zhu et al. (1996) in terms of different verification statistics for an independent winter season. The differences between the two systems at earlier lead times might be related to differences in initial perturbations. One would expect in this case a better performance of NCEP EPS as early as +24 h. This is not observed. At later lead times the differences in model performances might explain the better performance of ECMWF EPS. One would expect in this case a similar performance of NCEP EPS and ECMWF EPS in winter 1995/96, when the systems were based on models that compared better in terms of resolution. This is not observed (Zhu et al. 1996). Further investigation is needed to fully understand the differences between the performance of the two systems and to determine whether they are significant.

The comparison of the curves labeled EM+error and EM+stdev shown in Fig. 9a clearly indicates a strong negative impact of the spread on the overall performance of the NCEP EPS, even worse than noticed for the ECMWF EPS. This is obviously a consequence of the lack of spread of the NCEP EPS (Toth and Kalnay 1997). This lack of spread is not only due to the small ensemble size. Although Figs. 9a (NCEP) and 6a (ECMWF 10 + 1 members) compare rather well, Table 1 shows that the NCEP EPS spread is on average smaller than the spread of 11 members extracted from the ECMWF ensemble. The difference between the two systems in this respect is probably related to the better model climatology of the ECMWF EPS, partly due to a higher horizontal resolution.

It has been mentioned in previous sections that a lack of spread has little impact on the resolution. Figures 9b and 9c (cf. Figs. 4b and 4c) show that the performance of the two systems is more similar in terms of resolution than in terms of reliability. The better reliability of ECMWF EPS beyond +48 h is obviously a consequence of the spread being larger. Again, this is not only due to a larger number of members, as shown by the comparison of Figs. 6b (ECMWF 10 + 1 members) and 9b (NCEP). On the other hand, the larger spread of NCEP EPS before and until +48 h (due to the breeding method) does not lead to a better reliability.

7. Poor man’s ensemble prediction system

In this section the cost efficiency of the ECMWF EPS is addressed by comparison with the performance of a poor man’s scheme using current forecasts from various centers, which can be considered as a gratis ensemble prediction system. The poor man’s EPS is defined here as the following collection of forecasts: the ECMWF EPS control forecast (EC), the Deutscher Wetterdienst global model forecast (DW), the United Kingdom Meteorological Office unified model forecast (UK), and the NCEP EPS control forecast (NC). Both NC and EC are evaluated out to +240 h, DW to +168 h, and UK to +144 h.

With such a small ensemble (two to four members according to the time step) the score of the raw distribution is unlikely to be significant (see section 5). The poor man’s skill score has thus been computed for two distributions: (i) a Gaussian distribution based on the ensemble mean and standard deviation and (ii) a Gaussian distribution based on the ensemble mean and the standard deviation of the control forecast error. The reference forecast is the same as in sections 4 and 5, based on the ECMWF EPS control forecast.

The distribution based on the poor man’s ensemble mean and the standard deviation of the control forecast error has an impressively high skill score, much better in the short range and early medium range than the forecasts based on the ECMWF EPS or the NCEP EPS (Fig. 10a). The situation is reversed around +144 h as far as the two-member (EC–NC) poor man’s ensemble is concerned. The four-member poor man’s ensemble (EC–NC–DW–UK) runs only to +144 h, when it is still more skillful than the ECMWF EPS or the NCEP EPS forecasts. The skill difference is only due to a much better resolution of the poor man’s scheme (Fig. 10c). The reliability of the poor man’s EPS is equivalent to the reliability of either the ECMWF EPS or the NCEP EPS (Fig. 10b).

The distribution based on the poor man’s ensemble mean and standard deviation is not as skillful, especially as far as the two-member scheme is concerned (Fig. 11). This indicates that the skill of the poor man’s scheme is exclusively due to its ensemble mean, the spread contributing negatively to the skill. This confirms the results presented in section 5 concerning the impact of reducing ensemble size on the spread contribution to the skill score. Still the distribution based on the four-member poor man’s ensemble mean and standard deviation is more skillful than the distributions based on the ECMWF EPS or the NCEP EPS mean and standard deviation up to +144 h.

The results presented in this section suggest that it would be worth considering a multianalysis and/or multimodel approach besides operational ensemble prediction techniques used at ECMWF and NCEP, as proposed by Harrison et al. (1995). The benefit of combining numerical forecasts from different NWP centers was demonstrated by Rousseau and Chapelet more than 10 years ago (1985). Ziehmann (1998) presented some evidence that in some circumstances one might get better results using a limited number of forecasts from different NWP centers rather than the ECMWF EPS. Comparable results have been obtained by Talagrand et al. (1998) with a poor man’s scheme rather different from the one used in the present study.

8. Summary and concluding remarks

The performance of ensemble prediction systems has been investigated with respect to the probability distribution of 500-hPa geopotential height over Europe. The probability score has been used to assess the performance of probabilistic forecasts. A skill score has been defined by comparison to a reference forecast based on the control forecast of the system under consideration. A decomposition of the skill score has been applied in order to distinguish between the reliability and the resolution aspects of the performance.

The following results have been highlighted:

  • The ECMWF EPS forecast is skillful compared to the reference forecast from +96 h onward.

  • Most of the ECMWF EPS skill is obtained from a Gaussian distribution based on the ensemble mean. The ensemble spread contributes negatively to the skill score.

  • The reliability increases significantly after calibration of the EPS standard deviation, leading to a skillful forecast from +72 h onward.

  • The skill score of the ECMWF EPS decreases when reducing ensemble size. This decrease comes mainly from the contribution of the ensemble spread to the score.

  • The NCEP EPS has a lower skill score than the ECMWF EPS at all time steps, especially in terms of reliability. This seems a consequence of the insufficient spread of the former.

  • The NCEP and ECMWF ensemble means contribute similarly to the skill. The performance of the two systems in terms of resolution is comparable.

  • A poor man’s EPS is more skillful than either the ECMWF EPS or the NCEP EPS, up to +144 h. This is due to a better resolution, despite a negative contribution of the spread.

These results have been obtained with regard to a single binary event: the 500-hPa geopotential height anomaly exceeding 50 m in magnitude. This threshold has been chosen so that the occurrence of the event is roughly 20% (Table 2). One could argue that a higher threshold would have been a better choice to assess the performance of ensemble prediction systems in providing information on most extreme events. Table 2 shows the ECMWF EPS skill score increasing slightly with a threshold varying from 25 to 150 m, while the resolution component is rather constant. The Gaussian distribution based on the ensemble mean and the standard deviation of the control error is more skillful than the raw distribution for lower thresholds, up to 100 m for the skill score, up to 50 m only for the resolution component. Beyond these limits the contribution of ensemble spread to the skill score becomes positive.

One important issue is the validity of the results presented in section 7 (poor man’s EPS leading ECMWF EPS and NCEP EPS) for higher thresholds. Figure 12 shows the same comparison as Fig. 10a for a 150-m threshold, corresponding to a rather extreme event since it occurs in less than 8% of occasions (Table 2). The result is the same as pointed out in section 7: the poor man’s scheme is better than the ECMWF EPS and the NCEP EPS up to +144 h.

A second limitation of the study is the use of a unique performance criterion, the skill score based on the probability score. Similar conclusions can be drawn from a different verification approach based on the relative operating characteristics (ROC) curve (Mason 1982). The ROC curve is a plot of the hit rate as a function of the false alarm rate of a series of deterministic forecasts, obtained from the probability distribution by considering several probability thresholds, from p = 0% (event systematically forecast) to p = 100% (event never forecast). As pointed out by Stanski et al. (1989) the ROC curve says nothing about reliability since it is based on a stratification by observations. Therefore it is not surprising to find that results from this approach are similar to those concerning the resolution component of the skill score used in this paper. An example of this similarity is shown in Fig. 13, equivalent to Fig. 10 with respect to the area below the ROC curve.

Acknowledgments

Zoltan Toth, Roberto Buizza, Tim Palmer, and François Lalaurette, as well as three anonymous reviewers, provided helpful comments on earlier versions of this manuscript. Acknowledgment is also made to David Stephenson who contributed to the correctness of the text. Special thanks are expressed to Olivier Talagrand for his constant help and support.

REFERENCES

  • Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev.,78, 1–3.

  • Buizza, R., 1997: Potential forecast skill of ensemble prediction, and spread and skill distributions of the ECMWF Ensemble Prediction System. Mon. Wea. Rev.,125, 99–119.

  • ——, and T. N. Palmer, 1998: Impact of ensemble size on ensemble prediction. Mon. Wea. Rev.,126, 2503–2518.

  • ——, T. Petroliagis, T. N. Palmer, J. Barkmeijer, M. Hamrud, A. Hollingsworth, A. Simmons, and N. Wedi, 1998: Impact of model resolution and ensemble size on the performance of an ensemble prediction system. Quart. J. Roy. Meteor. Soc.,124, 1935–1960.

  • Courtier, P., C. Freyder, J. F. Geleyn, F. Rabier, and M. Rochas, 1991:The Arpege project at Meteo-France. Proc. Seminar on Numerical Methods in Atmospheric Models, Vol. 2, Reading, United Kingdom, ECMWF, 192–231.

  • Gibson, J. K., P. Kallberg, S. Uppala, A. Hernandez, A. Nomura, and E. Serrano, 1997: ERA description. ECMWF Re-Analysis Project Report Series, Vol. 1, 72 pp.

  • Hamill, T. M., and S. J. Colucci, 1998: Evaluation of Eta-RSM ensemble probabilistic precipitation forecasts. Mon. Wea. Rev.,126, 711–724.

  • Harrison, M. S. J., T. N. Palmer, D. S. Richardson, R. Buizza, and T. Petroliagis, 1995: Joint ensembles from the UKMO and ECMWF models. Proc. Seminar on Predictability, Vol. 2, Reading, United Kingdom, ECMWF, 61–120.

  • Leith, C. E., 1974: Theoretical skill of Monte Carlo forecasts. Mon. Wea. Rev., 102, 409–418.

  • Mason, I., 1982: A model for assessment of weather forecasts. Aust. Meteor. Mag.,30, 291–303.

  • Molteni, F., R. Buizza, T. N. Palmer, and T. Petroliagis, 1996: The ECMWF ensemble prediction system: Methodology and validation. Quart. J. Roy Meteor. Soc.,122, 73–119.

  • Murphy, A., 1973: A new vector partition of the probability score. J. Appl. Meteor.,12, 595–600.

  • Palmer, T. N., F. Molteni, R. Mureau, R. Buizza, P. Chapelet, and J. Tribbia, 1993: Ensemble prediction. Proc. Seminar on Validation of Models over Europe, Vol. 1, Reading, United Kingdom, ECMWF, 21–66.

  • Rousseau, D., and P. Chapelet, 1985: A test of the Monte-Carlo method using the WMO/CAS Intercomparison Project data. Report of the second session of the CAS working group on short- and medium-range weather prediction research, WMO/TD 91, PSMP Rep. Series 18, 114 pp.

  • Simmons, A. J., D. M. Burridge, M. Jarraud, C. Girard, and W. Wergen, 1989: The ECMWF medium-range prediction models development of the numerical formulations and the impact of increased resolution. Meteor. Atmos. Phys.,40, 28–60.

  • Stanski, H. R., L. J. Wilson, and W. R. Burrows, 1989: Survey of common verification methods in meteorology. WMO/WWW Tech. Rep. 8, 114 pp.

  • Talagrand, O., R. Vautard, and B. Strauss, 1998: Evaluation of probabilistic prediction systems. Proc. Seminar on Predictability, Reading, United Kingdom, ECMWF, 1–26.

  • Toth, Z., and E. Kalnay, 1997: Ensemble forecasting at NCEP and the breeding method. Mon. Wea. Rev.,125, 3297–3319.

  • ——, Y. Zhu, T. Marchok, S. Tracton, and E. Kalnay, 1998: Verification of the NCEP global ensemble forecasts. Preprints, 12th Conf. on Numerical Weather Prediction, Phoenix, AZ, Amer. Meteor. Soc., 286–289.

  • Tracton, M. S., and E. Kalnay, 1993: Operational ensemble prediction at the National Meteorological Center: Practical aspects. Wea. Forecasting,8, 379–398.

  • Wilson, L. J., 1995: Verification of weather element forecasts from an ensemble prediction system. Proc. Fifth Workshop on Meteorological Operational Systems, Reading, United Kingdom, ECMWF, 114–126.

  • Zhu, Y., G. Yyengar, Z. Toth, S. M. Tracton, and T. Marchok, 1996:Objective evaluation of the NCEP global ensemble forecasting system. Preprints, 15th Conf. on Weather Analysis and Forecasting, Norfolk, VA, Amer. Meteor. Soc., J79–J82.

  • Ziehmann, C., 1998: Comparison of the ECMWF ensemble with an ensemble consisting of four operational models. Abstracts, Seventh Int. Meeting on Statistical Climatology, Whistler, BC, Canada, ECMWF, 147.

Fig. 1.
Fig. 1.

Typical reliability diagrams (showing predicted probability vs observed frequency) and sharpness histograms (showing the distribution of predicted probabilities). (a) Perfect resolution and reliability, perfect sharpness. (b) Perfect reliability but poor sharpness, lower resolution than (a). (c) Perfect sharpness but poor reliability, lower resolution than (a). (d) As in (c) but after calibration, perfect reliability, same resolution.

Citation: Monthly Weather Review 127, 9; 10.1175/1520-0493(1999)127<1941:TSOEPS>2.0.CO;2

Fig. 2.
Fig. 2.

Reliability diagram and sharpness histogram of the reference forecast based on the ECMWF EPS control forecast (see text for details). Winter 1996/97, Europe, 500-hPa geopotential height anomaly exceeding 50 m, +96 h.

Citation: Monthly Weather Review 127, 9; 10.1175/1520-0493(1999)127<1941:TSOEPS>2.0.CO;2

Fig. 3.
Fig. 3.

Decomposition of the skill score of the reference forecast vs a forecast based on the long-term climatology. Winter 1996/97, Europe, 500-hPa geopotential height anomaly exceeding 50 m, +96 h.

Citation: Monthly Weather Review 127, 9; 10.1175/1520-0493(1999)127<1941:TSOEPS>2.0.CO;2

Fig. 4.
Fig. 4.

(a) Skill score of probabilistic forecasts based on the ECMWF EPS. Forecasts based on the raw distribution (EPS), on the distribution based on the ensemble mean and standard deviation (EM+stdev), on the distribution based on the ensemble mean and the standard deviation of the control forecast error (EM+error), and on the distribution based on the ensemble mean and a calibrated standard deviation [EM+corr(stdev)]. Winter 1996/97, Europe, 500-hPa geopotential height anomaly exceeding 50 m, +96 h. (b) Same as Fig. 4a but for reliability term of the skill score. (c) Same as Fig. 4a but for resolution term of the skill score.

Citation: Monthly Weather Review 127, 9; 10.1175/1520-0493(1999)127<1941:TSOEPS>2.0.CO;2

Fig. 5.
Fig. 5.

Relation between the ECMWF EPS standard deviation, stratified in 10 equally populated categories, and the standard deviation of the ECMWF EPS control error. Winter 1996/97, Europe, 500-hPa geopotential height.

Citation: Monthly Weather Review 127, 9; 10.1175/1520-0493(1999)127<1941:TSOEPS>2.0.CO;2

Fig. 6.
Fig. 6.

(a) Skill score of the forecasts based on the raw distribution of smaller ensembles constructed from the ECMWF EPS. Winter 1996/97, Europe, 500-hPa geopotential height anomaly exceeding 50 m, +96 h. (b) Same as Fig. 6a but for reliability term of the skill score. (c) Same as Fig. 6a but for resolution term of the skill score.

Citation: Monthly Weather Review 127, 9; 10.1175/1520-0493(1999)127<1941:TSOEPS>2.0.CO;2

Fig. 7.
Fig. 7.

Same as Fig. 6a but for distribution based on the ensemble mean and standard deviation, instead of the raw ensemble distribution.

Citation: Monthly Weather Review 127, 9; 10.1175/1520-0493(1999)127<1941:TSOEPS>2.0.CO;2

Fig. 8.
Fig. 8.

Same as Fig. 6a but for distribution based on the ensemble mean and the standard deviation of the control forecast error, instead of the raw distribution.

Citation: Monthly Weather Review 127, 9; 10.1175/1520-0493(1999)127<1941:TSOEPS>2.0.CO;2

Fig. 9.
Fig. 9.

(a) Same as Fig. 4a but based on the NCEP EPS. (b) Same as Fig. 9a but for reliability term of the skill score. (c) Same as Fig. 9a but for resolution term of the skill score.

Citation: Monthly Weather Review 127, 9; 10.1175/1520-0493(1999)127<1941:TSOEPS>2.0.CO;2

Fig. 10.
Fig. 10.

(a) Skill score of probabilistic forecasts based on a poor man’s ensemble (EC–NC, EC–NC–DW, and EC–NC–DW–UK), compared with the forecasts based on the ECMWF EPS (EC) and the NCEP EPS (NC). Distribution based on ensemble mean and the standard deviation of the ECMWF EPS control error. (b) Same as Fig. 10a but for reliability term of the skill score. (c) Same as Fig. 10a but for resolution term of the skill score.

Citation: Monthly Weather Review 127, 9; 10.1175/1520-0493(1999)127<1941:TSOEPS>2.0.CO;2

Fig. 11.
Fig. 11.

Same as Fig. 10a but for distribution based on ensemble mean and ensemble standard deviation.

Citation: Monthly Weather Review 127, 9; 10.1175/1520-0493(1999)127<1941:TSOEPS>2.0.CO;2

Fig. 12.
Fig. 12.

Same as Fig. 10a but for 150-m geopotential height anomaly.

Citation: Monthly Weather Review 127, 9; 10.1175/1520-0493(1999)127<1941:TSOEPS>2.0.CO;2

Fig. 13.
Fig. 13.

Same as Fig. 10a but for skill score based on the area below the ROC curve instead of the probability score.

Citation: Monthly Weather Review 127, 9; 10.1175/1520-0493(1999)127<1941:TSOEPS>2.0.CO;2

Table 1.

Ensemble standard deviation of 500-hPa geopotential height over Europe, winter 1996/97, ECMWF EPS, and NCEP EPS, for various memberships including the control forecast and a number of perturbed forecasts (m).

Table 1.
Table 2.

Occurrences of 500-hPa geopotential height anomaly exceeding various thresholds, and corresponding skill score and resolution component of the skill score of probabilistic forecasts based on 1) the ECMWF EPS raw distribution and 2) a Gaussian distribution based on the ECMWF EPS mean and the standard deviation of the control forecast error, winter 1996/97, +144 h.

Table 2.
Save