1. Introduction
Appropriate verification tools are essential in understanding the abilities and weaknesses of (probabilistic) forecast systems.
Verification is often focused on specific (weather) events. Such a binary event either occurs, or does not occur, and is forecast to occur or not to occur, with certain probabilities p and 1 − p respectively. Examples of such events are more than 10-mm precipitation in 24 h or an anomaly (from a climatological mean) of more than 50 m of the geopotential at 500 hPa. Several well-established tools exist that test how accurately the forecast system is able to describe the occurrence and nonoccurrence of the event under consideration, that is, how good the agreement is between the forecasted probabilities and observed states. Examples of scores, which are commonly used by operational centers such as the European Centre for Medium-Range Weather Forecasts (ECMWF) and the National Centers for Environmental Prediction are Brier scores (Brier 1950), the Relative Operating Characteristics (ROC) curves (Mason 1982; Stanski et al. 1989), and economic cost–loss analyses (see, e.g., Katz and Murphy 1997; or Richardson 1998, 2000).
The (half) Brier score is one of the oldest verification tools in use. From its numerical value alone the quality of a forecast system is difficult to assess. An attractive property of the Brier score, however, is that it can be decomposed into a reliability, a resolution, and an uncertainty part (Murphy 1973). The reliability tests whether the forecast system has the correct statistical properties. It can be presented in a graphical way by the so-called reliability diagram. The uncertainty is the Brier score one would obtain when only the climatological frequency for the occurrence of the event is available. The resolution shows the impact obtained by issuing case-dependent probability forecasts (which do not always equal the probability based on climatology). Therefore, the decomposition of the Brier score gives a detailed insight into the performance of the forecast system with respect to the event under consideration.
Binary events only highlight one aspect of the forecast. Such a single aspect may be quite relevant. For instance, certain extreme events can lead to economic losses, which could be avoided with the help of an accurate forecast system. This kind of issue is addresses by the ROC curve and economic cost–loss analyses. However, it may be desirable to obtain a broader overall view of performance. Several tools in this direction exist. It should however be mentioned that the term overall is often still restricted to the behavior of one forecast parameter only, such as precipitation or the geopotential at 500 hPa.
An example is the Talagrand diagram (Talagrand and Vautard 1997), also known as the rank histogram (Hamill and Collucci 1997) or the binned probability ensemble (Anderson 1996). This tool is tailor made for an ensemble system, that is, in case the probability density function (PDF) is represented by an ensemble of forecasts. Given such an ensemble, its N members divide the permissible range of the parameter of interest into N + 1 bins. The verifying analysis will be found to be in one of these bins. If all members are assumed to be equally weighted and representative, it is expected that, on average, each bin should be equally populated by the verifying analyses. Deviations from such a flat rank histogram indicate a violation of the above-made assumptions. For instance, a too high frequency of outliers is an indication that the average spread within the ensemble system is too low.
Another example is the ranked probability score (RPS) (see Epstein 1969; Murphy 1969, 1971). It is a generalization of the (half) Brier score. Instead of two options (event occurs or does not occur), the range of the parameter of interest is divided into more classes. In addition, the RPS contains a sense of distance of how far the forecast was found from reality. For a deterministic forecast for instance, the RPS is proportional to the number of classes by which the forecast missed the verifying analysis. Although the choice and number of classes may be prescribed by the specific application, the exact value of RPS will depend on this choice. It is possible to take the limit of an infinite number of classes, each with zero width. This leads to the concept of the continuous ranked probability score (CRPS) (Brown 1974; Matheson and Winkler 1976; Unger 1985; Bouttier 1994). This CRPS has several appealing properties. First of all, it is sensitive to the entire permissible range of the parameter of interest. Second, its definition does not require, such as for the RPS, the introduction of a number of predefined classes, on which results may depend. In addition, it can be interpreted as an integral over all possible Brier scores. Finally, for a deterministic forecast, the CRPS is equal to the mean absolute error (MAE) and, therefore, has a clear interpretation.
Despite these advantages, the CRPS is a single quantity, from which it is difficult to disentangle the detailed behavior of a forecast system. It would be desirable to be able to decompose the CRPS like it is possible for the Brier score. In this paper it is shown how for an ensemble prediction system this indeed can be achieved. In a similar way to the Brier score, the CRPS is shown to be decomposable into a reliability part, an uncertainty part, and a resolution part. The reliability part tests whether for each bin i on average the verifying analysis was found to be with a fraction i/N below this bin. It has a close relation to the rank histogram. The uncertainty part is equal to the CRPS one would receive, in case only a PDF-based on climatology would be available. The resolution finally expresses the improvement gained by issuing probability forecasts that are case dependent. It is shown that the resolution is sensitive to the average ensemble spread and the frequency and magnitude of the outliers. Finally, it is illustrated how the various contributions to the CRPS can be presented in a graphical way, like the reliability diagram of the Brier score.
The paper is organized as follows. In section 2 the CRPS is defined, and some characteristics are mentioned. The uncertainty part of the CRPS is highlighted in section 3. In section 4, the full decomposition for an ensemble system is derived. As an example, the decomposition of the CRPS for total precipitation in the ensemble prediction system (EPS) running at ECMWF is presented in section 5. A summary and some concluding remarks are made in section 6.
2. The continuous ranked probability score
The CRPS can be seen as the limit of a ranked probability score with an infinite number of classes, each with zero width.
3. The uncertainty of the CRPS
From Eqs. (10)–(12) it is seen that the CRPS based on climatology is minimal when Pcli is equal to Psam. The impact on the CRPS due to a deviation from the sample statistics is expressed by Eq. (11).
It should be noted that the term climatology depends on the degree of desired sophistication. The most crude level would be to assume the same climatological distribution at all grid points and cases. The mean climatological value of x, however, may be quite location and seasonal dependent. The mean 2-m temperature of Norway in January, for instance, is much lower than that of Spain in March. This would result in a very broad sample distribution and, therefore, to a large uncertainty. In order to correct for this, as a first step, the variable x can be redefined as being the anomaly with respect to the local climatology. The definition of the CRPS is invariant for such a shift in the variable x, as is easily seen from Eq. (1). As a consequence, the distribution Psam will change, because for each k in Eq. (9) a different shift may have been applied. This should result in a distribution that is much sharper, so the uncertainty
Finally, the entire climatological distribution (so not just its mean) could be chosen to depend on the location and/or season, so Pk = Pcli,location,season. For this, the best achievable distribution would be a location/seasonal-dependent sample distribution, also given by Eq. (9) but in which the sum (and the normalization of the weights) is restricted to all points k that belong to the same location and or season. Again, the resulting uncertainty is expected to become lower. For parameters like precipitation this will also lead to a lower uncertainty.
4. The CRPS for an ensemble system
a. The cumulative distribution of an ensemble
b. Decomposition for a single case
For the example given in Fig. 2, the verifying analysis is in between x3 and x4. Therefore, for this case β = 0 for i = 1 and 2, and α = 0 for i = 4. Only for i = 3 both α and β are nonzero.
c. The average over a set of cases
The quantity
d. Relation to the decomposition of the Brier score
In section 2 it was shown that the
The quantities 〈Reli〉 and 〈Resol〉, as well as D, involve integrals over gi
5. Decomposition for the EPS at ECMWF
The ideas developed in the previous sections will be illustrated by the performance of the ensemble prediction system running at ECMWF. This ensemble forecasting system (see Molteni et al. 1996; Buizza and Palmer 1998; Buizza et al. 1999) consists of 50 perturbed forecasts plus a control forecast integrated with the ECMWF TL159L31 primative equation (PE) model up to day 10. For seven cases in the summer of 1999, the
Table 1 shows the
In order to be able to understand these trends in more detail, in Figs. 4, 5, and 6 a graphical representation of the reliability, uncertainty, and resolution is displayed for forecast days 3, 6, and 9, respectively. In the top panels the observed frequencies oi as defined in Eqs. (31) and (33) are plotted as a function of the fraction of members pi. Any deviation from the diagonal will contribute to the reliability Reli defined in Eq. (36). The lower panels of Figs. 4–6 show (staircase curve) the accumulation of the average bin widths gi, as defined in Eq. (30). The leftmost and rightmost bins show the average magnitude g0 and gN, respectively, of the outliers [see Eq. (33)]. The width of this curve determines the potential CRPS, because CRPSpot can be seen as the integral over this curve with the weight function oi(1 − oi). The narrower the staircase curve, the smaller the region for which the weight function is significantly different from zero, and as a result, the smaller CRPSpot is. In addition, the lower panels show the cumulative distribution (“smooth” curve) of the sample climatology, as defined in Eq. (9). As is illustrated by Fig. 1, for example, the uncertainty
The discrepancy from perfect reliability for the first forecast days is mainly due to the lower bins of the ensembles, as can be seen in Fig. 4 for day 3. The frequency that the verifying analysis is found to be below these bins is too high. It occurs too often that all members predict at least some precipitation, while it remained dry (based on climatology as can be seen from Psam in the lower panel of Fig. 4, the probability that it remains dry is about 50%). However, for these cases, the amount of precipitation of the member with the smallest amount of rain is on average quite small (around 0.3 mm; see g0 in bottom panel of Fig. 4). Therefore this mild overestimation of precipitation will not contribute very strongly to Reli. Such a delicate analysis would not be visible from the rank histogram. It would only show a too high frequency of outliers.
The high resolution of the EPS for day 3 can clearly be seen from the bottom panel of Fig. 4. The average bin widths of the ensemble, including the outliers, is, compared to Psam, considerably small. The climatological distribution has a large tail for high amounts of precipitation. Apparently, for such cases, the EPS was capable of generating sharp ensembles with fair amounts of precipitation. This is the reason why the size of the outlier gN is reasonably small. The reduction of resolution with increasing forecast time is well illustrated by comparing the lower panels of Figs. 4–6. At day 3, the ensemble is much sharper than Psam, while at day 9, it is quite similar to the sample distribution, leaving only a low value of resolution.
6. Concluding remarks
In this paper it was shown how for an ensemble prediction system, the continuous ranked probability score can be decomposed into three parts. This decomposition is very similar to that of the Brier score. The first part, reliability, is closely related to the rank histogram. An important difference, however, is that the reliability of the CRPS is sensitive to the width of the ensemble bins, while the rank histogram gives each forecast the same weight. The reliability should be zero for an ensemble system with the correct statistical properties. The second part, uncertainty, is the best achievable value of the continuous ranked probability score, in case only climatological information is available. It was discussed that in contrast to the uncertainty of the Brier score, the value of uncertainty depends on the degree of sophistication. The third term, the resolution, expresses the superiority of a forecast system with respect to a forecast system based on climatology. The uncertainty/reliability part was found to be both sensitive to the average spread within the ensemble, and to the behavior of the outliers. It was shown that the proposed decomposition is not equal to the integral over the decomposition of the Brier score.
It was illustrated how the reliability part could be presented in a graphical way. In addition, it was shown how the resolution part of the CRPS can be visualized by looking at the difference between the sample climate distribution and the accumulated average bin widths of the ensemble system. As an example the decomposition for total precipitation for seven summer cases in 1999 of the ECMWF ensemble prediction system was considered.
In this paper attention was focused on ensemble forecasts, for which the allowable set of forecasted probabilities is finite. However, in general, a forecast system could issue any probability between 0 and 1. Such systems could be regarded as the limit of N → ∞, of an N-member ensemble, in which the ith member is positioned at the location where the cumulative distribution has the value P(xi) = pi = i/N. Therefore, the decomposition of the CRPS, given in section 4, can be extended to any continuous forecast system. As a result, the summations over probabilities pi in the definitions of reliability, resolution, and uncertainty will transform into integrals (from 0 to 1) over probabilities. In order to evaluate such integrals for continuous systems, it is more sensible to discretize the allowable set of probabilities, than to discretize the variable x. Therefore, in practice, the evaluation of the CRPS and its decomposition for continuous forecast systems exactly reduces to the method proposed in section 4.
The continuous ranked probability score is a verification tool that is sensitive to the overall (with respect to a certain parameter) performance of a forecast system. By using the decomposition proposed in this paper, it was argued how for an ensemble prediction system, a detailed picture of this overall behavior can be obtained.
Acknowledgments
The author would like to thank François Lalaurette at ECMWF and Kees Kok at KNMI for stimulating discussions.
REFERENCES
Anderson, J. L., 1996: A method for producing and evaluating probabilistic forecasts from ensemble model integrations. J. Climate,9, 1518–1530.
Bouttier, F., 1994: Sur la prévision de la qualité des prévisions météorologiques. Ph.D. thesis, Université Paul Sabatier, Toulouse, France, 240 pp. [Available from Libray, Université Paul Sabatier, route de Narbonne, Toulouse, France.].
Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev.,78, 1–3.
Brown, T. A., 1974: Admissible scoring systems for continuous distributions. Manuscript P-5235, The Rand Corporation, Santa Monica, CA, 22 pp. [Available from The Rand Corporation, 1700 Main St., Santa Monica, CA 90407-2138.].
Buizza, R., and T. N. Palmer, 1998: Impact of ensemble size on ensemble prediction. Mon. Wea. Rev.,126, 2503–2518.
——, A. Hollingsworth, F. Lalaurette, and A. Ghelli, 1999: Probabilistic predictions of precipitation using the ECMWF Ensemble Prediction System. Wea. Forecasting,14, 168–189.
Epstein, E. S., 1969: A scoring system for probability forecasts of ranked categories. J. Appl. Meteor.,8, 985–987.
Hamill, T., and S. J. Colucci, 1997: Verification of Eta–RSM short-range ensemble forecasts. Mon. Wea. Rev.,125, 1312–1327.
Katz, R. W., and A. H. Murphy, 1997: Economic Value of Weather and Climate Forecasts. Cambridge University Press, 222 pp.
Mason, I., 1982: A model for assessment of weather forecasts. Aust. Meteor. Mag.,30, 291–303.
Matheson, J. E., and R. L. Winkler, 1976: Scoring rules for continuous probability distributions. Manage. Sci.,22, 1087–1095.
Molteni, F., R. Buizza, T. N. Palmer, and T. Petroliagis, 1996: The ECMWF Ensemble Prediction System: Methodology and validation. Quart J. Roy. Meteor. Soc.,122, 73–119.
Murphy, A. H., 1969: On the “ranked probability score.” J. Appl. Meteor.,8, 988–989.
——, 1971: A note on the ranked probability score. J. Appl. Meteor.,10, 155–156.
——, 1973: A new vector partition of the probability score. J. Appl. Meteor.,12, 595–600.
——, and E. S. Epstein, 1989: Skill scores and correlation coefficients in model verification. Mon. Wea. Rev.,117, 572–581.
Press, W. H., B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, 1989: Numerical Recipes: The Art of Scientific Computing. Cambridge University Press, 818 pp.
Richardson, D., 1998: Obtaining economic value from the EPS. ECMWF Newsletter, Vol. 80, 8–12.
——, 2000: Skill and relative economic value of the ECMWF Ensemble Prediction System. Quart J. Roy. Meteor. Soc.,126, 649–668.
Stanski, H. R., L. J. Wilson, and W. R. Burrows, 1989: Survey of common verification methods in meteorology. Atmospheric Environment Service Research Rep. 89-5, 114 pp. [Available from Forecast Research Division, 4905 Dufferin St., Downsview, ON M3H 5T4, Canada.].
Talagrand, O., and R. Vautard, 1997: Evaluation of probabilistic prediction systems. Proc. ECMWF Workshop on Predictability, Reading, United Kingdom, ECMWF, 1–25.
Unger, D. A., 1985: A method to estimate the continuous ranked probability score. Preprints, Ninth Conf. on Probability and Statistics in Atmospheric Sciences, Virginia Beach, VA, Amer. Meteor. Soc., 206–213.
Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences. Academic Press, 467 pp.
APPENDIX
Some Technical Details
In this appendix the relation between the various terms of the Brier score defined in Eq. (40) and the terms of the continuous ranked probability score given in Eq. (39) will be determined.
Continuous ranked probability score and its decomposition into reliability, resolution, and uncertainty [see Eq. (39)] of total precipitation accumulated in the 24 h prior to the displayed forecast day for seven cases in the summer of 1999 for the ECMWF ensemble prediction system. The dimension of these quantities is mm (24h)−1