• Anderson, J. L., 1996: A method for producing and evaluating probabilistic forecasts from ensemble model integrations. J. Climate,9, 1518–1530.

    • Crossref
    • Export Citation
  • Bouttier, F., 1994: Sur la prévision de la qualité des prévisions météorologiques. Ph.D. thesis, Université Paul Sabatier, Toulouse, France, 240 pp. [Available from Libray, Université Paul Sabatier, route de Narbonne, Toulouse, France.].

  • Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev.,78, 1–3.

    • Crossref
    • Export Citation
  • Brown, T. A., 1974: Admissible scoring systems for continuous distributions. Manuscript P-5235, The Rand Corporation, Santa Monica, CA, 22 pp. [Available from The Rand Corporation, 1700 Main St., Santa Monica, CA 90407-2138.].

  • Buizza, R., and T. N. Palmer, 1998: Impact of ensemble size on ensemble prediction. Mon. Wea. Rev.,126, 2503–2518.

    • Crossref
    • Export Citation
  • ——, A. Hollingsworth, F. Lalaurette, and A. Ghelli, 1999: Probabilistic predictions of precipitation using the ECMWF Ensemble Prediction System. Wea. Forecasting,14, 168–189.

    • Crossref
    • Export Citation
  • Epstein, E. S., 1969: A scoring system for probability forecasts of ranked categories. J. Appl. Meteor.,8, 985–987.

  • Hamill, T., and S. J. Colucci, 1997: Verification of Eta–RSM short-range ensemble forecasts. Mon. Wea. Rev.,125, 1312–1327.

    • Crossref
    • Export Citation
  • Katz, R. W., and A. H. Murphy, 1997: Economic Value of Weather and Climate Forecasts. Cambridge University Press, 222 pp.

    • Crossref
    • Export Citation
  • Mason, I., 1982: A model for assessment of weather forecasts. Aust. Meteor. Mag.,30, 291–303.

  • Matheson, J. E., and R. L. Winkler, 1976: Scoring rules for continuous probability distributions. Manage. Sci.,22, 1087–1095.

    • Crossref
    • Export Citation
  • Molteni, F., R. Buizza, T. N. Palmer, and T. Petroliagis, 1996: The ECMWF Ensemble Prediction System: Methodology and validation. Quart J. Roy. Meteor. Soc.,122, 73–119.

    • Crossref
    • Export Citation
  • Murphy, A. H., 1969: On the “ranked probability score.” J. Appl. Meteor.,8, 988–989.

  • ——, 1971: A note on the ranked probability score. J. Appl. Meteor.,10, 155–156.

  • ——, 1973: A new vector partition of the probability score. J. Appl. Meteor.,12, 595–600.

    • Crossref
    • Export Citation
  • ——, and E. S. Epstein, 1989: Skill scores and correlation coefficients in model verification. Mon. Wea. Rev.,117, 572–581.

    • Crossref
    • Export Citation
  • Press, W. H., B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, 1989: Numerical Recipes: The Art of Scientific Computing. Cambridge University Press, 818 pp.

  • Richardson, D., 1998: Obtaining economic value from the EPS. ECMWF Newsletter, Vol. 80, 8–12.

  • ——, 2000: Skill and relative economic value of the ECMWF Ensemble Prediction System. Quart J. Roy. Meteor. Soc.,126, 649–668.

    • Crossref
    • Export Citation
  • Stanski, H. R., L. J. Wilson, and W. R. Burrows, 1989: Survey of common verification methods in meteorology. Atmospheric Environment Service Research Rep. 89-5, 114 pp. [Available from Forecast Research Division, 4905 Dufferin St., Downsview, ON M3H 5T4, Canada.].

  • Talagrand, O., and R. Vautard, 1997: Evaluation of probabilistic prediction systems. Proc. ECMWF Workshop on Predictability, Reading, United Kingdom, ECMWF, 1–25.

  • Unger, D. A., 1985: A method to estimate the continuous ranked probability score. Preprints, Ninth Conf. on Probability and Statistics in Atmospheric Sciences, Virginia Beach, VA, Amer. Meteor. Soc., 206–213.

  • Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences. Academic Press, 467 pp.

  • View in gallery

    Sample distribution Psam, as defined in Eq. (9), for two samples of eight cases, all with equal weight. The shaded area represents the corresponding uncertainty U [see Eq. (12)]. It is proportional to the standard deviation σ of the distribution

  • View in gallery

    Cumulative distribution for an ensemble {x1, . . . , x5} of five members (thick solid line) and for the verifying analysis xa (thin solid line). The CRPS is represented by the shaded area. The αi and βi are defined in Eq. (26)

  • View in gallery

    The same as in Fig. 2 but now for the case that the verifying analysis is outside the ensemble (outlier). Only when xa is below the ensemble (left panel) is β0 [see Eq. (27)] nonzero. And only when case xa is above the ensemble (right panel) is αN nonzero. The CRPS is given by the shaded area. Note that β0 and αN are weighted stronger than other α’s and β’s

  • View in gallery

    Decomposition of continuous ranked probability score for total precipitation accumulated between day 2 and day 3 for seven summer cases in 1999 and averaged over the European area

  • View in gallery

    The same as Fig. 4 but for day 6

  • View in gallery

    The same as Fig. 4 but for day 9

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 4285 3985 604
PDF Downloads 3309 3146 417

Decomposition of the Continuous Ranked Probability Score for Ensemble Prediction Systems

View More View Less
  • 1 Koninklijk Nederlands Meteorologisch Instituut, De Bilt, Netherlands
© Get Permissions
Full access

Abstract

Some time ago, the continuous ranked probability score (CRPS) was proposed as a new verification tool for (probabilistic) forecast systems. Its focus is on the entire permissible range of a certain (weather) parameter. The CRPS can be seen as a ranked probability score with an infinite number of classes, each of zero width. Alternatively, it can be interpreted as the integral of the Brier score over all possible threshold values for the parameter under consideration. For a deterministic forecast system the CRPS reduces to the mean absolute error.

In this paper it is shown that for an ensemble prediction system the CRPS can be decomposed into a reliability part and a resolution/uncertainty part, in a way that is similar to the decomposition of the Brier score. The reliability part of the CRPS is closely connected to the rank histogram of the ensemble, while the resolution/uncertainty part can be related to the average spread within the ensemble and the behavior of its outliers. The usefulness of such a decomposition is illustrated for the ensemble prediction system running at the European Centre for Medium-Range Weather Forecasts. The evaluation of the CRPS and its decomposition proposed in this paper can be extended to systems issuing continuous probability forecasts, by realizing that these can be interpreted as the limit of ensemble forecasts with an infinite number of members.

Corresponding author address: Dr. Hans Hersbach, KNMI, P.O. Box 201, 3730 AE Utrecht, Netherlands.

Email: hersbach@knmi.nl

Abstract

Some time ago, the continuous ranked probability score (CRPS) was proposed as a new verification tool for (probabilistic) forecast systems. Its focus is on the entire permissible range of a certain (weather) parameter. The CRPS can be seen as a ranked probability score with an infinite number of classes, each of zero width. Alternatively, it can be interpreted as the integral of the Brier score over all possible threshold values for the parameter under consideration. For a deterministic forecast system the CRPS reduces to the mean absolute error.

In this paper it is shown that for an ensemble prediction system the CRPS can be decomposed into a reliability part and a resolution/uncertainty part, in a way that is similar to the decomposition of the Brier score. The reliability part of the CRPS is closely connected to the rank histogram of the ensemble, while the resolution/uncertainty part can be related to the average spread within the ensemble and the behavior of its outliers. The usefulness of such a decomposition is illustrated for the ensemble prediction system running at the European Centre for Medium-Range Weather Forecasts. The evaluation of the CRPS and its decomposition proposed in this paper can be extended to systems issuing continuous probability forecasts, by realizing that these can be interpreted as the limit of ensemble forecasts with an infinite number of members.

Corresponding author address: Dr. Hans Hersbach, KNMI, P.O. Box 201, 3730 AE Utrecht, Netherlands.

Email: hersbach@knmi.nl

1. Introduction

Appropriate verification tools are essential in understanding the abilities and weaknesses of (probabilistic) forecast systems.

Verification is often focused on specific (weather) events. Such a binary event either occurs, or does not occur, and is forecast to occur or not to occur, with certain probabilities p and 1 − p respectively. Examples of such events are more than 10-mm precipitation in 24 h or an anomaly (from a climatological mean) of more than 50 m of the geopotential at 500 hPa. Several well-established tools exist that test how accurately the forecast system is able to describe the occurrence and nonoccurrence of the event under consideration, that is, how good the agreement is between the forecasted probabilities and observed states. Examples of scores, which are commonly used by operational centers such as the European Centre for Medium-Range Weather Forecasts (ECMWF) and the National Centers for Environmental Prediction are Brier scores (Brier 1950), the Relative Operating Characteristics (ROC) curves (Mason 1982; Stanski et al. 1989), and economic cost–loss analyses (see, e.g., Katz and Murphy 1997; or Richardson 1998, 2000).

The (half) Brier score is one of the oldest verification tools in use. From its numerical value alone the quality of a forecast system is difficult to assess. An attractive property of the Brier score, however, is that it can be decomposed into a reliability, a resolution, and an uncertainty part (Murphy 1973). The reliability tests whether the forecast system has the correct statistical properties. It can be presented in a graphical way by the so-called reliability diagram. The uncertainty is the Brier score one would obtain when only the climatological frequency for the occurrence of the event is available. The resolution shows the impact obtained by issuing case-dependent probability forecasts (which do not always equal the probability based on climatology). Therefore, the decomposition of the Brier score gives a detailed insight into the performance of the forecast system with respect to the event under consideration.

Binary events only highlight one aspect of the forecast. Such a single aspect may be quite relevant. For instance, certain extreme events can lead to economic losses, which could be avoided with the help of an accurate forecast system. This kind of issue is addresses by the ROC curve and economic cost–loss analyses. However, it may be desirable to obtain a broader overall view of performance. Several tools in this direction exist. It should however be mentioned that the term overall is often still restricted to the behavior of one forecast parameter only, such as precipitation or the geopotential at 500 hPa.

An example is the Talagrand diagram (Talagrand and Vautard 1997), also known as the rank histogram (Hamill and Collucci 1997) or the binned probability ensemble (Anderson 1996). This tool is tailor made for an ensemble system, that is, in case the probability density function (PDF) is represented by an ensemble of forecasts. Given such an ensemble, its N members divide the permissible range of the parameter of interest into N + 1 bins. The verifying analysis will be found to be in one of these bins. If all members are assumed to be equally weighted and representative, it is expected that, on average, each bin should be equally populated by the verifying analyses. Deviations from such a flat rank histogram indicate a violation of the above-made assumptions. For instance, a too high frequency of outliers is an indication that the average spread within the ensemble system is too low.

Another example is the ranked probability score (RPS) (see Epstein 1969; Murphy 1969, 1971). It is a generalization of the (half) Brier score. Instead of two options (event occurs or does not occur), the range of the parameter of interest is divided into more classes. In addition, the RPS contains a sense of distance of how far the forecast was found from reality. For a deterministic forecast for instance, the RPS is proportional to the number of classes by which the forecast missed the verifying analysis. Although the choice and number of classes may be prescribed by the specific application, the exact value of RPS will depend on this choice. It is possible to take the limit of an infinite number of classes, each with zero width. This leads to the concept of the continuous ranked probability score (CRPS) (Brown 1974; Matheson and Winkler 1976; Unger 1985; Bouttier 1994). This CRPS has several appealing properties. First of all, it is sensitive to the entire permissible range of the parameter of interest. Second, its definition does not require, such as for the RPS, the introduction of a number of predefined classes, on which results may depend. In addition, it can be interpreted as an integral over all possible Brier scores. Finally, for a deterministic forecast, the CRPS is equal to the mean absolute error (MAE) and, therefore, has a clear interpretation.

Despite these advantages, the CRPS is a single quantity, from which it is difficult to disentangle the detailed behavior of a forecast system. It would be desirable to be able to decompose the CRPS like it is possible for the Brier score. In this paper it is shown how for an ensemble prediction system this indeed can be achieved. In a similar way to the Brier score, the CRPS is shown to be decomposable into a reliability part, an uncertainty part, and a resolution part. The reliability part tests whether for each bin i on average the verifying analysis was found to be with a fraction i/N below this bin. It has a close relation to the rank histogram. The uncertainty part is equal to the CRPS one would receive, in case only a PDF-based on climatology would be available. The resolution finally expresses the improvement gained by issuing probability forecasts that are case dependent. It is shown that the resolution is sensitive to the average ensemble spread and the frequency and magnitude of the outliers. Finally, it is illustrated how the various contributions to the CRPS can be presented in a graphical way, like the reliability diagram of the Brier score.

The paper is organized as follows. In section 2 the CRPS is defined, and some characteristics are mentioned. The uncertainty part of the CRPS is highlighted in section 3. In section 4, the full decomposition for an ensemble system is derived. As an example, the decomposition of the CRPS for total precipitation in the ensemble prediction system (EPS) running at ECMWF is presented in section 5. A summary and some concluding remarks are made in section 6.

2. The continuous ranked probability score

Let the parameter of interest be denoted by x. For instance, x could be the 2-m temperature or 10-m wind speed. Suppose that the PDF forecast by an ensemble system is given by ρ(x) and that xa is the value that actually occurred. Then the continuous ranked probability score (Brown 1974; Matheson and Winkler 1976;Unger 1985; Bouttier 1994), expressing some kind of distance between the probabilistic forecast ρ and truth xa, is defined as
i1520-0434-15-5-559-e1
Here, P and Pa are cumulative distributions:
i1520-0434-15-5-559-e2
where
i1520-0434-15-5-559-e4
is the well-known Heaviside function. So, P(x) is the forecasted probability that xa will be smaller than x. Obviously, for any cumulative distribution, P(x) ∈ [0, 1], P(−∞) = 0, and P(∞) = 1. This is also true for parameters that are only defined on a subdomain of ℜ. In that case ρ(x) = 0 and P constant outside the domain of definition. The CRPS measures the difference between the predicted and occurred cumulative distributions. Its minimal value of zero is only achieved for P = Pa, that is, in the case of a perfect deterministic forecast. Note that the CRPS has the dimension of the parameter x (which enters via the integration over dx).
In practice the CRPS is averaged over an area and a number of cases:
i1520-0434-15-5-559-e5
where k labels the considered grid points and cases. The weights wk may depend on k (for instance proportional to the cosine of latitude).

The CRPS can be seen as the limit of a ranked probability score with an infinite number of classes, each with zero width.

There is a direct relation between the CRPS and the Brier score (Brier 1950). The Brier score (BS) is a verification tool for the prediction of the occurrence of a specific event. Usually, such an event is characterized by a threshold value xt. The event is said to have happened (O = 1) if xaxt and not happened (O = 0) if xa > xt. If p is the forecast probability that the event will occur, the Brier score is defined as
i1520-0434-15-5-559-e6
It is not difficult to see that pk = Pk(xt) and Ok = Pka(xt) and therefore
i1520-0434-15-5-559-e7
For a deterministic forecast, that is, x = xd without any specified uncertainty, P(x) = H(xxd). In that case, the integrand of Eq. (1) is either zero or one. The nonzero contributions are found in the region where P(x) and Pa(x) differ, which is the interval between xd and xa. As a result,
i1520-0434-15-5-559-e8
which is the MAE.

3. The uncertainty of the CRPS

For an ensemble prediction system, the forecast PDF will in general be case dependent. If instead, only climatological information about the behavior of the quantity x is available, the same probability forecast Pk = Pcli will be made for each situation. In that case,
i1520-0434-15-5-559-eq1
Note that Σkwk = 1 by definition, and H2 = H. If one defines
i1520-0434-15-5-559-e9
the CRPS can be rewritten as
i1520-0434-15-5-559-e10
where
i1520-0434-15-5-559-e11
The distribution Psam is the cumulative distribution based on the sample used in the verification. If, for instance, all M weights would be equal, so wk = 1/M, then Psam(x) is just the fraction of cases in which the verifying analysis was found to be smaller than x. The value of Psam(x) also equals the sample frequency of occurrence o(xt) for the Brier score with threshold xt = x.

From Eqs. (10)–(12) it is seen that the CRPS based on climatology is minimal when Pcli is equal to Psam. The impact on the CRPS due to a deviation from the sample statistics is expressed by Eq. (11).

The lowest possible value of a CRPS based on climatology is given by Eq. (12). It is solely determined by the climatology within the sample and does not depend on the performance of the forecast model. Expression (12) is equal to the integral of the uncertainty U (Murphy 1973; or see, e.g., Wilks 1995) of the Brier score over all possible thresholds:
i1520-0434-15-5-559-e13
Here
i1520-0434-15-5-559-e14
is the observed frequency that the event x < xt occurred. Therefore, it is very natural to define U as the uncertainty of the CRPS. It is the CRPS based on the sample climatology. It is proportional to the standard deviation of the sample distribution ρsam = dPsam/dx, because the main contribution to the integral in Eq. (12) comes from the region in x where Psam is significantly different from 0 and 1. An illustration is given in Fig. 1. To be more exact, the sample distribution ρsam can always be written as
i1520-0434-15-5-559-e15
where ρ0 is a distribution with σ = 1 (for instance similar to a standardized Gaussian) and P0 [see Eq. (2)] its cumulative distribution. From the uncertainty
i1520-0434-15-5-559-e16
of this distribution, it follows that
i1520-0434-15-5-559-e17
so indeed proportional to σ.

It should be noted that the term climatology depends on the degree of desired sophistication. The most crude level would be to assume the same climatological distribution at all grid points and cases. The mean climatological value of x, however, may be quite location and seasonal dependent. The mean 2-m temperature of Norway in January, for instance, is much lower than that of Spain in March. This would result in a very broad sample distribution and, therefore, to a large uncertainty. In order to correct for this, as a first step, the variable x can be redefined as being the anomaly with respect to the local climatology. The definition of the CRPS is invariant for such a shift in the variable x, as is easily seen from Eq. (1). As a consequence, the distribution Psam will change, because for each k in Eq. (9) a different shift may have been applied. This should result in a distribution that is much sharper, so the uncertainty U in Eq. (12) should be smaller. For a parameter in which the permissible range is limited, like precipitation or 10-m wind speed, such an approach may not be profitable. The reason for this is that the sample distribution obtained in this way (based on anomalies) will for part of the locations lead to nonvanishing probabilities outside the permissible range.

Finally, the entire climatological distribution (so not just its mean) could be chosen to depend on the location and/or season, so Pk = Pcli,location,season. For this, the best achievable distribution would be a location/seasonal-dependent sample distribution, also given by Eq. (9) but in which the sum (and the normalization of the weights) is restricted to all points k that belong to the same location and or season. Again, the resulting uncertainty is expected to become lower. For parameters like precipitation this will also lead to a lower uncertainty.

This section will be concluded by showing how the uncertainty can be evaluated in practice. The most straightforward method is to substitute definition (9) into Eq. (12):
i1520-0434-15-5-559-e18
The integrand will only be nonzero when both H(xxka) = 1 and H(xxla) = 0. This condition can only be met when xka < xla, in which case the integral is xlaxka. As a result,
i1520-0434-15-5-559-e19
Another way to calculate U is to realize that Eq. (9) is based on a finite number of verifying analyses. Therefore Psam will be piecewise constant (see, e.g., Fig. 1). It is zero for x = −∞ and each time an xka is passed, it makes a jump of wk. Beyond the largest verifying analysis in the set, Psam = 1. Now if the xka are ordered from small to large, then
i1520-0434-15-5-559-e20
where
pkpk−1wsort(k)p0
Evaluation (19) is of order M2, where M is the size of the sample set. If M becomes on the order of a few thousand, this evaluation becomes time consuming. In addition, roundoff errors are expected to become nonnegligible. Method (20) only involves a sum of order M. The price to be paid is that the xak should be sorted first. However, efficient sorting algorithms, such as quicksort or heapsort (see Press et al. 1989), are of order M log(M). Therefore, this latter method is still quite feasible and accurate for very large samples.

4. The CRPS for an ensemble system

a. The cumulative distribution of an ensemble

For an ensemble system, such as EPS, an equal weight is given to each of its members. Therefore, the probability assigned to the occurrence of a certain event is given by the fraction of members that predict the event. Effectively, for the variable x this means that the cumulative distribution forecasted by the ensemble system is given by
i1520-0434-15-5-559-e21
where x1, . . . , xN are the outcomes of the N ensemble members. From now on it is assumed that the members are ordered, that is,
xixjij.
The cumulative distribution P is a piecewise constant function. Transitions occur at the values xi:
i1520-0434-15-5-559-e23
in which x0 = −∞ and xN+1 = ∞ are introduced for convenience. An example of the cumulative distribution for an ensemble of five members is given (thick solid curve) in Fig. 2.

b. Decomposition for a single case

The CRPS, as defined in Eq. (1), can be evaluated as follows:
i1520-0434-15-5-559-e24
Depending on the position of the verifying analysis xa, H(xxa) will be either 0, or 1, or partly 0, partly 1, in the interval [xi, xi+1]. For each of these three possible situations, ci can be written as
ciαip2iβipi2
where
i1520-0434-15-5-559-e26
Note that the αi and βi have the dimension of the parameter x.

For the example given in Fig. 2, the verifying analysis is in between x3 and x4. Therefore, for this case β = 0 for i = 1 and 2, and α = 0 for i = 4. Only for i = 3 both α and β are nonzero.

Some care should be taken for i = 0 and i = N. These concern the intervals (−∞, x1] and [xN, ∞), respectively, and for which pi = 0 and pi = 1, respectively. These two intervals will only contribute to the CRPS in cases when the verifying analysis is an outlier, that is, when it is outside the range of the ensemble. In this situation Eq. (25) can also be used, but with
i1520-0434-15-5-559-e27
In Fig. 3 an example is given in which the verifying analysis is found to be below the ensemble (left panel) and above the ensemble (right panel). For the first case, there will be a contribution from β0, being the difference between xa and the smallest ensemble member. In the second case, αN is nonzero and equal to the distance of xa from the largest ensemble member. Outliers can contribute significantly to the CRPS, because nonzero values of β0 and αN are weighted stronger than other α’s and β’s (see, e.g., the shaded areas in Fig. 3).

c. The average over a set of cases

For M cases and/or grid points, each with a weight wk, the average CRPS [Eq. (5)] can be found as
i1520-0434-15-5-559-e28
where
i1520-0434-15-5-559-e29
are the weighted average values of αi and βi.
The quantities αi and βi can be expressed into two quantities gi and oi, which both have a physical interpretation. First the case 0 < i < N is considered. Let
i1520-0434-15-5-559-e30
It can be seen from Eq. (26) that gi is the average width of bin number i:
i1520-0434-15-5-559-e32
For the moment concentrate on a specific value of i. Then, for most cases, the verifying analysis will not lie in the interval [xi, xi+1]. Therefore, usually, αi will be zero and βi is equal to the width of bin number i, or vice versa. The first case applies to the situation in which the verifying analysis was found to be smaller than the ensemble member i, as can be seen from Eq. (26), the second case to which it was found to be larger than member i + 1. Taking this in mind, oi can be seen to be closely related to the average frequency that the verifying analysis was found to be below ½(xi + xi+1). Ideally these observed frequencies should match with the forecasted probability that the verifying analysis is to be found below the ith interval. Such a consistency is closely related to the flatness of the rank histogram [also known as Talagrand diagram or binned probability ensemble; see, e.g., Anderson (1996), Talagrand and Vautard (1997), or Hamill and Collucci (1997)].
For the outliers, o0 and oN is defined as the (weighted) frequency that xa was found to be smaller than x1 and xN, respectively. Here, g0,N is defined as the average length of the outlier, given that it occurred:
i1520-0434-15-5-559-e33
The user may verify that for all i = 0, . . . , N, so including the outliers
i1520-0434-15-5-559-e34
The average CRPS [see Eq. (5)] can now be decomposed as
i1520-0434-15-5-559-e35
where
i1520-0434-15-5-559-e36
This decomposition looks similar to the decomposition of the Brier score as it was introduced by Murphy (Murphy 1973; or see, e.g., Wilks 1995). The interpretation, however, is somewhat different.

The quantity Reli is identified as the reliability part of the CRPS. For a Brier score the reliability tests whether for all cases in which a certain probability p was forecast, on average, the event occurred with that fraction p. Here, it is tested whether, on average, the frequency oi that the verifying analysis was found to be below the middle of interval number i is proportional to i/n. Therefore, it is tested here whether the ensemble is capable of generating cumulative distributions that have, on average, this desired statistical property. The reliability (36) is closely connected to the rank histogram, which shows whether the frequency that the verifying analysis was found in bin number i is equal for all bins. The rank histogram does not take care of the width of the ensemble. It only counts how often the verifying analysis was located in a bin, regardless of the width of the bins. The reliability Reli does take this into account, because the larger a bin width (and therefore the larger the spread) the more weight it has in αi and βi and therefore oi. Note that Reli has a dimension (of x), while the reliability of the Brier score is dimensionless. The term CRPSpot given in Eq. (37) is called the potential CRPS (in analogy with Murphy and Epstein 1989), because it is the CRPS one would obtain after the probabilities pi would have been retuned, such that the system would become perfectly reliable, that is, for which Reli = 0. It is sensitive to the average spread of the ensemble. The narrower the ensemble system, the smaller the gi and the smaller Eq. (37). The potential CRPS is also sensitive to outliers. Too many and too large outliers will result in large values of g0o0 and gN(1 − oN) and therefore affect CRPSpot considerably. Although the small average bin widths g1, . . . , gN of an ensemble system with a too small spread may have a positive impact on the potential CRPS, the too high frequency of outliers and the large magnitudes of such outliers will have a clear negative impact. Given a certain degree of unpredictability, the optimal value for CRPSpot will be achieved for an ensemble system in which the spread and the statistics of outliers are in balance.

The uncertainty U as defined in (12) can be seen as the potential reliability for a forecast system based on the sample climatology. Such a system is by definition, perfectly reliable. To see the relation between Eqs. (12) and (37), the integral over x in Eq. (12) is to be approximated by a sum over intervals Δxi, each representing an equal part of 1/N of integrated probability. The Δxi may be identified with the widths gi and the Psam(xi) with the observed frequencies oi. As a result, these approximations lead to Eq. (37). It may be clear that it is desirable for an ensemble system that CRPSpot is smaller than the potential CRPS based on climatology. Therefore, the potential CRPS may, although perhaps somewhat artificially, be further decomposed into
i1520-0434-15-5-559-e38
This gives the following decomposition:
i1520-0434-15-5-559-e39
The resolution Resol is nothing else than the difference between the potential CRPS and the climatological uncertainty. The ensemble system has positive resolution if it performs better than the climatological probabilistic forecast. In the previous section it was discussed that the uncertainty (12) depends on the level of sophistication. Therefore, the same is true for the resolution. Unlike the resolution of the Brier score, the resolution part of the CRPS need not be positive definite.

d. Relation to the decomposition of the Brier score

In section 2 it was shown that the CRPS can be seen as an integral of the Brier score over all possible thresholds [see Eq. (7)]. The question may emerge whether the terms in decomposition (39) are also equal to the reliability, resolution, and uncertainty of the Brier score integrated over all possible thresholds.

The Brier score defined by Eq. (6) (with thresholds x) may be stratified with respect to the set of allowable probabilities pi = 0, 1/N, . . . , 1:
i1520-0434-15-5-559-e40
Here gi is the (weighted) fraction of cases in which a probability p = pi was issued, while oi is the fraction of such cases in which indeed the event was observed. Note that both quantities depend on the value of the threshold x.
After some algebra, it follows that the Brier score can be decomposed into
i1520-0434-15-5-559-e41
where
i1520-0434-15-5-559-e42
is the (weighted) frequency that the event occurred within the sample. In the appendix, Eq. (42) is shown to be equal to definition (14). There it is also shown [see Eqs. (A8)–(A11)] that the integral of gi(x) and gi(x)oi(x) over x is equal to the gi and gioi, respectively, defined by Eqs. (30)–(33). When integral (7) is performed, the relation between decompositions (39) and (41) can be established:
i1520-0434-15-5-559-e43
where
i1520-0434-15-5-559-e44
Here
i1520-0434-15-5-559-e47
where for zi = o0, o1, . . . , (1 − oN)
i1520-0434-15-5-559-e48
In general, D will be nonzero. Therefore, the integration of the resolution and reliability of the Brier score over all possible thresholds, in general, differs from the reliability and resolution, respectively, of the CRPS. Only the integral over all uncertainties U(x) is equal to the uncertainty of the CRPS. Using Eqs. (A8) and (A9), it is not difficult to see that for 1 < i < N,
i1520-0434-15-5-559-e49
from which it follows that these terms in D are positive definite. Only when oi(x) does not depend on x, they are zero. Therefore this part of 〈Reli〉 is stricter than the corresponding part of Reli, because 〈Reli〉 insists on a perfect reliability for all possible events, while Reli concentrates on the more overall reliability of the system. For the outliers the integral over gi(x) is infinite, and therefore Eq. (49) is not valid for i = 0, N.

The quantities 〈Reli〉 and 〈Resol〉, as well as D, involve integrals over gio2i. These integrals are, in contrast to integrals over gi and gioi (see the appendix), difficult to perform analytically. Therefore, in practice, it is a tedious procedure to evaluate 〈Reli〉 and 〈Resol〉. Besides, 〈Reli〉 does not have the same clear relation to the rank histogram as Reli has. For these reasons, decomposition (39) is to be preferred above decomposition (43).

5. Decomposition for the EPS at ECMWF

The ideas developed in the previous sections will be illustrated by the performance of the ensemble prediction system running at ECMWF. This ensemble forecasting system (see Molteni et al. 1996; Buizza and Palmer 1998; Buizza et al. 1999) consists of 50 perturbed forecasts plus a control forecast integrated with the ECMWF TL159L31 primative equation (PE) model up to day 10. For seven cases in the summer of 1999, the CRPS of total precipitation has been evaluated for the European area (30.0°–72.5°N, 22.5°W–42.5°E) using a grid spacing of 2.5° in both the latitudinal and the longitudinal direction (486 grid points). The weights wk [see Eq. (5)] were chosen to be proportional to the cosine of latitude. As verifying analysis the precipitation accumulated within the first 24 h of the ECMWF operational TL319L50 PE model forecasts was taken [for a discussion on this choice, see the appendix of Buizza et al. (1999)].

Table 1 shows the CRPS and its decomposition (39) between forecast day 2 and 10. It is seen that the continuous ranked probability score gradually grows (although not monotonously) from 0.98 mm (24 h−1) at day 2 to 1.31 mm (24 h−1) at day 10, expressing a decreasing predictability as a function of forecast time. The reliability only forms a small part of the CRPS. There is a trend that it decreases. Apparently reliability is less optimal for the first forecast days. The uncertainty shown in Table 1 is based on sample distributions in which no corrections for anomalies or location were applied. It fluctuates somewhat from day to day, expressing differences in the sample distributions (each consisting of 3402 verifying analyses) obtained for the various forecast days. The resolution strongly decreases from 0.322 mm (24 h)−1 at day 2, to 0.086 mm (24 h)−1 at day 10. Therefore, the first days, EPS significantly outperforms a forecast based on climatology, while for longer forecast periods there is an onset of convergence to climatology.

In order to be able to understand these trends in more detail, in Figs. 4, 5, and 6 a graphical representation of the reliability, uncertainty, and resolution is displayed for forecast days 3, 6, and 9, respectively. In the top panels the observed frequencies oi as defined in Eqs. (31) and (33) are plotted as a function of the fraction of members pi. Any deviation from the diagonal will contribute to the reliability Reli defined in Eq. (36). The lower panels of Figs. 4–6 show (staircase curve) the accumulation of the average bin widths gi, as defined in Eq. (30). The leftmost and rightmost bins show the average magnitude g0 and gN, respectively, of the outliers [see Eq. (33)]. The width of this curve determines the potential CRPS, because CRPSpot can be seen as the integral over this curve with the weight function oi(1 − oi). The narrower the staircase curve, the smaller the region for which the weight function is significantly different from zero, and as a result, the smaller CRPSpot is. In addition, the lower panels show the cumulative distribution (“smooth” curve) of the sample climatology, as defined in Eq. (9). As is illustrated by Fig. 1, for example, the uncertainty U is proportional to the width of Psam. In addition (see discussion at the end of section 4c) it can be seen as the expected CRPS of a forecast system based on the climatology of the sample. The difference in widths between the staircase curve and the cumulative distribution, therefore, is a measure for the resolution (38).

The discrepancy from perfect reliability for the first forecast days is mainly due to the lower bins of the ensembles, as can be seen in Fig. 4 for day 3. The frequency that the verifying analysis is found to be below these bins is too high. It occurs too often that all members predict at least some precipitation, while it remained dry (based on climatology as can be seen from Psam in the lower panel of Fig. 4, the probability that it remains dry is about 50%). However, for these cases, the amount of precipitation of the member with the smallest amount of rain is on average quite small (around 0.3 mm; see g0 in bottom panel of Fig. 4). Therefore this mild overestimation of precipitation will not contribute very strongly to Reli. Such a delicate analysis would not be visible from the rank histogram. It would only show a too high frequency of outliers.

The high resolution of the EPS for day 3 can clearly be seen from the bottom panel of Fig. 4. The average bin widths of the ensemble, including the outliers, is, compared to Psam, considerably small. The climatological distribution has a large tail for high amounts of precipitation. Apparently, for such cases, the EPS was capable of generating sharp ensembles with fair amounts of precipitation. This is the reason why the size of the outlier gN is reasonably small. The reduction of resolution with increasing forecast time is well illustrated by comparing the lower panels of Figs. 4–6. At day 3, the ensemble is much sharper than Psam, while at day 9, it is quite similar to the sample distribution, leaving only a low value of resolution.

6. Concluding remarks

In this paper it was shown how for an ensemble prediction system, the continuous ranked probability score can be decomposed into three parts. This decomposition is very similar to that of the Brier score. The first part, reliability, is closely related to the rank histogram. An important difference, however, is that the reliability of the CRPS is sensitive to the width of the ensemble bins, while the rank histogram gives each forecast the same weight. The reliability should be zero for an ensemble system with the correct statistical properties. The second part, uncertainty, is the best achievable value of the continuous ranked probability score, in case only climatological information is available. It was discussed that in contrast to the uncertainty of the Brier score, the value of uncertainty depends on the degree of sophistication. The third term, the resolution, expresses the superiority of a forecast system with respect to a forecast system based on climatology. The uncertainty/reliability part was found to be both sensitive to the average spread within the ensemble, and to the behavior of the outliers. It was shown that the proposed decomposition is not equal to the integral over the decomposition of the Brier score.

It was illustrated how the reliability part could be presented in a graphical way. In addition, it was shown how the resolution part of the CRPS can be visualized by looking at the difference between the sample climate distribution and the accumulated average bin widths of the ensemble system. As an example the decomposition for total precipitation for seven summer cases in 1999 of the ECMWF ensemble prediction system was considered.

In this paper attention was focused on ensemble forecasts, for which the allowable set of forecasted probabilities is finite. However, in general, a forecast system could issue any probability between 0 and 1. Such systems could be regarded as the limit of N → ∞, of an N-member ensemble, in which the ith member is positioned at the location where the cumulative distribution has the value P(xi) = pi = i/N. Therefore, the decomposition of the CRPS, given in section 4, can be extended to any continuous forecast system. As a result, the summations over probabilities pi in the definitions of reliability, resolution, and uncertainty will transform into integrals (from 0 to 1) over probabilities. In order to evaluate such integrals for continuous systems, it is more sensible to discretize the allowable set of probabilities, than to discretize the variable x. Therefore, in practice, the evaluation of the CRPS and its decomposition for continuous forecast systems exactly reduces to the method proposed in section 4.

The continuous ranked probability score is a verification tool that is sensitive to the overall (with respect to a certain parameter) performance of a forecast system. By using the decomposition proposed in this paper, it was argued how for an ensemble prediction system, a detailed picture of this overall behavior can be obtained.

Acknowledgments

The author would like to thank François Lalaurette at ECMWF and Kees Kok at KNMI for stimulating discussions.

REFERENCES

  • Anderson, J. L., 1996: A method for producing and evaluating probabilistic forecasts from ensemble model integrations. J. Climate,9, 1518–1530.

    • Crossref
    • Export Citation
  • Bouttier, F., 1994: Sur la prévision de la qualité des prévisions météorologiques. Ph.D. thesis, Université Paul Sabatier, Toulouse, France, 240 pp. [Available from Libray, Université Paul Sabatier, route de Narbonne, Toulouse, France.].

  • Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev.,78, 1–3.

    • Crossref
    • Export Citation
  • Brown, T. A., 1974: Admissible scoring systems for continuous distributions. Manuscript P-5235, The Rand Corporation, Santa Monica, CA, 22 pp. [Available from The Rand Corporation, 1700 Main St., Santa Monica, CA 90407-2138.].

  • Buizza, R., and T. N. Palmer, 1998: Impact of ensemble size on ensemble prediction. Mon. Wea. Rev.,126, 2503–2518.

    • Crossref
    • Export Citation
  • ——, A. Hollingsworth, F. Lalaurette, and A. Ghelli, 1999: Probabilistic predictions of precipitation using the ECMWF Ensemble Prediction System. Wea. Forecasting,14, 168–189.

    • Crossref
    • Export Citation
  • Epstein, E. S., 1969: A scoring system for probability forecasts of ranked categories. J. Appl. Meteor.,8, 985–987.

  • Hamill, T., and S. J. Colucci, 1997: Verification of Eta–RSM short-range ensemble forecasts. Mon. Wea. Rev.,125, 1312–1327.

    • Crossref
    • Export Citation
  • Katz, R. W., and A. H. Murphy, 1997: Economic Value of Weather and Climate Forecasts. Cambridge University Press, 222 pp.

    • Crossref
    • Export Citation
  • Mason, I., 1982: A model for assessment of weather forecasts. Aust. Meteor. Mag.,30, 291–303.

  • Matheson, J. E., and R. L. Winkler, 1976: Scoring rules for continuous probability distributions. Manage. Sci.,22, 1087–1095.

    • Crossref
    • Export Citation
  • Molteni, F., R. Buizza, T. N. Palmer, and T. Petroliagis, 1996: The ECMWF Ensemble Prediction System: Methodology and validation. Quart J. Roy. Meteor. Soc.,122, 73–119.

    • Crossref
    • Export Citation
  • Murphy, A. H., 1969: On the “ranked probability score.” J. Appl. Meteor.,8, 988–989.

  • ——, 1971: A note on the ranked probability score. J. Appl. Meteor.,10, 155–156.

  • ——, 1973: A new vector partition of the probability score. J. Appl. Meteor.,12, 595–600.

    • Crossref
    • Export Citation
  • ——, and E. S. Epstein, 1989: Skill scores and correlation coefficients in model verification. Mon. Wea. Rev.,117, 572–581.

    • Crossref
    • Export Citation
  • Press, W. H., B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, 1989: Numerical Recipes: The Art of Scientific Computing. Cambridge University Press, 818 pp.

  • Richardson, D., 1998: Obtaining economic value from the EPS. ECMWF Newsletter, Vol. 80, 8–12.

  • ——, 2000: Skill and relative economic value of the ECMWF Ensemble Prediction System. Quart J. Roy. Meteor. Soc.,126, 649–668.

    • Crossref
    • Export Citation
  • Stanski, H. R., L. J. Wilson, and W. R. Burrows, 1989: Survey of common verification methods in meteorology. Atmospheric Environment Service Research Rep. 89-5, 114 pp. [Available from Forecast Research Division, 4905 Dufferin St., Downsview, ON M3H 5T4, Canada.].

  • Talagrand, O., and R. Vautard, 1997: Evaluation of probabilistic prediction systems. Proc. ECMWF Workshop on Predictability, Reading, United Kingdom, ECMWF, 1–25.

  • Unger, D. A., 1985: A method to estimate the continuous ranked probability score. Preprints, Ninth Conf. on Probability and Statistics in Atmospheric Sciences, Virginia Beach, VA, Amer. Meteor. Soc., 206–213.

  • Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences. Academic Press, 467 pp.

APPENDIX

Some Technical Details

In this appendix the relation between the various terms of the Brier score defined in Eq. (40) and the terms of the continuous ranked probability score given in Eq. (39) will be determined.

Let the function I(x, a, b) be defined by
i1520-0434-15-5-559-ea1
then the densities gi(x) and frequencies oi(x) introduced in Eq. (40) can be written as
i1520-0434-15-5-559-ea2
By using the property
i1520-0434-15-5-559-ea4
it is evident that
i1520-0434-15-5-559-ea5
So the gi are normalized and o(x) is related to the cumulative distribution of the sample. From the expressions
i1520-0434-15-5-559-ea7
1 < i < N, where βki is defined by Eq. (26), it follows that
i1520-0434-15-5-559-ea8
where gi and gioi are defined in Eq. (32).
For the outliers one has to keep in mind that g0(−∞) = 1 and gN(∞) = 1, and therefore Eq. (A8) would become infinite. With the help of definitions (27) and (33), the reader may verify that
i1520-0434-15-5-559-ea10

Fig. 1.
Fig. 1.

Sample distribution Psam, as defined in Eq. (9), for two samples of eight cases, all with equal weight. The shaded area represents the corresponding uncertainty U [see Eq. (12)]. It is proportional to the standard deviation σ of the distribution

Citation: Weather and Forecasting 15, 5; 10.1175/1520-0434(2000)015<0559:DOTCRP>2.0.CO;2

Fig. 2.
Fig. 2.

Cumulative distribution for an ensemble {x1, . . . , x5} of five members (thick solid line) and for the verifying analysis xa (thin solid line). The CRPS is represented by the shaded area. The αi and βi are defined in Eq. (26)

Citation: Weather and Forecasting 15, 5; 10.1175/1520-0434(2000)015<0559:DOTCRP>2.0.CO;2

Fig. 3.
Fig. 3.

The same as in Fig. 2 but now for the case that the verifying analysis is outside the ensemble (outlier). Only when xa is below the ensemble (left panel) is β0 [see Eq. (27)] nonzero. And only when case xa is above the ensemble (right panel) is αN nonzero. The CRPS is given by the shaded area. Note that β0 and αN are weighted stronger than other α’s and β’s

Citation: Weather and Forecasting 15, 5; 10.1175/1520-0434(2000)015<0559:DOTCRP>2.0.CO;2

Fig. 4.
Fig. 4.

Decomposition of continuous ranked probability score for total precipitation accumulated between day 2 and day 3 for seven summer cases in 1999 and averaged over the European area

Citation: Weather and Forecasting 15, 5; 10.1175/1520-0434(2000)015<0559:DOTCRP>2.0.CO;2

Fig. 5.
Fig. 5.

The same as Fig. 4 but for day 6

Citation: Weather and Forecasting 15, 5; 10.1175/1520-0434(2000)015<0559:DOTCRP>2.0.CO;2

Fig. 6.
Fig. 6.

The same as Fig. 4 but for day 9

Citation: Weather and Forecasting 15, 5; 10.1175/1520-0434(2000)015<0559:DOTCRP>2.0.CO;2

Table 1.

Continuous ranked probability score and its decomposition into reliability, resolution, and uncertainty [see Eq. (39)] of total precipitation accumulated in the 24 h prior to the displayed forecast day for seven cases in the summer of 1999 for the ECMWF ensemble prediction system. The dimension of these quantities is mm (24h)−1

Table 1.
Save