1. Introduction
Useful probabilistic forecasts have long been a goal in operational weather forecasting, as has the idea that, by its very nature, the meteorological problem makes a probabilistic solution “desirable if not inevitable” (Petterssen 1956). Modern telecommunications allow the user of weather model output to construct weather forecasts using simulations from several operational centers. Furthermore, ensembles of simulations under the same model provide flow-dependent uncertainty information, superior to the traditional use of historical errors, for translating simulations into probabilistic forecasts (Palmer 2000; Palmer et al. 2005).
As probability forecasts become more common, the need to select one method from among the plethora of alternatives for constructing and tuning probabilistic forecasts as well as growing interest in how to better quantify the improvement in probabilistic forecasting techniques (Jolliffe and Stephenson 2003) has stimulated the development or adoption of a number of scores (Wilks 1995; Gneiting and Raftery 2004; Roulston and Smith 2002). While the true value of a forecast is its utility to the end user, scores are fundamental to the performance analysis of probabilistic forecasts and, ideally, provide a general measure of future forecast quality, independent of any specific end user. We will examine several scores in detail in section 3, each of which aims to quantify the quality of a probabilistic forecast system given a series of forecast–verification pairs. The main aim of this paper is to demonstrate the requirements on the scores to ensure internally consistent forecast evaluation, rather than how scores could be employed in connection with forecast archives to either evaluate or improve probabilistic forecast systems.
Our main focus is on the importance of using proper scores, as outlined in section 4. After defining this property, it is demonstrated that only proper scores are internally consistent in the sense that a forecast probability distribution is given an optimal expected score when the verification is, in fact, drawn from that probability distribution. Using proper scores may have other positive side effects on the behavior of forecasters, as argued by Murphy and Winkler (1987). While discussions on motivating the honesty of forecasters is sometimes wide ranging, the importance of using proper scores can be motivated solely on the grounds that mathematically, only proper scores are internally consistent.
Scores used as examples in this paper include the ignorance, Brier score, and the naive linear score. Although widely discussed (Wilson et al. 1999), the naive linear score is not a proper score. We derive a variant of the linear score that is proper. In section 5 we consider the issues surrounding the notion of locality, and note that uncertain observations may drive us to use generalized scores. Concluding remarks are made in section 6.
2. Probabilistic forecasts
To give a mathematical definition of a probabilistic forecast, let us consider a variable of interest, say the temperature at London Heathrow Airport on a specific day. We will use the symbol X to denote the observed value of that variable. The corresponding lowercase x denotes any possible value in the range of X. In the case of London Heathrow temperature, x could be any real number larger than −273°C. In this paper we focus on probabilistic forecasts in the form of probability density functions (PDFs) p(x), which express uncertainty over what the possible values of X will be, based on the information in hand. By p(x) we denote the entire function, while the notation p(X) always denotes the value of the function at the particular observation X. Different information may well lead to different probability forecasts for X denoted by p(x), q(x), r(x), etc.
3. Scores
a. The ignorance score
b. The Brier score
c. The naive linear score
d. The proper linear score
e. The mean square error
Note that the proper linear score depends on the entire functional form of p(x) [due to the integral in the first term of Eq. (5)], while both the ignorance and the naive linear score depend on p(x) only via the single number p(X), the value of p(x) at the verification X. That is, the ignorance and the naive linear score depend only on the value of the probabilistic forecast at the verification, not on other features of the functional form of p(x). This property is called locality, which we return to later in section 5.
4. Proper scores
At first glance, the various scores presented above possess no distinctive features qualifying them as particularly useful in valuing probabilistic forecasts. As will be shown in this section though, some of these scores are proper, while others are not. We will first define this property and subsequently explain why improper scores lead to conclusions inconsistent with common sense, thus motivating the importance of being proper.
The nonpropriety of the naive linear score would also emerge as a consequence of a far more general result due to Bernardo (1979; see also section 5), namely that for continuous variables all smooth, proper, and local scores are affine functions of the ignorance. The notion of locality, briefly mentioned at the end of section 3, will be returned to in section 5. Proper scores in general have been characterized by Gneiting and Raftery (2004).
To conclude, we note that proper scores and only proper scores are internally consistent in that the score S[q(x), X] assigns an optimal expected value to q(x) if and only if X is distributed according to q(x). Note that philosophical arguments over the existence of a true probability distribution play no role in the entire discussion of this paper. It is tempting to think of the skill of a forecast p(x) as its distance to a true (in any sense) conditional probability describing the relation between our information and the unknown variable X. Because we do not assume the existence of such a true probability distribution, much less having access to it, we are unable to consider distance measures between probability distributions, gainfully explored in other circumstances by Kleeman (2002). A proper score merely ensures consistency.
5. Locality and nonlocality
At first sight, it might seem unreasonable that features of the forecast other than the value it assigned to the verification should matter. Yet it is possible that domain knowledge suggests any appropriate forecast should have, for example, some smoothness properties; one may want to restrict the possible variations in the probability forecast a priori, without having looked at the data.3 This can also be useful when scores are employed for training models that translate numerical simulations into probability density forecasts. A ubiquitous problem here is to limit model complexity, which can be addressed by enforcing certain measures of smoothness upon the PDFs (regularization). Because a finite sample of verifications is never sufficient to either confirm or deny the presence of such properties, smoothness has to be enforced either by restricting the considered class of density functions to smooth functions a priori, or by augmenting the score with a term that penalizes nonsmooth densities, essentially rendering the score nonlocal.4,5
A separate reason for using nonlocal measures in a particular problem would arise if the ignorance score is not considered suitable. For example, the ignorance score is infinity if the forecast assigns vanishing probability to an event that obtains. If we wish to usefully evaluate forecasts that insist on assigning zero probability to events that occur, we would have to resort to other scores. Inasmuch as ignorance is the only smooth, proper and local score for continuous variables (Bernardo 1979),6 this implies switching to a nonlocal score.
6. Conclusions
Insightful evaluation and intercomparison of probability forecasts requires a careful choice of score to quantify the agreement of historical forecast–verification pairs. We focus on a few scores for the case that each forecast consists of a PDF and each verification consists of a real number. This list of scores is not exhaustive. Furthermore, probabilistic forecasts for discrete events allow for further measures of skill not mentioned here. Our main point is that only proper scores are internally consistent. By Bernardo’s theorem, ignorance is effectively the only proper local score for continuous variables. Locality also appears to be a desirable property of a score, yet the case for local scores is less compelling than for proper scores. It would be interesting to identify and investigate when nonlocal scores for continuous variables would be highly valued.
When using scores to evaluate probabilistic forecasting systems it is critical to consider the performance of the system over a duration sufficiently long to obtain robust results. Ultimate evaluation of operational probabilistic forecast systems may require including the fact that the verifying observation is itself uncertain, and thus a move to generalized scores. A possible generalization was motivated and discussed in the context of propriety and locality. Proper scores allow an internally consistent evaluation, making their use an important feature in the valuation and further improvement of these forecasts and the models behind them.
Acknowledgments
This work was supported by the DIME EPSRC/DTI Faraday Partnership under Grant GR/R92363/01; and ENSEMBLES and the National Oceanographic and Atmospheric Administration (NOAA) under Grant 1-RAT-S592-04001. Furthermore, the authors gratefully acknowledge fruitful discussions with Liam Clarke, Devin Kilminster, Antje Weisheimer, and Kevin Judd.
REFERENCES
Bernardo, J. M., 1979: Expected information as expected utility. Ann. Stat., 7 , 686–690.
Candille, G., and Talagrand O. , 2005: Evaluation of probabilistic prediction systems for a scalar variable. Quart. J. Roy. Meteor. Soc., 131 , 2131–2150.
Gneiting, T., and Raftery A. , 2004: Strictly proper scoring rules, prediction, and estimation. Tech. Rep. 436, Department of Statistics, University of Washington.
Good, I. J., 1952: Rational decisions. J. Roy. Stat. Soc., XIV , 107–114.
Jolliffe, I. T., and Stephenson D. B. , 2003: Forecast Verification: A Practitioner’s Guide in Atmospheric Science. Wiley, 256 pp.
Kleeman, R., 2002: Measuring dynamical prediction utility using relative entropy. J. Atmos. Sci., 59 , 2057–2072.
Kullback, S., and Leibler R. A. , 1951: On information and sufficiency. Ann. Math. Stat., 22 , 79–86.
Murphy, A. H., and Winkler R. L. , 1987: A general framework for forecast verification. Mon. Wea. Rev., 115 , 1330–1338.
Palmer, T. N., 2000: Predicting uncertainty in forecasts of weather and climate. Rep. Prog. Phys., 63 , 2. 71–116.
Palmer, T. N., Shutts G. J. , Hagedorn R. , Doblas-Reyes F. , Jung T. , and Leutbecher M. , 2005: Representing model uncertainty in weather and climate prediction. Annu. Rev. Earth Planet. Sci., 33 , 163–193.
Petterssen, S., 1956: Weather Analysis and Forecasting. 2d ed. McGraw Hill, 505 pp.
Roulston, M. S., and Smith L. A. , 2002: Evaluating probabilistic forecasts using information theory. Mon. Wea. Rev., 130 , 1653–1660.
Selten, R., 1998: Axiomatic characterisation of the quadratic scoring rule. Exp. Econ., 1 , 43–62.
Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences: An Introduction. International Geophysics Series, Vol. 59, Academic Press, 464 pp.
Wilson, L. J., Burrows W. R. , and Lanzinger A. , 1999: A strategy for verification of weather element forecasts from an ensemble prediction system. Mon. Wea. Rev., 127 , 956–970.
The mean and the standard deviation of p(x) are not to be confused with the sample mean and the sample standard deviation of the observations.
Note that a similar definition for nonlocal scores would require a substantially more advanced concept of smoothness, because, in general, nonlocal scores involve functional operations.
The definition of locality as given here must not be confused with issues related to scoring forecasts for spatial fields. There the question arises whether fields should have some smoothness properties over space, rather than over different verifications.
Scores including the derivative at verification points are still nonlocal according to the standard definition of locality, although they could be attested a certain “pseudolocality.”
Requirements of smoothness or parsimony might be desired for reasons not directly connected with skill, and therefore might not be considered as part of the score. We thank D. Kilminster for stressing this point.
The exact statement of this result is that every local, smooth, and proper score for continuous variables is an affine function of the ignorance.