• Anderson, J. L., 1996: A method for producing and evaluating probabilistic forecasts from ensemble model integrations. J. Climate, 9 , 15181530.

    • Search Google Scholar
    • Export Citation
  • Bernardo, J. M., 1979: Expected information as expected utility. Ann. Stat., 7 , 686690.

  • Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78 , 13.

  • Brier, G. W., and R. A. Allen, 1951: Verification of weather forecasts. Compendium of Meteorology, T. Malone, Ed., Amer. Meteor. Soc., 841–848.

    • Search Google Scholar
    • Export Citation
  • Brocker, J., and L. A. Smith, 2007: Scoring probabilistic forecast: The importance of being proper. Wea. Forecasting, 22 , 382388.

  • Epstein, E., 1969: A scoring system for probability forecast of ranked categories. J. Appl. Meteor., 8 , 985987.

  • Gandin, L. S., and A. H. Murphy, 1992: Equitable skill scores for categorical forecasts. Mon. Wea. Rev., 120 , 361370.

  • Gibbs, J. W., 1902: Elementary Principles in Statistical Mechanics. Charles Scribner’s Sons, 207 pp.

  • Good, I. J., 1952: Rational decisions. J. Roy. Stat. Soc., 14A , 107114.

  • Jaynes, E. T., 1957: Information theory and statistical mechanics. Phys. Rev., 106 , 620630.

  • Jewson, S., cited. 2008: The problem with the Brier score. arXiv:physics/0401046v1 [physics.ao-ph]. [Available online at http://arxiv.org/abs/physics/0401046v1].

    • Search Google Scholar
    • Export Citation
  • Jolliffe, I. T., and D. B. Stephenson, 2008: Proper scores for probability forecasts can never be equitable. Mon. Wea. Rev., 136 , 15051510.

    • Search Google Scholar
    • Export Citation
  • Kullback, S., and R. A. Leibler, 1951: On information and sufficiency. Ann. Math. Stat., 22 , 7986.

  • Leung, L-Y., and G. R. North, 1990: Information theory and climate prediction. J. Climate, 3 , 514.

  • Lindley, D. V., 1985: Making Decisions. John Wiley & Sons, 207 pp.

  • Mason, I. B., 1982: A model for assessment of weather forecasts. Aust. Meteor. Mag., 30 , 291303.

  • Murphy, A. H., 1971: A note on the ranked probability score. J. Appl. Meteor., 10 , 155156.

  • Roulston, M. S., and L. A. Smith, 2002: Evaluating probabilistic forecasts using information theory. Mon. Wea. Rev., 130 , 16531660.

  • Shannon, C. E., 1948: A mathematical theory of communication. Bell Syst. Technol. J., 27 , 379423. 623656.

  • Talagrand, O., R. Vautard, and B. Strauss, 1999: Evaluation of probabilistic prediction system. Proc. ECMWF Workshop on Predictability, Reading, United Kingdom, ECMWF, 1–25.

    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences: An Introduction. Academic Press, 464 pp.

  • View in gallery

    Plot of the maximum FS, attained when the forecasted probability p exactly corresponds to the observed frequency f of the event.

  • View in gallery

    Three-dimensional surface representing the FS as a function of both the forecasted probability p and the observed occurrence frequency f.

  • View in gallery

    Comparison between AS and BS for p = f.

  • View in gallery

    Three-dimensional surface representing the difference between AS and BS.

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 921 723 44
PDF Downloads 812 616 37

Scoring Rules for Forecast Verification

View More View Less
  • 1 Laboratory for Environmental Monitoring and Modelling (LaMMA), Toscana, Italy
© Get Permissions
Full access

Abstract

The problem of probabilistic forecast verification is approached from a theoretical point of view starting from three basic desiderata: additivity, exclusive dependence on physical observations (“locality”), and strictly proper behavior. By imposing such requirements and only using elementary mathematics, a univocal measure of forecast goodness is demonstrated to exist. This measure is the logarithmic score, based on the relative entropy between the observed occurrence frequencies and the predicted probabilities for the forecast events. Information theory is then used as a guide to choose the scoring-scale offset for obtaining meaningful and fair skill scores. Finally the Brier score is assessed and, for single-event forecasts, its equivalence to the second-order approximation of the logarithmic score is shown.

The large part of the presented results are far from being new or original, nevertheless their use still meets with some resistance in the weather forecast community. This paper aims at providing a clear presentation of the main arguments for using the logarithmic score.

Corresponding author address: Riccardo Benedetti, Laboratory for Environmental Monitoring and Modelling (LaMMA), Building D, via Madonna del Piano, Sesto Fiorentino 50019, Italy. Email: benedetti@lamma.rete.toscana.it

Abstract

The problem of probabilistic forecast verification is approached from a theoretical point of view starting from three basic desiderata: additivity, exclusive dependence on physical observations (“locality”), and strictly proper behavior. By imposing such requirements and only using elementary mathematics, a univocal measure of forecast goodness is demonstrated to exist. This measure is the logarithmic score, based on the relative entropy between the observed occurrence frequencies and the predicted probabilities for the forecast events. Information theory is then used as a guide to choose the scoring-scale offset for obtaining meaningful and fair skill scores. Finally the Brier score is assessed and, for single-event forecasts, its equivalence to the second-order approximation of the logarithmic score is shown.

The large part of the presented results are far from being new or original, nevertheless their use still meets with some resistance in the weather forecast community. This paper aims at providing a clear presentation of the main arguments for using the logarithmic score.

Corresponding author address: Riccardo Benedetti, Laboratory for Environmental Monitoring and Modelling (LaMMA), Building D, via Madonna del Piano, Sesto Fiorentino 50019, Italy. Email: benedetti@lamma.rete.toscana.it

1. Introduction

The importance of assessing the quality of forecasts is widely recognized by both the numerical weather prediction community and synoptic-empirical meteorologists, who base their predictions on experienced analysis of large-scale atmospheric motions. As a matter of fact, in the absence of a reliable verification scheme, the comparison between different forecast methods as well as the real effectiveness of corrections applied to a given procedure become questionable, especially when no clear evidence exists.

Since the pioneering work by Brier (Brier 1950), Allen (Brier and Allen 1951), and Good (Good 1952), various kinds of scores have provided an operational tool to quantify the forecast goodness. The main drawback of this approach resides in the large number of different scores proposed, such as the Brier score (Brier 1950), the logarithmic score (Good 1952), the ranked probability score (Epstein 1969; Murphy 1971), the relative operating characteristics (ROC) curves (Mason 1982), rank histograms (Talagrand et al. 1999), and so on. It often happens that the same forecast turns out better than another using a certain score, and worse using a different one. This prevents both researchers and users from getting reliable feedback on how good each given series of predictions is and raises the question on the existence of a unique measure of the forecast quality. It is important to stress that “quality” in this context means “accordance with reality,” as derived from the available observations, in opposition to any personal concept of utility as consequence of specific situations or subjective preferences.

Nowadays consent generally exists for probabilistic forecasts (Anderson 1996), in part because of the recent techniques of ensemble forecasting that provide output in terms of probability distribution functions instead of single deterministic values. A similar consent regards the use of strictly proper scores (Brocker and Smith 2007), that is, scores whose expected result is maximized if and only if the announced probabilities for the series of events to be predicted exactly correspond to those believed to be the closest to reality. In other words proper scores promote honesty, in the sense that the forecaster is forced to declare what he actually thinks to be the best prediction in order to achieve the best score. Any different behavior yields poorer expected results. Nevertheless, even restricting the topic to probabilistic forecasts evaluated by strictly proper scores, a unique verification score is not found. As proposed by many authors (Lindley 1985; Leung and North 1990; Roulston and Smith 2002), the information theory founded by Shannon (Shannon 1948), but already present in Gibbs seminal work (Gibbs 1902), provides the right framework to approach the problem of forecast verification. In this paper, following the pattern used by Shannon to define the information entropy, we try to show by means of elementary mathematics that the logarithmic score is the only one to respect three basic desiderata whose violation can hardly be accepted. A more formal and cumbersome demonstration of an equivalent result may be found, for example, in a paper by Bernardo (Bernardo 1979) on the procedure for experiments design by maximizing the expected information. The mathematical apparatus used by Bernardo, involving probability distributions, functionals, and variational calculus, is more complicated than necessary for the scoring rules as commonly applied to forecast verification. In our opinion this unnecessary complication has overshadowed both the application and the importance of such results as normative procedures for the quality assessment of probabilistic forecasts.

2. Basic desiderata for scoring rule

In the following we consider sets of exhaustive and mutually exclusive events {E1, E2, … , Em}, meaning that
i1520-0493-138-1-203-eq1
where p(Ei) indicates the probability for the ith event, p(Ei|Ej) is the conditional probability for the event Ei given Ej, and δij is the Kronecker’s delta. The term “probability” is here intended in a cognitive sense as “degree of belief that an event will occur or turn out true.”1

A given forecast over the set of m events is univocally determined assigning the expected probabilities p(Ei) for all the m events, so that it can be usefully represented by the m-dimensional vector p whose components are the probabilities p(Ei). Consistently a sequence of forecasts issued in n different occasions will be represented by the m × n matrix 𝗣 whose kth column is the vector pk for the kth forecast. Finally to represent the series of events that actually occurred at the n different occasions we use the n-dimensional vector ϵ with integer components ϵk such as ϵk = i if the ith event occurs at the kth occasion.

Using this notation we state three basic desiderata that we request to be satisfied by any score for forecast verification: additivity, exclusive dependence on physical observations (“locality”), and strictly proper behavior. In detail these requirements read as follows:

  1. The score Sn for the forecast sequence 𝗣 is a differentiable function of the variables pk (Ei) additive with respect to each single forecast, in the sense that adding a new forecast occasion the score for the extended sequence is given by
    i1520-0493-138-1-203-e1
  2. The score Sn depends only on the probabilities assigned to the events actually occurred:
    i1520-0493-138-1-203-e2
  3. If the probabilities assigned to each event are held constant in all the forecast occasions (i.e., pk = p for all k) then the score exhibits an extremant when such probabilities exactly correspond to the observed occurrence frequencies fi for the events, or in vectorial notation when p = f. Mathematically, writing for sake of brevity p(Ei) ≡ pi:
    i1520-0493-138-1-203-e3
    with the constraint Σipi = 1.

As demanded by common sense, the first requirement of additivity implies that each single forecast affects the global score for the whole sequence in the same manner as any other, and any variation of its score produces an identical variation in the global one. As for the information entropy, the continuity with respect to the variables pk would be sufficient. Anyway we require differentiability, a slightly stronger constraint, to make the demonstration of uniqueness easier. In practice all the proposed and currently used scores are additive and most of them are differentiable functions of the probabilities pk.

The second desideratum may appear quite arbitrary or at least more specific. On the contrary it is really fundamental in a very general sense, since any sound scientific measure for the accordance between modeling predictions and natural phenomena cannot depend on what could have been observed but was not. The goodness of the accordance must be determined only by what is actually observed. A little reflection shows that this modern science principle, probably stated for the first time by Galilei, also avoids situations that common sense can hardly accept. In fact two sequences of forecasts that have assigned exactly the same probabilities to a series of observed events cannot gain different scores on the basis of probabilities assigned to never occurred events. Common sense suggests that both deserve the same score until some event treated as different is observed. Surprisingly many widely used scores violate this basic principle, which corresponds to the so called locality2 (e.g., see Brocker and Smith 2007). As far as we know the importance of this property has never been recognized in the field of forecast verification and considered on the same level of other arbitrary properties, such as the “equitability,” defined by Gandin and Murphy (1992). For example the famous Brier score, whose lasting success is probably due to some of its properties we explain better in section 6, is “nonlocal” and its “nonlocality” accepted as a legitimate choice.

Finally, the third desideratum restricts the choice to the class of strictly proper scores, since, before the actual series of events takes place, the expected score announcing probabilities i is calculated by taking the believed probabilities pi as the true frequencies fi. Thus, the expected score reaches an extremant when the announced probabilities equal those believed true and any different choice is expected to yield a worse score. Contrary to the locality, the importance of being proper is now widely recognized (Brocker and Smith 2007). For example in a recent paper Jolliffe and Stephenson (Jolliffe and Stephenson 2008) noticed that proper scores can never be equitable, concluding that “propriety” is the more fundamental requirement.

3. Score function determination

The first two basic desiderata reduce the expression of the score for a given sequence of n forecasts to a summation of n terms, each of them is the score for the kth forecast:
i1520-0493-138-1-203-e4
where SS1, with the index no longer necessary. Clearly the S function determination also determines the global score Sn. Let us consider a sequence of forecasts for which the probabilities assigned to each event are kept constant through the n occasions. In this case the expression in (4) can be written as a summation over the m possible events:
i1520-0493-138-1-203-e5
where fi = ni /n is the observed frequency of the ith event in all the n occasions, when it occurs ni times. Now let us focus on two of the m events, for example E1 and E2. Apparently if p1 is assumed free to vary between 0 and , then p2 is constrained to r1,2p1. Thus (5) becomes
i1520-0493-138-1-203-e6
According to the third basic desideratum, the derivative of this expression with respect to p1 must vanish for p = f:
i1520-0493-138-1-203-e7
where . Because the choice of the two events is completely arbitrary, the frequency f1 can be regarded as a generic variable x ranging from 0 to r, r being its generic upper limit that can take any value between 0 and 1. Thus (7) corresponds to the functional equation:
i1520-0493-138-1-203-e8
or equivalently, setting xS ′(x) ≡ Φ(x):
i1520-0493-138-1-203-e9
As evident, the general solution is given by any function Φ symmetric with respect to the central point x = r/2; since this symmetry must hold for any r ∈ (0, 1), we conclude that the only possibility is Φ(x) = a for any x ∈ (0, 1), a being an arbitrary constant. Thus, at least for a number of events greater than 2, by simple integration we have
i1520-0493-138-1-203-e10
where b is another arbitrary constant.

This seemingly surprising result tells us that in addition to requiring the respect of the three basic desiderata as a necessary condition for a scoring rule, it is also sufficient to univocally define the form of the score function S and hence the global score Sn for a sequence of n forecasts. Any other choice different than (10) necessarily implies a violation of at least one of the three basic desiderata. Since they satisfy very basic requirements suggested by common sense and needed for a sound scientific approach, a deliberate violation of one of them can hardly be accepted and, if present, some inconsistency or nonsense is expected.

The constants a and b represent the arbitrary scale factor and offset, respectively, apart from which any scoring scale to evaluate a given forecast sequence is defined. Clearly no essential distinction exists between scores that differ only by an offset and a multiplicative factor, since they overlap with a translation of the zero and a unit redefinition.

4. Single-event forecasts

The logarithmic form in (10) for the score function S(x) has been obtained assuming a number m of events greater than 2. Even if it seems natural to maintain the same form also for m = 2, a sounder argument is needed. As evident when m = 2 the set of mutually exclusive and exhaustive events composed of only the event E and its negation E, for example “rain” and “no rain.” From a logical point of view we have a single-event forecast with a single probability p(E) = p to be predicted, being necessarily p(E) = 1 − p. Since in this case r = 1 with no possibility of variation, any function Φ(x) symmetric with respect to x = ½ satisfies the functional Eq. (9) and the necessity of the logathmic form for S(x) seems to disappear. On the contrary consider the situation of a first forecaster who announces the probability p1 for the event E1 (e.g., “rain for tomorrow”) and a second forecaster who knows for certain that only when E1 does not occur a second event E2 can be observed (e.g., “rain for the day after tomorrow”), so that he announces the conditional probability p2p(E2|E1) for the occasions when E1 does not occur. Given the series of n-forecast occasions within which E1 occurred n1 times, the score for the first forecaster will be
i1520-0493-138-1-203-e11
while the score for the second whose forecasts are limited to nn1 occasions:
i1520-0493-138-1-203-e12
where n2 is the number of occurrences for E2. Since both the forecasters provide single-event forecasts (m = 2), S(x) is at the moment an unknown score function. If a third forecaster announces both the probabilities p1 and p2 for E1 and E2, and consequently the probability p3 = 1 − p1p2 for the third possible event E3 (e.g., “no rain for both tomorrow and the day after”), then he is providing a multiple-events forecast with m = 3. In this case we know that his score must be
i1520-0493-138-1-203-e13
But when p2 = (1 − p1)p2 the third multiple forecast is logically equivalent3 to the joint of the first and the second forecast, so for logical consistency we have to require that the sum of the scores for the two single-event predictions equals the score of the multiple one for at least a choice of the constants a and b:
i1520-0493-138-1-203-e14
After some algebra and taking into account that (14) must hold for any n1, n2, and n3nn1n2, we get three functional equations to be satisfied simultaneously:
i1520-0493-138-1-203-e15
It easy to verify that by choosing b = 0 [i.e., S(x) = a lnx] all these three equations are satisfied and no other choice is possible. Thus, for single-event forecast no other possibilities than a logarithmic score function like (10) exist. But a new interesting and unexpected result has also been obtained: the logical consistency requirement of getting identical scores for logically equivalent forecasts, when added to the three basic desiderata, forces the offset b of the scoring scale to zero. It is something like the temperature scales in thermodynamics: both the offset and the unit are arbitrary until we require the equivalence between temperature and thermal energy. In this case only the absolute temperature scale starting from zero is possible. As suggested by this analogy, the score coming from the choice b = 0 can be named “absolute score.”

5. Skill scores and information

Adopting the scoring function in (10), the score for a given sequence of n forecasts is easily calculated once the actual series ϵ of occurred events is known:
i1520-0493-138-1-203-e16
As seen in section 4, if an “absolute” score is needed, in the sense that we request the same score for logically equivalent forecasts however obtained, then we must take b = 0 and no different choice is possible. Nevertheless, whatever the offset b is, the score in (16) cannot be directly used for comparing the skillfulness of forecast methods applied to different series of occasions, especially when they differ in length. A natural skill score could be the mean score per forecast, obtained dividing Sn by n:
i1520-0493-138-1-203-e17
and when the probabilities pk are kept constant, like in (5):
i1520-0493-138-1-203-e18
As known from information theory, the summation at the right-hand side of (18) is just the cross-entropy Hc between the discrete probability and frequency distributions p and f (Jaynes 1957):
i1520-0493-138-1-203-e19
where Hc can be easily written as the sum of the entropy for f and the relative entropy of p with respect to f, nowadays better known as Kullback–Leibler (KL) divergence (Kullback and Leibler 1951):
i1520-0493-138-1-203-e20
The KL divergence represents a measure of how much the distribution p diverges from f in terms of information content and, according to the Gibbs inequality (Gibbs 1902), it is a positive defined quantity, making the cross entropy always greater than or equal to the entropy H(f) of the observed frequency distribution.
Looking at the skill score in (18) through the lens of the information theory helps to choose the most convenient and meaningful values for the arbitrary constants a and b. First of all, by merely psychological reasons and common practice, we are inclined to associate positive values to better results and negative to worse. So a positive-oriented scoring scale (i.e., the higher the score the better the result) is in general more immediate. Besides the zero value can usefully indicates null results. For example if the only climatological frequencies fi for the events {Ei} are known, then the best and most honest prediction before the development of more accurate forecast methods would be to assign the probability pi = fi to each event Ei. So this could be taken as reference of “zero level” to evaluate and verify the quality and the effectiveness of any finer forecast exploiting further information and knowledge: if the forecast uses them skillfully, then it should be at least as effective as the zero level prediction and more effective when the additional information is truly weighty and the knowledge significant. Since the climatological frequencies are expected to repeat, the expected cross entropy for the zero level prediction corresponds to the entropy of the climatological frequency distribution f:
i1520-0493-138-1-203-e21
Thus, choosing a = 1 and b = H(f) the skill score in (18) becomes
i1520-0493-138-1-203-e22
where the difference Δfififi between actually observed and climatologically expected frequencies has been introduced. A rapid look at the skill score in (22) shows that if the observed frequencies equal those expected by climatology (i.e., Δfi = 0 for all i), then the skill score corresponds to the negative of the KL divergence between f and p and it is always nonpositive, indicating that no better results than the climatological information can be attained. In particular p = f produces a null skill score, the highest possible, while for any other forecast the skill score is negative (we remind the reader the constraint of keeping the predicted probabilities fixed, when they vary at each forecast occasion positive results become possible). If fifi for some i (i.e., ff), then there is room for positive results, being the highest skill score given by the difference between the entropy of f and that of f. Such maximum value H(f) − H(f), attainable if and only if p = f, can be positive or negative. Since the larger the indetermination of a distribution the higher its entropy is, it turns out positive when the observed frequency distribution f is less indeterminate than f and negative when more. This is a very reasonable behavior because a more indeterminate distribution in terms of information content can be more easily forecasted. As a limit the most indeterminate distribution (i.e., when all the m events have the same occurrence frequency 1/m), can be forecasted simply by using the Laplace principle of indifference, without any effort to gather information or necessity to get some specific knowledge. Indeed, being the indifference principle the forecast method for the total uninformed, it can be taken as a sort of absolutely fair zero level to get the corresponding “fair skill score” (FS) by setting b = H(1/m) ≡ lnm:
i1520-0493-138-1-203-e23
All the general considerations concerning the information content remain valid for a generic sequence of n forecasts whose probabilities for the m events are changed at each occasion. In fact such a sequence can be thought as the sum of subsequences for which the probabilities are kept constant and, by the additivity property, the total score must be the sum of the subsequences partial scores. Nevertheless the additivity property does not transfer to the mean score per forecast, so that the skill score for the whole sequence is not the sum of the skill scores for each subsequence. For example the FS has to be calculated as
i1520-0493-138-1-203-e24
Since the perfect forecaster always assigns a probability 1 (absolute precision) to the event then actually occurring (absolute accuracy) [i.e., pk (ϵk) = 1 for all k], he gains a skill score FS = lnm that indeed corresponds to the absolute maximum. Besides, not only honesty, precision, and accuracy, but also resolution power is promoted: the larger the set of m possible events, the higher the perfect forecast skill score turns out.
To get more insight and to allow graphical representations, let us consider the very simple case of single-event forecast, when the set of mutually exclusive and exhaustive events composed of only the event E and its negation E. Setting p(E) = p, necessarily p(E) = 1 − p and analogously for the observed frequencies: if f (E) = f then f (E) = 1 − f. Thus, the FS in (23) turns out to be
i1520-0493-138-1-203-e25
Since the property in (3) directly transfers to the mean score per forecast, whatever the observed frequency f is the skill score reaches its maximum, corresponding to the best result, when p = f:
i1520-0493-138-1-203-e26
Plotting FSmax as a function of f we find the graph drawn in Fig. 1.

The absolute maximum, equal to ln2 ≅ 0.69, is approached for f → 0 or f → 1 (i.e., for very rare or very frequent events). Despite the fact that it could at first appear to be a strange result, a little reflection shows that it is really completely reasonable. Because of the arbitrary choice for E and E we can limit the reasoning to rare events, for which it is quite apparent that the exact forecast of their occurrence frequency turns out very difficult, nothing but for the difficulty to gather enough information by sampling. We also note that the FS discourages the temptation to assign superficially p = 0 to very rare events as well as p = 1 to very frequent ones. In fact a single occurrence (or not occurrence) is sufficient to collapse the score to −∞, without any possibility to rise again. Thus, unless an event is known for certain to be impossible, the FS compels one to honestly consider it possible. Indeed this is a general feature of any logarithmic score respecting the three basic desiderata.

On the other hand FSmax vanishes at its absolute minimum for f = ½, corresponding to a maximally indeterminate event whose occurrence is as frequent as its no occurrence. Again a fair result, since no effort apart from the application of the indifference principle is needed to forecast exactly such a frequency. Thus, as suggested by common sense, a null score is deserved by such a trivial forecast.

Finally the three-dimensional surface representing the FS as a function of both f and p is shown in Fig. 2. The upper points along the saddle line form the bidimensional curve represented in Fig. 1. For each given occurrence frequency f, any departure from the condition p = f produces a score decrease. As p tends to 0 or 1 the FS drops to −∞, unless f = 0 or f = 1, too.

6. Brier Score as approximation

Since 1950 the Brier score (BS; Brier 1950) is commonly used to verify probabilistic forecasts. In the notation here adopted it is defined as
i1520-0493-138-1-203-e27
where | · |2 denotes the square modulus and, as usual, n is the number of forecast occasions, pk the forecast vector at the kth occasion, and êk is the corresponding occurred event versor whose m components are all null except the ith set to 1, indicating that the Ei event occurred at the kth occasion. In practice Brier used as a scoring function S(p), the square of the euclidean distance between the vectors p and ê in the probability space for the events.

It can be verified that the BS is a strictly proper score (Wilks 1995) but, as obvious, it infringes the second basic desideratum because of its dependence on probabilities assigned to never occurred events. Although such a violation makes it unsatisfactory, some reason to consider it acceptable in several situations should exist if the forecasters’ community has been using it for decades. Its useful and well-known decomposition in reliability, resolution, and uncertainty (Wilks 1995), cannot explains by itself the undeniable success of the BS.

Again let us consider the simple case of single-event forecast, for which the BS turns out
i1520-0493-138-1-203-e28
As term of comparison let us take the “absolute” skill score (AS):
i1520-0493-138-1-203-e29
Since the scale factor is unessential, to make a more meaningful comparison we choose a = −(2 ln2)−1, so that AS assumes the same value as BS for f = p = ½ (for f = p = 1 and f = p = 0 they both vanish). As shown in Fig. 3, for the best possible forecasts, when p = f, BS is very similar to AS and in practice provides the same information. Anyway, because of its violation of second basic desideratum, some inconsistencies are expected. For example Roulston and Smith (Roulston and Smith 2002) noticed that the logarithmic score, or “ignorance” according to their terminology, is a double-valued function of the BS so that two forecasts with the same Brier score can be judged different if “ignorance” is used instead. Besides Jewson (Jewson 2008) has recently noticed that the BS leads to conclusions that disagree with our intuition, especially when it is applied to extreme forecasts (p ≅ 1 or p ≅ 0) or to forecasts whose accordance with the observed frequencies is poor.

Plotting the difference AS − BS, we get the three-dimensional surface shown in Fig. 4. As visible, when the forecasted probability p differs significantly from the observed occurrence frequency f, larger differences become possible and BS is no more a suitable measure for forecast verification. Also for very rare (or very frequent) events BS becomes inadequate as a skill score. In fact, assigning by default p = 0 to rare events whose actual occurrence frequency is, say, f = 10−3, we get BS = 2 × 10−3. If, maybe after great research efforts, we make the correct prediction p = 10−3, then BS = 2 × 10−3(1–10−3), a gain of only 0.1% in score. Thus, BS is very unfair for evaluating rare (or common) events forecasts. On the contrary, logarithmic scores like AS exhibit the automatic safety device to avoid any arbitrary assignment p = 0 or p = 1, as noticed in section 5.

Since the two variables function, Ψ(x, y) = x lny + (1 − x) ln(1 − y) exhibits an extremant for x = y = ½ with x’s second derivative clearly vanishing, its second-order expansion about such an extremant point is
i1520-0493-138-1-203-e30
By this expression, after some tedious but trivial calculations, we find that
i1520-0493-138-1-203-e31
So, apart from unessential scale factor and offset, the BS in (28) corresponds to the second-order approximation about the point p = f = ½ of the logarithmic skill score in (29). This explains the wide and lasting use of the BS, but also indicates and quantifies its limits.

7. Conclusions

Requiring the observance of the three basic desiderata (i.e., additivity, exclusive dependence on physical observations, and strictly proper behavior) the scoring rule for probabilistic forecast verification is univocally determined. The proper score function is the logarithm of the probability forecasted for the event that actually occurred and any different choice necessarily infringes at least one of the three basic desiderata. In addition, the further request of logical consistency for assigning identical scores to logically equivalent forecasts however obtained, forces the choice of the offset for the logarithmic scoring scale to zero. Thus, apart from the arbitrary unit fixing the logarithm base, an “absolute” scoring scale can be defined. As a natural skill score the mean score per forecast over a sequence of forecast occasions with known occurred events can be taken. Information theory, with its concept of entropy and relative entropy of discrete probability distributions, provides useful indications for a meaningful choice of the “reference level” for the skill score. In particular a fair skill score can be obtained by choosing as the zero level reference the expected skill score of the totally uninformed forecast, based only on the Laplace principle of indifference, which assigns the same probability to each of the possible events.

The lasting success of the Brier score, one of the most commonly used for decades, is explained by its equivalence to the second-order approximation of the logarithmic skill score, which makes it an effective scoring rule in many situations. Nevertheless it shows its limitations when applied to very rare (or very frequent) events forecasts, as well as to forecasts of poor quality.

Though the most part of the presented results have been known for decades in the field of statistical mechanics and information theory, their use entered the weather forecast community only recently and with some difficulties. We hope that this paper contributes to clarify the effectiveness and soundness of the approach based on information and probability theory, whose entropy and subjective probability concepts appear to be the natural framework for the probabilistic forecast verification.

Acknowledgments

This work was accomplished by the help of the Foundation for Climate and Sustainability (FCS), a private nonprofit foundation, and despite the drastic reduction of public funding for research pursued by the Italian government (see the article “Cut-Throat Savings,” Nature, Vol. 455, October 2008, available online at http://www.nature.com/nature/journal/v455/n7215/full/455835b.html). Thanks are due to Massimiliamo Pasqui and Jacopo Primicerio for stimulating discussions and to Andrea Orlandi and Alberto Ortolani for their useful suggestions and encouragement.

REFERENCES

  • Anderson, J. L., 1996: A method for producing and evaluating probabilistic forecasts from ensemble model integrations. J. Climate, 9 , 15181530.

    • Search Google Scholar
    • Export Citation
  • Bernardo, J. M., 1979: Expected information as expected utility. Ann. Stat., 7 , 686690.

  • Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78 , 13.

  • Brier, G. W., and R. A. Allen, 1951: Verification of weather forecasts. Compendium of Meteorology, T. Malone, Ed., Amer. Meteor. Soc., 841–848.

    • Search Google Scholar
    • Export Citation
  • Brocker, J., and L. A. Smith, 2007: Scoring probabilistic forecast: The importance of being proper. Wea. Forecasting, 22 , 382388.

  • Epstein, E., 1969: A scoring system for probability forecast of ranked categories. J. Appl. Meteor., 8 , 985987.

  • Gandin, L. S., and A. H. Murphy, 1992: Equitable skill scores for categorical forecasts. Mon. Wea. Rev., 120 , 361370.

  • Gibbs, J. W., 1902: Elementary Principles in Statistical Mechanics. Charles Scribner’s Sons, 207 pp.

  • Good, I. J., 1952: Rational decisions. J. Roy. Stat. Soc., 14A , 107114.

  • Jaynes, E. T., 1957: Information theory and statistical mechanics. Phys. Rev., 106 , 620630.

  • Jewson, S., cited. 2008: The problem with the Brier score. arXiv:physics/0401046v1 [physics.ao-ph]. [Available online at http://arxiv.org/abs/physics/0401046v1].

    • Search Google Scholar
    • Export Citation
  • Jolliffe, I. T., and D. B. Stephenson, 2008: Proper scores for probability forecasts can never be equitable. Mon. Wea. Rev., 136 , 15051510.

    • Search Google Scholar
    • Export Citation
  • Kullback, S., and R. A. Leibler, 1951: On information and sufficiency. Ann. Math. Stat., 22 , 7986.

  • Leung, L-Y., and G. R. North, 1990: Information theory and climate prediction. J. Climate, 3 , 514.

  • Lindley, D. V., 1985: Making Decisions. John Wiley & Sons, 207 pp.

  • Mason, I. B., 1982: A model for assessment of weather forecasts. Aust. Meteor. Mag., 30 , 291303.

  • Murphy, A. H., 1971: A note on the ranked probability score. J. Appl. Meteor., 10 , 155156.

  • Roulston, M. S., and L. A. Smith, 2002: Evaluating probabilistic forecasts using information theory. Mon. Wea. Rev., 130 , 16531660.

  • Shannon, C. E., 1948: A mathematical theory of communication. Bell Syst. Technol. J., 27 , 379423. 623656.

  • Talagrand, O., R. Vautard, and B. Strauss, 1999: Evaluation of probabilistic prediction system. Proc. ECMWF Workshop on Predictability, Reading, United Kingdom, ECMWF, 1–25.

    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences: An Introduction. Academic Press, 464 pp.

Fig. 1.
Fig. 1.

Plot of the maximum FS, attained when the forecasted probability p exactly corresponds to the observed frequency f of the event.

Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2945.1

Fig. 2.
Fig. 2.

Three-dimensional surface representing the FS as a function of both the forecasted probability p and the observed occurrence frequency f.

Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2945.1

Fig. 3.
Fig. 3.

Comparison between AS and BS for p = f.

Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2945.1

Fig. 4.
Fig. 4.

Three-dimensional surface representing the difference between AS and BS.

Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2945.1

1

In this sense only conditional probability should exist and the “absolute” probability of a given event E does not make sense. For the sake of brevity we adopt the notation p(E) instead of the more cumbersome p(E|I) to indicate the probability of the event E conditional to some information I whose specification is unnecessary.

2

We think that locality is a very unfortunate term, since it is used in physics for the principle stating that a natural phenomenom is influenced directly only by what happens at its immediate surrounding, because of the finite propagation velocity of the physical effects. Even if we limit this term to weather forecasts, it also refers to spatial smoothness properties of fields such as temperature or pressure.

3

By “logically equivalent” we mean that the first and the second forecast jointly provide exactly the same information as the third forecast. In other words they are simply two different ways to express exactly the same prediction. The logical inconsistency assigning different scores to them within the scoring rules logical system is clear: both the statement “the score for the two forecasts is the same” and its negation can be proved true, the former reducing the double prediction to the third multiple forecast and then calculating its score, the latter calculating the score separately, without any application of the logical equivalence.

Save