## 1. Introduction

The importance of assessing the quality of forecasts is widely recognized by both the numerical weather prediction community and synoptic-empirical meteorologists, who base their predictions on experienced analysis of large-scale atmospheric motions. As a matter of fact, in the absence of a reliable verification scheme, the comparison between different forecast methods as well as the real effectiveness of corrections applied to a given procedure become questionable, especially when no clear evidence exists.

Since the pioneering work by Brier (Brier 1950), Allen (Brier and Allen 1951), and Good (Good 1952), various kinds of scores have provided an operational tool to quantify the forecast goodness. The main drawback of this approach resides in the large number of different scores proposed, such as the Brier score (Brier 1950), the logarithmic score (Good 1952), the ranked probability score (Epstein 1969; Murphy 1971), the relative operating characteristics (ROC) curves (Mason 1982), rank histograms (Talagrand et al. 1999), and so on. It often happens that the same forecast turns out better than another using a certain score, and worse using a different one. This prevents both researchers and users from getting reliable feedback on how good each given series of predictions is and raises the question on the existence of a unique measure of the forecast quality. It is important to stress that “quality” in this context means “accordance with reality,” as derived from the available observations, in opposition to any personal concept of utility as consequence of specific situations or subjective preferences.

Nowadays consent generally exists for probabilistic forecasts (Anderson 1996), in part because of the recent techniques of ensemble forecasting that provide output in terms of probability distribution functions instead of single deterministic values. A similar consent regards the use of strictly proper scores (Brocker and Smith 2007), that is, scores whose expected result is maximized if and only if the announced probabilities for the series of events to be predicted exactly correspond to those believed to be the closest to reality. In other words proper scores promote honesty, in the sense that the forecaster is forced to declare what he actually thinks to be the best prediction in order to achieve the best score. Any different behavior yields poorer expected results. Nevertheless, even restricting the topic to probabilistic forecasts evaluated by strictly proper scores, a unique verification score is not found. As proposed by many authors (Lindley 1985; Leung and North 1990; Roulston and Smith 2002), the information theory founded by Shannon (Shannon 1948), but already present in Gibbs seminal work (Gibbs 1902), provides the right framework to approach the problem of forecast verification. In this paper, following the pattern used by Shannon to define the information entropy, we try to show by means of elementary mathematics that the logarithmic score is the only one to respect three basic desiderata whose violation can hardly be accepted. A more formal and cumbersome demonstration of an equivalent result may be found, for example, in a paper by Bernardo (Bernardo 1979) on the procedure for experiments design by maximizing the expected information. The mathematical apparatus used by Bernardo, involving probability distributions, functionals, and variational calculus, is more complicated than necessary for the scoring rules as commonly applied to forecast verification. In our opinion this unnecessary complication has overshadowed both the application and the importance of such results as normative procedures for the quality assessment of probabilistic forecasts.

## 2. Basic desiderata for scoring rule

*E*

_{1},

*E*

_{2}, … ,

*E*}, meaning that

_{m}*p*(

*E*) indicates the probability for the

_{i}*i*th event,

*p*(

*E*|

_{i}*E*) is the conditional probability for the event

_{j}*E*given

_{i}*E*, and

_{j}*δ*is the Kronecker’s delta. The term “probability” is here intended in a cognitive sense as “degree of belief that an event will occur or turn out true.”

_{ij}^{1}

A given forecast over the set of *m* events is univocally determined assigning the expected probabilities *p*(*E _{i}*) for all the

*m*events, so that it can be usefully represented by the

*m*-dimensional vector

**p**whose components are the probabilities

*p*(

*E*). Consistently a sequence of forecasts issued in

_{i}*n*different occasions will be represented by the

*m*×

*n*matrix 𝗣 whose

*k*th column is the vector

**p**

*for the*

_{k}*k*th forecast. Finally to represent the series of events that actually occurred at the

*n*different occasions we use the

*n*-dimensional vector

**with integer components**

*ϵ**ϵ*such as

_{k}*ϵ*=

_{k}*i*if the

*i*th event occurs at the

*k*th occasion.

Using this notation we state three basic desiderata that we request to be satisfied by any score for forecast verification: additivity, exclusive dependence on physical observations (“locality”), and strictly proper behavior. In detail these requirements read as follows:

- The score
*S*for the forecast sequence 𝗣 is a differentiable function of the variables_{n}*p*(_{k}*E*) additive with respect to each single forecast, in the sense that adding a new forecast occasion the score for the extended sequence is given by_{i} - The score
*S*depends only on the probabilities assigned to the events actually occurred:_{n} - If the probabilities assigned to each event are held constant in all the forecast occasions (i.e.,
**p**=_{k}**p**for all*k*) then the score exhibits an extremant when such probabilities exactly correspond to the observed occurrence frequencies*f*for the events, or in vectorial notation when_{i}**p**=**f**. Mathematically, writing for sake of brevity*p*(*E*) ≡_{i}*p*:_{i}with the constraint Σ= 1._{i}p_{i}

As demanded by common sense, the first requirement of additivity implies that each single forecast affects the global score for the whole sequence in the same manner as any other, and any variation of its score produces an identical variation in the global one. As for the information entropy, the continuity with respect to the variables *p _{k}* would be sufficient. Anyway we require differentiability, a slightly stronger constraint, to make the demonstration of uniqueness easier. In practice all the proposed and currently used scores are additive and most of them are differentiable functions of the probabilities

*p*.

_{k}The second desideratum may appear quite arbitrary or at least more specific. On the contrary it is really fundamental in a very general sense, since any sound scientific measure for the accordance between modeling predictions and natural phenomena cannot depend on what could have been observed but was not. The goodness of the accordance must be determined only by what is actually observed. A little reflection shows that this modern science principle, probably stated for the first time by Galilei, also avoids situations that common sense can hardly accept. In fact two sequences of forecasts that have assigned exactly the same probabilities to a series of observed events cannot gain different scores on the basis of probabilities assigned to never occurred events. Common sense suggests that both deserve the same score until some event treated as different is observed. Surprisingly many widely used scores violate this basic principle, which corresponds to the so called locality^{2} (e.g., see Brocker and Smith 2007). As far as we know the importance of this property has never been recognized in the field of forecast verification and considered on the same level of other arbitrary properties, such as the “equitability,” defined by Gandin and Murphy (1992). For example the famous Brier score, whose lasting success is probably due to some of its properties we explain better in section 6, is “nonlocal” and its “nonlocality” accepted as a legitimate choice.

Finally, the third desideratum restricts the choice to the class of strictly proper scores, since, before the actual series of events takes place, the expected score announcing probabilities *p̃ _{i}* is calculated by taking the believed probabilities

*p*as the true frequencies

_{i}*f*. Thus, the expected score reaches an extremant when the announced probabilities equal those believed true and any different choice is expected to yield a worse score. Contrary to the locality, the importance of being proper is now widely recognized (Brocker and Smith 2007). For example in a recent paper Jolliffe and Stephenson (Jolliffe and Stephenson 2008) noticed that proper scores can never be equitable, concluding that “propriety” is the more fundamental requirement.

_{i}## 3. Score function determination

*n*forecasts to a summation of

*n*terms, each of them is the score for the

*k*th forecast:

*S*≡

*S*

_{1}, with the index no longer necessary. Clearly the

*S*function determination also determines the global score

*S*. Let us consider a sequence of forecasts for which the probabilities assigned to each event are kept constant through the

_{n}*n*occasions. In this case the expression in (4) can be written as a summation over the

*m*possible events:

*f*=

_{i}*n*/

_{i}*n*is the observed frequency of the

*i*th event in all the

*n*occasions, when it occurs

*n*times. Now let us focus on two of the

_{i}*m*events, for example

*E*

_{1}and

*E*

_{2}. Apparently if

*p*

_{1}is assumed free to vary between 0 and

*p*

_{2}is constrained to

*r*

_{1,2}−

*p*

_{1}. Thus (5) becomes

*p*

_{1}must vanish for

**p**=

**f**:

*f*

_{1}can be regarded as a generic variable

*x*ranging from 0 to

*r*,

*r*being its generic upper limit that can take any value between 0 and 1. Thus (7) corresponds to the functional equation:

*xS*′(

*x*) ≡ Φ(

*x*):

*x*=

*r*/2; since this symmetry must hold for any

*r*∈ (0, 1), we conclude that the only possibility is Φ(

*x*) =

*a*for any

*x*∈ (0, 1),

*a*being an arbitrary constant. Thus, at least for a number of events greater than 2, by simple integration we have

*b*is another arbitrary constant.

This seemingly surprising result tells us that in addition to requiring the respect of the three basic desiderata as a necessary condition for a scoring rule, it is also sufficient to univocally define the form of the score function *S* and hence the global score *S _{n}* for a sequence of

*n*forecasts. Any other choice different than (10) necessarily implies a violation of at least one of the three basic desiderata. Since they satisfy very basic requirements suggested by common sense and needed for a sound scientific approach, a deliberate violation of one of them can hardly be accepted and, if present, some inconsistency or nonsense is expected.

The constants *a* and *b* represent the arbitrary scale factor and offset, respectively, apart from which any scoring scale to evaluate a given forecast sequence is defined. Clearly no essential distinction exists between scores that differ only by an offset and a multiplicative factor, since they overlap with a translation of the zero and a unit redefinition.

## 4. Single-event forecasts

*S*(

*x*) has been obtained assuming a number

*m*of events greater than 2. Even if it seems natural to maintain the same form also for

*m*= 2, a sounder argument is needed. As evident when

*m*= 2 the set of mutually exclusive and exhaustive events composed of only the event

*E*and its negation

*E*

*p*(

*E*) =

*p*to be predicted, being necessarily

*p*(

*E*

*p*. Since in this case

*r*= 1 with no possibility of variation, any function Φ(

*x*) symmetric with respect to

*x*= ½ satisfies the functional Eq. (9) and the necessity of the logathmic form for

*S*(

*x*) seems to disappear. On the contrary consider the situation of a first forecaster who announces the probability

*p*

_{1}for the event

*E*

_{1}(e.g., “rain for tomorrow”) and a second forecaster who knows for certain that only when

*E*

_{1}does not occur a second event

*E*

_{2}can be observed (e.g., “rain for the day after tomorrow”), so that he announces the conditional probability

*p*

_{2}≡

*p*(

*E*

_{2}|

*E*

_{1}) for the occasions when

*E*

_{1}does not occur. Given the series of

*n*-forecast occasions within which

*E*

_{1}occurred

*n*

_{1}times, the score for the first forecaster will be

*n*−

*n*

_{1}occasions:

*n*

_{2}is the number of occurrences for

*E*

_{2}. Since both the forecasters provide single-event forecasts (

*m*= 2),

*S*(

*x*) is at the moment an unknown score function. If a third forecaster announces both the probabilities

*p*

_{1}and

*p*

_{2}for

*E*

_{1}and

*E*

_{2}, and consequently the probability

*p*

_{3}= 1 −

*p*

_{1}−

*p*

_{2}for the third possible event

*E*

_{3}(e.g., “no rain for both tomorrow and the day after”), then he is providing a multiple-events forecast with

*m*= 3. In this case we know that his score must be

*p*

_{2}= (1 −

*p*

_{1})

*p*

_{2}the third multiple forecast is logically equivalent

^{3}to the joint of the first and the second forecast, so for logical consistency we have to require that the sum of the scores for the two single-event predictions equals the score of the multiple one for at least a choice of the constants

*a*and

*b*:

*n*

_{1},

*n*

_{2}, and

*n*

_{3}≡

*n*−

*n*

_{1}−

*n*

_{2}, we get three functional equations to be satisfied simultaneously:

*b*= 0 [i.e.,

*S*(

*x*) =

*a*ln

*x*] all these three equations are satisfied and no other choice is possible. Thus, for single-event forecast no other possibilities than a logarithmic score function like (10) exist. But a new interesting and unexpected result has also been obtained: the logical consistency requirement of getting identical scores for logically equivalent forecasts, when added to the three basic desiderata, forces the offset

*b*of the scoring scale to zero. It is something like the temperature scales in thermodynamics: both the offset and the unit are arbitrary until we require the equivalence between temperature and thermal energy. In this case only the absolute temperature scale starting from zero is possible. As suggested by this analogy, the score coming from the choice

*b*= 0 can be named “absolute score.”

## 5. Skill scores and information

*n*forecasts is easily calculated once the actual series

*ϵ*of occurred events is known:

*b*= 0 and no different choice is possible. Nevertheless, whatever the offset

*b*is, the score in (16) cannot be directly used for comparing the skillfulness of forecast methods applied to different series of occasions, especially when they differ in length. A natural skill score could be the mean score per forecast, obtained dividing

*S*by

_{n}*n*:

*p*are kept constant, like in (5):

_{k}*H*between the discrete probability and frequency distributions

_{c}**p**and

**f**(Jaynes 1957):

*H*can be easily written as the sum of the entropy for

_{c}**f**and the relative entropy of

**p**with respect to

**f**, nowadays better known as Kullback–Leibler (KL) divergence (Kullback and Leibler 1951):

**p**diverges from

**f**in terms of information content and, according to the Gibbs inequality (Gibbs 1902), it is a positive defined quantity, making the cross entropy always greater than or equal to the entropy

*H*(

**f**) of the observed frequency distribution.

*a*and

*b*. First of all, by merely psychological reasons and common practice, we are inclined to associate positive values to better results and negative to worse. So a positive-oriented scoring scale (i.e., the higher the score the better the result) is in general more immediate. Besides the zero value can usefully indicates null results. For example if the only climatological frequencies

*f*

_{i}for the events {

*E*} are known, then the best and most honest prediction before the development of more accurate forecast methods would be to assign the probability

_{i}*p*=

_{i}*f*

_{i}to each event

*E*. So this could be taken as reference of “zero level” to evaluate and verify the quality and the effectiveness of any finer forecast exploiting further information and knowledge: if the forecast uses them skillfully, then it should be at least as effective as the zero level prediction and more effective when the additional information is truly weighty and the knowledge significant. Since the climatological frequencies are expected to repeat, the expected cross entropy for the zero level prediction corresponds to the entropy of the climatological frequency distribution

_{i}**f**

*a*= 1 and

*b*=

*H*(

**f**

*f*≡

_{i}*f*−

_{i}*f*

_{i}between actually observed and climatologically expected frequencies has been introduced. A rapid look at the skill score in (22) shows that if the observed frequencies equal those expected by climatology (i.e., Δ

*f*= 0 for all

_{i}*i*), then the skill score corresponds to the negative of the KL divergence between

**f**

**p**and it is always nonpositive, indicating that no better results than the climatological information can be attained. In particular

**p**=

**f**

*f*≠

_{i}*f*

_{i}for some

*i*(i.e.,

**f**≠

**f**

**f**

**f**. Such maximum value

*H*(

**f**

*H*(

**f**), attainable if and only if

**p**=

**f**, can be positive or negative. Since the larger the indetermination of a distribution the higher its entropy is, it turns out positive when the observed frequency distribution

**f**is less indeterminate than

**f**

*m*events have the same occurrence frequency 1/

*m*), can be forecasted simply by using the Laplace principle of indifference, without any effort to gather information or necessity to get some specific knowledge. Indeed, being the indifference principle the forecast method for the total uninformed, it can be taken as a sort of absolutely fair zero level to get the corresponding “fair skill score” (FS) by setting

*b*=

*H*(1/

*m*) ≡ ln

*m*:

*n*forecasts whose probabilities for the

*m*events are changed at each occasion. In fact such a sequence can be thought as the sum of subsequences for which the probabilities are kept constant and, by the additivity property, the total score must be the sum of the subsequences partial scores. Nevertheless the additivity property does not transfer to the mean score per forecast, so that the skill score for the whole sequence is not the sum of the skill scores for each subsequence. For example the FS has to be calculated as

*p*(

_{k}*ϵ*) = 1 for all

_{k}*k*], he gains a skill score FS = ln

*m*that indeed corresponds to the absolute maximum. Besides, not only honesty, precision, and accuracy, but also resolution power is promoted: the larger the set of

*m*possible events, the higher the perfect forecast skill score turns out.

*E*and its negation

*E*

*p*(

*E*) =

*p*, necessarily

*p*(

*E*

*p*and analogously for the observed frequencies: if

*f*(

*E*) =

*f*then

*f*(

*E*

*f*. Thus, the FS in (23) turns out to be

*f*is the skill score reaches its maximum, corresponding to the best result, when

*p*=

*f*:

_{max}as a function of

*f*we find the graph drawn in Fig. 1.

The absolute maximum, equal to ln2 ≅ 0.69, is approached for *f* → 0 or *f* → 1 (i.e., for very rare or very frequent events). Despite the fact that it could at first appear to be a strange result, a little reflection shows that it is really completely reasonable. Because of the arbitrary choice for *E* and *E**p* = 0 to very rare events as well as *p* = 1 to very frequent ones. In fact a single occurrence (or not occurrence) is sufficient to collapse the score to −∞, without any possibility to rise again. Thus, unless an event is known for certain to be impossible, the FS compels one to honestly consider it possible. Indeed this is a general feature of any logarithmic score respecting the three basic desiderata.

On the other hand FS_{max} vanishes at its absolute minimum for *f* = ½, corresponding to a maximally indeterminate event whose occurrence is as frequent as its no occurrence. Again a fair result, since no effort apart from the application of the indifference principle is needed to forecast exactly such a frequency. Thus, as suggested by common sense, a null score is deserved by such a trivial forecast.

Finally the three-dimensional surface representing the FS as a function of both *f* and *p* is shown in Fig. 2. The upper points along the saddle line form the bidimensional curve represented in Fig. 1. For each given occurrence frequency *f*, any departure from the condition *p* = *f* produces a score decrease. As *p* tends to 0 or 1 the FS drops to −∞, unless *f* = 0 or *f* = 1, too.

## 6. Brier Score as approximation

^{2}denotes the square modulus and, as usual,

*n*is the number of forecast occasions,

**p**

*the forecast vector at the*

_{k}*k*th occasion, and

**ê**

_{k}is the corresponding occurred event versor whose

*m*components are all null except the

*i*th set to 1, indicating that the

*E*event occurred at the

_{i}*k*th occasion. In practice Brier used as a scoring function

*S*(

**p**), the square of the euclidean distance between the vectors

**p**and

**ê**in the probability space for the events.

It can be verified that the BS is a strictly proper score (Wilks 1995) but, as obvious, it infringes the second basic desideratum because of its dependence on probabilities assigned to never occurred events. Although such a violation makes it unsatisfactory, some reason to consider it acceptable in several situations should exist if the forecasters’ community has been using it for decades. Its useful and well-known decomposition in reliability, resolution, and uncertainty (Wilks 1995), cannot explains by itself the undeniable success of the BS.

*a*= −(2 ln2)

^{−1}, so that AS assumes the same value as BS for

*f*=

*p*= ½ (for

*f*=

*p*= 1 and

*f*=

*p*= 0 they both vanish). As shown in Fig. 3, for the best possible forecasts, when

*p*=

*f*, BS is very similar to AS and in practice provides the same information. Anyway, because of its violation of second basic desideratum, some inconsistencies are expected. For example Roulston and Smith (Roulston and Smith 2002) noticed that the logarithmic score, or “ignorance” according to their terminology, is a double-valued function of the BS so that two forecasts with the same Brier score can be judged different if “ignorance” is used instead. Besides Jewson (Jewson 2008) has recently noticed that the BS leads to conclusions that disagree with our intuition, especially when it is applied to extreme forecasts (

*p*≅ 1 or

*p*≅ 0) or to forecasts whose accordance with the observed frequencies is poor.

Plotting the difference AS − BS, we get the three-dimensional surface shown in Fig. 4. As visible, when the forecasted probability *p* differs significantly from the observed occurrence frequency *f*, larger differences become possible and BS is no more a suitable measure for forecast verification. Also for very rare (or very frequent) events BS becomes inadequate as a skill score. In fact, assigning by default *p* = 0 to rare events whose actual occurrence frequency is, say, *f* = 10^{−3}, we get BS = 2 × 10^{−3}. If, maybe after great research efforts, we make the correct prediction *p* = 10^{−3}, then BS = 2 × 10^{−3}(1–10^{−3}), a gain of only 0.1% in score. Thus, BS is very unfair for evaluating rare (or common) events forecasts. On the contrary, logarithmic scores like AS exhibit the automatic safety device to avoid any arbitrary assignment *p* = 0 or *p* = 1, as noticed in section 5.

*x*,

*y*) =

*x*ln

*y*+ (1 −

*x*) ln(1 −

*y*) exhibits an extremant for

*x*=

*y*= ½ with

*x*’s second derivative clearly vanishing, its second-order expansion about such an extremant point is

*p*=

*f*= ½ of the logarithmic skill score in (29). This explains the wide and lasting use of the BS, but also indicates and quantifies its limits.

## 7. Conclusions

Requiring the observance of the three basic desiderata (i.e., additivity, exclusive dependence on physical observations, and strictly proper behavior) the scoring rule for probabilistic forecast verification is univocally determined. The proper score function is the logarithm of the probability forecasted for the event that actually occurred and any different choice necessarily infringes at least one of the three basic desiderata. In addition, the further request of logical consistency for assigning identical scores to logically equivalent forecasts however obtained, forces the choice of the offset for the logarithmic scoring scale to zero. Thus, apart from the arbitrary unit fixing the logarithm base, an “absolute” scoring scale can be defined. As a natural skill score the mean score per forecast over a sequence of forecast occasions with known occurred events can be taken. Information theory, with its concept of entropy and relative entropy of discrete probability distributions, provides useful indications for a meaningful choice of the “reference level” for the skill score. In particular a fair skill score can be obtained by choosing as the zero level reference the expected skill score of the totally uninformed forecast, based only on the Laplace principle of indifference, which assigns the same probability to each of the possible events.

The lasting success of the Brier score, one of the most commonly used for decades, is explained by its equivalence to the second-order approximation of the logarithmic skill score, which makes it an effective scoring rule in many situations. Nevertheless it shows its limitations when applied to very rare (or very frequent) events forecasts, as well as to forecasts of poor quality.

Though the most part of the presented results have been known for decades in the field of statistical mechanics and information theory, their use entered the weather forecast community only recently and with some difficulties. We hope that this paper contributes to clarify the effectiveness and soundness of the approach based on information and probability theory, whose entropy and subjective probability concepts appear to be the natural framework for the probabilistic forecast verification.

## Acknowledgments

This work was accomplished by the help of the Foundation for Climate and Sustainability (FCS), a private nonprofit foundation, and despite the drastic reduction of public funding for research pursued by the Italian government (see the article “Cut-Throat Savings,” *Nature,* Vol. 455, October 2008, available online at http://www.nature.com/nature/journal/v455/n7215/full/455835b.html). Thanks are due to Massimiliamo Pasqui and Jacopo Primicerio for stimulating discussions and to Andrea Orlandi and Alberto Ortolani for their useful suggestions and encouragement.

## REFERENCES

Anderson, J. L., 1996: A method for producing and evaluating probabilistic forecasts from ensemble model integrations.

,*J. Climate***9****,**1518–1530.Bernardo, J. M., 1979: Expected information as expected utility.

,*Ann. Stat.***7****,**686–690.Brier, G. W., 1950: Verification of forecasts expressed in terms of probability.

,*Mon. Wea. Rev.***78****,**1–3.Brier, G. W., and R. A. Allen, 1951: Verification of weather forecasts.

*Compendium of Meteorology,*T. Malone, Ed., Amer. Meteor. Soc., 841–848.Brocker, J., and L. A. Smith, 2007: Scoring probabilistic forecast: The importance of being proper.

,*Wea. Forecasting***22****,**382–388.Epstein, E., 1969: A scoring system for probability forecast of ranked categories.

,*J. Appl. Meteor.***8****,**985–987.Gandin, L. S., and A. H. Murphy, 1992: Equitable skill scores for categorical forecasts.

,*Mon. Wea. Rev.***120****,**361–370.Gibbs, J. W., 1902:

*Elementary Principles in Statistical Mechanics*. Charles Scribner’s Sons, 207 pp.Good, I. J., 1952: Rational decisions.

,*J. Roy. Stat. Soc.***14A****,**107–114.Jaynes, E. T., 1957: Information theory and statistical mechanics.

,*Phys. Rev.***106****,**620–630.Jewson, S., cited. 2008: The problem with the Brier score. arXiv:physics/0401046v1 [physics.ao-ph]. [Available online at http://arxiv.org/abs/physics/0401046v1].

Jolliffe, I. T., and D. B. Stephenson, 2008: Proper scores for probability forecasts can never be equitable.

,*Mon. Wea. Rev.***136****,**1505–1510.Kullback, S., and R. A. Leibler, 1951: On information and sufficiency.

,*Ann. Math. Stat.***22****,**79–86.Leung, L-Y., and G. R. North, 1990: Information theory and climate prediction.

,*J. Climate***3****,**5–14.Lindley, D. V., 1985:

*Making Decisions*. John Wiley & Sons, 207 pp.Mason, I. B., 1982: A model for assessment of weather forecasts.

,*Aust. Meteor. Mag.***30****,**291–303.Murphy, A. H., 1971: A note on the ranked probability score.

,*J. Appl. Meteor.***10****,**155–156.Roulston, M. S., and L. A. Smith, 2002: Evaluating probabilistic forecasts using information theory.

,*Mon. Wea. Rev.***130****,**1653–1660.Shannon, C. E., 1948: A mathematical theory of communication.

,*Bell Syst. Technol. J.***27****,**379–423. 623–656.Talagrand, O., R. Vautard, and B. Strauss, 1999: Evaluation of probabilistic prediction system.

*Proc. ECMWF Workshop on Predictability,*Reading, United Kingdom, ECMWF, 1–25.Wilks, D. S., 1995:

*Statistical Methods in the Atmospheric Sciences: An Introduction*. Academic Press, 464 pp.

Three-dimensional surface representing the FS as a function of both the forecasted probability *p* and the observed occurrence frequency *f*.

Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2945.1

Three-dimensional surface representing the FS as a function of both the forecasted probability *p* and the observed occurrence frequency *f*.

Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2945.1

Three-dimensional surface representing the FS as a function of both the forecasted probability *p* and the observed occurrence frequency *f*.

Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2945.1

Comparison between AS and BS for *p* = *f*.

Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2945.1

Comparison between AS and BS for *p* = *f*.

Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2945.1

Comparison between AS and BS for *p* = *f*.

Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2945.1

Three-dimensional surface representing the difference between AS and BS.

Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2945.1

Three-dimensional surface representing the difference between AS and BS.

Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2945.1

Three-dimensional surface representing the difference between AS and BS.

Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2945.1

^{1}

In this sense only conditional probability should exist and the “absolute” probability of a given event *E* does not make sense. For the sake of brevity we adopt the notation *p*(*E*) instead of the more cumbersome *p*(*E*|*I*) to indicate the probability of the event *E* conditional to some information *I* whose specification is unnecessary.

^{2}

We think that locality is a very unfortunate term, since it is used in physics for the principle stating that a natural phenomenom is influenced directly only by what happens at its immediate surrounding, because of the finite propagation velocity of the physical effects. Even if we limit this term to weather forecasts, it also refers to spatial smoothness properties of fields such as temperature or pressure.

^{3}

By “logically equivalent” we mean that the first and the second forecast jointly provide exactly the same information as the third forecast. In other words they are simply two different ways to express exactly the same prediction. The logical inconsistency assigning different scores to them within the scoring rules logical system is clear: both the statement “the score for the two forecasts is the same” and its negation can be proved true, the former reducing the double prediction to the third multiple forecast and then calculating its score, the latter calculating the score separately, without any application of the logical equivalence.