1. Introduction
Forecast verification is a crucial aspect of any prediction system. It is important to assess the quality of forecasts if improvements are to be made. A large number of verification measures have been suggested (Jolliffe and Stephenson 2003). To narrow the range of possible measures, a number of desirable properties of measures have been proposed and generally accepted (Murphy 1993; Mason 2003). For probability forecasts of a binary event, the two most frequently cited properties are propriety (Winkler and Murphy 1968) and equitability (Gandin and Murphy 1992). A score is proper if it is optimized only when a forecaster predicts according to his/her true beliefs and is equitable if the same expected score is achieved for all unskilled forecasts of a certain type—for example, constant forecasts.
The widely used Brier score (Brier 1950) is proper but not equitable (Mason 2004), and recent research has attempted to adapt the score to circumvent this “nonequitability” (Mason 2004; Müller et al. 2005; Weigel et al. 2007). It does not seem to have been noted in previous studies that propriety and equitability are incompatible. A proof that it is not possible for a verification score to simultaneously possess both properties is given in section 4 of this paper. Before that, Brier score, propriety, and equitability are defined in section 2, and recent attempts to alleviate or remove the nonequitability of the Brier skill score are briefly discussed in section 3.
2. Definitions
a. Probability forecasts of binary events
Suppose that a set of probability forecasts is made of n binary events of interest such as “precipitation tomorrow” or “damaging frost next month.” Denote the n forecasts by {p1, p2, . . . , pn}, where each pi is a probability between 0 and 1. The corresponding observations {x1, x2, . . . , xn} are coded as 1 if the event occurs and 0 if it does not. To assess the quality of the forecasts, various measures or scores can be constructed that quantify the difference between the set of forecasts and the corresponding set of observations.


b. Proper scores and hedging
The concept of proper scores dates back at least as far as Winkler and Murphy (1968). Suppose that, as above, a forecaster makes forecasts {p1, p2, . . . , pn}. On these n forecast occasions, the forecaster’s true beliefs of the probability of the event are {q1, q2, . . . , qn}. If any qi is different from pi, then the forecaster is said to be hedging the forecast (Murphy and Epstein 1967). Hedging is beneficial to the forecaster if using it improves the expected (long-term average) value of the score that is being used to assess his or her forecasts. A score is proper if the forecaster’s expected value for the score is optimized by taking pi = qi. In other words, the forecaster cannot benefit from hedging, that is, by making a forecast other than his or her true belief. If the expected value is uniquely optimized by pi = qi, the score is strictly proper. It is generally accepted that propriety (a score being proper) is highly desirable—it is undesirable for a score to allow forecasters to benefit from hedging their forecasts.
The topic of proper scores has been discussed in some detail in both the meteorological and statistical literature (see, e.g., Winkler and Murphy 1968; Savage 1971; Gneiting and Raftery 2007). The latter authors, among others, discuss the characterization of proper scoring rules. There exists an infinite number of such rules, but only a few have been explicitly described. The Brier score is by far the best known of these, but the logarithmic score in Table 1 is also proper.


c. Equitable scores
Another widely known desirable property is equitability. This was first defined by Gandin and Murphy (1992), though the concept was used many years earlier in devising a verification measure for forecasts of categorical events (Jolliffe and Foord 1975). It is based on the idea that all unskillful forecasts of a certain type should have the same expected score. Specifically, constant forecasts that are always the same (pi = p, i = 1, 2, . . . , n) are clearly unskillful, as are forecasts that randomly choose a forecast from some distribution on the interval [0, 1] (constant forecasts are a special case in which all the probability in the interval is concentrated at a single point). It would seem odd if two different forecasting strategies from this class have different expected scores, as all such strategies are “equally unskillful.” Hence equitability is considered desirable.
3. Nonequitability of the Brier skill score
None of the proper scores that have been suggested in the literature are equitable (see section 4). In particular, with climatology as a reference forecast, Bs can be negative for other unskillful (potential reference) forecasts. This is undesirable in its own right, but from a practical point of view it also means that forecasting systems with skill may have values of Bs close to zero and hence look distinctly unimpressive. This is referred to in the literature as a negative bias in Bs. Both Mason (2004) and Müller et al. (2005) have suggested modifications of Bs to alleviate or circumvent such negative bias.
Müller et al. (2005) discuss the situation in which an ensemble of m forecasts is generated and the probability forecast for the event of interest is the proportion of ensemble members for which the event occurs. They consider the ranked probability skill score (RPSS), which extends the Brier skill score to more than two categories. To reduce the negative bias in this skill score, Müller et al. (2005) replace the usual reference score by one based on resampling of the climatology. This turns out to be equivalent to using a reference forecasting strategy in which a binomial random variable is generated with m trials and probability of success θ, where m is the number of ensemble members and θ the climatological probability or base rate (Weigel et al. 2007; see also Ferro et al. 2008). This alleviates the problem of negative bias of the RPSS (and, as a special case, the Brier skill score) for ensemble forecasts although as will be reported elsewhere, there may be advantages in slightly adjusting the probability of success in the reference forecast away from θ.
Mason (2004) proposes a different approach. One of his objectives is to ensure that all unskillful forecasts have a nonnegative expected score. This addresses the problem of negative bias, though it now has the opposite problem that positive values of the score do not necessarily imply skill. His reference forecast is different from those proposed previously in that it depends on the forecasts themselves, so the reference forecast changes as the set of forecasts being assessed changes. By allowing this dependence on forecasts, Mason (2004) derives a variant of the Brier skill score that he claims is equitable.
4. A no-go theorem for propriety and equitability


Most scores are additive since one generally assumes that the two loss functions at any particular time do not depend on the values of x and p at the other times. However, this is not the case for skill scores in which the reference forecast depends on the other values of either x or p. For example, the Brier skill score based on climatological mean forecasts has a denominator that depends on the sample variance










a. Examples
Table 1 and Figs. 1 –3 illustrate the ideas of propriety and equitability for three scores: a linear score, a logarithmic score, and the Brier (quadratic) score.
The first column of the table presents the definition of each score, given a set of n forecasts {p1, p2, . . . , pn} and corresponding observations {x1, x2, . . . , xn}. The next three columns give the form of S0(p), S1(p), and the ratio of derivatives S′0/S′1 for each measure. The figures show the form of S(p, q) for the three scores and illustrate geometrically the necessary conditions for equitability and propriety, that is, where S(p, q) = S(p, θ).
Figure 1 displays S(p, q) for the linear score, for which S0(p) = p and S1(p) = (1 − p). Here the function is constant for all p when q = θ = ½. Hence, for this value of the base rate, but for no other, the linear score is equitable. For any value of q, the minimum value of S(p, q) is achieved when p is zero or unity, depending on whether q is less than or greater than ½. Thus it is always advantageous to hedge forecasts to 0 or 1, and the score is not proper.
Figure 2 shows S(p, q) for the Brier score, for which S0(p) = p2 and S1(p) = (1 − p)2. Here there is no value of θ for which S(p, θ) is constant for all p, so the score is never equitable. However, the minimum value of S(p, q) is achieved on the 45° line in Fig. 2, and only on that line for every value of q, so that the score is strictly proper. Although Fig. 3, for the logarithmic score with S0(p) = −log(1 − p), S1(p) = −log(p), looks less simple than Fig. 2, its underlying structure is the same, illustrating propriety and nonequitability.
For both Figs. 2 and 3, if the figures are considered as contours on a topographical map, then walking from west to east is uphill until the 45° line and downhill thereafter, demonstrating propriety. There is no west–east traverse that is flat, so equitability is impossible. However, in Fig. 1, the west–east traverse is flat for q = 0.5, demonstrating equitability in this case. For any other value of q, the traverse will be either all uphill (for q < 0.5) or all downhill (q > 0.5), ruling out propriety.


Many well-known verification measures fall into this class, as demonstrated by Table 1, but their skill score versions often do not. This is because, as discussed above, the transformation used to convert a measure into a skill score often involves all the data, as when sample climatology is used as a reference forecast.
Our intuition is that propriety and equitability are incompatible for all verification measures, including such skill scores, and this view is reinforced by the work described in section 3. However, demonstration of the wider result awaits further research.
5. Discussion
It would be ideal to have a verification score that is both proper and equitable, but the previous section shows, for a large class of measures, that this is impossible for probability forecasts of a binary event. Both properties have their appeal, but our view is that propriety is the more fundamental requirement. The implication is that equitability is a largely irrelevant property for probabilistic forecasts. However, it should not be forgotten that, unlike propriety, equitability can also be defined for deterministic forecasts and is highly relevant for such forecasts. Indeed, the idea of equitability was first introduced for deterministic categorical forecasts (Gandin and Murphy 1992) and later, perhaps mistakenly, adopted for probabilistic forecasts.
The loss of equitability for probability forecasts has important implications for how one interprets whether or not a forecasting system has skill. Nonequitability means that different unskillful forecasts can give different scores and hence there is no unique absolute benchmark against which to measure skill. For example, the skill of a forecasting system may have negative skill when compared with a constant probability climatological forecast and yet have positive skill when compared with random probability forecasts (Mason 2004; Müller et al. 2005; Weigel et al. 2007). How should one then decide whether the forecasting system really has skill? Perhaps the most rational approach is to demand that the score is better than the best score of all possible unskillful forecasts. To be able to do this, one needs to think carefully about how best to optimize the score using unskillful forecasts. For scores where the loss functions S0(p) and S1(p) are convex functions of p, the best unskillful forecasts are those that issue constant probability values, albeit not necessarily the base rate. Alternatively, one can avoid some of these difficulties by eschewing any mention of the word “skill” and simply present scores for different forecasting systems: it is still possible to order forecasting systems based on pairwise comparison of scores.
Acknowledgments
We are grateful to an anonymous reviewer, whose comments led us to substantially rethink some of our views. Much of the work leading to this paper was done while ITJ was funded by a NERC Environmental Mathematics and Statistics Discipline Bridging Award (NER/T/S/2003/00126).
REFERENCES
Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78 , 1–3.
Ferro, C. A. T., D. S. Richardson, and A. P. Weigel, 2008: On the effect of ensemble size on the discrete and continuous ranked probability scores. Meteor. Appl., in press.
Gandin, L. S., and A. H. Murphy, 1992: Equitable skill scores for categorical forecasts. Mon. Wea. Rev., 120 , 361–370.
Gneiting, T., and A. E. Raftery, 2007: Strictly proper scoring rules, prediction, and estimation. J. Amer. Stat. Assoc., 102 , 359–378.
Jolliffe, I. T., and J. F. Foord, 1975: Assessment of long-range forecasts. Weather, 30 , 172–181.
Jolliffe, I. T., and D. B. Stephenson, 2003: Forecast Verification: A Practitioner’s Guide in Atmospheric Science. John Wiley and Sons, 254 pp.
Mason, I. B., 2003: Binary events. Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. T. Jolliffe and D. B. Stephenson, Eds., John Wiley and Sons, 37–76.
Mason, S. J., 2004: On using “climatology” as a reference strategy in the Brier and ranked probability skill scores. Mon. Wea. Rev., 132 , 1891–1895.
Müller, W. A., C. Appenzeller, F. J. Doblas-Reyes, and M. A. Liniger, 2005: A debiased ranked probability skill score to evaluate probabilistic ensemble forecasts with small ensemble sizes. J. Climate, 18 , 1513–1523.
Murphy, A. H., 1973: Hedging and skill scores for probability forecasts. J. Appl. Meteor., 12 , 215–223.
Murphy, A. H., 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting. Wea. Forecasting, 8 , 281–293.
Murphy, A. H., 1996: The Finley affair: A signal event in the history of forecast verification. Wea. Forecasting, 11 , 3–20.
Murphy, A. H., and E. S. Epstein, 1967: A note on probability forecasts and “hedging”. J. Appl. Meteor., 6 , 1002–1004.
Savage, L. J., 1971: Elicitation of personal probabilities and expectations. J. Amer. Stat. Assoc., 66 , 783–801.
Toth, Z., O. Talagrand, G. Candille, and Y. Zhu, 2003: Probability and ensemble forecasts. Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. T. Jolliffe and D. B. Stephenson, Eds., John Wiley and Sons, 137–163.
Weigel, A. P., M. A. Liniger, and C. Appenzeller, 2007: The discrete Brier and ranked probability skill scores. Mon. Wea. Rev., 135 , 118–124.
Winkler, R. L., and A. H. Murphy, 1968: “Good” probability assessors. J. Appl. Meteor., 7 , 751–758.

Contour plot of S(p, q) in Eq. (3) for the linear score.
Citation: Monthly Weather Review 136, 4; 10.1175/2007MWR2194.1

Contour plot of S(p, q) in Eq. (3) for the linear score.
Citation: Monthly Weather Review 136, 4; 10.1175/2007MWR2194.1
Contour plot of S(p, q) in Eq. (3) for the linear score.
Citation: Monthly Weather Review 136, 4; 10.1175/2007MWR2194.1

As in Fig. 1, but for the Brier score.
Citation: Monthly Weather Review 136, 4; 10.1175/2007MWR2194.1

As in Fig. 1, but for the Brier score.
Citation: Monthly Weather Review 136, 4; 10.1175/2007MWR2194.1
As in Fig. 1, but for the Brier score.
Citation: Monthly Weather Review 136, 4; 10.1175/2007MWR2194.1

As in Fig. 1, but for the logarithmic score.
Citation: Monthly Weather Review 136, 4; 10.1175/2007MWR2194.1

As in Fig. 1, but for the logarithmic score.
Citation: Monthly Weather Review 136, 4; 10.1175/2007MWR2194.1
As in Fig. 1, but for the logarithmic score.
Citation: Monthly Weather Review 136, 4; 10.1175/2007MWR2194.1
Three scores for probability forecasts and associated functions of the scores.

