The Brier score is a quadratic measure of error in probabilistic forecasts (Brier 1950). Although the Brier score can be used in multievent situations, it is most commonly used in a dichotomous situation in which an event of interest either occurs or does not occur (Toth et al. 2003). The ranked probability score is a closely related measure that generalizes the Brier score to a multievent situation, but in which the events can be ordered (Epstein 1969; Murphy 1969, 1971). Both scores are measures of the accuracy of the forecast in terms of the probability (or probabilities in the case of the ranked probability score) assigned (Murphy 1993). The scores are widely expressed as skill scores, by which they compare the extent to which a forecast strategy outperforms a (usually simpler) reference forecast strategy. The most widely used reference strategy is that of “climatology,” in which the climatological probability/probabilities of the forecast variable is/are issued perpetually. Skill scores on both measures are widely reported to be low compared to other performance indicators (Wilks 1995), and so these skill scores are often considered harsh standards. While low scores can partly be attributed to sampling errors in the forecast probabilities, most notably when ensemble sizes are small (Kumar et al. 2001), a more fundamental reason for the often negative skill indicated by the scores is detailed in this note. The following discussion refers only to the Brier score, but the conclusions can easily be generalized to the ranked probability skill score.


The components of the Brier score can be illustrated on the attributes diagram (Hsu and Murphy 1986). In Fig. 1 an example is presented in which the thick line represents the empirical curve for an arbitrary set of forecasts for an event with a climatological probability of 0.3. Each point on the curve is defined by the coordinates (fk,







That the Brier skill score with climatology as the reference strategy is a strictly proper scoring rule [Eq. (9)] is not inconsistent with the fact that the expected value of the score can be optimized by repeatedly issuing the climatological probability [Eq. (8)]. Equation (8) applies only in the absence of any reason for expecting the forecast event to be more or less likely than usual. In this instance the forecaster should issue the climatological probability as the forecast in preference to any other strategy (such as assigning all probability to one specific outcome, or randomly assigning probabilities). In contrast, a number of other skill scores have an expected score of 0 for all naïve forecast strategies, and the forecaster is effectively free to choose any of these strategies. These scores have the property of equitability (Gandin and Murphy 1992; Mason 2003), which BSSclim lacks. However, in the specific context of the Brier skill score, the lack of equitability may be a desirable feature: a nonclimatological forecast should imply that the forecaster believes that the probability of the event is different from normal, and where that implied belief is unfounded the forecaster is penalized by the Brier score. Specifically, it can be shown that perpetual forecasts of nonclimatological values give a Brier skill score that is equal to the negative of the squared departure from the climatological probability divided by the uncertainty [Eq. (8)].


Since SHP + UNC > 0, skill is indicated whenever SHP + RES > REL. The area of skill, as defined by Eq. (11) can be indicated on the attributes diagram, but first the distance represented by the sharpness term needs to be identified. From Eq. (10), the sharpness term is represented by the squared horizontal distance between the empirical curve and the climatological probability. For an arbitrary point on the curve, the contribution to the sharpness term is therefore equivalent to the square of the distance between C and D. If the distance (CD)2, and hence (AF)2, is equal to the sharpness term, and (AC)2 to the resolution, then the distance CF2 equals SHP + RES. With reliability represented by the distance (CE)2, skill is indicated wherever CF > CE, which is true for all points below the no-resolution line to the left of the climatological probability and above the no-resolution line to the right (light and dark shaded areas of Fig. 1). In effect, therefore, skill is indicated relative to random guessing whenever the slope of the reliability curve is positive. That a positively sloping reliability curve indicates positive skill is intuitively appealing since it indicates that the probability of the event occurring does increase (by however small an amount) as the forecast probability increases.



In conclusion, it is recommended that the Brier skill score with climatology as the reference forecast strategy [Eq. (5)] not be used as a lone measure of forecast skill because of the possibility that negative skill scores may hide the fact that the forecast system does contain useful information, especially if the sharpness of the forecasts is high. A similar recommendation can be made for the ranked probability skill score because of its simple relationship to the Brier skill score. This weakness of the widely used version of the Brier skill score is a side effect of its lack of equitability. Although from some perspectives equitability is likely to be an undesirable feature of a probabilistic scoring rule, the downside of the lack of equitability is that the information content of nonclimatological forecast probabilities may be discarded under the somewhat arbitrary condition of resolution being less than reliability. The possibility of using randomly assigned probabilities with the same marginal distribution as the set of forecast probabilities under consideration as a reference strategy instead of climatological probabilities was considered [Eq. (11)]. By explicitly considering the sharpness of the probabilities issued, this skill score has the desirable feature of having an expected score of 0 when nonclimatological forecast probabilities are issued, but at the loss of propriety. Regardless of which measures are used, this paper has highlighted the need to consider a set of scoring measures because of inherent weaknesses in any single measure of forecast performance.
This note was funded by Cooperative Agreement AN07GP0213 from the National Oceanic and Atmospheric Administration (NOAA). The views expressed herein are those of the author and do not necessarily reflect the views of NOAA or any of its subagencies. The comments of L. Goddard, A. G. Barnston, and anonymous referees are gratefully acknowledged, as are valuable discussions with F. J. Doblas-Reyes, L. Ferranti, A. Ghelli, R. Hagedorn, F. Vitart, and M. S. J. Harrison.
REFERENCES
Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev, 78 , 1–3.
Epstein, E. S., 1969: A scoring system for probability forecasts of ranked categories. J. Appl. Meteor, 8 , 985–987.
Gandin, L. S., , and A. H. Murphy, 1992: Equitable scores for categorical forecasts. Mon. Wea. Rev, 120 , 361–370.
Gerrity, J. P., 1992: A note on Gandin and Murphy's equitable skill score. Mon. Wea. Rev, 120 , 2707–2712.
Hsu, W-R., , and A. H. Murphy, 1986: The attributes diagram: A geometrical framework for assessing the quality of probability forecasts. Int. J. Forecasting, 2 , 285–293.
Jolliffe, I. T., , and D. B. Stephenson, 2003: Introduction. Forecast Verification: A Practitioner's Guide in Atmospheric Science, I. T. Jolliffe and D. B. Stephenson, Eds., Wiley, 1–12.
Kumar, A., , A. G. Barnston, , and M. P. Hoerling, 2001: Seasonal predictions, probabilistic verifications, and ensemble size. J. Climate, 14 , 1671–1676.
Livezey, R. E., 2003: Categorical events. Forecast Verification: A Practitioner's Guide in Atmospheric Science, I. T. Jolliffe and D. B. Stephenson, Eds., Wiley, 77–96.
Mason, I. T., 2003: Binary events. Forecast Verification: A Practitioner's Guide in Atmospheric Science, I. T. Jolliffe and D. B. Stephenson, Eds., Wiley, 37–76.
Murphy, A. H., 1966: A note on the use of probabilistic predictions and the probability score in the cost-loss ratio decision situation. J. Appl. Meteor, 5 , 534–537.
Murphy, A. H., 1969: On the “ranked probability score.”. J. Appl. Meteor, 8 , 988–989.
Murphy, A. H., 1971: A note on the ranked probability score. J. Appl. Meteor, 10 , 155–156.
Murphy, A. H., 1973: A new vector partition of the probability score. J. Appl. Meteor, 12 , 595–600.
Murphy, A. H., 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting. Wea. Forecasting, 8 , 281–293.
Murphy, A. H., , and E. S. Epstein, 1967: A note on probability forecasts and “hedging.”. J. Appl. Meteor, 6 , 1002–1004.
Murphy, A. H., , and R. L. Winkler, 1992: Diagnostic verification of probability forecasts. Int. J. Forecasting, 7 , 435–455.
Potts, J. M., 2003: Basic concepts. Forecast Verification: A Practitioner's Guide in Atmospheric Science, I. T. Jolliffe and D. B. Stephenson, Eds., Wiley, 13–36.
Potts, J. M., , C. K. Folland, , I. T. Jolliffe, , and D. Sexton, 1996: Revised “LEPS” scores for assessing climate model simulations and long-range forecasts. J. Climate, 9 , 34–53.
Toth, Z., , O. Talagrand, , G. Candille, , and Y. Zhu, 2003: Probability and ensemble forecasts. Forecast Verification: A Practitioner's Guide in Atmospheric Science, I. T. Jolliffe and D. B. Stephenson, Eds., Wiley, 137–163.
Ward, N. M., , and C. K. Folland, 1991: Prediction of seasonal rainfall in the north Nordeste of Brazil using eigenvectors of sea surface temperatures. Int. J. Climatol, 11 , 711–743.
Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences. Academic Press, 467 pp.

Attributes diagram showing the areas of skill compared to forecasts of climatology (dark shading) and additional areas of skill compared to random guessing (light shading). The prior probability of the event,
Citation: Monthly Weather Review 132, 7; 10.1175/1520-0493(2004)132<1891:OUCAAR>2.0.CO;2
