• Baringhaus, L., and C. Franz, 2004: On a new multivariate two-sample test. J. Multivariate Anal., 88, 190206, https://doi.org/10.1016/S0047-259X(03)00079-4.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Bernardo, J. M., 1979: Expected information as expected utility. Ann. Stat., 7, 686690, https://doi.org/10.1214/aos/1176344689.

  • Bernardo, J. M., and A. F. M. Smith, 2000: Bayesian Theory. Wiley, 608 pp.

  • Boero, G., J. Smith, and K. F. Wallis, 2011: Scoring rules and survey density forecasts. Int. J. Forecasting, 27, 379393, https://doi.org/10.1016/j.ijforecast.2010.04.003.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78, 13, https://doi.org/10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Bröcker, J., and L. A. Smith, 2007: Scoring probabilistic forecasts: The importance of being proper. Wea. Forecasting, 22, 382388, https://doi.org/10.1175/WAF966.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Brown, T. A., 1970: Probabilistic forecasts and reproducing scoring systems. Tech. Rep. RM-6299-ARPA, RAND Corporation, 65 pp.

  • Brown, T. A., 1974: Admissible scoring systems for continuous distributions. Manuscript P-5235, The Rand Corporation, 24 pp.

  • Du, H., and L. A. Smith, 2012: Parameter estimation using ignorance. Phys. Rev. E, 86, 016213, https://doi.org/10.1103/PhysRevE.86.016213.

  • Epstein, E. S., 1969: A scoring system for probability forecasts of ranked categories. J. Appl. Meteor., 8, 985987, https://doi.org/10.1175/1520-0450(1969)008<0985:ASSFPF>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Fricker, T. E., C. A. T. Ferro, and D. B. Stephenson, 2013: Three recommendations for evaluating climate prediction. Meteor. Appl., 20, 246255, https://doi.org/10.1002/met.1409.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Frigg, R., S. Bradley, H. Du, and L. Smith, 2014: Laplace’s demon and the adventures of his apprentices. Philos. Sci., 81, 3159, https://doi.org/10.1086/674416.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Gneiting, T., and A. E. Raftery, 2007: Strictly proper scoring rules, prediction and estimation. J. Amer. Stat. Assoc., 102, 359378, https://doi.org/10.1198/016214506000001437.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Goddard, L., and Coauthors, 2013: A verification framework for interannual-to-decadal predictions experiments. Climate Dyn., 40, 245272, https://doi.org/10.1007/s00382-012-1481-2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Good, I. J., 1952: Rational decisions. J. Roy. Stat. Soc., 14A, 107114.

  • Good, I. J., 1971: Comment on “Measuring information and uncertainty” by Robert J. Buehler. Foundations of Statistical Inference, Holt, Rinehart and Winston, 337–339.

  • Hagedorn, R., and L. Smith, 2009: Communicating the value of probabilistic forecasts with weather roulette. Meteor. Appl., 16, 143155, https://doi.org/10.1002/met.92.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Jaynes, E. T., 2003: Probability Theory: The Logic of Science. Cambridge University Press, 727 pp.

    • Crossref
    • Export Citation
  • Jolliffe, I. T., and D. B. Stephenson, 2003: Forecast Verification: A Practitioner’s Guide in Atmospheric Science. Wiley, 254 pp.

  • Jose, V. R. R., R. F. Nau, and R. L. Winkler, 2008: Scoring rules, generalized entropy, and utility maximization. Oper. Res., 56, 11461157, https://doi.org/10.1287/opre.1070.0498.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Kelly, J. L., 1956: A new interpretation of information rate. Bell Syst. Tech. J., 35, 917926, https://doi.org/10.1002/j.1538-7305.1956.tb03809.x.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Kohonen, J., and J. Suomela, 2006: Lessons learned in the challenge: Making predictions and scoring them. Machine Learning Challenges: Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment, J. Quiñonero-Candela et al., Eds., Springer, 95–116.

    • Crossref
    • Export Citation
  • Kullback, S., 1959: Information Theory and Statistics. Wiley, 395 pp.

  • Kullback, S., and R. A. Leibler, 1951: On information and sufficiency. Ann. Math. Stat., 22, 7986, https://doi.org/10.1214/aoms/1177729694.

  • Machete, R., 2013: Contrasting probabilistic scoring rules. J. Stat. Plan. Inference, 143, 17811790, https://doi.org/10.1016/j.jspi.2013.05.012.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Machete, R., and L. Smith, 2016: Demonstrating the value of larger ensembles in forecasting physical systems. Tellus, 68A, 28393, https://doi.org/10.3402/tellusa.v68.28393.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Mason, S. J., and A. P. Weigel, 2009: A generic forecast verification framework for administrative purposes. Mon. Wea. Rev., 137, 331349, https://doi.org/10.1175/2008MWR2553.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Matheson, J. E., and R. L. Winkler, 1976: Scoring rules for continuous probability distributions. Manage. Sci., 22, 10871096, https://doi.org/10.1287/mnsc.22.10.1087.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Maynard, T., 2016: Extreme insurance and the dynamics of risk. Ph.D. thesis, The London School of Economics and Political Science, 416 pp.

  • McSharry, P., and L. A. Smith, 1999: Better nonlinear models from noisy data: Attractors with maximum likelihood. Phys. Rev. Lett., 83, 42854288, https://doi.org/10.1103/PhysRevLett.83.4285.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1969: On the “ranked probability score.” J. Appl. Meteor., 8, 988989, https://doi.org/10.1175/1520-0450(1969)008<0988:OTPS>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1970: The ranked probability score and the probability score: A comparison. Mon. Wea. Rev., 98, 917924, https://doi.org/10.1175/1520-0493(1970)098<0917:TRPSAT>2.3.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1996: The Finley affair: A signal event in the history of forecast verification. Wea. Forecasting, 11, 320, https://doi.org/10.1175/1520-0434(1996)011<0003:TFAASE>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., and R. L. Winkler, 1970: Scoring rules in probability assessment and evaluation. Acta Psychol., 34, 273286, https://doi.org/10.1016/0001-6918(70)90023-5.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Roby, T. B., 1965: Belief states: A preliminary empirical study. Behav. Sci., 10, 255270.

  • Roulston, M., and L. Smith, 2002: Evaluating probabilistic forecasts using information theory. Mon. Wea. Rev., 130, 16531660, https://doi.org/10.1175/1520-0493(2002)130<1653:EPFUIT>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Savenkov, M., 2009: On the truncated Weibull distribution and its usefulness in evaluating the theoretical capacity factor of potential wind (or wave) energy sites. Univ. J. Eng. Technol., 1, 2125.

    • Search Google Scholar
    • Export Citation
  • Scheuerer, M., 2014: Probabilistic quantitative precipitation forecasting using ensemble model output statistics. Quart. J. Roy. Meteor. Soc., 140, 10861096, https://doi.org/10.1002/qj.2183.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Selten, R., 1998: Axiomatic characterization of the quadratic scoring rule. Exp. Econ., 1, 4361, https://doi.org/10.1023/A:1009957816843.

  • Shuford, E. H., Jr., A. Albert, and H. E. Massengill, 1966: Admissible probability measurement procedures. Psychometrika, 31, 125145, https://doi.org/10.1007/BF02289503.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Smith, L. A., E. B. Suckling, E. L. Thompson, T. Maynard, and H. Du, 2015: Towards improving the framework for probabilistic forecast evaluation. Climatic Change, 132, 3145, https://doi.org/10.1007/s10584-015-1430-2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Staël von Holstein, C.-A. S., 1970a: A family of strictly proper scoring rules which are sensitive to distance. J. Appl. Meteor., 9, 360364, https://doi.org/10.1175/1520-0450(1970)009<0360:AFOSPS>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Staël von Holstein, C.-A. S., 1970b: Measurement of subjective probability. Acta Psychol., 34, 146159, https://doi.org/10.1016/0001-6918(70)90013-2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Stephenson, D. B., 2000: Use of the “odds ratio” for diagnosing forecast skill. Wea. Forecasting, 15, 221232, https://doi.org/10.1175/1520-0434(2000)015<0221:UOTORF>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Székely, G. J., 2003: E-statistics: The energy of statistical samples. Tech. Rep. 2003-16, Bowling Green State University, 20 pp., https://doi.org/10.13140/RG.2.1.5063.9761.

    • Crossref
    • Export Citation
  • Székely, G. J., and M. L. Rizzo, 2005: A new test for multivariate normality. J. Multivariate Anal., 93, 5880, https://doi.org/10.1016/j.jmva.2003.12.002.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Toda, M., 1963: Measurement of subjective probability distributions. Tech. Doc. esd-tdr-63-407, Decision Sciences Laboratory, Electronic Systems Division, Air Force Systems Command, Vol. 86, 42 pp.

  • Tödter, J., and B. Ahrens, 2012: Generalization of the ignorance score: Continuous ranked version and its decomposition. Mon. Wea. Rev., 140, 20052017, https://doi.org/10.1175/MWR-D-11-00266.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Unger, D. A., 1995: A method to estimate the continuous ranked probability score. Proc. Ninth Conf. on Probability and Statistics, Boston, MA, Amer. Meteor. Soc., 206–213.

  • Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences: An Introduction. International Geophysics Series, Vol. 59, Elsevier, 467 pp.

  • Winkler, R. L., 1969: Scoring rules and the evaluation of probability assessors. J. Amer. Stat. Assoc., 64, 10731078, https://doi.org/10.1080/01621459.1969.10501037.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Winkler, R. L., 1996: Scoring rules and the evaluation of probabilities. Test, 5, 160, https://doi.org/10.1007/BF02562681.

  • Winkler, R. L., and A. H. Murphy, 1968: “Good” probability assessors. J. Appl. Meteor., 7, 751758, https://doi.org/10.1175/1520-0450(1968)007<0751:PA>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Zhang, Y., J. Wang, and X. Wang, 2014: Review on probabilistic forecasting of wind power generation. Renew. Sustain. Energy Rev., 32, 255270, https://doi.org/10.1016/j.rser.2014.01.033.

    • Crossref
    • Search Google Scholar
    • Export Citation
All Time Past Year Past 30 Days
Abstract Views 158 158 111
Full Text Views 30 30 20
PDF Downloads 32 32 24

Beyond Strictly Proper Scoring Rules: The Importance of Being Local

View More View Less
  • 1 Department of Mathematical Sciences, Durham University, Durham, United Kingdom
  • 2 Centre for the Analysis of Time Series, London School of Economics, London, United Kingdom
© Get Permissions
Restricted access

Abstract

The evaluation of probabilistic forecasts plays a central role both in the interpretation and in the use of forecast systems and their development. Probabilistic scores (scoring rules) provide statistical measures to assess the quality of probabilistic forecasts. Often, many probabilistic forecast systems are available while evaluations of their performance are not standardized, with different scoring rules being used to measure different aspects of forecast performance. Even when the discussion is restricted to strictly proper scoring rules, there remains considerable variability between them; indeed strictly proper scoring rules need not rank competing forecast systems in the same order when none of these systems are perfect. The locality property is explored to further distinguish scoring rules. The nonlocal strictly proper scoring rules considered are shown to have a property that can produce “unfortunate” evaluations, particularly the fact that the continuous rank probability score prefers the outcome close to the median of the forecast distribution regardless of the probability mass assigned to the value at/near the median raises concern to its use. The only local strictly proper scoring rule, the logarithmic score, has direct interpretations in terms of probabilities and bits of information. The nonlocal strictly proper scoring rules, on the other hand, lack meaningful direct interpretation for decision support. The logarithmic score is also shown to be invariant under smooth transformation of the forecast variable, while the nonlocal strictly proper scoring rules considered may, however, change their preferences due to the transformation. It is therefore suggested that the logarithmic score always be included in the evaluation of probabilistic forecasts.

© 2021 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Hailiang Du, hailiang.du@durham.ac.uk

Abstract

The evaluation of probabilistic forecasts plays a central role both in the interpretation and in the use of forecast systems and their development. Probabilistic scores (scoring rules) provide statistical measures to assess the quality of probabilistic forecasts. Often, many probabilistic forecast systems are available while evaluations of their performance are not standardized, with different scoring rules being used to measure different aspects of forecast performance. Even when the discussion is restricted to strictly proper scoring rules, there remains considerable variability between them; indeed strictly proper scoring rules need not rank competing forecast systems in the same order when none of these systems are perfect. The locality property is explored to further distinguish scoring rules. The nonlocal strictly proper scoring rules considered are shown to have a property that can produce “unfortunate” evaluations, particularly the fact that the continuous rank probability score prefers the outcome close to the median of the forecast distribution regardless of the probability mass assigned to the value at/near the median raises concern to its use. The only local strictly proper scoring rule, the logarithmic score, has direct interpretations in terms of probabilities and bits of information. The nonlocal strictly proper scoring rules, on the other hand, lack meaningful direct interpretation for decision support. The logarithmic score is also shown to be invariant under smooth transformation of the forecast variable, while the nonlocal strictly proper scoring rules considered may, however, change their preferences due to the transformation. It is therefore suggested that the logarithmic score always be included in the evaluation of probabilistic forecasts.

© 2021 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Hailiang Du, hailiang.du@durham.ac.uk
Save