• Barnston, A. G., , and M. K. Tippett, 2013: Predictions of Nino3.4 SST in CFSv1 and CFSv2: A diagnostic comparison. Climate Dyn., 41, 16151633, doi:10.1007/s00382-013-1845-2.

    • Search Google Scholar
    • Export Citation
  • Barnston, A. G., , M. K. Tippett, , M. L. L’Heureux, , S. Li, , and D. G. DeWitt, 2012: Skill of real-time seasonal ENSO model predictions during 2002–11: Is our capability increasing? Bull. Amer. Meteor. Soc., 93, 631651, doi:10.1175/BAMS-D-11-00111.1.

    • Search Google Scholar
    • Export Citation
  • Clark, T. E., , and K. D. West, 2007: Approximately normal tests for equal predictive accuracy in nested models. J. Econom., 138, 291311, doi:10.1016/j.jeconom.2006.05.023.

    • Search Google Scholar
    • Export Citation
  • Conover, W. J., 1980: Practical Nonparametric Statistics. 2nd ed. Wiley-Interscience, 493 pp.

  • Davison, A. C., , and D. V. Hinkley, 1997: Bootstrap Methods and Their Application. Cambridge University Press, 582 pp.

  • DelSole, T., , and X. Feng, 2013: The “Shukla–Gutzler” method for estimating potential seasonal predictability. Mon. Wea. Rev., 141, 822831, doi:10.1175/MWR-D-12-00007.1.

    • Search Google Scholar
    • Export Citation
  • Diebold, F. X., , and R. S. Mariano, 1995: Comparing predictive accuracy. J. Bus. Econ. Stat., 13, 253263.

  • Doblas-Reyes, F. J., and Coauthors, 2013: Initialized near-term regional climate change prediction. Nat. Commun., 4, 1715, doi:10.1038/ncomms2704.

    • Search Google Scholar
    • Export Citation
  • Efron, B., , and R. J. Tibshirani, 1994: An Introduction to the Bootstrap. Chapman and Hall, 456 pp.

  • Giacomini, R., , and H. White, 2006: Tests of conditional predictive ability. Econometrica, 74, 15451578, doi:10.1111/j.1468-0262.2006.00718.x.

    • Search Google Scholar
    • Export Citation
  • Gilleland, E., 2010: Confidence intervals for forecast verification. NCAR Tech. Rep. NCAR/TN-479+STR, 71 pp.

  • Gilleland, E., 2013: Testing competing precipitation forecasts accurately and efficiently: The spatial prediction comparison test. Mon. Wea. Rev., 141, 340355, doi:10.1175/MWR-D-12-00155.1.

    • Search Google Scholar
    • Export Citation
  • Gilleland, E., , D. A. Ahijevych, , B. G. Brown, , and E. E. Ebert, 2010: Verifying forecasts spatially. Bull. Amer. Meteor. Soc., 91, 13651373, doi:10.1175/2010BAMS2819.1.

    • Search Google Scholar
    • Export Citation
  • Goddard, L., and Coauthors, 2012: A verification framework for interannual-to-decadal prediction experiments. Climate Dyn.,40, 245–272, doi:10.1007/s00382-012-1481-2.

  • Good, P. I., 2005: Permutation, Parametric and Bootstrap Tests of Hypotheses. 3rd ed. Springer, 315 pp.

  • Good, P. I., 2006: Resampling Methods. 3rd ed. Birkhäuser, 218 pp.

  • Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts. Wea. Forecasting, 14, 155167, doi:10.1175/1520-0434(1999)014<0155:HTFENP>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Hansen, P. R., 2005: A test for superior predictive ability. J. Bus. Econ. Stat., 23, 365380, doi:10.1198/073500105000000063.

  • Harvey, D., , S. Leybourne, , and P. Newbold, 1997: Testing the equality of prediction mean squared errors. Int. J. Forecasting, 13, 281291, doi:10.1016/S0169-2070(96)00719-4.

    • Search Google Scholar
    • Export Citation
  • Harvey, D., , S. Leybourne, , and P. Newbold, 1998: Tests for forecast encompassing. J. Bus. Econ. Stat., 16, 254259.

  • Hering, A. S., , and M. G. Genton, 2011: Comparing spatial predictions. Technometrics, 53, 414425, doi:10.1198/TECH.2011.10136.

  • Kirtman, B., and Coauthors, 2013: Near-term climate change: Projections and predictability. Climate Change 2013: The Physical Science Basis, T. Stocker et al., Eds., Cambridge University Press, 953–1028.

  • Kirtman, B., and Coauthors, 2014: The North American Multimodel Ensemble: Phase-1 seasonal-to-interannual prediction, Phase-2 toward developing intraseasonal prediction. Bull. Amer. Meteor. Soc., 95, 585–601, doi:10.1175/BAMS-D-12-00050.1.

    • Search Google Scholar
    • Export Citation
  • Kumar, A., , M. Chen, , L. Zhang, , W. Wang, , Y. Xue, , C. Wen, , L. Marx, , and B. Huang, 2012: An analysis of the nonstationarity in the bias of sea surface temperature forecasts for the NCEP Climate Forecast System (CFS) version 2. Mon. Wea. Rev., 140, 30033016, doi:10.1175/MWR-D-11-00335.1.

    • Search Google Scholar
    • Export Citation
  • Lahiri, S. N., 2003: Resampling Methods for Dependent Data. Springer, 374 pp.

  • McCracken, M. W., 2004: Parameter estimation and tests of equal forecast accuracy between non-nested models. Int. J. Forecasting, 20, 503514, doi:10.1016/S0169-2070(03)00063-3.

    • Search Google Scholar
    • Export Citation
  • Reynolds, R. W., , T. M. Smith, , C. Liu, , D. B. Chelton, , K. S. Casey, , and M. G. Schlax, 2007: Daily high-resolution-blended analyses for sea surface temperature. J. Climate, 20, 54735496, doi:10.1175/2007JCLI1824.1.

    • Search Google Scholar
    • Export Citation
  • Saha, S., and Coauthors, 2014: The NCEP Climate Forecast System version 2. J. Climate, 27, 21852208, doi:10.1175/JCLI-D-12-00823.1.

  • Scheffe, H., 1959: The Analysis of Variance. John Wiley and Sons, 477 pp.

  • Zwiers, F. W., , and H. von Storch, 1995: Taking serial correlation into account in tests of the mean. J. Climate, 8, 336351, doi:10.1175/1520-0442(1995)008<0336:TSCIAI>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 49 49 21
PDF Downloads 38 38 17

Comparing Forecast Skill

View More View Less
  • 1 George Mason University, Fairfax, Virginia, and Center for Ocean–Land–Atmosphere Studies, Calverton, Maryland
  • 2 Department of Applied Physics and Applied Mathematics, Columbia University, New York, New York, and Center of Excellence for Climate Change Research, Department of Meteorology, King Abdulaziz University, Jeddah, Saudi Arabia
© Get Permissions
Restricted access

Abstract

A basic question in forecasting is whether one prediction system is more skillful than another. Some commonly used statistical significance tests cannot answer this question correctly if the skills are computed on a common period or using a common set of observations, because these tests do not account for correlations between sample skill estimates. Furthermore, the results of these tests are biased toward indicating no difference in skill, a fact that has important consequences for forecast improvement. This paper shows that the magnitude of bias is characterized by a few parameters such as sample size and correlation between forecasts and their errors, which, surprisingly, can be estimated from data. The bias is substantial for typical seasonal forecasts, implying that familiar tests may wrongly judge that differences in seasonal forecast skill are insignificant. Four tests that are appropriate for assessing differences in skill over a common period are reviewed. These tests are based on the sign test, the Wilcoxon signed-rank test, the Morgan–Granger–Newbold test, and a permutation test. These techniques are applied to ENSO hindcasts from the North American Multimodel Ensemble and reveal that the Climate Forecast System, version 2, and the Canadian Climate Model, version 3 (CanCM3), outperform other models in the sense that their squared error is less than that of other single models more frequently. It should be recognized that while certain models may be superior in a certain sense for a particular period and variable, combinations of forecasts are often significantly more skillful than a single model alone. In fact, the multimodel mean significantly outperforms all single models.

Corresponding author address: Timothy DelSole, Center for Ocean–Land–Atmosphere Studies, 4041 Powder Mill Rd., Suite 302, Calverton, MD 20705. E-mail: delsole@cola.iges.org

Abstract

A basic question in forecasting is whether one prediction system is more skillful than another. Some commonly used statistical significance tests cannot answer this question correctly if the skills are computed on a common period or using a common set of observations, because these tests do not account for correlations between sample skill estimates. Furthermore, the results of these tests are biased toward indicating no difference in skill, a fact that has important consequences for forecast improvement. This paper shows that the magnitude of bias is characterized by a few parameters such as sample size and correlation between forecasts and their errors, which, surprisingly, can be estimated from data. The bias is substantial for typical seasonal forecasts, implying that familiar tests may wrongly judge that differences in seasonal forecast skill are insignificant. Four tests that are appropriate for assessing differences in skill over a common period are reviewed. These tests are based on the sign test, the Wilcoxon signed-rank test, the Morgan–Granger–Newbold test, and a permutation test. These techniques are applied to ENSO hindcasts from the North American Multimodel Ensemble and reveal that the Climate Forecast System, version 2, and the Canadian Climate Model, version 3 (CanCM3), outperform other models in the sense that their squared error is less than that of other single models more frequently. It should be recognized that while certain models may be superior in a certain sense for a particular period and variable, combinations of forecasts are often significantly more skillful than a single model alone. In fact, the multimodel mean significantly outperforms all single models.

Corresponding author address: Timothy DelSole, Center for Ocean–Land–Atmosphere Studies, 4041 Powder Mill Rd., Suite 302, Calverton, MD 20705. E-mail: delsole@cola.iges.org
Save