Conversations with Professor I. T. Jolliffe and Drs. C. A. S. Coelho, D. B. Stephenson, and G. J. van Oldenborgh (who provided the data), plus comments from the referees, helped to motivate and improve this work.
Bergsma, W. P., 2004: Testing conditional independence for continuous random variables. EURANDOM Tech. Rep. 2004–048, 19 pp.
Bradley, A. A., , Hashino T. , , and Schwartz S. S. , 2003: Distributions-oriented verification of probability forecasts for small data samples. Wea. Forecasting, 18 , 903–917.
Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78 , 1–3.
Buizza, R., , and Palmer T. N. , 1998: Impact of ensemble size on ensemble prediction. Mon. Wea. Rev., 126 , 2503–2518.
DiCiccio, T. J., , Monti A. C. , , and Young G. A. , 2006: Variance stabilization for a scalar parameter. J. Roy. Stat. Soc., 68B , 281–303.
Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts. Wea. Forecasting, 14 , 155–167.
Jolliffe, I. T., , and Stephenson D. B. , 2003: Forecast Verification: A Practitioner’s Guide in Atmospheric Science. John Wiley and Sons, 240 pp.
Kane, T. L., , and Brown B. G. , 2000: Confidence intervals for some verification measures–a survey of several methods. Preprints, 15th Conf. on Probability and Statistics, Asheville, NC, Amer. Meteor. Soc., 46–49.
Lahiri, S. N., 1993: Bootstrapping the Studentized sample mean of lattice variables. J. Mult. Anal., 45 , 247–256.
Müller, W. A., , Appenzeller C. , , Doblas-Reyes F. J. , , and Liniger M. A. , 2005: A debiased ranked probability skill score to evaluate probabilistic ensemble forecasts with small ensemble sizes. J. Climate, 18 , 1513–1523.
Palmer, T. N., and Coauthors, 2004: Development of a European Multimodel Ensemble System for Seasonal to Interannual Prediction (DEMETER). Bull. Amer. Meteor. Soc., 85 , 853–872.
Potts, J. M., 2003: Basic concepts. Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. T. Jolliffe and D. B. Stephenson, Eds., John Wiley and Sons, 13–36.
Richardson, D. S., 2001: Measures of skill and value of ensemble prediction systems, their interrelationship and the effect of ensemble size. Quart. J. Roy. Meteor. Soc., 127 , 2473–2489.
Romano, J. P., 1988: A bootstrap revival of some nonparametric distance tests. J. Amer. Stat. Assoc., 83 , 698–708.
Seaman, R., , Mason I. , , and Woodcock F. , 1996: Confidence intervals for some performance measures of yes-no forecasts. Aust. Meteor. Mag., 45 , 49–53.
Stephenson, D. B., 2000: Use of the “odds ratio” for diagnosing forecast skill. Wea. Forecasting, 15 , 221–232.
Thornes, J. E., , and Stephenson D. B. , 2001: How to judge the quality and value of weather forecast products. Meteor. Appl., 8 , 307–314.
Woodcock, F., 1976: The evaluation of yes/no forecasts for scientific and administrative purposes. Mon. Wea. Rev., 104 , 1209–1214.