1. Introduction
A basic question in weather and climate forecasting is whether one prediction system is more skillful than another. For instance, users want to know how different prediction systems have performed in the past, modelers want to know if changes in prediction system improved the skill, and operational centers want to know where to allocate resources. Regardless of application, a universal concern is whether the observed difference in skill reflects an actual difference or is dominated by random sampling variability that is sensitive to the specific forecast samples being evaluated. This concern grows as the sample size and skill level decrease, because sampling variability increases in these cases. Thus, this concern is greatest for low-skill, infrequent forecasts like seasonal or longer climate forecasts, or when forecast performance is stratified according to infrequent phenomena (e.g., warm ENSO events).
To address the above question, one needs to perform a significance test that quantifies the probability that the observed difference in skill could have arisen entirely from sampling variability. Standard procedures that could be used to test differences in skill include Fisher’s test for equality of correlations and the F test for equality of variance. An underlying assumption of these tests is that the skill estimates be independent. This requirement could be satisfied, for example, by deriving skill estimates from independent periods or observations.
However, it is more common (and often viewed as desirable) to compare forecast skills on a common period or set of observations. In this case, the skill estimates are not independent and the most commonly used significance tests give incorrect results. In particular, we show that the difference in skill tends to be closer to zero for dependent estimates than for independent estimates. This bias tends to mislead a user to conclude no difference in skill when in fact a difference exists. This point has important consequences for the broad endeavor of forecast improvement. The purpose of this paper is to quantify this fundamental difficulty and to highlight alternative tests that are more statistically justified. In fact, we show that the magnitude of the bias can be estimated from data assuming Gaussian distributions, and that this bias is large in typical seasonal forecasts. We also review more statistically justified tests based on the sign test, the Wilcoxon signed-rank test, the Morgan–Granger–Newbold test, and a permutation test, as proposed in the economics literature (Diebold and Mariano 1995) and in weather prediction literature (Hamill 1999).
To avoid possible confusion, it is worth mentioning that there exist some common circumstances in which the significance of a difference in skill can be tested even when the skill estimates are not independent. In particular, a difference in skill can be tested if the difference is independent of the skill of one of the models. Such a situation arises when regression methods are applied to a simpler model nested within a more complex model (Scheffe 1959, section 2.9). In contrast, comparison of nonnested models generally involves comparing correlated skill estimates whose sampling distributions depend on unknown population parameters. Unfortunately, this nesting strategy is not applicable for dynamical models because dynamical models are neither nested parametrically nor have parameters chosen to satisfy orthogonality constraints.
The next section discusses pitfalls in comparing skill where skill is based on mean-square error or correlation. Analytic formulas for the variance of relevant sampling distributions are presented. Importantly, these formulas depend on parameters that can be estimated from the available data, allowing the magnitude of bias to be estimated. Monte Carlo simulations presented in appendix D demonstrate that the analytic formulas hold very well over a wide range of possible forecast/observation systems. In section 3, these formulas are used to show that the independence assumption is strongly violated in practical ENSO forecasting. Four tests for comparing forecast skill over the same period are reviewed in section 4. These tests are applied in section 5 to ENSO hindcasts to compare hindcasts from one model against those of several other models. The final section summarizes the results and discusses their implications.
2. Pitfalls in skill comparisons
a. Mean-square error
Although the above results were derived assuming large sample size, Monte Carlo simulations discussed in appendix D show that the above results hold well for N ≥ 30, a sample size typical in seasonal hindcasting. Furthermore, the Monte Carlo simulations confirm that the variance depends on model parameters primarily through the correlation R, although not necessarily in the exact linear dependence in Eq. (10). We propose in appendix D a more accurate estimate by assuming that variance is proportional to (1 − R2) and choosing the proportionality constant so that the variance matches that of the F distribution when R = 0.
b. Correlation skill
Monte Carlo experiments presented in appendix D confirm that the above theoretical predictions hold very well for small sample size (e.g., N ≈ 30). In particular, they confirm that the variance of the difference in transformed correlation skills depends primarily on Γ. Appendix D also shows that Eq. (12) gives approximately unbiased estimates of Γ with an uncertainty similar to that of a standard correlation coefficient.
3. Assessment for ENSO seasonal forecasting
To assess the seriousness of the above problem, we consider hindcasts of monthly mean Niño-3.4 from the North American Multimodel Ensemble (NMME). The NMME, reviewed by Kirtman et al. (2014), consists of at least 9-month hindcasts by nine state-of-the-art coupled atmosphere–ocean models from the following centers: the National Centers for Environmental Prediction Climate Forecast System versions 1 and 2 (NCEP-CFSv1 and NCEP-CFSv2), the Canadian Centre for Climate Modeling and Analysis Climate Models versions 3 and 4 (CMC1-CanCM3 and CMC2-CanCM4), the Geophysical Fluid Dynamics Laboratory Climate Model (GFDL-CM2p1-aer04), the International Research Institute for Climate and Society Models (IRI-ECHAM4p5-Anomaly and IRI-ECHAM4p5-Direct), the National Aeronautics and Space Administration Global Modeling and Assimilation Office Model (NASA-GMAO-062012), and a joint collaboration between the Center for Ocean–Land–Atmosphere Studies, the University of Miami Rosenstiel School of Marine and Atmospheric Science, and the National Center for Atmospheric Research Community Climate System Model version 3 (COLA-RSMAS-CCSM3).
The hindcasts are initialized in all 12 calendar months during 1982–2009. The hindcast of a model is calculated by averaging 10 ensemble members from that model, except for the CCSM3 model, which only has 6. Although some models have more than 10 ensemble members, it is desirable to use a uniform ensemble size to rule out skill differences due to ensemble size.
The verification data for SST used in this study are the National Climatic Data Center (NCDC) Optimum Interpolation analysis (Reynolds et al. 2007).
The CFSv2 hindcasts have an apparent discontinuity across 1999, presumably due to the introduction of certain satellite data into the assimilation system in October 1998 (Kumar et al. 2012; Barnston and Tippett 2013; Saha et al. 2014). To remove this discontinuity, the climatologies before and after 1998 were removed separately. This adjustment was performed on all models to allow uniform comparisons. The slight change in degrees of freedom due to removing two separate climatologies is ignored for simplicity.
Estimates of the correlation R between errors and the correlation Γ between sample correlation skills were computed for hindcast of Niño-3.4 between 1982 and 2009, for every target month and lead out to 8 months. The resulting values of R2 and Γ are shown in Fig. 1. The median values are 0.32 and 0.64, and the frequency of falling below 0.2 is 30% and 2.0%, respectively. These results demonstrate that skill estimates tend to be correlated in seasonal forecasting. They also imply that the equality-of-correlation test tends to have a stronger bias than the equality-of-MSE test, for this hindcast dataset.
Histograms of the correlation between (left) errors R2 and (right) the correlation between correlation skill Γ for hindcasts of monthly mean Niño-3.4 by nine dynamical models in the NMME over the period 1982–2009, for leads 0.5–7.5 months and all 12 initial months.
Citation: Monthly Weather Review 142, 12; 10.1175/MWR-D-14-00045.1
4. Testing equality of forecast skill
Appropriate tests for comparing forecast skill have been proposed by Hamill (1999) and in numerous economic studies since the seminal paper of Diebold and Mariano (1995) (e.g., see Harvey et al. 1997, 1998; Hansen 2005; Clark and West 2007; Hering and Genton 2011). Some of these tests account for serially correlated forecasts, but require estimating the integral time scale, which is notoriously difficult to estimate for weather and climate variables (Zwiers and von Storch 1995; DelSole and Feng 2013). Since the autocorrelation of most surface variables is indistinguishable from zero after about a year, we assume forecasts separated by a year are independent. Diebold and Mariano (1995) also proposed tests that could be justified in an asymptotic sense, but demonstrated that these tests produced inaccurate results for the small sample sizes typically seen in seasonal forecasting (e.g., N ≤ 30). Therefore, we consider only exact tests that are appropriate for small sample sizes.
a. The sign test


b. Wilcoxon signed-rank test
c. Morgan–Granger–Newbold test
d. A permutation test
Hamill (1999) and Goddard et al. (2012), among others, have proposed resampling methods for comparing forecast skill. Resampling methods draw random samples from the available data to estimate an empirical distribution of a statistic; see Efron and Tibshirani (1994), Good (2006), and Gilleland (2010) for accessible introductions, and Davison and Hinkley (1997), Lahiri (2003), and Good (2005) for more advanced treatments. As in previous sections, we assume that forecast errors from different initial conditions are independent. Also, Monte Carlo experiments and analysis of actual forecasts suggest that only the ensemble mean need be resampled–sampling individual ensemble members gives very similar results.
A natural null hypothesis corresponding to equal skill is that the forecasts are exchangeable, or equivalently, the null distribution is invariant to rearrangements of the model labels. A resampling method that is consistent with this null hypothesis is to calculate the loss differential after random rearrangements of the model labels. This procedure effectively draws forecasts randomly without replacement, and, hence, is called a permutation test. Note that swapping the identity of the forecasts in each year preserves contemporaneous correlation. Since only two models exist per year, this procedure is equivalent to swapping the model labels in each year with a 50% probability, which is tantamount to changing the sign of the loss differential with a 50% probability. Thus, a convenient numerical approach is to generate a sequence of positive and negative ones from a binomial distribution with p = ½, and multiply this sequence by the sequence of loss differentials to construct a permutation sample. This procedure is performed many times, here 10 000, to create 10 000 loss differentials. The 10 000 samples were needed to ensure near insensitivity to the choice of permutations. Then, the fraction of permutation samples whose mean/median exceeds the observed absolute value, plus that which falls below the negative observed absolute value, gives the p value of the test (two terms are involved because the test is two tailed).
e. Differences between the tests
If the four procedures reviewed above give different conclusions, what are we to decide? We argue that inconsistent conclusions imply that the decision is sensitive to differences in the null hypothesis, differences in the measure, or differences in the power of the tests. For instance, the different methods make increasingly restrictive assumptions about the null distribution: the sign test assumes the null distribution has zero median, the Wilcoxon signed-rank test additionally assumes the null distribution is symmetric, and the Morgan–Granger–Newbold test further assumes the errors have a normal distribution. The permutation test assumes the forecasts are exchangeable, which implies at least that the null distribution is symmetric. For symmetric null distributions, Wilcoxon’s test has more power than the sign test, which means that it is more likely to reject the null hypothesis when it is false, thereby contributing to differences in decisions. Also, some tests are based on the median while others are based on the mean. The sampling properties of the mean and median differ, which could contribute to differences in decisions. Determining which of these differences dominate differences in decisions is difficult based of the relatively small sample size typical of seasonal forecasting.
5. Comparison of NMME hindcasts
Results of using the sign test to test the hypothesis that the median difference between squared errors vanishes, as a function of target month (horizontal axis) and lead month (vertical axis), where “0 lead” corresponds to a forecast in which the initial condition and verification occur in the same month. The test is applied to the difference between the squared error of the CFSv2 hindcasts minus the squared error of the model indicated in the title. Blue and red colors indicate that the CFSv2 outperforms the comparison model more frequently and less frequently, respectively, than the comparison model. White blanks indicate no significance difference in skill. The significance level of the test is indicated in the title.
Citation: Monthly Weather Review 142, 12; 10.1175/MWR-D-14-00045.1
Results of testing the hypothesis that the median difference between squared errors vanishes, as in Fig. 2, but based on the Wilcoxon signed-rank test.
Citation: Monthly Weather Review 142, 12; 10.1175/MWR-D-14-00045.1
Results of testing the hypothesis that the median difference between squared errors vanishes, as in Fig. 2, but based on a permutation test.
Citation: Monthly Weather Review 142, 12; 10.1175/MWR-D-14-00045.1
The result of applying the permutation test to test equality of mean-squared error is shown in Fig. 5. Comparison with Fig. 4 reveals some differences: the two permutation tests agree with each other about 91% of the cases. These differences illustrate the sensitivity to using the mean or median to test equality of skill. Again, assessing whether these differences occur because of lack of symmetry is difficult for this small sample size. The result of applying the Morgan–Granger–Newbold test to test equality of mean-square error is shown in Fig. 6. The Morgan–Granger–Newbold test agrees with the permutation test shown in Fig. 5 about 93% of the cases. In separate analysis, we find the Gaussian assumption to be reasonable for the errors, so discrepancies are not likely to be explained by departures from Gaussian.
Results of applying the permutation test to test equality of mean-square error between CFSv2 and the model indicated in the title. The format is as in Fig. 2.
Citation: Monthly Weather Review 142, 12; 10.1175/MWR-D-14-00045.1
Results of applying the Morgan–Granger–Newbold test to test equality of mean-square error between CFSv2 and the model indicated in the title. The format is as in Fig. 2.
Citation: Monthly Weather Review 142, 12; 10.1175/MWR-D-14-00045.1
For comparison, we also perform the (invalid) equal MSE test based on the F distribution (see Fig. 7) and the equal correlation test based on Fisher’s z transformation (see Fig. 8). As discussed in section 2, both tests tend to be conservative. Thus, the differences detected by these latter tests ought to be a subset of those detected by Morgan–Granger–Newbold (which also makes the Gaussian error assumption). This is in fact the case: 98% of the differences detected by the F test were also detected by Morgan–Granger–Newbold. The correlation test is grossly conservative: it detects almost no differences. Those it does detect were not detected by other tests. The fact that the F test has less bias than the correlation test also was anticipated on the basis of Fig. 1.
Results of applying the (invalid) F test to test equality of mean-square error between CFSv2 and the model indicated in the title. The format is as in Fig. 2.
Citation: Monthly Weather Review 142, 12; 10.1175/MWR-D-14-00045.1
Results of applying the (invalid) equality-of-correlation test to the correlation skill of the CFSv2 hindcasts and the model indicated in the title. The format is as in Fig. 2.
Citation: Monthly Weather Review 142, 12; 10.1175/MWR-D-14-00045.1
We now change the base model used for comparison. Furthermore, we pool all lead times and target months together. Because errors are serially correlated, the independence assumption is no longer valid after pooling. To compensate, we use a Bonferroni correction to control the family wise error rate at 1%. A conservative approach is to assume the errors are correlated for 12 months, in which case the Bonferroni correction would be 1% divided by the number of target months (12 months) and lead times (8 leads), which is about 1 in 10 000. Thus, we simply count the number of forecasts in which the squared error exceeds that of the comparison model and compare that count to the 1 in 10 000 critical value from a Bernoulli distribution. The result for all pairwise comparisons is shown in Fig. 9. We also have included the multimodel mean of all models. The figure shows that CanCM3 and CFSv2 tend to outperform other single models besides the IRI models (i.e., squared error is less than that other models with frequency greater than expected from a random walk). Conversely, the GFDL and CFSv1 models tend to perform worse than the other models by this measure. The multimodel mean significantly outperforms all models.
The percentage of cases (relative to 50%) in which NMME forecasts from the model labeled on the x axis has less squared error than the model indicated by color. Positive values indicate that the model on the x axis is more skillful than the model to which it is compared. The forecasts are pooled over 8 lead times and 12 target months. The solid horizontal lines show the 1 in 10 000 significance level, which is a conservative estimate of the significance threshold needed to ensure a family-wise significance level of 1%.
Citation: Monthly Weather Review 142, 12; 10.1175/MWR-D-14-00045.1
6. Summary and conclusions
This paper showed that certain statistical tests for deciding equality of two skill measures are not valid when skill is computed from the same period or observations. This paper also reviewed alternative tests that are appropriate in such situations even for relatively small sample sizes. The invalid tests include the equality-of-MSE test based on the F distribution and the equality-of-correlation test based on Fisher’s Z transformation. These tests are not valid when skill is computed from the same period or observations because they do not account for correlations between skill estimates. Importantly, the dependence between skill estimates tends to bias these tests toward indicating no difference in skill. We show that the error due to neglecting these correlations depends almost entirely on the correlation R between forecast errors and the correlation Γ between sample correlation skills, both of which can be estimated from data. Monte Carlo simulations demonstrate that these estimates hold very well over a wide range of possible forecast/observation systems. Estimates from actual seasonal hindcasts of Niño-3.4 show that the assumption of contemporaneous independence is violated strongly, implying that the above tests cannot be used reliably to decide if one seasonal prediction system is superior to another over the same hindcast period. These results further imply that care should be taken when constructing confidence intervals on skill since many confidence intervals do not account for dependence between skill estimates.
Four valid statistical procedures for testing equality of skill were reviewed. These tests are based on the difference between two loss functions, called the loss differential, and can account for serial correlation, but only squared error under independent verifications was considered in this paper. The different methods make increasingly restrictive assumptions about the null distribution: the sign test tests the hypothesis that the null distribution has zero median (i.e., the loss differentials are equally likely to be positive or negative), the Wilcoxon signed-rank test further assumes the null distribution is symmetric (i.e., a loss differential of a given magnitude can be positive or negative with equal probability), and the Morgan–Granger–Newbold test further assumes the errors have a normal distribution. The permutation test tests the hypothesis that the null distribution is invariant to permuting model identities. Differences between these tests can be ascribed to differences in the assumed null hypotheses, differences in the statistic used to measure loss differentials, or differences in power. For instance, the sign test uses information only about the direction of skill differences whereas the Wilcoxon signed-rank test takes into account the magnitude of the difference. As a result, the Wilcoxon test has more power than the sign test, which means it is more likely to reject the null hypothesis when it is false, thereby contributing to differences in decisions. Also, the permutation test gives different conclusions for testing zero mean or median, demonstrating sensitivity to the measure used to characterize the null distribution.
The above tests were applied to hindcasts of Niño-3.4 from the NMME project. All (valid) tests give generally consistent results: tests based on the median agree among themselves over 90% of the cases, and tests based on the mean agree among themselves over 90% of the cases. Overall, CFSv2 and CanCM3 outperform other single models in the sense that their square errors are less than that of other models more frequently, although the degree of outperformance is not statistically significant for the IRI models. Conversely, the GFDL-CM2p1 and CFSv1 models perform significantly worse than other models. Although some models outperform others in some sense, it should be recognized that the combinations of forecasts still are often significantly more skillful than a single model alone. Indeed, the multimodel mean significantly outperforms all single models.
The above results also confirmed that the F test and equal-correlation test are conservative. Specifically, 98% of the differences detected by the F test also were detected by Morgan–Granger–Newbold, whereas the equal-correlation test was so conservative that it detected almost no differences. These results suggest that studies that have used these tests (e.g., Doblas-Reyes et al. 2013; Kirtman et al. 2013) probably underestimated detectable differences between forecasts.
Several research directions appear worthy of further investigation. One direction is to generalize the above analysis beyond univariate measures, especially to spatial fields and multiple variables. Some steps in this direction are discussed in Hering and Genton (2011) and Gilleland (2013) and are being explored as part of the Spatial Forecast Verification Methods InterComparison Project (Gilleland et al. 2010). Another direction is the performance of the above methods for highly non-Gaussian variables such as precipitation. Also, comparison of weather and climate forecasts always involve some type of regression correction (e.g., removal of the mean bias), but this correction was ignored here. McCracken (2004) proposed methods for accounting for errors introduced by parameter estimation. Multimodel forecasts raise numerous other questions, such as whether particular combinations of forecasts are more skillful than others, or whether a small subset of models is just as good as the full set. Generalizing the above analysis to such questions would be desirable. Also, the skill measures investigated here used only ensemble mean information: information about the spread of the ensemble was ignored. Various approaches to comparing probabilistic forecasts were proposed by Hamill (1999) and deserve further development. An important component of forecast verification is to understand the skill differences (e.g., to address whether differences in skill arise from poor representation of physical processes or from faulty initialization). Indeed, in the NMME data used here, the conclusions would have been very different if the discontinuity across 1999 had not been removed prior to comparison. It would be desirable to identify such discontinuities objectively. Thus, generalizing the above analysis to test conditional skill differences would be desirable (Giacomini and White 2006).
Acknowledgments
We thank an anonymous reviewer for pointing out important references and making insightful comments that lead to substantial improvements in the final paper. This research was supported primarily by the National Oceanic and Atmospheric Administration, under the Climate Test Bed program (Grant NA10OAR4310264). Additional support was provided by the National Science Foundation (Grants ATM0332910, ATM0830062, and ATM0830068), National Aeronautics and Space Administration (Grants NNG04GG46G and NNX09AN50G), the National Oceanic and Atmospheric Administration (Grants NA04OAR4310034, NA09OAR4310058, NA10OAR4310210, NA10OAR4310249, and NA12OAR4310091), and the Office of Naval Research Award (N00014-12-1-091). The NMME project and data dissemination is supported by NOAA, NSF, NASA, and DOE. The help of CPC, IRI, and NCAR personnel in creating, updating, and maintaining the NMME archive is gratefully acknowledged. The views expressed herein are those of the authors and do not necessarily reflect the views of these agencies.
APPENDIX A
Variance of the Mean-Square Error Ratio

APPENDIX B
Correlation between Sample Correlations






We have ignored the Z transformation in the above derivation. The Monte Carlo simulations confirm that this result holds well even for Z-transformed variables. This equivalence is probably due to the fact that the Z transformation is approximately linear except when the correlations are near ±1.
APPENDIX C
Generating Random Variables with Prescribed Variances and Correlations



APPENDIX D
Monte Carlo Simulations
The validity of the analytic formulas derived in section 2 is assessed here using Monte Carlo simulations. Specifically, we construct an idealized forecast/observation system in which the true properties of the system are known exactly. We assume that forecast/observations are drawn from a multivariate Gaussian distribution, so that the statistics are specified completely by nine parameters: three means, three variances, and three pairwise correlations. We further assume that the observation/forecasts for different target dates are independent. For simplicity, we assume the means vanish. Denote the observation and the two forecasts by o, f1, and f2, and let their variances be
a. Mean-square errors
We first show in Fig. D1 a histogram of the ratio of mean-square errors computed numerically by Monte Carlo methods for a particular choice of model parameters that give R = 0.88, which is not unrealistic for practical seasonal forecasting (see Fig. 1). Consistent with the null hypothesis, the parameters are chosen such that the population mean-square errors are equal. We also show the F distribution with N − 1 and N − 1 degrees of freedom, where one degree of freedom is lost due to subtracting out the climatology. The figure shows that the distribution for the mean-square error is more narrowly concentrated around unity than the corresponding F distribution, as anticipated.
Histogram of the ratio of mean-square errors of the system in Eqs. (4)–(6) computed by numerical simulation using parameter values indicated in the right-hand side of the figure. The curve shows an F distribution with N − 1 and N − 1 degrees of freedom. The vertical dashed lines indicate the 5th and 95th percentiles of the F distribution. The parameter values have been chosen to emphasize the discrepancy between the two distributions while producing equal expected mean-square errors.
Citation: Monthly Weather Review 142, 12; 10.1175/MWR-D-14-00045.1
Although there are six adjustable parameters, our theoretical calculations imply that the sampling distribution depends only on sample size and the correlation between errors R. We test this prediction by generating realizations from the multivariate Gaussian for a wide range of parameter choices and plotting the results simply as a function of sample size and R. More precisely, we fix the model parameters and generate N realizations of observations and forecasts. This procedure is repeated 10 000 times to generate 10 000 estimates of MSE1 and MSE2, from which the variance ratio is estimated. Then, this entire procedure is repeated M = 200 times but using a new set of randomly chosen population variances
Variance of the ratio of mean-square errors of an idealized forecast/observation system plotted as a function of the squared correlation between forecast errors. Each symbol represents the result from the forecast/observation system in Eqs. (C1)–(C10) for a randomly selected set of model parameters. The black solid line shows the theoretical estimate in Eq. (10), appropriate for asymptotically large sample size N. The dashed line shows the empirical estimate in Eq. (D1) appropriate to the sample size N (the dashed line corresponding to N = 60 is difficult to see because the symbols lie on top of it).
Citation: Monthly Weather Review 142, 12; 10.1175/MWR-D-14-00045.1
An important implication of Fig. D2 is that the variance of the mean-square error ratio depends on the model parameters only through the parameter R. This result is remarkable given that different model parameters correspond to very different signal and noise variances.
b. Correlation skill
The difference in Z-transformed correlation skills for the forecast/observation system for a specific set of model parameters is shown in Fig. D3. The figure reveals a similar tendency to be more concentrated than the theoretical distribution based on independent correlations. A plot of the (scaled) population variance of z1 − z2 as a function of Γ is shown in Fig. D4 for two different sample sizes. The results are well approximated by Eq. (11), despite very different signal and noise variances in different models. Given this tight relation, we can anticipate that testing equality of correlation skills using the Z-transform method will give misleading results when Γ differs greatly from 0. Although the population value of this parameter generally is unknown, it can be estimated from the available data.
Histogram of the difference in Z-transformed correlation skill of the system in Eqs. (C1)–(C10) computed by numerical simulation using parameter values indicated in the right-hand side of the figure. The curve shows a normal distribution with zero mean and variance 2/(N − 3). The vertical dashed lines indicate the 5th and 95th percentiles of the normal distribution. The parameter values have been chosen to emphasize the discrepancy between the two distributions while producing equal expected mean-square errors.
Citation: Monthly Weather Review 142, 12; 10.1175/MWR-D-14-00045.1
Variance of the difference in Z-transformed correlations of an idealized forecast/observation system plotted as a function the correlation parameter Γ defined in Eq. (12). Each symbol represents the result from the forecast/observation system in Eqs. (C1)–(C10) for a randomly selected set of model parameters. The line shows Eq. (11).
Citation: Monthly Weather Review 142, 12; 10.1175/MWR-D-14-00045.1
c. Estimation of R and Γ
As emphasized in section 2, the correlation between errors R and correlation between correlation skill Γ can be estimated from available data. A comparison of sample estimates and exact values is shown in Fig. D5. The shaded region shows the 95% confidence interval computed from the sample, and the dashed lines show the upper and lower limits of the 95% confidence interval computed from a standard correlation test (e.g., as derived from Fisher’s Z transformation). The figure suggests that the formulas provide unbiased estimates of these parameters with an uncertainty that behaves like a correlation coefficient.
Results of estimating the correlation between errors R and correlation between correlation skill Γ from a single sample of size N = 30 from the Monte Carlo experiments. The estimates for Γ are obtained from Eq. (12). For each exact (i.e., population) value, there are 1000 sample estimates. The thick solid line shows the mean of the sample values and the shaded region shows two standard deviations on both sides of the mean value. The dashed line shows the 95% confidence interval of a standard correlation coefficient (derived from Fisher’s Z transformation).
Citation: Monthly Weather Review 142, 12; 10.1175/MWR-D-14-00045.1
REFERENCES
Barnston, A. G., and M. K. Tippett, 2013: Predictions of Nino3.4 SST in CFSv1 and CFSv2: A diagnostic comparison. Climate Dyn., 41, 1615–1633, doi:10.1007/s00382-013-1845-2.
Barnston, A. G., M. K. Tippett, M. L. L’Heureux, S. Li, and D. G. DeWitt, 2012: Skill of real-time seasonal ENSO model predictions during 2002–11: Is our capability increasing? Bull. Amer. Meteor. Soc., 93, 631–651, doi:10.1175/BAMS-D-11-00111.1.
Clark, T. E., and K. D. West, 2007: Approximately normal tests for equal predictive accuracy in nested models. J. Econom., 138, 291–311, doi:10.1016/j.jeconom.2006.05.023.
Conover, W. J., 1980: Practical Nonparametric Statistics. 2nd ed. Wiley-Interscience, 493 pp.
Davison, A. C., and D. V. Hinkley, 1997: Bootstrap Methods and Their Application. Cambridge University Press, 582 pp.
DelSole, T., and X. Feng, 2013: The “Shukla–Gutzler” method for estimating potential seasonal predictability. Mon. Wea. Rev., 141, 822–831, doi:10.1175/MWR-D-12-00007.1.
Diebold, F. X., and R. S. Mariano, 1995: Comparing predictive accuracy. J. Bus. Econ. Stat., 13, 253–263.
Doblas-Reyes, F. J., and Coauthors, 2013: Initialized near-term regional climate change prediction. Nat. Commun., 4, 1715, doi:10.1038/ncomms2704.
Efron, B., and R. J. Tibshirani, 1994: An Introduction to the Bootstrap. Chapman and Hall, 456 pp.
Giacomini, R., and H. White, 2006: Tests of conditional predictive ability. Econometrica, 74, 1545–1578, doi:10.1111/j.1468-0262.2006.00718.x.
Gilleland, E., 2010: Confidence intervals for forecast verification. NCAR Tech. Rep. NCAR/TN-479+STR, 71 pp.
Gilleland, E., 2013: Testing competing precipitation forecasts accurately and efficiently: The spatial prediction comparison test. Mon. Wea. Rev., 141, 340–355, doi:10.1175/MWR-D-12-00155.1.
Gilleland, E., D. A. Ahijevych, B. G. Brown, and E. E. Ebert, 2010: Verifying forecasts spatially. Bull. Amer. Meteor. Soc., 91, 1365–1373, doi:10.1175/2010BAMS2819.1.
Goddard, L., and Coauthors, 2012: A verification framework for interannual-to-decadal prediction experiments. Climate Dyn.,40, 245–272, doi:10.1007/s00382-012-1481-2.
Good, P. I., 2005: Permutation, Parametric and Bootstrap Tests of Hypotheses. 3rd ed. Springer, 315 pp.
Good, P. I., 2006: Resampling Methods. 3rd ed. Birkhäuser, 218 pp.
Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts. Wea. Forecasting, 14, 155–167, doi:10.1175/1520-0434(1999)014<0155:HTFENP>2.0.CO;2.
Hansen, P. R., 2005: A test for superior predictive ability. J. Bus. Econ. Stat., 23, 365–380, doi:10.1198/073500105000000063.
Harvey, D., S. Leybourne, and P. Newbold, 1997: Testing the equality of prediction mean squared errors. Int. J. Forecasting, 13, 281–291, doi:10.1016/S0169-2070(96)00719-4.
Harvey, D., S. Leybourne, and P. Newbold, 1998: Tests for forecast encompassing. J. Bus. Econ. Stat., 16, 254–259.
Hering, A. S., and M. G. Genton, 2011: Comparing spatial predictions. Technometrics, 53, 414–425, doi:10.1198/TECH.2011.10136.
Kirtman, B., and Coauthors, 2013: Near-term climate change: Projections and predictability. Climate Change 2013: The Physical Science Basis, T. Stocker et al., Eds., Cambridge University Press, 953–1028.
Kirtman, B., and Coauthors, 2014: The North American Multimodel Ensemble: Phase-1 seasonal-to-interannual prediction, Phase-2 toward developing intraseasonal prediction. Bull. Amer. Meteor. Soc., 95, 585–601, doi:10.1175/BAMS-D-12-00050.1.
Kumar, A., M. Chen, L. Zhang, W. Wang, Y. Xue, C. Wen, L. Marx, and B. Huang, 2012: An analysis of the nonstationarity in the bias of sea surface temperature forecasts for the NCEP Climate Forecast System (CFS) version 2. Mon. Wea. Rev., 140, 3003–3016, doi:10.1175/MWR-D-11-00335.1.
Lahiri, S. N., 2003: Resampling Methods for Dependent Data. Springer, 374 pp.
McCracken, M. W., 2004: Parameter estimation and tests of equal forecast accuracy between non-nested models. Int. J. Forecasting, 20, 503–514, doi:10.1016/S0169-2070(03)00063-3.
Reynolds, R. W., T. M. Smith, C. Liu, D. B. Chelton, K. S. Casey, and M. G. Schlax, 2007: Daily high-resolution-blended analyses for sea surface temperature. J. Climate, 20, 5473–5496, doi:10.1175/2007JCLI1824.1.
Saha, S., and Coauthors, 2014: The NCEP Climate Forecast System version 2. J. Climate, 27, 2185–2208, doi:10.1175/JCLI-D-12-00823.1.
Scheffe, H., 1959: The Analysis of Variance. John Wiley and Sons, 477 pp.
Zwiers, F. W., and H. von Storch, 1995: Taking serial correlation into account in tests of the mean. J. Climate, 8, 336–351, doi:10.1175/1520-0442(1995)008<0336:TSCIAI>2.0.CO;2.