## Abstract

A basic question in forecasting is whether one prediction system is more skillful than another. Some commonly used statistical significance tests cannot answer this question correctly if the skills are computed on a common period or using a common set of observations, because these tests do not account for correlations between sample skill estimates. Furthermore, the results of these tests are biased toward indicating no difference in skill, a fact that has important consequences for forecast improvement. This paper shows that the magnitude of bias is characterized by a few parameters such as sample size and correlation between forecasts and their errors, which, surprisingly, can be estimated from data. The bias is substantial for typical seasonal forecasts, implying that familiar tests may wrongly judge that differences in seasonal forecast skill are insignificant. Four tests that are appropriate for assessing differences in skill over a common period are reviewed. These tests are based on the sign test, the Wilcoxon signed-rank test, the Morgan–Granger–Newbold test, and a permutation test. These techniques are applied to ENSO hindcasts from the North American Multimodel Ensemble and reveal that the Climate Forecast System, version 2, and the Canadian Climate Model, version 3 (CanCM3), outperform other models in the sense that their squared error is less than that of other single models more frequently. It should be recognized that while certain models may be superior in a certain sense for a particular period and variable, combinations of forecasts are often significantly more skillful than a single model alone. In fact, the multimodel mean significantly outperforms all single models.

## 1. Introduction

A basic question in weather and climate forecasting is whether one prediction system is more skillful than another. For instance, users want to know how different prediction systems have performed in the past, modelers want to know if changes in prediction system improved the skill, and operational centers want to know where to allocate resources. Regardless of application, a universal concern is whether the observed difference in skill reflects an actual difference or is dominated by random sampling variability that is sensitive to the specific forecast samples being evaluated. This concern grows as the sample size and skill level decrease, because sampling variability increases in these cases. Thus, this concern is greatest for low-skill, infrequent forecasts like seasonal or longer climate forecasts, or when forecast performance is stratified according to infrequent phenomena (e.g., warm ENSO events).

To address the above question, one needs to perform a significance test that quantifies the probability that the observed difference in skill could have arisen entirely from sampling variability. Standard procedures that could be used to test differences in skill include Fisher’s test for equality of correlations and the *F* test for equality of variance. An underlying assumption of these tests is that the skill estimates be independent. This requirement could be satisfied, for example, by deriving skill estimates from independent periods or observations.

However, it is more common (and often viewed as desirable) to compare forecast skills on a common period or set of observations. In this case, the skill estimates are not independent and the most commonly used significance tests give incorrect results. In particular, we show that the difference in skill tends to be closer to zero for dependent estimates than for independent estimates. This bias tends to mislead a user to conclude no difference in skill when in fact a difference exists. This point has important consequences for the broad endeavor of forecast improvement. The purpose of this paper is to quantify this fundamental difficulty and to highlight alternative tests that are more statistically justified. In fact, we show that the magnitude of the bias can be estimated from data assuming Gaussian distributions, and that this bias is large in typical seasonal forecasts. We also review more statistically justified tests based on the sign test, the Wilcoxon signed-rank test, the Morgan–Granger–Newbold test, and a permutation test, as proposed in the economics literature (Diebold and Mariano 1995) and in weather prediction literature (Hamill 1999).

To avoid possible confusion, it is worth mentioning that there exist some common circumstances in which the significance of a difference in skill can be tested even when the skill estimates are not independent. In particular, a difference in skill can be tested if the *difference* is independent of the skill of one of the models. Such a situation arises when regression methods are applied to a simpler model *nested* within a more complex model (Scheffe 1959, section 2.9). In contrast, comparison of nonnested models generally involves comparing correlated skill estimates whose sampling distributions depend on unknown population parameters. Unfortunately, this nesting strategy is not applicable for dynamical models because dynamical models are neither nested parametrically nor have parameters chosen to satisfy orthogonality constraints.

The next section discusses pitfalls in comparing skill where skill is based on mean-square error or correlation. Analytic formulas for the variance of relevant sampling distributions are presented. Importantly, these formulas depend on parameters that can be estimated from the available data, allowing the magnitude of bias to be estimated. Monte Carlo simulations presented in appendix D demonstrate that the analytic formulas hold very well over a wide range of possible forecast/observation systems. In section 3, these formulas are used to show that the independence assumption is strongly violated in practical ENSO forecasting. Four tests for comparing forecast skill over the same period are reviewed in section 4. These tests are applied in section 5 to ENSO hindcasts to compare hindcasts from one model against those of several other models. The final section summarizes the results and discusses their implications.

## 2. Pitfalls in skill comparisons

### a. Mean-square error

To define the problem precisely, let *o* denote an observation and let *f*_{1} and *f*_{2} denote forecasts of *o*. The error of the *i*th forecast is defined to be

The mean-square error of the *i*th forecast over *N* samples is

We want to test the hypothesis that the population mean-square errors are equal:

One might suggest (Doblas-Reyes et al. 2013; Kirtman et al. 2013) that the hypothesis in Eq. (3) can be tested by comparing the ratio of mean-square errors to an *F* distribution with relevant degrees of freedom. Presumably, the rationale behind this suggestion is that testing equality of mean-square errors is equivalent to testing equality of variances. The problem with this rationale, however, is that the proposed test assumes the variances are independent, whereas mean-square errors computed from the same observations are not independent. To understand this point, it is helpful to partition the forecasts and observations into a conditional mean, called the signal, and a deviation about the conditional mean, called the noise. That is, let the forecasts and observations be

where *s* and *n* variables represent signal and noise variables, respectively. The signal variables are independent of the noise variables, the noise variables are mutually independent, but the signal variables may be correlated. The decomposition for forecasts can be applied either to an individual ensemble member or to the ensemble mean, in which case the signals would be identical but the noise terms for the ensemble mean would have smaller variance than the noise terms for individual ensemble members. The analysis below applies to either interpretation. Therefore, the covariance between errors under these assumptions is

The covariance between errors involves two terms: one that depends on the difference in signals and another that depends on the noise variance of observations. These terms reflect the fact that correlations between mean-square errors may arise due to correlated signal errors, or due to using the same observations to calculate two separate mean-square errors (MSEs). Correlation between prediction errors is called *contemporaneous dependence* (Diebold and Mariano 1995).

To gain insight into the impact of correlated errors, consider the ratio of MSEs for the idealized forecast/observation system in Eqs. (4)–(6):

The numerator and denominator cannot be independent because they each contain the random variable *n*_{o}, which is not canceled by the other terms. Since some of the expanded terms are identical in both numerator and denominator, the ratio of MSEs will be closer to unity than would be the case if all terms were independent. These considerations suggest that the distribution of *λ* will be more concentrated around unity than a true *F* ratio based on independent denominator and numerator. Furthermore, the degree to which the distribution of *λ* concentrates near unity depends on the true signal and noise, which are unobservable.

To put the above statements in a more rigorous statistical context, consider the case in which MSE_{1} and MSE_{2} are independent. This would be true if the MSEs were calculated from independent verification periods. If the sample sizes are equal, say *N*, then the appropriate test is to compare the ratio *λ* to an *F* distribution with *N* − 1 and *N* − 1 degrees of freedom, where one degree is lost due to centering. Using standard properties of the *F* distribution, it follows that the variance of the ratio *λ* for independent errors is

In appendix A, we derive the apparently new result that if the errors are normally distributed and population mean-square errors are equal, then for large sample size *N*

where *R* is the correlation between forecast errors *ϵ*_{1} and *ϵ*_{2}. The parameter *R* is a measure of contemporaneous dependence. Comparison with Eq. (9) for independent errors shows that correlated errors *reduce the variance* of the ratio of mean-square errors, just as was anticipated earlier. This decrease in variance implies that a test based on the *F* distribution would be biased toward finding no skill. Equation (10) is remarkable because it implies that the sampling distribution depends on (unknown) population quantities only through the parameter *R*, *which can be estimated from the available data*.

Although the above results were derived assuming large sample size, Monte Carlo simulations discussed in appendix D show that the above results hold well for *N* ≥ 30, a sample size typical in seasonal hindcasting. Furthermore, the Monte Carlo simulations confirm that the variance depends on model parameters primarily through the correlation *R*, although not necessarily in the exact linear dependence in Eq. (10). We propose in appendix D a more accurate estimate by assuming that variance is proportional to (1 − *R*^{2}) and choosing the proportionality constant so that the variance matches that of the *F* distribution when *R* = 0.

### b. Correlation skill

Another common verification measure is the correlation coefficient between forecast and observation. A common way to test the significance of a difference in sample correlations is to transform the correlations using Fisher’s *Z* transform, and then perform hypothesis tests on the resulting statistic, say *z*_{1} and *z*_{2} for forecast 1 and 2. In particular, if the population correlations are equal *and the samples are independent*, then the difference *z*_{1} − *z*_{2} should be approximately normally distributed with zero mean and variance 2/(*N* − 3). On the other hand, if the sample correlation skills are correlated, then

where Γ = cor[*z*_{1}, *z*_{2}]. For a single dataset, *z*_{1} and *z*_{2} are single numbers, hence the correlation Γ is not directly computable. In appendix B, we show that for large sample size *N*,

where *ρ*_{12} is the correlation between *f*_{1} and *f*_{2}, and *ρ*_{o1} and *ρ*_{o2} is the correlation skill for models 1 and 2, respectively. Importantly, this expression for Γ can be estimated from the dataset (e.g., substitute sample correlations for population correlations); hence, the degree to which the sampling distribution is impacted by lack of independence can be estimated.

Monte Carlo experiments presented in appendix D confirm that the above theoretical predictions hold very well for small sample size (e.g., *N* ≈ 30). In particular, they confirm that the variance of the difference in transformed correlation skills depends primarily on Γ. Appendix D also shows that Eq. (12) gives approximately unbiased estimates of Γ with an uncertainty similar to that of a standard correlation coefficient.

## 3. Assessment for ENSO seasonal forecasting

To assess the seriousness of the above problem, we consider hindcasts of monthly mean Niño-3.4 from the North American Multimodel Ensemble (NMME). The NMME, reviewed by Kirtman et al. (2014), consists of at least 9-month hindcasts by nine state-of-the-art coupled atmosphere–ocean models from the following centers: the National Centers for Environmental Prediction Climate Forecast System versions 1 and 2 (NCEP-CFSv1 and NCEP-CFSv2), the Canadian Centre for Climate Modeling and Analysis Climate Models versions 3 and 4 (CMC1-CanCM3 and CMC2-CanCM4), the Geophysical Fluid Dynamics Laboratory Climate Model (GFDL-CM2p1-aer04), the International Research Institute for Climate and Society Models (IRI-ECHAM4p5-Anomaly and IRI-ECHAM4p5-Direct), the National Aeronautics and Space Administration Global Modeling and Assimilation Office Model (NASA-GMAO-062012), and a joint collaboration between the Center for Ocean–Land–Atmosphere Studies, the University of Miami Rosenstiel School of Marine and Atmospheric Science, and the National Center for Atmospheric Research Community Climate System Model version 3 (COLA-RSMAS-CCSM3).

The hindcasts are initialized in all 12 calendar months during 1982–2009. The hindcast of a model is calculated by averaging 10 ensemble members from that model, except for the CCSM3 model, which only has 6. Although some models have more than 10 ensemble members, it is desirable to use a uniform ensemble size to rule out skill differences due to ensemble size.

The verification data for SST used in this study are the National Climatic Data Center (NCDC) Optimum Interpolation analysis (Reynolds et al. 2007).

The CFSv2 hindcasts have an apparent discontinuity across 1999, presumably due to the introduction of certain satellite data into the assimilation system in October 1998 (Kumar et al. 2012; Barnston and Tippett 2013; Saha et al. 2014). To remove this discontinuity, the climatologies before and after 1998 were removed separately. This adjustment was performed on all models to allow uniform comparisons. The slight change in degrees of freedom due to removing two separate climatologies is ignored for simplicity.

Estimates of the correlation *R* between errors and the correlation Γ between sample correlation skills were computed for hindcast of Niño-3.4 between 1982 and 2009, for every target month and lead out to 8 months. The resulting values of *R*^{2} and Γ are shown in Fig. 1. The median values are 0.32 and 0.64, and the frequency of falling below 0.2 is 30% and 2.0%, respectively. These results demonstrate that skill estimates tend to be correlated in seasonal forecasting. They also imply that the equality-of-correlation test tends to have a stronger bias than the equality-of-MSE test, for this hindcast dataset.

## 4. Testing equality of forecast skill

Appropriate tests for comparing forecast skill have been proposed by Hamill (1999) and in numerous economic studies since the seminal paper of Diebold and Mariano (1995) (e.g., see Harvey et al. 1997, 1998; Hansen 2005; Clark and West 2007; Hering and Genton 2011). Some of these tests account for serially correlated forecasts, but require estimating the integral time scale, which is notoriously difficult to estimate for weather and climate variables (Zwiers and von Storch 1995; DelSole and Feng 2013). Since the autocorrelation of most surface variables is indistinguishable from zero after about a year, we assume forecasts separated by a year are independent. Diebold and Mariano (1995) also proposed tests that could be justified in an asymptotic sense, but demonstrated that these tests produced inaccurate results for the small sample sizes typically seen in seasonal forecasting (e.g., *N* ≤ 30). Therefore, we consider only exact tests that are appropriate for small sample sizes.

Diebold and Mariano (1995) proposed tests that applied to arbitrary loss functions and avoided restrictive assumptions on the distributions. In this paper, we consider primarily squared error, although the generalization to arbitrary loss functions should be recognized. For squared error, the statistic for comparing forecasts is the *loss differential*

Equality of mean-square error in Eq. (3) corresponds to *E*[*d*] = 0.

### a. The sign test

Perhaps the simplest “equal skill” hypothesis would assert that there is equal probability that one forecast will outperform the other at any given time. If squared error is used to compare skill, then this hypothesis implies that the loss differential has a 50–50 chance of being positive. If the loss differentials also are independent, then this hypothesis implies that the number of positive loss differentials follow a Bernoulli process, or a random walk, *regardless of the true distribution of the errors and regardless of the loss function*. Therefore, we simply count the number of times the loss differential is positive, and compare that number to the appropriate threshold computed from a binomial distribution. This test is called the *sign test*, and its statistic is

The probability of obtaining the value ST under the null hypothesis follows a binomial distribution:

The *p* value of the test is 2*p*(0) + + 2*p*(ST_{o}), where ST_{o} is the observed value, and the factor of 2 accounts for the fact that test is two-tailed owing to the fact that the loss differential may be either positive or negative.

The above null hypothesis is equivalent to the hypothesis that the *median* of the loss differential vanishes. To see this, note that the above null hypothesis assumes , which is precisely equivalent to

which, by definition, implies that *d* = 0 is the median. Diebold and Mariano (1995) proposed testing the hypothesis that the median loss differential vanishes. Although *H*_{m} differs from *H*_{0} for distributions that are not symmetric, we do not view this discrepancy as a problem: the median and mean measure different aspects of the full forecast comparison problem.

### b. Wilcoxon signed-rank test

One limitation of the sign test is that it uses information only about the *direction* of the differences between squared errors. Conceivably, taking account of the *magnitude* of the difference could lead to a more powerful test. The Wilcoxon signed-rank test is another test of the hypothesis *H*_{m} that gives more weight to larger differences than to smaller differences. Of course, this extra power comes with a price: it assumes that the null distribution of the loss differential is *symmetric* (Conover 1980). Symmetry implies that the size of the difference is independent of the sign of the difference. Hamill (1999) also proposed the Wilcoxon signed-rank test for comparing forecasts, although its general applicability beyond probabilistic skill scores may have been little noticed, aside from a few exceptions (e.g., Barnston et al. 2012). In the present context, the statistic for this test is

where ties and *d*_{n} = 0 are assumed to never occur. The exact finite-sample distribution of this statistic is invariant to the distribution of the loss differential, provided the distribution is symmetric about zero. Most mathematical packages can perform this test (e.g., wilcox.test in R and signrank in MATLAB). The upper and lower critical values of WT are computed in the same way as for the sign test discussed in the previous paragraph.

### c. Morgan–Granger–Newbold test

Diebold and Mariano (1995) discuss another test called the Morgan–Granger–Newbold test. The idea is to avoid assuming contemporaneous independence through an “orthogonalizing transformation.” Specifically, the errors are first transformed according to

The covariance between these random variables is

Therefore, the hypothesis of equality of mean-square error in Eq. (3) is equivalent to zero correlation between *x* and *z*. Also, the sign of the correlation indicates the superior model. It follows that if errors are normally distributed, the hypothesis of equal mean-square error can be tested using a standard correlation test between *x* and *z*.

### d. A permutation test

Hamill (1999) and Goddard et al. (2012), among others, have proposed resampling methods for comparing forecast skill. Resampling methods draw random samples from the available data to estimate an empirical distribution of a statistic; see Efron and Tibshirani (1994), Good (2006), and Gilleland (2010) for accessible introductions, and Davison and Hinkley (1997), Lahiri (2003), and Good (2005) for more advanced treatments. As in previous sections, we assume that forecast errors from different initial conditions are independent. Also, Monte Carlo experiments and analysis of actual forecasts suggest that only the ensemble mean need be resampled–sampling individual ensemble members gives very similar results.

A natural null hypothesis corresponding to equal skill is that the forecasts are *exchangeable*, or equivalently, the null distribution is invariant to rearrangements of the model labels. A resampling method that is consistent with this null hypothesis is to calculate the loss differential after random rearrangements of the model labels. This procedure effectively draws forecasts randomly *without* replacement, and, hence, is called a *permutation test*. Note that swapping the identity of the forecasts in each year preserves contemporaneous correlation. Since only two models exist per year, this procedure is equivalent to swapping the model labels in each year with a 50% probability, which is tantamount to changing the sign of the loss differential with a 50% probability. Thus, a convenient numerical approach is to generate a sequence of positive and negative ones from a binomial distribution with *p* = ½, and multiply this sequence by the sequence of loss differentials to construct a permutation sample. This procedure is performed many times, here 10 000, to create 10 000 loss differentials. The 10 000 samples were needed to ensure near insensitivity to the choice of permutations. Then, the fraction of permutation samples whose mean/median exceeds the observed absolute value, plus that which falls below the negative observed absolute value, gives the *p* value of the test (two terms are involved because the test is two tailed).

### e. Differences between the tests

If the four procedures reviewed above give different conclusions, what are we to decide? We argue that inconsistent conclusions imply that the decision is sensitive to differences in the null hypothesis, differences in the measure, or differences in the power of the tests. For instance, the different methods make increasingly restrictive assumptions about the null distribution: the sign test assumes the null distribution has zero median, the Wilcoxon signed-rank test additionally assumes the null distribution is symmetric, and the Morgan–Granger–Newbold test further assumes the errors have a normal distribution. The permutation test assumes the forecasts are exchangeable, which implies at least that the null distribution is symmetric. For symmetric null distributions, Wilcoxon’s test has more power than the sign test, which means that it is more likely to reject the null hypothesis when it is false, thereby contributing to differences in decisions. Also, some tests are based on the median while others are based on the mean. The sampling properties of the mean and median differ, which could contribute to differences in decisions. Determining which of these differences dominate differences in decisions is difficult based of the relatively small sample size typical of seasonal forecasting.

## 5. Comparison of NMME hindcasts

We now apply the above tests to the NMME hindcasts. The NCEP-CFSv2 hindcasts are selected for comparison, so the loss differential for model *i* is defined as

Accordingly, a positive loss differential implies that the squared error of CFSv2 exceeds that of model *i*. For each error comparison, there are 28 years of hindcasts for each lead month and target month. The result of using the sign test to test the median difference in squared error is shown in Fig. 2. Cases in which the squared error of CFSv2 is less than the comparison model more frequently than expected from a random walk (with *p* = 0.5) are shown in blue. Conversely, red indicates that CFSv2 squared errors are larger than the comparison model more frequently than expected, and white blanks indicate no significant median in the difference of squared errors. CFSv2 shows no significant difference in performance for most (>70%) target months and leads. At the 3.6% significance level, the counts are expected to exceed the significance threshold about 4 times by chance, which is close to what is observed for the IRI models. In the remaining cases, and aside from the first lead forecasts of the CanCM3, CFSv2 generally outperforms other models in the sense that its squared error is less than that of other models more frequently than expected. The result of applying the Wilcoxon signed-rank test is shown in Fig. 3. The significance level is chosen to be 3.6% to be consistent with the sign test. The Wilcoxon test detects more frequent and stronger differences in squared error than the sign test, but agrees with the sign test 91% of the cases. As discussed in section 4e, the Wilcoxon test has more power than the sign test and hence is more likely to detect a difference in skill when the null hypothesis is false, which may explain why it detects slightly more differences. The result of applying the permutation test to test vanishing median is shown in Fig. 4. The permutation test agrees with the Wilcoxon test over 95% of the cases, which is perhaps not surprising given that they both assume the null distribution is symmetric. We have explored whether the remaining differences can be attributed to violations of symmetry, but different tests for symmetry give conflicting results, so the question is ultimately undecidable for this small sample size.

The result of applying the permutation test to test equality of *mean*-squared error is shown in Fig. 5. Comparison with Fig. 4 reveals some differences: the two permutation tests agree with each other about 91% of the cases. These differences illustrate the sensitivity to using the mean or median to test equality of skill. Again, assessing whether these differences occur because of lack of symmetry is difficult for this small sample size. The result of applying the Morgan–Granger–Newbold test to test equality of mean-square error is shown in Fig. 6. The Morgan–Granger–Newbold test agrees with the permutation test shown in Fig. 5 about 93% of the cases. In separate analysis, we find the Gaussian assumption to be reasonable for the errors, so discrepancies are not likely to be explained by departures from Gaussian.

For comparison, we also perform the (invalid) equal MSE test based on the *F* distribution (see Fig. 7) and the equal correlation test based on Fisher’s *z* transformation (see Fig. 8). As discussed in section 2, both tests tend to be conservative. Thus, the differences detected by these latter tests ought to be a subset of those detected by Morgan–Granger–Newbold (which also makes the Gaussian error assumption). This is in fact the case: 98% of the differences detected by the *F* test were also detected by Morgan–Granger–Newbold. The correlation test is grossly conservative: it detects almost no differences. Those it does detect were not detected by other tests. The fact that the *F* test has less bias than the correlation test also was anticipated on the basis of Fig. 1.

We now change the base model used for comparison. Furthermore, we pool all lead times and target months together. Because errors are serially correlated, the independence assumption is no longer valid after pooling. To compensate, we use a Bonferroni correction to control the family wise error rate at 1%. A conservative approach is to assume the errors are correlated for 12 months, in which case the Bonferroni correction would be 1% divided by the number of target months (12 months) and lead times (8 leads), which is about 1 in 10 000. Thus, we simply count the number of forecasts in which the squared error exceeds that of the comparison model and compare that count to the 1 in 10 000 critical value from a Bernoulli distribution. The result for all pairwise comparisons is shown in Fig. 9. We also have included the multimodel mean of all models. The figure shows that CanCM3 and CFSv2 tend to outperform other single models besides the IRI models (i.e., squared error is less than that other models with frequency greater than expected from a random walk). Conversely, the GFDL and CFSv1 models tend to perform worse than the other models by this measure. The multimodel mean significantly outperforms all models.

## 6. Summary and conclusions

This paper showed that certain statistical tests for deciding equality of two skill measures are not valid when skill is computed from the same period or observations. This paper also reviewed alternative tests that are appropriate in such situations even for relatively small sample sizes. The invalid tests include the equality-of-MSE test based on the *F* distribution and the equality-of-correlation test based on Fisher’s *Z* transformation. These tests are not valid when skill is computed from the same period or observations because they do not account for correlations between skill estimates. Importantly, the dependence between skill estimates tends to bias these tests toward indicating no difference in skill. We show that the error due to neglecting these correlations depends almost entirely on the correlation *R* between forecast errors and the correlation Γ between sample correlation skills, both of which can be estimated from data. Monte Carlo simulations demonstrate that these estimates hold very well over a wide range of possible forecast/observation systems. Estimates from actual seasonal hindcasts of Niño-3.4 show that the assumption of contemporaneous independence is violated strongly, implying that the above tests cannot be used reliably to decide if one seasonal prediction system is superior to another over the same hindcast period. These results further imply that care should be taken when constructing confidence intervals on skill since many confidence intervals do not account for dependence between skill estimates.

Four valid statistical procedures for testing equality of skill were reviewed. These tests are based on the difference between two loss functions, called the loss differential, and can account for serial correlation, but only squared error under independent verifications was considered in this paper. The different methods make increasingly restrictive assumptions about the null distribution: the sign test tests the hypothesis that the null distribution has zero median (i.e., the loss differentials are equally likely to be positive or negative), the Wilcoxon signed-rank test further assumes the null distribution is symmetric (i.e., a loss differential of a given magnitude can be positive or negative with equal probability), and the Morgan–Granger–Newbold test further assumes the errors have a normal distribution. The permutation test tests the hypothesis that the null distribution is invariant to permuting model identities. Differences between these tests can be ascribed to differences in the assumed null hypotheses, differences in the statistic used to measure loss differentials, or differences in power. For instance, the sign test uses information only about the direction of skill differences whereas the Wilcoxon signed-rank test takes into account the magnitude of the difference. As a result, the Wilcoxon test has more power than the sign test, which means it is more likely to reject the null hypothesis when it is false, thereby contributing to differences in decisions. Also, the permutation test gives different conclusions for testing zero mean or median, demonstrating sensitivity to the measure used to characterize the null distribution.

The above tests were applied to hindcasts of Niño-3.4 from the NMME project. All (valid) tests give generally consistent results: tests based on the median agree among themselves over 90% of the cases, and tests based on the mean agree among themselves over 90% of the cases. Overall, CFSv2 and CanCM3 outperform other single models in the sense that their square errors are less than that of other models more frequently, although the degree of outperformance is not statistically significant for the IRI models. Conversely, the GFDL-CM2p1 and CFSv1 models perform significantly worse than other models. Although some models outperform others in some sense, it should be recognized that the combinations of forecasts still are often significantly more skillful than a single model alone. Indeed, the multimodel mean significantly outperforms all single models.

The above results also confirmed that the *F* test and equal-correlation test are conservative. Specifically, 98% of the differences detected by the *F* test also were detected by Morgan–Granger–Newbold, whereas the equal-correlation test was so conservative that it detected almost no differences. These results suggest that studies that have used these tests (e.g., Doblas-Reyes et al. 2013; Kirtman et al. 2013) probably underestimated detectable differences between forecasts.

Several research directions appear worthy of further investigation. One direction is to generalize the above analysis beyond univariate measures, especially to spatial fields and multiple variables. Some steps in this direction are discussed in Hering and Genton (2011) and Gilleland (2013) and are being explored as part of the Spatial Forecast Verification Methods InterComparison Project (Gilleland et al. 2010). Another direction is the performance of the above methods for highly non-Gaussian variables such as precipitation. Also, comparison of weather and climate forecasts always involve some type of regression correction (e.g., removal of the mean bias), but this correction was ignored here. McCracken (2004) proposed methods for accounting for errors introduced by parameter estimation. Multimodel forecasts raise numerous other questions, such as whether particular combinations of forecasts are more skillful than others, or whether a small subset of models is just as good as the full set. Generalizing the above analysis to such questions would be desirable. Also, the skill measures investigated here used only ensemble mean information: information about the spread of the ensemble was ignored. Various approaches to comparing *probabilistic* forecasts were proposed by Hamill (1999) and deserve further development. An important component of forecast verification is to *understand* the skill differences (e.g., to address whether differences in skill arise from poor representation of physical processes or from faulty initialization). Indeed, in the NMME data used here, the conclusions would have been very different if the discontinuity across 1999 had not been removed prior to comparison. It would be desirable to identify such discontinuities objectively. Thus, generalizing the above analysis to test conditional skill differences would be desirable (Giacomini and White 2006).

## Acknowledgments

We thank an anonymous reviewer for pointing out important references and making insightful comments that lead to substantial improvements in the final paper. This research was supported primarily by the National Oceanic and Atmospheric Administration, under the Climate Test Bed program (Grant NA10OAR4310264). Additional support was provided by the National Science Foundation (Grants ATM0332910, ATM0830062, and ATM0830068), National Aeronautics and Space Administration (Grants NNG04GG46G and NNX09AN50G), the National Oceanic and Atmospheric Administration (Grants NA04OAR4310034, NA09OAR4310058, NA10OAR4310210, NA10OAR4310249, and NA12OAR4310091), and the Office of Naval Research Award (N00014-12-1-091). The NMME project and data dissemination is supported by NOAA, NSF, NASA, and DOE. The help of CPC, IRI, and NCAR personnel in creating, updating, and maintaining the NMME archive is gratefully acknowledged. The views expressed herein are those of the authors and do not necessarily reflect the views of these agencies.

### APPENDIX A

#### Variance of the Mean-Square Error Ratio

To derive Eq. (10), we let *W* and *Z* be MSEs with the same (nonzero) population mean *μ*. Define the perturbation quantities *W*′ = *W* − *μ* and *Z*′ = *Z* − *μ*. If the perturbation quantities are small compared to the mean, and the standard approximation 1/(1 + *x*) ≈ 1 − *x* for small *x* is invoked, then the ratio *W*/*Z* can be approximated as

where only first-order perturbation terms are retained because higher-order terms are assumed to be negligible. This expression allows us to estimate the variance of the ratio as

In terms of the original variables, this expression implies

We assume that the forecast errors are independently and identically distributed as a Gaussian with zero mean. Therefore, the expected mean-square error is

Gaussian variables have the property that if *X*_{1}, *X*_{2}, *X*_{3}, and *X*_{4} are multivariate Gaussian, then

It follows from this that

and

Therefore, the variance of mean-square error is

Similarly, the covariance between mean-square errors is

Substituting the above equations into Eq. (A3) yields

Since equality of mean-square error implies , the term

gives the correlation between forecast errors. This completes the derivation of Eq. (10).

### APPENDIX B

#### Correlation between Sample Correlations

This appendix estimates the correlation between two sample correlations. Consider three centered random variables *X*, *Y*, and *Z*. The sample correlation between *X* and *Y* is defined as

where *X*_{n} denotes the *n*th sample value of *X minus its sample mean*, and similarly for *Y*_{n}. The upper limit of the sums is *N*. A similar expression can be written for the sample correlations and . Our goal is to estimate

in terms of the population parameters of the random variables.

Recall that the correlation coefficient is invariant to affine transformations of the variables. Thus, without loss of generality, the population mean and variance of each variable can be set to zero and one, respectively. In this case, a unitary transformation can be applied to reduce the number of samples from *N* to *N* − 1 without altering the sample mean, variance, or covariance. We assume that this transformation has been applied to all three random variables, in which case the upper limit of the sums becomes *N* − 1. For large *N*, the sample covariances and variances are close to their population values, implying that the differences between the estimated covariance and variances relative to their population values are small compared to one. Invoking the standard linear approximation for small *x*, the sample correlation in Eq. (B1) can be approximated as

Similar expressions can be written for and .

If the joint random variables *X*, *Y*, and *Z* are drawn independently from the same multivariate normal distribution, then the variance of the sample correlation can be estimated as

Equation (B5) follows from the definition of variance and the fact that the summand has vanishing mean, Eq. (B6) follows from independence of samples, Eq. (B7) follows from algebra and independence, and Eq. (B8) follows from Eq. (A5). Similarly, we have

Following similar calculations, the covariance between sample correlations can be estimated as

where

Using Eq. (A5), these quantities can be expressed as

Multiplying these terms out and simplifying yields

Substituting the above expressions into Eq. (B2) yields

We have ignored the *Z* transformation in the above derivation. The Monte Carlo simulations confirm that this result holds well even for *Z*-transformed variables. This equivalence is probably due to the fact that the *Z* transformation is approximately linear except when the correlations are near ±1.

### APPENDIX C

#### Generating Random Variables with Prescribed Variances and Correlations

A general model for simulating three random variables *o*, *f*_{1}, and *f*_{2} with respective variances and pairwise correlations *ρ*_{o1}, *ρ*_{o2}, and *ρ*_{12} is

where we define the random variables

To obtain the coefficients *a*, *b*, and *c*, we use Eq. (C2) to derive the following covariances between *f*_{2} and other variables:

This yields two equations for the two unknowns *a*, *b* that can be solved. The coefficient *c* can be derived by deriving the variance of *f*_{2} from Eq. (C2) and eliminating *a*, *b* based on the previous solution. The final solution is

Note that the correlations cannot be selected independently because the correlation matrix must be positive definite, which requires

This constraint ensures that Eq. (C10) is real.

We desire to simulate two forecasts with the same (population) mean-square error. The error of the *i*th forecast is

Therefore, the expected mean-square error is

If the differential loss is defined as , then the mean differential loss is

Thus, specifying the mean differential loss *E*[*d*] is tantamount to imposing a constraint on *ρ*_{o1} and *ρ*_{o2}. Solving this equation for *ρ*_{o2} yields

where

To ensure that Eq. (C11) is satisfied and the MSEs of the two forecasts are equal, we select parameters according to the following sequence:

where the notation *X* ~ [*a*, *b*] means that *X* is drawn randomly from a uniform distribution between *a* and *b*. The choice in Eq. (C22) with *E*[*d*] = 0 in the definition of *α* in Eq. (C17) ensures that the population mean-square errors of the two forecasts are equal, and the choice in Eq. (C25), obtained by solving the quadratic in *ρ*_{12} in Eq. (C11), ensures that the correlations are realizable (i.e., the correlation matrix is positive semidefinite).

Finally, the covariance between errors is

Therefore, the correlation between forecast errors is

We also would like to choose parameters such that the correlation skills *ρ*_{o1} and *ρ*_{o2} are equal. This goal is relatively easier because these parameters occur explicitly in the model. The primary constraint is in Eq. (C11), which, after *r* = *ρ*_{o1} = *ρ*_{o2} has been specified, is a quadratic equation for *ρ*_{12}. Solving this quadratic equation gives the following constraint:

### APPENDIX D

#### Monte Carlo Simulations

The validity of the analytic formulas derived in section 2 is assessed here using Monte Carlo simulations. Specifically, we construct an idealized forecast/observation system in which the true properties of the system are known exactly. We assume that forecast/observations are drawn from a multivariate Gaussian distribution, so that the statistics are specified completely by nine parameters: three means, three variances, and three pairwise correlations. We further assume that the observation/forecasts for different target dates are independent. For simplicity, we assume the means vanish. Denote the observation and the two forecasts by *o*, *f*_{1}, and *f*_{2}, and let their variances be , respectively, and let the pairwise correlations be *ρ*_{o1}, *ρ*_{o2}, and *ρ*_{12}, using an obvious notation. Our algorithm for generating samples from the above multivariate Gaussian distribution is discussed in appendix C.

##### a. Mean-square errors

We first show in Fig. D1 a histogram of the ratio of mean-square errors computed numerically by Monte Carlo methods for a particular choice of model parameters that give *R* = 0.88, which is not unrealistic for practical seasonal forecasting (see Fig. 1). Consistent with the null hypothesis, the parameters are chosen such that the population mean-square errors are equal. We also show the *F* distribution with *N* − 1 and *N* − 1 degrees of freedom, where one degree of freedom is lost due to subtracting out the climatology. The figure shows that the distribution for the mean-square error is more narrowly concentrated around unity than the corresponding *F* distribution, as anticipated.

Although there are six adjustable parameters, our theoretical calculations imply that the sampling distribution depends only on sample size and the correlation between errors *R*. We test this prediction by generating realizations from the multivariate Gaussian for a wide range of parameter choices and plotting the results simply as a function of sample size and *R*. More precisely, we fix the model parameters and generate *N* realizations of observations and forecasts. This procedure is repeated 10 000 times to generate 10 000 estimates of MSE_{1} and MSE_{2}, from which the variance ratio is estimated. Then, this entire procedure is repeated *M* = 200 times but using a new set of *randomly chosen* population variances and correlations *ρ*_{o1}, *ρ*_{o2}, *ρ*_{12} (within the constraints of having realizable statistics and equal mean-square errors). Appendix C describes our procedure for generating synthetic data.

The variance ratios generated by the above model are plotted as a function of the (population) parameter *R*^{2} in Fig. D2, for two choices of sample size *N*. The black line shows the theoretical estimate in Eq. (10). The figure reveals that the theoretical estimate underestimates the actual variance ratios, presumably due to higher-order effects that were neglected in the derivation of Eq. (10). An improved estimate can be derived heuristically by assuming that the variance is proportional to (1 − *R*^{2}), and choosing the proportionality constant to be consistent with an *F* distribution for the case *R* = 0. This approach gives

This function is plotted in Fig. D2 and gives a better match to the Monte Carlo results, and converges to Eq. (10) for asymptotically large *N*.

An important implication of Fig. D2 is that the variance of the mean-square error ratio depends on the model parameters only through the parameter *R*. This result is remarkable given that different model parameters correspond to very different signal and noise variances.

##### b. Correlation skill

The difference in *Z*-transformed correlation skills for the forecast/observation system for a specific set of model parameters is shown in Fig. D3. The figure reveals a similar tendency to be more concentrated than the theoretical distribution based on independent correlations. A plot of the (scaled) population variance of *z*_{1} − *z*_{2} as a function of Γ is shown in Fig. D4 for two different sample sizes. The results are well approximated by Eq. (11), despite very different signal and noise variances in different models. Given this tight relation, we can anticipate that testing equality of correlation skills using the *Z*-transform method will give misleading results when Γ differs greatly from 0. Although the population value of this parameter generally is unknown, it can be estimated from the available data.

##### c. Estimation of R and *Γ*

As emphasized in section 2, the correlation between errors *R* and correlation between correlation skill Γ can be estimated from available data. A comparison of sample estimates and exact values is shown in Fig. D5. The shaded region shows the 95% confidence interval computed from the sample, and the dashed lines show the upper and lower limits of the 95% confidence interval computed from a standard correlation test (e.g., as derived from Fisher’s *Z* transformation). The figure suggests that the formulas provide unbiased estimates of these parameters with an uncertainty that behaves like a correlation coefficient.

## REFERENCES

*Practical Nonparametric Statistics*. 2nd ed. Wiley-Interscience, 493 pp.

*Bootstrap Methods and Their Application*. Cambridge University Press, 582 pp.

*An Introduction to the Bootstrap*. Chapman and Hall, 456 pp.

*Climate Dyn.,*

**40,**245–272, doi:.

*Permutation, Parametric and Bootstrap Tests of Hypotheses*. 3rd ed. Springer, 315 pp.

*Resampling Methods*. 3rd ed. Birkhäuser, 218 pp.

*Climate Change 2013: The Physical Science Basis,*T. Stocker et al., Eds., Cambridge University Press, 953–1028.

**95,**585–601, doi:.

*Resampling Methods for Dependent Data*. Springer, 374 pp.

*The Analysis of Variance*. John Wiley and Sons, 477 pp.