## 1. Introduction

Hindcast experiments are routinely generated to detect systematic biases of forecast systems, and to assess forecast quality. Hindcast data from a competing forecast system are often available, from either a low-resolution version of the same forecast system, the system of a competing forecast institution, or a simple statistical benchmark forecast. It is then of interest to address the question whether the forecast system at hand offers an improvement over the competitor. A very common measure of forecast skill is the (Pearson product moment) correlation coefficient between forecast and observations. To answer the question of whether the new forecast offers an improvement over a competitor, the difference in the correlation coefficient could be considered. Furthermore, in order to assess the robustness of an observed difference in correlation, some measure of uncertainty must be calculated.

As pointed out by Jolliffe (2007): “The value of a verification measure on its own is of little use; it also needs some quantification of the uncertainty associated with the observed value” (p. 637). Uncertainty quantification is important to distinguish genuine improvements in forecast skill from random sampling variability due to the finite hindcast samples. Jolliffe (2007) presents various statistical methods to quantify uncertainty in forecast skill and differences in forecast skill. DelSole and Tippett (2014) show that commonly used statistical tests for comparing skill of climate forecasts make the questionable assumption that the competing forecasts are independent. They show that this assumption can invalidate the test results, and suggest suitable alternatives.

The present paper complements Jolliffe (2007) and DelSole and Tippett (2014) by reviewing statistical methods that are directly applicable to testing for differences in correlation forecast skill, and by emphasizing the power of statistical tests to detect skill improvements. Section 2 briefly reviews the correlation coefficient, statistical hypothesis testing, and confidence intervals. Section 3 describes the most currently used hypothesis test for quantifying uncertainty of a correlation difference. A hypothesis test by Steiger (1980), and an approximate method to calculate confidence intervals by Zou (2007) are suggested as more appropriate methods for comparing correlation coefficients of two forecasts for the same set of observations. In section 4, the different statistical methods are applied to datasets of seasonal near-surface air temperature forecasts. The analyses provide detailed examples of how the different test statistics are calculated in practice. It is shown that the alternative tests indicate significant improvements in forecast skill where the traditional test does not. In section 5, the differences between the tests are assessed by analyzing their type-I error rates (the probability of falsely detecting an improvement) and their power (the probability of correctly detecting an improvement). It is shown that the traditional test can have a too low type-I error rate, and that the alternative test has higher power and thus increases the chance of detecting genuine improvements in forecast skill. Section 6 compares predictions of climate indices (ENSO and NAO) using model versions with different resolutions. Section 7 concludes the paper with a discussion and additional remarks.

## 2. Basic concepts

*Y*. A hindcast dataset generated by A and B for the same observation

*Y*consists of a series of triplets

To quantify uncertainty of the correlation coefficient, it is often assumed that the hindcast dataset is a random sample from an infinite population of forecasts and observations. The sample correlations *testing problem* that can be addressed by significance testing. The question of how big an improvement is, is an *estimation problem* that can be addressed by confidence intervals.

Tests for improvements in correlation skill assume null hypotheses such *T* is calculated, which is a function of the hindcast data, and whose sampling distribution is known if *T*, say *p* value is calculated (i.e., the chance of observing a value of *T* that is more extreme than *p* value such as *p* value is not the probability that *p* value is smaller than 0.05, we accept a 5% chance of mistakenly rejecting a true

## 3. Methods

In this section we summarize statistical methods for hypothesis testing and confidence intervals of correlation coefficients. In the atmospheric sciences literature, methods to quantify uncertainty in a single correlation coefficient are well known. However, statistical methods to calculate hypothesis tests and confidence intervals for the difference between correlation coefficients are less well known.

### a. Hypothesis tests and confidence intervals for a single correlation coefficient

*t*distribution with

*n*− 2 degrees of freedom (Von Storch and Zwiers 2001, section 8.2.3). The test assumes that the hindcast data are independently and identically normally distributed. The term

*t*distribution if

*α*would thus reject the null hypothesis

*t*distribution.

*z*transform), defined by(Von Storch and Zwiers 2001, section 8.2.3). For data that are identically and independently normally distributed, the Fisher transformation of

*n*− 3)

^{−1}. A

*p*quantile of the standard normal distribution (e.g.,

### b. Testing and estimating the difference of two overlapping correlations

*overlapping*(Zou 2007). The following test presented by Steiger (1980) [based on results from Williams (1959)] tests equality of overlapping correlations (i.e.,

*R*aswhere

*R*is the determinant of the

*t*distribution with

*n*− 3 degrees of freedom under the null hypothesis of zero correlation difference. If

*t*distribution. A one-sided test at significance level

*α*would thus compare

*t*distribution with

*n*− 3 degrees of freedom, and reject the null hypothesis of zero correlation difference if

## 4. Application to seasonal near-surface air temperature forecasts

### a. Description of the data

A comparison of seasonal climate forecasts serves here as a practical example. Forecasts of average summer (JJA) near-surface air temperatures are initialized on 1 May for the

The hindcast experiment addresses the effect of initializing the land surface conditions. A more realistic initialization of the land surface conditions is expected to have particular impact on prediction of summer temperatures over landmasses. One forecast, denoted forecast A, was generated by using the same climatological land surface conditions to initialize the forecast in each year. We computed the climatology of surface parameters (soil moisture and temperature at all soil levels, and the albedo, depth, density, and temperature of the snow layer) by taking their 1993–2009 averages in a window of 10 days centered around the initialization date 1 May, using data from the ERA-Interim/Land global reanalysis dataset (Balsamo et al. 2015). The 10-day window ensures a robust estimate of the climatology. The other set of forecasts, denoted forecast B, were initialized with the actual land surface parameters on the initialization date in the respective year, taken from the ERA-Interim/Land dataset. All model hindcasts were carried out with the global climate system model EC-Earth3 (Hazeleger et al. 2012), which has been widely used for studying intraseasonal to multiannual predictability and climate projections (Doblas-Reyes et al. 2013a). Hindcasts are initialized with reanalysis data from Global Ocean Reanalysis and Simulations, version 1 (GLORYS2v1) for the ocean (Ferry et al. 2012), ERA-Interim reanalysis data for the atmosphere (Dee et al. 2011), ERA-Interim/Land data for the land surface (Balsamo et al. 2015), and sea ice initial conditions from Guemas et al. (2014). Each prediction is calculated as the mean over 10 ensemble members initialized by atmospheric singular vectors. Surface temperature data from the ERA-Interim reanalysis were used as verifying observations.

We evaluate differences in correlation skill at each land grid point individually, and also for area averages. The area averages are calculated for four regions defined in the SREX special report of the IPCC (IPCC 2012). The region specifications are given in Table 1. These four regions are either in semiarid climates, where the land surface–atmosphere interactions play an important role for the energy balance, or for which land surface–atmosphere couplings were previously reported in the literature (Koster et al. 2004; Zhang et al. 2011; Bellprat et al. 2013). The time series plots of the area-weighted temperature averages for the four regions are shown in Fig. 1. It can be noted that forecasts and observations have negligible serial correlation.

Region specifications. The regions are also indicated in Fig. 2.

### b. Correlation analysis

We first provide a detailed example of how the various test statistics, *p* values, and confidence limits of section 3 are calculated. We use the time series of the central European (CEU) region (Fig. 1a) for illustration. The sample correlations of the forecasts with the observations are *p* value of 0.11 under the standard normal distribution. That is, if the null hypothesis of zero correlation difference were true (and if the forecasts were uncorrelated), *t* test of Steiger (1980), which accounts for the correlation between the forecasts, we obtain a value of the test statistic of *p* value under the *t* distribution with

Table 2 summarizes the correlation analysis of the four time series of Fig. 1. In all examples, the *t* test based on the test statistic *p* values than the test based on *p* value of 0.37, indicating that the observed value of *p* value is very small, leading to rejection of *p* values are very similar. On the other hand, the correlation differences in regions EAS and northeastern Brazil (NEB) are very similar, but the *p* values are very different. Last, we note that as a result of the soil moisture–temperature feedback (dry/wet conditions lead to warmer/colder temperatures), the variance of the forecast system B is slightly higher than the variance of forecast A in all four regions.

Table summarizing correlation coefficients, hypothesis tests, and confidence intervals for the data shown in Fig. 1.

Figure 2 shows correlation coefficients

The correlation differences at individual grid points are shown in the top panel of Fig. 3. Stippled points indicate grid points where the one-sided test based on the test statistic *p* value smaller than 0.05, and, therefore, rejects the null hypothesis at the

We comment on field significance, following the procedure first proposed by Livezey and Chen (1983). There is a total of *k* would follow a binomial distribution with size *n* and success probability 0.05. The chance of observing a value at least as large as *n*. In order for a fraction *n* would have to be smaller than 725. A detailed estimation of the spatial degrees of freedom is outside the scope of this study, but we can provide a rough estimate based on visual inspection. The decorrelation length of the data is about 10 grid cells, which suggests that the map consists of independent circular regions, each consisting of about 80 grid cells. Our estimate of the effective degrees of freedom is thus about

According to Table 2, the *p* values of the two tests based on *p* value of *p* values. It happens that the *p* value is smaller than 0.05, but the

## 5. Type-I error rate and power analysis

The present section addresses two important questions concerning statistical tests of correlation forecast skill:

- If forecasts A and B had equal skill (i.e.,
), how frequently does a given statistical test (falsely) reject the null hypothesis of zero correlation difference? - If forecast B were more skillful than forecast A (i.e.,
), how often does a given statistical test (correctly) reject the null hypothesis of zero correlation difference?

In the first question, a rejection of *p* values should be smaller than 0.05 if *p* values and confidence intervals should, by definition, have a type-I error rate equal to the nominal significance level (e.g.,

In the second question, a rejection of *n* (Cohen 1992). Obviously, statistical tests with high power are desirable. An estimate of the power of the different tests will be useful because power characterizes our ability to detect improvements in forecast quality.

*ρ*s are in

To calculate power and type-I error rate of a given test, we use the following protocol:

- Fix values for
, , and , as well as the sample size *n.* - Draw
*n*triplets, , from the corresponding trivariate normal distribution and interpret these data a hindcast dataset of size *n*of two competing forecast systems A and B for the same observation. - Perform the given hypothesis test of the null hypothesis
. - Record whether or not the test rejects
. - Repeat steps 2–4 a large number of times, each time with a different realization of artificial hindcast data.
- Calculate the fraction of rejected null hypotheses.

If

We have analyzed type-I error rates of the hypothesis tests and confidence intervals presented in section 3. We simulate artificial hindcast datasets of sample size *n* are generally different from the population values, and different from each other. For a number of values of *p* value of the two-sided test is smaller than 0.05, and if the central 95% confidence interval does not overlap the value zero. Since we chose

Figure 4 shows that empirical type-I error rates are not always equal to the nominal value of *n* (not shown).

We also compare statistical power of one-sided tests for improvement based on the test statistics *p* value is smaller than 0.05. Since

Power of statistical tests based on test statistic

In medical research, for example, it is common practice to demand that statistical tests at a significance level of 5% should achieve power of at least 80% (Cohen 1992). In three of the settings of Table 3, the power of the test based on *n*. For the correlation structure of region NEB, power greater than 80% is already achieved at sample sizes of

The dependency between correlation structure and power is not straightforward, and it is worth analyzing this dependency further. Figure 6 shows how power of the test based on

## 6. Improved predictions of climate indices by increasing model resolution

In this section we present an additional application of the statistical methodology of this paper. A standard approach to evaluate the ability of forecast systems at predicting regional climate variability is to check their skill to forecast the main modes of climate variability, such as El Niño–Southern Oscillation (ENSO; Trenberth 1997) or the North Atlantic Oscillation (NAO; Hurrell 1995). In this context, a rigorous application of statistical tests is essential to compare different forecast systems. The size of the sample of predictions that is used to compare the skill of different forecast systems should be chosen depending on the initial skill of the forecast system. Additionally, the high similarities between two versions of the same forecast system have to be taken into account when evaluating skill improvement. As an illustration, the EC-Earth model initialized in May with a resolution of *p* value based on the test statistics *p* value based on

Seasonal forecast of NAO is more challenging than for ENSO. Seasonal forecast systems typically obtain correlation skill between 0 and 0.3 at predicting the winter NAO on seasonal time scales (Shi et al. 2015). At commonly used sample sizes of around 20, these values are not statistically significant at the

## 7. Conclusions

A commonly used statistical test for detecting improvement in correlation skill was shown to be too conservative and underpowered, because it assumes that the two competing forecasts are uncorrelated with one another. Using an appropriate test that correctly accounts for the (high) correlation between forecasts improves the power of detecting genuine increases in forecast skill. We therefore strongly recommend using the test by Steiger (1980) based on the test statistics

The importance of power analysis has been pointed out in the climate literature by Jolliffe (2007) and Wilks (2010). Power analysis is common practice in designing medical studies, in order to determine the necessary sample size to detect a hypothesized effect of a given treatment. To our knowledge, power is not currently considered when designing hindcast experiments for comparing climate forecast systems. But with insufficient sample sizes, it is unlikely to detect significant differences in forecast skill, which limits the usefulness of the computationally expensive hindcast simulation. Clearly, analyzing differences in forecast skill is not the only purpose why hindcast datasets are simulated; different applications include diagnosing model errors and calculating bias corrections. But these applications are subject to statistical uncertainties as well. There is always a chance of falsely diagnosing a model error or failing to diagnose an existing forecast bias due to insufficient sample size. Given the computational resources required to run hindcast experiments with state-of-the-art global climate forecast systems, statistical power should be taken more seriously if significance testing and confidence intervals are used to diagnose improvements. The present study demonstrates a simple simulation-based framework for investigating statistical power, and could be exploited for better design of hindcast experiments. Using the framework for power analysis presented here, more general settings can be analyzed. The present study only focused on differences in correlation skill of univariate data. In actual hindcast datasets, data are high dimensional, spatially and temporally correlated, and possibly nonnormal. These settings should be considered in future studies.

Comparative verification studies are often performed between forecast systems that are not very different from each other. It can be hypothesized that any improvement of one forecast over another is necessarily small (i.e.,

As in the present study, forecasts are often calculated by averaging over a finite number of ensemble forecasts to average out internal model variability, and thus obtain a better estimate of the predictable signal of the model. Depending on the ensemble size, and the signal-to-noise ratio of the ensemble forecasts, there might be an inherent upper bound on the achievable correlation skill. Such an upper bound limits the possible magnitudes of improvement that, in turn, limits the power of detecting any improvements of ensemble forecasts.

Furthermore, power might be different for different evaluation criteria than correlation skill. But for different evaluation criteria, the notion of which forecast is better changes—we might find that forecast B has higher correlation than forecast A, but a worse ROC statistic or Brier score [for definitions, see Jolliffe and Stephenson (2012)] than forecast A. Given that different criteria yield different definitions of “improvement,” we do not generally recommend a comparison of statistical power between different evaluation criteria.

We have shown in the appendix that correlation is closely related to the mean squared error (MSE), so instead of analyzing differences in correlation one might analyze difference in the MSE of the recalibrated forecasts. The MSE has the benefit that it is a scoring rule; that is, it assigns an individual value to each pair

This paper presented appropriate statistical tests for analyzing skill improvements, and power analysis as a method to evaluate such tests. The proposed tests were used to analyze seasonal hindcast datasets as practical examples, but can be applied to short-term weather forecasting and climate projections as well. It was shown that realistic land surface representation leads to significantly higher correlation skill in temperature forecasts. It was further shown that increased atmosphere and ocean resolution leads to significantly improved correlation of ENSO forecasts. For NAO predictions, for which most current systems have low skill, it was shown that very large hindcast datasets would be required to detect small increases in skill with sufficiently high power.

The authors acknowledge support by the European Union Program FP7/2007-13 under Grant Agreement 3038378 (SPECS). The work of O. Bellprat was funded by ESA under the Climate Change Initiative (CCI) Living Planet Fellowship VERITAS-CCI. Acknowledgment is made for the use of ECMWF’s computing and archive facilities in this research, and the computer resources, technical expertise, and assistance provided by the Red Española de Supercomputación. The views expressed herein are those of the authors and do not necessarily reflect the views of their funding bodies or any of their subagencies. We wish to thank Timothy DelSole, two anonymous reviewers, and the editor for their comments that helped to improve the quality of the paper.

# APPENDIX

## Derivation: Correlation Squared is a Skill Score

## REFERENCES

Balsamo, G., and Coauthors, 2015: ERA-Interim/Land: A global land surface reanalysis data set.

,*Hydrol. Earth Syst. Sci.***19**, 389–407, doi:10.5194/hess-19-389-2015.Bellprat, O., , S. Kotlarski, , D. Lüthi, , and C. Schär, 2013: Physical constraints for temperature biases in climate models.

,*Geophys. Res. Lett.***40**, 4042–4047, doi:10.1002/grl.50737.Cohen, J., 1992: Statistical power analysis.

,*Curr. Dir. Psychol. Sci.***1**, 98–101, doi:10.1111/1467-8721.ep10768783.Dee, D. P., and Coauthors, 2011: The ERA-Interim reanalysis: Configuration and performance of the data assimilation system.

,*Quart. J. Roy. Meteor. Soc.***137**, 553–597, doi:10.1002/qj.828.DelSole, T., , and M. K. Tippett, 2014: Comparing forecast skill.

,*Mon. Wea. Rev.***142**, 4658–4678, doi:10.1175/MWR-D-14-00045.1.Diebold, F. X., , and R. S. Mariano, 1995: Comparing predictive accuracy.

,*J. Bus. Econ. Stat.***13**(3), 134–144, doi:10.1198/073500102753410444.Doblas-Reyes, F. J., , J. García-Serrano, , F. Lienert, , A. P. Biescas, , and L. R. L. Rodrigues, 2013a: Seasonal climate predictability and forecasting: Status and prospects.

,*Wiley Interdiscip. Rev.: Climate Change***4**, 245–268, doi:10.1002/wcc.217.Doblas-Reyes, F. J., and Coauthors, 2013b: Initialized near-term regional climate change prediction.

,*Nat. Commun.***4**, 1715, doi:10.1038/ncomms2704.Du, H., , F. Doblas-Reyes, , J. Garca-Serrano, , V. Guemas, , Y. Soufflet, , and B. Wouters, 2012: Sensitivity of decadal predictions to the initial atmospheric and oceanic perturbations.

,*Climate Dyn.***39**, 2013–2023, doi:10.1007/s00382-011-1285-9.Ferry, N., and Coauthors, 2012: GLORYS2V1 global ocean reanalysis of the altimetric era (1993–2009) at meso scale.

*Mercator Ocean Quart. Newsl.*,**44**, 28–39. [Available online at http://www.mercator-ocean.fr/wp-content/uploads/2015/05/Mercator-Ocean-newsletter-2012_44.pdf.]Guemas, V., , F. J. Doblas-Reyes, , K. Mogensen, , S. Keeley, , and Y. Tang, 2014: Ensemble of sea ice initial conditions for interannual climate predictions.

,*Climate Dyn.***43**, 2813–2829, doi:10.1007/s00382-014-2095-7.Hazeleger, W., and Coauthors, 2012: EC-Earth V2.2: Description and validation of a new seamless earth system prediction model.

,*Climate Dyn.***39**, 2611–2629, doi:10.1007/s00382-011-1228-5.Hurrell, J. W., 1995: Decadal trends in the North Atlantic Oscillation: Regional temperatures and precipitation.

,*Science***269**, 676–679, doi:10.1126/science.269.5224.676.IPCC, 2012:

*Managing the Risks of Extreme Events and Disasters to Advance Climate Change Adaptation*. Cambridge University Press, 582 pp. [Available online at http://ipcc-wg2.gov/SREX/report/full-report/.]Jolliffe, I. T., 2007: Uncertainty and inference for verification measures.

,*Wea. Forecasting***22**, 637–650, doi:10.1175/WAF989.1.Jolliffe, I. T., , and D. B. Stephenson, Eds., 2012:

*Forecast Verification: A Practitioner’s Guide in Atmospheric Science*. 2nd ed. John Wiley & Sons, 292 pp.Keenlyside, N. S., , M. Latif, , J. Jungclaus, , L. Kornblueh, , and E. Roeckner, 2008: Advancing decadal-scale climate prediction in the North Atlantic sector.

,*Nature***453**, 84–88, doi:10.1038/nature06921.Koster, R. D., and Coauthors, 2004: Regions of strong coupling between soil moisture and precipitation.

,*Science***305**, 1138–1140, doi:10.1126/science.1100217.Livezey, R. E., , and W. Chen, 1983: Statistical field significance and its determination by Monte Carlo techniques.

,*Mon. Wea. Rev.***111**, 46–59, doi:10.1175/1520-0493(1983)111<0046:SFSAID>2.0.CO;2.Merchant, C. J., and Coauthors, 2014: Sea surface temperature datasets for climate applications from Phase 1 of the European Space Agency Climate Change Initiative (SST CCI).

*Geosci. Data J.*,**1**, 179–191, doi:10.1002/gdj3.20.Murphy, A. H., , and E. S. Epstein, 1989: Skill scores and correlation coefficients in model verification.

,*Mon. Wea. Rev.***117**, 572–582, doi:10.1175/1520-0493(1989)117<0572:SSACCI>2.0.CO;2.Pepler, A. S., , L. B. Díaz, , C. Prodhomme, , F. J. Doblas-Reyes, , and A. Kumar, 2015: The ability of a multi-model seasonal forecasting ensemble to forecast the frequency of warm, cold and wet extremes.

,*Wea. Climate Extremes***9**, 68–77, doi:10.1016/j.wace.2015.06.005.Rodgers, J. L., , and W. A. Nicewander, 1988: Thirteen ways to look at the correlation coefficient.

,*Amer. Stat.***42**, 59–66, doi:10.2307/2685263.Shi, W., , N. Schaller, , D. MacLeod, , T. N. Palmer, , and A. Weisheimer, 2015: Impact of hindcast length on estimates of seasonal climate predictability.

*Geophys. Res. Lett.*,**42**, 1554–1559, doi:10.1002/2014GL062829.Siegert, S., , D. B. Stephenson, , P. G. Sansom, , A. A. Scaife, , R. Eade, , and A. Arribas, 2016: A Bayesian framework for verification and recalibration of ensemble forecasts: How uncertain is NAO predictability?

,*J. Climate***29**, 995–1012, doi:10.1175/JCLI-D-15-0196.1.Steiger, J. H., 1980: Tests for comparing elements of a correlation matrix.

,*Psychol. Bull.***87**, 245–251, doi:10.1037/0033-2909.87.2.245.Trenberth, K. E., 1997: The definition of El Niño.

,*Bull. Amer. Meteor. Soc.***78**, 2771–2777, doi:10.1175/1520-0477(1997)078<2771:TDOENO>2.0.CO;2.Von Storch, H., , and F. W. Zwiers, 2001:

*Statistical Analysis in Climate Research*.Cambridge University Press, 496 pp.Wilks, D., 2010: Sampling distributions of the Brier score and Brier skill score under serial dependence.

,*Quart. J. Roy. Meteor. Soc.***136**, 2109–2118, doi:10.1002/qj.709.Wilks, D., 2011:

*Statistical Methods in the Atmospheric Sciences*. Vol. 100, Academic Press, 704 pp.Williams, E. J., 1959: The comparison of regression variables.

,*J. Roy. Stat. Soc. B***21**(2), 396–399. [Available online at http://www.jstor.org/stable/2983809.]Zhang, J., , L. Wu, , and W. Dong, 2011: Land-atmosphere coupling and summer climate variability over East Asia.

,*J. Geophys. Res.***116**, D05117, doi:10.1029/2010JD014714.Zou, G. Y., 2007: Toward using confidence intervals to compare correlations.

,*Psychol. Methods***12**, 399–413, doi:10.1037/1082-989X.12.4.399.