1. Introduction
There is an increasing interest in decadal climate predictions. Successful predictions on daily and seasonal time scales depend crucially on correct modeling of internal variability and therefore require that the model system is initialized with realistic conditions. In contrast, climate model scenarios depend on external forcings such as the atmospheric greenhouse gas concentrations and natural (e.g., from volcanic eruptions) and anthropogenic aerosol loadings. Predictions on decadal time scales may depend on both sources of predictability (see, e.g., Kushnir et al. 2019; Merryfield et al. 2020; Meehl et al. 2021; and references therein) and therefore require inclusion of both initializations and external forcings. See DelSole and Tippett (2018) for a more rigorous discussion of the different kinds of predictability.
Both phase 5 and phase 6 of the Coupled Model Intercomparison Project (CMIP5 and CMIP6) contain initialized decadal predictions for the periods since 1960 in addition to the uninitialized historical experiments. Skill in decadal forecasts is related to slowly evolving parts of the climate system and is mainly found in the North Atlantic region and has been reported for surface air temperature, precipitation, and for the frequency of some extreme events. For details and references, see the summaries in, for example, Kirtman et al. (2014), Kushnir et al. (2019), Simpson et al. (2019), Meehl et al. (2021), and Hermanson et al. (2022). Recently, it has been reported that the predictable signal might be underestimated in models and that the average of very large ensembles might be needed to isolate this signal (Scaife and Smith 2018; Smith et al. 2019, 2020).
For some atmospheric variables, such as temperature, there is a strong influence from greenhouse gases in the last century. Thus, both the initialized and uninitialized experiments contain this forcing and show strong skill related to the general warming. It is often found (see, e.g., Borchert et al. 2021b; Bilbao et al. 2021) that correlations between observations and the ensemble mean of the historical experiments are large and significant in many geographical regions. The same holds for correlations between observations and the ensemble means of the initialized experiments even for decadal lead times. An example is discussed in section 4 (Fig. 8).
To estimate the effect of the initialized experiments, it is therefore important to either remove the forced response from the initialized experiments or to more carefully directly compare the initialized and noninitialized experiments. If the only forcing was the greenhouse gases one could try to remove this part from the prediction by expressing it as loworder polynomial fit or by removing the part linearly congruent to the amount of greenhouse gases. However, the value of this approach is limited by other faster forcings such as volcanic eruptions, the effect of which is harder to identify and remove (Trenberth and Shea 2006; Borchert et al. 2021a).
Attempts to demonstrate significant added skill from initializations for surface temperature on decadal time scales have shown mixed success outside the North Atlantic polar gyre region (Kirtman et al. 2014; Smith et al. 2010). See, for example, the discussions in DoblasReyes et al. (2013), Borchert et al. (2021a), and SospedraAlfonso and Boer (2020). However, recently Smith et al. (2019) have reported significant added skill from initializations in nearsurface temperature, surface pressure, and precipitation using a new statistical approach to estimate the statistical significance.
In this paper, we compare different methods to estimate the statistical significance of the effect of initializations. An important concept is the power of the test, which here is defined as the probability of correctly rejecting the null hypothesis that initialized and noninitialized predictions have the same skill. The methods are all based on the generation of surrogate time series that preserve the temporal characteristics (power spectrum and autocorrelations) of the original series. The methods differ by the statistic considered, that is, the measure used to represent the difference in skill between initialized and noninitialized predictions. While the paper is cast in the context of decadal predictions, we believe that the results will also be useful for other situations where different forecast systems are compared, for example, for seasonal predictions.
In section 2, we describe the method. In section 2a, we define the different statistics, and in section 2b, we discuss the statistical test. In section 3, we look at data generated from simple, idealized models. In section 3a, we take a first look at the behavior of the statistics, while we in section 3b study the power of the tests. In section 4, we use—as an example of a realistic situation—the statistical tests on historical and initialized ensembles from a single climate model. The paper is closed with the conclusions in section 5.
2. The method
In section 2a, we introduce the measures—the statistics—of the added skill from initializations. These statistics are all based on correlations. To estimate the statistical significance, we need to compare an observed statistic to the value it would have had if there were no added skill. However, the measures are stochastic variables and are not characterized by single numbers but by distributions. The statistical significance is therefore estimated by comparing the observed statistic to its distribution when there is no added skill: if the observed statistic falls in the extreme 5% of the distribution (twosided test) we reject the null hypothesis of no added skill. Note that it is not guaranteed that the actual risk of rejecting a true null hypothesis is equal to this nominal 5%. In section 2b, we discuss how to obtain these distributions.
a. How to measure the added skill: The statistics
We have three time series: the observations o; the historical, uninitialized simulation h; and the initialized forecast f. We assume that these time series have the same length (N). They could, for example, represent monthly temperatures in a single grid point or a circulation index such as the North Atlantic Oscillation. We want to estimate if a forecast f contributes with skill that is not included in the historical experiment h.
To proceed we need a measure—a statistic—of the added skill from initializations. Such statistics are stochastic variables, and their distributions under the null hypothesis is discussed in the next subsection.
Above, f and h could be individual members or ensemble means. Note that the skill of the ensemble mean often improves with ensemble size. This means that if we consider ensemble means we should include the same numbers of initialized and uninitialized experiments. It also means that even if the initialized forecast f is not better than the uninitialized h, it might still be an advantage to include in the ensemble.
b. The test of statistical significance
Having chosen a statistic, we need to find its distribution under the null hypothesis. Here, a suitable null hypothesis is that initialized forecast f has the same skill as the uninitialized experiment h. More precisely, we assume that f and h are exchangeable so that, for example, f has the same relation to o as h.
Under this null hypothesis the difference statistic cor(f, o) − cor(h, o) and the split statistic cor(f − h, o)σ_{f}_{−}_{h}/σ_{f} will be distributed around zero also for finite values of N, but this is not the case for the residual statistic cor(oh, fh). The residuals are in general not independent, which leads to a bias in cor(oh, fh). The complication of the bias was remedied in Smith et al. (2019) by estimating and subtracting it. Here, we will apply a general Monte Carlo method valid for all the statistics. We will return to the behavior of the statistics when studying simple examples in section 3.
We calculate the significance by comparing the original statistic—calculated, for example, from the climate model output or from simple numerically generated test data—to the distribution of that statistic under the assumption that f is similar to h. Thus, we need to build a distribution of the statistic under the null hypothesis that f and h have the same properties. We can then estimate if the original statistic falls far enough out in the tails of this distribution that we will reject the null hypothesis.
The distribution is built by a Monte Carlo approach where a large number (typically 500–2500) of surrogate versions of f are calculated. For the null hypothesis to be fulfilled, the surrogates
If we knew the generating process of h, we could use this process directly to produce the the values of
An alternative way to produce the surrogates is to express h as the sum of a term linearly congruent with o and a residual: h = ao + ξ (cf. footnote 1). We can now make surrogates
Using the Monte Carlo method, we are directly accounting for the problematic assumption of some analytical tests that the skill estimates are independent, as discussed in DelSole and Tippett (2014) and Siegert et al. (2017). For example, cor(f, o) and cor(h, o) are dependent as they refer to the same set of observations o, but this dependence is included in the surrogates. We also avoid estimating the independent temporal degrees of freedom, which are often ill defined. However, the Monte Carlo methods assume that the time series are stationary. In section 3b(2), we will consider how to deal with trends.
We also use a simpler phasescrambling method to estimate if correlations themselves are significantly different from zero, as described in Christiansen (2001).
3. Comparing the tests on idealized data
Here, we compare the different tests using simple numerically generated data.
a. Simple examples
We first briefly look at some simple examples to illustrate the difference between the statistics.
In the first example, we let the contribution of the observations to f and h be identical,
The distributions of the statistics under the null hypothesis are also shown in Fig. 1. We see that the original values of the statistics fall well inside these distributions, and we can therefore not reject the null hypothesis in any of the three tests.
We now look at another simple example (Fig. 2) where the forecast is superior. We take o as before, but now h = o + ξ_{h} and f = o + 0.5ξ_{f}. For this realization we have the difference statistic = 0.17, the residual statistic = 0.81, and the split statistic = −0.03. Now the values for the difference and the residual statistics fall clearly outside the distributions under the null hypothesis as calculated with the surrogate method. However, here the split statistic disagrees, and we will not reject the null hypothesis according to this statistic.
As in Fig. 1, but now h = o + ξ_{h}, and f = o + 0.5ξ_{f}, where ξ_{h} and ξ_{f} are normalized white noise. This example does not fulfill the null hypothesis.
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
As in Fig. 1, but now h = o + ξ_{h}, and f = o + 0.5ξ_{f}, where ξ_{h} and ξ_{f} are normalized white noise. This example does not fulfill the null hypothesis.
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
As in Fig. 1, but now h = o + ξ_{h}, and f = o + 0.5ξ_{f}, where ξ_{h} and ξ_{f} are normalized white noise. This example does not fulfill the null hypothesis.
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
Equations (5)–(7), which are valid for large ensemble sizes, agree well with the value of the statistics from the realizations as is shown with the red vertical lines in Figs. 1 and 2.
From Eqs. (5)–(7), we see that under the null hypotheses
As a simple example we extend the model in Eq. (4). Now
The residual statistic
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
The residual statistic
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
The residual statistic
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
b. Power of the tests
The power of a test is its probability of correctly rejecting the null hypothesis. The power depends both on the design of the test and the length of the time series (Sienz et al. 2016; Totaro et al. 2020)—or more precisely on the independent degrees of freedom. A power curve is the power shown as a function of a parameter. The wellknown structure of the power curve and its relation to type I and II errors are shown schematically in Fig. 4. A perfect test would have the power close to one when the null hypothesis does not hold, that is, almost always correctly rejecting the null hypothesis and thereby minimizing the risk of type II errors. Likewise, the power curve would have small values when the null hypothesis holds, that is, having only a small risk of erroneously rejecting a true null hypothesis and thereby minimizing the risk of type I errors.
Schematic view of a power curve. The power p is shown as function of a parameter (full black curve). The null hypothesis is fulfilled only when the parameter is zero. Away from zero, p is the probability that the false null hypothesis is correctly rejected (indicated by green hatching). The probability that the false null hypothesis is incorrectly not rejected (type II error) is 1 − p (indicated by red hatching). At zero—where the null hypothesis is fulfilled—the length of the green line, 1 − p, indicates the probability to correctly not reject the true null hypothesis, while the length of the red line p indicates the probability to incorrectly reject the true null hypothesis (type I error).
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
Schematic view of a power curve. The power p is shown as function of a parameter (full black curve). The null hypothesis is fulfilled only when the parameter is zero. Away from zero, p is the probability that the false null hypothesis is correctly rejected (indicated by green hatching). The probability that the false null hypothesis is incorrectly not rejected (type II error) is 1 − p (indicated by red hatching). At zero—where the null hypothesis is fulfilled—the length of the green line, 1 − p, indicates the probability to correctly not reject the true null hypothesis, while the length of the red line p indicates the probability to incorrectly reject the true null hypothesis (type I error).
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
Schematic view of a power curve. The power p is shown as function of a parameter (full black curve). The null hypothesis is fulfilled only when the parameter is zero. Away from zero, p is the probability that the false null hypothesis is correctly rejected (indicated by green hatching). The probability that the false null hypothesis is incorrectly not rejected (type II error) is 1 − p (indicated by red hatching). At zero—where the null hypothesis is fulfilled—the length of the green line, 1 − p, indicates the probability to correctly not reject the true null hypothesis, while the length of the red line p indicates the probability to incorrectly reject the true null hypothesis (type I error).
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
1) The simple idealized model
We now fix
Power curves—giving the power as function of λ_{f}—are shown in Fig. 5 for
The power—the probability of rejecting the null hypothesis—of the statistical tests as function of λ_{f} for different values of the sample size N: 25, 100, and 1000. Here, h = o + ξ_{h} and f = λ_{f}o + ξ_{f}, where o is a normalized AR1 process, and ξ_{h} and ξ_{f} are normalized white noise. This corresponds to Eq. (4) with
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
The power—the probability of rejecting the null hypothesis—of the statistical tests as function of λ_{f} for different values of the sample size N: 25, 100, and 1000. Here, h = o + ξ_{h} and f = λ_{f}o + ξ_{f}, where o is a normalized AR1 process, and ξ_{h} and ξ_{f} are normalized white noise. This corresponds to Eq. (4) with
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
The power—the probability of rejecting the null hypothesis—of the statistical tests as function of λ_{f} for different values of the sample size N: 25, 100, and 1000. Here, h = o + ξ_{h} and f = λ_{f}o + ξ_{f}, where o is a normalized AR1 process, and ξ_{h} and ξ_{f} are normalized white noise. This corresponds to Eq. (4) with
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
More interesting in our context, we find that the largest power is found when using the split statistic. Also, the power of the difference statistic is in general a bit larger than the power of the residual statistic. These differences are largest for small N (sample size) and decreases with N. In particular, the difference and residual statistics show almost the same power for N ≥ 100, while they still differ considerably from the split statistic.
Unfortunately, a high power is here connected to a high risk of falsely rejecting a true null hypothesis (type I error). For λ_{f} = 1, where the null hypothesis holds, the rejection rates for the residual and difference statistics are around 0.2 for N = 25 and decreases toward the nominal value of 0.05 for large N. The situation is worse for the split statistic, which even for N = 1000 is still around 0.3.
Above we calculated the surrogate time series by the phasescrambling procedure based on given realizations of h and o. This corresponds to realistic situations. In the simple tests we know the process
In Fig. 6, we show some other examples of the power. In Fig. 6a, we vary
(a) Power as function of
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
(a) Power as function of
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
(a) Power as function of
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
2) Including trends
When applying the surrogate method to climate models—and in particular to temperature—we have to consider the trends. Trends are not treated well by the phasescrambling method (Schreiber and Schmitz 2000) and should not be included in the stochastic part of the process. When calculating the surrogate
The left panels in Figs. 7a and 7c show examples where the time series are generated as in Fig. 5b but now s includes a nonlinear trend. Likewise, the right panels in Figs. 7b and 7d are generated as in Fig. 6b but now with the trend. The trend has the form of a fourthorder polynomial. Not taking special care of the trends in the phasescrambling gives rather wide power curves (Figs. 7a,b) compared to the situation without trend (Figs. 5a and 6b). Using the detrending procedure described in the previous paragraph improves the results and gives power curves (Figs. 7c,d) closely resembling those of Figs. 5b and 6b. We have confirmed that similar results are obtained with other powers of the true trend.
Including trend in s of the form 6t^{4}, with t = (0, 1, … N − 1)/N. Power as function of λ_{f}. N = 100 in all panels. (left) Parameters are as in Fig. 5b, including
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
Including trend in s of the form 6t^{4}, with t = (0, 1, … N − 1)/N. Power as function of λ_{f}. N = 100 in all panels. (left) Parameters are as in Fig. 5b, including
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
Including trend in s of the form 6t^{4}, with t = (0, 1, … N − 1)/N. Power as function of λ_{f}. N = 100 in all panels. (left) Parameters are as in Fig. 5b, including
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
4. Climate models
In this section we consider the nearsurface temperature from the historical singlemodel ensemble performed in the Community Earth System Model (CESM) Large Ensemble Simulation (LENS; Kay et al. 2015). We also use the corresponding ensemble of initialized decadal forecasts from the CESM Decadal Prediction Large Ensemble (DPLE; Yeager et al. 2018). Both ensembles have 40 members and use the same code and external forcing datasets. As observations, we use NCEP–NCAR data (Kalnay et al. 1996). Prior to the analysis, model data are interpolated to the horizontal NCEP global grid, 2.5° × 2.5°, using a simple nearest neighbor procedure. Our analyses are based on annual means for the period 1970–2017. As we use annual means we do not expect the choice of reanalysis to be important. To be sure we have confirmed that we get similar results using the JRA55 (Kalnay et al. 1996). The DPLE experiments are initialized in November every year and a lead time of 1 year will refer to the mean over the first full calendar year. When calculating the statistical significance of the statistics we use the phasescrambling method described in section 2b amended with the approach described in section 3b(2) dealing with trends.
Figure 8 shows the correlations between the ensemble mean of the historical ensemble and observations and the correlations between the ensemble mean of the initialized decadal prediction ensemble and observations for lead times of 1 and 10 years. There are strong positive and significant correlations between the historical ensemble and observations (topleft panel). The correlations between the forecast ensemble and observations are comparable for a lead time of 10 years and somewhat larger for a lead time of 1 year (bottom panels). The topright panel in Fig. 8 demonstrates the challenges of isolating the effect of initializations by showing that large and significant correlations between observations and the historical ensemble mean persist even after the linear trends have been removed.
Results for LENS annualmean nearsurface temperature. (top left) Correlations between observations and historical ensemble mean. (top right) As in the top left, but now time series have been detrended. (bottom) Correlations between observations and the initialized ensemble mean for lead times of (left) 1 year and (right) 10 years. The period is 1970–2017. Dots indicate where correlations are significantly different from zero at the 95% level.
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
Results for LENS annualmean nearsurface temperature. (top left) Correlations between observations and historical ensemble mean. (top right) As in the top left, but now time series have been detrended. (bottom) Correlations between observations and the initialized ensemble mean for lead times of (left) 1 year and (right) 10 years. The period is 1970–2017. Dots indicate where correlations are significantly different from zero at the 95% level.
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
Results for LENS annualmean nearsurface temperature. (top left) Correlations between observations and historical ensemble mean. (top right) As in the top left, but now time series have been detrended. (bottom) Correlations between observations and the initialized ensemble mean for lead times of (left) 1 year and (right) 10 years. The period is 1970–2017. Dots indicate where correlations are significantly different from zero at the 95% level.
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
In Fig. 9, we show the three statistics for a lead time of 1 year. The difference statistic shows large values only in the eastern Pacific and in the Atlantic subpolar gyre region. In contrast, the residual statistic is large in more extended areas. However, the two statistics rejects the null hypothesis—that historical and initialized experiments have the same skill—in many of the same regions: the eastern Pacific [the region of the El Niño–Southern Oscillation (ENSO)] and the subpolar gyre region. The split statistic is statistically significant in even larger areas. Note that there are many areas where the statistics are negative and significant, indicating that the historical ensemble has more skill than the initialized ensemble. Figure 10 shows the situation for a lead time of 10 years. Again the values for the residual statistic is larger than values of the difference statistic, but the two statistics mainly agree on the regions where the null hypothesis can be rejected. This is basically in the subpolar gyre region, as also found in other studies, but also in the polar regions, although here the significant correlations may be of different sign. Again, the split statistic is significant in larger areas.
The three statistics measuring the difference in skill between the initialized ensemblemean forecast and the uninitialized historical ensemble mean for annualmean nearsurface temperature. (top left) The difference statistic, (top right) the residual statistic, and (bottom left) the split statistic. Based on LENS ensemble means for 1970–2017. A lead time of 1 year is used for the initialized forecast. Dots indicate where the null hypothesis—that historical and initialized experiments have the same skill—can be rejected at the 95% level.
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
The three statistics measuring the difference in skill between the initialized ensemblemean forecast and the uninitialized historical ensemble mean for annualmean nearsurface temperature. (top left) The difference statistic, (top right) the residual statistic, and (bottom left) the split statistic. Based on LENS ensemble means for 1970–2017. A lead time of 1 year is used for the initialized forecast. Dots indicate where the null hypothesis—that historical and initialized experiments have the same skill—can be rejected at the 95% level.
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
The three statistics measuring the difference in skill between the initialized ensemblemean forecast and the uninitialized historical ensemble mean for annualmean nearsurface temperature. (top left) The difference statistic, (top right) the residual statistic, and (bottom left) the split statistic. Based on LENS ensemble means for 1970–2017. A lead time of 1 year is used for the initialized forecast. Dots indicate where the null hypothesis—that historical and initialized experiments have the same skill—can be rejected at the 95% level.
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
As in Fig. 9, but for a lead time of 10 years.
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
As in Fig. 9, but for a lead time of 10 years.
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
As in Fig. 9, but for a lead time of 10 years.
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
The left panel in Fig. 11 shows the fractional area of significant correlations for the three statistics as function of lead time. For a lead time of 1 year the difference statistic is significant in 60% of the globe and for a lead time of 10 years this fraction is reduced to 15%. The similar fractions for the residual statistic are 50% and 20% and for the split statistic 78% and 33%. While these fractions even for a lead time of 10 years are larger than the nominal 5%, they are not impressive when compared to the results in Fig. 5. There we found that the risk of rejecting a true null hypothesis is around 10%–20% for time series of length 25–100. Furthermore, the area of significant positive values of the statistic is only larger than the area of significant negative values for lead times less than 3–4 years (right panel in Fig. 11). This holds for both the difference and the residual statistics and indicates that the significance for larger lead times is due to chance.
(left) The fraction of the global area where the skills in forecast and historical experiments are estimated to be significantly different. The fraction is shown as a function of lead time for each of the three statistics. (right) The fraction of the global area where the skills are significantly different calculated for grid points with positive and negative correlations, respectively.
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
(left) The fraction of the global area where the skills in forecast and historical experiments are estimated to be significantly different. The fraction is shown as a function of lead time for each of the three statistics. (right) The fraction of the global area where the skills are significantly different calculated for grid points with positive and negative correlations, respectively.
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
(left) The fraction of the global area where the skills in forecast and historical experiments are estimated to be significantly different. The fraction is shown as a function of lead time for each of the three statistics. (right) The fraction of the global area where the skills are significantly different calculated for grid points with positive and negative correlations, respectively.
Citation: Journal of Climate 36, 9; 10.1175/JCLID220605.1
5. Conclusions
Attempts to find out if initialization improves forecasts as compared to uninitialized forecasts are hampered by the skill coming from the shared forcings. The skill from the forcings can be large, in particular for the last decades, due to the increased levels of greenhouse gases and the added skill from initialization is often relatively small. It is therefore important to carefully estimate the statistical significance of the added skill.
Previous literature has used different correlationbased statistics measuring the added skill from initialization. The simplest statistic is the difference in correlations between observations and the initialized experiment and the correlation between observations and the uninitialized experiment. Another statistic is the partial correlation between observations and the initialized forecast given the uninitialized forecast, that is, the correlation of the residuals after the contributions linearly congruent to the uninitialized forecast have been removed from the observations and the initialized forecast. The third statistic is based on the correlation between the difference of the initialized and uninitialized experiments and the observations.
We have studied the properties of the three statistics and used a Monte Carlo procedure to test the statistical significance. The Monte Carlo procedure is based on surrogate data with the same temporal characteristics as the original time series and the same relation to the observations. We first considered examples based on a simple model to study the power of the tests and then considered decadal forecasts with the LENS climate model. For the simple model, we derived analytical expressions of the statistics in the limit of large ensemble size.
Our main findings from the idealized model are as follows:

The residual statistic is in general—also in the limit of large samples—not zero under the null hypothesis that initialized and uninitialized forecasts are drawn from the same process. Furthermore, the value of the residual statistic does not vary monotonously with signaltonoise ratio. And when we consider ensemblemean forecasts, it therefore does not vary monotonously with the ensemble size (Fig. 3).

The split statistic does not recognize differences in the signaltonoise ratio but only differences in the absolute level of the signal. This statistic therefore fails in some circumstances, but it is worth noting that if h and f are normalized, then the split statistic becomes identical to the difference statistic [Eq. (1)].

The power of the difference and residual statistics are comparable when observations are noise free. When observations are noisy, the difference statistic is superior.

We also calculated a benchmark for the power based on the full information from the idealized model. The problems for the split statistic still remain. The difference and the residual statistics are still comparable when observations are noise free. However, when observations are noisy the difference between these statistics has been reduced, although the difference statistic is still behaving somewhat better.

The rejection rate when the null hypothesis is true is larger than the nominal value for small lengths of the time series but gets closer to the nominal value for long time series. This holds for all three statistics and the problem is quite severe for realistic sample lengths up to 100.

Secular trends cannot be directly dealt with by the phasescrambling (or bootstrap) procedure. We find that generating surrogates from detrended data and then adding the trend back to these surrogates gives good results.
Note that the first bullet point means that the residual statistic cannot be compared across different geographical regions with different signaltonoise ratios.
From the climate model analysis, we find the following:

When comparing the nearsurface temperature in initialized decadal forecasts with uninitialized historical simulations, we recognize the larger values of the residual statistic over the difference statistic. However, the two statistics show mainly statistical significance in the same regions. These results are not unexpected considering the results from the simple examples.

For the nearsurface temperature, the initialized forecasts add significant skill for lead times of a few years, in particular in the regions influenced by the ENSO. For decadal lead times, the area of statistical significance is consistent with the null hypothesis (no added skills from the initializations) when the results from the simple examples are taken into consideration. This interpretation is supported by the fact that for large parts of the significant areas, the statistics are negative, indicating that initialization has deteriorated the predictions.
Based on our results we recommend using the difference statistic while being aware of the risk of the overestimation of the significance that can come from the considerable risk of falsely rejecting a true null hypothesis.
The discussion in this paper has mainly been in the context of decadal forecasts but is also relevant for other situations where different forecast systems are compared. In particular, seasonal forecasts share many of the same challenges as decadal forecasts as the skill depends both on forcings and initial conditions.
Acknowledgments.
This work is supported by the NordForskfunded Nordic Centre of Excellence project (Award 76654) Arctic Climate Predictions: Pathways to Resilient, Sustainable Societies (ARCPATH) and by the project European Climate Prediction System (EUCP) funded by the European Union under Horizon 2020 (Grant Agreement 776613).
Data availability statement.
The CESM Large Ensemble Project (http://www.cesm.ucar.edu/projects/communityprojects/LENS/) was downloaded via EFGS from http://www.cesm.ucar.edu/projects/communityprojects/LENS/datasets.html. The CESM Decadal Prediction Large Ensemble Project (http://www.cesm.ucar.edu/projects/communityprojects/DPLE/) was downloaded via EFGS from http://www.cesm.ucar.edu/projects/communityprojects/DPLE/datasets.html. The NCEP–NCAR Reanalysis data were provided by the NOAA–CIRES Climate Diagnostics Center, Boulder, Colorado, from their website at http://www.cdc.noaa.gov/.
Footnotes
The part of y linearly congruent with x is found by writing y = ax + ξ, where ξ and x are independent. The linearly congruent part is then ax and the residual yx = ξ.
As we are dealing with correlations, we could also define the null hypothesis by identical signaltonoise ratios
APPENDIX
Derivations of Analytical Expressions for the Statistics for Large Sample Size
The analysis of climate data can often be simplified using the blessings of dimensionality (Christiansen 2018, 2021). When the sample size of the time series is large—or more precisely when the number of effective degrees of freedom in the time series is large—we can utilize the nonintuitive properties of highdimensional spaces. The properties we will consider here are that vectors drawn independently from the same distribution have the same lengths and that independent vectors are orthogonal. The mathematical background and the extent to which they can be applied to climatic fields and time series are discussed in Christiansen (2021). Several analytical results based on the blessings of dimensionality are derived in the paper cited above and in Christiansen (2019, 2020) and Christiansen et al. (2022).
REFERENCES
Bilbao, R., and Coauthors, 2021: Assessment of a fullfield initialized decadal climate prediction system with the CMIP6 version of ECEarth. Earth Syst. Dyn., 12, 173–196, https://doi.org/10.5194/esd121732021.
Borchert, L. F., V. Koul, M. B. Menary, D. J. Befort, D. Swingedouw, G. Sgubin, and J. Mignot, 2021a: Skillful decadal prediction of unforced southern European summer temperature variations. Environ. Res. Lett., 16, 104017, https://doi.org/10.1088/17489326/ac20f5.
Borchert, L. F., M. B. Menary, D. Swingedouw, G. Sgubin, L. Hermanson, and J. Mignot, 2021b: Improved decadal predictions of North Atlantic subpolar gyre SST in CMIP6. Geophys. Res. Lett., 48, e2020GL091307, https://doi.org/10.1029/2020GL091307.
Christiansen, B., 2001: Downward propagation of zonal mean zonal wind anomalies from the stratosphere to the troposphere: Model and reanalysis. J. Geophys. Res., 106, 27 307–27 322, https://doi.org/10.1029/2000JD000214.
Christiansen, B., 2018: Ensemble averaging and the curse of dimensionality. J. Climate, 31, 1587–1596, https://doi.org/10.1175/JCLID170197.1.
Christiansen, B., 2019: Analysis of ensemble mean forecasts: The blessings of high dimensionality. Mon. Wea. Rev., 147, 1699–1712, https://doi.org/10.1175/MWRD180211.1.
Christiansen, B., 2020: Understanding the distribution of multimodel ensembles. J. Climate, 33, 9447–9465, https://doi.org/10.1175/JCLID200186.1.
Christiansen, B., 2021: The blessing of dimensionality for the analysis of climate data. Nonlinear Processes Geophys., 28, 409–422, https://doi.org/10.5194/npg284092021.
Christiansen, B., S. Yang, and D. Matte, 2022: The forced response and decadal predictability of the North Atlantic Oscillation: Nonstationary and fragile skills. J. Climate, 35, 5869–5882, https://doi.org/10.1175/JCLID210807.1.
DelSole, T., and M. K. Tippett, 2014: Comparing forecast skill. Mon. Wea. Rev., 142, 4658–4678, https://doi.org/10.1175/MWRD1400045.1.
DelSole, T., and M. K. Tippett, 2018: Predictability in a changing climate. Climate Dyn., 51, 531–545, https://doi.org/10.1007/s0038201739398.
DoblasReyes, F. J., and Coauthors, 2013: Initialized nearterm regional climate change prediction. Nat. Commun., 4, 1715, https://doi.org/10.1038/ncomms2704.
Hermanson, L., and Coauthors, 2022: WMO global annual to decadal climate update: A prediction for 2021–25. Bull. Amer. Meteor. Soc., 103, E1117–E1129, https://doi.org/10.1175/BAMSD200311.1.
Kalnay, E., and Coauthors, 1996: The NCEP/NCAR 40Year reanalysis project. Bull. Amer. Meteor. Soc., 77, 437–471, https://doi.org/10.1175/15200477(1996)077<0437:TNYRP>2.0.CO;2.
Kay, J. E., and Coauthors, 2015: The Community Earth System Model (CESM) large ensemble project: A community resource for studying climate change in the presence of internal climate variability. Bull. Amer. Meteor. Soc., 96, 1333–1349, https://doi.org/10.1175/BAMSD1300255.1.
Kirtman, B., and Coauthors, 2014: Nearterm climate change: Projections and predictability. Climate Change 2013: The Physical Science Basis, T. F. Stocker et al., Eds., Cambridge University Press, 953–1028, https://doi.org/10.1017/CBO9781107415324.023.
Kushnir, Y., and Coauthors, 2019: Towards operational predictions of the nearterm climate. Nat. Climate Change, 9, 94–101, https://doi.org/10.1038/s4155801803597.
Lancaster, G., D. Iatsenko, A. Pidde, V. Ticcinelli, and A. Stefanovska, 2018: Surrogate data for hypothesis testing of physical systems. Phys. Rep., 748, 1–60, https://doi.org/10.1016/j.physrep.2018.06.001.
Meehl, G. A., and Coauthors, 2021: Initialized Earth system prediction from subseasonal to decadal timescales. Nat. Rev. Earth Environ., 2, 340–357, https://doi.org/10.1038/s4301702100155x.
Merryfield, W. J., and Coauthors, 2020: Current and emerging developments in subseasonal to decadal prediction. Bull. Amer. Meteor. Soc., 101, E869–E896, https://doi.org/10.1175/BAMSD190037.1.
Scaife, A. A., and D. Smith, 2018: A signaltonoise paradox in climate science. npj Climate Atmos. Sci., 1, 28, https://doi.org/10.1038/s4161201800384.
Schreiber, T., and A. Schmitz, 1996: Improved surrogate data for nonlinearity tests. Phys. Rev. Lett., 77, 635–638, https://doi.org/10.1103/PhysRevLett.77.635.
Schreiber, T., and A. Schmitz, 2000: Surrogate time series. Physica D, 142, 346–382, https://doi.org/10.1016/S01672789(00)000439.
Sgubin, G., D. Swingedouw, L. F. Borchert, M. B. Menary, T. Noël, H. Loukos, and J. Mignot, 2021: Systematic investigation of skill opportunities in decadal prediction of air temperature over Europe. Climate Dyn., 57, 3245–3263, https://doi.org/10.1007/s00382021058630.
Siegert, S., O. Bellprat, M. Ménégoz, D. B. Stephenson, and F. J. DoblasReyes, 2017: Detecting improvements in forecast correlation skill: Statistical testing and power analysis. Mon. Wea. Rev., 145, 437–450, https://doi.org/10.1175/MWRD160037.1.
Sienz, F., W. A. Müller, and H. Pohlmann, 2016: Ensemble size impact on the decadal predictive skill assessment. Meteor. Z., 25, 645–655, https://doi.org/10.1127/metz/2016/0670.
Simpson, I. R., S. G. Yeager, K. A. McKinnon, and C. Deser, 2019: Decadal predictability of late winter precipitation in western Europe through an ocean–jet stream connection. Nat. Geosci., 12, 613–619, https://doi.org/10.1038/s415610190391x.
Smith, D. M., R. Eade, N. J. Dunstone, D. Fereday, J. M. Murphy, H. Pohlmann, and A. A. Scaife, 2010: Skilful multiyear predictions of Atlantic hurricane frequency. Nat. Geosci., 3, 846–849, https://doi.org/10.1038/ngeo1004.
Smith, D. M., and Coauthors, 2019: Robust skill of decadal climate predictions. npj Climate Atmos. Sci., 2, 13, https://doi.org/10.1038/s416120190071y.
Smith, D. M., and Coauthors, 2020: North Atlantic climate far more predictable than models imply. Nature, 583, 796–800, https://doi.org/10.1038/s4158602025250.
SolarajuMurali, B., L.P. Caron, N. GonzalezReviriego, and F. J. DoblasReyes, 2019: Multiyear prediction of European summer drought conditions for the agricultural sector. Environ. Res. Lett., 14, 124014, https://doi.org/10.1088/17489326/ab5043.
SospedraAlfonso, R., and G. J. Boer, 2020: Assessing the impact of initialization on decadal prediction skill. Geophys. Res. Lett., 47, e2019GL086361, https://doi.org/10.1029/2019GL086361.
Theiler, J., S. Eubank, A. Longtin, B. Galdrikian, and J. Doyne Farmer, 1992: Testing for nonlinearity in time series: The method of surrogate data. Physica D, 58, 77–94, https://doi.org/10.1016/01672789(92)90102S.
Totaro, V., A. Gioia, and V. Iacobellis, 2020: Numerical investigation on the power of parametric and nonparametric tests for trend detection in annual maximum series. Hydrol. Earth Syst. Sci., 24, 473–488, https://doi.org/10.5194/hess244732020.
Trenberth, K. E., and D. J. Shea, 2006: Atlantic hurricanes and natural variability in 2005. Geophys. Res. Lett., 33, L12704, https://doi.org/10.1029/2006GL026894.
Wang, Y., F. Counillon, N. Keenlyside, L. Svendsen, S. Gleixner, M. Kimmritz, P. Dai, and Y. Gao, 2019: Seasonal predictions initialised by assimilating sea surface temperature observations with the EnKF. Climate Dyn., 53, 5777–5797, https://doi.org/10.1007/s00382019048979.
Yeager, S. G., and Coauthors, 2018: Predicting nearterm changes in the Earth system: A large ensemble of initialized decadal prediction simulations using the Community Earth System Model. Bull. Amer. Meteor. Soc., 99, 1867–1886, https://doi.org/10.1175/BAMSD170098.1.