Estimating the Significance of the Added Skill from Initializations: The Case of Decadal Predictions

Bo Christiansen aDanish Meteorological Institute, Copenhagen, Denmark

Search for other papers by Bo Christiansen in
Current site
Google Scholar
PubMed
Close
https://orcid.org/0000-0003-2792-4724
,
Shuting Yang aDanish Meteorological Institute, Copenhagen, Denmark

Search for other papers by Shuting Yang in
Current site
Google Scholar
PubMed
Close
, and
Dominic Matte bPhysics of Ice, Department of Climate and Earth, Niels Bohr Institute, University of Copenhagen, Copenhagen, Denmark
cOuranos, Montréal, Québec, Canada

Search for other papers by Dominic Matte in
Current site
Google Scholar
PubMed
Close
Open access

Abstract

A considerable part of the skill in decadal forecasts often comes from the forcings, which are present in both initialized and uninitialized model experiments. This makes the added value from initialization difficult to assess. We investigate statistical tests to quantify if initialized forecasts provide skill over the uninitialized experiments. We consider three correlation-based statistics previously used in the literature. The distributions of these statistics under the null hypothesis that initialization has no added values are calculated by a surrogate data method. We present some simple examples and study the statistical power of the tests. We find that there can be large differences in both the values and power for the different statistics. In general, the simple statistic defined as the difference between the skill of the initialized and uninitialized experiments behaves best. However, for all statistics the risk of rejecting the true null hypothesis is too high compared to the nominal value. We compare the three tests on initialized decadal predictions (hindcasts) of near-surface temperature performed with a climate model and find evidence for a significant effect of initializations for small lead times. In contrast, we find only little evidence for a significant effect of initializations for lead times longer than 3 years when the experience from the simple experiments is included in the estimation.

© 2023 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Bo Christiansen, boc@dmi.dk

Abstract

A considerable part of the skill in decadal forecasts often comes from the forcings, which are present in both initialized and uninitialized model experiments. This makes the added value from initialization difficult to assess. We investigate statistical tests to quantify if initialized forecasts provide skill over the uninitialized experiments. We consider three correlation-based statistics previously used in the literature. The distributions of these statistics under the null hypothesis that initialization has no added values are calculated by a surrogate data method. We present some simple examples and study the statistical power of the tests. We find that there can be large differences in both the values and power for the different statistics. In general, the simple statistic defined as the difference between the skill of the initialized and uninitialized experiments behaves best. However, for all statistics the risk of rejecting the true null hypothesis is too high compared to the nominal value. We compare the three tests on initialized decadal predictions (hindcasts) of near-surface temperature performed with a climate model and find evidence for a significant effect of initializations for small lead times. In contrast, we find only little evidence for a significant effect of initializations for lead times longer than 3 years when the experience from the simple experiments is included in the estimation.

© 2023 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Bo Christiansen, boc@dmi.dk

1. Introduction

There is an increasing interest in decadal climate predictions. Successful predictions on daily and seasonal time scales depend crucially on correct modeling of internal variability and therefore require that the model system is initialized with realistic conditions. In contrast, climate model scenarios depend on external forcings such as the atmospheric greenhouse gas concentrations and natural (e.g., from volcanic eruptions) and anthropogenic aerosol loadings. Predictions on decadal time scales may depend on both sources of predictability (see, e.g., Kushnir et al. 2019; Merryfield et al. 2020; Meehl et al. 2021; and references therein) and therefore require inclusion of both initializations and external forcings. See DelSole and Tippett (2018) for a more rigorous discussion of the different kinds of predictability.

Both phase 5 and phase 6 of the Coupled Model Intercomparison Project (CMIP5 and CMIP6) contain initialized decadal predictions for the periods since 1960 in addition to the uninitialized historical experiments. Skill in decadal forecasts is related to slowly evolving parts of the climate system and is mainly found in the North Atlantic region and has been reported for surface air temperature, precipitation, and for the frequency of some extreme events. For details and references, see the summaries in, for example, Kirtman et al. (2014), Kushnir et al. (2019), Simpson et al. (2019), Meehl et al. (2021), and Hermanson et al. (2022). Recently, it has been reported that the predictable signal might be underestimated in models and that the average of very large ensembles might be needed to isolate this signal (Scaife and Smith 2018; Smith et al. 2019, 2020).

For some atmospheric variables, such as temperature, there is a strong influence from greenhouse gases in the last century. Thus, both the initialized and uninitialized experiments contain this forcing and show strong skill related to the general warming. It is often found (see, e.g., Borchert et al. 2021b; Bilbao et al. 2021) that correlations between observations and the ensemble mean of the historical experiments are large and significant in many geographical regions. The same holds for correlations between observations and the ensemble means of the initialized experiments even for decadal lead times. An example is discussed in section 4 (Fig. 8).

To estimate the effect of the initialized experiments, it is therefore important to either remove the forced response from the initialized experiments or to more carefully directly compare the initialized and noninitialized experiments. If the only forcing was the greenhouse gases one could try to remove this part from the prediction by expressing it as low-order polynomial fit or by removing the part linearly congruent to the amount of greenhouse gases. However, the value of this approach is limited by other faster forcings such as volcanic eruptions, the effect of which is harder to identify and remove (Trenberth and Shea 2006; Borchert et al. 2021a).

Attempts to demonstrate significant added skill from initializations for surface temperature on decadal time scales have shown mixed success outside the North Atlantic polar gyre region (Kirtman et al. 2014; Smith et al. 2010). See, for example, the discussions in Doblas-Reyes et al. (2013), Borchert et al. (2021a), and Sospedra-Alfonso and Boer (2020). However, recently Smith et al. (2019) have reported significant added skill from initializations in near-surface temperature, surface pressure, and precipitation using a new statistical approach to estimate the statistical significance.

In this paper, we compare different methods to estimate the statistical significance of the effect of initializations. An important concept is the power of the test, which here is defined as the probability of correctly rejecting the null hypothesis that initialized and noninitialized predictions have the same skill. The methods are all based on the generation of surrogate time series that preserve the temporal characteristics (power spectrum and autocorrelations) of the original series. The methods differ by the statistic considered, that is, the measure used to represent the difference in skill between initialized and noninitialized predictions. While the paper is cast in the context of decadal predictions, we believe that the results will also be useful for other situations where different forecast systems are compared, for example, for seasonal predictions.

In section 2, we describe the method. In section 2a, we define the different statistics, and in section 2b, we discuss the statistical test. In section 3, we look at data generated from simple, idealized models. In section 3a, we take a first look at the behavior of the statistics, while we in section 3b study the power of the tests. In section 4, we use—as an example of a realistic situation—the statistical tests on historical and initialized ensembles from a single climate model. The paper is closed with the conclusions in section 5.

2. The method

In section 2a, we introduce the measures—the statistics—of the added skill from initializations. These statistics are all based on correlations. To estimate the statistical significance, we need to compare an observed statistic to the value it would have had if there were no added skill. However, the measures are stochastic variables and are not characterized by single numbers but by distributions. The statistical significance is therefore estimated by comparing the observed statistic to its distribution when there is no added skill: if the observed statistic falls in the extreme 5% of the distribution (two-sided test) we reject the null hypothesis of no added skill. Note that it is not guaranteed that the actual risk of rejecting a true null hypothesis is equal to this nominal 5%. In section 2b, we discuss how to obtain these distributions.

a. How to measure the added skill: The statistics

We have three time series: the observations o; the historical, uninitialized simulation h; and the initialized forecast f. We assume that these time series have the same length (N). They could, for example, represent monthly temperatures in a single grid point or a circulation index such as the North Atlantic Oscillation. We want to estimate if a forecast f contributes with skill that is not included in the historical experiment h.

To proceed we need a measure—a statistic—of the added skill from initializations. Such statistics are stochastic variables, and their distributions under the null hypothesis is discussed in the next subsection.

A straightforward statistic that has often been considered (e.g., very recently by Sgubin et al. 2021) is the difference in correlations cor(f, o) − cor(h, o). See also the references in Siegert et al. (2017). Recently, Smith et al. (2019) introduced another statistic based on the residuals o|h and f|h, obtained by removing the influence of h from o and f. More precisely, the residual o|h is calculated by removing from o the part of o that is linearly congruent with h.1 The statistic is then cor(o|h, f|h), which is also known as the partial correlation. This statistic was also used by Solaraju-Murali et al. (2019), Smith et al. (2020), and Borchert et al. (2021a). The residual statistic is related to the method of Sospedra-Alfonso and Boer (2020) and will give similar values to that method for large N (Borchert et al. 2021a). Wang et al. (2019) introduced a third statistic based on the straightforward and assumptionless expansion:
cor(f,o)=cor(h,o)σh/σf+cor(fh,o)σfh/σf,
where σx2 denotes the variance of x. Wang et al. (2019) considered the last term on the right-hand side as a measure for the added contribution from initializations in seasonal forecasts. In the rest of the paper we will call cor(f, o) − cor(h, o) for the difference statistic, cor(o|h, f|h) for the residual statistic, and cor(fh, o)σfh/σf for the split statistic. This does not exhaust the variations of statistics that can be used for testing differences in skill (DelSole and Tippett 2014).

Above, f and h could be individual members or ensemble means. Note that the skill of the ensemble mean often improves with ensemble size. This means that if we consider ensemble means we should include the same numbers of initialized and uninitialized experiments. It also means that even if the initialized forecast f is not better than the uninitialized h, it might still be an advantage to include in the ensemble.

b. The test of statistical significance

Having chosen a statistic, we need to find its distribution under the null hypothesis. Here, a suitable null hypothesis is that initialized forecast f has the same skill as the uninitialized experiment h. More precisely, we assume that f and h are exchangeable so that, for example, f has the same relation to o as h.

Under this null hypothesis the difference statistic cor(f, o) − cor(h, o) and the split statistic cor(fh, o)σfh/σf will be distributed around zero also for finite values of N, but this is not the case for the residual statistic cor(o|h, f|h). The residuals are in general not independent, which leads to a bias in cor(o|h, f|h). The complication of the bias was remedied in Smith et al. (2019) by estimating and subtracting it. Here, we will apply a general Monte Carlo method valid for all the statistics. We will return to the behavior of the statistics when studying simple examples in section 3.

We calculate the significance by comparing the original statistic—calculated, for example, from the climate model output or from simple numerically generated test data—to the distribution of that statistic under the assumption that f is similar to h. Thus, we need to build a distribution of the statistic under the null hypothesis that f and h have the same properties. We can then estimate if the original statistic falls far enough out in the tails of this distribution that we will reject the null hypothesis.

The distribution is built by a Monte Carlo approach where a large number (typically 500–2500) of surrogate versions of f are calculated. For the null hypothesis to be fulfilled, the surrogates f* should have the same population correlation to o as the original h. They should also have the same temporal structure as h, that is, the same serial correlations.

If we knew the generating process of h, we could use this process directly to produce the the values of f* and from these the distribution of the statistics under the null hypothesis. The power calculated this way provides us with a benchmark for the power. The generating process will be known when we study the simple model in section 3 [middle expression in Eq. (4)] but, of course, not in more realistic situations, as in section 4.

In realistic situations, we only have one observed realization of each of the time series h, f, and o. One way to calculate surrogates f* fulfilling the requirements above is to let
f*=o/o+ah+/h+,
and then rescale f* so that f*=h. Here, x is the norm of x, h+ is a phase-scrambled version of h, and
a=1/cor2(h,o)1.
The phase-scrambling—randomizing the Fourier phases—produces surrogates retaining the full autocorrelation spectrum of the original time series. For details about the phase-scrambling method, see Theiler et al. (1992) and Lancaster et al. (2018). The phase-scrambling method does not preserve the probability distribution of the time series but produces Gaussian distributed data. As an alternative, we have used the iterative amplitude-adjusted Fourier transform algorithm (IAAFT; Schreiber and Schmitz 1996), which is an extended version of the phase-scrambling method that does preserve the probability distribution. We find that the phase scrambling and IAAFT produce similar results both for the simple model in section 3 and when applied to the climate model in section 4.

An alternative way to produce the surrogates is to express h as the sum of a term linearly congruent with o and a residual: h = ao + ξ (cf. footnote 1). We can now make surrogates f* by bootstrapping—random sampling with replacement—the residual ξ and adding the bootstrap to ao. Serial correlations can be included by using block bootstraps (or by phase scrambling the residuals). We find that this method gives almost similar results as the phase-scrambling method described in the previous paragraph. In the rest of the paper, we will report the results obtained with the phase-scrambling method. When we study the simple model in section 3, we will also show results when the significance is obtained using the generating process. This makes it possible to separate the effect of the statistics from the effect of the phase scrambling.

Using the Monte Carlo method, we are directly accounting for the problematic assumption of some analytical tests that the skill estimates are independent, as discussed in DelSole and Tippett (2014) and Siegert et al. (2017). For example, cor(f, o) and cor(h, o) are dependent as they refer to the same set of observations o, but this dependence is included in the surrogates. We also avoid estimating the independent temporal degrees of freedom, which are often ill defined. However, the Monte Carlo methods assume that the time series are stationary. In section 3b(2), we will consider how to deal with trends.

We also use a simpler phase-scrambling method to estimate if correlations themselves are significantly different from zero, as described in Christiansen (2001).

3. Comparing the tests on idealized data

Here, we compare the different tests using simple numerically generated data.

We will consider test data generated by
o=λos+σξoξo,h=λhs+σξhξh,andf=λfs+σξfξf.
Here, we take the potentially predictable signal s as a simple first-order autoregressive series with length N and coefficient 0.5. We assume without lack of generality that s is normalized so σs = 1. The terms ξo, ξh, and ξf are independent, white Gaussian noise. The noise terms have zero means and unit variances. So σξo, σξh, and σξf are measures of the noise variances, while λo, λh, and λf are measures of the size of the signal. The null hypothesis holds only2 when σξh=σξf and λh = λf.
Using the simplifying properties of high-dimensional spaces (Christiansen 2021), we show in the appendix that for large N we have for the three statistics:
cor(f,o)cor(h,o)=11+σξo2/λo2(11+σξf2/λf211+σξh2/λh2),
cor(f|h,o|h)=sgn(λfλo)γγ+σξo2/λo2γ+σξf2/λf2withγ=11+λh2/σξh2,
and
cor(fh,o)σfh/σf=11+σξo2/λo21λh/λf1+σξf2/λf2.
Here, sgn(x) denotes the sign of x. Note, that the nonzero value of cor(o|h, f|h) under the null hypotheses is not an effect of the finite sample size. So, the value of cor(o|h, f|h) is not directly related to its significance. We note that all three statistics depend on λo and σξo only through the signal-to-noise ratio λo/σξo. We also note that while the difference and the residual statistics only depend on the signal-to-noise ratios λf/σξf and λh/σξh, this is not the case for the split statistic, which depends explicitly on λf and λh and also has the peculiarity that it does not depend on σξh.

a. Simple examples

We first briefly look at some simple examples to illustrate the difference between the statistics.

In the first example, we let the contribution of the observations to f and h be identical, σξh=σξf=1 and λh = λf = 1, thus fulfilling the null hypothesis. We also choose σξo=0 and N = 100. The model then reduces to h = o + ξh, and f = o + ξf, where o is a normalized first-order autoregression (AR1) series. For the realization shown in Fig. 1, we have cor(h, o) = 0.72 and cor(f, o) = 0.71, so the difference statistic is 0.01. The residual statistic is 0.55, and the split statistic is 0.01 (the statistics are shown with cyan vertical lines in Fig. 1). Thus, while the difference and split statistics are close to zero, this is certainly not the case for the residual statistic. A simple test based on the residual statistic assuming that o|h and f|h are independent will wrongly suggest that the null hypothesis can be rejected.

Fig. 1.
Fig. 1.

Cyan vertical lines indicate the values of the statistics for an example where o is a normalized AR1 series with coefficient 0.5, h = o + ξh, and f = o + ξf, where ξh and ξf are normalized, Gaussian white noise. This example fulfills the null hypothesis. (left) Difference statistic. (center) Residual statistic. (right) Split statistic. The histograms show the distributions of the statistics under the null hypothesis that h and f are identically distributed as calculated with the surrogate method. The sample size is N = 100. The red vertical lines show the values of the statistics from Eqs. (5) to (7) valid for N → ∞.

Citation: Journal of Climate 36, 9; 10.1175/JCLI-D-22-0605.1

The distributions of the statistics under the null hypothesis are also shown in Fig. 1. We see that the original values of the statistics fall well inside these distributions, and we can therefore not reject the null hypothesis in any of the three tests.

We now look at another simple example (Fig. 2) where the forecast is superior. We take o as before, but now h = o + ξh and f = o + 0.5ξf. For this realization we have the difference statistic = 0.17, the residual statistic = 0.81, and the split statistic = −0.03. Now the values for the difference and the residual statistics fall clearly outside the distributions under the null hypothesis as calculated with the surrogate method. However, here the split statistic disagrees, and we will not reject the null hypothesis according to this statistic.

Fig. 2.
Fig. 2.

As in Fig. 1, but now h = o + ξh, and f = o + 0.5ξf, where ξh and ξf are normalized white noise. This example does not fulfill the null hypothesis.

Citation: Journal of Climate 36, 9; 10.1175/JCLI-D-22-0605.1

Equations (5)(7), which are valid for large ensemble sizes, agree well with the value of the statistics from the realizations as is shown with the red vertical lines in Figs. 1 and 2.

From Eqs. (5)(7), we see that under the null hypotheses σ=σξf=σξh and λ = λf = λh, the difference and the split statistics are 0 while the residual statistic becomes {(2+σ2/λ2)[1+(1+λ2/σ2)σξo2/λo2]}1/2. This will depend nonmonotonously on the signal-to-noise ratio λ/σ when σξo is different from zero. When estimating the skill of forecasts and historical experiments we often consider ensemble means. As the effective noise amplitude scales as 1/K, where K is the ensemble size, this nontrivial behavior is also seen as function of ensemble size.

As a simple example we extend the model in Eq. (4). Now hi=s+σξh and fi=s+σξf, for i = 1, …, K, where K is the ensemble size, and o = s + ξo. In this example, the null hypothesis holds. We consider the statistic cor(o|h¯,f¯|h¯), where the overline denotes the ensemble mean. Figure 3 shows the residual statistic as function of ensemble size and the noise amplitude σ. The nonmonotonous structure means that we cannot use the statistic to meaningfully estimate the effect of ensemble size. Likewise, we cannot directly compare the statistics in different geographical regions, as these regions may have different amounts of noise. These results are robust and hold for other choices of the signal s.

Fig. 3.
Fig. 3.

The residual statistic cor(o|h¯,f¯|h¯) as function of ensemble size K and noise amplitude σ. The model is hi=s+σξh and fi=s+σξf, i = 1, …, K, and o = s + ξo. Here, s is a normalized AR1 series, and N = 100. Note the nonlinear scales on the axes.

Citation: Journal of Climate 36, 9; 10.1175/JCLI-D-22-0605.1

b. Power of the tests

The power of a test is its probability of correctly rejecting the null hypothesis. The power depends both on the design of the test and the length of the time series (Sienz et al. 2016; Totaro et al. 2020)—or more precisely on the independent degrees of freedom. A power curve is the power shown as a function of a parameter. The well-known structure of the power curve and its relation to type I and II errors are shown schematically in Fig. 4. A perfect test would have the power close to one when the null hypothesis does not hold, that is, almost always correctly rejecting the null hypothesis and thereby minimizing the risk of type II errors. Likewise, the power curve would have small values when the null hypothesis holds, that is, having only a small risk of erroneously rejecting a true null hypothesis and thereby minimizing the risk of type I errors.

Fig. 4.
Fig. 4.

Schematic view of a power curve. The power p is shown as function of a parameter (full black curve). The null hypothesis is fulfilled only when the parameter is zero. Away from zero, p is the probability that the false null hypothesis is correctly rejected (indicated by green hatching). The probability that the false null hypothesis is incorrectly not rejected (type II error) is 1 − p (indicated by red hatching). At zero—where the null hypothesis is fulfilled—the length of the green line, 1 − p, indicates the probability to correctly not reject the true null hypothesis, while the length of the red line p indicates the probability to incorrectly reject the true null hypothesis (type I error).

Citation: Journal of Climate 36, 9; 10.1175/JCLI-D-22-0605.1

1) The simple idealized model

We now fix σξh, σξo, λh, and λo, while we let λf (or σξf) vary, and for each λf (or σξf) we consider many realizations (2500) of o, h, and f. For each realization we calculate the statistical significance with the phase-scrambling method and note if the null hypothesis is rejected (at the 95% level). From the 2500 realizations, we then calculate the power as the rate of rejections.

Power curves—giving the power as function of λf—are shown in Fig. 5 for σξh=σξf=λh=1, σξo=0, and sample sizes N = 25, 100, and 1000. The power is shown with the full curves for all three statistics: difference (black), residual (blue), and split (orange). As expected, we find that for all statistics that the power increases when λf moves away from 1. We also see that the power increases with N and in the limit of large N the power converges to 1 when λf is different from 1.

Fig. 5.
Fig. 5.

The power—the probability of rejecting the null hypothesis—of the statistical tests as function of λf for different values of the sample size N: 25, 100, and 1000. Here, h = o + ξh and f = λfo + ξf, where o is a normalized AR1 process, and ξh and ξf are normalized white noise. This corresponds to Eq. (4) with σξh=σξf=λh=1 and σξo=0. Black, blue, and orange full curves are the power for the difference statistic cor(f, o) − cor(h, o), the residual statistic cor(o|h, f|h), and the split statistic cor(fh, o)σfh/σh, respectively. Dashed lines are power calculated with the optimal benchmark method.

Citation: Journal of Climate 36, 9; 10.1175/JCLI-D-22-0605.1

More interesting in our context, we find that the largest power is found when using the split statistic. Also, the power of the difference statistic is in general a bit larger than the power of the residual statistic. These differences are largest for small N (sample size) and decreases with N. In particular, the difference and residual statistics show almost the same power for N ≥ 100, while they still differ considerably from the split statistic.

Unfortunately, a high power is here connected to a high risk of falsely rejecting a true null hypothesis (type I error). For λf = 1, where the null hypothesis holds, the rejection rates for the residual and difference statistics are around 0.2 for N = 25 and decreases toward the nominal value of 0.05 for large N. The situation is worse for the split statistic, which even for N = 1000 is still around 0.3.

Above we calculated the surrogate time series by the phase-scrambling procedure based on given realizations of h and o. This corresponds to realistic situations. In the simple tests we know the process h=λhs+σξhξh and can produce surrogates directly from this process, as discussed in section 2b. The power calculated this way is shown as dashed curves in Fig. 5. Here, the probability for rejecting a true null hypothesis is the nominal 5% for all N and for all statistics. For small N the difference statistic still has a bit more power than the residual and the split statistics. Compared to this optimal benchmark, the surrogate method based on the phase-scrambling procedure has to estimate the parameter a in Eq. (2) and in general underestimates the power.

In Fig. 6, we show some other examples of the power. In Fig. 6a, we vary σξf and keep the other parameters constant. While the difference and residual statistics behave very much as when we varied λf (Fig. 5), the split statistic totally fails. This is due to the missing sensitivity of σξh, as seen in Eq. (7), and holds also for the benchmark. Figures 6b and 6c show situations when there is also noise in the observations, that is, when σξo0. Here, we see that the difference statistic behaves very sensibly, while the residual statistic does not have the minimum in power when the null hypothesis holds. As this is not seen for the optimal benchmark, although the difference statistic still has a little more power than the residual statistic, it must be related to a combination of the phase-scrambling method and the statistics.

Fig. 6.
Fig. 6.

(a) Power as function of σξf with h = o + ξh and f=o+σξfξf. Here, o is a normalized AR1 process. (b) Power as function of λf with f = λfs + ξf. (c) Power as function of σξf with f=s+σξfξf. In (b) and (c), h = s + ξh, o = s + 0.5ξo, and s is a normalized AR1 process. The terms ξh, ξf, and ξo are normalized white noise. N = 100 in all panels.

Citation: Journal of Climate 36, 9; 10.1175/JCLI-D-22-0605.1

2) Including trends

When applying the surrogate method to climate models—and in particular to temperature—we have to consider the trends. Trends are not treated well by the phase-scrambling method (Schreiber and Schmitz 2000) and should not be included in the stochastic part of the process. When calculating the surrogate f*, we therefore first detrend o and h and use the detrended values in Eqs. (2) and (3). The trend from h is then added to f*. We find that the method in general gives good results when applied to the idealized models. The trend is estimated as a third-order polynomial, but the results are robust to changes in the order. We also note that the results in section 4 are robust to the changes in the order of the polynomial detrending.

The left panels in Figs. 7a and 7c show examples where the time series are generated as in Fig. 5b but now s includes a nonlinear trend. Likewise, the right panels in Figs. 7b and 7d are generated as in Fig. 6b but now with the trend. The trend has the form of a fourth-order polynomial. Not taking special care of the trends in the phase-scrambling gives rather wide power curves (Figs. 7a,b) compared to the situation without trend (Figs. 5a and 6b). Using the detrending procedure described in the previous paragraph improves the results and gives power curves (Figs. 7c,d) closely resembling those of Figs. 5b and 6b. We have confirmed that similar results are obtained with other powers of the true trend.

Fig. 7.
Fig. 7.

Including trend in s of the form 6t4, with t = (0, 1, … N − 1)/N. Power as function of λf. N = 100 in all panels. (left) Parameters are as in Fig. 5b, including σξo=0. (right) As in Fig. 6b, including σξo=0.5. (top) No detrending in significance estimation. (bottom) The third-order polynomial is removed. Legend as in Figs. 5 and 6.

Citation: Journal of Climate 36, 9; 10.1175/JCLI-D-22-0605.1

4. Climate models

In this section we consider the near-surface temperature from the historical single-model ensemble performed in the Community Earth System Model (CESM) Large Ensemble Simulation (LENS; Kay et al. 2015). We also use the corresponding ensemble of initialized decadal forecasts from the CESM Decadal Prediction Large Ensemble (DPLE; Yeager et al. 2018). Both ensembles have 40 members and use the same code and external forcing datasets. As observations, we use NCEP–NCAR data (Kalnay et al. 1996). Prior to the analysis, model data are interpolated to the horizontal NCEP global grid, 2.5° × 2.5°, using a simple nearest neighbor procedure. Our analyses are based on annual means for the period 1970–2017. As we use annual means we do not expect the choice of reanalysis to be important. To be sure we have confirmed that we get similar results using the JRA-55 (Kalnay et al. 1996). The DPLE experiments are initialized in November every year and a lead time of 1 year will refer to the mean over the first full calendar year. When calculating the statistical significance of the statistics we use the phase-scrambling method described in section 2b amended with the approach described in section 3b(2) dealing with trends.

Figure 8 shows the correlations between the ensemble mean of the historical ensemble and observations and the correlations between the ensemble mean of the initialized decadal prediction ensemble and observations for lead times of 1 and 10 years. There are strong positive and significant correlations between the historical ensemble and observations (top-left panel). The correlations between the forecast ensemble and observations are comparable for a lead time of 10 years and somewhat larger for a lead time of 1 year (bottom panels). The top-right panel in Fig. 8 demonstrates the challenges of isolating the effect of initializations by showing that large and significant correlations between observations and the historical ensemble mean persist even after the linear trends have been removed.

Fig. 8.
Fig. 8.

Results for LENS annual-mean near-surface temperature. (top left) Correlations between observations and historical ensemble mean. (top right) As in the top left, but now time series have been detrended. (bottom) Correlations between observations and the initialized ensemble mean for lead times of (left) 1 year and (right) 10 years. The period is 1970–2017. Dots indicate where correlations are significantly different from zero at the 95% level.

Citation: Journal of Climate 36, 9; 10.1175/JCLI-D-22-0605.1

In Fig. 9, we show the three statistics for a lead time of 1 year. The difference statistic shows large values only in the eastern Pacific and in the Atlantic subpolar gyre region. In contrast, the residual statistic is large in more extended areas. However, the two statistics rejects the null hypothesis—that historical and initialized experiments have the same skill—in many of the same regions: the eastern Pacific [the region of the El Niño–Southern Oscillation (ENSO)] and the subpolar gyre region. The split statistic is statistically significant in even larger areas. Note that there are many areas where the statistics are negative and significant, indicating that the historical ensemble has more skill than the initialized ensemble. Figure 10 shows the situation for a lead time of 10 years. Again the values for the residual statistic is larger than values of the difference statistic, but the two statistics mainly agree on the regions where the null hypothesis can be rejected. This is basically in the subpolar gyre region, as also found in other studies, but also in the polar regions, although here the significant correlations may be of different sign. Again, the split statistic is significant in larger areas.

Fig. 9.
Fig. 9.

The three statistics measuring the difference in skill between the initialized ensemble-mean forecast and the uninitialized historical ensemble mean for annual-mean near-surface temperature. (top left) The difference statistic, (top right) the residual statistic, and (bottom left) the split statistic. Based on LENS ensemble means for 1970–2017. A lead time of 1 year is used for the initialized forecast. Dots indicate where the null hypothesis—that historical and initialized experiments have the same skill—can be rejected at the 95% level.

Citation: Journal of Climate 36, 9; 10.1175/JCLI-D-22-0605.1

Fig. 10.
Fig. 10.

As in Fig. 9, but for a lead time of 10 years.

Citation: Journal of Climate 36, 9; 10.1175/JCLI-D-22-0605.1

The left panel in Fig. 11 shows the fractional area of significant correlations for the three statistics as function of lead time. For a lead time of 1 year the difference statistic is significant in 60% of the globe and for a lead time of 10 years this fraction is reduced to 15%. The similar fractions for the residual statistic are 50% and 20% and for the split statistic 78% and 33%. While these fractions even for a lead time of 10 years are larger than the nominal 5%, they are not impressive when compared to the results in Fig. 5. There we found that the risk of rejecting a true null hypothesis is around 10%–20% for time series of length 25–100. Furthermore, the area of significant positive values of the statistic is only larger than the area of significant negative values for lead times less than 3–4 years (right panel in Fig. 11). This holds for both the difference and the residual statistics and indicates that the significance for larger lead times is due to chance.

Fig. 11.
Fig. 11.

(left) The fraction of the global area where the skills in forecast and historical experiments are estimated to be significantly different. The fraction is shown as a function of lead time for each of the three statistics. (right) The fraction of the global area where the skills are significantly different calculated for grid points with positive and negative correlations, respectively.

Citation: Journal of Climate 36, 9; 10.1175/JCLI-D-22-0605.1

5. Conclusions

Attempts to find out if initialization improves forecasts as compared to uninitialized forecasts are hampered by the skill coming from the shared forcings. The skill from the forcings can be large, in particular for the last decades, due to the increased levels of greenhouse gases and the added skill from initialization is often relatively small. It is therefore important to carefully estimate the statistical significance of the added skill.

Previous literature has used different correlation-based statistics measuring the added skill from initialization. The simplest statistic is the difference in correlations between observations and the initialized experiment and the correlation between observations and the uninitialized experiment. Another statistic is the partial correlation between observations and the initialized forecast given the uninitialized forecast, that is, the correlation of the residuals after the contributions linearly congruent to the uninitialized forecast have been removed from the observations and the initialized forecast. The third statistic is based on the correlation between the difference of the initialized and uninitialized experiments and the observations.

We have studied the properties of the three statistics and used a Monte Carlo procedure to test the statistical significance. The Monte Carlo procedure is based on surrogate data with the same temporal characteristics as the original time series and the same relation to the observations. We first considered examples based on a simple model to study the power of the tests and then considered decadal forecasts with the LENS climate model. For the simple model, we derived analytical expressions of the statistics in the limit of large ensemble size.

Our main findings from the idealized model are as follows:

  • The residual statistic is in general—also in the limit of large samples—not zero under the null hypothesis that initialized and uninitialized forecasts are drawn from the same process. Furthermore, the value of the residual statistic does not vary monotonously with signal-to-noise ratio. And when we consider ensemble-mean forecasts, it therefore does not vary monotonously with the ensemble size (Fig. 3).

  • The split statistic does not recognize differences in the signal-to-noise ratio but only differences in the absolute level of the signal. This statistic therefore fails in some circumstances, but it is worth noting that if h and f are normalized, then the split statistic becomes identical to the difference statistic [Eq. (1)].

  • The power of the difference and residual statistics are comparable when observations are noise free. When observations are noisy, the difference statistic is superior.

  • We also calculated a benchmark for the power based on the full information from the idealized model. The problems for the split statistic still remain. The difference and the residual statistics are still comparable when observations are noise free. However, when observations are noisy the difference between these statistics has been reduced, although the difference statistic is still behaving somewhat better.

  • The rejection rate when the null hypothesis is true is larger than the nominal value for small lengths of the time series but gets closer to the nominal value for long time series. This holds for all three statistics and the problem is quite severe for realistic sample lengths up to 100.

  • Secular trends cannot be directly dealt with by the phase-scrambling (or bootstrap) procedure. We find that generating surrogates from detrended data and then adding the trend back to these surrogates gives good results.

Note that the first bullet point means that the residual statistic cannot be compared across different geographical regions with different signal-to-noise ratios.

From the climate model analysis, we find the following:

  • When comparing the near-surface temperature in initialized decadal forecasts with uninitialized historical simulations, we recognize the larger values of the residual statistic over the difference statistic. However, the two statistics show mainly statistical significance in the same regions. These results are not unexpected considering the results from the simple examples.

  • For the near-surface temperature, the initialized forecasts add significant skill for lead times of a few years, in particular in the regions influenced by the ENSO. For decadal lead times, the area of statistical significance is consistent with the null hypothesis (no added skills from the initializations) when the results from the simple examples are taken into consideration. This interpretation is supported by the fact that for large parts of the significant areas, the statistics are negative, indicating that initialization has deteriorated the predictions.

Based on our results we recommend using the difference statistic while being aware of the risk of the overestimation of the significance that can come from the considerable risk of falsely rejecting a true null hypothesis.

The discussion in this paper has mainly been in the context of decadal forecasts but is also relevant for other situations where different forecast systems are compared. In particular, seasonal forecasts share many of the same challenges as decadal forecasts as the skill depends both on forcings and initial conditions.

Acknowledgments.

This work is supported by the NordForsk-funded Nordic Centre of Excellence project (Award 76654) Arctic Climate Predictions: Pathways to Resilient, Sustainable Societies (ARCPATH) and by the project European Climate Prediction System (EUCP) funded by the European Union under Horizon 2020 (Grant Agreement 776613).

Data availability statement.

The CESM Large Ensemble Project (http://www.cesm.ucar.edu/projects/community-projects/LENS/) was downloaded via EFGS from http://www.cesm.ucar.edu/projects/community-projects/LENS/data-sets.html. The CESM Decadal Prediction Large Ensemble Project (http://www.cesm.ucar.edu/projects/community-projects/DPLE/) was downloaded via EFGS from http://www.cesm.ucar.edu/projects/community-projects/DPLE/data-sets.html. The NCEP–NCAR Reanalysis data were provided by the NOAA–CIRES Climate Diagnostics Center, Boulder, Colorado, from their website at http://www.cdc.noaa.gov/.

Footnotes

1

The part of y linearly congruent with x is found by writing y = ax + ξ, where ξ and x are independent. The linearly congruent part is then ax and the residual y|x = ξ.

2

As we are dealing with correlations, we could also define the null hypothesis by identical signal-to-noise ratios λh/σξh=λf/σξf, but this choice will not have any consequences for the rest of the paper.

APPENDIX

Derivations of Analytical Expressions for the Statistics for Large Sample Size

The analysis of climate data can often be simplified using the blessings of dimensionality (Christiansen 2018, 2021). When the sample size of the time series is large—or more precisely when the number of effective degrees of freedom in the time series is large—we can utilize the nonintuitive properties of high-dimensional spaces. The properties we will consider here are that vectors drawn independently from the same distribution have the same lengths and that independent vectors are orthogonal. The mathematical background and the extent to which they can be applied to climatic fields and time series are discussed in Christiansen (2021). Several analytical results based on the blessings of dimensionality are derived in the paper cited above and in Christiansen (2019, 2020) and Christiansen et al. (2022).

Let us first consider f2/N in some details. Using Eq. (4) and expanding we get
f2/N=ff/N=(λfs+σξfξf)(λfs+σξfξf)/N=λf2s2/N+σξf2ξf2/N+2λfσξfsξf/N.
As s and ξf are independent, the last term disappears for large N. Both s2/N and ξf2/N will converge to the constants 1. We thus get the unsurprising result
f2/N=λf2+σξf2.
Likewise, we have h2/N=λh2+σξh2 and o2/N=λo2+σξo2.
We now consider the derivation of Eq. (5). For centered series, we have cor(f,o)=fo/f/o. We have
fo=(λfs+σξfξf)(λos+σξoξo)=λfλos2+λfσξosξo+λoσξfsξf+σξfσξoξfξo=λfλoN.
In the last step, we have used the orthogonality of independent vectors. Hence,
cor(f,o)=λfλoN/f/o=λfλoλo2+σξo2λf2+σξf2.
With the similar expression for cor(h, o), we reach Eq. (5).
We then consider how Eq. (6) is derived. We have
o|h=oλ1h=(λoλ1λh)s+σξoξoλ1σξhξh,
where
λ1=oh/h2=λoλh/(λh2+σξh2).
In the last step, we have again used the properties of high-dimensional spaces. Likewise, f|h=(λfλ2λh)s+σξfξfλ2σξhξh with λ2=λfλh/(λh2+σξh2). For the correlation between o|h and f|h we then arrive at Eq. (6) after again setting dot products between independent vectors to zero and some tedious reductions.
Finally, we consider the origin of Eq. (7). We have
cor(fh,o)σfh/σf=[(λfλh)s+σξfξfσξhξh](λos+σξoξo)/σo/σf=(λfλh)λo /λo2+σo2/λf2+σf2,
from which the wanted expression follows.

REFERENCES

  • Bilbao, R., and Coauthors, 2021: Assessment of a full-field initialized decadal climate prediction system with the CMIP6 version of EC-Earth. Earth Syst. Dyn., 12, 173196, https://doi.org/10.5194/esd-12-173-2021.

    • Search Google Scholar
    • Export Citation
  • Borchert, L. F., V. Koul, M. B. Menary, D. J. Befort, D. Swingedouw, G. Sgubin, and J. Mignot, 2021a: Skillful decadal prediction of unforced southern European summer temperature variations. Environ. Res. Lett., 16, 104017, https://doi.org/10.1088/1748-9326/ac20f5.

    • Search Google Scholar
    • Export Citation
  • Borchert, L. F., M. B. Menary, D. Swingedouw, G. Sgubin, L. Hermanson, and J. Mignot, 2021b: Improved decadal predictions of North Atlantic subpolar gyre SST in CMIP6. Geophys. Res. Lett., 48, e2020GL091307, https://doi.org/10.1029/2020GL091307.

    • Search Google Scholar
    • Export Citation
  • Christiansen, B., 2001: Downward propagation of zonal mean zonal wind anomalies from the stratosphere to the troposphere: Model and reanalysis. J. Geophys. Res., 106, 27 30727 322, https://doi.org/10.1029/2000JD000214.

    • Search Google Scholar
    • Export Citation
  • Christiansen, B., 2018: Ensemble averaging and the curse of dimensionality. J. Climate, 31, 15871596, https://doi.org/10.1175/JCLI-D-17-0197.1.

    • Search Google Scholar
    • Export Citation
  • Christiansen, B., 2019: Analysis of ensemble mean forecasts: The blessings of high dimensionality. Mon. Wea. Rev., 147, 16991712, https://doi.org/10.1175/MWR-D-18-0211.1.

    • Search Google Scholar
    • Export Citation
  • Christiansen, B., 2020: Understanding the distribution of multimodel ensembles. J. Climate, 33, 94479465, https://doi.org/10.1175/JCLI-D-20-0186.1.

    • Search Google Scholar
    • Export Citation
  • Christiansen, B., 2021: The blessing of dimensionality for the analysis of climate data. Nonlinear Processes Geophys., 28, 409422, https://doi.org/10.5194/npg-28-409-2021.

    • Search Google Scholar
    • Export Citation
  • Christiansen, B., S. Yang, and D. Matte, 2022: The forced response and decadal predictability of the North Atlantic Oscillation: Nonstationary and fragile skills. J. Climate, 35, 58695882, https://doi.org/10.1175/JCLI-D-21-0807.1.

    • Search Google Scholar
    • Export Citation
  • DelSole, T., and M. K. Tippett, 2014: Comparing forecast skill. Mon. Wea. Rev., 142, 46584678, https://doi.org/10.1175/MWR-D-14-00045.1.

    • Search Google Scholar
    • Export Citation
  • DelSole, T., and M. K. Tippett, 2018: Predictability in a changing climate. Climate Dyn., 51, 531545, https://doi.org/10.1007/s00382-017-3939-8.

    • Search Google Scholar
    • Export Citation
  • Doblas-Reyes, F. J., and Coauthors, 2013: Initialized near-term regional climate change prediction. Nat. Commun., 4, 1715, https://doi.org/10.1038/ncomms2704.

    • Search Google Scholar
    • Export Citation
  • Hermanson, L., and Coauthors, 2022: WMO global annual to decadal climate update: A prediction for 2021–25. Bull. Amer. Meteor. Soc., 103, E1117E1129, https://doi.org/10.1175/BAMS-D-20-0311.1.

    • Search Google Scholar
    • Export Citation
  • Kalnay, E., and Coauthors, 1996: The NCEP/NCAR 40-Year reanalysis project. Bull. Amer. Meteor. Soc., 77, 437471, https://doi.org/10.1175/1520-0477(1996)077<0437:TNYRP>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Kay, J. E., and Coauthors, 2015: The Community Earth System Model (CESM) large ensemble project: A community resource for studying climate change in the presence of internal climate variability. Bull. Amer. Meteor. Soc., 96, 13331349, https://doi.org/10.1175/BAMS-D-13-00255.1.

    • Search Google Scholar
    • Export Citation
  • Kirtman, B., and Coauthors, 2014: Near-term climate change: Projections and predictability. Climate Change 2013: The Physical Science Basis, T. F. Stocker et al., Eds., Cambridge University Press, 953–1028, https://doi.org/10.1017/CBO9781107415324.023.

  • Kushnir, Y., and Coauthors, 2019: Towards operational predictions of the near-term climate. Nat. Climate Change, 9, 94101, https://doi.org/10.1038/s41558-018-0359-7.

    • Search Google Scholar
    • Export Citation
  • Lancaster, G., D. Iatsenko, A. Pidde, V. Ticcinelli, and A. Stefanovska, 2018: Surrogate data for hypothesis testing of physical systems. Phys. Rep., 748, 160, https://doi.org/10.1016/j.physrep.2018.06.001.

    • Search Google Scholar
    • Export Citation
  • Meehl, G. A., and Coauthors, 2021: Initialized Earth system prediction from subseasonal to decadal timescales. Nat. Rev. Earth Environ., 2, 340357, https://doi.org/10.1038/s43017-021-00155-x.

    • Search Google Scholar
    • Export Citation
  • Merryfield, W. J., and Coauthors, 2020: Current and emerging developments in subseasonal to decadal prediction. Bull. Amer. Meteor. Soc., 101, E869E896, https://doi.org/10.1175/BAMS-D-19-0037.1.

    • Search Google Scholar
    • Export Citation
  • Scaife, A. A., and D. Smith, 2018: A signal-to-noise paradox in climate science. npj Climate Atmos. Sci., 1, 28, https://doi.org/10.1038/s41612-018-0038-4.

    • Search Google Scholar
    • Export Citation
  • Schreiber, T., and A. Schmitz, 1996: Improved surrogate data for nonlinearity tests. Phys. Rev. Lett., 77, 635638, https://doi.org/10.1103/PhysRevLett.77.635.

    • Search Google Scholar
    • Export Citation
  • Schreiber, T., and A. Schmitz, 2000: Surrogate time series. Physica D, 142, 346382, https://doi.org/10.1016/S0167-2789(00)00043-9.

  • Sgubin, G., D. Swingedouw, L. F. Borchert, M. B. Menary, T. Noël, H. Loukos, and J. Mignot, 2021: Systematic investigation of skill opportunities in decadal prediction of air temperature over Europe. Climate Dyn., 57, 32453263, https://doi.org/10.1007/s00382-021-05863-0.

    • Search Google Scholar
    • Export Citation
  • Siegert, S., O. Bellprat, M. Ménégoz, D. B. Stephenson, and F. J. Doblas-Reyes, 2017: Detecting improvements in forecast correlation skill: Statistical testing and power analysis. Mon. Wea. Rev., 145, 437450, https://doi.org/10.1175/MWR-D-16-0037.1.

    • Search Google Scholar
    • Export Citation
  • Sienz, F., W. A. Müller, and H. Pohlmann, 2016: Ensemble size impact on the decadal predictive skill assessment. Meteor. Z., 25, 645655, https://doi.org/10.1127/metz/2016/0670.

    • Search Google Scholar
    • Export Citation
  • Simpson, I. R., S. G. Yeager, K. A. McKinnon, and C. Deser, 2019: Decadal predictability of late winter precipitation in western Europe through an ocean–jet stream connection. Nat. Geosci., 12, 613619, https://doi.org/10.1038/s41561-019-0391-x.

    • Search Google Scholar
    • Export Citation
  • Smith, D. M., R. Eade, N. J. Dunstone, D. Fereday, J. M. Murphy, H. Pohlmann, and A. A. Scaife, 2010: Skilful multi-year predictions of Atlantic hurricane frequency. Nat. Geosci., 3, 846849, https://doi.org/10.1038/ngeo1004.

    • Search Google Scholar
    • Export Citation
  • Smith, D. M., and Coauthors, 2019: Robust skill of decadal climate predictions. npj Climate Atmos. Sci., 2, 13, https://doi.org/10.1038/s41612-019-0071-y.

    • Search Google Scholar
    • Export Citation
  • Smith, D. M., and Coauthors, 2020: North Atlantic climate far more predictable than models imply. Nature, 583, 796800, https://doi.org/10.1038/s41586-020-2525-0.

    • Search Google Scholar
    • Export Citation
  • Solaraju-Murali, B., L.-P. Caron, N. Gonzalez-Reviriego, and F. J. Doblas-Reyes, 2019: Multi-year prediction of European summer drought conditions for the agricultural sector. Environ. Res. Lett., 14, 124014, https://doi.org/10.1088/1748-9326/ab5043.

    • Search Google Scholar
    • Export Citation
  • Sospedra-Alfonso, R., and G. J. Boer, 2020: Assessing the impact of initialization on decadal prediction skill. Geophys. Res. Lett., 47, e2019GL086361, https://doi.org/10.1029/2019GL086361.

    • Search Google Scholar
    • Export Citation
  • Theiler, J., S. Eubank, A. Longtin, B. Galdrikian, and J. Doyne Farmer, 1992: Testing for nonlinearity in time series: The method of surrogate data. Physica D, 58, 7794, https://doi.org/10.1016/0167-2789(92)90102-S.

    • Search Google Scholar
    • Export Citation
  • Totaro, V., A. Gioia, and V. Iacobellis, 2020: Numerical investigation on the power of parametric and nonparametric tests for trend detection in annual maximum series. Hydrol. Earth Syst. Sci., 24, 473488, https://doi.org/10.5194/hess-24-473-2020.

    • Search Google Scholar
    • Export Citation
  • Trenberth, K. E., and D. J. Shea, 2006: Atlantic hurricanes and natural variability in 2005. Geophys. Res. Lett., 33, L12704, https://doi.org/10.1029/2006GL026894.

    • Search Google Scholar
    • Export Citation
  • Wang, Y., F. Counillon, N. Keenlyside, L. Svendsen, S. Gleixner, M. Kimmritz, P. Dai, and Y. Gao, 2019: Seasonal predictions initialised by assimilating sea surface temperature observations with the EnKF. Climate Dyn., 53, 57775797, https://doi.org/10.1007/s00382-019-04897-9.

    • Search Google Scholar
    • Export Citation
  • Yeager, S. G., and Coauthors, 2018: Predicting near-term changes in the Earth system: A large ensemble of initialized decadal prediction simulations using the Community Earth System Model. Bull. Amer. Meteor. Soc., 99, 18671886, https://doi.org/10.1175/BAMS-D-17-0098.1.

    • Search Google Scholar
    • Export Citation
Save
  • Bilbao, R., and Coauthors, 2021: Assessment of a full-field initialized decadal climate prediction system with the CMIP6 version of EC-Earth. Earth Syst. Dyn., 12, 173196, https://doi.org/10.5194/esd-12-173-2021.

    • Search Google Scholar
    • Export Citation
  • Borchert, L. F., V. Koul, M. B. Menary, D. J. Befort, D. Swingedouw, G. Sgubin, and J. Mignot, 2021a: Skillful decadal prediction of unforced southern European summer temperature variations. Environ. Res. Lett., 16, 104017, https://doi.org/10.1088/1748-9326/ac20f5.

    • Search Google Scholar
    • Export Citation
  • Borchert, L. F., M. B. Menary, D. Swingedouw, G. Sgubin, L. Hermanson, and J. Mignot, 2021b: Improved decadal predictions of North Atlantic subpolar gyre SST in CMIP6. Geophys. Res. Lett., 48, e2020GL091307, https://doi.org/10.1029/2020GL091307.

    • Search Google Scholar
    • Export Citation
  • Christiansen, B., 2001: Downward propagation of zonal mean zonal wind anomalies from the stratosphere to the troposphere: Model and reanalysis. J. Geophys. Res., 106, 27 30727 322, https://doi.org/10.1029/2000JD000214.

    • Search Google Scholar
    • Export Citation
  • Christiansen, B., 2018: Ensemble averaging and the curse of dimensionality. J. Climate, 31, 15871596, https://doi.org/10.1175/JCLI-D-17-0197.1.

    • Search Google Scholar
    • Export Citation
  • Christiansen, B., 2019: Analysis of ensemble mean forecasts: The blessings of high dimensionality. Mon. Wea. Rev., 147, 16991712, https://doi.org/10.1175/MWR-D-18-0211.1.

    • Search Google Scholar
    • Export Citation
  • Christiansen, B., 2020: Understanding the distribution of multimodel ensembles. J. Climate, 33, 94479465, https://doi.org/10.1175/JCLI-D-20-0186.1.

    • Search Google Scholar
    • Export Citation
  • Christiansen, B., 2021: The blessing of dimensionality for the analysis of climate data. Nonlinear Processes Geophys., 28, 409422, https://doi.org/10.5194/npg-28-409-2021.

    • Search Google Scholar
    • Export Citation
  • Christiansen, B., S. Yang, and D. Matte, 2022: The forced response and decadal predictability of the North Atlantic Oscillation: Nonstationary and fragile skills. J. Climate, 35, 58695882, https://doi.org/10.1175/JCLI-D-21-0807.1.

    • Search Google Scholar
    • Export Citation
  • DelSole, T., and M. K. Tippett, 2014: Comparing forecast skill. Mon. Wea. Rev., 142, 46584678, https://doi.org/10.1175/MWR-D-14-00045.1.

    • Search Google Scholar
    • Export Citation
  • DelSole, T., and M. K. Tippett, 2018: Predictability in a changing climate. Climate Dyn., 51, 531545, https://doi.org/10.1007/s00382-017-3939-8.

    • Search Google Scholar
    • Export Citation
  • Doblas-Reyes, F. J., and Coauthors, 2013: Initialized near-term regional climate change prediction. Nat. Commun., 4, 1715, https://doi.org/10.1038/ncomms2704.

    • Search Google Scholar
    • Export Citation
  • Hermanson, L., and Coauthors, 2022: WMO global annual to decadal climate update: A prediction for 2021–25. Bull. Amer. Meteor. Soc., 103, E1117E1129, https://doi.org/10.1175/BAMS-D-20-0311.1.

    • Search Google Scholar
    • Export Citation
  • Kalnay, E., and Coauthors, 1996: The NCEP/NCAR 40-Year reanalysis project. Bull. Amer. Meteor. Soc., 77, 437471, https://doi.org/10.1175/1520-0477(1996)077<0437:TNYRP>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Kay, J. E., and Coauthors, 2015: The Community Earth System Model (CESM) large ensemble project: A community resource for studying climate change in the presence of internal climate variability. Bull. Amer. Meteor. Soc., 96, 13331349, https://doi.org/10.1175/BAMS-D-13-00255.1.

    • Search Google Scholar
    • Export Citation
  • Kirtman, B., and Coauthors, 2014: Near-term climate change: Projections and predictability. Climate Change 2013: The Physical Science Basis, T. F. Stocker et al., Eds., Cambridge University Press, 953–1028, https://doi.org/10.1017/CBO9781107415324.023.

  • Kushnir, Y., and Coauthors, 2019: Towards operational predictions of the near-term climate. Nat. Climate Change, 9, 94101, https://doi.org/10.1038/s41558-018-0359-7.

    • Search Google Scholar
    • Export Citation
  • Lancaster, G., D. Iatsenko, A. Pidde, V. Ticcinelli, and A. Stefanovska, 2018: Surrogate data for hypothesis testing of physical systems. Phys. Rep., 748, 160, https://doi.org/10.1016/j.physrep.2018.06.001.

    • Search Google Scholar
    • Export Citation
  • Meehl, G. A., and Coauthors, 2021: Initialized Earth system prediction from subseasonal to decadal timescales. Nat. Rev. Earth Environ., 2, 340357, https://doi.org/10.1038/s43017-021-00155-x.

    • Search Google Scholar
    • Export Citation
  • Merryfield, W. J., and Coauthors, 2020: Current and emerging developments in subseasonal to decadal prediction. Bull. Amer. Meteor. Soc., 101, E869E896, https://doi.org/10.1175/BAMS-D-19-0037.1.

    • Search Google Scholar
    • Export Citation
  • Scaife, A. A., and D. Smith, 2018: A signal-to-noise paradox in climate science. npj Climate Atmos. Sci., 1, 28, https://doi.org/10.1038/s41612-018-0038-4.

    • Search Google Scholar
    • Export Citation
  • Schreiber, T., and A. Schmitz, 1996: Improved surrogate data for nonlinearity tests. Phys. Rev. Lett., 77, 635638, https://doi.org/10.1103/PhysRevLett.77.635.

    • Search Google Scholar
    • Export Citation
  • Schreiber, T., and A. Schmitz, 2000: Surrogate time series. Physica D, 142, 346382, https://doi.org/10.1016/S0167-2789(00)00043-9.

  • Sgubin, G., D. Swingedouw, L. F. Borchert, M. B. Menary, T. Noël, H. Loukos, and J. Mignot, 2021: Systematic investigation of skill opportunities in decadal prediction of air temperature over Europe. Climate Dyn., 57, 32453263, https://doi.org/10.1007/s00382-021-05863-0.

    • Search Google Scholar
    • Export Citation
  • Siegert, S., O. Bellprat, M. Ménégoz, D. B. Stephenson, and F. J. Doblas-Reyes, 2017: Detecting improvements in forecast correlation skill: Statistical testing and power analysis. Mon. Wea. Rev., 145, 437450, https://doi.org/10.1175/MWR-D-16-0037.1.

    • Search Google Scholar
    • Export Citation
  • Sienz, F., W. A. Müller, and H. Pohlmann, 2016: Ensemble size impact on the decadal predictive skill assessment. Meteor. Z., 25, 645655, https://doi.org/10.1127/metz/2016/0670.

    • Search Google Scholar
    • Export Citation
  • Simpson, I. R., S. G. Yeager, K. A. McKinnon, and C. Deser, 2019: Decadal predictability of late winter precipitation in western Europe through an ocean–jet stream connection. Nat. Geosci., 12, 613619, https://doi.org/10.1038/s41561-019-0391-x.

    • Search Google Scholar
    • Export Citation
  • Smith, D. M., R. Eade, N. J. Dunstone, D. Fereday, J. M. Murphy, H. Pohlmann, and A. A. Scaife, 2010: Skilful multi-year predictions of Atlantic hurricane frequency. Nat. Geosci., 3, 846849, https://doi.org/10.1038/ngeo1004.

    • Search Google Scholar
    • Export Citation
  • Smith, D. M., and Coauthors, 2019: Robust skill of decadal climate predictions. npj Climate Atmos. Sci., 2, 13, https://doi.org/10.1038/s41612-019-0071-y.

    • Search Google Scholar
    • Export Citation
  • Smith, D. M., and Coauthors, 2020: North Atlantic climate far more predictable than models imply. Nature, 583, 796800, https://doi.org/10.1038/s41586-020-2525-0.

    • Search Google Scholar
    • Export Citation
  • Solaraju-Murali, B., L.-P. Caron, N. Gonzalez-Reviriego, and F. J. Doblas-Reyes, 2019: Multi-year prediction of European summer drought conditions for the agricultural sector. Environ. Res. Lett., 14, 124014, https://doi.org/10.1088/1748-9326/ab5043.

    • Search Google Scholar
    • Export Citation
  • Sospedra-Alfonso, R., and G. J. Boer, 2020: Assessing the impact of initialization on decadal prediction skill. Geophys. Res. Lett., 47, e2019GL086361, https://doi.org/10.1029/2019GL086361.

    • Search Google Scholar
    • Export Citation
  • Theiler, J., S. Eubank, A. Longtin, B. Galdrikian, and J. Doyne Farmer, 1992: Testing for nonlinearity in time series: The method of surrogate data. Physica D, 58, 7794, https://doi.org/10.1016/0167-2789(92)90102-S.

    • Search Google Scholar
    • Export Citation
  • Totaro, V., A. Gioia, and V. Iacobellis, 2020: Numerical investigation on the power of parametric and nonparametric tests for trend detection in annual maximum series. Hydrol. Earth Syst. Sci., 24, 473488, https://doi.org/10.5194/hess-24-473-2020.

    • Search Google Scholar
    • Export Citation
  • Trenberth, K. E., and D. J. Shea, 2006: Atlantic hurricanes and natural variability in 2005. Geophys. Res. Lett., 33, L12704, https://doi.org/10.1029/2006GL026894.

    • Search Google Scholar
    • Export Citation
  • Wang, Y., F. Counillon, N. Keenlyside, L. Svendsen, S. Gleixner, M. Kimmritz, P. Dai, and Y. Gao, 2019: Seasonal predictions initialised by assimilating sea surface temperature observations with the EnKF. Climate Dyn., 53, 57775797, https://doi.org/10.1007/s00382-019-04897-9.

    • Search Google Scholar
    • Export Citation
  • Yeager, S. G., and Coauthors, 2018: Predicting near-term changes in the Earth system: A large ensemble of initialized decadal prediction simulations using the Community Earth System Model. Bull. Amer. Meteor. Soc., 99, 18671886, https://doi.org/10.1175/BAMS-D-17-0098.1.

    • Search Google Scholar
    • Export Citation
  • Fig. 1.

    Cyan vertical lines indicate the values of the statistics for an example where o is a normalized AR1 series with coefficient 0.5, h = o + ξh, and f = o + ξf, where ξh and ξf are normalized, Gaussian white noise. This example fulfills the null hypothesis. (left) Difference statistic. (center) Residual statistic. (right) Split statistic. The histograms show the distributions of the statistics under the null hypothesis that h and f are identically distributed as calculated with the surrogate method. The sample size is N = 100. The red vertical lines show the values of the statistics from Eqs. (5) to (7) valid for N → ∞.

  • Fig. 2.

    As in Fig. 1, but now h = o + ξh, and f = o + 0.5ξf, where ξh and ξf are normalized white noise. This example does not fulfill the null hypothesis.

  • Fig. 3.

    The residual statistic cor(o|h¯,f¯|h¯) as function of ensemble size K and noise amplitude σ. The model is hi=s+σξh and fi=s+σξf, i = 1, …, K, and o = s + ξo. Here, s is a normalized AR1 series, and N = 100. Note the nonlinear scales on the axes.

  • Fig. 4.

    Schematic view of a power curve. The power p is shown as function of a parameter (full black curve). The null hypothesis is fulfilled only when the parameter is zero. Away from zero, p is the probability that the false null hypothesis is correctly rejected (indicated by green hatching). The probability that the false null hypothesis is incorrectly not rejected (type II error) is 1 − p (indicated by red hatching). At zero—where the null hypothesis is fulfilled—the length of the green line, 1 − p, indicates the probability to correctly not reject the true null hypothesis, while the length of the red line p indicates the probability to incorrectly reject the true null hypothesis (type I error).

  • Fig. 5.

    The power—the probability of rejecting the null hypothesis—of the statistical tests as function of λf for different values of the sample size N: 25, 100, and 1000. Here, h = o + ξh and f = λfo + ξf, where o is a normalized AR1 process, and ξh and ξf are normalized white noise. This corresponds to Eq. (4) with σξh=σξf=λh=1 and σξo=0. Black, blue, and orange full curves are the power for the difference statistic cor(f, o) − cor(h, o), the residual statistic cor(o|h, f|h), and the split statistic cor(fh, o)σfh/σh, respectively. Dashed lines are power calculated with the optimal benchmark method.

  • Fig. 6.

    (a) Power as function of σξf with h = o + ξh and f=o+σξfξf. Here, o is a normalized AR1 process. (b) Power as function of λf with f = λfs + ξf. (c) Power as function of σξf with f=s+σξfξf. In (b) and (c), h = s + ξh, o = s + 0.5ξo, and s is a normalized AR1 process. The terms ξh, ξf, and ξo are normalized white noise. N = 100 in all panels.

  • Fig. 7.

    Including trend in s of the form 6t4, with t = (0, 1, … N − 1)/N. Power as function of λf. N = 100 in all panels. (left) Parameters are as in Fig. 5b, including σξo=0. (right) As in Fig. 6b, including σξo=0.5. (top) No detrending in significance estimation. (bottom) The third-order polynomial is removed. Legend as in Figs. 5 and 6.

  • Fig. 8.

    Results for LENS annual-mean near-surface temperature. (top left) Correlations between observations and historical ensemble mean. (top right) As in the top left, but now time series have been detrended. (bottom) Correlations between observations and the initialized ensemble mean for lead times of (left) 1 year and (right) 10 years. The period is 1970–2017. Dots indicate where correlations are significantly different from zero at the 95% level.

  • Fig. 9.

    The three statistics measuring the difference in skill between the initialized ensemble-mean forecast and the uninitialized historical ensemble mean for annual-mean near-surface temperature. (top left) The difference statistic, (top right) the residual statistic, and (bottom left) the split statistic. Based on LENS ensemble means for 1970–2017. A lead time of 1 year is used for the initialized forecast. Dots indicate where the null hypothesis—that historical and initialized experiments have the same skill—can be rejected at the 95% level.

  • Fig. 10.

    As in Fig. 9, but for a lead time of 10 years.

  • Fig. 11.

    (left) The fraction of the global area where the skills in forecast and historical experiments are estimated to be significantly different. The fraction is shown as a function of lead time for each of the three statistics. (right) The fraction of the global area where the skills are significantly different calculated for grid points with positive and negative correlations, respectively.

All Time Past Year Past 30 Days
Abstract Views 114 0 0
Full Text Views 675 238 17
PDF Downloads 626 192 10