Two large ensembles (LEs) of historical climate simulations are used to compare how various statistical methods estimate the sea surface temperature (SST) changes due to anthropogenic and other external forcing, and how their removal affects the internally generated Atlantic multidecadal oscillation (AMO), Pacific decadal oscillation (PDO), and the SST footprint of the Atlantic meridional overturning circulation (AMOC). Removing the forced SST signal by subtracting the global mean SST (GM) or a linear regression on it (REGR) leads to large errors in the Pacific. Multidimensional ensemble empirical mode decomposition (MEEMD) and quadratic detrending only efficiently remove the forced SST signal in one LE, and cannot separate the short-term response to volcanic eruptions from natural SST variations. Removing a linear trend works poorly. Two methods based on linear inverse modeling (LIM), one where the leading LIM mode represents the forced signal and another using an optimal perturbation filter (LIMopt), perform consistently well. However, the first two LIM modes are sometimes needed to represent the forced signal, so the more robust LIMopt is recommended. In both LEs, the natural AMO variability seems largely driven by the AMOC in the subpolar North Atlantic, but not in the subtropics and tropics, and the scatter in the AMOC–AMO correlation is large between individual ensemble members. In three observational SST reconstructions for 1900–2015, linear and quadratic detrending, MEEMD, and GM yield somewhat different AMO behavior, and REGR yields smaller PDO amplitudes. Based on LIMopt, only about 30% of the AMO variability is internally generated, as opposed to more than 90% for the PDO. The natural SST variability contribution to global warming hiatus is discussed.
The global mean surface temperature has risen in the last 100 years, but it has not done so uniformly in time, reflecting changes in the emission of greenhouse gases and aerosols, fluctuations in solar activity, and also internally generated variability driven by the interactions among the atmosphere, ocean, cryosphere, and land. The variability driven by Earth interactions, which would occur in the absence of external forcing, will be called natural climate variability. Separating the forced variability from natural variability is needed for detecting and attributing climate changes, estimating the climate response to external forcing, and identifying the unforced climate variability. This is of much interest since the natural climate variability substantially contributes to regional and perhaps global temperature changes on interannual–multidecadal time scales, as during the so-called warming hiatus or pause in surface warming seen in globally averaged surface temperature in the first decade of the twenty-first century. The pause in surface warming has been largely attributed to natural variability in the tropical Pacific Ocean (e.g., Meehl et al. 2013; Kosaka and Xie 2013; Trenberth et al. 2014; England et al. 2014), perhaps in part driven by Atlantic changes (McGregor et al. 2014; Chen and Tung 2014; Li et al. 2015; Ruprich-Robert et al. 2017). However, a recent SST bias correction suggests that there was no global warming slowdown (Karl et al. 2015). The substantial role of natural variability is also illustrated by the large spread in large ensembles of scenario simulations (Deser et al. 2014; Wettstein and Deser 2014).
Because of the short duration of the observational record, a careful removal of the forced signal is key to better understanding of the dynamical and thermodynamical processes that drive the main modes of low-frequency sea surface temperature (SST) variability, such as the Atlantic multidecadal oscillation (AMO) and the Pacific decadal oscillation (PDO). For instance, it is generally considered that the Atlantic meridional overturning circulation (AMOC) in part drives the AMO (e.g., Delworth et al. 1993; Knight et al. 2005), and most climate models indeed show that the AMOC leads the AMO, albeit by a model-dependent time lag (Medhaug and Furevik 2011; Gastineau and Frankignoul 2012). The integral response of the upper ocean to stochastic atmospheric forcing also contributes to the AMO variability, as recently emphasized by Clement et al. (2015). Hence, it is of interest to compare the AMO and the AMOC SST fingerprint and investigate their relationship. Marini and Frankignoul (2014) showed that removing the forced signal in historical simulations with three climate models lead to a lag AMOC–AMO correlation that was more similar to that found in control simulations, and the agreement depended on how the forced SST signal was estimated. Tandon and Kushner (2015) also showed that external forcing interferes with the AMOC–AMO relationship and that removing forced variations leads to closer agreement between models or between ensemble members in a large ensemble. The link between the AMO and the PDO is also of interest. D’Orgeville and Peltier (2007) found observational evidence that the AMO leads the PDO by 13 yr and the PDO leads the (negative) AMO by 17 yr, arguing that they are signature of the same oscillation cycle. However, by removing the global warming signal with linear inverse modeling, Marini and Frankignoul (2014) suggested that the AMO plays the leading role in the interbasin connection since the correlation remains strong when the AMO leads but decreases when the PDO leads. The relation between low-frequency modes is thus sensitive to the method used for separating forced and natural climate variability.
Two different approaches have been used to estimate the space–time pattern of the forced climate response, one that solely relies on observations and one that uses the response patterns of historical or scenario runs with state-of-the art climate models. The latter approach is based on multimodel simulations, sometimes with multiple ensemble members, and it involves statistical methods to better identify the forced component, such as scaling factors (Franckombe et al. 2015), signal-to-noise-maximizing empirical orthogonal functions (EOFs) (Ting et al. 2009), or discriminant analysis and maximization of the average predictability time (Delsole et al. 2011). Although it has been argued that a multimodel approach should strongly reduce the impact of individual model biases, there is no guarantee that it will yield unbiased results, since most climate models share similar components and parameterizations (Knutti et al. 2013) and might have been tuned to reproduce the observed twentieth-century warming (Hourdin et al. 2017). In addition, there remain substantial uncertainties in the anthropogenic and external forcing prescribed in historical runs, and current climate models generally do not capture regional trend patterns well (Shin and Sardeshmukh 2011). A variant to the multimodel approach is to use a large ensemble of a single climate model (e.g., Branstator and Selten 2009), which allows a less ambiguous separation between the externally forced climate change and natural climate variability in the model. Such large ensemble can also be used to estimate the thermodynamically induced trend representative of the forced components (Deser et al. 2016).
Here, we solely focus on methods that only rely on observations. They are not affected by model biases but are affected by the substantial observational uncertainties, in particular before the second half of the twentieth century, and by the method of SST or air temperature reconstructions. A common practice in the observation-based approach is to use the linear trend to characterize secular changes, especially in short records (e.g., Swart et al. 2015; Vincent et al. 2015), and to remove it when analyzing the main modes of natural climate variability or their impact (e.g., Knight et al. 2005; Sutton and Hodson 2005; Alexander et al. 2014; Delworth et al. 2017; Lyu et al. 2017; Osborne et al. 2017). This simply assumes that the forced climate evolution is primarily linear in time, even though observations and climate model simulations show that it is not really appropriate. Trenberth and Shea (2006) stressed that the global SST warming has intensified in recent decades, so that the AMO index (the low-pass-filtered SST averaged over the North Atlantic) is better defined by first subtracting the global mean SST, which turned out to substantially reduce the AMO amplitude. A similar strategy had been used by Zhang et al. (1997) to study ENSO-like interdecadal variability, and by Mantua et al. (1997) to define the PDO. Enfield and Cid-Serrano (2010) argued that removing the globally averaged SST or the regression on it also removes part of the natural climate variability; hence, they subtracted a quadratic trend for investigating the natural climate variability. Several other methods have been proposed to separate external forcing from internal variability. Parker et al. (2007) used the first EOF of low-pass-filtered SST and air temperature to represent the forced SST signal, while Guan and Nigam (2009) used rotated extended EOFs. Ting et al. (2009) used a regression on the global average of the SST or the surface temperature to represent the forced signal. More sophisticated methods have also been used. Compo and Sardeshmukh (2010), Newman (2013), and Marini and Frankignoul (2014) used a dynamical filter based on linear inverse modeling (LIM; e.g., Penland and Matrosova 1994, 2006), which decomposes the SST into nonorthogonal normal modes and can single out the SST signature of ENSO and the global secular trend pattern. Hannachi (2007) introduced trend EOFs obtained using correlations between time positions of the sorted data to capture the different trend patterns of a field, which was generalized by Li et al. (2011) to extract coupled trends between SST and latent heat flux using singular value decomposition (SVD). The empirical mode decomposition (EMD; Huang et al. 1998) and its extensions, the ensemble empirical mode decomposition (EEMD; Wu and Huang 2009), and the multidimensional EEMD (MEEMD; Wu et al. 2009; Ji et al. 2014) have been used to decompose a time series into amplitude–frequency-modulated components of increasing time scales, and applied to temperature datasets.
Here we compare the ability of several of the observation-based methods at estimating and removing the forced climate response, using two large ensembles (LEs) of climate simulations subject to identical natural and anthropogenic forcing, but slightly different initial conditions, as a substitute for the real climate system. The first ensemble was conducted with the IPSL-CM5A-LR, and it contains n = 30 members. The second ensemble was conducted with the Community Earth System Model, version 1 (CESM1), and it contains n = 40 members (Kay et al. 2015). These LE simulations provide a unique opportunity to systematically assess the efficiency of each statistical method at extracting the forced signal in a single realization (an analogy to the observations). Indeed, the forced climate change signal can be estimated by the ensemble mean, as the noise due to the natural climate variability is reduced by , so that it is 5.5 or 6.3 times smaller. Furthermore, by comparing LEs using two distinct models, the robustness of the findings may be better judged.
Seven methods, namely based on linear and quadratic trend, global mean SST, regression on the global mean SST, LIM, optimal perturbation filter, and MEEMD are applied to each simulation to extract the forced signal. The estimated forced signal is then compared to the ensemble mean, which is our target truth for the respective LE. We compare the ability of these methods at representing both the long-term climate changes and the short-term response to large volcanic eruptions. As additional criteria, we compare the natural variability of the AMO and the PDO, and the AMOC SST footprint, after removing the estimated forced signal. Finally, we apply the methods to three observation-based SST reconstructions during 1900–2015.
2. Estimation of the forced climate signal
a. Linear trend
In this simplest case, the forced climate signal is represented by a linear trend (hereafter D1), which is estimated by least squares fit. Here we apply it to the yearly mean SST anomaly at each grid point after slight smoothing by a binomial (¼–½–¼) filter. The spatial pattern of the forced signal evolves with time, as the trend differs between grid points.
b. Quadratic trend
Quadratic detrending (hereafter D2) consists of fitting instead a second-order polynomial. The spatial pattern of the estimated forced signal also evolves with time.
c. Global mean SST
Following Trenberth and Shea (2006) but for a slight smoothing, the global forced signal is approximated at each grid point by the yearly averaged global mean SST anomaly smoothed by a binomial (¼–½–¼) filter (hereafter GM). The estimated forced signal is identical at all grid points.
d. Regression on the global mean SST
The nonuniformity of the global warming signal is accounted for by using a regression of the SST anomaly at each grid point on the yearly global mean SST anomaly time series (i.e., the GM time series) after smoothing by the binomial filter (hereafter REGR method). The estimated forced signal has spatially nonuniform amplitude.
e. Linear inverse model
The LIM assumes that the SST anomalies x are well approximated by a multivariate linear Markov process
where is a linear operator, and F is white noise (e.g., Penland and Sardeshmukh 1995). The most probable forward solution of (1) at time t + τ is x(t + τ) = (τ)x(t), where (τ) = exp(τ). Three-month averages of SST anomalies in January–March (JFM), April–June (AMJ), July–September (JAS), and October–December (OND) are used for x, with τ = 3 months. The dimensionality is reduced by working in a truncated rotated-EOF space, retaining the first 15 rotated near-global (60°S–60°N) EOFs,1 which account for about 75% of the SST variance. Increasing the number of EOFs leads to similar results, but the LIM modes become less stable among ensemble members. The SST field is then decomposed into a sum of nonnormal eigenmodes (e.g., Penland and Matrosova 1994, 2006):
where ui is the eigenvector of estimated by = τ−1 ln[(τ)(0)−1] and associated with the eigenvalue βi, and αi(t) is the time series obtained by projecting x onto the corresponding adjoint eigenvector. Here (τ) is the lag covariance matrix of x at lag τ. Because is a real, negative, but not symmetric matrix, the eigenmodes may be nonorthogonal and all the eigenvalues have negative real parts. There is thus no orthogonality assumption, unlike in EOF analysis. The ith eigenmode of is either a stationary damped mode characterized by one pattern and a decay time or an oscillatory mode with two patterns, a period 2π/Im(βi), and a decay time −1/Re (βi). In all cases, the first eigenmode (that with least damping) is a weakly damped mode whose time evolution has a large trend, as in Fig. 1; it is taken as the forced climate signal, as in Compo and Sardeshmukh (2010), Newman (2013), and Marini and Frankignoul (2014). In several ensemble members, the second damped mode seems to also contribute to the secular trend but, although we tried several selection criteria, taking the second LIM mode into account degraded the overall results. Hence, the forced signal has the fixed spatial pattern of the first mode.
f. Linear inverse model with optimal perturbation filter
Solomon and Newman (2012) showed that an optimal perturbation filter based on LIM was more efficient at removing ENSO than a set of LIM eigenmodes. Here we adapt their method to define a space and time varying estimate of the forced SST signal (hereafter LIMopt). The forced signal (t) is represented by the variability from the data that evolves into the maximum possible anomaly after a time τe, using Φ1(τe), the optimal initial structure, which is the normalized right singular vector of (τe) = exp(τe). Here we use τe = 2.5 yr, much longer than in Solomon and Newman (2012), who used τe = 6–9 months to estimate ENSO signal. As described in Solomon and Newman (2012), the filter iteratively considers the evolution over a longer period τ1, taken here to be τ1 = 20 yr. The forced signal is calculated from the initial condition t = 0 by α(0) = Φ1(τe)x(0):
for t = 1 and α(1) = Φ1(τe)[x(1) − α(0)(1)Φ1(τe)]:
and for each t ≥ τ1 and :
where α(t − τ) designates the projections on Φ1(τe) at that point in the iteration. For t < τ1, we use τ1 = t in (4). The spinup of the iteration is fast, and we only remove the results from the first year to avoid initial instabilities.
The results are not sensitive to the precise value of τe (as long as it is longer than the ENSO time scale) and τ1. An example of the optimal initial condition and resulting global mean SST after 2.5 years is given in Fig. 2. Note that although the first LIM mode strongly contributes to the optimal initial structure, the method is more general than LIM as it makes no assumption as to which modes represent the forced signal and could include a significant contribution of the second LIM mode, as illustrated by the projection of each LIM mode on the optimal initial structure (Fig. 2, bottom).
g. Multidimensional ensemble empirical mode decomposition
EMD and EEMD (e.g., Wu and Huang 2009; Z. Wu et al. 2011) use amplitude–frequency-modulated oscillatory components of increasing time scale obtained adaptively from an anomaly time series x(t), which is decomposed into a finite and often small number of components cj(t), such that
where Rn(t) is the residual after n intrinsic modes have been extracted. In EMD, each oscillatory component is obtained through a sifting process: the local maxima and minima of x(t) are first found, and the upper envelope eu(t) and the lower envelope el(t) obtained by connecting them via a cubic spline. The local mean m(t) of the two envelopes for the riding wave is then calculated. If it is close enough to zero (the upper and lower envelopes are symmetric with respect to the zero line), the sifting is terminated. Otherwise, m(t) is subtracted from x(t) and the sifting process is repeated until m(t) is close enough to zero. This yields the highest-frequency oscillatory component c1(t) [the first riding wave component or intrinsic mode function (IMF), which has the shortest time scale]. Subtracting it from x(t) leads to a residual R1(t), which is treated similarly, yielding a second IMF c2(t) and a residual R2(t). The decomposition process stops when the remaining time series Rn(t) is a monotonic function or a function that has at most one internal extremum. As the IMFs are sensitive to noise, more robust estimates are obtained (EEMD method) by adding multiple white-noise realizations to x(t), decomposing the data in each case, and using the ensemble means of the respective IMFs as the final result. The multidimensional generalization of EEMD (i.e., MEEMD) simply pieces together the EEMD components of similar time scale. This allows for both static and spatially propagating signals of naturally determined time scale, at least if there is a clear correspondence between the IMFs at the different grid points. Here we work with annual mean SST anomalies at each grid point, without binomial smoothing. The noise added to each input time series has a standard deviation 5 times smaller than the original time series, and the ensemble number is 400, as in Ji et al. (2014). It is not a priori obvious which residual best represents the forced climate signal. The last residual Rn(t) was chosen in Z. Wu et al. (2011) and Ji et al. (2014). This seems to hold at some locations, as in Fig. 3, where R5(t), practically a quadratic trend, is highly correlated with the ensemble-mean SST, but not at others. It was found that in the two LEs, MEEMD fairs poorly for R1(t) and R2(t), better for R3(t), and best for R4(t), although R5(t) works nearly as well. Hence, in the following we use R4(t). However, using R4(t) in the SST reconstructions leads to very different AMO evolutions than the other methods, but not R5(t), which is then preferred. Because the EEMD is applied to each grid point separately, the spatial coherence in MEEMD is not always very good.
Two LEs are considered. The first ensemble was conducted at Laboratoire d’Océanographie et du Climat: Expérimentation et Approches Numériques (LOCEAN) with the IPSL-CM5A-LR (IPSL-LE) at about 2° ocean and atmosphere resolution (Dufresne et al. 2013), and it contains 30 members for the period 1940–2020. The members start from random initial oceanic and atmospheric conditions selected between 1920 and 1960 in six CMIP5 historical runs. The second ensemble was conducted at the National Center for Atmospheric Research (NCAR) with the CESM1 (CESM-LE) at about 1° ocean and atmosphere resolution (Kay et al. 2015), and it contains 40 members for the period 1920–2100 starting from different atmospheric initial conditions but the same oceanic ones. All IPSL-LE and CESM-LE ensemble members have the same specified external forcing, following the CMIP5 design protocol: historical forcing until 2005, then representative concentration pathway 8.5 (RCP8.5). Details are given in Kay et al. (2015), who showed that the observed global warming anomaly was within the spread predicted by the CESM-LE. Because the oceanic initial conditions were identical in CESM-LE, we do not consider the period prior to 1940 to better distinguish forced and natural variability. Indeed, the AMOC, which largely contributes to the oceanic meridional heat transport, was only uncorrelated between members after the initial 20 years.
Our analysis focuses on monthly SST anomalies during 1940–2015 between 60°S and 60°N extracted from the surface temperature where the land fraction is smaller than 0.3, the SST larger than −2°C, and, in CESM-LE, the monthly sea ice concentration always smaller than 5%. In IPSL-LE, as there is too much sea ice during the cold season (Dufresne et al. 2013), we consider grid points where the sea ice climatology does not exceed 10%. As illustrated by the time evolution of the ensemble mean of the monthly, area-weighted, quasi-globally averaged SST in Fig. 4, the IPSL model shows a much larger sensitivity to the external forcing than CESM1, but the sensitivity of CESM1 is in better agreement with our best estimate of the forced SST signal in the HadISST (red), COBE-SST (blue), and ERSST (green) observational SST reconstructions (taken from Fig. 15). This is consistent with the CMIP5 model study of Andrews et al. (2012), where IPSL-CM5A-LR shows a rather large climate sensitivity, and CESM1 a rather low one. Nonetheless, both models represent the acceleration of the quasi-global mean SST increase, except for a sudden cooling after the three large volcanic eruptions, namely Mount Agung in 1963, El Chichón in 1982, and Mount Pinatubo in 1991. In the two LEs, there is no clear indication of the global warming slowdown from the late 1990s to 2012 that is seen in the HadISST reconstruction, but not in COBE-SST and ERSST (see section 6). There is much scatter between ensemble members, in particular in CESM-LE (gray lines in Fig. 4). The spatial pattern of the global mean SST changes, as obtained by regression, also varies, with a larger equatorial signal in CESM-LE, but larger midlatitude signals in IPSL-LE (Fig. 4).
For the observations, we consider the 1900–2015 period in three recent SST reconstructions, namely HadISST, version 1.1 (HadISST1.1), on a 1° × 1° grid (Rayner et al. 2003), ERSST, version 4 (ERSST.v4), on a 2° × 2° grid (Huang et al. 2015), and COBE-SST2 on a 1° × 1° grid (Hirahara et al. 2014). The mask using the same criteria as in CESM-LE is used to select open ocean grid points, using the sea ice dataset associated to each SST reconstruction, except for ERSST.v4 where we use the HadISST sea ice mask.
4. Externally forced signal
A synthetic view of method efficiency is provided by comparing the evolution of the area-weighted global mean forced SST during 1940–2015 in each ensemble member against the target truth, i.e., the global mean SST of the ensemble average given in Fig. 4. Because the global mean yearly SST time series show a near monotonous increase over time, except after volcanic eruptions, the correlation r with the ensemble average is very high and basically undistinguishable between methods, except for D1. Indeed, the averaged correlation for the other methods ranges between 0.98 and 0.99 in IPSL-LE, and between 0.94 and 0.97 in CESM-LE, while D1 is clearly inferior (r = 0.93 in IPSL-LE and 0.86 in CESM-LE). A more selective comparison is provided by the differences between the estimated yearly forced global mean SST time series in each ensemble member and the global mean SST of the ensemble average, which eliminates the trend that dominates the correlation. To compare the methods, we average over ensemble members (30 or 40) the root-mean-square (rms) difference for the 76 years. By the central limit theorem, this statistic has to a good approximation a Gaussian distribution. As shown in Table 1, LIM and then LIMopt yield the smallest errors, except in CESM-LE where MEEMD and D2 fare slightly better or comparably, presumably because the global mean SST approximately increases quadratically with time in this model (Fig. 4). GM and REGR lead to larger errors, albeit only significantly so in CESM-LE, while D1 is clearly inadequate.
The geographical distribution of the averaged (over time and ensemble members) rms differences suggests that regions with a large natural variability, such as near the equator and at high latitude, dominate the global errors for all methods, as illustrated for IPSL-LE (Fig. 5). These regions also dominate their area-weighted spatial average in each ensemble member, whose histogram is given in Fig. 6, although similar results are found with normalized SST data. In IPSL-LE (Figs. 6a–g), the averaged rms error is smallest for LIM and LIMopt, then REGR, and largest for GM and D1. Based on the two-sample Student’s t test, the errors are smallest in LIM and LIMopt at the 1% significance level, but they cannot be differentiated. In CESM-LE (Figs. 6h–n), the estimation errors and the differences among methods are smaller, although LIM-opt, D2, MEEMD, and LIM are best (and not significantly different), and REGR and D1 yield larger errors. Hence, LIM and LIMopt are the only methods that lead to small errors in both LEs.
To investigate how each method represents short-term forced climate fluctuations, we separately consider the 7-yr periods that follow the three largest volcanic eruptions occurring in the 1940–2015 period, namely Mount Agung (1963), El Chichòn (1982), and Mount Pinatubo (1991). A global measure of method efficiency could be provided as above. However, the rms errors would have two contributions, one corresponding to the initial year of the volcanic eruption, and the other to the SST decrease in the following six years. Hence, they would not provide an optimal measure of method ability at representing fast changes. Indeed, if a method, say, underestimates the global mean SST in the initial year, it may yield a small error even if the actual volcanic cooling is underestimated. Therefore, we compare instead the SST evolution following the year of each volcanic event in each ensemble member to that of the ensemble mean (Table 2). This is more independent from the statistics given in Table 1, which already include the volcanic periods. In IPSL-LE, the area-weighted mean rms error during the three volcanic periods is smallest in GM, LIMopt, REGR, and LIM, while it is much larger in methods that emphasize the most slowly varying components of the SST evolution (i.e., MEEMD, D1, and D2). In CESM-LE, LIMopt and LIM yield much smaller errors than MEEMD, GM, and REGR, while D1 and D2 again lead to large errors.
5. Natural variability after removing externally forced signals
a. Atlantic multidecadal oscillation
The AMO time series in each ensemble member is defined by the low-pass-filtered, area-weighted SST anomaly average within 0°–60°N, 0°–80°W after subtraction of the forced SST signal, using yearly SST and a 10-yr cutoff (Butterworth filter), with the land and sea ice mask described above. The AMO pattern in each ensemble member is obtained by regressing the yearly mean SST anomalies onto the AMO time series. We assume that the true unforced AMO pattern is given to a good approximation by the average of the patterns obtained in each ensemble member after subtraction of the ensemble-mean SST. Note that there is some scatter in the AMO pattern between ensemble members, reflecting that the record length (76 yr) is too short to well resolve the long AMO time scale. The same strategy is used for each method, except that it is the estimated forced signal that is subtracted in each ensemble member before estimating the AMO.
Method efficiency at representing the natural variability of the AMO is evaluated in two different ways. First, we consider the ensemble average of the AMO time series, which should by construction be, within error bars, equal to zero throughout 1940–2015. The comparison is shown in Figs. 7a,b, where the 95% confidence interval for the “true” AMO, as estimated by removing the ensemble mean in each ensemble member, is indicated. For clarity, no confidence interval is given for each method, as it is likely to be similar to, but not independent of, that of the true mean. The poor performance of D1 is striking in both ensembles. In IPSL-LE and, to a lesser extent CESM-LE, MEEMD and D2 are unable to separate the global cooling that follows the two main volcanic eruptions (in the 1960s and the 1990s) from the natural variability of the AMO, yet they lead to the best averaged correlation with the true AMO in CESM-LE (Figs. 7b,d,f,h), as other methods overestimate the AMO in the first decade. In both LEs, LIM and LIMopt lead to small errors. Although GM fares well in the volcanic periods, it poorly separates the forced SST changes from the AMO during the last decade. Indeed, as climate models (and observations) show a minimum warming in the subpolar North Atlantic (e.g., Drijfhout et al. 2012; see also Fig. 4), the removal of a spatially uniform pattern leads to erroneous cold AMO anomalies after 2000, when the external forcing from greenhouse gases has increased. Nonetheless, GM works well in IPSL-LE, and REGR fares nicely in both LEs.
Another comparison is given in Figs. 8 and 9, where the mean (over all ensemble members) AMO pattern estimated by each method is compared to that derived by subtracting the ensemble mean from each ensemble member. The true AMO patterns are fairly realistic (cf. with observational AMO patterns in Figs. 16a,b), although the tropical part of the AMO is shifted northward in IPSL-LE (Gastineau et al. 2013), and in both models there is a weak ENSO-like warming in the Pacific and, in particular in CESM-LE, a PDO-like signal in the North Pacific. This presumably occurs because ENSO teleconnections contribute to driving both the AMO and the positive phase of the PDO in the two models. While the AMO was also shown to drive a negative PDO in sensitivity experiments with CESM1 (Ruprich-Robert et al. 2017), it would lead to a PDO of the opposite sign if it were the dominant interaction. These AMO patterns are almost identical to those derived from 1500-yr control simulations (not shown). The true AMO pattern is generally well represented by each method in the North Atlantic, except D1, but there are larger differences in the Pacific, as GM and REGR do not reproduce the Pacific extensions. To compare the methods more quantitatively, Table 3 gives the mean of the rms differences between the AMO pattern in each member/method and the corresponding true AMO pattern, with the pattern correlation, which takes into account the scatter between ensemble members. In IPSL-LE, LIMopt provides the best results, but they are not significantly different from LIM, D2, and MEEMD, and only significantly better than REGR, GM, and D1. In CESM-LE, D2 and MEEMD are both significantly better than LIM and LIMopt, while GM and REGR fare poorly for the global pattern and D1 provide large errors both in the North Atlantic and globally.
b. AMOC SST signature
To determine the SST fingerprint of the AMOC, we define the AMOC by the maximum of the yearly AMOC streamfunction between 500 and 2500 m averaged between 35° and 45°N (hereafter AMOC40), and we remove its ensemble mean, since in both models the AMOC has substantially decreased by 2015 (Fig. 4). In IPSL-LE, the mean AMOC decreases almost linearly after the mid-1950s. In CESM-LE, the mean AMOC, which is much stronger, slightly increases in the 1960s and only decreases after the mid-1980s, reflecting a possible influence of anthropogenic aerosols (Tandon and Kushner 2015). There is much scatter between ensemble members, however, reflecting the large natural variability of the AMOC.
Although the SST fingerprint of the natural variability of the AMOC evolves with time lag, as shown by Gastineau et al. (2013) for IPSL-CM5 and Frankignoul et al. (2015) for an earlier version of CESM1 (CCSM4), a typical low-frequency footprint can be obtained by regressing the low-pass-filtered SST onto the low-pass-filtered AMOC40, after removing the global forced signal, at the lag of maximum AMOC40–AMO correlation. This was done by concatenating all ensemble members into a single realization, with the SST lagging the AMOC by 9 yr for IPSL-LE and 1 yr for CESM-LE (see Fig. 12). The best estimate of the AMOC SST fingerprint, obtained by removing the ensemble-mean SST signal (the true fingerprint), is represented in Figs. 10h and 11h. In both models, there is a strong warming in the subpolar gyre and a cooling along the Gulf Stream, which likely reflects the southward shift of the Gulf Stream due to bottom torque at the crossover of the deep western branch of the AMOC with the Gulf Stream, as in CCSM4 (Frankignoul et al. 2015). There is also a weak cooling in the South Atlantic.
The comparison with the true AMO pattern in Figs. 8 and 9 shows that the AMOC SST fingerprint broadly resembles the AMO north of about 40°N in both models, but differs elsewhere. In both models, the AMOC SST fingerprint lacks the subtropical/tropical North Atlantic warming and the equatorial Pacific warming associated with the AMO, while there is no South Atlantic cooling in the AMO. The AMO also lacks the Gulf Stream cooling seen in the AMOC fingerprint. This suggests that in these two models the AMO, south of about 40°N, is primarily driven by local atmospheric variability and ENSO teleconnections, or by oceanic variability unrelated to the AMOC. Note that the equatorial Pacific warming associated with the AMO is not seen in the observations (see Figs. 16a,b), which suggests that the tropical Pacific influence is perhaps too strong in these models. The AMOC SST fingerprint is generally well reproduced when the forced SST signal is estimated in each ensemble member, although MEEMD fares less well in CESM-LE (Figs. 10 and 11).
The best (true) estimate of the lag correlation between the (low-pass filtered) AMO and AMOC40 is obtained by subtracting the ensemble-mean SST before computing the AMO. Figure 12 shows that the maximum mean correlation occurs when the AMOC leads the AMO by about 9 yr in IPSL-LE, consistent with Gastineau et al. (2013) and Marini and Frankignoul (2014), and by 1 or 2 yr in CESM-LE, as in CCSM4 (Danabasoglu et al. 2012). The maximum correlation in IPSL-LE (r = 0.26) is much smaller than in CESM-LE (r = 0.40). There is also a marginally significant hint that a negative AMO precedes an intensification of the AMOC by 10–16 yr in CESM-LE. Yet, there is much scatter among ensembles members, as illustrated by the dispersion of the lag and the value of the largest correlation (blue dots in Fig. 12). A 76-yr record is thus too short to determine with some confidence the relation between the AMOC and the AMO, and the AMOC–AMO correlation is not a powerful criterion to distinguish between methods of forced signal removal. In IPSL-LE, the mean (over ensemble members) AMOC–AMO correlation depends little on the way the forced signal is removed, although D1, D2, and MEEMD yield a smaller maximum; in CESM-LE, the mean AMOC–AMO correlation is generally well reproduced, although the maximum correlation is systematically too small (not shown). Similar results are obtained if AMOC40 is replaced by the averaged meridional heat transport between 35° and 45°N, or if the AMOC variability is represented by its first principal component (not shown).
c. The Pacific decadal oscillation
The PDO is defined as the first EOF of yearly averaged North Pacific (20°–60°N) SST variability computed in each ensemble member after removing the forced SST signal. The best (true) estimate of the PDO pattern (extended globally by regression) is obtained by removing the ensemble average SST from each ensemble member (Figs. 13h and 14h). As the mean PDO patterns estimated by each method look pretty similar, we show for comparison their difference with the true PDO (note the change of scale). In both models, the best estimates of the PDO pattern are obtained with LIMopt and LIM (Table 4). MEEMD works well, but D1 leads to a weak cold bias and D2 to much variability among ensemble members in IPSL-LE, although the mean pattern is well estimated. More importantly, REGR substantially underestimates the PDO amplitude, perhaps because PDO and global mean SST are related (Meehl et al. 2013), and GM yields large errors. The comparison between the rms errors in Tables 3 and 4 shows that the global PDO pattern tends to be better estimated than the AMO pattern when the best methods are selected, while the rms errors are otherwise comparable. In fact, the largest rms errors are found for the PDO when using GM or REGR.
Method efficiency at representing the natural variability of the PDO is also evaluated by considering the ensemble-average of the PDO time series, which should by construction be, within error bars, equal to zero throughout 1940–2015. The comparison is shown in Fig. 7 (bottom two panels), where the 95% confidence interval for the “true” PDO, as estimated by removing the ensemble mean in each ensemble member, is indicated. The PDO time behavior in each ensemble member is best represented by LIM and LIMopt, although differences with D1, D2, and MEEMD are not 5% significant. On the other hand, REGR fares poorly in both LEs, and GM even worse in CESM-LE.
The lag relation between the AMO and the PDO is also considered, using 10-yr low-pass-filtered time series. Even in the true case, the mean lag correlation is not significantly different from zero in both models, and there is much scatter between ensemble members (not shown). Hence, it cannot be used for method validation.
d. Method summary
Although LIM and LIMopt do not always provide the smallest errors for the natural climate variability, they lead to consistently good estimates, while GM and REGR may introduce substantial errors in the PDO and the AMO, and D1 often leads to poor estimates.
6. Application to the observations
Each method has been used to estimate and remove the forced signal in the 1900–2015 SST reconstructions. Each reconstruction yields a slightly different quasi-global (between 60°N and 60°S) mean SST evolution (given by GM) that reflects differences in bias correction, interpolation methods, and smoothing (e.g., Hirahara et al. 2014); the differences are at least comparable to those between estimation methods (Fig. 15). Note that the cooling following volcanic eruptions is smaller—or less visible—than in the LEs, although the cooling following the 1963 Mount Agung eruption is enhanced in the LIM and LIMopt estimates. The lack of warming trend between the mid-1940s and the mid-1970s, also seen in land surface air temperature (Hartmann et al. 2013) and sometimes referred to as big hiatus, likely reflects a compensation between the warming from increasing greenhouse gases and the cooling from anthropogenic sulfate aerosols (e.g., Fyfe et al. 2016), followed by the Mount Agung cooling. Consistent with the CMIP5 simulations and the CESM-LE, the natural variability (the difference between GM and the other methods) seems to play little role in the big hiatus, except that it ends up to a decade earlier when the natural variability is removed (i.e., in LIM and LIMopt). This is consistent with Kosaka and Xie (2016), who showed that the prolongation of the big hiatus was due to the change of phase of the interdecadal Pacific oscillation, which is well correlated with the PDO. The global warming slowdown found in global mean surface temperature from the late 1990s to 2012 is seen in the quasi-global mean SST from HadISST, but not COBE-SST or ERSST values (see also Karl et al. 2015). Our analysis suggests that the SST warming slowdown, if any, cannot be directly attributed to the natural variability of the PDO or the AMO, as there are only small differences between GM and LIM or LIMopt. This does not imply that the PDO did not contribute to the slowdown of the surface temperature warming, as suggested by Kosaka and Xie (2013, 2016) and England et al. (2014), since the interdecadal Pacific oscillation may have affected surface temperature at higher latitudes and over the continents.
However, the first LIM mode may not always provide a consistent estimate of the forced signal when it is applied to short time periods. This is illustrated for the 1940–2015 period considered in the LEs (dashed line in Fig. 15), as it takes the first two LIM modes to represent the forced signal and provide a similar behavior to LIMopt, in particular in HadISST where the two modes have a similar pattern but nearly opposite time variations (not shown). This reflects that over such short periods the time scale of the forced signal becomes comparable to that of the AMO; also, the forced signal represents less variance than the main ENSO mode. Both factors make it more difficult for LIM to distinguish between modes of variability. Hence, a careful examination of the main LIM modes is required when representing the forced signal in short records. We recommend using LIMopt instead, as it does not seem affected by the record length.
There are differences in the AMO and PDO patterns between the three SST reconstructions (not shown) that largely reflect their degree of spatial smoothing, namely limited smoothing in HadISST, more substantial smoothing in COBE-SST, and larger smoothing at low frequency in ERSST. Nonetheless, the AMO and the PDO evolve similarly in the three reconstructions. The estimated AMO patterns are illustrated for COBE-SST in Figs. 16a,b, except for D2, which is nearly undistinguishable from D1. They are broadly similar in the North Atlantic, although in COBE-SST (and to a lesser extent in HadISST) the tropical signal is stronger for D1, D2, and MEEMD. These methods also show the largest Pacific anomalies, and some Southern Hemisphere warming instead of a weak cooling. Note that all AMO estimates lack the equatorial Pacific warming seen in the two LEs. Correspondingly, the AMO evolution in each SST reconstruction is very similar for D1 (or D2) and MEEMD, but substantially different for LIM, LIMopt, and REGR, which behave similarly (bottom panels); GM leads to smaller AMO amplitude during the last few decades, in particular in ERSST.
As illustrated for HadISST but found in each SST reconstruction, the PDO patterns are similar among methods (Figs. 17a,b). However, REGR yields smaller amplitude, consistent with the LE results, while MEEMD yields the largest warming off the west coast of North America. The central North Pacific cooling also varies somewhat between methods and datasets. The PDO time evolution at low frequency is similar for each method, so there are fewer differences than for the AMO. LIMopt and REGR lead to very similar behaviors in each reconstruction, despite differences in SST pattern. Why the PDO is more coherent among methods than the AMO may be in part due to the longer AMO time scale, which makes it more difficult to separate it from the slow variations of the forced signal. In addition, as in the LEs, the AMO pattern projects well on the forced SST signal in the Atlantic, whereas the PDO pattern strongly differs from the forced SST signal in the Pacific. Hence, the PDO is more easily separable from the forced signal than the AMO.
The lag correlation between the AMO and the low-pass-filtered PDO was also calculated. All methods give rather similar results that are broadly consistent with earlier studies, albeit with large differences between reconstructions, but they are not 10% significant.
7. Summary and conclusions
Two large ensembles of historical simulations with the IPSL-CM5A-LR and the CESM1 climate models are used to assess how well various statistical methods estimate the SST changes due to external and anthropogenic forcing during 1940–2015, as well as the main patterns of natural decadal SST variability. The focus was on the AMO, PDO, and SST footprint of the AMOC, which were derived after removing the estimated forced SST signal. Because the amplitude of the natural climate variability is strongly reduced, the ensemble average SST was used as an approximation to the true forced SST signal, thus allowing us to compare the efficiency of each method in estimating the forced SST signal and the natural variability in individual ensemble members.
We have compared seven statistical methods that solely rely on observations to estimate the forced SST signal, thus excluding methods based on multimodel estimates. Two methods based on linear inverse modeling, a standard version (LIM) where the forced signal is taken as the first LIM mode and an optimal perturbation filter (LIMopt), perform consistently well and are often best, leading to small errors in estimating both the forced and the natural SST variability. Removing the forced signal by subtracting the global mean SST (GM) or a linear regression on the global mean SST (REGR) only works well in some cases and leads to large errors in the Pacific and the PDO, perhaps because of the link between global mean SST and PDO (Meehl et al. 2013). Quadratic detrending (D2) efficiently removes the forced SST signal in only one LE, and is unable to represent the SST cooling that follows volcanic eruptions. This is also the case for multidimensional ensemble empirical mode decomposition (MEEMD) because it decomposes signals into components of increasing time scale at each grid point, and hence cannot separate forced SST variations from natural ones with similar time scales but different spatial structures. Removing a linear trend (D1) fares poorly and leads to overestimated AMO fluctuations, so that approach should not be used, even in rather short records.
The two best methods, LIM and LIMopt, provide similar results that are often statistically undistinguishable. LIM is easier to program, but LIMopt is more general as it makes no assumption as to which eigenmode represents the forced SST signal. In fact, when LIM was applied to the ensemble-mean SST, the second LIM mode also contributed to the forced signal, particularly in CESM-LE. This occurred because the natural variability was so strongly reduced that the different SST response patterns expected from different kinds of anthropogenic forcing (Wang et al. 2016) could project on different eigenmodes. The natural variability in individual ensemble members is too strong to allow for such pattern separation, but the first two LIM modes were also needed in HadISST and, to a lesser extent, ERSST when the short period of 1940–2015 was considered. LIMopt is thus more robust since it automatically selected the modes that represent the forced signal.
The LIM methods generally provide a better skill at extracting the forced SST signal probably because they represent the natural climate variability well. LIM decomposes the SST fields into nonorthogonal modes, hence separating the fields into subsets that are not statistically independent. Perhaps more importantly, their success may also arise from the ability of LIM at representing the natural SST variability (the residuals from the global forced signal represented the first LIM mode) by the higher LIM modes. Indeed, midlatitude SST anomalies primarily reflect the response of the upper ocean to stochastic atmospheric forcing and are thus well represented by a first-order Markov process (Frankignoul and Hasselmann 1977; Frankignoul 1985), and tropical SST variability is also well described by a stable linear dynamical system driven by spatially coherent white noise (e.g., Penland and Sardeshmukh 1995; Penland and Matrosova 2006), even on decadal time scales (Newman 2007). Nonetheless, no filter is perfect and the higher LIM modes may retain some forced components, which would explain why LIMopt is more robust than LIM as it makes no a priori assumption as to which modes represent the forced signal. On the other hand, none of the other methods takes advantage of the statistical properties of the natural climate variability.
The comparison between the AMO pattern and the SST fingerprint of the AMOC suggests that in both LEs the AMO is largely driven by the AMOC in the subpolar gyre and the Gulf Stream region, but not in the tropical North Atlantic, which seems influenced by local atmospheric forcing, ENSO teleconnections, or ocean circulation variability unrelated to the AMOC. The maximum mean correlation between the AMOC (or the meridional heat transport) and the AMO occurs when the AMO lags by about 9 yr in IPSL-LE and 1 or 2 yr in CESM-LE, consistent with previous studies. However, even in the “true” case, there is much scatter among ensembles members, indicating that a 76-yr record is much too short to determine with some confidence the relation between the AMOC and the AMO. Finally, there is no significant correlation between the AMO and the PDO in either LE.
The methods were applied to the HadISST, COBE-SST, and ERSST SST reconstructions during 1900–2015. The differences between reconstructions are often comparable to or, in the case of the widely reported global warming slowdown from the late 1990s to 2012 that is only seen for the quasi-global mean SST in HadISST, exceed the differences between methods. Nonetheless, our estimates of the forced SST signal suggest that the natural SST variability slightly masked the quasi-global cooling following the Mount Agung eruption and only contributed to the big hiatus in surface warming from the mid-1940s to the mid-1970s by prolonging it by up to a decade, consistent with Kosaka and Xie (2016). In addition, natural SST variability plays no direct role in the recent warming slowdown from the late 1990s to 2012.
The estimated spatial patterns of natural variability largely reflect the degree of smoothing in the SST reconstructions. Nonetheless, they share common features in the reconstructions. The AMO time series differ somewhat when estimated by D1, D2, MEEMD, and, to a lesser extent, GM from their estimation by LIM, LIMopt, or REGR. Correspondingly, the AMO patterns are slightly different, in particular for D1, D2, and MEEMD, although they all show a similar warming in the subpolar gyre and along the North Atlantic Current. Interestingly, the amount of AMO variance that is attributed to natural variability and the correlation between forced and natural signals varies between methods and reconstructions. The largest percentage of variance attributed to natural variability is for D1 and D2 (52%, 42%, and 48% for COBE-SST, ERSST, and HadISST, respectively), while providing as expected nearly independent estimates of the free and forced signals (correlation |r| ≤ 0.08). MEEMD leads to a rather similar ratio, but highly anticorrelated free and forced signals in COBE-SST (r = −0.29) and ERSST (r = −0.2). The smallest percentage of natural variability is found for REGR (21%, 26%, and 29% for COBE-SST, ERSST, and HadISST, respectively), with nearly independent free and forced AMO time series. The ratios are somewhat larger in GM, but the separation between free and forced signal is poor (r ranging between −0.21 and −0.43), which leads to an overestimation of the forced signal. In LIM, 33%–45% of the AMO variance is attributed to the natural variability, but the separation between free and forced signal is rather poor (r ≈ −0.2). On the other hand, LIMopt separates well the free and forced signals (|r| ≤ 0.09) and only attributes a moderate fraction of the AMO variance to natural variability (25%, 34%, and 38% for COBE-SST, ERSST, and HadISST, respectively). This suggests that a substantial fraction of the signals and climatic impacts attributed to the AMO when it is estimated from linearly detrended data (e.g., Knight et al. 2005; Sutton and Hodson 2005; Alexander et al. 2014; Delworth et al. 2017, and many others) results from more global forcing and should be interpreted with care.
The PDO patterns are more comparable among methods, except for REGR, which provides smaller PDO amplitude, and GM, which yields large errors. However, the PDO time series are largely similar. Most of the low-frequency PDO variance is associated with the natural variability, but free and forced signals can be highly anticorrelated; only LIMopt and, to a lesser extent, REGR (besides D1 and D2) provide nearly independent estimates of free and forced low-frequency PDO variability.
These differences between methods are consistent with those found in the LEs, suggesting that the most reliable estimates of the observed natural decadal variability are found by using LIMopt. That the PDO evolution is more coherent among methods than the AMO may be in part due to the longer AMO time scale, which is more comparable to the slow variations of the forced signal. In addition, the AMO pattern projects well on the forced SST signal, while the PDO pattern strongly differs. Hence, the PDO is more easily separable from the forced signal than the AMO.
As the number of LIM modes needed to represent the forced SST signal may vary and LIMopt better separates free and forced variability, we recommend using LIMopt, which was consistently robust. Solomon and Newman (2012) also noted that the optimal perturbation filter led to more robust representation of ENSO than LIM, and more consistent estimates the ENSO-residual trend in different datasets. On the other hand, using D1, D2, MEEMD, GM, and REGR may lead to biased estimates. Although these differences may not seem striking, they could lead to different interpretations. Consider the tropical transbasin variability, defined by the standardized difference between the yearly mean SST anomaly in the tropical Atlantic–Indian Ocean (15°S–15°N, 40°W–60°E) and the tropical central Pacific (15°S–15°N, 180°–150°W). It approximately increases linearly during 2001–12, but the contributions of the forced signal (Fig. 18, red bar), the AMO (green), and the PDO (gray) differ between estimation methods and, to a lesser extent, datasets. Indeed, the forced signal does not contribute at all to the linear trend in GM and very little in REGR, D1, D2, and MEEMD, but it contributes substantially in LIM and LIMopt; the contribution of the PDO is smallest in GM, and the AMO contribution is small. This could lead to different assessment of forced versus natural climate impacts.
Support from the NOAA Climate Program Office Climate Variability and Predictability program (NA13OAR4310139), NSF EaSM2 (OCE-84298900), the European Community Horizon 2020 Framework under Grant Agreement 727852 (Blue-Action), and the ANR MORDICUS project (ANR-13-SENV-0002-02) is gratefully acknowledged. This work was granted access to the HPC resources of TGCC under allocation 2015-017403 made by GENCI. We thank the reviewers for insightful and constructive comments.
Similar results are obtained by calculating the rotated EOFs separately in the Indo-Pacific and the Atlantic, but more EOFs are required to represent a comparable amount of variance.