## Abstract

Reconstruction of the earth’s surface temperature from proxy data is an important task because of the need to compare recent changes with past variability. However, the statistical properties and robustness of climate reconstruction methods are not well known, which has led to a heated discussion about the quality of published reconstructions. In this paper a systematic study of the properties of reconstruction methods is presented. The methods include both direct hemispheric-mean reconstructions and field reconstructions, including reconstructions based on canonical regression and regularized expectation maximization algorithms. The study will be based on temperature fields where the target of the reconstructions is known. In particular, the focus will be on how well the reconstructions reproduce low-frequency variability, biases, and trends.

A climate simulation from an ocean–atmosphere general circulation model of the period a.d. 1500–1999, including both natural and anthropogenic forcings, is used. However, reconstructions include a large element of stochasticity, and to draw robust statistical interferences, reconstructions of a large ensemble of realistic temperature fields are needed. To this end a novel technique has been developed to generate surrogate fields with the same temporal and spatial characteristics as the original surface temperature field from the climate model. Pseudoproxies are generated by degrading a number of gridbox time series. The number of pseudoproxies and the relation between the pseudoproxies and the underlying temperature field are determined realistically from Mann et al.

It is found that all reconstruction methods contain a large element of stochasticity, and it is not possible to compare the methods and draw conclusions from a single or a few realizations. This means that very different results can be obtained using the same reconstruction method on different surrogate fields. This might explain some of the recently published divergent results.

Also found is that the amplitude of the low-frequency variability in general is underestimated. All methods systematically give large biases and underestimate both trends and the amplitude of the low-frequency variability. The underestimation is typically 20%–50%. The shape of the low-frequency variability, however, is well reconstructed in general.

Some potential in validating the methods on independent data is found. However, to gain information about the reconstructions’ ability to capture the preindustrial level it is necessary to consider the average level in the validation period and not the year-to-year correlations. The influence on the reconstructions of the number of proxies, the type of noise used to generate the proxies, the strength of the variability, as well as the effect of detrending the data prior to the calibration is also reported.

## 1. Introduction

It is important to know the climate of the past in order to assess the significance of present-day climate excursions and to assess sensitivity of the climate to different forcings. Climate has been systematically observed with instruments since the mid-nineteenth century only. Further back in time, a few instrumental series exist (e.g., Manley 1953; Parker and Horton 2005), but mainly one must rely on climate-sensitive proxy data. Thus, there is a rich tradition for exploiting data from documentary archives (e.g., Pfister 1995; Pfister et al. 1998; Ogilvie 1996), tree rings (Fritts 1976; Cook and Kairukstis 1990), ice cores (Dansgaard et al. 1982; Dahl-Jensen et al. 1998), corals (Halpert and Ropelewski 1992), marine sediment cores (Bond et al. 2001), lake sediment core data (Wolfe et al. 2004), terrestrial borehole temperatures (Huang et al. 2000; Pollack and Huang 2000; Harris and Chapman 2001; Beltrami 2002, Pollack and Smerdon 2004), and other data sources, to reconstruct past climate either qualitatively or quantitatively.

One of the first quantitative multiproxy climate reconstructions, that is, using a suite of different proxies, with global scope was done by Groveman and Landsberg (1979) and Groveman (1979). They used a multiple regression analysis for reconstructing mean Northern Hemisphere (NH) temperatures from a range of different proxies. They applied a direct regression approach where the predictand—the NH temperature—is expressed as a linear combination of the proxies. The reconstruction (a.d. 1579–1860) showed a good deal of low-frequency variability. A general warming of recent times compared to preindustrial times was demonstrated and the lack of trends in the precalibration era was noted. A simpler method, the composite plus scale, was used in a multiproxy climate reconstruction of the NH by Bradley and Jones (1993). Composite-plus-scale methods relate a single composite proxy, constructed as an (possibly weighted) average of all proxies, to the global temperature by a scaling.

Recently, Mann et al. (1998) considered a much larger proxy dataset and a method where the spatial temperature field is reconstructed. We will refer to such reconstruction methods as “field reconstructions” and methods, as those cited in the paragraph above, where the global-scale mean is reconstructed directly as “global reconstructions.” Hence, composite-plus-scale methods are a subset of global reconstructions. The approach of Mann et al. (1998) can be considered a rediscovery of field reconstruction methods that had been applied in the three-ring community earlier (e.g., Fritts et al. 1971; Cook and Kairukstis 1990, chapter 4). It is also worth noting that their methodology is based on indirect regression; the proxies are expressed as linear combinations of the predictands. The mean NH temperature obtained from this reconstruction shows a long, slowly decreasing trend followed by a sharp rise in industrial times; the so-called “hockey stick.” The reconstruction was provided with uncertainty estimates, which become broader back in time.

The hockey stick has become the center of much attention at both political and scientific levels. While the contribution by Mann et al. (1998) was notable for pulling together a large and varied proxy dataset it also raised a range of questions about the method choice, the robustness and quality of regression-based reconstructions, as well as issues related to data coverage. The last decade has seen a series of papers addressing these issues, but a consensus has not been reached. Some studies are based on observations and real proxies while other utilize model-generated data. In the latter approach the quality of the reconstructions can be easily obtained because the target is known. However, pseudoproxies have to be constructed, typically by adding noise to local temperatures. In the former approach the main limitation is the brief length of the observational record, which makes assessment of the low-frequency properties of reconstructions difficult.

Mann and Rutherford (2002) was one of the first papers to investigate systematically the effects of method characteristics. They used pseudoproxies constructed from the observed gridded temperature series by adding noise, and they studied the effects of varying the number of proxies, the signal-to-noise ratio of the proxies, and the type of noise used to generate the proxies. In this work the regularized expectation maximization (RegEM) algorithm (Schneider 2001) was introduced as a climate reconstruction method. A conclusion was that the reconstruction itself was not particularly sensitive to the power spectrum of the noise added, but that the significance levels were.

Rutherford et al. (2003), also using RegEM, did a more extensive study focusing on the role of calibrating in periods with and without temporal trends. This was done for both observed data (as in Mann and Rutherford 2002) and with data from forced and unforced climate model runs. The main conclusion was that if the calibration period included a trend it was possible to obtain nearly unbiased reconstructions of periods lacking trends. However, if the calibration period did not include a trend, the reconstructions showed evidence of systematic bias in periods with trends.

Zorita et al. (2003) performed a study based on a 1000-yr-long unforced simulation with the ECHAM and the global Hamburg Ocean Primitive Equation (ECHO-G) climate model. This work used the original reconstruction method of Mann et al. (1998) and used “perfect” pseudoproxies; that is, the pseudoproxies were chosen as local temperatures without any noise added. Zorita et al. (2003) found that the reconstructions underestimated low-frequency variability in the global-mean temperature. However, the coherence between the reconstruction of the global-mean temperature and the target (the modeled global-mean temperature) was larger for low frequencies than for high frequencies.

Jones and Mann (2004) offered a review of the now burgeoning field of climate reconstructions considering the following two alternative approaches: climate-field reconstruction and composite-plus-scale reconstruction. They argued that the reconstruction method is of secondary importance compared to better and larger networks of available proxies, but they recommended field reconstruction over composite-plus-scale reconstruction because of the additional regional information.

Von Storch et al. (2004) tested the climate-field reconstruction method of Mann et al. (1998), generating pseudoproxies from a climate model experiment driven by estimations of historical forcings. The paper concluded that the method seriously underestimated the low-frequency variability. The relevance of this analysis was questioned by Mann et al. (2005) who argued that the model experiment used by von Storch et al. (2004) had an unphysical pattern of variance connected to the unusually large changes in the natural radiative forcing and an early initialization drift, and that the experiment therefore was inappropriate for testing climate reconstruction methods. Applying the RegEM method and a simple composite-plus-scale method to data from a different climate model experiment, they found no evidence that real-world reconstructions are likely to underestimate the low-frequency variability. Furthermore, Wahl et al. (2006) argued that the underestimation of the low-frequency variability reported by von Storch et al. (2004) was mainly a result of a detrending of the data before the method was calibrated. This detrending step was not applied in Mann et al. (1998). Von Storch et al. (2006a) acknowledged this difference between the methodologies of von Storch et al. (2004) and Mann et al. (1998), but they argued that the underestimation of the low-frequency variability remains also without the detrending. Zorita et al. (2007), using the reconstruction method from Mann et al. (1998), found results similar to von Storch et al. (2004) by generating pseudoproxies from a climate simulation with a variability that closely resembles the variability in the simulation used by Mann et al. (2005). In a reply, Mann et al. (2007a) presented reconstructions with the RegEM method that opposed the results of Zorita et al. (2007).

Smerdon and Kaplan (2007) argued that the methodology used in the pseudoproxy study by Mann et al. (2005) included a standardization scheme in the RegEM procedure that used information not only from the calibration period but also from the reconstruction period. They also showed that if the test of Mann et al. (2005) is repeated including only information from the calibration period, then the reconstructions suffer from biases and variance losses similar to those reported by von Storch et al. (2004). Mann et al. (2007c) argued that the source of the problem—that the reconstructions depend on the standardization scheme—originates in a particularity of the ridge regularization approach used in previous versions of RegEM. They introduced a new version of RegEM where the regularization is based on truncated total least squares. Mann et al. (2007b) further investigated this alternative version of RegEM by applying it to data from climate model experiments and concluded that this method can produce reconstructions with no evident systematic bias. The discussion on this point continues in Mann et al. (2007c) and Smerdon et al. (2008).

Bürger and Cubasch (2005) and Bürger et al. (2006) considered a very broad approach, varying as many method-choices as possible in the Mann et al. (1998) method. These choices included, among others, inclusion versus removal of trends, the direction of the regression method, and field reconstruction versus global reconstruction of the NH mean. Bürger and Cubasch (2005) considered real data and proxies while Bürger et al. (2006) used a climate model simulation. Both papers demonstrated a large sensitivity of the reconstructions to these relatively arbitrary choices.

Rutherford et al. (2005) tested the sensitivity to method, predictor network, target seasons, and spatial domain, using two proxy datasets. Three methods for reconstruction were employed: the RegEM method of Schneider (2001) and two modifications of this method. They concluded that the methodology is generally robust and emphasized that data quality was more important than data quantity. They also showed that the RegEM method produces a NH mean reconstruction very similar to that of Mann et al. (1998) when applied to the same set of proxies.

Let us finally mention two additional studies suggesting methodological improvements. Moberg et al. (2005) introduced a reconstruction method based on separation of data into frequency domains. They also introduced the use of some lower-resolution proxies capturing information at multicentennial time scales, as opposed to tree-ring data that can have a low-frequency cutoff resulting from data-treatment issues related to the tree-ageing process (Esper et al. 2002). Their reconstruction showed larger multicentennial variability than most previous multiproxy reconstructions and notably showed evidence of large natural climate variability in the past. Hegerl et al. (2007) introduced a composite-plus-scale method where the relation between the composite proxy and the NH temperature is determined by total least squares. In contradiction to ordinary least squares, where the errors are assumed to be present only in the observations, total least squares takes into consideration errors in both the observations and the proxies. Reconstructing the NH temperature of the climate simulation from von Storch et al. (2004) they found that their method avoids loss of low-frequency variability. Neither the method of Moberg et al. (2005) nor the method of Hegerl et al. (2007) are considered in the present study.

The brief summary above of the published work that followed the remarkable paper by Mann et al. (1998) is far from complete. We have not discussed issues regarding the selection of proxies (e.g., McIntyre and McKitrick 2005), but we have focused on issues directly related to the statistical reconstruction methods. However, the summary demonstrates that the scientific community has not reached a general agreement on the robustness and best methodology of climate reconstructions. For further discussions and references we refer to the report by the National Research Council (2006). This report reviewed the state of climate reconstructions following the sometimes uproarious debate that arose after Mann et al. (1998). The review makes it clear that several issues still require attention; the issue of how well low frequencies are reconstructed is the most important for those investigations that wish to evaluate the present warming trend in terms of past excursions. The problem of how to assign reconstruction skill to a result is also paramount because much of the previous work has focused on the “reduction of error” and the “coefficient of efficiency” diagnostics.

In this paper we present a comprehensive study of the statistical properties of climate reconstruction methods based on an extended pseudoproxy method. We include seven different reconstruction methods, including the method of Mann et al. (1998) and two versions of the RegEM method. We focus mainly on the ability of the methods to reconstruct the longest time scales, such as the centennial variability, the preindustrial level, and trends. By applying the different reconstruction methods to exactly the same data, with the same data preparation with regards to normalization, trends, the choice of calibration period, number, and type of proxies, etc., we avoid some of the problems that have made comparison across publications difficult and probably clouded the issues. Note, that we validate only the ability of the methods to reconstruct the properties of the Northern Hemisphere mean. This limitation holds for all of the methods, and also for those that do reconstruct the field.

From the above it is evident that climate models with historical forcings have been widely used to test reconstruction methods. However, reconstructions include a large element of randomness, and results that are based on a single climate model experiment are not representative, as will be demonstrated. This might explain some of the seemingly contradictory results found in the literature. Using a large ensemble of climate models experiments would be preferable and, for example, enable the study of the statistical distribution of reconstruction performance. Unfortunately, such a large ensemble of climate model experiments does not currently exist. Instead, based on a climate simulation of the period a.d. 1500–1999, we apply a novel technique to generate an ensemble of surrogate fields with the same temporal and spatial characteristics as the original modeled surface temperature field. This technique also allows us to control the strength of the variability in the surrogate fields.

We apply the reconstruction methods to each member of the ensemble of surrogate temperature fields; we calibrate the reconstruction method on a subperiod and reconstruct the remaining years. The ensemble (or multiworld) approach will allow us to make statistically robust estimates both of the mean properties of the different reconstruction methods and their statistical variance. The realism of the pseudoproxies is ensured by obtaining their relation to the local temperatures and their number from Mann et al. (1998).

The rest of the paper is organized as follows. In section 2 we describe the different reconstruction methods. In section 3 we describe the generation of the surrogate temperature fields and the pseudoproxies, and we also provide details about how the data were prepared before they were passed to the reconstruction methods. In sections 4–9 we present the results. Section 4 deals with the stochasticity of the reconstructions, section 5 with the sensitivity of reconstructions to the number of proxies and the type of noise used to generate the proxies, section 6 with the reconstruction methods ability to capture a trend, section 7 with the possibility of using independent validation data to separate good and bad reconstructions, section 8 with the effects of detrending and inflation, and section 9 with the robustness of the reconstructions to some tunable parameters. We close the paper with conclusions in section 10.

## 2. Reconstruction methods

In this section we describe the reconstruction methods. In view of the recent history of the field where at least some of the confusion and disparities have been due to subtle differences in the implementation of the reconstruction methods we describe the methods and the data (section 3) in detail. The material in the present section may be skipped at first reading. The reconstruction methods are summarized in Table 1.

Of the seven methods, the four first are based on standard multiple or multivariate regression methods. The seven methods include two global methods, that is, where the hemispheric-mean temperature is reconstructed directly, and five field methods, where the global temperature field is reconstructed followed by a calculation of the hemispheric-mean temperature. Of the field methods the first two reconstruct the leading principal components (PCs) of the temperature fields. The fifth method is canonical correlation analysis (CCA) regression, which differs from the first four methods by treating temperatures and proxies more symmetrically. The two last methods are the regularized EM algorithms, RegEM Ridge, and RegEM TTLS, where the regularization is performed with ridge regression and truncated total least squares regression, respectively.

In the calibration period we have *n*_{cal} coincident records of the observations and the proxies. The *p*_{obs} observations fill the matrix 𝗢 (*n*_{cal} × *p*_{obs}) and the *p*_{prox} proxies fill the matrix 𝗣 (*n*_{cal} × *p*_{prox}).

In the reconstruction period we have additional *n*_{rec} records of the proxies. We let the matrix (*n*_{rec} × *p*_{prox}) contain the *n*_{rec} records of the proxies, and we let the matrix (*n*_{rec} × *p*_{obs}) represent the unknown values of the observations to be reconstructed.

The complete data matrix is then organized as follows:

We assume that both proxies and observations have been centered over the calibration period. For the proxies this means that the temporal mean of 𝗣_{|,i} has been subtracted from both 𝗣_{|,i} and _{|,i}. Here, and in subsequent notation, 𝗠_{|,i} denotes the *i*th column vector of the matrix 𝗠.

### a. Direct regression

The direct method expresses the observations as linear combinations of the proxies, 𝗢 = 𝗣** α**, where

**(**

*α**p*

_{prox}×

*p*

_{obs}) is a matrix of regression coefficients.

In the calibration step we write for the *i*th of the *p*_{obs} observations,

This gives us *n*_{cal} equations for the *p*_{prox} unknowns *α*_{|,i}. For each *i* the best solution is found by least squares approximation and the formal solution is

Because the data have been centered, 𝗣^{T}𝗣 is recognized as the sample covariance matrix **Σ*** _{PP}* of the proxies.

The reconstruction part is now simple. For each time *j* of the reconstruction period we have _{j,|} = _{j,|}** α**.

For the direct method to give a unique solution we need *n*_{cal} > *p*_{prox} as a necessary condition for **Σ*** _{pp}* to be invertible. We use a backward elimination procedure (see section 2e) to select the most skillful proxies. Thus, above 𝗣 and

**only contain the rows corresponding to proxies accepted by the backward elimination procedure.**

*α*### b. Indirect regression

In the calibration step the indirect method expresses the proxies as linear combinations of the observations 𝗣 = 𝗢** α**, where

**(**

*α**p*

_{obs}×

*p*

_{prox}) is the matrix of regression coefficients.

We proceed as for the direct method and for the *i*th of the *p*_{prox} proxies we have

which gives us *n*_{cal} equations for the *p*_{obs} unknowns *α*_{|,i}. For each proxy series *i*, the best solution is found by the least squares approximation. The formal solution is

where 𝗢^{T}𝗢 = **Σ*** _{OO}* is the sample covariance matrix of the observations.

We need *n*_{cal} > *p*_{obs} as a necessary condition for **Σ*** _{OO}* to be invertible. As for the direct regression, we use a backward elimination procedure (see section 2e) to select the most skillful predictors, and above 𝗢 and

**only contain the rows corresponding to the accepted observations.**

*α*This is all similar to the calibration step for the direct method. However, the indirect method requires another inversion in the reconstruction step. For each time *j* of the reconstruction period we have *p*_{prox} equations for the *p*_{obs} unknowns,

The solution is found by least squares approximation and is formally _{j,|} = _{j,|}*α*^{T}(*αα*^{T})^{−1}.

For *αα*^{T} to be invertible it is necessary that *p*_{prox} > *p*_{obs}. This is fulfilled if one only wants to reconstruct the NH-mean temperature (global reconstruction), but it is not directly fulfilled if one wants to reconstruct the temperature field (field reconstruction). This regularization problem for the field reconstruction is solved by choosing a leading subset of the PCs of the temperature field (see section 2e).

In full matrix notation we have = 𝗕, where 𝗕 = ** α** =

**Σ**

_{PO}for the direct method and 𝗕 =

*α*^{T}(

*αα*^{T})

^{−1}with

**= for the indirect method. Here,**

*α***Σ**

*= 𝗣*

_{PO}^{T}𝗢 is the cross-covariance matrix between proxies and observations.

### c. Canonical correlation analysis regression

Canonical correlation analysis finds pairs of patterns in observations and proxies whose expansion coefficients describing the time development have the largest possible correlations (see, e.g., Preisendorfer and Mobley 1988; Bretherton et al. 1992; Cook et al. 1994). First, a singular value decomposition of 𝗖_{OO}^{−1/2}𝗖_{OP}𝗖_{PP}^{−1/2} (*p*_{obs} × *p*_{prox}) is performed, where 𝗖* _{OO}*, 𝗖

*, and 𝗖*

_{PP}*are the correlation and cross-correlation matrices. The singular value decomposition gives (without any loss of generality, we assume that*

_{OP}*p*

_{obs}>

*p*

_{prox}) 𝗜

_{O}𝗪𝗜

_{P}

^{T}, where 𝗜

*(*

_{O}*p*

_{obs}×

*p*

_{prox}) and 𝗜

*(*

_{P}*p*

_{prox}×

*p*

_{prox}) are the left and right singular vectors, respectively, and 𝗪 (

*p*

_{prox}×

*p*

_{prox}) is diagonal. Then, the expansion coefficients 𝗢′ and 𝗣′ (

*n*

_{cal}×

*p*

_{prox}) are found by projecting the original fields, 𝗢 and 𝗣 on the weight vectors 𝗖

_{OO}

^{−1/2}𝗜

_{O}and 𝗖

_{PP}

^{−1/2}𝗜

_{P}.

The expansion coefficients 𝗢′ and 𝗣′ are orthogonal, so in this set of coordinates both the covariance and cross-covariance matrices become diagonal and so does the matrix 𝗕 of regression coefficients. Direct regression, which we will use, gives ′_{|,i} = *w _{i}*′

_{|,i}, where

*w*, the

_{i}*i*th diagonal element of 𝗪, is the (canonical) correlation coefficient between 𝗢′

_{|,i}and 𝗣′

_{|,i}. The expansion coefficient ′ is found by projecting the original field on the weight vector 𝗖

_{PP}

^{−1/2}𝗜

_{P}.

The CCA requires the calculation of the inverse of the correlation matrices for the observations and proxies and therefore that *n*_{cal} > *p*_{prox} and *n*_{cal} > *p*_{obs}. To avoid singularities we use the version of CCA where the fields are first transformed to PC space (Barnett and Preisendorfer 1987; Bretherton et al. 1992) and only a subset of the leading PCs are retained.

### d. The regularized EM algorithms

The RegEM for imputation of missing values was developed by Schneider (2001) and introduced in the context of climate reconstructions by Mann and Rutherford (2002) and Rutherford et al. (2003).

The expectation maximization method is iterative and begins by filling the missing observations with the time means of known observations, 𝗢: _{|,i} = _{|,i}. The rest of the algorithm operates with covariance and cross covariance obtained from the augmented matrices = 𝗢 | [(*n*_{rec} + *n*_{cal}) × *p*_{obs}] and = 𝗣 | [(*n*_{rec} + *n*_{cal}) × *p*_{prox}]. In each iteration the direct regression coefficients are calculated as 𝗕 = and the missing values are updated with the regressed values ← + ( − )𝗕. The covariance and cross-covariance matrices in each iteration are calculated from the updated values, with an additional term in **Σ**_{ÔÔ} representing the residuals. The iterations continue until convergence. Throughout this paper we use the same stopping criterion as in Rutherford et al. (2003, their appendix) except in section 9, where the robustness to this choice is investigated.

The regularized algorithm RegEM Ridge consists of the same steps but now is replaced by ( + *h*^{2}𝗗)^{−1}, where 𝗗 is the diagonal matrix consisting of the diagonal elements of and *h* is a regularization parameter. Regression with this form of regularization is known as ridge regression. In each iteration *h* is determined by a generalized cross-validation procedure (**Σ**_{ÔÔ} enters here).

The algorithm sketched above differs a little from the original (Schneider 2001). Because of the simple geometry of Eq. (1), the regression coefficients and the regularization parameter need only be calculated once per iteration (see also Smerdon et al. 2008). In practice, the coefficients 𝗕 are calculated by singular value decomposition of the correlation matrix as described in detail in Schneider (2001).

Mann et al. (2007b,c) introduced a version of the RegEM algorithm, RegEM TTLS, where the regularization is performed by truncated total least squares [Fierro et al. (1997), see also van Huffel and Vandewalle (1991) for extension to *p*_{obs} > 1] instead of ridge regression. In each iteration the regression coefficients 𝗕 are found by an eigenvalue decomposition of the total covariance matrix [i.e., the (*p*_{obs} + *p*_{prox}) × (*p*_{obs} + *p*_{prox}) covariance matrix obtained from the full data matrix in Eq. (1)], with an additional term representing the residuals. A truncation parameter *k* is then obtained as the number of eigenvalues that are above the noise level and the regression coefficients calculated from the truncated set of eigenvectors. The version of RegEM TTLS used here includes the additional reduction applied by Mann et al. (2007b) of representing the observations by their leading *m* PCs, where *m* is the number of PCs resolvable from noise. Here we find *m* = 4 and *k* = 13 as the number of components that separate the signals from the noise.

### e. Regularization and good predictors

For the direct method to be well posed we must have *n*_{cal} > *p _{prox}*, and for the indirect method we must have both

*n*

_{cal}>

*p*

_{obs}and

*p*

_{prox}>

*p*

_{obs}, as we have seen. Because a typical calibration period in reality is of the order 100 yr, the number of observations around 1000, and the number of proxies of the order of 100, these criteria are not fulfilled. Such ill-posed problems with a noninvertible covariance matrix are often encountered in regression studies and several strategies exist to deal with them.

A simple approach that works for the indirect regression method is to reconstruct only the global or hemispheric mean (p_{obs} = 1). It is less drastic to use the leading PCs of the observations, which can reduce the *p*_{obs} to a few tens and still retain almost all of the variability. This choice is natural if one is interested in reconstructing regional variability (Briffa et al. 1986). If one is only interested in the global or hemispheric mean it is not obvious a priori which method is better. Although the empirical orthogonal functions (EOFs) do not necessarily have any physical reality, choosing the leading PCs is at least economical in the way that as much variability as possible per variable has been included. Note also that using the PCs the covariance matrix 𝗢^{T}𝗢 becomes diagonal, which simplifies some of the calculations. However, in our implementation the PCA is done on monthly anomalies (following Mann et al. 1998) and annual means of the PCs are calculated. These annual means are not necessarily orthogonal.

The CCA regression method also solves the regularization problem by simultaneously reducing the number of predictors and predictands. The RegEM algorithm solves the problem of regularization either by ridge regression or truncated total least squares regression.

We still need to solve the regularization problem for the direct regression method, but also another problem remains: We need to decide which predictors to include in the regression. It is well known that including predictors that explain little or none of the predictand’s variance will decrease the predictive properties of the regression method, even if the problem is well posed.^{1}

We therefore need a procedure to select a subset of the possible predictors. We have chosen the backward elimination procedure where predictors are removed if they do not explain significantly (most often to the *p* = 10% level, but the influence of this parameter is studied in section 9) more variance than a random predictor. This selection of predictors are not applied to the CCA and RegEM methods.

## 3. Test data

To estimate the skill of the reconstruction methods we apply the methods on an ensemble of different surface temperature fields. The ensemble of surface temperature fields is generated by a Monte Carlo approach based on phase randomizing the original field. The original field comes from a climate simulation with a coupled general circulation model with historical forcings. Pseudoproxies are generated from each member of the ensemble by degrading local temperatures with noise.

### a. The general circulation model experiment

Our basic surface temperature field originates from a forced experiment with the coupled atmosphere–ocean general circulation model ECHAM4/Ocean Isopycnal Model 3 (OPYC3), with a horizontal resolution in the atmosphere of T42 and the surface temperature field obtained on a regular 128 × 64 latitude–longitude grid. The experiment covers the period 1500–1999 and is forced with both natural (solar irradiation, volcanic sulfur) and anthropogenic forcings (well-mixed greenhouse gases, sulfate aerosols, land use changes, tropospheric ozone). The model and the experiment are described in detail in Stendel et al. (2006).

### b. The surrogate temperature fields

Surrogate temperature fields, with the same temporal and spatial coverage as the original climate model surface temperature field, are generated with a phase-randomizing procedure as introduced in Christiansen (2007). First, the original monthly surface temperature field from the climate model is transformed to PCs. Then, the average annual cycle and the trend (obtained as a least squares third-order polynomial fit) are removed from each individual PC. The resulting PC anomalies are then Fourier transformed; the Fourier phases are randomized, but with the same random phases for all PCs; and an inverse Fourier transform is performed to get the scrambled PCs. Then, the average annual cycle and the trend are added to each scrambled PC, and these PCs are finally recombined to get the surrogate fields. Note that the transformation to PC space is only done for efficiency. Here we retain the 30 first EOFs, including 99.99% of the variability.

In this way we are able to generate an ensemble of surrogate temperature fields with the same temporal and spatial structure as the original one. More precisely, the phase-randomizing procedure retains all lagged covariances and cross covariances.^{2} Therefore, the surrogate fields represent the temperature development over the period 1500–1999 as it could have been with other phases of the “noise”; here, the noise is defined as the residuals after removing the trend, that is, the noise includes both forced and internal variability. Optionally, we can enhance the strength of the variability by scaling the PC anomalies with a common factor *a* before the recombination. Note that this method results in a conservative ensemble of surrogate temperature fields: no new physics is introduced. We also note that except for the trend the statistical structure of the surrogates does not depend on time. In particular, the statistical structure is the same in the calibration and reconstruction period. This form of stationarity will probably overestimate the performance of the reconstruction methods as compared to their performance on real data.

Before the generation of the surrogate surface temperature fields the original field was interpolated to a coarser 32 × 32 latitude–longitude grid to increase processing speed. The procedure described above was performed on the area north of 40°S. This area includes 32 × 23 grid points, which is therefore also the resolution of the surrogate fields. The NH-mean surface temperature that we want to reconstruct is the area-weighted average of all grid points (32 × 16) in the Northern Hemisphere.

### c. Construction of pseudoproxies

For each surrogate data field we need to produce a set of pseudoproxies to pass to the reconstruction methods. For increased realism of the study we base the number of the pseudoproxies and their relation to local temperatures on the data in Mann et al. (1998; see also Mann et al. 2004).

For the numbers of pseudoproxies we take 112, 57, and 22, which corresponds to the number of proxies in Mann et al. (1998), reaching back to 1820, 1600, and 1400, respectively. We identify the location of the *i*th proxy as where the correlation between the proxy 𝗣_{|,i}^{Mann} and the local surface temperature 𝗧^{Mann} used in Mann et al. (1998) [surface temperature resulting from Jones and Briffa (1992)] is largest. Choosing this location, and not the location where the proxy was obtained, allows for the representation of some nonlocal effects between proxy and temperature (see also below). The maximum correlations have a lower value near 0.3 and a mean near 0.47 for all three the sets of proxies reaching back to year 1400, 1600, and 1800. The positions of the pseudoproxies are often but not always near the position of the corresponding proxy from Mann et al. (1998). We have tested that similar results are obtained with randomly distributed locations, so the results presented in the rest of the paper are not sensitive to the location of the proxies.

The *i*th pseudoproxy _{|,i} is constructed from the local surrogate surface temperatures _{|,i} in such a way that they have the same correlation *c* as the correlation between the proxies 𝗣_{|,i}^{Mann} and the corresponding local surface temperature 𝗧_{|,i}^{Mann} taken over the period 1902–80. This is fulfilled by setting _{|,i} = _{|,i}*σ*^{T} + *ξ*1/*c*^{2} − 1, where *σ*^{T} is the standard deviation of _{|,i}. The noise *ξ* has unit variance, zero mean, and is either taken as white noise or, more realistically, is constructed to have the same power spectrum as the residuals 𝗣_{|,i}^{Mann} − *r _{i}*𝗧

_{i}

^{Mann}, where

*r*is the regression coefficient between 𝗣

_{i}_{|,i}

^{Mann}and 𝗧

_{i}

^{Mann}. The amplitudes of the noise, 1/

*c*

^{2}− 1, typically fall in the range between 1 and 3.

Some proxies are not necessarily well represented by the local temperature but may, for example, be more connected to the precipitation. Because the precipitation in turn is related to the large-scale patterns, it is important for the realism of our study that the pseudoproxies not only are realistically related to the local temperature but also have teleconnections of realistic strength and extent. Figure 1 shows three typical examples of the correlations between the proxy and the temperature field for both the real proxy and the corresponding pseudoproxy. Note that the teleconnections are stronger for the pseudoproxies than for the real proxies. This increased “stiffness” of the model over the real climate will probably lead to an improved performance of reconstruction methods when applied to model data (Zorita et al. 2003). We conclude that the pseudoproxies do not underestimate the relations between proxies and the temperature field.

We also study reconstructions based on the two extremes: “perfect” proxies and “spurious” proxies. They are defined, respectively, as proxies identical to the local temperatures (no noise added) and as proxies without any connection to the local temperatures (only white noise).

### d. Data preparation

The proxies have been centered and normalized with offset and variance determined over the calibration period. The trend has not been removed except in section 8, where the influence of detrending is studied.

Likewise, the observations have been centered to the mean over the calibration period but their trends have not been removed except in section 8. The preparation of the observations is detailed below.

For the direct and indirect global reconstructions the annual means of the local temperatures have been centered with the temporal mean over the calibration period. The NH (32 × 16 grid points) mean is calculated weighting the grid points with their area. The standard deviation of the detrended NH mean was calculated to be used in the optional inflation.

For the direct and indirect field reconstructions a principal component analysis has been performed on monthly anomalies (following Mann et al. 1998) of the temperatures north of 40S (32 × 23 grid points) in the calibration period. Before the PCA these anomalies were normalized with the standard deviation *s* of the detrended anomalies and weighted with the cosine to their latitudes *w*. Then, the predictands were calculated as the annual means of the PCs. Note that these annual means are not necessarily orthogonal. The standard deviation of the detrended PCs was then calculated to be used in the optional inflation. However, the PCs were not detrended before the reconstruction. After the reconstruction, the reconstructed and optionally inflated PCs are recombined into local temperatures which then are scaled with *s* and 1/*w*. Finally, the weighed NH mean is calculated.

The CCA reconstruction was performed on local annual-mean temperatures that have been centered over the calibration period. In the CCA regression the PCs were calculated from annual values and not from monthly values as in the direct and indirect field reconstructions. The optional inflation was performed using the standard deviation of the detrended local annual temperatures.

For the two RegEM versions the local annual-mean temperatures in the NH have been centered over the calibration period. No inflation has been attempted for these methods because reconstructions in the calibration period are not directly available (however, see Smerdon et al. 2008).

The target to which we compare the reconstructions is found by centering the local temperatures with their temporal mean over the calibration period and calculating the weighted NH mean. This is the same procedure as when preparing the predictands for the global methods.

Inflation is an ad hoc method that attempts to repair the underestimation of variance often encountered by regression methods (Klein et al. 1959). Inflation compares the variance of the reconstructed (detrended) variables in the calibration period with known variance of the original (detrended) variables. The reconstructed variables are then scaled with the square root of the fraction of these variances.

## 4. Stochasticity of the reconstructions

Figure 2 shows the reconstructions of the NH-mean temperature of three surrogate data fields with different strength of variability (top: *a* = 1, middle: *a* = 2, bottom: *a* = 3). The calibration period was the last 100 yr (1900–99) and the reconstructions were performed with 57 white noise pseudoproxies. The black curve is the target, that is, the NH-mean temperature of the surrogate data, and the colored curves are the reconstructions obtained with the seven different reconstruction methods. The thin curves show annual values, while the thick curves are low-pass filtered to remove time scales shorter than 50 yr.

This figure illustrates some of the points that will be elaborated on later. First, we note that the shape of the low-frequency variability is relatively well reconstructed. Second, we note that although differences exist between the seven reconstruction methods there is some agreement about the low-frequency variability. Third, we observe that the amplitude of the low-frequency variability is generally underestimated and that most reconstructions underestimate the cold temperatures before 1900. Finally, we see that this bias seems to increase with increasing strength of the variability in the surrogate field.

While each of the panels in Fig. 2 only shows results from a single surrogate field, Fig. 3 summarizes the results from the reconstructions of 100 surrogates with amplification factor *a* = 2. Each of the seven panels, one for each of the reconstruction methods, shows the histogram of the year-to-year correlations over the reconstruction period 1500–1899 between the target and the reconstructions. This figure demonstrates an important point: The reconstructions include a substantial element of stochasticity and very different results can be obtained using the same reconstruction method on different surrogate fields. The direct global method, for example, gives correlations between 0.4 and 0.7. The largest and smallest spread are found for the RegEM TTLS method and the indirect global method, respectively. Averaged over the 100 surrogates the indirect field method gives the smallest (0.46) and the indirect global method the largest (0.69) correlations.

The stochasticity is not limited to the year-to-year correlations as shown in Fig. 4. Here the results of the reconstructions with the different methods are shown as box plots, also known as box-and-whisker plots. In these plots the large, bold circle marks the mean, the horizontal line marks the median, the box gives the interquartile range, the thin vertical lines are the 5th–95th percentile range, and the small dots are the 100 individual data points. We focus here on four diagnostics of the quality of the reconstructions (see Table 2 for details). The two first are the year-to-year (already presented in Fig. 3) and the low-frequency correlation coefficients between the reconstruction and target. The correlations are calculated over the reconstruction period 1500–1899, and the low-frequency correlation is calculated from 50-yr low-passed-filtered data. The third diagnostic is the relative bias calculated from the averages of the reconstruction and target, respectively, over the preindustrial years of 1500–1799. The last diagnostic is the relative low-frequency amplitude calculated from the variances of the 50-yr low-passed-filtered data. For each method (color coded) results are shown for amplification factors 1, 2, and 3.

We strongly emphasize the stochasticity that is considerable for all methods and all diagnostics and that, for example, for the indirect field method, makes the relative bias of the reconstructions vary from a 100% underestimation to a 20% overestimation (for *a* = 3).

The low-frequency correlations are generally rather large in particular compared to the year-to-year correlations. This means that the form of the low-frequency variability is, in general, well represented in the reconstructions. In contrast, the relative bias and the relative amplitude are in general negative, strongly underestimating the low preindustrial temperatures and the amplitude of the low-frequency variability.

While all diagnostics are infected by a substantial stochasticity and all methods underestimate preindustrial levels and the amplitude of the low-frequency variability, there are some other noticeable patterns. The stochasticity is largest for the indirect field method and is smallest for the indirect global method. As for the year-to-year correlations, the best and worst methods regarding the low-frequency correlations are the indirect global method and the indirect field method. The relative bias and relative amplitude, however, agree on a somewhat different pattern. With these diagnostics the indirect global and indirect field methods give the best results, followed by the CCA method.

We also see that for all of the reconstruction methods the bias and the low-frequency amplitude deteriorate when the strength of the variability in the field to be reconstructed increases while both the year-to-year correlations and the low-frequency correlations are relatively unaffected by this parameter, perhaps except for the indirect field method.

Figure 2 suggests that the skill of the reconstructions obtained with the different methods is related; that is, if one reconstruction method performs well on a specific realization the other reconstruction methods will also perform well on that realization. Figure 5 shows the relative bias and the low-frequency correlation for 40 different realizations, and the correlations between the 100 realizations are given in Tables 3 and 4 for 112 and 22 proxies. There is a rather strong relation between all of the reconstruction methods, except for the indirect field method both when the relative bias and the low-frequency correlation are considered. The relations between the six methods (excluding the indirect field method) seem to be tightest for 22 proxies.

## 5. Sensitivity to the number and types of proxies

The results in the previous section were obtained with 57 pseudoproxies constructed from the local temperatures by adding white noise. The amplitude of the white noise was chosen so that the correlation coefficient between the local temperature and the proxy was the same as in Mann et al. (1998). The 57 proxies correspond to the number of proxies reaching back to year 1600 in Mann et al. (1998). In this section, we will study how the quality of the reconstructions depends on the number of proxies and of the type of noise included in the proxies.

The white noise used in the previous section is obviously a simplification, as demonstrated in Fig. 6. For the 112 proxies in Mann et al. (1998), reaching back to 1820, we have calculated the residuals by removing from the proxies the part linearly explained by the local temperature (see section 3c). In the top panel of Fig. 6 the histogram of the autocorrelations at lag 1 for these 112 residuals is shown. For comparison, the bottom panel in Fig. 6 shows the similar histogram of 112 random white noise series. It is obvious that the residuals obtained from the Mann et al. (1998) data have serial correlations and that they are not very well represented by white noise. To investigate the effect of the serial correlated noise we have repeated the reconstructions with pseudoproxies generated as in the previous section, but now with the noise obtained by phase randomizing the residuals from Mann et al. (1998). Thus, the new pseudoproxies have the same correlation with the local temperature as the proxies in Mann et al. (1998) (as described in the section 3c), but now, in addition, the noise has the same spectral form as the residuals in Mann et al. (1998). We have also repeated the reconstructions with perfect pseudoproxies, that is, pseudoproxies taken as the local temperatures without the addition of noise and with pseudoproxies taken as white noise without any relation to the local temperatures.

Results from reconstructions with 112, 57, and 22 proxies are shown in Fig. 7. The proxies are based on local temperatures with a white noise term. We find that all of the diagnostics of the reconstruction quality depend rather strongly on the number of proxies. For the year-to-year and low-frequency correlations, the indirect field and the RegEM TTLS methods show the largest deterioration, while the RegEM Ridge method shows the smallest. Also, the bias and the low-frequency amplitude in general deteriorate for all methods when the number of proxies decreases. The largest deterioration in these diagnostics is seen in the direct field method where the average bias grows from 20% for 112 proxies to more than 50% for 22 proxies. Least sensitive to the number of proxies considering bias and low-frequency amplitude is the indirect global method, for which the relative bias is 12% for 112 proxies and 18% for 22 proxies.

Results from reconstructions with the different types of proxies (57) are shown in Fig. 8. With white noise pseudoproxies without any relation to the local temperatures the average of all four diagnostics of the reconstruction skills vanishes for all methods. However, note the large spread between the different realizations, in particular regarding the low-frequency correlations. With perfect proxies, that is, proxies that are equal to the local temperature, the reconstructions are very skillful with correlations near one, small biases, and little spread between different realizations. The only methods that stand out are the two RegEM versions, which show an average relative bias near −0.3 and an underestimation of the low-frequency amplitude. This seems to be due to the applied stopping criterion, which will be discussed in more detail in section 9.

Compared to the reconstructions based on white noise proxies, the reconstructions based on proxies with a more realistic noise term perform worse when the low-frequency correlations are considered, but they are almost as good when year-to-year correlations and bias are considered.

## 6. Reconstructing a trend

We have seen that all of the reconstruction methods have difficulties in reproducing the correct preindustrial mean temperature (measured as the relative bias). When the methods are calibrated in the last 100 yr, where the secular trends are considerable, the generally lower temperatures in the first 300 yr are strongly underestimated. In this section we will investigate how the different methods reconstruct a trend. To this end we will use the same sets of surrogate temperature fields as in Figs. 2 –4, but now we will use the first 100 yr as calibration period and reconstruct the last 400 yr.

Figure 9 shows the reconstructions of three different surrogate fields, the same as in Fig. 2, when calibrating over the first 100 yr and using 57 white noise proxies. The reconstructions follow the target well in the first 250 yr of the reconstruction period (1600–1850) but start to diverge from the target at the end of the nineteenth century when the trend sets in. All of the reconstructions in Fig. 9 underestimate the trend.

The results from all 100 realizations are summarized in Fig. 10. The left panel showing the low-frequency correlation over the reconstruction period 1600–1999 can be directly compared to the upper right panel of Fig. 4. We find that the reconstruction of the low-frequency correlations is rather indifferent to the calibration period. The same is the case for the year-to-year correlations (not shown).

The right panel in Fig. 10 shows the skill of the reconstructed trend, calculated as the relative difference (*τ*_{rec} − *τ*_{tar})/*τ*_{tar}, where *τ*_{rec} and *τ*_{tar} are the trends of the reconstruction and target, respectively, over the last 150 yr. We observe that all methods on average seriously underestimate the trend, although there are large differences among the methods. The indirect field and the indirect global methods show the smallest underestimation. It is interesting that the average underestimation is much smaller for the realizations with stronger variability than for realizations with weaker variability. This is the opposite of what was found for the relative bias in section 4. Again, we emphasize the large amount of stochasticity: for all methods we can find realizations whose trends are well reconstructed and realizations whose trends are not reconstructed at all.

## 7. Can validation on an independent dataset assess reconstruction skills?

It is often recommended as sound statistical practice to calibrate and validate the model on different subsets of the data. However, in many applications the data are too scarce to make such a strategy possible. For climate reconstructions the available data for both calibration and validation cover around 100 yr. In Mann et al. (1998), some attempt of validation was performed by utilizing additional 50 yr of data from a geographically sparse subregion. In this section we briefly address the question weather the calibration/validation approach is fruitful under such data limitations.

To this end we consider how the skill of the reconstructions in the first 375 yr of the reconstructed period (1500–1874) relates to the skill in the last 25 yr of the reconstruction period (1875–1899). The calibration period is chosen to be 1900–99, as before. This differs somewhat from the approach in Mann et al. (1998) by validating the NH mean directly (and not temperatures in a subregion) and using a shorter validation period.

The results are demonstrated in Fig. 11 where we have used the year-to-year correlations, low-frequency correlations, and relative bias to gauge the skill in the reconstruction period. In the validation period we have chosen the year-to-year correlations and the relative bias. Results from 100 realizations reconstructed with the indirect field method are shown, but results from the other methods give conclusions.

We find that there is only a weak relation between the year-to-year correlation in the 25-yr-long validation period and the skills obtained over the rest of the 375-yr-long reconstruction period (Fig. 11, left panels). This weak effect seems to be largest when the skill in the reconstruction period is gauged by year-to-year correlation and weakest when it is gauged by the relative bias. We also find that this effect seemingly increases with a decreasing number of proxies.

A larger effect is found when the relative bias is used to gauge the skill in the validation period (Fig. 11, right panels). Although only a weak relation is found between year-to-year correlations and low-frequency correlations in the reconstruction period, a more substantial relation is found in the relative bias in the reconstruction period (lower-right panel).

This result introduces some optimism to the field: Some insight about the ability of a given reconstruction to reproduce previous temperature levels (above measured as relative bias) can be obtained by validation. It is necessary, however, that the level itself in the validation period, and not the year-to-year correlations, is used to gauge the skill in the calibration period.

## 8. Detrending and inflation

In previous sections we have used nondetrended proxies and observations in the calibration of the reconstruction methods. When calibrating in the last 100 yr, trends are surely present. Retaining these trends collides with the usual statistical practice of removing nonstationarities from the data. However, it could be argued (Wahl et al. 2006) that the trends include important information about the low-frequency relations between proxies and observations that are not present in the detrended series. Inflation is a method to repair the variance suppression that often occurs in regression methods (see section 3d). Both the effects of detrending and inflation have been debated in the context of climate reconstructions and we therefore briefly investigate how the two procedures influence the different reconstruction methods.

Figure 12 shows the reconstructions of the same realization with the four different combinations of no detrending/detrending and no inflation/inflation. Figure 13 summarizes the effects on the low-frequency correlation and the relative bias when the reconstruction methods are applied to an ensemble of 100 different temperature fields. Inflation was not implemented for the RegEM methods.

It is obvious from Figs. 12 and 13 that all methods perform substantially worse when the trend is removed before the calibration. The effect is particularly strong when the relative bias is considered. Although this is true for all methods, the CCA method stands out because it looses all ability to capture the preindustrial level, and almost all ability to reproduce the low-frequency correlations. The reason is that implementation of the detrending in the CCA method (see section 2c) not only removes the effects of the trends in the regression coefficients (as is the case for all the methods), but also effectively removes the trend from the predictors.

The effect of including inflation is smaller than the effect of removing the trends. The effect shows no obvious pattern among the methods, although it is more often detrimental than advantageous. By definition the inflation has no effect on the correlations of the global methods where the inflation reduces to a simple scaling of the reconstructed NH-mean temperature.

## 9. Sensitivity to free parameters

All of the reconstruction methods include free parameters that have to be chosen a priori. For the regression methods the significance level *p*, with which the predictors have to differ from random noise to not be rejected in the screening procedure, is such a parameter (section 2e). For the CCA method, the number of coupled patterns included has to be prescribed, which in our implementation is determined by the fraction of explained variance retained in the PC decompositions of the proxies and observations. For the RegEM methods the stopping criterion must be chosen. While this is the only free parameter for the RegEM Ridge (as long as the regularization parameter *h* is determined by generalized cross validation), RegEM TTLS includes two truncation parameters.

In the previous sections results were presented based on a single choice of these parameters. In this section we consider the robustness of the results to other choices.

We first consider the dependence of the direct field reconstructions on the parameter *p* (determining the threshold significance of nonrejected predictors). For smaller values of this parameter, proxies will need a stronger relation to the PCs of the observations, to be included in the regression. Figure 14 shows the relative bias, low-frequency correlation, and number of predictors for the leading PC as function of *p*. In this example we have 122 proxies and 100 yr in the calibration period, which makes the covariance matrix singular unless some screening is done. For *p* close to 1 we select almost 100 proxies and the effect of the ill conditioning is seen as bad reconstruction skill with a large spread among the 10 realizations. The best skills and smallest spread are found for values of *p* around 0.2, where 20–40 predictors are selected. For smaller values of *p* the reconstructions deteriorate as less predictors are included, although this effect is relatively small. We note that in reconstruction efforts like the present some screening of the predictors was already implemented when the selection of proxies was performed. This suggests that the value of *p* should not be too small. From Fig. 14 we conclude that the value *p* = 0.1, chosen in the previous sections, is close to optimal.

For the CCA method the free parameter is the fraction of explained variance retained in the PC decompositions. Previously we have chosen this value to be 0.99. Figure 15 shows the relative bias and the number of patterns included in the reconstructions as a function of the fraction of retained variance. Both the ensemble mean and the spread decrease with increased retained variance, except for the largest values of 0.995 and 0.999, where the spread begins to increase. Again, our original choice is close to optimal. For this choice we include around 40 patterns in the reconstructions. Note, that we have retained the same fraction of variance in both PC decompositions. Allowing different thresholds for the observations and proxies would allow further calibration, which is not attempted here.

For the RegEM methods it is necessary to determine when the iterations have converged. Rutherford et al. (2003, their appendix) use a criterion where the process is stopped when the square root of the ratio of the sum of squared change to the sum of squares (summed over all *n*_{rec} × *p*_{prox} reconstructed values) is less than *ε* = 0.005. This was also the value used previously in the present study. In Figs. 16b,c, the relative bias and the low-frequency correlations are shown as function of *ε* for RegEM Ridge. While the shape of the reconstructions converge reasonably fast and has saturated at *ε* = 0.005, we find a very slow convergence for the relative bias. The relative bias has not saturated at *ε* = 0.005, but continues to improve until *ε* = 0.0001. These results are illustrated in Fig. 16a, where reconstructions are shown for different values of *ε*. Similar slow convergence is observed for RegEM TTLS. The results reported for the RegEM methods in previous sections regarding the bias, may therefore be somewhat improved by decreasing *ε*. However, for *ε* = 0.0001 several thousands iterations are needed in the already computationally expensive RegEM Ridge method, stretching computational resources to the limit. The truncation parameters *k* and *m* in RegEM TTLS were previously set to 13 and 8 according to a criterion determining these numbers as the number of components resolvable from noise. While this criterion is objective, the noise level is often hard to establish because of a near-parabolic decrease of the eigenvalues. We find that lower values of *k* give better results for day-to-day and low-frequency correlations while the bias and low-frequency amplitude are unchanged. Lower values of *m* deteriorate all the diagnostics while larger values of *m* slightly increase day-to-day and low-frequency correlations, again leaving bias and low-frequency amplitude unchanged.

## 10. Conclusions

We have applied seven different reconstruction methods to an ensemble of surface temperature fields. Because the fields are known in both the calibration and reconstruction period the skill of the reconstruction methods can be investigated. Emphasis has been given to how the methods reconstruct trends, preindustrial levels, and low-frequency variability. The diagnostics are summarized in Table 2. The validation focuses explicitly on the Northern Hemisphere mean, even for the methods that do reconstruct the field.

The ensemble of temperature fields was based on a climate model experiment with both natural and anthropogenic forcings (section 3a). The ensemble members were constructed to retain both the temporal and spatial structures, as well as the secular trend of the original climate experiment (section 3b). Loosely speaking, each member represents a possible model world as it could have been with different phases of the noise. This ensemble approach allows us to address the stochasticity of the reconstruction problem. Pseudoproxies are generated by degrading local temperatures (section 3c). For increased realism the numbers of proxies and their relation to the local temperature is obtained from Mann et al. (1998).

Note that our ensemble is a conservative one because it only changes the phase of the residuals (after removing the trend), not the underlying physics. The residuals are also stationary and, in particular, have the same structure in the calibration and the reconstruction periods. Real proxies do not necessarily respond only to local temperature, but we have found that the pseudoproxies do not underestimate the strength or geographical extent of the connections to the temperature field. These conservative choices make us confident that our results do not underestimate the properties of the reconstruction methods when applied to real data.

The seven different reconstruction methods (section 2; see also Table 1 for a summary) include both global methods, where the hemispheric-mean temperature is reconstructed directly, and field methods, where the temperature field is reconstructed and the hemispheric mean is calculated afterward. We also consider both direct methods, where the temperatures are expressed as linear combinations of the proxies, and indirect methods where the proxies are expressed as linear combinations of the temperatures. A CCA regression method and two versions of the RegEM method are also included.

Our main conclusions are as follows:

All reconstruction methods contain a large element of stochasticity, which is revealed as broad distributions of skills when the methods are applied to fields of the same temporal and spatial characteristics (section 4). This means that it is very difficult to draw conclusions from a single or a few realizations, which may help explain some of the recent debate where different authors have arrived at bewildering opposed conclusions (e.g., see the brief review in the introduction). For instance, although the distributions of the relative biases in Fig. 4 obtained with the different methods may be different, they have a large overlap and no inferences about the methods’ relative skill can be made from only one realization.

All methods have substantial skills, in particular, considering the reconstruction of the shape of the low-frequency variability. Nearly all bumps on centennial time scales are present in the reconstructions. Low-frequency correlations are typically around 0.7–0.9, although the exact value depends on the details, especially on the number of proxies. Year-to-year correlations are smaller (around 0.4–0.7) but almost always significantly different from zero, and their empirical distributions (Fig. 4) are well separated from the distribution obtained with totally random pseudoproxies (Fig. 8). Also, all methods correctly give no skill when applied to totally random pseudoproxies (Fig. 8).

All methods systematically underestimate the amplitude of the low-frequency variability (Fig. 4). All three relevant diagnostics—the relative bias, the low-frequency amplitude, and the trend—unanimously demonstrate this considerable underestimation. The low temperatures in the preindustrial period is typically underestimated with 20%–50%. The same or even larger magnitude of underestimation is seen when the recent trend is reconstructed (Fig. 10). This underestimation has been subject to fierce debate (von Storch et al. 2004; Wahl et al. 2006; von Storch et al. 2006a; Rahmstorf 2006; von Storch et al. 2006b; Mann et al. 2007a). We think that at least some of the confusion may be due to the difficulty to draw conclusions from a single or a few realizations. It is interesting that the skill in the reconstruction of the trend increases with the strength of the variability in the surrogate temperatures, while the opposite is seen for the skill in reconstructing the preindustrial temperature level.

One very important question, which to our knowledge has not been addressed in any previous work, is the efficacy of validating the reconstructions on independent data, that is, data not used for calibration. To this end we have compared the diagnostics over the reconstruction period with diagnostics obtained over a validation period of 25 yr (comparable to what would be available for real-life reconstructions). An overall conclusion (section 7, Fig. 11) is that there is no substantial relationship between the skills in the reconstruction period and the year-to-year correlation coefficient estimated from the 25-yr validation period. On the other hand, our results indicate that it is possible to obtain some information about the ability of a reconstruction to capture the preindustrial level from the level of the reconstruction in the 25-yr validation period.

It is not possible to completely rank the climate reconstruction methods. Their relative behavior depends on the diagnostics, the number of proxies, and the strength of the variability in the surrogate temperature field. However, the indirect global method stands out as a best choice if only the NH-mean temperature is needed. Furthermore, it appears that the performance of the different methods is to some extent interdependent: if one method has a good performance for one realization of our ensemble data there is a tendency for the other methods to perform well on the same realization (Fig. 5). A similar conclusion was reached by Rutherford et al. (2005), Mann et al. (2005), and Bürger (2007). Only small differences were found between the two versions of the RegEM method regarding bias and low-frequency amplitude at odds with the claims of Mann et al. (2007b).

Reconstructions are most skillful with trends included in the calibration data (section 8). This is in agreement with Rutherford et al. (2003) and also with Bürger (2007), who found confidence intervals for the verification diagnostics based on a bootstrap technique. However, our results are at odds with his conclusion that this constitutes an overestimation of the skills. While including trends will offend some statisticians, theoretical justifications can be given (Wahl et al. 2006; see also the discussion in von Storch et al. 2006a). No improvement is found by applying variance inflation.

The relative bias of the reconstructions depends both on the number of proxies and on the strength of the variability. However, the shape of the reconstruction depends moderately on the strength of the variability and strongly on the number of proxies (section 5). There is no large impact of using proxies with realistic noise compared to proxies with white noise (Fig. 8).

Comparing the two indirect and the two direct methods shows that in general the indirect methods gives the best results. This can be understood when the characteristics of the two regression methods are considered. The indirect method assumes that the errors in the proxies dominate over the errors in the temperatures, while the direct method assumes the opposite (van Huffel and Vandewalle 1991). It is not unrealistic for real-world reconstructions that the errors in the proxies are larger than the errors in the temperatures (Hegerl et al. 2007). We also find that reconstructing the field does not give any advantage compared to global reconstructions when the interest is on the NH mean.

The underestimation of the amplitude of the low-frequency variability demonstrated for all of the seven methods discourage the use of reconstructions to estimate the rareness of the recent warming. That this underestimation is found for all the reconstruction methods is rather depressing and strongly suggests that this point should be investigated further before any real improvements in the reconstruction methods can be made. Until then, smaller improvements may be possible by obtaining more and better proxies. To end on a positive note we emphasize that the reconstruction methods’ abilities to reconstruct the shape of the centennial variations are promising for studies on the relative impacts of different climate forcings.

## Acknowledgments

This work was supported by the Danish Climate Centre at the Danish Meteorological Institute. The authors thank Martin Stendel for making available the data from the ECHAM-OPYC3 climate simulation. Proxies and temperatures from Mann et al. (1998) were downloaded from http://www.nature.com/nature/journal/v430/n6995/suppinfo/nature02478.html.

## REFERENCES

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

## Footnotes

*Corresponding author address:* B. Christiansen, Danish Meteorological Institute, Danish Climate Centre, Lyngbyvej 100, DK-2100 Copenhagen Ø, Denmark. Email: boc@dmi.dk

^{1}

The effects of adding more predictors or having an ill-posed problem can be seen as follows (see, e.g., Sen and Srivastava 1990). Let us consider the model *y* = *a*_{1}*x*_{1} + *a*_{2}*x*_{2} … *a _{n}x_{n}* +

*ε*=

**xa**+

*ε*. The least squares estimate of

**a**is

**= (𝗫**

*α*^{T}𝗫)

^{−1}𝗫

^{T}

**y**. For a future observation

**x**

_{new}the true values of

*y*is

*y*

_{tru}=

**x**

_{new}

**a**+

*ε*and the prediction is

*y*

_{pre}=

**x**

_{new}

**. The variance of the prediction error becomes =**

*α**σ*

^{2}[1 +

**x**

_{new}

^{T}(𝗫

^{T}𝗫)

^{−1}

**x**

_{new}]. Here

*σ*

^{2}is the variance of the residuals and the last term on the right-hand side is a positive semidefinite quadratic form. It is obvious that it can cause problems if the covariance matrix 𝗫

^{T}𝗫 becomes (close to) singular and the new observation

**x**

_{new}has a finite projection on the (nearly) singular directions. However, even if the covariance matrix is readily inverted, adding additional predictors may increase the . There are two competing effects. Adding a new predictor will decrease, or at least not increase, the variance of the residuals,

*σ*

^{2}, which is good. However, it can be shown that the value of the quadratic form increases when new variables are added, which is bad. Therefore, if the new predictor does not explain enough of the predictands’ variance, the second effect dominates and the variance of the prediction error grows.

^{2}

It is easily seen that the phase-randomizing procedure retains lagged covariances and cross covariances. Consider two time series: *A*(*t*) and *B*(*t*). Let *A*(*t*) = ∑* _{k} a_{k}* exp(

*i*2

*πkt*/

*T*) and

*B*(

*t*) = ∑

*exp(*

_{k}b_{k}*i*2

*πkt*/

*T*) be their Fourier expansions. New surrogate series are then generated as

*A*

_{new}(

*t*) = ∑

*exp[*

_{k}a_{k}*i*(2

*πkt*/

*T*+ Θ

*)] and*

_{k}*B*

_{new}(

*t*) = ∑

*exp[*

_{k}b_{k}*i*(2

*πkt*/

*T*+ Θ

*)]. Here Θ*

_{k}*= −Θ*

_{k}_{−k}is uniformly distributed in [0, 2

*π*]. It is easy to see that the only terms that contribute to ∫

*A*

_{new}(

*t*)

*B*

_{new}(

*t*+

*τ*)

*dt*= ∑

_{k,j}

*a*exp[

_{k}b_{j}*i*(Θ

_{k}+ Θ

_{j})] exp(

*i*2

*πkτ*/

*T*) ∫ exp[

*i*2

*π*(

*k*+

*j*)

*t*/

*T*]

*dt*are those with

*k*= −

*j*. Therefore, (as Θ

_{k}= −Θ

_{−k}) ∫

*A*

_{new}(

*t*)

*B*

_{new}(

*t*+

*τ*)

*dt*= ∑

_{k}

*a*

_{k}b_{−k}exp(

*i*2

*πkτ*/

*T*) = ∫

*A*(

*t*)

*B*(

*t*+

*τ*)

*dt*.