## Abstract

A new multiproxy reconstruction of the Northern Hemisphere extratropical mean temperature over the last millennium is presented. The reconstruction is performed with a novel method designed to avoid the underestimation of low-frequency variability that has been a general problem for regression-based reconstruction methods. The disadvantage of this method is an exaggerated high-frequency variability. The reconstruction is based on a set of 40 proxies of annual to decadal resolution that have been shown to relate to the local temperature. The new reconstruction shows a very cold Little Ice Age centered around the 17th century with a cold extremum (for 50-yr smoothing) of about 1.1 K below the temperature of the calibration period, AD 1880–1960. This cooling is about twice as large as corresponding numbers reported by most other reconstructions. In the beginning of the millennium the new reconstruction shows small anomalies in agreement with previous studies. However, the new temperature reconstruction decreases faster than previous reconstructions in the first 600 years of the millennium and has a stronger variability. The salient features of the new reconstruction are shown to be robust to changes in the calibration period, the source of the local temperatures, the spatial averaging procedure, and the screening process applied to the proxies. An ensemble pseudoproxy approach is applied to estimate the confidence intervals of the 50-yr smoothed reconstruction showing that the period AD 1500–1850 is significantly colder than the calibration period.

## 1. Introduction

A number of quantitative multiproxy reconstructions of global or Northern Hemispheric temperatures for the past 1000–2000 years, based on different paleotemperature records, have been published during the last 15 years (e.g., Briffa 2000; Cook et al. 2004; Crowley and Lowery 2000; D’Arrigo et al. 2006; Esper et al. 2002; Hegerl et al. 2007; Jones et al. 1998; Jones and Mann 2004; Juckes et al. 2007; Ljungqvist 2010; Loehle 2007; Mann et al. 1999, 2008, 2009; Mann and Jones 2003; Moberg et al. 2005; Osborn and Briffa 2006). They have received considerable attention in the on-going global warming debate as they have been used for placing the observed twentieth-century warming in a long-term perspective. All reconstructions are in agreement that the temperatures during the last two decades likely are unprecedented in a 1000–2000-yr context but they differ considerably in low-frequency variability and the amplitude of warming and cooling associated with the Medieval Warm Period (Bradley et al. 2003; Broecker 2001; Esper and Frank 2009; Hughes and Diaz 1994; Lamb 1977) and the Little Ice Age (Lamb 1977; Matthews and Briffa 2005; Wanner et al. 2008), respectively. The amplitude of the multidecadal to centennial preindustrial temperature variability ranges from ~0.2 to 1 K in the different reconstructions (Esper et al. 2005a,b; Jansen et al. 2007).

All reconstruction methods rely on variations of linear regression, whereby the relationship between proxies and temperatures is statistically estimated in a calibration period when both datasets are available. The reconstruction methods range from global methods targeting the Northern Hemisphere (NH) mean temperature directly to field methods that reconstruct the temperature field before the spatial average is calculated. Global methods include simple composite plus scale methods where the proxies first are combined into a composite proxy, which then is related to the NH mean temperature (Bradley and Jones 1993; Moberg et al. 2005; Juckes et al. 2007; Hegerl et al. 2007; Ljungqvist 2010). Field methods include methods based on empirical orthogonal functions (Mann et al. 1998) and the iterative regularized expectation maximization (RegEM) method (Dempster et al. 1977; Schneider 2001; Mann and Rutherford 2002; Smerdon and Kaplan 2007). Reconstruction methods often require regularization, such as screening of variables, ridge regression, or truncated total least squares regression if the number of samples in the calibration period is of the same order or smaller than the number of independent variables (e.g., Christiansen et al. 2009).

There has been growing evidence that these reconstruction methods underestimate the low-frequency variability (von Storch et al. 2004; Bürger and Cubasch 2006; Zorita et al. 2007; Smerdon and Kaplan 2007; Riedwyl et al. 2009; Christiansen et al. 2009; Tingley and Huybers 2010b). This understanding mainly comes from pseudoproxy experiments in which the reconstruction methods are applied to temperature fields from climate models. Christiansen et al. (2009) compared seven different reconstruction methods and found that all underestimated low-frequency variability and trends by 20%–50%. It is important to understand that this underestimation is a systematic bias and not included in the confidence intervals often provided with the reconstructions. The systematic bias will also not be revealed by embedding the reconstruction method in a Bayesian approach (Tingley and Huybers 2010a; Li et al. 2010). It is obvious that such a systematic bias undermines the conclusions drawn from both individual reconstructions and studies of compilations of reconstructions concerning the amplitude of preindustrial temperature variability. Based on theoretical consideration, Christiansen (2011a) suggested a new reconstruction method, called LOC, designed to preserve the low-frequency variability. This new method performed satisfactorily in a series of realistic pseudoproxy experiments where other methods have been shown to perform poorly.

The new method, LOC, reconstructs the local temperatures at the locations of the proxies. The reconstruction is based on one-dimensional linear regression with the local temperature as the independent variable and the proxy as the dependent variable. The local reconstructed temperatures are then averaged to get the NH mean temperature. While this method preserves low-frequency variability, the drawback is an exaggerated high-frequency variability. This adverse effect can be dampened by temporal averaging and decreases with the number of (independent) proxies. As the performance of the LOC method can be negatively affected by proxies with no or little correlation to local temperature, only proxies where this correlation is statistically significant (as estimated, e.g., with a *t* test) should be included. See Tingley and Li (2011, manuscript submitted to *J. Climate*) and Christiansen (2011b, manuscript submitted to *J. Climate*) for a detailed discussion of these points.

Christiansen (2011a) also included a real-world reconstruction of the Northern Hemisphere in the period AD 1505–1960 with the LOC method based on the proxies from Hegerl et al. (2007). This reconstruction showed a much larger variability than previous reconstructions, reaching decadally smoothed temperatures as low as −1.5 K (relative to AD 1880–1960) in the 17th century. However, the reconstructions based on this set of decadally smoothed proxies are not very robust, basically because of the limited number of independent points in the calibration period. In this paper we apply the new LOC method to a new and much larger compilation of proxies to produce a reconstruction of the NH extratropical (>30°N) mean temperature reaching back to AD 1000.

Our focus is primarily set on the magnitude of the multidecadal to centennial temperature variability in the Northern Hemisphere during the last millennium and in particular on the amplitude of the cooling during the Little Ice Age. The degree of cooling during this period, and especially that of the 17th century, has been a key issue in the discussion of the ability of quantitative temperature reconstructions to fully capture low-frequency trends (Ammann and Wahl 2007; Crowley 2000; Datsenko and Sonechkin 2008; Esper et al. 2005a,b; González-Rouco et al. 2009; Jones et al. 2001; Lee et al. 2008; von Storch et al. 2004, 2009).

## 2. Proxy data

Knowledge of temperature variability prior to meteorological measurements of temperature is dependent on proxy data sensitive to temperature variations. Apart from historical records, such information can be extracted from various natural archives of climate variability. The most common forms of paleotemperature proxy data are lake and marine sediments, ice cores, speleothems, tree-ring width and density, corals, and fossil pollen, but many other archives can be used as well (for reviews, see Bradley 1999; Jansen et al. 2007; Jones et al. 2009; National Research Council 2006). A serious obstacle to all attempts at large-scale temperature reconstructions continues to be the limited and unevenly distributed number of quantitative paleotemperature records extending back a millennium or longer, although the number of available records has been increasing in recent years (Ljungqvist 2009). Paleotemperature proxy records included in multiproxy temperature reconstructions should preferably be as accurately dated as possible, have a reasonably high sample resolution, and a good correlation to the local instrumental temperature record in the calibration period (see the discussion in Jones et al. 2009).

We have compiled a set of 40 proxies located in the extratropical NH fulfilling these requirements. The proxy locations, types, sample resolutions, response season, and original references are listed in Table 1. Figure 1 shows the geographical distribution of the proxies. The proxies are relatively well distributed although they are concentrated over land and some clustering is found over eastern Eurasia and Greenland. The sample resolutions vary from annual to decadal; 21 proxies have annual sampling, 17 decadal sampling, and 2 annual–to–decadal sampling. We have linearly interpolated proxies of annual-to-decadal and decadal sampling to annual values for numerical purposes. The only exception is China historic (Number 7 in Table 1), which originally is smoothly interpolated to annual resolution. This strategy relies on the realistic assumption that the sample resolution reflects the temporal support of the proxies. See section 8 for an assessment of the effect of the interpolation of the proxies and the alternative strategy of subsampling the local temperatures to reflect the sampling of the proxies.

Almost all proxies cover the period AD 1000–1975. The exceptions are Crête (9), Devon Ice Cap (10), Lake of the Clouds (25), and Lower Murray Lake (26) that end in AD 1973, 1973, 1965, and 1970, respectively. The annually resolved Iceberg Lake (21) has 10 missing years in the interior of the period. These missing values have been filled by extrapolation or interpolation. The time series of the 40 proxies are shown in Fig. 2. The signs of the proxies are chosen so that positive proxy anomalies should correspond to warm temperature anomalies. This choice is based on the understanding of the interactions between the different proxies and temperature. While a few of the proxies show the famous hockey stick form [e.g., Yamal (39) and Big Round Lake (4)], others have more erratic behavior [e.g., Donard Lake (11) and the Greenland Ice Sheet Project (GISP2) (16)]. However, we find that 12 proxies have their largest 50-yr mean anomalies somewhere in the period AD 1900–75. The corresponding number is 6 for AD 1000–75 and 0 for 1600–75. Note, however, that it is very likely that the peak medieval warming occurred already in the mid to late 10th century (Esper et al. 2002; Ljungqvist 2009, 2010). As our proxies begin in AD 1000 we therefore cannot assess the existence or strength of the Medieval Warm Period.

The individual proxies as well as the different proxy types used here exhibit a different level of correlation to the local instrumental temperature record. Paleotemperature proxy records based on lake and marine sediments, fossil pollen, and speleothems primarily have their strength at multidecadal and longer time scales that cannot be well calibrated to the relatively short instrumental record (Moberg et al. 2008). Those records are also the ones with the generally lowest correlation to the instrumental record. Proxies based on historical records, *δ*O^{18} ice-core records, and tree-ring width or maximum density records usually have a much stronger temperature correlation. Greenland and Canadian Arctic Archipelago *δ*O^{18} ice-core records generally have a rather high correlation to regional annual mean temperatures in the calibration period but seem to be dominated by a winter temperature signal (Vinther et al. 2010). They also contain some dating uncertainties increasing farther back in time and their low-frequency variability is possibly deflated and underestimated (Johnsen et al. 2001; Kobashi 2007; Masson-Delmotte et al. 2006). The Chinese historical records, based for example on observations of ice breakup dates, show a strong correlation to temperature on decadal time scales. They are usually accurately dated but may be based on fewer observations back in time (Ge et al. 2003). Tree-ring records are precisely dated and annually resolved, leading to a generally good correlation to local temperature in the calibration period. Modern dendroclimatology methods enable the preservation of most of both the high- and low-frequency variability in millennium-long records (Linderholm et al. 2010). Tree-ring maximum density records have a much stronger correlation to annual temperature than tree-ring width records since they respond to the whole growing season, whereas tree-ring width records primarily reflect the conditions of the warmest month (usually July for continental sites) (Tuovinen et al. 2009).

Some proxy records are not fully independent from each other. The Dulan (12) tree-ring width record (Zhang et al. 2003) and the China historic (7) record (Yang et al. 2002) are among the proxies included in the composite China stack (8) record (Yang et al. 2002). The GISP2 (16) ice-core record (Kobashi et al. 2010) and the Greenland Ice Core Project (GRIP) (17) ice-core record (Johnsen et al. 2001) are both drilled from the Greenland inland ice sheet very close to each other.

The Southern Colorado Plateau (31) tree-ring width record (Salzer and Kipfmueller 2005) comes from a high-elevation site in a generally semiarid region and may be considerably influenced by drought/precipitation as well as temperature. Many tree-ring width records from the North American southwest have been suspected to not always show a linear response to warming in cases when a warmer climate also reduces the availability of water (D’Arrigo et al. 2006; Loehle 2009). One could also argue that the same environmental conditions, to some extent, apply to the tree-ring width record from Mongolia (27; D’Arrigo et al. 2001).

In the present work we relate the proxies to the local annual mean temperature derived from gridded datasets (section 3). In agreement with the previous studies discussed above we find in the present work (see section 5) that tree-ring maximum density records, historical records, and *δ*O^{18} ice-core records usually show a higher correlation to annual mean temperature than other types of proxies. An exception to this rule is the Canadian Rockies (5) tree-ring maximum density record and the China historic (7) record. We find that tree-ring width records show a very wide range of correlations, with some records showing a good response to annual mean temperature [e.g., Dulan (12), Indigirka (22), Jämtland (23), Southern Colorado Plateau (31), Taimyr (35), Yamal (39)] and others a basically nonexistent response [e.g., Eastern Carpathians (15), Mongolia (27), Tien Shen (36)]. Lake and sea sediments and stalagmites in most cases exhibit a weak correlation. The exceptions are Big Round Lake (4) and Hallet Lake (18). The two pollen records that we include in the study [Lake of the Clouds (25) and Usvyatskii Mokh (38)], however, show a good correlation to annual mean temperature, although it should be noted that they are calibrated to decadally smoothed temperature data.

## 3. Instrumental temperature data

We calibrate the proxies against annual mean surface temperatures obtained from gridded datasets, either the HadCRUT2v (Jones et al. 1999) or Goddard Institute for Space Studies Surface Temperature Analysis (GISTEMP) (Hansen et al. 1999). The HadCRUT2v dataset is of monthly resolution, extends for the period AD 1870–2005, and is defined on a 72 × 36 global longitude–latitude grid. For grid points where data exist for at least one month, missing monthly data are filled with inverse distance interpolation. The annual means are then obtained to give the annually resolved temperature field that will be used for the calibration. The GISTEMP dataset was obtained for the period AD 1880–2005 as annual means on a 180 × 90 global longitude–latitude grid. For grid points where data exist for at least one year, missing annual data are filled with inverse distance interpolation.

For the LOC method considered in this paper we need local annual mean temperatures at the locations of the proxies. These local temperatures are then obtained by locating the nearest grid cell in the annually resolved temperature datasets.

As the data availability depends on both space and time, so will the amount of interpolation needed. For HadCRUT2v we have calculated the average number of months per year with data for each grid cell in the extratropical NH. The geographical distribution of this number is shown in the top panels of Fig. 3 for two 40-yr periods. While Europe and North America have good data coverage for both periods, the coverage is sparse for both Greenland and east Eurasia, in particular in the early period. For GISTEMP we have calculated the number of years with data in the period AD 1880–1960 (Fig. 3, lower panel). The GISTEMP dataset has complete coverage except for the high Arctic and east Eurasia. Despite these differences the annual extratropical NH mean temperatures obtained from the two datasets are very similar (Fig. 4). Only in the early period AD 1880–90 are considerable differences seen.

However, the local temperatures required for the LOC method show larger differences, as shown in Fig. 5. Here local temperatures are shown together with the proxies for the period AD 1880–1960. The temporal resolution of the proxies is either annual, annual to decadal, or decadal. To match the proxies the local temperatures have been low-pass filtered with a cutoff at 5 or 10 yr if the corresponding proxy is of annual-to-decadal or decadal resolution. Here and in the rest of the paper low-pass filtering is performed with a simple running average filter. In general the GISTEMP temperatures have a stronger variability than the HadCRUT2v temperatures. The GISTEMP temperatures are also somewhat more stationary with respect to the strength of variability than HadCRUT2v temperatures; at some locations for HadCRUT2v [e.g., Iceberg Lake (21)] the strength of the variability is weaker in the early years. Such suppression of variability is a typical consequence of interpolation.

Correlations between the local temperatures and the proxies are shown for both datasets (see Fig. 7). Despite the differences in the local temperatures discussed above, the strength of these correlations is to a considerable degree shared between the two datasets. There is a weak tendency for HadCRUT2v to provide the largest correlations—in particular, China stack (8), Dulan (12), and Hallet Lake (18) stand out in this respect.

## 4. The LOC reconstruction method

The LOC reconstruction method was described in detail in Christiansen (2011a). Here we give a brief summary and refer to some older, related studies of calibration theory [see also Tingley and Li (2011, manuscript submitted to *J. Climate*) and Christiansen (2011b, manuscript submitted to *J. Climate*)].

As mentioned in the introduction, the LOC reconstruction method consists of three steps. First, the proxies are screened so that proxies without a significant relation to the local annual mean temperature are excluded. Second, the local temperatures at the locations of the remaining proxies are reconstructed using one-dimensional linear regression. It is important here that the local temperature is taken as the independent variable and the proxy as the dependent variable. Third, the local reconstructed temperatures are averaged to get the extratropical NH mean.

This method was suggested as a reconstruction method that should preserve the low-frequency variability. Christiansen (2011a) tested the LOC method in a series of realistic pseudoproxy experiments and showed that in these experiments low-frequency variability such as the preindustrial level, trends, and long-lasting warm periods were indeed well reconstructed.

The second step above relates a proxy *P* to the corresponding local temperature *T* by [Eq. (A2) in Christiansen 2011a]

Having determined *λ* as a least squares estimate from the calibration data, temperatures can be reconstructed (or predicted) from a new proxy, *P*_{new}, as *T*_{new} = *P*_{new}/*λ*. This regression model is often known as indirect regression in the climate reconstruction literature, while the model with temperature as the dependent variable [Eq. (A1) in Christiansen 2011a] is known as direct regression.

The merits and differences of the two regression models have been studied in detail in the literature about calibration. Here indirect regression is known as classical calibration and direct regression as inverse calibration. A good review is provided by Osborne (1991).

It is obvious that a small value of *λ* will cause problems for classical calibration (indirect regression). Shukla and Datta (1985) therefore suggested that classical calibration (indirect regression) should be used only when *λ* is significantly different from zero. It is precisely this “truncated” classical calibration that LOC is based on. The screening of the proxies is reminiscent of the regularization used with other reconstruction methods. Shukla (1972) gives expansions of the mean-square error, bias, and variance of *T*_{new} estimated by both classical calibration (indirect regression) and inverse calibration (direct regression).

It follows from the discussion (Shukla 1972) that the inverse estimator (direct regression) is always biased, even for an infinite sample size, if *P* is different from zero. This bias grows fast as the predictor moves away from zero and soon dominates the mean square error. The classical estimator (indirect regression) is unbiased for large sample size, and the bias changes only slowly as the predictor moves away from zero.

Therefore, classical calibration (indirect regression) is often preferred when the predictor is not close to zero. For the reconstruction problem—where the bias is a severe problem as discussed in the introduction—it therefore is sensible to use (truncated) classical calibration (indirect regression). The increased variance can be dampened by spatial and temporal averaging as discussed in Christiansen (2011a,b, manuscript submitted to *J. Climate*) and in the older literature under the heading of repeated measurements.

We close this section by commenting on using nondetrended proxies and temperatures in the calibration process. The choice of using nondetrended time series is standard in most climate reconstruction studies including the present although it collides with the usual statistical practice of removing nonstationarities from the data [for an example of modeling nonstationarities see Li et al. (2010)]. However, for the LOC method we find that the effect of detrending is minor (not shown). This is in contrast to what is found with many other reconstruction methods where the proxies are calibrated against a global temperature or the leading principal components of the temperature field. This is substantiated in Fig. 6 where the correlations between the proxies and the extratropical NH mean temperature are shown as a function of the correlations between the proxies and the local temperatures. When the trend is included, local correlations closely match the global correlations (Fig. 6, upper panel). However, removing the trend changes the global correlations drastically while smaller changes are found for the local correlations (Fig. 6, lower panel). This is related to the fact that the extratropical NH mean (Fig. 4) is much more dominated by the trend and the low-frequency variability than the local temperatures (Fig. 5). Thus, calibrating against local temperatures involves more degrees of freedom than calibrating against a global mean. In the former case a significance test will therefore better provide protection against spurious correlations.

## 5. The LOC reconstruction

In this section we present the reconstruction based on the HadCRUT2v dataset and compare it with previous reconstructions. A LOC reconstruction based on the GISTEMP dataset is included in section 6. As explained in the preceding section local temperatures have been low-pass filtered to match the proxies if these are of annual-to-decadal or decadal resolution.

The proxies have been chosen because they have a connection to the local temperature. However, this connection is often primarily to a specific season and the proxies are not necessarily very sensitive to the annual mean temperatures. The correlations between the proxies and the local annual mean temperatures in the period AD 1880–1960 are shown in Fig. 7. The correlations obtained with HadCRUT2v are also shown in the titles above the panels in Fig. 5. Five locations [Donard Lake (11), Eastern Capathians (15), Haukvatn (19), Hesheng (20), and Lower Murray Lake (26)] show weak negative correlations while Sugan Lake (34) is an absolute outlier with strong negative correlation. The largest correlation is ~0.8, while a typical value is ~0.4.

To avoid including proxies with possible spurious correlations to the local temperatures we remove proxies with negative correlations and apply a *t* test to each proxy. The resulting *p* values are also included in Fig. 5. For the HadCRUT2v dataset we find that 24 proxies (marked with an asterisk in Table 1) have *p* values less than 0.01, corresponding to a cutoff correlation of 0.29. However, we have assumed that the proxies and local temperatures are without serial correlations. This is obviously not true—in particular for the proxies with less than annual sampling. The low threshold of 0.01 is chosen in light of the serial correlations. We will investigate the influence of the screening in detail in section 6. This will include a Monte Carlo test that takes into account the complete autocorrelation spectrum of the proxies and local temperatures.

The local reconstructions based on the HadCRUT2v dataset including the 24 proxies with a *p* value less than 0.01 are shown in Fig. 8. The reconstructions have been low-pass filtered with a 50-yr smoothing to suppress the exaggerated high-frequency variability inherent in the LOC method. The range of the temperature variations is typically within 1–5 K and seems acceptable. The only exception is Polar Urals (29) where temperature anomalies of up to 15 K are observed in the beginning of the millennium. Referring to Fig. 2 we see that this is due to large positive values of the proxy in this period combined with a relatively weak variability in the calibration period. Including this proxy in the LOC reconstruction results in a 1.5-K positive anomaly in the first 100–200 years of the millennium but has negligible influence on the rest of the period. We have therefore chosen to exclude the Polar Urals (29) when calculating the spatial mean.

From the local reconstructions the reconstruction of the extratropical NH mean temperature is obtained by simple averaging. More complicated averaging schemes give similar results for the low-frequency variability as demonstrated in section 6. The resulting LOC reconstruction of the extratropical NH is shown in Fig. 9 (blue curve). For the 50-yr low-pass filtered reconstruction the temperature anomaly generally decreases from around 0 K in AD 1000 to a minimum of −1.1 K in the 17th century. After AD 1700 the temperature anomaly increases, interrupted by an abrupt decrease around AD 1800. Overlaid on this very low frequency variability, the 50-yr low-pass filtered reconstruction shows variability with an amplitude of around 0.2 K. Larger variability is seen for annually resolved values, which have an unrealistic amplitude of around 0.5 K. Again we note that the LOC method is designed to preserve low-frequency variability at the cost of increasing the high-frequency variability.

Our new reconstruction is compared to a compilation of previous NH mean (some including the complete NH, others only the extratropics) reconstructions of the last millennium (Huang et al. 2000; Esper et al. 2002; Mann and Jones 2003; Oerlemans 2005; Moberg et al. 2005; Mann et al. 2008; Ljungqvist 2010); see Fig. 10. Included is also the 1505–1960 LOC reconstruction of Christiansen (2011a) based on the 14 decadally smoothed proxies from Hegerl et al. (2007). The new LOC reconstruction has a much larger variability than previous reconstructions, with the other LOC reconstruction as an exception. Regarding the shape of the low-frequency variability, the LOC reconstruction agrees with the majority of the previous reconstructions as was also found for the LOC reconstruction in Christiansen (2011a). Both LOC reconstructions agree on a cold Little Ice Age, about a factor of 2 colder than the coldest of the other reconstructions (Huang et al. 2000; Moberg et al. 2005) (although all reconstructions are hampered by wide confidence intervals). Note that the present LOC reconstruction and the LOC reconstruction based on the proxies from Hegerl et al. (2007) are not completely independent as some of the proxies are included in both studies [e.g., Taimyr (35) and Mongolia (27)]. In the first half of the millennium the new LOC reconstruction converges toward the previous reconstructions with no clear sign of the Medieval Warm Period (however, see the discussion in the concluding section).

## 6. Robustness

In this section we will investigate the robustness of the reconstruction to the method of spatial averaging, the choice of calibration period, the screening of the proxies, and the source of the local temperatures. The spatial averaging might be important because of the partial clustering of the proxies. The screening involves the influence of the decadally resolved proxies. As we saw above, the local temperatures may differ considerably between the two gridded datasets, GISTEMP and HadCRUT2v, which are both heavily processed products.

We first consider the reconstruction based on the GISTEMP dataset. For the GISTEMP dataset 24 proxies pass the *t* test (marked with a plus sign in Table 1; 21 of these also passed the *t* test with HadCRUT2v temperatures). We find that the amplitude of the reconstructed variability is in general larger than for the reconstructions based on HadCRUT2v. This is a simple consequence of the larger variability of the local temperatures in GISTEMP compared to HadCRUT2v. We find that China historic (7), Indigirka (22), ShiHua Cave (33), and Polar Urals (29) have suspiciously large variability and these locations are therefore excluded from the calculation of the spatial mean. The reconstruction of the extratropical NH agrees well with that based on HadCRUT2v (Fig. 9) except for the larger variability with preindustrial values about 0.3 K colder. The amplitude resembles that of the LOC reconstruction of AD 1505–1960 in Christiansen (2011a) with the cold extremum in the 17th century of around −1.5 K.

As mentioned in section 3, missing annual values in the instrumental temperature data are filled with inverse distance interpolation. To test the influence of this step we have repeated the reconstruction based on the HadCRUT2v dataset but now using a kriging procedure based on the exponential variogram model (Isaaks and Srivastava 1989) to fill the missing values. As seen (Fig. 9, green curve), the influence of the interpolation is small (see also section 8).

The last step of the LOC method involves taking the average over the local reconstructions. Above, this spatial average has been calculated as the simple arithmetic mean. In Christiansen (2011a) the local reconstructions were weighted with the cosine of their latitude. Given the partial clustering of the proxies described in section 2, a more rational choice would be to choose a weight for a local reconstruction that represents the area of the corresponding proxy’s polygon of influence (polygonal declustering, see, e.g., Isaaks and Srivastava 1989). A proxy’s polygon of influence is the region consisting of points that are closer to the location of this proxy than to any other proxy.^{1} Another way to calculate the spatial average is to weight the local reconstructions according to the correlation between the proxy and the local temperature. Reconstructions calculated with these four different averaging procedures (Fig. 11) show only minor differences regarding the low-frequency variability.

We have applied the LOC method to a mixed ensemble of proxies of both annual and decadal resolution. The local temperatures have been low-passed filtered to match the resolution of the corresponding proxies. Of the 24 proxies that passed the *t* test with the HadCRUT2v temperatures (Fig. 8), 11 are of annual resolution while 12 are of decadal and 1 is of annual-to-decadal resolution. It is worth noting that the different time scales are not mixed in the calibration process but only when calculating the NH mean. In a manner of speaking, the LOC method is automatically a hybrid approach (Rutherford et al. 2005; Christiansen et al. 2010). However, as mentioned before, the few degrees of freedom left in the decadally resolved proxies make the *t* test problematic. With 80 degrees of freedom a correlation numerically larger than 0.29 is significant to the 1% level. With only 10 degrees of freedom the corresponding correlation is 0.76. Including only the annually resolved proxies (11 proxies pass the *t* test) gives a reconstruction (Fig. 12, orange curve) that closely resembles the reconstruction based on all proxies (blue curve in Figs. 12 or 9) although it has a somewhat larger variability. Thus, it does not seem that decadally resolved proxies necessarily increase the variability in the reconstruction, as could be expected from Moberg et al. (2005) who found a larger than usual variability in a reconstruction based on proxies with multidecadal–to-centennial resolution. As mentioned, the *t* test favors the decadally resolved proxies as it does not take the reduced number of degrees of freedom into account. We have applied a Monte Carlo method (Christiansen 2000) that compares the correlations with correlations between random time series with the same autocorrelation structure as the original proxies and local temperatures. Now only seven annually resolved and one decadally resolved proxies are significantly correlated with their local temperatures at the 1% level. The resulting reconstruction (Fig. 12, purple curve) resembles the original reconstruction rather closely. As expected the amplitude of the spurious high-frequency variability increases when the number of proxies decreases.

We close this section by looking at the effect of choosing other calibration periods than the previously used period AD 1880–1960. Figure 13 shows reconstructions calibrated in AD 1880–1975, 1900–75, and 1900–60. The reconstructions are very similar and the main difference is found for the reconstruction with the calibration period AD 1900–75. This reconstruction is somewhat colder than the others with an extremum in the Little Ice Age of −1.3 K compared to −1.1 K in the other reconstructions.

## 7. Confidence intervals

We saw in the last section that our LOC reconstruction is fairly robust to changes in, for example, the screening procedure and the choice of the calibration period. However, robustness is only a necessary but not a sufficient requirement for a trustworthy reconstruction. As seen in Christiansen (2011a) reconstruction methods are often beleaguered by systematic biases inherent in the underlying regression methods. These biases will not be revealed by studies of robustness such as those in section 6.

Climate reconstructions are hampered by different sources of noise. In general we can identify three basic sources: (i) measurement noise on proxies, (ii) measurement noise on temperatures, and (iii) the fact that proxies are not explained totally by temperature but include effects such as changes in the local environment (see, e.g., Bradley 1999). The last noise component (iii) enters the regression in the form of equation noise. There are some arguments (Christiansen 2011a) in favor of ignoring (ii).

The influence of the noise on the uncertainty of the reconstructions depends on the length of the calibration period, the number of proxies, and the reconstruction method itself. As shown in Christiansen et al. (2010) the noise will most often not only give rise to a dispersion of the reconstruction around its true value but also in a systematic bias. The LOC method was designed to minimize this bias.

Because of the limited length of the calibration period and our focus on low-frequency variability, cross-validation is not applicable [see also Christiansen et al. (2009) for an evaluation of cross-validation for climate reconstructions]. To estimate how the noise influences the LOC reconstructions we resort to an ensemble pseudoproxy approach as explained in Christiansen (2011a). An ensemble of surrogate temperature fields with the same spatial and temporal structure as in a climate model experiment is constructed. For each ensemble member pseudoproxies are generated by adding noise to the local temperatures after these have been smoothed to reflect the resolution of the original proxies. The positions and number of the pseudoproxies and their correlations with the local temperatures all mimic those of the real proxies. The climate model experiment is the Erik-the-Red simulation with the ECHAM and the global Hamburg Ocean Primitive Equation (ECHO-G) climate model (González-Rouco et al. 2003). This forced experiment covers the period 1000–1990 and has been used in several pseudoproxy studies (Zorita et al. 2003; von Storch et al. 2004; Bürger and Cubasch 2006; Mann et al. 2007; Christiansen 2011a). Christiansen (2011a) found only small differences when comparing pseudoproxy experiments based on two different climate model simulations.

The NH mean temperature of the surrogate field is then reconstructed. The diagnostic under consideration—here the 50-yr smoothed temperature but in Christiansen (2011a) also preindustrial level and trends—is calculated from the reconstruction and compared to the known target. The details of the LOC method such as the calibration period, the screening process, etc., are the same as those used for the real-world reconstruction. Thus, the effects of the measurement noise on proxies, the equation noise, the finite length of the calibration interval, the limited number of proxies, the screening procedure, and the spatial and temporal averaging are all included in our estimates of the skill of the reconstruction method. Note also that this approach includes the effect of the temperature field being a stochastic field. As discussed in Christiansen (2011a) this is analogous to the concept of a structural model.

From an ensemble of 100 such pseudoproxy reconstructions we get a distribution of the difference between the reconstructed diagnostic and that of the target. From this distribution all relevant statistics such as the bias, variance, and mean-square error can be calculated. Here we present the bias and the upper and lower 2.5% quantiles.

The approach described above has been applied to the three reconstructions with different screening procedures from Fig. 12. In Fig. 14 these reconstructions are shown together with the curves obtained by adding the bias and the upper and lower 2.5% quantiles. We note that for all three screening procedures the 50-yr smoothed reconstruction shows only a negligible bias but that the 95% significance interval is quite wide. The width of this interval is 0.57, 0.63, and 0.66 K for three reconstructions, respectively. Temperature excursions significantly different from zero are according to this analysis found for all three reconstructions in the period AD 1500–1850.

As expected, the 95% significance intervals are much larger when annual values are considered. Again the bias is close to zero but the significance intervals are now 1.77, 2.55, and 2.62 K, respectively. For both the 50-yr smoothed and annual values the significance intervals decrease when the number of proxies increases. However, proxies of lower than annual resolution will contribute relatively more to the width of the confidence intervals because of the fewer degrees of freedom available.

Our confidence intervals depend on the assumptions that the surface temperature from the underlying climate simulation (Erik-the-Red, González-Rouco et al. 2003) is a good representation of the real temperature and that the proxies respond linearly to the local temperature. In Christiansen (2011a) only relatively small differences were found between ensemble pseudoproxy experiments based on two different climate models. Nonlinear connections between proxies and temperatures are expected to expand the confidence intervals.

## 8. Further considerations

There has been an increasing interest in using Bayesian approaches to estimate the uncertainty of climate reconstructions (Li et al. 2010; Tingley and Huybers 2010a; Tingley et al. 2011, manuscript submitted to *J. Climate*). However, as shown (Christiansen et al. 2009), the main problem of many climate reconstruction methods are systematic biases. These biases are the result of using a wrong model, such as an insufficient regression model or a too heavy truncation. As the Bayesian approach is conditioned on these model choices it will not inform us of the bias but only of the variance. The pseudoproxy method used above provides both the bias and the variance—the former because the true value is known. We therefore think the pseudoproxy approach is superior to the Bayesian approach in the present context regarding the estimation of confidence levels and skills of climate reconstructions. The Bayesian hierarchical approach is systematic and can span the different steps in the reconstruction process. However, so can the pseudoproxy approach. It is also worth noting that the indirect regression that is an important step in the LOC method is the natural data-level model within the Bayesian framework (Tingley et al. 2011, manuscript submitted to *J. Climate*; Tingley and Li 2011, manuscript submitted to *J. Climate*; Christiansen 2011b, manuscript submitted to *J. Climate*).

As an example of including additional steps in the pseudoproxy approach we have focused on the effect of the missing values in the temperature dataset. To this end we recorded the years and geographical positions of the missing data in the HadCRUT2v dataset. For each of the surrogate fields we—after the generation of the pseudoproxies—removed data from the same years and the corresponding geographical positions. These missing values were then filled with the same inverse distance interpolation as used for the real-world reconstruction. We repeated the calculations of the three confidence intervals presented in the previous section with this method. The widths of the confidence intervals for the 50-yr smoothed values decrease slightly and are now 0.51, 0.52, and 0.53 K. The reduction for the annual values is larger; the widths of the confidence intervals are now 1.43, 2.02, and 2.05 K. What happens is that the correlations between proxies and local temperatures decrease somewhat but this is outweighed by increased correlations between the local temperatures and the NH mean temperature. The cost of the reduced widths of the confidence intervals is a warm bias of approximately 0.1 K.

The proxies do not necessarily match the local temperatures regarding temporal and spatial resolution [for a general discussion of misalignment see Tingley et al. (2011, manuscript submitted to *J. Climate*)]. As mentioned in section 2 proxies of annual-to-decadal and decadal sampling were linearly interpolated to annual values. We have repeated the reconstructions based on quadratically interpolated proxies and confirmed that the effect of the type of interpolation is small. We have confirmed that the influence of the interpolation itself is small by subsampling the local temperatures to reflect the sampling of the proxies. Correlations between the subsampled temperatures and proxies are almost identical to those reported in Fig. 5.

Local temperatures at the positions of the proxies were obtained from the temperature datasets locating the nearest grid cell. Also, this choice has little impact on the reconstructions, as we have confirmed by repeating the reconstructions with local temperatures obtained as averages over neighboring grid cells or obtained by inverse distance interpolation. As explained in section 2 the local temperatures have been low-pass filtered with a cutoff at 5 or 10 yr to match the annual-to-decadal or decadal resolution of the proxies. Using other cutoffs (10 or 20 yr) we have confirmed that the effect of this choice is negligible.

As mentioned in the previous section cross-validation is not applicable when the focus is on the low-frequency variability. However, we have performed a leave-one-out cross-validation test of the year-to-year correlation between reconstruction and NH temperature.^{2} Under the same conditions as for the “annual” reconstruction in Fig. 12 the cross-validation gives a correlation of 0.72. This high correlation is partly a result of the trends in the NH temperature and the reconstruction. After detrending the correlation is 0.35, which is comparable to what was obtained with the pseudoproxy experiments in Christiansen (2011a).

## 9. Conclusions

We have presented a new reconstruction of the extratropical NH mean temperature in the last millennium based on a reconstruction method, LOC, that previously (Christiansen 2011a) has been shown to preserve low-frequency variability at the expense of an increase in the high-frequency variability. We applied the new method to a compilation of 40 proxies, each of which previously has been shown to relate to the local temperature.

The LOC method involves one-dimensional regression between proxies and local temperatures with temperature as the independent variable. A proxy with little relation to the local temperature can therefore jeopardize the reconstruction. While the proxies have been chosen because they have a proven relation to local temperature, this relationship is often found for a particular season and does not necessarily hold for annual means. We have therefore applied a significance test to exclude possible spurious correlations. Because of the limited length of the calibration period and the serial correlations in proxies and temperatures (in particular for the decadally resolved proxies) this test has to be somewhat relaxed compared to usual standards. If a strict Monte Carlo test that takes the serial correlations into account is applied, only eight proxies (including one with decadal resolution) are found to be significantly correlated with the local temperature at the 1% level. We have therefore included all proxies that pass the usual *t* test at the 1% level, assuming all years are independent. Additionally, the local reconstructions are investigated and locations with suspiciously large variability are excluded from the final calculation of the NH extratropical mean. The influence of the screening was studied in section 6.

Our main conclusions are

The new reconstruction shows a much more pronounced low-frequency variability compared to previous reconstructions. In particular we find that temperatures under parts of the Little Ice Age (e.g., the 17th century) reach values as cold as −1.1 K below the AD 1880–1960 level. This is about twice as cold as previous estimates. For example, our reconstruction is about 0.6 K colder than the borehole-based reconstruction (Huang et al. 2000), which is one of the coldest previous reconstructions.

Our reconstruction agrees with previous work in finding only small temperature anomalies in the beginning of the millennium. We would like to stress that the results of this article are not intended for assessing the amplitude of the Medieval Warm Period, as it is very likely that the peak medieval warming occurred already in the mid- to late 10th century and thus before our reconstruction begins (Esper et al. 2002; Ljungqvist 2009, 2010).

The reconstruction is fairly robust with respect to changes in the calibration period, the screening procedure, and the procedure for calculating the spatial average. The local temperatures from HadCRUT2v and GISTEMP differ considerably and the reconstruction based on GISTEMP shows a larger variability although it has the same general shape. The cold extremum in the 17th century is −1.5 K for the GISTEMP-based reconstruction compared to −1.1 K for the HadCRUT2v-based reconstruction.

Confidence intervals have been estimated by an ensemble pseudoproxy approach that includes both measurement noise on proxies and the fact that proxies are not perfect records of local temperatures (Christiansen 2011a). This approach also includes the effect of the number of proxies, the length of the calibration interval, the screening process, etc. We find that the 5% confidence interval around the 50-yr smoothed reconstruction has a width of approximately 0.6 K. This makes the cooling in the period AD 1500–1850 significant compared to the calibration period. Assuming a confidence interval of similar size for the borehole-based reconstruction (Huang et al. 2000), the difference in the 17th century between the borehole-based reconstruction and the reconstructions in the present paper may not be completely statistically significant.

All our reconstructions—both the temporally smoothed and the annually resolved—show a decline after AD 1960. This decline is mainly a consequence of the behavior of tree-ring width and density proxies. Of the tree-ring width and density proxies used in this study, 13 out of 15 (87%) have a lower mean in 1960–75 than in 1940–60. For the other types of proxies the corresponding numbers are 13 out of 25 (52%). Part of the decline in the reconstructed temperatures after AD 1960 may therefore be attributed to the divergence problem (Briffa et al. 1998; Wilson et al. 2007; D’Arrigo et al. 2008). Given this behavior, it is reassuring that our reconstruction is not sensitive to the calibration interval (Fig. 5).

## Acknowledgments

This work was supported by the Danish Climate Centre at the Danish Meteorological Institute. The HadCRUT2v was downloaded online (see http://www.cru.uea.ac.uk/cru/data/tem2/). The GISTEMP was downloaded online (see ftp://data.giss.nasa.gov/pub/gistemp/netcdf/). We wish to thank Anders Moberg, Department of Physical Geography and Quaternary Geology at Stockholm University, for valuable discussions.

## REFERENCES

## Footnotes

^{1}

The areas of these regions are most simply calculated by a Monte Carlo approach where points are generated randomly on the sphere. From a large sample of *N* such points we select those north of 30°N and count the number *n _{i}* of points that are closest to the

*i*th proxy location. The weights are then

*n*/Σ

_{i}*. Alternatively (and more elegantly) a spherical Voronoi procedure can be used.*

_{i}n_{i}^{2}

Holding back the *i*th year from the calibration period the model is trained on the remaining years. The NH temperature in the *i*th year is then reconstructed using this model. After doing this successively for all years in the calibration period we calculate the correlation between *T ^{cv}* and the observed NH temperature. Holding back a neighborhood around the

*i*th year—to reduce the influence of serial correlations—only changes the results slightly.