Centennial-Scale Temperature Change in Last Millennium Simulations and Proxy-Based Reconstructions

Systematic comparisons of proxy-based reconstructions and climate model simulations of past millennium temperature variability offer insights into climate sensitivity and feedback mechanisms, besides allowing model evaluation independently from the period covered by instrumental data. Such simulation– reconstruction comparisons can help to distinguish more skillful models from less skillful ones, which may subsequently help to develop more reliable future projections. This study evaluates the low-frequency simulation–reconstruction agreement within the past millennium through assessing the amplitude of temperature change between the Medieval Climate Anomaly (here, 950–1250 CE) and the Little Ice Age (here, 1450–1850 CE) in PMIP3 model simulations compared to proxy-based local and continentalscale reconstructions. The simulations consistently show a smaller temperature change than the reconstructions for most regions in the Northern Hemisphere, but not in the Southern Hemisphere, as well as a partly different spatial pattern. A cost function analysis assesses howwell the various simulations agree with reconstructions. Disregarding spatial correlation, significant differences are seen in the agreement with the local temperature reconstructions between groups of models, but insignificant differences are noted when compared to continental-scale reconstructions. This result points toward a limited possibility to ‘‘rank’’ models by means of their low-frequency temperature variability alone. The systematically lower amplitude of simulated versus reconstructed temperature change indicates either too-small simulated internal variability or that the analyzed models lack some critical forcing or have missing or too-weak feedback mechanisms. We hypothesize that too-cold initial ocean conditions in the models—in combination with too-weak internal variability and slow feedbacks over longer time scales—could account for much of the simulation–reconstruction disagreement.


Introduction
Understanding the variability of temperature at local to global scales, over the past one to two millennia, offers insight into natural decadal to centennial-scale change that cannot be discerned from instrumental data.To this end considerable progress has been made during the past decade using reconstructions from proxy data and climate model simulations (Masson-Delmotte et al. 2013;Smerdon and Pollack 2016;Luterbacher et al. 2016;Christiansen and Ljungqvist 2017;Zhang et al. 2018).Recently made advancements are fourfold: first, the network of temperature-sensitive proxy data has expanded both spatially and temporally (Jones et al. 2009;PAGES 2k Consortium 2013, 2017;Ljungqvist et al. 2012Ljungqvist et al. , 2016;;Steiger et al. 2018;Tardif et al. 2018).
These advances have facilitated an increasing number of simulation-reconstruction comparison studies (e.g., Fernández-Donado et al. 2013;Phipps et al. 2013;Schurer et al. 2013;PAGES 2k-PMIP3 Group 2015;Ljungqvist et al. 2016;Luterbacher et al. 2016;PAGES Hydro2k Consortium 2017).Simulation-reconstruction comparisons provide an opportunity to evaluate model performance using proxy records as ''out of sample'' test beds for the models' ability to reproduce both a warmer and colder climate under various forcings and boundary conditions, independent of the models' ability to realistically simulate modern climate conditions (e.g., Moberg 2013).Detection and attribution methods and other types of simulation-reconstruction comparisons have shown that changes in external forcing-in particular orbital, solar, volcanic, and greenhouse gasesare responsible for a large portion of the observed multidecadal to centennial-scale temperature change (e.g., Phipps et al. 2013;Schurer et al. 2013Schurer et al. , 2014;;Andres and Peltier 2016;Otto-Bliesner et al. 2016).Internal, unforced, climate variability also plays a significant role for multidecadal to centennial-scale temperature variability during the past millennium; it may in particular have been to a large degree responsible for the warmth during portions of medieval times (Goosse et al. 2012;Le et al. 2016;Cheung et al. 2017;Wang et al. 2017Wang et al. , 2018)).Fernández-Donado et al. (2013) compared 26 ''last millennium'' simulations from eight pre-PMIP3 models with global or hemispheric-scale temperature reconstructions.
They found that the simulations can reproduce the general patterns and timing of temperature variability, although not always the amplitude, during cold periods reasonably well; however, less so during warm periods.In particular, Fernández-Donado et al. (2013) as well as many other studies (e.g., Goosse et al. 2005Goosse et al. , 2012;;Zorita et al. 2010;Jungclaus et al. 2010;Phipps et al. 2013;Schurer et al. 2013Schurer et al. , 2014;;Luterbacher et al. 2016) have noted that the simulations fail to reproduce the peak medieval warmth around 900-1100 CE.Instead, they simulate a weaker medieval warming, more than a century after the proxy-reconstructed peak, consistent with high solar and relatively low volcanic forcing at that time.A relatively good agreement between simulated and reconstructed temperatures is found after approximately 1200 CE, when both simulations and proxy-based reconstructions follow the evolution in radiative forcings (Masson-Delmotte et al. 2013;Phipps et al. 2013;Schurer et al. 2013;Goosse 2017;Neukom et al. 2018).Model simulations in general produce smaller centennial-scale variability than proxybased temperature reconstructions over the last millennium (Ault et al. 2013;Bothe et al. 2013a,b;Phipps et al. 2013;Schurer et al. 2014;Yan et al. 2015;Ljungqvist et al. 2016;Luterbacher et al. 2016;Zhang et al. 2018) and possess a different frequency spectrum (Lovejoy and Varotsos 2016;Dee et al. 2017) with less low-frequency and more high-frequency variability (PAGES 2k-PMIP3 Group 2015;Hartl-Meier et al. 2017).
Simulation-reconstruction comparisons can be carried out for a number of purposes including 1) to assess the performance of model simulations, 2) to explore the mechanisms forcing past climate changes as recorded in proxies, 3) to constrain future climate change projections, and 4) to explore the role of external forcings versus internal causes of past climate variability.Systematic comparisons of proxy-based reconstructions and model simulations can address long-term processes within the climate system and offer important insights into climate sensitivity and feedback mechanisms otherwise inaccessible.The models' performance in simulating low-frequency changes, beyond multidecadal time scales, cannot be evaluated using the short period covered by instrumental data.At the same time, model performance over these longer time scales is of significance for the reliability of long-term projections of anthropogenic global warming.The ability of the models to realistically simulate the low-frequency (centennial scale) temperature changes during the past one to two millennia is of particular interest for testing the models' climate sensitivity and feedback mechanisms since the boundary conditions (besides anthropogenic forcing) are rather similar to those of the present day.
The centennial-scale temperature change of last millennium simulations with PMIP3 models has hitherto not been systematically benchmarked against reconstructions, in a way that potentially allows a ''ranking'' of models, as most simulation-reconstruction comparison studies have focused more on interannual to multidecadal scale agreement.Testing the ability to rank models is of interest since the models that better than others capture the amplitude of reconstructed low-frequency temperature change over the past millennium (or other time scales), as well as its spatial signature, are arguably also more likely to provide more reliable long-term climate projections (Schmidt 2010;Braconnot et al. 2012;Schmidt et al. 2014a;Jungclaus et al. 2017).Alternatively, one could argue that those models that more poorly than others simulate reconstructed past climate variations may not be expected to provide the most reliable future projections.However, care must be taken when interpreting results from model ranking exercises because the results are conditional upon the forcing data used to drive paleoclimate simulations.If the forcing data have errors, it will certainly affect the reliability of results from model ranking experiments.If the models systematically underestimate past low-frequency temperature variability, as many previous studies have shown, it may indicate that they have a too-low climate sensitivity, erroneous initial conditions, or wrong forcing configurations or have some missing or too-weak feedback mechanisms over longer time scales.Alternatively, it means that they have a toosmall internal variability over longer time scales.
Here, we compare and analyze the reconstructionmodel agreement of the low-frequency component over the past millennium climate.We do this by assessing the amplitude of temperature change between 950-1250 CE (henceforth the Medieval Climate Anomaly, MCA) and 1450-1850 CE (henceforth the Little Ice Age, LIA) in the PMIP3 last millennium (850-1850 CE) simulations using a set of 128 local temperature reconstructions and six PAGES 2k Consortium (2013) continental-scale temperature reconstructions.The time intervals of 950-1250and 1450-1850CE conform to IPCC definitions (Masson-Delmotte et al. 2013).We use the absolute MCA-LIA difference as a benchmark to assess the agreement between model-simulated and proxy-reconstructed centennial-scale amplitude of temperature change.Thus, we evaluate the models' performance independently of their properties during the industrial period and without the need of a reference period.
Our objectives of the simulation-reconstruction comparison are as follows: 1) To systematically explore how well models simulate the amplitude of the reconstructed MCA-LIA temperature difference and its spatial signature.This may inform us both of their response to external forcing and their ability to simulate internal variability in the low-frequency domain.2) To determine which simulations agree significantly better than others with the reconstructed amplitude of MCA-LIA temperature difference and explore the possibility of ranking the various simulations according to their agreement with the amplitude of this difference.3) As part of our assessment, we investigate to what extent the reconstructed and simulated polar amplification of the MCA-LIA temperature difference agrees.This key aspect of the climate system has been insufficiently explored for the past millennium.
We conclude the study by discussing possible reasons for observed differences in the amplitude of centennialscale temperature change in simulations and proxy-based reconstructions, and attempt to reconcile observed simulation-reconstruction differences.

a. Local temperature-sensitive proxy reconstructions
We assembled a global and millennium-long dataset of 51 annual-mean, 56 summer, and 21 winter temperature-sensitive local (i.e., site-specific) proxybased reconstructions from the scientific literature.All reconstructions were calibrated to temperature by the original authors and have a resolution of at least two observations per century (Table 1; Fig. 1).Reconstructions with less than annual resolution are required to have their dating constrained by at least one age after 1700 CE (D 1 ), one age between 950 and 1700 CE (D 2 ), and one age before 950 CE (D 3 ) [see Ljungqvist et al. (2016) for further details].All nonannually resolved local reconstructions are converted to annual time series by simple linear interpolation between the available dates.For the sake of compatibility with the model simulations we let all boreal summer (austral winter) reconstructions, regardless of the exact monthly interval used in calibration, represent June-August (JJA), and all boreal winter (austral summer) reconstructions represent December-February (DJF).Although they may well calibrate to annual-mean temperature, few proxies published as annual temperature reconstructions are in reality likely to be equally temperature sensitive in all 12 months of the TABLE 1. List of all local reconstructions, sorted from north to south, used in this study with information about site/location, latitude and longitude coordinates, archive type, proxy type, temporal resolution (Res.), seasonality, dating method(s), reference to original publication, temperature difference between 950-1250 and 1450-1850 CE, and the weight assigned to each proxy for the cost function calculations based on their estimated uncertainty interval.The weights are standardized to sum to 1 and may be used directly to compute the cost function for all seasons taken together.Digitized local reconsructions are marked with an asterisk after the reference.year.To ease the comparison with model simulations, we also treat sea surface temperature reconstructions as near-surface temperature indicators.This approach is justifiable given the high correlation between nearsurface temperature and sea surface temperature over long time scales (e.g., Rayner et al. 2003).For the sake of comparability between model data and reconstructions, the simulated values have been made to correspond to the reconstructed ones in both time and space, that is, we use simulated values for the sites where we have proxy data, at the same time points.This means that for sites where the reconstruction has less-than-annual resolution, so will the simulated values used.We then subject these simulated values to the same linear interpolation as the reconstructions, even though actual annually resolved data are of course obtainable for the simulations.

RECONSTRUCTIONS
Most local reconstructions used in this study were published with some measure of uncertainty.The nature of this uncertainty (calibration error, measurement error, etc.) is not always given explicitly, nor is it always clear how it is to be understood in statistical terms, for example, whether it is a 95% confidence interval or a standard deviation.Nevertheless, we attempt to make these measures comparable across studies, which is necessary in order to assign weights (see section 2d and appendix A) by reinterpreting them as standard deviations SD MCA and SD LIA of the statistical error in the reconstructions.Under the assumption of normality, a confidence interval may be converted into a standard deviation as shown in Table 3, and we have taken the expression ''root-meansquare error'' to mean a standard deviation.When it was explicitly stated that the uncertainty is the standard deviation (SD) of the average of m values, the uncertainty is re-expressed as SD/ ffiffiffiffi ffi m p .However, for many reconstructions assigning an SD value to the annual or interpolated values is qualified guesswork.To proxies with no published estimate of uncertainty we have assigned the pooled standard deviation1 of all other reconstructions from the same archive type.The MCA (950-1250 CE) and the LIA (1450-1850 CE) values are represented by 300-and 400-yr averages, so we use SD MCA and SD LIA to calculate the standard deviations of these averages, taking autocorrelation into account (see section 2d and appendix A for details).

b. The PAGES 2k continental-scale temperature reconstructions
The PAGES 2k Consortium ( 2013) produced seven continental-scale reconstructions of various lengths using either a multiproxy network or only tree-ring data: the Arctic (.608N), Europe, Asia, North America, South America, Australasia, and Antarctica (Table 2) Despite agreeing upon a common set of minimum criteria, the various PAGES 2k Consortium (2013) regional working groups used different criteria for their proxy selection and used different methods for calibration; primarily by scaling the proxy composite to an ).In addition, each PAGES2k group calculated their reconstruction's levels of uncertainty differently, all assuming stationarity between temperature and proxy values.We have reinterpreted all these uncertainties so that they should represent SDs (Table 3).

c. PMIP3 last millennium simulations
Ten groups within the Paleoclimate Modeling Intercomparison Project phase 3 (PMIP3; Braconnot et al. 2012) of the Coupled Model Intercomparison Project phase 5 (CMIP5; Taylor et al. 2012) have produced forced-transient simulations covering 850-1850 CE.All simulations follow the PMIP3 protocol regarding orbital, total solar irradiation, volcanic, and land-use forcings (Schmidt et al. 2011(Schmidt et al. , 2012; Table 4).The surface temperature from the simulations covering the period of 850-1850 CE used in this study are shown in Fig. 2 for the global mean, Northern Hemisphere (NH) mean, and Southern Hemisphere (SH) mean, respectively.
Simulations with the following models were downloaded from the CMIP5 database (https://esgf-data.dkrz.de/search/cmip5-dkrz/) and used in this study: BCC_CSM1.1 (BCC hereafter), CCSM4, CSIRO Mk3L.1.2(CSIRO hereafter), FGOALS-s2 (FGOALS hereafter), GISS-E2-R (GISS hereafter), HadCM3, IPSL-CM5A-LR (IPSL hereafter), and MPI-ESM-P (MPI hereafter).The GISS modeling group produced an ensemble with eight members (with different combinations of volcanic, solar and land-use forcings; see Table 4).We selected the three GISS simulations [r1i1p121 (GISS-R21), r1i1p124 (GISS-R24), and r1i1p127 (GISS-R27)] with correctly implemented volcanic forcing and also containing all other forcings as prescribed by Schmidt et al. (2011Schmidt et al. ( , 2012)).Additionally, we include a recently produced simulation with EC-Earth v3.1 (Hazeleger et al. 2010(Hazeleger et al. , 2012; see appendix B).The simulations from the MIROC-ESM and FGOALS-gl models, which are also available in the CMIP5 database, were discarded in the present study.MIROC-ESM shows a strong warming drift (;0.098C century 21 ) caused by a decreasing trend in the Southern Ocean sea  4. Description of CMIP5/PMIP3 climate models used for the last millennium simulations and their forcing configuration.When a certain forcing is not specified, a constant ''preindustrial'' value is used.Orbital forcing in the models follows Berger (1978).Abbreviations in the table for the different forcing datasets are as follows.ice, likely due to an insufficiency of the initialization according to Sueyoshi et al. (2013).FGOALS-gl is excluded because it first starts at 1000 CE.A simple correction is performed for the known drift of the GISS simulations (Schmidt 2012): by subtracting a locally weighted scatterplot smooth (LOWESS) fit (influence of about 600 years) to the GISS preindustrial control run from the annually resolved data of interest (i.e., the gridpoint data for the field evaluation and the relevant time series for the area-averaged assessment).

MODEL DATA PROCESSING
The monthly mean temperature data for 950-1250 CE and 1450-1850 CE were extracted from all the simulations.From the two periods we calculated the annual, the JJA, and DJF means.The difference between the two periods was then calculated for each seasonal target to represent the amplitude of MCA-LIA temperature change in each grid cell.
Since the models have different resolutions, their data were bilinearly regridded to a common grid size of 18 3 18 before calculating the multimodel ensemble mean and medians.The SDs of the annual and seasonal values within the periods representing the MCA and the LIA at each new grid cell were calculated and used for a two-tailed Student's t test of mean climate change.Similarly, the Wilcoxon's signed rank test (Wilcoxon 1945) is used for the median.

d. Statistical evaluation of the simulationreconstruction agreement
The simulation-reconstruction comparison is performed for those areas (e.g., grid cells) for which local or continental-scale reconstructions are available.A weighted distance between simulations and reconstructions is calculated by means of a cost function (CF).The method was first applied by Goosse et al. (2006) in a data assimilation study to select ensemble members from a large number of simulations that best match a target reconstruction.Another use has been to objectively group simulation data from different models (Zhang et al. 2010).The CF can also be applied to evaluate the degree of agreement between model results and the corresponding reconstructions.A good-fit measured by the cost-function largely depends on the number of predictors, their uncertainties, and their spatial distribution.
The CF measures the weighted distance between reconstructed and simulated temperature change, and is defined as follows: where T rec i is the difference between the reconstructed MCA and LIA temperatures at location i; T model_k i is the difference between the MCA and LIA temperatures at location i, simulated by model k; and w i is weights, standardized to sum to 1.
In the case of local reconstructions, the CF values will be calculated for each season (annual mean, JJA, or DJF) separately.
For the definition of the weights, see appendix A. Suffice to say here is that reconstructions with large uncertainties receive smaller weights.We also consider an unweighted CF, in which all the weights are equal, that is, w i 5 1/n: THE COST FUNCTION CALCULATION AS A BLOCK

DESIGN
In principle, the CF values could be used to rank the climate models, by means of the statistical distance between simulated temperature reconstructions, in terms of point-wise performance: the lower the CF value, the better the climate model's fit to the reconstructions.However, such numerical assessments are not necessarily reliable: even if all the models were in fact equally good, it is unlikely that any two of the CF values would turn out exactly the same.Hence, one model would appear to be the ''best'' and another the ''worst,'' when in such a case all the observed differences between models are due to random variability rather than to any consistent difference in model performance.This fact points to the need for a significance test to tell when an observed difference between models is meaningful.To this end, consider measuring the performance of model k at location i.
Introducing the notation Y k for the average of all the Y ki for model k, we then have the following relation to the unweighted cost function: The Y ki may be analyzed as a block design (e.g., Box et al. 2005), with the locations of the reconstructions as blocks.This means that we only compare Y ki values for different k (models) at the same i (reconstruction location); clearly the Y ki for different locations reflect the diverse nature of the locations at least as much as differences in model performance and so should not be allowed to influence the analysis.Since the normality of Y ki is questionable (indeed, they are nonnegative), we perform a nonparametric test, the Friedman test (Friedman 1937(Friedman , 1939)), complemented by a Friedman-Nemenyi test (Nemenyi 1963), to decide whether any significant differences exist between models.Both tests are based on ranks but not on any assumption of normality.For each site i the models are ranked so that the model with the smallest Y ki is assigned rank 1, the model with the second smallest Y ki is assigned rank 2, etc.Then the ranks are summed over locations for each model, giving an overall assessment R k of the performance of model k.The R k may be used to test the null hypothesis H 0 : the assignment of ranks at the various sites are independent random permutations of the numbers 1, 2, . . ., n.This would be the case if the difference between model runs was due to internal variability and random error alone, assuming the time series belonging to the different locations i are independent.In that case all the R k will be approximately equal.The Friedman test tells us how unequal they need to be in order for H 0 to be rejected.The Friedman-Nemenyi test restricts this comparison to pairs of models, so that we can tell exactly which differences are significant.Yet, in reality, we cannot be certain that neighboring locations are independent, meaning that the effective number of independent sites is smaller than the observed number, so that we have less information.Hence, the p values we present in this context may be too small and should be taken only as rough indicators of which differences between models are meaningful.
3. Amplitude of temperature change in simulations and local and continental-scale reconstructions

a. Intermodel comparison
The selected simulations from PMIP3 plus EC-Earth show considerable differences in both the amplitude of the MCA-LIA temperature change and its spatial patterns (Figs. 3, 4).Some models show a much larger amplitude of change than others even though similar forcing datasets are used.This is largely explained by differences in the sensitivity to forcings due to differing feedbacks and in ocean heat uptake efficiency (Atwood et al. 2016;Fernández-Donado 2016), with the structural intermodel differences larger than the uncertainties in external forcings (Lehner et al. 2015), and due to the implementation of the forcings (Zanchettin et al. 2016).
When considering either the ensemble mean or median, larger areas show a significant change (p , 0.05, twotailed Student's t test or Wilcoxon rank test) in annualmean temperature than for JJA or DJF temperatures.A partly similar assessment of MCA-LIA temperature change in the PMIP3 simulations is found in Atwood et al. (2016) though they used 950-1200 CE for the MCA and 1600-1850 CE for the LIA, but the choice of periods has a marginal effect.Our results, and those of Atwood et al. (2016), are consistent with similar analyses of pre-PMIP3 models showing a similar large spread (Fernández-Donado et al. 2013).

b. Model comparison using local reconstructions
We observe the same general latitudinal pattern in the relative amplitude of MCA-LIA temperature change in the local reconstructions as in the simulations (Table 5; Fig. 5): largest in the NH high latitudes (608-908N) and smallest in the low latitudes (308N-308S).However, the local reconstructions in general show a larger amplitude of MCA-LIA temperature change than the simulations, particularly in the NH (Table 5; Fig. 5).The MCA-LIA difference in the local reconstructions is rather similar across seasons with annual-mean, JJAand DJF-mean (median) values of 0.388C (0.258C), 0.418C (0.388C), and 0.388C (0.308C), respectively, compared to model means (medians) from corresponding grid cells of 0.198C (0.188C), 0.218C (0.238C), and 0.278C (0.178C), respectively.
Considering local reconstructions of annual-mean temperature, the MCA-LIA difference is consistently larger in the eastern rim of the Pacific, the North Atlantic and Europe (Figs.3, 4) compared to in the simulations.A larger annual temperature change in the northern North Atlantic is captured both in most simulations and in the local reconstructions (although the amplitude is larger in absolute terms in the latter).The reconstructed JJA temperature change for locations in Eurasia is consistent with the simulations though generally larger, whereas reconstructions for North America show a less consistent and, in general, smaller change than for Eurasia.Local reconstructions representing DJF temperature are too few, and show an inconsistent signal (likely due to proxy noise), to meaningfully assess the simulation-reconstruction agreement.
For certain seasons and regions, there are a sufficient number of local reconstructions to calculate meaningful continental-scale means and medians of MCA-LIA temperature difference and compare these to the mean and median model values from corresponding grid cells (Table 6).All such continental-scale averages should be interpreted cautiously-they are based on relatively few reconstructions that among themselves show considerable differences.Though the means and medians for the reconstructions, except for Antarctica and South America, exceed the amplitude of change in most models, they are not inconsistent with the two models showing the largest amplitude: CCSM4 and FGOALS.
Could the larger reconstructed Northern Hemisphere MCA-LIA temperature difference, compared to the simulated one, simply be a result of internal variability?It is impossible to assess, in detail, the signature of internal variability in the reconstructed MCA-LIA temperature difference, but for simulations the situation is more tractable.Ideally, we would use replicates of simulations with the same forcings and the same model to assess internal variability.Failing that, unforced control simulations may be informative, if we are willing to believe that the internal variability of a forced simulation can be reasonably approximated by the variability of an unforced control simulation.The question then becomes whether the observed difference between reconstructed MCA-LIA differences and simulated ones falls within the range of the MCA-LIA differences in a control simulation.We have used the 3000-yr unforced MPI control simulation (Jungclaus et al. 2010;Zhu et al. 2016) to investigate whether this can happen.Within the 3000 years, we can fit 2101 periods of 900 consecutive years.For each one of these, we calculated the temperature difference between the means of the first 300 years (representing MCA) and the last 400 years (representing LIA) and separated by 200 years.
The large majority of all differences between 300-and 400-yr temperature averages from the control simulation vary by less than 0.18C for global annual-mean temperature (Fig. 6a), thus explaining only a fraction of the proxy reconstruction-simulation difference at the global-average annual-mean scale.Over the PAGES2k Consortium (2013) regions of Europe and North America, the outer tail of the amplitude of the simulated internal variability (Figs.6b-g) reaches the size of the difference between the reconstruction and model mean for the same two regions (Tables 6, 7).The Arctic also shows a large simulated internal variability, but its magnitude is less than half the difference between the reconstruction and model mean.For the regions where the forced model simulations do not underestimate the  reconstructed difference-Antarctica, South America and, to a lesser extent, Asia-the simulated internal variability is also relatively small.This implies that internal variability plays a role in those regions were the forced simulations do underestimate the reconstructed MCA-LIA temperature difference.Thus, we cannot exclude the possibility that the different temporal realizations of internal variability between the real world (as evidenced by the reconstructions) and in the model simulations could account for a substantial fraction of the simulation-reconstruction mismatches.

c. Polar amplification in simulations and the local reconstructions
Polar amplification (or Arctic amplification in the NH) is a commonly used term to denote a larger temperature change in the polar regions compared to other parts of the world.It is often quantified as the ratio FIG. 6.The distribution of temperature differences (8C) within the 3000-yr-long unforced MPI control simulation (Jungclaus et al. 2010;Zhu et al. 2016).These are all available differences between means taken over 300-and 400-yr periods, separated by 200 years.
(amplification factor) between the temperature change in polar to nonpolar regions (Hind et al. 2016).If we only consider sites with local reconstructions reflecting annual-mean temperature, and by using their median values, the amplification factor associated with the temperature changes between the latitude bands of 608-908N and 08-308N is 3.52.Replacing the medians for local reconstructions by grid cells corresponding from the pool of all models in the calculation, we obtain the amplification factor 2.36.For the SH considering the analogous latitudinal bands (608-908S and 08-308S), the amplification factors using local reconstruction and model medians are 1.11 and 2.19, respectively.Thus, both the local reconstructions and model median agree there is a somewhat smaller polar amplification in the SH than in the NH.The lack of JJA reconstructions outside the NH extratropics, and the limited number of DJF reconstructions, unfortunately precludes a calculation of polar/Arctic amplification for these seasons.
The amplification factor is defined as a ratio, so one should not expect sensible results when its denominator is close to zero, that is, for simulations or reconstructions where no significant MCA-LIA difference has been observed at mid-and low latitudes (cf.Hind et al. 2016).In such cases the uncertainty in the amplification factor may become so large that the factor becomes entirely noninformative: attempts to construct a confidence interval for it yield an ''interval'' consisting of all real numbers (appendix C; also Fieller 1954;von Luxburg and Franz 2009).Arguably, the notion of an amplification factor is meaningless in this case.For many local reconstructions and corresponding simulated temperatures, the confidence intervals obtained are noninformative in this sense, in particular for seasonal temperatures.However, for sites with local reconstructions representing annual-mean temperature, the reconstructed amplification factor is significant ( p , 0.05) for 608-908N versus 08-308N, for 608-908N versus 908S-608N (i.e., Arctic versus non-Arctic), and 608-908N versus 608S-608N (i.e., Arctic versus nonpolar), that is, the interval lies entirely above 1.For the corresponding sites in the climate models, the simulated amplification is significant (p , 0.05) for GISS-R24 and GISS-R27 in the NH, and also for the model mean and median.For the SH significant (p , 0.05) amplification is seen for CCSM4 and FGOALS, and again for the multimodel mean and median.We note that models with low variability show little or no temperature change in tropical regions, which could explain the noninformativeness of the intervals obtained in these cases.

d. Model comparison using continental-scale reconstructions
Overall, the MCA-LIA temperature amplitude agreement, between proxy reconstructions and model simulations, is slightly better for the continental-scale reconstructions than for the local reconstructions (Tables 7-9).This is partly because of the somewhat smaller amplitude of MCA-LIA temperature difference of the continental-scale reconstructions compared to the local reconstructions.However, a better agreement is mainly evident when considering the mean of the local reconstructions but not when considering the median (Table 9).For the Arctic reconstruction, the amplitude of annual-mean temperature change is still twice that of the model mean and median (0.778C versus 0.398 and 0.388C).However, CCSM4 and FGOALS show an amplitude almost comparable to the Arctic reconstruction (0.658 and 0.688C, respectively).The reconstructions for Antarctica and Asia display a similar degree of MCA-LIA temperature difference as the model mean or median, although the various models differ substantially (SD of 0.268C) for Antarctica (Table 7).
The models in general show too-little temperature change in Europe (reconstruction 5 0.408C versus model mean/ median 5 0.228C), but CCSM4, GISS21, and GISS24 show simulated values comparable to the reconstructed ones.The models also, but to a lesser extent, underestimate the reconstructed temperature change in North America.In contrast to results for all other regions, all models used for the simulations actually overestimate the amplitude of MCA-LIA temperature change for South America (reconstruction 5 0.098C versus model mean/ median 5 0.188C).

Cost function analysis of the simulationreconstruction agreement
For all seasons, the weighted CF values for local simulation-reconstruction comparisons are smaller than the corresponding unweighted values (Fig. 7, Tables 10,  11).This means that locations with local reconstructions having a higher uncertainty (which will be downweighted in the weighted version) also exhibit a larger difference between model data and reconstructed temperatures than locations of local reconstructions with a lower uncertainty.Our results thus support the recommendation by Schmidt et al. (2014a) that it is important to have well-characterized estimates of proxy uncertainty in a simulation-reconstruction comparison.The simulated temperatures are in better agreement with local reconstructions if uncertainties in the latter are taken into account, after adjustment for unrealistically small error estimates, see appendix A. However, assigning weights to the different continentalscale reconstructions has only a marginal effect, only slightly affecting the ranking among models, whereas the CF values for the local reconstructions decrease substantially when weights are assigned (Fig 8,Table 12).This effect could partly be attributed to the weights being far more variable for local than for continental-scale reconstructions.In other words, the weighted CF obtained when using local reconstructions are less similar to the weights in the unweighted CF, which may be viewed as a weighted CF with all its weights equal to 1/n.The mean and median values for the pool of all simulations produce lower cost function values than the individual simulations.However, mean and median values are not fully comparable to single simulations as much of the internal variability of the individual simulations is averaged out, resulting in a mainly forced temperature component.
The CF value is a measure of how closely a given model reproduces the reconstructed temperatures, and an observed difference between the CF values of two models will suggest a difference in model performance.In section 2d we described a significance test for such observed differences, indicating which of them are meaningful.The most striking differences between models are found when considering all seasonal targets together ( p value 5 4 3 10 26 ).Restricting attention to seasonal targets, we obtain significant ( p , 0.05) differences for annual-mean temperature ( p value 5 0.0128) and for JJA ( p value 5 0.0012).For DJF temperature no significance was found ( p value 5 0.626).This means that at least some models differ significantly in their agreement with the local reconstructions.The comparison with continental-scale reconstructions yields no significant differences between models ( p value 5 0.171), which is perhaps not surprising since this test is based on only six observations (i.e., continents) for each model.Once a significant difference has been detected, we test which models differ significantly using the Friedman-Nemenyi test.Studying the local comparisons, we find significant differences when considering all seasonal targets together, with MPI, IPSL, and BCC differing significantly (by showing a worse agreement) from the mean of all simulations.The results are displayed graphically in Fig. 9.No significant (p , 0.05) differences are seen in the continental-scale comparisons.Out of the 11 simulations, HadCM3 and FGOALS show the lowest CF values, that is, best fit to the weighted local reconstructions when considering all seasonal targets together.The lowest agreement is found with MPI, IPSL, and BCC.It is worth noting that the ranking of simulations using local reconstructions versus continental-scale reconstructions results in limited agreement, although MPI, IPSL, and BCC are among the simulations showing the lowest agreement with both local and continentalscale reconstructions.

Discussion and conclusions
After comparing the MCA-LIA temperature difference in the PMIP3 simulations with local and continentalscale temperature reconstructions, we conclude the following: 1) Both local and the continental-scale reconstructions on average show a larger amplitude of temperature change than the simulations in the NH, particularly in the high latitudes.Results for the ocean-dominated SH are less conclusive, and even the opposite can be seen for South America (DJF temperatures), that is, the models show a larger amplitude of temperature change than the reconstructions there.The best agreement between local reconstructions and model simulations are obtained when comparing the model median (i.e., at the grid cells closest to the proxy data locations) with the medians of the local reconstructions (at a seasonal basis).The medians of the local reconstructions also produce the most similar results to when comparing the model median to the continental-scale reconstructions.
2) When comparing simulations with local reconstructions by means of the cost function analysis we detect differences between groups of models, but sites in close proximity are not necessarily independent and hence p values obtained may be too small.For continental-scale reconstructions no significant differences are seen.3) Weighting the local reconstructions by their estimated uncertainty results in better agreement with the simulations in the cost function analysis, whereas weighting the continental-scale reconstructions by their uncertainty has no obvious effect.4) It is difficult to evaluate the degree of agreement between reconstructed and simulated Arctic/polar amplification as only some models show a significant amplification while others do not.
The generally larger amplitude of MCA-LIA temperature difference in the NH shown by proxy-based reconstructions compared to simulations is not restricted to particular proxy types, regions or seasons.Nor is it an artifact of geographical bias as the comparison with simulations is conducted only for grid cells corresponding to site locations of the reconstructions (see Table 8).It is unlikely that the proxy-based reconstructions overestimate the amplitude of the MCA-LIA temperature difference.First, most state-of-the-art calibration methods result in an underestimation, rather than an overestimation, of the reconstructed low-frequency temperature amplitude, especially from proxies with a low signal-to-noise ratio (e.g., von Storch et al. 2004;Bürger et al. 2006;Christiansen 2011;Moberg and Brattström 2011;Tingley et al. 2012).
Multiproxy reconstructions tend to suffer more from this variance loss than single site reconstructions (Christiansen and Ljungqvist 2017), perhaps explaining the slightly lower amplitude of MCA-LIA temperature difference in the continental-scale reconstructions compared to in the local reconstructions (e.g., Luterbacher et al. 2016).Second, an even larger MCA-LIA amplitude of change is supported by independent estimates of low-frequency temperature change from terrestrial borehole profiles (Huang et al. 2008).However, this result is less conclusive than a large amplitude in temperature difference between the LIA and the present derived from borehole evidence (Pollack and Smerdon 2004).The temperature estimates from boreholes are based on physical modeling, instead of statistical calibration against time series of instrumental temperature measurements.

a. Uncertainties in the model simulations and in their forcing
In this section, we point out some relevant known issues in model simulations used in this study.To begin, IPSL simulates a ;28C colder global-mean temperature than other models (or the actual temperature).CCSM4 overestimates decadal near-global cooling after large volcanic eruptions (Landrum et al. 2013;Hartl-Meier et al. 2017) (Fig. 2).Such cold biases might have consequences for, in particular, sea ice changes and albedo feedbacks (Miller et al. 2012).In this context, any mean biases in the models that have influences on sea ice changes and albedo will likely influence the simulated equator-to-pole temperature gradient and Arctic/polar amplification.Hence, such biases may in principle affect the degree of agreement between simulations and reconstructions.Unfortunately, as pointed out above, we found it difficult to evaluate the simulation-reconstruction agreement of Arctic/polar amplification.Some of the differences between the models may also be related to the differences in the implemented forcings.However, our research design does not allow us to address the influences of these differences.The PMIP3 models have previously been found to show little agreement in their response to solar forcing as a result of the different parameterizations used (Le et al. 2016).A choice with a smaller amplitude of solar forcing is in line with present understanding of solar physics (e.g., Gray et al. 2010) as well as with results from detection and attribution studies (Schurer et al. 2013(Schurer et al. , 2014;;Stern and Kaufmann 2014).Still, studies of whether a larger or smaller amplitude of solar forcing variations agrees best with temperature reconstructions have so far been inconclusive (Fernández-Donado et al. 2013;Hind and Moberg 2013;Moberg et al. 2015;Luterbacher et al. 2016).
The PAGES 2k-PMIP3 Group (2015) found the multidecadal temperature response to changes in solar forcing to be larger in the PAGES 2k Consortium (2013) reconstructions for Europe, the Arctic and, to a lesser extent, Asia than suggested by simulations, and Luterbacher et al. (2016) came to similar conclusions for Europe.This may be due to that the multidecadal solar irradiance change is underestimated by about 40% in the PMIP3 model simulations (Lean 2018).Another possible reason can be the lack of stratospheric chemistry and physics in most of the PMIP3 models that are sensitive to changes in solar irradiance (Mitchell et al. 2015).Chiodo and Polvani (2016) found that the inclusion of an interactive stratospheric chemistry in a climate model can significantly reduce the surface warming resulting from an increase in incoming solar radiation.This behavior can be explained by the so-called stratospheric ozone feedback, which effectively means that stronger solar irradiance will induce more ozone concentration in the stratosphere due to photochemistry (Nowack et al. 2018).Consequently, increased ozone will absorb more solar radiation and result in less solar radiation arriving to the surface, thus reducing the surface temperature response to the increased solar radiation.Unfortunately, none of the PMIP3 models (or EC-Earth) that have been used to perform the ''last millennium'' transient simulations include stratospheric interactive chemistry; not even the two models that do have a high-top stratosphere: GISS (Schmidt et al. 2014b) and MPI (J.H. Jungclaus 2018, personal communication).This implies that the surface temperature MCA-LIA difference, simulated by the PMIP3 models, may be expected to be larger than if those models actually had included the interactive chemistry.The importance of atmospheric chemistry has been recognized and the Aerosol Chemistry Model Intercomparison Project (AerChemMIP) is endorsed by CMIP6 to quantify the climate effect from aerosols and chemically reactive gases (Collins et al. 2017).All the AerChemMIP models are including an interactive representation of tropospheric aerosols and atmospheric chemistry.Because of the very heavy computation (e.g., up to 5 times the expense for the EC-Earth model) of atmospheric chemistry processes, it is unlikely that any PMIP4 model will include this interactive module in their last millennium transient simulations.Besides the PMIP4 experiments, for modeling group with available computation resource, it would be of interest to perform sensitivity experiments with latest solar irradiance estimation (Lean 2018) and enabled interactive atmospheric   Moreover, Ineson et al. (2015) showed substantial regional climate effects through ultraviolet solar irradiance modulation of the Arctic and North Atlantic Oscillations, which may result in feedbacks not captured by the models.Actually, solar forcing may be more important on multicentennial than shorter time scales (e.g., Anchukaitis et al. 2017;Pei et al. 2017).However, it appears unlikely that deficiencies in solar forcing are the major reason for the discrepancy between simulations and reconstructions.A main argument for this is that although the amplitude of solar forcing change is uncertain the temporal evolution is well constrained; in particular, peak medieval temperatures occur more than a century before the medieval solar maximum (see section 1).
One possible explanation for why the simulations show a smaller amplitude of MCA-LIA temperature difference than most reconstructions may be residual deep-ocean warmth (from earlier in the Holocene; Rosenthal et al. 2013Rosenthal et al. , 2017) ) not considered in the initial conditions of simulations starting at 850 CE.The choice of initial conditions is important: the same climate model, with the same set of forcings, produces different climate evolutions depending on the chosen initial conditions (González-Rouco et al. 2009).If the oceans are too cold in the model setup, it would exert a significant cold bias during the first centuries of the last millennium simulations and thus result in a biased cold MCA.In line with the recommendations for the PMIP4 protocol (Jungclaus et al. 2017), it would be preferable for model-data comparison if the simulation starts in the year 0 or runs a transient Holocene simulation, which at the moment is rather challenging for most of the modeling group due to the computation cost.We note that a negative (cooling) drift in the simulations is often corrected, thus making it more difficult to address the possible effect of the oceans being warmer than in equilibrium to forcing.

b. Comparison with existing simulationreconstruction studies
PMIP3 last millennium simulations have, by some investigators, been found to generally show less variability at multidecadal and longer time scales than reconstructions (Zorita et al. 2010;Bothe et al. 2013a,b;Fernández-Donado et al. 2013;Schurer et al. 2013;Yan et al. 2015).Assessing this issue more systematically, the PAGES 2k-PMIP3 group (2015) modified this conclusion.Relatively good agreement between simulations and PAGES2k continental-scale reconstructions was found for the NH, but disagreement in the SH (see also Neukom et al. 2018).The NH agreement was best at 20-50-yr time scales; many simulations showed less lowfrequency but more high-frequency variability than the reconstructions.Conversely, simulations actually showed more low-frequency variability than reconstructions for the SH.
Model simulations typically also underestimate the amplitude of change compared to the paleoclimate evidence during earlier warm periods: for example, for the last interglacial period (;130-116 ka) (Otto-Bliesner et al. 2013), regionally during the mid-Holocene (Zhang et al. 2010;Braconnot et al. 2012;Harrison et al. 2014), and for the Neoglaciation global sea surface cooling trend (Lohmann et al. 2013).Since orbital forcing, mainly responsible for the climate conditions during these periods, is well constrained, the lack of some important feedback mechanisms in climate models is the likely cause for the reconstructionsimulation discrepancies.
Our finding that it is not possible to rank last millennium simulations by their agreement with continentalscale reconstructions-and that it is only possible to do so, to a limited degree, with local reconstructions-is well in line with previous findings.For example, Bothe et al. (2013a,b) concluded that the inconsistency between simulations and reconstructions at multidecadal and longer time scales renders it difficult to draw firm conclusions regarding model performance and the relative influence of different forcings.Likewise, Moberg (2013) suggested that the temperature changes during the last millennium have been too small to constrain climate sensitivity, and Fernández-Donado et al. (2013) and Tingley et al. (2015) demonstrated the uncertainties and limitations in trying to do so.

c. The importance of internal variability and feedback mechanisms
Not just the models used in PMIP3 studies (see above), but also those used in other studies have been found to typically underestimate the amplitude of unforced variability and/or have deficiencies in their feedback mechanisms on multidecadal and longer time scales (Cheung et al. 2017).Except for the cooling after large volcanic eruptions, internal variability dominates temperature variability in last millennium PMIP3 simulations (Lehner et al. 2015).Similarly, Le et al. (2016) found that internal variability plays a larger role, than changes in external forcing, in simulations of temperature variations during the past millennium over the North Atlantic, North Pacific, and Arctic as well as parts of Antarctica and surrounding oceans.
Compared to instrumental data (Laepple and Huybers 2014a;Davey et al. 2002;DelSole 2006) and proxy data (Laepple and Huybers 2014b;Dee et al. 2017), Lovejoy et al. (2013) found models to systematically simulate too-little centennial-scale temperature variability.However, the PAGES2k-PMIP3 Group (2015) found, using spectral analysis, no tendency for a systematic underestimation of low-frequency temperature variability in the simulations, compared to the continental-scale last millennium reconstructions.Rather, they found that results varied among the regions.A plausible explanation for the model-data discrepancies is that the models could have a tooweak internal variability on longer time scales (e.g., Valdes 2011).The models may be too diffusive-in line with model-data discrepancies increasing with time scale (Laepple and Huybers 2014a)-or have a too-weak ocean-atmosphere coupling (Fernández-Donado 2016).Low-frequency internal variability could partly be governed by nonlinear responses from solar and/or volcanic forcings, and include slow feedback mechanisms, deep ocean currents, and/or various biogeochemical processes that are not properly captured by the models.
The role of internal variability is largest on smaller spatial scales (Hawkins and Sutton 2009).Conclusions from the results of experiments using the unforced control simulation show that internal variability is smallest on a global scale and largest in the higher latitudes of the NH (section 3d).The previously reported inability to simulate multidecadal warm periods on regional scales, for instance in Europe during the eighteenth century (Gómez-Navarro et al. 2015), indicates a too-weak internal variability in the models, perhaps in combination with incorrectly specified forcing, model errors, and/or observational errors.
For the long-term cooling trend of the last millennium, orbital forcing may play a larger role than the models suggest.The orbital induced change in insolation (Berger and Loutre 1991) resulted in little change in external forcing globally or annually but regional changes in seasonal forcing, in particular reduced boreal summer insolation, might have caused significant feedbacks including changes in albedo and water vapor (Atwood et al. 2016).The inability of mid-Holocene simulations (Braconnot et al. 2012;Hargreaves et al. 2013;Lohmann et al. 2013;Harrison et al. 2014;Liu et al. 2014) to fully capture the warm mid-Holocene conditions suggests that some feedbacks from orbital forcing can be underestimated or unaccounted for.
We find the largest simulated MCA-LIA temperature difference over the high-latitude North Atlantic section in agreement with local reconstructions (Figs. 3,4).This is primarily a result of changes in sea ice coverage (Landrum et al. 2013), highlighting the importance of this feedback.The larger reconstructed MCA-LIA temperature difference farther south in the NH suggests some of the amplifying real-world mechanisms at work north of ;608N also influence the climate much farther south.We propose that this may be due to a stronger effect of seasonal orbital forcing, changes in ocean currents, sea ice, vegetation, and water vapor and their feedbacks or any combination thereof.Identifying these mechanisms should be a priority in future studies as they may also be informative with regard to amplification mechanisms under ongoing anthropogenic climate change.

d. Concluding remarks
We have demonstrated, within the PMIP3 last millennium simulations, the existence of both large intermodel spread in simulating the amplitude of MCA-LIA temperature change and that reconstructions based on proxy data from the NH systematically show a larger amplitude of this change than simulations, though not in the SH.The larger MCA-LIA temperature difference in the reconstructions from the NH compared to the simulations suggest either that the temperature change in the simulations arose mainly from internal variability, or that the models used for the transient simulations fail to represent some feedback mechanisms correctly or even lack some critical forcing completely.If it is indeed deficits in the models' feedback mechanisms that are responsible, it would have implications for the skill of future climate projections, possibly implying that the models have too-small climate sensitivity [as, e.g., concluded for Europe by Luterbacher et al. (2016)].From the assessment of an unforced control simulation we cannot exclude the possibility that the difference in simulated and reconstructed MCA-LIA temperature amplitude is the result of internal variability at the outer tails of what is possible.As the simulated and real-world (reconstructed) internal variability will never show perfect temporal agreement-even if the models would simulate the correct amplitude of internal variabilitythe models will never simulate the full range of reconstructed MCA-LIA temperature difference if internal variability played an important role for this temperature difference.
As models that more accurately simulate past climate may be expected to provide more reliable future projections, we tested the ability to rank model simulations according to their agreement with the reconstructed MCA-LIA temperature difference.Recall that we used the nonparametric Friedman-Nemenyi test to decide whether any significant differences exist between pairs of models.Also, recall that our calculated p values may be too small because we cannot be certain that neighboring locations are independent.Thus, our significance tests should be taken only as rough indicators of which differences between models are meaningful.Bearing this in mind, we found it, to some degree, possible to detect statistically significant (p , 0.05) differences in how well various model simulations agreed with local reconstructions when data for summer, winter and annual-mean temperatures are weighted together.On the other hand, no statistically significant differences could be found among the various simulations' agreement with continental-scale reconstructions.These results indicate that the differences between the temperature data generated by the different climate models are small in relation to the uncertainties and heterogeneities in the simulations as well as in the local and continental-scale reconstructions.However, if similar reconstruction-model comparisons would also be performed for periods with large amplitudes in temperature change-for example, the Last Glacial Maximum and the mid-Holocene-the results obtained may be more informative for evaluating model performances.
To the knowledge of the authors, it has never been extensively discussed before that too-cold initial ocean conditions in the climate models' settings may to a large extent be responsible for the smaller simulated than reconstructed MCA-LIA temperature difference.We argue that in order to test this hypothesis, and to accurately simulate the climate of the second millennium CE, it is necessary to also simulate the climate of the full first millennium CE as the ocean heat content may not be in equilibrium with the forcing at the start of simulations.This will now be done by some PMIP4 modeling groups (Jungclaus et al. 2017).
In conclusion, we propose three possibilities, or any combination thereof, to explain the smaller amplitude of temperature change in the model simulations compared to reconstructions: 1) too-cold initial ocean conditions in the climate models' settings as discussed above, 2) too-weak dynamics of internal variability in the models over longer time scales and/or simply the mismatch in timing of real-world (reconstructed) and simulated internal variability, and 3) some missing or too-weak feedback mechanisms in the simulated climate systems.While points 1 and 2 would have limited implications for future projections, point 3 would imply that the models might have an erroneous climate sensitivity over longer time scales.It is challenging to address these questions, but we expect two of the experimental designs outlined in Jungclaus et al. (2017) for future PMIP4 last millennium simulations to be particularly informative in this regard: 1) starting simulation in year 0 or, preferably, near the beginning of the Holocene (e.g., about 10 000 years ago) as it would allow testing the hypothesis about too-cold initial ocean conditions, and 2) running simulations with time series of both low and high solar forcing estimates to test uncertainties of this potentially important, but still poorly constrained, forcing.In both cases, a number of different ensembles from the same model, also run with other forcing alternatives, would be needed to robustly assess the influence of individual forcings.Finally, systematic comparisons with comprehensive sets of quality-assessed paleoclimate reconstructions should be made by using appropriate methods that allow the drawing of conclusions in statistical terms.
Acknowledgments.This study was partly funded by the Swedish Research Council (Grant 2012-5924, ''A statistical framework for comparing paleoclimate data and climate model simulations'').Q.Z. was also partly funded by the Swedish Research Council for the Swedish-French project ''Greenland in a warming Arctic'' (GIWA, Grant 2014-06578).F.C.L. was partly supported by the Royal Swedish Academy of Letters, History and Antiquities and the Bank of Sweden Tercentenary Foundation.A.S. was supported by the German Research Foundation (DFG, SE 2802/1-1).The EC-Earth model ''last millennium'' simulation was performed with resources provided by Cray XC30 HPC systems at the European Centre for Medium-Range Weather Forecasts (ECMWF), and model data analysis is performed on the Swedish National Infrastructure for Computing (SNIC) at the National Supercomputer Centre (NSC).Dr. Hanna S. Sundqvist made valuable contributions in the early stages of this project.We thank the three reviewers whose useful comments helped to improve this article.

Uncertainty Estimates for the Cost Function (CF) Calculations
In the calculation of CF it is reasonable to let the weights be inversely proportional to the variances based on the known or estimated uncertainties of the local and continental-scale temperature reconstructions (see ''Uncertainty estimates in local reconstructions'' in section 2a).We calculated averages of 300 years for the MCA and of 400 years for the LIA of reconstructed temperatures except for North America where we consider averages of only 10 years for the MCA and 13 years for the LIA.
Having converted the published uncertainties in temperatures for MCA and LIA into standard deviations SD MCA and SD LIA (see Table 3), we can estimate the variance of the (MCA-LIA) difference.We denote the number of observations in each average by n; thus, for North America n 5 10/13, and for all other regional reconstructions n 5 300/400.If the observations are independent, then the variance of the difference can be estimated simply as It can be shown (e.g., Wei 1989) that the variance adjusted for autocorrelation is Var( where r( f ) is the estimated autocorrelation function (ACF) at lag f for the time series of reconstructed temperatures.The first equality holds because MCA and LIA are sufficiently far apart to be considered independent, even in the presence of autocorrelation.
The standard deviations SD MCA and SD LIA were estimated above using published uncertainty estimates.
For the autocorrelation no such information exists, and we are forced to try to extract it from the data itself.A linear trend in the data will automatically produce large autocorrelations at all lags and will also violate the stationarity assumption implicit in Eq. (A2).Hence, it makes sense to remove the trend and calculate the ACF on the residuals.Another reason for doing this is that what we really want is the autocorrelation in the noise, not in the temperature signal.If we introduce the notation for the correction factor, then the variance corrected for autocorrelation is V 3 A. A1 The weight assigned to each continental-scale temperature reconstruction is proportional to (V 3 A) 21 .Finally, the weights are standardized to sum to 1.The results are summarized in Table 2.For the continental-scale reconstructions, this method of assigning weights is unproblematic, but applying it mechanically to data using local proxies produces unreasonable results: in some cases, a single reconstruction receives well over half the weight, which is clearly not desirable.This is partly due to some records having an unrealistically low uncertainty estimate, which inflates the weight.Hence, we impose a cutoff on the standard deviations of 0.1.In addition, we set the upper 10% of the weights equal to the upper 10% quantile, thus guarding against any one reconstruction dominating.After these adjustments, we again standardize the weights to sum to 1.

APPENDIX B
The EC-Earth Climate Model ''Last Millennium'' (850-1850 CE) Simulation EC-Earth is developed by a consortium of European research institutions (Hazeleger et al. 2010(Hazeleger et al. , 2012)).The atmospheric component of EC-Earth is based on Integrated Forecasting System (IFS), which is developed at the European Centre for Medium-Range Weather Forecasts (ECMWF), and the ocean component is based on Nucleus for European Modeling of the Ocean (NEMO) (Madec 2008), including a sea ice model LIM3 (Bouillon et al. 2009).The goal of EC-Earth is to build a fully coupled atmosphere-ocean-land-biosphere model usable for seasonal to decadal climate prediction and climate projections (Hazeleger et al. 2010).A number of model integrations such as ''historic '' simulations (1850'' simulations ( -2005 CE) CE) and scenarios of future projections have been conducted using version 2.3 (v2.3), to contribute to the CMIP5.However, EC-Earth v2.3 has not previously run any PMIP3 simulations except one 50-yr mid-Holocene simulation.We have implemented several components that are necessary for performing paleoclimate simulations such as orbital forcing and land-ice surface processes in EC-Earth v3.1, and performed several PMIP3 simulations with relatively higher resolution than the other PMIP3 models: T159L62 for the atmosphere model (IFS), ORCA1L46, and 46 vertical A1 For reasons of numerical stability, we do not, in fact, let the sum in the correction for autocorrelation run all the way up to n 2 1, but only to the smallest integer $10log 10 (n), if this number is less than n 2 1. levels for the ocean model (NEMO) (Table 4).This might be one reason that mid-Holocene simulations show good agreement with proxy-based reconstructions (Muschitiello et al. 2015;Pausata et al. 2016).The last millennium experiment with EC-Earth v3.1 has applied the recommended forcings following PMIP3 protocol (https:// wiki.lsce.ipsl.fr/pmip3/doku.php/pmip3:design:lm:final;Table 4), and the initial condition starts from an equilibrium state at 850 CE after 300 years of spinup.The seasonal and latitudinal distribution of the orbital modulation of insolation is calculated following Berger (1978).Variations in total solar irradiance follow Vieira and Solanki (2010) from 850 to 1609 CE, and from 1610 CE Wang et al. (2005).The volcanic aerosol forcing follows Crowley and Unterman (2013).Changes in greenhouse gas forcing derive from Hansen andSato (2004) andMacFarling Meure et al. (2006).Land use changes are taken from Pongratz et al. (2008).The EC-Earth last millennium simulation, as used in this article, is available for download from http://model.natgeo.su.se/past1000/tas_EC-Earth3.1_mon.nc.

APPENDIX C
Confidence Interval Calculation for the Arctic Amplification Ratios Our confidence intervals for the Arctic amplification ratios (Hind et al. 2016) have been constructed using Fieller's method (Fieller 1954;von Luxburg and Franz 2009).We consider two independent normally distributed samples, x 1 , x 2 , . . ., x n and y 1 , y 2 , . . ., y m with arithmetic means x and y; think of them as the mid-and low-latitude temperature differences and polar temperature differences respectively.The Arctic amplification is then estimated as the ratio y/x.In the case of two independent samples, a (1 2 a)-confidence interval takes the following form when the denominator x is significantly 6 ¼ 0: xy 6 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi x 2 y 2 2 x T y T p x T , (C1) where x T 5 x 2 -z 2 a/2 s 2 x and y T 5 y 2 -z 2 a/2 s 2 y , and z a/2 is the upper a/2 normal quantile.Hence, x is significantly 6 ¼ 0 precisely when x T .0. In this case, x 2 y 2 .x T y T , so the expression under the root sign is positive and we obtain an ordinary, bounded confidence interval.On the other hand, if x is not significantly 6 ¼ 0, then the Fieller confidence interval is either the entire real line, or an ''inverted'' confidence interval which contains everything except the numbers between the two limits given by the formula for I Fieller above.
. To cover the full study period of 950-1850 CE, we exclude the Australasian reconstruction (1001-2001 CE) and the tree-ring-based reconstruction (1204-1974 CE) for North America.Representing the Arctic, we use the updated McKay and Kaufman (2014) version, for Europe the updated Luterbacher et al. (2016) version, and for Antarctica the updated Stenni et al. (2017) version.Seasonal targets are annual-mean temperature for the Arctic, North America, and Antarctica, JJA for Europe and Asia, and DJF for South America.The number of proxies contributing to each reconstruction and their temporal changes in coverage as well as the proxies' signal strength among the reconstructions are not homogenous [see Fig. 2 in PAGES 2k Consortium (2013)].

FIG. 2 .
FIG.2.Surface air temperature evolution (8C) over 850-1850 CE in the PMIP3 last millennium simulations for the (top) global mean, (middle) NH, and (bottom) SH.Note that the absolute temperature differs considerably between the individual model simulations and that the large negative temperature ''spikes'' are forced by large volcanic eruptions.

FIG. 3
FIG. 3. (left) Ensemble mean and (right) ensemble median near-surface air temperature difference (8C) between 950-1250 and 1450-1850 CE for (top) annual-mean, (middle) JJA, and (bottom) DJF temperatures, with local reconstructions for the respective season superimposed.White areas show where insignificant ( p .0.05) temperature difference between the two periods in the ensemble mean and medians occur.Note the polar amplification in both the ensemble mean and median for all seasons, which is stronger in winter than in summer.

FIG. 5 .
FIG. 5. Temperature difference (8C) between 950-1250 and 1450-1850 CE for annual-mean temperature in local reconstructions (proxies) and model-simulated data for corresponding grid cells after latitude bands are divided into six 308 segments.The filled circles represent the mean, and the length of the bars represents the quartile range.

FIG. 7 .
FIG. 7. Cost function values of local reconstructions and model simulations for (a) unweighted and (b) weighted data for summer (green), winter (light blue), and annualmean temperatures (orange) and for all seasons together (dark blue).The cost function values decrease noticeably when the weights are assigned.

8.
Cost function values for continental-scale reconstructions and model simulations for (a) unweighted and (b) weighted data.

FIG. 9 .
FIG. 9. Friedman-Nemenyi test for simulated temperatures with all seasons weighted together compared to local reconstructions.The models are ordered by rank sum, and the insignificant ( p .0.05) values are connected by a common green horizontal line.

TABLE 3 .
Connecting uncertainties u and standard deviations s of the continental-scale temperature reconstructions.

TABLE 5 .
Global, NH, and SH temperature differences (8C)between 950-1250 and 1450-1850 CE in local reconstructions and in climate model simulations at locations of local reconstructions by season.Note that reconstruction means and medians are only calculated if at least three local reconstructions for a particular season were available.

TABLE 6 .
Continental-scale means and medians of the MCA-LIA temperature difference (8C) in local reconstructions and model simulations from corresponding grid cells for regions with at least four local reconstructions representing the same season.
a Greenland not included here in North America.

TABLE 7 .
The temperature difference (8C) between 950-1250 and 1450-1850 CE in the continental-scale reconstructions and corresponding simulated temperature change over the same PAGES2k geographical region and season.See Table2for the latitude and longitude coordinates for each region.

TABLE 9 .
The temperature difference (8C) between 950-1250 and 1450-1850 CE for local and continental-scale reconstructions and climate model simulations.The table summarizes data from all localities and for all climate models used in the study.

TABLE 8 .
The simulated temperature difference (8C) between 950-1250 and 1450-1850 CE for continental-scale reconstruction locations and the locations of local reconstructions.

TABLE 10 .
Unweighted cost functions based on the local reconstructions.

TABLE 12 .
Cost functions based on the continental-scale temperature reconstructions.