## Abstract

A detailed analysis is presented of a recently published Antarctic temperature reconstruction that combines satellite and ground information using a regularized expectation–maximization algorithm. Though the general reconstruction concept has merit, it is susceptible to spurious results for both temperature trends and patterns. The deficiencies include the following: (i) improper calibration of satellite data; (ii) improper determination of spatial structure during infilling; and (iii) suboptimal determination of regularization parameters, particularly with respect to satellite principal component retention. This study proposes two methods to resolve these issues. One utilizes temporal relationships between the satellite and ground data; the other combines ground data with only the spatial component of the satellite data. Both improved methods yield similar results that disagree with the previous method in several aspects. Rather than finding warming concentrated in West Antarctica, the authors find warming over the period of 1957–2006 to be concentrated in the peninsula (≈0.35°C decade^{−1}). This study also shows average trends for the continent, East Antarctica, and West Antarctica that are half or less than that found using the unimproved method. Notably, though the authors find warming in West Antarctica to be smaller in magnitude and find that statistically significant warming extends at least as far as Marie Byrd Land. This study also finds differences in the seasonal patterns of temperature change, with winter and fall showing the largest differences and spring and summer showing negligible differences outside of the peninsula.

## 1. Introduction

In a 2009 study published in *Nature*, Steig et al. (2009 hereafter referred to as S09) present a novel reconstruction technique to extend Advanced Very High Resolution Radiometer (AVHRR) infrared satellite observations back to 1957 using manned ground station temperature information as predictors. Previous Antarctic gridded reconstructions (Chapman and Walsh 2007; Monaghan et al. 2008) relied on interpolation or kriging methods to estimate temperatures at noninstrumented points. In Chapman and Walsh (2007), interpolation was guided by correlation length scales calculated using the International Comprehensive Ocean–Atmosphere Dataset (ICOADS) for ocean and coastal areas and station-to-station pairs for the Antarctic interior. In Monaghan et al. (2008), the 40-yr European Centre for Medium-Range Weather Forecasts (ECMWF) Re-Analysis (ERA-40) data were utilized to provide the kriging field. In contrast, S09 perform multiple linear regression of satellite temporal data against ground data and then *directly* recover gridded estimates using the satellite spatial structure—obviating the need for interpolation.

S09 present three separate reconstructions. The primary reconstruction is the focus of this paper and will be referred to as the *T*_{IR} or S09 reconstruction.^{1} They also present a reconstruction that does not combine AVHRR data with ground data, which will be referred to as the Automatic Weather Station (AWS) reconstruction. This reconstruction is dealt with implicitly, as our proposed modifications likewise separate the estimation of missing AVHRR principal component (PC) and ground station information. The third reconstruction, utilizing standard principal component analysis, appears in S09’s supplementary information and is not accompanied by sufficient information for a quantitative comparison. However, as this version also utilized the same number of retained AVHRR PCs as the *T*_{IR} reconstruction, our criticisms apply to the major parameter choice for this reconstruction as well.

The primary S09 method involves the following major steps: (i) cloud masking and regridding of the raw AVHRR data; (ii) decomposition of the cloud-masked AVHRR data into PCs and spatial eigenvectors; (iii) augmentation of a matrix of station data starting in 1957 with the first three AVHRR PCs; (iv) estimation of missing data in the augmented matrix with an infilling algorithm; (v) extraction of the completely infilled PCs; and (vi) estimation of temperatures at all grid points by reconstituting the PCs with their corresponding spatial eigenvectors (S09; E. J. Steig 2010, personal communication). The last step provides a time series of maps containing the temperature contribution from each PC–spatial eigenvector pair, which are then summed together to provide the gridded temperature estimates for all months.

Our approach to this topic begins with demonstrating replication of the S09 results. We discuss the S09 choice of infilling algorithm and inability of the algorithm to provide the necessary calibration function in section 3. In section 4, we show that the method used by S09 results in a different spatial structure being used for infilling than is present in the satellite data, which distorts the spatial distribution and magnitudes of temperature trends. Section 5 closes out the first half of the article by arguing that the choice of three principal components is suspect.

In the second-half of this article, we present alternate reconstructions that address our concerns with S09. We outline the corrections to the methodology in section 6. In section 7, we discuss the primary features of our result, similarities and differences as compared to S09, and cross-validation statistics. Recommendations and conclusions are contained in section 8. Additional details not covered in the main text are provided in the supporting information (http://www.climateaudit.org/data/odonnell/4 20101208 SI Revised.pdf).

## 2. Replication of S09

We restrict our replication of the S09 process to steps that follow cloud masking of the AVHRR data. We do not attempt to replicate the cloud-masking operations by S09, as these are similar to previously published studies (e.g., Comiso 2000), and instead utilize the archived set provided by Steig on his university Web site. For ground data, we utilized S09’s archived Reference Antarctic Data for Environmental Research (READER) dataset (Turner et al. 2003), also published on the same Web site.

For the period of 1957–2006, our replication yields linear trends in °C decade^{−1} of 0.12 for all grid cells, 0.10 for East Antarctica, 0.13 for the peninsula, and 0.20 for West Antarctica. These values are all within 0.01 of those obtained using the published *T*_{IR} reconstruction, with identical spatial and seasonal patterns of temperature change. The reader should note that to allow broader comparisons, the values listed above were computed using traditional geographic boundaries rather than the ad hoc definitions used by S09 and therefore differ slightly from the trends reported in that study. The minor changes to geographic definitions do not impact our conclusions. The shaded regions in Fig. 1 depict the definitions used for this study.

## 3. Calibration via infilling?

### a. Sources of systematic error in the AVHRR data

The AVHRR instrument is carried aboard the National Oceanic and Atmospheric Administration (NOAA) series of satellites. It is a multichannel sensor designed to provide imaging at both visible (channels 1 and 2) and infrared (channels 3–5) wavelengths as described by Fowler et al. (2009) at the National Snow and Ice Data Center. The AVHRR data used by S09 is cloud masked in similar fashion to Comiso (2000), regridded to 50 km by 50 km resolution and presented as monthly means.

The AVHRR data is not a continuous set of measurements. Like other satellite imaging products, measurements from different satellites must be combined to produce a continuous record, which admits the possibility of splicing errors. Sensor degradation, calibration errors, time-of-observation drifts, atmospheric conditions, and cloud opacity at infrared wavelengths (Comiso 2000; Fowler et al. 2009; Gleason et al. 2002; Jiménez-Muñoz and Sobrino 2006; Jin and Treadon 2003; Sobrino et al. 2008; Trishchenko and Li 2001; Trishchenko 2002; Trishchenko et al. 2002) all contribute nonnegligible measurement error, some of which may change from satellite to satellite and some of which show latitudinal variations. Additionally, the AVHRR instrument measures skin temperature rather than near-surface air temperature. These factors highlight the need to calibrate the AVHRR data to ground data, as the measurements cannot a priori be expected to be interchangeable.^{2} The mathematical description provided by S09 establishes the ground data as the explanatory variables and indicates that the infilling algorithm provides the calibration.

### b. Description of the total least squares algorithm

The infilling method utilized by S09 is an implementation of truncated total least squares (TTLS) in a regularized expectation–maximization algorithm (RegEM) developed by Schneider (2001). The TTLS algorithm provides a solution to the linear model , where both and are assumed to contain errors. S09 define an augmented matrix , where is *said* to represent the ground station data (predictors or explanatory variables), and is said to represent the AVHRR principal components to be estimated (predictands or response variables).^{3} Regularization is accomplished by performing a singular value decomposition (SVD) of the correlation matrix with *k* retained eigenvectors (Mann et al. 2007).^{4} From Schneider (2001), this yields the spatial eigenvectors and squared eigenvalues of the *n* × *p* matrix of observations (where *n* consists of the time steps, *p* consists of the variables, and is a vector of unbiased standard deviation estimators), since

In this notation, represents the temporal eigenvectors, **Λ** the eigenvalues, and the spatial eigenvectors. We may then partition into subspaces, where rows 1 and 2 indicate available observations and missing observations, and columns 1 and 2 indicate eigenvectors (1, … , *k*) and (*k* + 1, … , *n*), respectively:

The TTLS solution yielding the set of statistical weights for prediction of from is given by Fierro et al. (1997), where symbol † indicates the generalized inverse:

We can now estimate the missing values in for any moment in time *j*:

Alternatively, rather than limit the estimation to subspace , we can replace in Eq. (4) with , yielding a full set of statistical weights to provide estimates for both missing and actual values:

As all eigenvectors greater than *k* are discarded, subspace provides an estimate of the covariance matrix of the scaled predictand residuals (Schneider 2001):

RegEM defines a new correlation matrix using the original data , the newly estimated data , and . A new solution is then computed. The algorithm iterates until the rms change in reaches a predefined stagnation tolerance.

### c. Theoretical and practical difficulties with the S09 approach

Fundamentally, calibration places the response variables in terms of the explanatory variables to allow estimation of a response from a given observation of the explanatory variable. If this is not performed, then the relationship between the variables is undefined and subsequent predictions are not valid. In the special case that the response and explanatory variables are equivalent quantities and are interchangeable, no formal calibration is necessary. Formal calibration is required if this does not hold. As already discussed, the ground and AVHRR data are not interchangeable, and the latter consideration applies.

A critical aspect of the S09 methodology is that *both* the satellite PCs (which exist only from 1982–2006) and the station data matrix are temporally *incomplete*. During 1982–2006, the PCs appear in (not ) and are directly used to predict missing ground station values. Missing ground station information, then, is estimated by linear combinations of the ground stations *and* the PCs. Since the correlation matrix is computed using both actual and estimated values, the response variables (the PCs) are partially calibrated to the ground station data and partially calibrated to linear combinations of themselves.

Second, the process of regularization in RegEM destroys the orthogonality of the PCs. From Eq. (4), the missing data will be estimated using correlated—not orthogonal—PCs. This means that the response variables are also partially calibrated to each other. In the case of PC 2 and PC 3, the correlation following regularization for the entire 1957–2006 period is a factor of 2.5 higher than that following the initial regularization (−0.2501 versus −0.1001) after 30 iterations.

Next, RegEM as used by S09 does not extract modeled (calibrated) data. Rather, it extracts the *original data* with estimates in place of the missing values. Because the algorithm implicitly assumes that all variables are already equivalent and interchangeable, the original response variables (the AVHRR PCs) are never expressed in terms of the explanatory variables (the ground data). Even assuming the previous considerations to be of negligible importance, this means, at best, the PCs used by S09 are properly calibrated only in the 1957–81 period and have 25 yr of uncalibrated data spliced on the end.

The final concern is that using a total least squares algorithm (which minimizes the errors in both the available and missing values) presents a theoretical difficulty. The error in a PC (which represents the temporal component of a temperature *field*) does not mean the same thing to the reconstruction as an error in an observation (which represents temperature at a *point*). Since systemic errors are more likely to affect the PCs than the ground data (section 3a), the filtering effect of the truncation parameter *k* will be less effective for the PCs. Random errors in the AVHRR data have already been relegated to the truncated modes, while the systemic errors will be interpreted as signal. Moreover, the assumption that the relative variance of errors in the dataset are homogenous (Schneider 2001) is violated by the AVHRR PCs because the relative variance of errors increases (i.e., signal to noise ratio decreases) as one proceeds from the low- to high-order modes. This translates into additional estimation error for the ground stations when the PCs appear in .

## 4. Spatial structure considerations

Another concern with S09 is the difference in spatial structure used to infill the PCs and the corresponding AVHRR spatial eigenvectors. The assumption that the spatial structure is similar is implicit in the S09 method, which recovers the gridded temperature estimates without altering the AVHRR spatial eigenvectors. From Eqs. (4) and (6), the estimated portion of the PCs (1957–81) is comprised of linear combinations of station data and *themselves*, with coefficients given by matrix . Unless the coefficients in have a spatial distribution identical to the corresponding AVHRR spatial eigenvector weights, recovering gridded estimates using the unaltered AVHRR spatial eigenvectors will result in a geographical translation of information and a change in the magnitude of the estimates.

We can investigate whether the AVHRR eigenvector weights are directly compatible with the TTLS regression coefficients. Since Eq. (6) yields a fully populated matrix of estimates of rank *k*, we may find the vector of coefficients **v*** _{i}* that describe the contribution of all variables to any

*i*

^{th}variable in . If the

*i*

^{th}variable is an AVHRR PC,

**v**

*then yields what the AVHRR spatial eigenvector weights*

_{i}*should be*to best reproduce the ground data used to predict the PC.

From Eq. (1), the SVD will yield the spatial eigenvectors and eigenvalues (imputation EOFs) and the temporal eigenvectors . The *j*^{th} column in contains the coefficients that yield the contribution of the *j*^{th} column in to each series in . Since every series in is comprised of linear combinations of using coefficients given by the corresponding *i*^{th} row in , we can calculate the contribution of series *i* to the rest of the data in by taking the vector sum of the imputation EOFs with series *i* removed , scaled by the appropriate element *v _{i,j}* in , and a normalization constant

*c*:

Figure 2 (with metadata available in the supporting information) compares **v*** _{i}* to the AVHRR spatial eigenvector weights for the three PCs retained by S09. To use the AVHRR spatial weights (the stars in the figure) without geographically relocating temperature information, the TTLS weights (the circles) must be comparable. However, there are some obvious and important differences.

While PC 1 and PC 3 demonstrate similar spatial makeups for the peninsula and East Antarctica, the TTLS algorithm predicts PC 2 using five East Antarctic stations in the opposite orientation. In 9 out of 17 cases for PC 2, the coefficients assigned to East Antarctica differ from the AVHRR weights by a factor of 2 or more. In West Antarctica, the differences are more significant and the overall match between the TTLS weights and the AVHRR eigenvectors is poor. For PCs 1 and 2, TTLS uses two out of five stations in the opposite orientation, and four of the stations (80%) differ from the AVHRR eigenvector weight by a factor of 1.5 or more. For PC 3, all of the West Antarctic stations differ by a factor of at least three, and three of the four most heavily weighted stations in TTLS are used in the opposite orientation. Last, PC 1—which primarily determines the average temperature trend for the continent—displays a noticeably higher set of weights for the peninsula stations and lower set of weights for East Antarctica between TTLS and the AVHRR eigenvectors. This necessarily results in a redistribution of the peninsula trend across the entire continent.

## 5. Significant principal components

A critical aspect of the reconstruction method employed by S09—which is essentially principal components regression—is the choice of the truncation parameter *k*_{sat} for the satellite data and *k*_{RegEM} for the infilling of the augmented ground station–PC matrix. In their study, S09 state that they use the procedure described in Mann et al. (2007) to determine their truncation parameters. This procedure involves inspection of the log eigenvalue spectrum and calculation of eigenvalue sampling error. The modes that correspond to separable eigenvalues (i.e., the error bars on the eigenvalues do not fall into the continuum of overlapping error estimates) are selected for use. The remainder is discarded.

This calculation is identical to that performed by North et al. (1982), where it was shown that overlapping eigenvalue sampling error estimates indicate mixing (degeneracy) of the underlying modes. Importantly, North et al. (1982) note that this procedure provides *no guidance* for determining the truncation parameter. The effect of splitting degenerate modes during truncation depends on the analysis being performed. If one is trying to find a smaller basis for representing a dataset, the effect is limited to the amount of variance the multiplet explains in the original data. If the explained variance of the multiplet is small, the error due to splitting it is correspondingly small. In the case of S09, the latter concern applies. The procedure used by S09, which selects the truncation parameter based solely on sampling error, is incomplete because it omits any investigation of the error caused by early truncation.

Furthermore, the procedure described in Mann et al. (2007) was evaluated during the course of pseudoproxy experiments under different conditions than exist in Antarctica. It was tested using separate noise realizations for each pseudoproxy, which does not admit the possibility of nonlocal correlation between predictor and predictand residuals. As discussed earlier, this assumption is violated in Antarctica. Mann et al. (2007) also note that the procedure is too conservative at high signal-to-noise ratios (SNRs). The average coefficient of determination (*r*^{2}) between the AVHRR and ground data is approximately 0.45, which roughly corresponds to an SNR of 1.0. This was the highest SNR tested by Mann et al. (2007), with the exception of perfect pseudoproxies. Finally, they note that the procedure is heuristic, describing it as a “conservative tool that works well in practice,” and suggest cross validation as a possible alternative and more objective method. We will show via extensive cross-validation testing that this procedure led S09 to select suboptimal truncation parameters.

## 6. Corrections to methodology

### a. Spatial and temporal assumptions

Performing a reconstruction of this type necessarily requires assumptions that, if not met, potentially invalidate the results. A stated assumption of S09 is that the AVHRR data provide a reasonably accurate spatial representation of temperatures. However, by retaining the 1982–2006 portion of the AVHRR PCs unchanged, S09 implicitly make the additional assumption that the AVHRR data provides a reasonably accurate *temporal* representation of temperatures. We find that for reasons discussed in section 3, the latter assumption is not likely to hold. To correct this, our approach shares the spatial assumption of S09 and assumes that the *ground data* provides more accurate temporal information.

This assumption may be mathematically expressed in one of two ways. If an infilling algorithm is used, one may make use of Eq. (6) to extract regression estimates for the AVHRR PCs at all times rather than only times where the original PCs are incomplete. (As noted in section 3c, this modification is *required* for a valid calibration.) The estimates may then be reconstituted with the corresponding spatial eigenvectors to obtain the reconstruction. In this way, the reconstruction contains no direct AVHRR temporal information. An alternative means of expressing the revised temporal assumption is to perform the regression by using *only* spatial information and exclude the PCs altogether (section 6d).

### b. Calibration

As discussed, RegEM is not capable of providing a calibration function if the explanatory and response variables are both incomplete or if multiple response variables are simultaneously included. Additionally, the default output from RegEM does not yield the calibrated PCs. We address these issues first by ensuring the explanatory variables are temporally complete (which we denote Mod 1) and second by taking advantage of Eq. (6) to extract the modeled PCs (which we denote Mod 2). To perform these modifications, we utilize the RegEM algorithm to infill a matrix comprised solely of ground station data (analogous to the AWS reconstruction in S09). The completed matrix is then augmented by the AVHRR PCs and the entire 1957–2006 period for the PCs is predicted via Eq. (6). This prevents the estimation of the PCs from influencing each other via their influence on the estimation of ground data since the ground station estimation is complete prior to the PCs being regressed. This also helps resolve the theoretical difficulty of errors in the PCs meaning something different than errors in the ground stations, as the PCs are never used to estimate ground temperatures.

### c. Spatial structure considerations when regressing principal components

One way to resolve the issue of differing spatial structures between the ground station infilling and the AVHRR eigenvectors is to constrain the ground station regression by the corresponding eigenvector weights appropriately scaled to reduce the influence of the PC on the decomposition. Because each set of weights is unique to a particular AVHRR eigenvector, this requires that each PC be infilled separately and has the added benefit of entirely resolving the issues of mutual reinforcement and cross contamination of the PCs noted in section 3. We denote this modification as Mod 3. Additionally, we denote the combination of Mods 1–3 for direct AVHRR PC regression as the *eigenvector-weighted* (E-W) method.

### d. Eliminating use of principal components

A more elegant means to resolve calibration and spatial structure concerns is to avoid using the AVHRR PCs. Since we assume the AVHRR spatial structure to be accurate, the most efficient way to perform the reconstruction is to directly regress the ground station data against the AVHRR spatial eigenvectors. To do this, we first define our spatial EOFs as , where **Λ**_{AVHRR} contains the AVHRR eigenvalues 1, … , *k*, represents the weights of the spatial eigenvectors at the ground station locations, and represents the effective degrees of freedom. We then define a matrix of RegEM-infilled ground station observations , unknown regression coefficients , and write

The *regularized least squares* (RLS) solution can be found in Lawson and Hanson (1974), where a vector solution is computed separately for each time *j* in matrices and (subscripts hereafter omitted for readability):

As we do not know the proper regularization parameter from any a priori physical arguments, we determine the parameter *h*^{2} by minimizing the rms error between the reconstruction and all stations (including ones not used as predictors) via explicit leave-one-out cross validation. The value of *h*^{2} that minimizes error in withheld data can be interpreted as the maximum likelihood estimation of the true ratio of system measurement error and noise (Fitzpatrick 1991; Sima 2006), where noise refers to the amount of information on each predictor that is not usable for prediction.

We denote this method—which inseparably combines Mods 1–3 without use of the AVHRR PCs—as the RLS method.

### e. Determining regularization parameters

The final modification, which applies to both the E-W and RLS methods, is to determine the optimum regularization parameters through a series of cross-validation experiments. This provides an objective criterion for determining important modes without resorting to heuristic tools (inspecting the log eigenvalue spectrum, bootstrapped eigenvalue/eigenvector, broken stick, scree plots, etc.) that can give vastly different answers for the same set of data. To avoid confusion, we will use *k*_{gnd} to refer to the truncation parameter for the initial ground data infilling, *k*_{sat} to denote the number of retained AVHRR PCs, *k*_{RegEM} to denote the truncation parameters used by the infilling algorithm to regress the PCs against the ground stations for the E-W reconstructions, and *h*^{2} to denote the regularization parameter for RLS.

We use several algorithms for the cross-validation experiments. The first is the TTLS algorithm of Schneider (2001)—which was the algorithm used by S09—reprogrammed in the R Programming Language and benchmarked against the original Matlab version. The second is a Truncated SVD (TSVD) algorithm similar to the DINEOF routine of Beckers and Rixen (2003). The third is the ridge regression (IRidge) algorithm of Schneider (2001), which utilizes Tikhonov regularization rather than truncation, also reprogrammed in R and benchmarked against the Matlab version.

Unlike IRidge, where the regularization parameter is determined via minimization of the generalized cross validation (GCV) function of Golub et al. (1979), no analytical cross-validation function is known for TTLS and TSVD. We therefore determine the optimal value of *k*_{gnd} by explicitly calculating reconstruction rms error to withheld stations. For this procedure, we withhold one station at a time from the ground station set and infill the remaining stations at different values of *k*_{gnd}. We then perform RLS and E-W reconstructions and calculate rms error to the withheld station. The value of *k*_{gnd} that results in the lowest overall rms error for the withheld stations is selected as optimal.

For the E-W reconstructions, we determine *k*_{RegEM} by regressing each AVHRR PC against the infilled ground station set, performing the reconstruction, and choosing the parameter that yields the lowest rms error between the reconstruction and all stations, including those not used as predictors. For RLS reconstructions, we determine the regularization parameter *h*^{2} via explicit leave-one-out cross validation during the regression (section 6d). The number of AVHRR PCs included in the reconstruction *k*_{sat} is likewise determined by minimizing the rms error on all stations.

Last, after the optimal parameters have been chosen, we evaluate how well the regression and reconstruction procedure may estimate temperatures at uninstrumented locations in a similar fashion as that used to determine *k*_{gnd}. One station at a time is entirely withheld from the ground station infilling and is also entirely withheld from the RLS and E-W reconstructions (including the cross-validation steps). We then calculate rms error (*μ*), coefficient of efficiency (CE), and correlation coefficient (*r*) to the withheld station. Because this is repeated for every station, we can obtain a complete set of verification statistics for all predictors. We also perform this procedure using the S09 reconstruction to have a like-to-like comparison of the effectiveness of the reconstruction methods.

We denote this modification—wherein all truncation parameters are determined via cross-validation—as Mod 4.

### f. Summary of modifications

To address the three primary issues noted in our abstract with the S09 method, we propose the following modifications:

Mod 1: Infill ground stations separately from the AVHRR PCs;

Mod 2: Use the calibrated, modeled PCs at all times;

Mod 3: Constrain the prediction of the PCs by the AVHRR spatial eigenvectors; and

Mod 4: Determine all adjustable parameters via cross-validation testing.

These are implemented in two ways. In E-W, the PCs are regressed against the station data. In RLS, the ground station data is directly regressed against the retained AVHRR spatial eigenvectors. Optimal parameters for both are selected via Mod 4.

## 7. Results

### a. Optimal parameters

#### 1) Parameters for ground station infilling

Because we find that including short-record stations results in degraded performance of the reconstruction, we limit our set of predictors to all READER stations that are located within 100 km of an AVHRR grid cell and have at least 96 months of data. This yields 63 stations (35 AWS and 28 manned ground stations), which are tabulated in the supporting information.

We find TTLS and TSVD to yield nearly identical results, with TTLS showing slightly lower (0%–5%) reconstruction trends depending on the regularization parameter used. Because of the increased efficiency of TSVD (the algorithm is approximately 10-times faster than TTLS), we conduct most of the cross validation using TSVD and spot check with TTLS. Both algorithms yield an identical optimal *k*_{gnd} of 7.

However, we find the regularization method in TTLS and TSVD to be somewhat undesirable. While the optimal value of *k*_{gnd} does indeed yield comparable reconstructions to IRidge, small changes in the regularization parameter can cause large changes in the resulting reconstructions. Because the regularization parameter is fixed for the entire dataset, the parameter that is ideal for the dataset as a whole causes overfits during periods with few predictors and underfits during periods with many predictors, yielding lower overall prediction effectiveness. In experiments where portions (5%) of the station data are withheld, we obtain maximum CEs to the withheld data of approximately 0.50 with TTLS and TSVD but in excess of 0.63 with IRidge. This possibility—that the smooth regularization and ability to adapt the regularization to the number of predictors available that is provided by ridge regression would prove more effective in practice than TTLS—is pointed out in Schneider (2001). In addition to the problems resulting from coarse, fixed regularization, TTLS and TSVD lack a known analytical minimization function. This makes the computational cost advantage over ridge regression noted as attractive by Mann et al. (2007) entirely evaporate due to the need for extensive, explicit cross-validation.

Given these issues with TTLS and TSVD, we choose to emphasize the reconstructions using IRidge-infilled ground stations in the main text. We do note, however, that the optimal parameters for TTLS and TSVD provide reconstructions that are quite comparable to IRidge. TTLS-infilled reconstructions are available in the supporting information.

#### 2) E-W reconstruction parameters

Unlike the ground station infilling, we found TTLS and TSVD to provide stable results for the PC regressions in the E-W reconstructions. In this case, the number of predictors is stable and there is only one predictand. This prevents having to select the truncation parameter based on best overall performance for multiple predictands and allows selection of the parameter based on the best performance for the single predictand. As a result, the underfitting–overfitting issues associated with the ground station infilling are avoided. For the E-W reconstructions—which are far more computationally intensive than the RLS reconstructions—we found the speed advantage of TTLS and TSVD to be particularly useful.

For the E-W reconstruction, imposing an upper limit of *k*_{RegEM} ≤ 8 during the ground station–AVHRR PC regression yields the best verification statistics for both TTLS and TSVD, with TSVD showing slightly higher trends (approximately 5%) and slightly better verification statistics. In contrast with the ground station infilling, the results are relatively insensitive to changes in the maximum allowable value for *k*_{RegEM}, with only a 0.01°C decade^{−1} difference in the continental and West Antarctic trends as the upper limit on *k*_{RegEM} is varied from 5 to 11. Below 5, trends in all areas are significantly reduced.

The optimal value for *k*_{sat} also shows little sensitivity to *k*_{RegEM}, yielding an ideal number of 150 retained satellite PCs with *k*_{RegEM} ≤ 8. Overfitting results in a rapid decrease in verification statistics and optimal number of retained satellite PCs when the upper limit on *k*_{RegEM} is allowed to exceed 11.

#### 3) RLS reconstruction parameters

For the RLS reconstructions, an optimal regularization parameter *h*^{2} of 0.34 is obtained, with a resulting optimal value for *k*_{sat} of 126. Reconstruction trends show little sensitivity (decreasing by approximately 0.02°C decade^{−1}) for values of *h*^{2} up to 1.5 and are generally lower if fewer AVHRR eigenvectors are retained. An important observation is that a very small amount of regularization—approximately 99.3% of the predictor variance is retained—for the RLS reconstructions yield the best predictions, indicating that the ground data is largely free from systemic error. This provides tangible evidence that the AVHRR data contains larger errors than the ground data.

### b. Overall and spatial patterns of temperature change

While we do find overall warming of the continent, the continental average is not significant at the 5% level (≈0.06 ± 0.07°C decade^{−1}), nor is the warming in East Antarctica (≈0.03 ± 0.09).^{5} This is similar to S09, wherein the trends for the continent and East Antarctica are less positive than West Antarctica and the peninsula (Table 1). Unlike S09, we find the peninsula regional average to demonstrate the most strongly positive trend, at 0.35 ± 0.11 from 1957–2006. For West Antarctica, the RLS reconstruction demonstrates a statistically significant trend of 0.10 ± 0.09 (approximately half that reported by S09) while the E-W reconstruction shows a trend of 0.06 ± 0.07. Given that the E-W reconstruction displays some variance loss compared to both RLS and the raw ground station data, we consider the RLS result to be the more accurate of the two methods. This results in a total West Antarctic warming of 0.5°C since 1957, which is roughly equivalent to the overall warming of the earth over the same period.

Figure 3 compares the spatial patterns of temperature change using major subperiods that appear in the S09 text. One feature that is similar to S09 is a strong indication of warming in portions of West Antarctica, particularly in the peninsula half of Marie Byrd Land and in the Pine Island and Thwaites glacier locations. However, the pattern of West Antarctic warming is substantially different. In S09, warming is concentrated in the Ross region and the warming over all of West Antarctica is statistically significant. In our reconstructions, statistically significant warming is concentrated in the area adjacent to the peninsula and qualitatively appears to be an extension of the peninsula warming. Additionally, we show an area of mild cooling over the Ross Ice Shelf and adjacent land (not statistically significant) that is distinctly at odds with S09. A similar cooling feature—corroborated by ground station records—extending from the South Pole to the Weddell Sea is also absent from the S09 reconstructions.

Our results—including the strong peninsula warming, insignificant cooling to neutral trend in the Ross region, and generally insignificant trends elsewhere on the continent—compare more favorably to Chapman and Walsh (2007) and Monaghan et al. (2008) than S09. One notable difference between the Monaghan et al. (2008) reconstruction and the present work is in West Antarctica. Unlike the Monaghan et al. (2008), our reconstructions show positive trends at Byrd AWS. For the entire 1957–2006 period, the reconstructed trends at the Byrd location are 0.05 ± 0.13 (RLS) and 0.02 ± 0.13 (E-W). For the satellite coverage period of 1982–2006, those become 0.21 ± 0.36 (RLS) and 0.15 ± 0.26 (E-W). By comparison, the S09 Byrd trends are 0.26 ± 0.12 and 0.47 ± 0.31, respectively.

Also unlike Monaghan et al. (2008), our reconstructions show little sensitivity to the removal of the manned Byrd station data, which decreases the West Antarctic regional trend by only 0.015. We do note, however, that the ground data from the interior of West Antarctica is sparse, with only the manned Byrd station providing any pre-1980 observations. The reconstructed trends in Marie Byrd Land, therefore, are essentially an interpolation between the long-record peninsula and Ross area stations, using the AVHRR spatial data, the manned Byrd station, and infilled values for four West Antarctic stations (Byrd AWS, Erin, Henry, and Mount Siple) as constraints.

### c. Seasonal patterns of temperature change

In comparing seasonal patterns of change, our reconstructions show minor to negligible differences compared to S09 in regional averages on the mainland for spring and summer (Table 2 and Figs. 4, 5 and 6). Winter and fall show more substantial differences, especially in the Ross and Weddell regions. The peninsula trends are substantially different for all seasons and all time periods.

In terms of seasonal maximums, S09 find the most warming in winter and spring for all areas. In our reconstructions, the peninsula shows maximum warming in winter and fall, and the remainder of the continent peaks primarily in spring and summer. We additionally show two separate patterns of change in West Antarctica, with the Ross region showing *cooling* in winter rather than maximum warming, while the area adjacent to the peninsula follows the peninsula pattern. Overall, the seasonal behavior for the peninsula and West Antarctica is again closer to Chapman and Walsh (2007) and Monaghan et al. (2008) than S09. Those studies find maximum peninsula warming during winter and fall (both studies) and maximum Ross region cooling during winter and fall (Monaghan et al. 2008) over slightly different periods (1958–2002 and 1960–2005, respectively). The observation that the seasonal patterns in S09 for all regions are coupled to the peninsula is additional evidence that the S09 method allows the peninsula to unduly influence the reconstruction.

Our reconstructions, like S09, show cooling over large portions of East Antarctica in the fall. In addition to this, however, we also find significant cooling in the region stretching from the South Pole to the Weddell Sea during winter. This corresponds well to winter trends at Amundsen–Scott (from the READER archive) of −0.34 ± 0.07, compared to −0.33 ± 0.07 (RLS) and −0.19 ± 0.06 (E-W). The winter cooling is absent in S09, who show the greatest *warming* occurring at the pole during this time. It differs to a lesser extent from the Monaghan et al. (2008) result of approximately neutral winter trends at the pole from 1960–2005 and matches well with Chapman and Walsh (2007), who find cooling at the pole during all seasons, with a maximum cooling trend during winter in the 1958–2002 timeframe.

Statistical significance of the differences between our study and S09 were calculated via by a one sample *t* test on the residual trend (or, equivalently, a paired *t* test on the monthly estimates). Spatial maps of the differences between S09 and our reconstructions for all seasons and each season individually over the period of 1957–2006 appear in Fig. 6 and results for regional averages appear in Table 2.

### d. Skill statistics

The top half of Table 3 summarizes the skill of the full reconstructions with respect to rms error (*μ*_{rms}), correlation coefficient (*r*), and average explained variance (*R*^{2}), which indicate how well the reconstruction performs at locations where ground information is available. Both the RLS and E-W reconstructions show a substantially better match to ground data than either the S09 reconstruction or the raw AVHRR data. This is not surprising for RLS, as the temporal information comes entirely from the ground stations. For the E-W reconstructions, which regress the AVHRR PCs against the ground data, the improvements validate our concerns with the S09 method.

Verification statistics (bottom-half of Table 3) are calculated by comparing reconstructed temperatures to station data that is withheld from the reconstruction as described in section 6e. Statistics calculated are *μ*_{rms}, *r*, and CE. Reduction-of-error (RE) statistics are undefined as verification targets are completely omitted from the reconstructions; hence, no target calibration period exists. A noteworthy result is that the verification statistics—wherein no station data is used at the locations being tested—*also* substantially exceed the performance of both the full S09 reconstruction and the raw AVHRR data. Full statistics are available in the Supporting Information.

Variance loss in our reconstructions is small, with RLS slightly outperforming E-W. Since the peninsula displays the highest trends, variance loss in that region will be most noticeable. Figure 7 shows annual anomalies for the seven most complete peninsula ground stations versus the corresponding grid cells from the RLS, E-W, and S09 reconstructions. A baseline period of 1970–85 was chosen due to some stations being incomplete outside that period. The ground data trend is 0.46 ± 0.18°C decade^{−1}. The reconstruction trends are 0.44 ± 0.16 (RLS), 0.39 ± 0.14 (E-W), and 0.06 ± 0.05 (S09).

### e. Relative importance of Mods 1–4

Based on performing S09-style replications with and without Mod 1 (which separates the ground station and PC estimation), we find that this modification has a negligible impact on the gridded results. Though the impact becomes greater as additional AVHRR eigenvectors are included, even at *k*_{sat} = 150 the overall pattern and magnitude of trends do not change appreciably. Therefore, while the combined infilling of PCs and ground stations technically invalidates the calibration (and should be avoided), it has no material impact on the S09 results.

Mods 2–4, however, all exhibit a significant impact on the results. Each contribute approximately 33% of the difference in the magnitude of the continental average, which limits the first-order effect of errors in the AVHRR data on the S09 reconstruction to about 0.02°C decade^{−1}. This indicates that although the detrended S09 reconstruction demonstrates similar magnitudes to our reconstructions, the similarity cannot be wholly due to trends in the AVHRR data. Loss of spatial covariance information (also recognized by S09) contributes to the reduction in magnitude by degrading the quality of the regression.

In terms of overall spatial pattern reorganization, Mod 4 (cross-validation) is the primary contributor, though Mods 2 and 3 also result in substantial reorganization, especially when examining reconstruction subperiods (Table 4). Since the patterns in our reconstructions cannot be obtained simply by increasing the number of retained AVHRR eigenvectors, it follows that the calibration and spatial structure concerns with S09 are significant.

While the contribution of Mod 2 may be determined independently, Mod 3 requires both 1 and 2. We therefore estimate the influence of Mod 3 via deduction. For a qualitative examination, we may make use of Eq. (8) to evaluate whether Mod 3 results in a better match between the TTLS regression coefficients and the AVHRR spatial eigenvectors (Fig. 8). There is a noticeable improvement as compared to Fig. 2. The rms error estimates between the normalized spatial weights for this method are 0.91, 0.56, and 0.44 for PCs 1–3, respectively, as compared to 0.93, 0.98, and 1.13 for S09. As expected, the E-W method greatly reduces the percentage of stations used in the opposite orientation. When this does occur, it is with lower-weight stations than in the S09 method, which lessens the error in the reconstruction.

## 8. Conclusions and recommendations

S09 present a novel means of using an infilling algorithm to produce a high-resolution gridded reconstruction of Antarctic temperatures using ground and satellite data. We have shown that the method has three primary areas of concern: (i) improper calibration; (ii) spatial structure differences between the infilling operation and recovery of gridded estimates; and (iii) suboptimal determination of regularization parameters. We propose four modifications to correct these issues.

We demonstrate that our concerns have a material impact on the results. When resolved, the results obtained differ from S09 in several key aspects. While we find some notable agreement with S09 (specifically, overall positive trends, statistically significant warming in West Antarctica, lesser warming in East Antarctica, and overall warming of the continent in summer and spring), we also find substantial differences. Average 1957–2006 temperature trends for the continent, East Antarctica, and West Antarctica are halved. We find an average peninsula trend of approximately 0.35°C decade^{−1}, which is almost three times that of S09. Maximum warming in West Antarctica occurs in the area adjacent the peninsula rather than on Ross. East Antarctica displays a persistent cooling feature extending from the South Pole to the Weddell Sea, and large portions of West and East Antarctica display substantially different seasonal behavior. All of these differences are statistically significant at the 5% level.

Though we find the general concept of regressing satellite principal components against ground information using an infilling algorithm to have merit, care must be taken to ensure a proper calibration. We observe that in cases where the temporal component of a dataset may be suspect, a method using only spatial information from the satellite data may provide more accurate results. The method also presents itself as a diagnostic tool; one could easily compare results between temporal and spatial methods. This diagnostic may be performed for any problem that requires both temporal and spatial analysis of incomplete datasets, where the temporal and spatial information are derived from different sources.

Furthermore, we find the heuristic procedure of Mann et al. (2007) for determining truncation parameters to be suboptimal, at least in the case of Antarctica. This procedure warrants more investigation to establish the conditions under which it might provide optimal or near-optimal results. Additionally, we find strong evidence that regularization via truncation with a fixed parameter (as in TTLS and TSVD) provides near-optimal results only when the number of predictors is stable and the number of predictands is small (or, equivalently, highly spatially coherent).

Finally, we recommend that more study be undertaken to resolve the significant differences between the AVHRR dataset used by S09 and temperatures measured at ground station locations. The small regularization parameters required for optimal verification statistics using only ground station temporal information indicate the error is likely to be with the AVHRR data. Though the scope of this work limits our analysis to a single dataset, the potential sources for error suggest similar problems may exist in other AVHRR temperature products.

## Acknowledgments

We gratefully acknowledge assistance from Hu McCullough, Roman Mureika, Jean-Marie Beckers, and Eric Steig.

## REFERENCES

## Footnotes

^{1}

S09 additionally present a detrended variant that we will not directly address. This variant retains the linearly detrended AVHRR data as is and is justifiably deemphasized in the S09 text. All criticisms apply equally to the detrended reconstruction; however, owing to the detrending, resulting trend magnitudes are not directly comparable.

^{2}

An embedded trend in the satellite data is apparent from 1982–2001 (*NOAA–7* through *NOAA–14*) that exceeds ground station measurements by 0.19° ± 0.16°C decade^{−1}.

^{3}

Since are defined separately for each time step, the S09 definition is not strictly true (section 3c).

^{4}

Schneider (2001) and Mann et al. (2007) describe the algorithm as using the covariance matrix; however, RegEM scales to correlation prior to regularization.

^{5}

All uncertainty intervals in this study are 95% confidence intervals, with degrees of freedom corrected for Assessment Report (AR1) serial correlation of the residuals (Santer et al. 2000).