The impact of including comprehensive estimates of observational uncertainties on a detection and attribution analysis of twentieth-century near-surface temperature variations is investigated. The error model of HadCRUT4, a dataset of land near-surface air temperatures and sea surface temperatures, provides estimates of measurement, sampling, and bias adjustment uncertainties. These uncertainties are incorporated into an optimal detection analysis that regresses simulated large-scale temporal and spatial variations in near-surface temperatures, driven by well-mixed greenhouse gas variations and other anthropogenic and natural factors, against observed changes. The inclusion of bias adjustment uncertainties increases the variance of the regression scaling factors and the range of attributed warming from well-mixed greenhouse gases by less than 20%. Including estimates of measurement and sampling errors has a much smaller impact on the results. The range of attributable greenhouse gas warming is larger across analyses exploring dataset structural uncertainty. The impact of observational uncertainties on the detection analysis is found to be small compared to other sources of uncertainty, such as model variability and methodological choices, but it cannot be ruled out that on different spatial and temporal scales this source of uncertainty may be more important. The results support previous conclusions that there is a dominant anthropogenic greenhouse gas influence on twentieth-century near-surface temperature increases.
By the first decade of the twenty-first century observed global near-surface temperatures had increased by about 0.8 K since the late nineteenth century (Hartmann et al. 2013). Formal detection studies have attributed most of this warming to anthropogenic influences (e.g., Santer et al. 1995; Hegerl et al. 1997; Tett et al. 1999; Stott et al. 2000; Gillett et al. 2002; Braganza et al. 2004; Huntingford et al. 2006; Stott et al. 2006; Zhang et al. 2006; Stone et al. 2007; Gillett et al. 2013; Jones et al. 2013; Ribes and Terray 2013), with well-mixed greenhouse gases contributing as much to the temperature increase as has been observed, if not more. There is also substantial and growing evidence for human influences on regional scales and on other measures of climate change (Mitchell et al. 2001; Hegerl et al. 2007; Bindoff et al. 2013).
There are a number of different techniques for detecting and attributing climate changes, which rely on comparing estimates of expected changes in climate caused by different factors with observed changes (Hegerl et al. 2007; Hegerl and Zwiers 2011; Bindoff et al. 2013). One common type of analysis uses an “optimal fingerprint” methodology, where the fingerprints of patterns of climate change from different factors simulated by climate models are regressed against the observed changes (Mitchell et al. 2001; Barnett et al. 2005; Hegerl et al. 2011). This methodology assumes that the scaling factors can account for gross uncertainties in the magnitude of the forcing and temperature responses and that the uncertainty ranges on the regression scaling factors allow for uncertainty arising from internal climate variability (e.g., Allen et al. 2006).
The impact of uncertainties in the spatiotemporal patterns from external forcing factors can be explored by examining ranges of climate models looking at how the results change with each model (Hegerl et al. 2000; Stott et al. 2006) and/or when the average response across models is used (Gillett et al. 2002; Jones et al. 2013; Gillett et al. 2013) and when intramodel differences are incorporated (Huntingford et al. 2006; Hannart et al. 2014). While attributed, the size of the anthropogenic greenhouse gas contribution to observed multidecadal temperature changes varies with the choice of climate model, because of the diversity of response patterns, but also depends on the methodological choices made by the researcher (Jones et al. 2016).
In contrast, uncertainties in temperature measurements and in how they are collated together into dataset products have seen relatively limited attention in detection studies. The sources of uncertainty in observation-based datasets are errors in measurements, in creating gridded data from individual observation platforms, in adjustments to account for systematic biases, from limited coverage when calculating large area means (Morice et al. 2012), and from differences in the methods applied by researchers—also known as structural uncertainty—to collect data together and correct for biases and other data quality issues (Jones 2016).
Estimates of the sampling uncertainty within gridded data and in instrumental error have previously been included in a detection analysis (Hegerl et al. 2001). How results changed with four different observational datasets was explored in another detection study (Jones and Stott 2011). Most recently an estimate of the impact on a detection analysis of bias adjustment uncertainties was included in a study (Gillett et al. 2013). All these studies found that the observational uncertainties examined had only a small impact on the attribution of a dominant anthropogenic influence on observed temperature increases.
In this study we make the most thorough examination of the impact of observational uncertainty on an optimal fingerprinting detection study to date by, as comprehensively as possible, incorporating bias, measurement, and sampling uncertainties from the error model of a near-surface temperature dataset, HadCRUT4 (Morice et al. 2012). We also further explore the impact of dataset structural uncertainty by, for the first time, repeating a detection analysis on five alternative near-surface temperature datasets. We use as our basis the detection analysis of Jones et al. (2016). They applied an optimal fingerprint regression methodology to large spatial and temporal scale variations of observed near-surface temperatures (HadCRUT4; Morice et al. 2012), using climate model simulations to provide temperature response patterns from different climate drivers or forcings. The simulations were produced from climate models drawn from phase 5 of the Coupled Model Intercomparison Project (CMIP5; Taylor et al. 2012). Jones et al. (2016) showed that detection results were sensitive to the choice of CMIP5 model used to derive the response patterns. They also showed that the detection results were sensitive to what forcing’s response patterns were used as predictor variables in the regression. This was especially the case when forcing responses had low signal-to-noise ratios or had strong correlations with other forcing responses. For instance, all the CMIP5 models had natural-only forced response patterns that had very low signal-to-noise ratios, and for some CMIP5 models the response to well-mixed greenhouse gas forcing was strongly anticorrelated with the response patterns from other anthropogenic forcings. Both of these issues are known to increase uncertainties in the regression scaling factors (Allen and Tett 1999; Allen and Stott 2003). In this study we choose to use a two-way regression, effectively using the signal combination of response patterns from well-mixed greenhouse gas forcings and the response patterns to the combined influence from other non–greenhouse gas anthropogenic and natural forcings. Jones et al. (2016) concluded that this two-pattern combination arguably gave more robust results than other combinations that were possible from the available CMIP5 experiments.
How this paper is structured is as follows. In section 2 we describe the observational and model data we use, in section 3 we summarize the methodology, and in section 4 we present the results of the optimal detection analyses using the HadCRUT4 error model and the alternative observational datasets. We discuss the results and give a summary in sections 5 and 6, respectively.
HadCRUT4 is the latest dataset of observed near-surface temperatures produced by the Met Office Hadley Centre and the Climatic Research Unit (CRU) at the University of East Anglia (Morice et al. 2012) (Table 1). The dataset is blended from sea surface temperatures (SSTs) [Hadley Centre SST dataset, version 3 (HadSST3); Kennedy et al. 2011a,b] and land near-surface air temperatures [CRU Temperature dataset, version 4 (CRUTEM4); Jones et al. 2012]. The dataset comprises monthly mean anomalies relative to 1961–90, on a common 5° × 5° latitude–longitude spatial grid, covering the period from 1850 to the present day. An error model is provided with the dataset that describes residual errors in bias adjustments, measurement errors, and sampling errors (Brohan et al. 2006; Kennedy et al. 2011a,b; Morice et al. 2012).
Bias adjustments are applied to SSTs and to land air temperatures, to minimize the effects of changes in measurement techniques, station moves, instrument changes, reference period mean uncertainties, and biases in urbanization and sensor exposures (Morice et al. 2012). A 100-member ensemble is provided with HadCRUT4 that samples uncertainties in the bias adjustments (Fig. 1a).
Measurement and gridbox sampling uncertainties covering land points in HadCRUT4 are uncorrelated in time and space and are provided in an ancillary dataset as standard deviations for each grid box and month. Within the SSTs there are further systematic errors affecting measurements made by individual ships, buoys, and other observation platforms. These are not incorporated in the bias realization ensemble, which deals with large-scale, fleetwide bias corrections (Kennedy et al. 2011a). To capture these errors, monthly spatial covariance matrices are provided with HadCRUT4, but because of incomplete records it is not currently possible to fully describe the spatial and temporal correlation structure of these uncertainties throughout the entire time period.
The provision of an ensemble of interchangeable members allows complex spatial and temporal correlations in the uncertainties to be incorporated into this study. The analysis can be repeated on each HadCRUT4 ensemble member, with the results then combined statistically. To enable uncorrelated and correlated measurement and sampling errors to be included in this study, we construct an additional 100-member ensemble of realizations by using the standard deviation and covariance ancillary datasets together with an assumption about the temporal correlation structure (Fig. 1b). Details about the assumptions and choices made are described in the appendix. This measurement and sampling realization ensemble is used here to investigate the possible impact of the inclusion of a more comprehensive set of uncertainties than just using the bias realization ensemble and is not a formal addition to the HadCRUT4 error model.
Calculating the trends for each of the members of the HadCRUT4 uncertainty ensembles and then examining their distribution enables an estimation of the uncertainty of the global mean trend of HadCRUT4 to be estimated. We use least squares trends as a summary statistic of the long-term behavior of the data, but their use is not meant to indicate the data is varying in a purely linear fashion. Other approaches are also available to describe the tendency of data (box 2.2 in Hartmann et al. 2013), which would give similar results to those presented here. We find that the HadCRUT4 trend over the 1900–2014 period with bias adjustment uncertainties is 0.76 ± 0.06 K (100 yr)−1 (plus or minus two standard deviations). The uncertainty in the HadCRUT4 trend resulting from the measurement and sampling uncertainties is much smaller [0.76 ± 0.01 K (100 yr)−1], because of the weaker spatial and assumed temporal correlations in the derivation of the ensemble. When both realization sets are added together to form a combined bias, measurement, and sampling uncertainty ensemble (Fig. 1c), the uncertainty in the trend is only very marginally larger than it is for the bias adjustment ensemble, with the variance increasing by less than 2%. This suggests that using just the bias adjustment uncertainty ensemble, as provided with HadCRUT4, may be generally sufficient in many analyses to describe the majority of the large spatial- and temporal-scale observational uncertainties.
Another source of uncertainty in global mean temperature anomalies is coverage uncertainty. With measurements incompletely sampling the globe, the global mean will be an imperfect estimate of the global mean of a hypothetical dataset that had complete spatial coverage. Morice et al. (2012) provided an estimate of the coverage uncertainty of the global and large area means, which can be sampled and added to the bias, measurement, and sampling ensembles (Fig. 2a). As alternative observational datasets have different spatial coverages (see below), this source of uncertainty should be considered when comparing HadCRUT4 to these alternative datasets. As the coverage uncertainty has unquantified temporal correlations, it is not trivial to estimate the impact this has on the global mean trend uncertainties. We do not consider coverage uncertainties within this detection study, as the model data being used is processed to have the same spatial coverage as HadCRUT4.
Hartmann et al. (2013) gave an estimate of the observed global mean trend, based on the average of three near-surface temperature datasets, together with an uncertainty range, 0.85 K warming (0.65–1.05 K; 5%–95%) over the period 1880–2012. This uncertainty estimate is not comparable with what we estimated above as it is based on a first-order autoregressive model of the residual of the best line fit (section 2.SM.3.3 in Hartmann et al. 2013) and not on observational dataset uncertainties. Hartmann et al. (2013) also provided an estimate of the observed warming between different periods [e.g., a warming between the 1850–1900 and 2003–12 periods of 0.78 K (0.72–0.85 K; 5%–95%)]. These uncertainties are based on the observational error model of HadCRUT4 (section 2.SM.4.3 in Hartmann et al. 2013) and so are comparable to the uncertainties on the global mean trend calculated above in this study.
b. Alternative observational datasets
To examine what role structural uncertainty in the creation of temperature datasets may play in the optimal detection analysis, we repeat the analysis using five alternative observational datasets (Fig. 2a and Table 1). The approaches to construct the datasets are technical, so we only give a brief summary below. Summaries can also be found in other studies (e.g., Jones 2016; Kent et al. 2017; Tables 2.SM.4 and 2.SM.5 in Hartmann et al. 2013), but we recommend that for specific details the reader should seek out the dataset description documents.
The GISS Surface Temperature Analysis (GISTEMP) dataset is provided by the NASA Goddard Institute for Space Studies. It uses land air temperatures [Global Historical Climatology Network (GHCN), version 3.3.0 (v3.3.0), and Scientific Committee on Antarctic Research (SCAR) stations over Antarctica] and SST [Extended Reconstructed Sea Surface Temperature, version 4 (ERSST.v4)] observations to produce a blended gridded 2° × 2° monthly mean dataset for the period 1880–present, given as anomalies relative to 1951–80 (Hansen et al. 2010). Land air temperatures are smoothed over 1200-km spatial scales, interpolating over sea ice regions and land regions with no observations. The SST dataset ERSST.v4 was filtered in time and space, as for the NOAA dataset described below, so some ice-free sea areas with missing data are infilled.
The NOAA Global Surface Temperature (NOAAGlobalTemp) dataset combines land air temperatures (GHCN v3.3.0) and SST (ERSST.v4) observations to produce monthly means with respect to the 1971–2000 mean on a 5° × 5° grid, for 1880–present (Smith et al. 2008). The land and sea datasets have, separately, been processed with complex time and spatial filtering, with the use of empirical orthogonal functions to interpolate over regions with no observations. There is no interpolation of temperatures over regions with sea ice.
The Japan Meteorological Agency’s (JMA) dataset blends their own SSTs [Centennial In Situ Observation-Based Estimates (COBE) SST; Ishii et al. 2005] with land air temperatures (GHCN before 2000 and subsequently CLIMAT reports; details of data versions are undocumented) onto a 5° × 5° grid (described in http://ds.data.jma.go.jp/tcc/tcc/products/gwp/temp/explanation.html). The COBE SST dataset was created by applying an optimum interpolation approach, which infilled missing data regions (Ishii et al. 2005). However, it appears that in the blended dataset the SSTs have been spatially subsampled to the original observational coverage. It is not documented whether the land air temperature component has had any infilling techniques applied, but the dataset’s spatial coverage suggests not. Monthly mean anomalies with respect to 1971–2000 are provided for the period 1891–present.
Cowtan and Way (2014) applied an interpolation method, kriging, to HadCRUT4 (version 4.3) to infill regions with no observations, producing a variation of the dataset (CW14). The kriging was applied to the land air (CRUTEM4) and SST (HadSST3) parts separately, with land air temperatures interpolated over sea ice regions. The meaning, reference period, time coverage, and resolution are the same as for HadCRUT4.
The Berkley Earth project (herein BerkEarth) used empirical statistical techniques to merge together station records, without explicitly accounting for knowledge about the stations (e.g., instrument changes and urbanization; Muller et al. 2013). A spatial filtering technique, kriging, was applied to this land air temperature dataset, which infills areas with no observations. SSTs were adapted from a version of HadSST3 and also spatially filtered with the kriging technique to infill missing data areas. The land air temperatures are then blended with the SSTs, with interpolated air temperatures used over sea ice areas. The details of the processing of the SSTs and blending with the land air temperatures are not fully documented. The dataset produced comprises monthly means on a 1° × 1° grid, for the period 1850–present with anomalies given with respect to 1961–90.
The datasets share many core observations but differ in processing, data corrections, anomaly reference period, and infilling of missing data techniques (Jones 2016). Some of the datasets also share land or SST components with other datasets. Because of this the datasets cannot be treated as completely independent sources of information. The JMA and BerkEarth blended land and sea temperature datasets have not been described in the peer review literature. As such, some details of the creation of these datasets are not clear. However, as the datasets are publicly available we include them in the following analysis for completeness. Data sources and version numbers are given, where available, in Table 1.
The global mean trends vary across the datasets, while the overall evolution of the datasets is very similar (Fig. 2a) (Jones 2016). We find that the linear trends for the 1900–2014 period are 0.85, 0.82, 0.74, 0.77, and 0.80 K (100 yr)−1 for GISTEMP, NOAAGlobalTemp, JMA, CW14, and BerkEarth datasets, respectively. Recall that the HadCRUT4 trend, incorporating bias, measurement, and sampling uncertainties, is 0.76 ± 0.06 K (100 yr)−1.
The spatial coverages vary considerably between the datasets. For instance around 1910 the coverage varies from approximately 50% (HadCRUT4 and JMA) to 80% (NOAAGlobalTemp) and from 95% (GISTEMP and BerkEarth) to 100% (CW14) of the globe. By 2000 the global spatial coverage increases to 80% for HadCRUT4 and JMA, 90% for NOAAGlobalTemp, and 100% for GISTEMP, BerkEarth, and CW14. The vast majority of the differences in the spatial coverages are due to the differing infilling techniques and not in the spatial availability of observations. When all the datasets have the same common spatial coverage applied, we find that the variations in the global annual means are more similar (Fig. 2b). GISTEMP and NOAAGlobalTemp have now near-identical variations, perhaps unsurprising given the common source of the land air and SST datasets they use. With no infilling, CW14 is now near identical to HadCRUT4, with the only differences likely due to CW14 using an older version of HadCRUT4 and a subtly different technique for blending air temperatures and SSTs. Interestingly, without infilling BerkEarth is now very similar to HadCRUT4, perhaps an indication of the dominance that the common SST dataset source has on global mean temperatures. However, differences between the datasets remain, with some of the datasets occasionally deviating outside the range of the bias, measurement, and sampling uncertainties of HadCRUT4, highlighting the role other structural differences in the creation of the datasets have. Morice et al. (2012) came to the same conclusion when applying a shared common spatial mask to HadCRUT4 and previous versions of GISTEMP, NOAAGlobalTemp, and JMA.
c. CMIP5 simulation data
We use the same CMIP5 (Taylor et al. 2012) model simulations as used in Jones et al. (2016) (Table 2). Monthly mean near-surface air temperatures were obtained for three experiments, piControl, historical, and historicalGHG (Taylor et al. 2012). Lengths of 500 yr were retrieved from piControl, which have no variations in external radiative forcing, from 23 models. Of these models, 15 also provided historical and historicalGHG simulations for the 1906–2005 period (Table 2). The historicalGHG experiments are driven by historical variations in well-mixed greenhouse gases, although several models also include variations in ozone (Jones et al. 2016). The historical experiments are forced by variations in well-mixed greenhouse gases, tropospheric and stratospheric ozone, sulfate-aerosol direct effects, carbonaceous aerosols, total solar irradiance, and stratospheric volcanic aerosols. Most of the model historical simulations also include variations in aerosol indirect effects (aerosol–cloud interactions) and land-use changes (Table 2 in Jones et al. 2016).
Here we outline the detection methodology, together with the preprocessing steps and other analysis choices. The basis of the analysis follows that of Jones et al. (2016), an optimal detection analysis for the 1906–2005 period using 10-yr means projected onto spherical harmonics [triangular truncation at wavenumber 4 (T4)]. Jones et al. (2016) explored a range of methodological choices, but to examine just the impact of observational uncertainties we follow their core analysis using the CMIP5 multimodel mean for the forced responses and using a two-way signal regression to deduce the scaling factors for the responses to well-mixed greenhouse gas influences and combined other anthropogenic and natural influences.
The optimal detection methodology used here is described extensively elsewhere (Allen and Stott 2003; Stott et al. 2003; Allen et al. 2006), with the specifics of the analysis described in Jones et al. (2016), so we will only briefly describe it here. The basic method we use is linear regression, where spatiotemporal patterns, representing the responses to different combinations of forcing factors, are regressed against observed climate changes. The response patterns are obtained from climate model simulations, which can be driven by a variety of forcing factors. With response patterns from different model experiments included in a multiple signal regression it is possible to deduce the relative contributions from different forcing factors that could explain the observed changes. The simulated and observed temperature data are filtered and projected onto the leading empirical orthogonal functions (EOFs), which represent modes of internal variability as estimated from the piControl simulations. The amplitudes of the EOFs are optimized to downweight those modes with the largest variability in the piControl simulations, such that the focus is on the most important externally forced spatiotemporal fingerprints. The choice of the number of EOFs included in an analysis—the EOF truncation—is a compromise (Allen et al. 2006). Including higher-order EOFs increases the response patterns’ variance explained but at the expense of the signal-to-noise ratios. It is common practice to examine how the scaling factors vary with EOF truncation, which can show how sensitive the results are to different choices. Like other methodological choices, what EOF truncation is used in the following analyses is down to the researchers’ best judgment. Jones et al. (2016) explored the sensitivity of their results to the choice of EOF truncation but used an EOF truncation of 40 to present their main results.
The linear regression produces probability distributions for the scaling factors, which inform what patterns are detected (i.e., when their 5%–95% ranges do not overlap with a value of 0). The scaled reconstructed temperatures are then deduced, together with their trends and uncertainties. Part of the evidence for an attribution of observed changes to specific causes is often considered when the responses are not scaled significantly up or down (i.e., scaling factors ranges overlap with a value of one), thus consistent with prior physical understanding of the model response (Hasselmann 1997). However, there is an argument that scaling factors consistent with a value of one are not necessary for attribution, if expert judgment understands the discrepancy (Hegerl and Zwiers 2011; Bindoff et al. 2013). A residual consistency test (Allen and Stott 2003) is often applied to assess whether the residual variability matches the expectation of internal variability.
Alternative explanations for the observed change, such as natural influences only, should normally also be considered and ruled out to move toward a formal attribution assessment (Allen et al. 2006). After examining the different combinations of forced responses, obtained from the available CMIP5 experiments, Jones et al. (2016) deduced that using the historical and historicalGHG experiments as predictors in the regression gave more consistent estimates of the scaled greenhouse gas contribution to the observed trend and appeared to be more skillful in estimating the greenhouse gas contribution in perfect model tests. We use the CMIP5 historical and historicalGHG experiments in this study to deduce the contributions to the observed warming from well-mixed greenhouse gases and the combined influence from other anthropogenic and natural forcing factors. Alternative combinations of patterns in the regression are not considered here, as Jones et al. (2016) examined this in depth.
To incorporate observational uncertainty we repeat the regression analysis on each of the 100 members of the observational uncertainty ensemble, treating each one as the observations. As each observational realization is designed to be interchangeable the scaling factor probability density functions (PDFs) from the analysis on each realization can be added together—following the probability addition rule—and normalized to produce a “total” PDF for the scaling factors for each predictor.
We preprocess the simulated data in as similar a way as possible as the observations. It is not strictly possible to do a “true like-with-like” comparison of model simulations with observation datasets and process them identically. Models do not simulate individual observation stations or platforms, they have simplified orography (e.g., missing out small islands that have air temperature observations), and the processing steps to create the core gridded datasets will be very different. One common processing choice is to make sure spatial coverages of model fields match those of the observed fields being compared with, which has been the general approach in detection studies for some time (e.g., Hegerl et al. 1997). It has also been common practice to use SSTs as a reasonable surrogate for marine air temperatures (e.g., Brohan et al. 2006) in the construction of observed global surface temperature datasets. Comparing such datasets to simulated near-surface air temperatures has generally not been considered to introduce significant errors compared to other uncertainties (Allen et al. 2006). However, using SSTs in observed datasets potentially introduces small negative biases to warming trends (Cowtan et al. 2015), because of the higher thermal inertia of the oceans than the air. This issue could be considered as an observational uncertainty, additional to those described in this study, and is worthy of further investigation in the future. However, despite these technical and practical limitations, one should attempt to compare data as similarly as is possible, acknowledging that the chosen assumptions and approximations may influence results to a lesser or greater extent.
The preprocessing that we apply to the observed and simulated data follows that described in Jones et al. (2016). All data are projected onto the same 5° × 5° latitude–longitude spatial grid as HadCRUT4. Annual means are calculated from the monthly means (January–December) and nonoverlapping 10-yr means calculated for the 1906–2005 period. For the observations 8 months are required for the annual mean to be calculated at a grid point, and then at least 5 yr are required to make a decadal mean. The simulated decadal mean data are subsampled to have the same spatial coverage as HadCRUT4, and missing data values are given elsewhere. For each grid point, anomalies with respect to the 1906–2005 period are found, where any amount of missing data is allowed in the calculation of the reference period mean (Fig. 3). All the data are projected onto spherical harmonics (T4)—filtering out spatial scales smaller than 5000 km (Stott and Tett 1998).
We create a common EOF basis (Gillett et al. 2002) by combining 250 yr from each of the 23 CMIP5 model piControl simulations. A separate 250 yr from each model is combined to produce an independent estimate of internal variability for uncertainty testing. To obtain the multimodel means of the historical and historicalGHG experiments we first calculate the ensemble mean for each model and then calculate the multimodel mean. This gives equal weight to each model, rather than each simulation (Jones et al. 2013). The impact on the noise characteristics are accounted for by scaling the variability deduced from the piControl simulations. The impact of intramodel uncertainties (e.g., Hannart et al. 2014) is not included in this optimal detection analysis (Jones et al. 2016).
We require the scaling factors for well-mixed greenhouse gases (G) and the other anthropogenic and natural combined influences (OAN) (section 3). We attain this by doing the regression using the patterns from the historical and historicalGHG experiments (Jones et al. 2016). The resulting scaling factors are then transformed to the required G and OAN scaling factors. We use a different naming convention for the transformed (G and OAN) scaling factors to differentiate them from the untransformed (historical and historicalGHG) scaling factors (Jones et al. 2016).
We will first describe the results of the standard analysis using HadCRUT4 and then the results of analyses using the ensembles sampling HadCRUT4’s uncertainties. Finally, we describe the results when we repeat the analysis using each of the five alternative observational datasets.
a. HadCRUT4 median
The optimal fingerprint regression on the HadCRUT4 median produces scaling factors for G and OAN that are robustly detected across all EOF truncations (Fig. 4). The G scaling factors are consistent with 1, indicating no significant scaling up or down. The OAN scaling factors are less robustly consistent with 1 across the EOF truncations. The residual consistency test is passed for all EOF truncations. Following Jones et al. (2016) we chose an EOF truncation of 40 to enable us to focus on the sensitivity of the results to just the factors of interest, in this case observational uncertainty. At this EOF truncation 97% of the variability of HadCRUT4 is captured, and for both the historical and historicalGHG experiments the multimodel means have 98% of the variability captured. We calculate that the signal-to-noise ratios [following Tett et al. (2002)] for HadCRUT4 and the multimodel means of historical and historicalGHG are 2.3, 12.8, and 16.9, respectively. Together with the general consistency of the scaling factors across the range of possible EOF truncations (Fig. 4) we believe that the choice of 40 can be seen as a reasonable choice for the EOF truncation and representative for G, although less so for OAN.
b. HadCRUT4–uncertainty realization ensemble
For each of the 100 members of the bias adjustment ensemble we repeat the analysis using the ensemble member instead of the HadCRUT4 median as the predictand in the regression. The G and OAN scaling factors are obtained for each analysis (Fig. 5a). For all 100 analyses both signals are detected. The G scaling factors are consistent with a value of 1 when using all the bias adjustment realizations. The OAN scaling factors are consistent with 1 for analyses using 73 of the 100 realizations. The spread of scaling factors across the realizations is similar to the range of scaling factors for different EOF truncations (Fig. 4), indicating that this source of observational uncertainty has an impact of similar magnitude to that of one source of methodological uncertainty. We combine the 100 PDFs of the G and OAN scaling factors (Fig. 5b) by adding them together and normalizing the resultant distributions. The PDFs (Fig. 5c) and the 5%–95% ranges (Fig. 5d) of the G and OAN scaling factors, for the bias realization analysis, are slightly wider than the PDFs for the scaling factors from the HadCRUT4 median analysis. We find an increase in the variance of the scaling factor uncertainties, over the HadCRUT4 median analysis, of 17% and 6% for G and OAN, respectively.
We repeat this process for the additional uncertainty ensembles created for this study (section 2a). The G and OAN scaling factor ranges (5%–95%) are very similar for all the analyses (Fig. 6a). The measurement and sampling uncertainty ensemble analysis is near identical to the median analysis, with a variance increase in the scaling factor uncertainties of less than 2%. The scaling factors for the analysis on the combined bias, measurement, and sampling uncertainties have increases in the variance of the uncertainty ranges of just 18% and 8% over the median analysis for G and OAN.
c. Scaled temperature reconstructions
We reconstruct temperature trends of the filtered 10-yr means using the scaling factors to enable the contributions to the observed trends as deduced by the regression analysis to be assessed (Allen and Stott 2003). As expected from the similarity of the scaling factors, the scaled temperature trends are very similar (Fig. 6b). For the HadCRUT4 median analysis the scaled G and OAN trends are 1.04 (ranging from 0.86 to 1.22) and −0.38 (ranging from −0.54 to −0.22) K (100 yr)−1 compared to the observed trend of 0.65 (ranging from 0.52 to 0.77) K (100 yr)−1 (uncertainties given as 5%–95% ranges). As we compare the observed trend with simulated trends, the uncertainty due to internal variability is accounted for in the HadCRUT4 trend—not to be confused with observational dataset uncertainties. The estimate of the uncertainty due to internal variability in the observed trend is deduced from variability in the piControl trends (following Tett et al. 2002). For the analysis on the bias adjustment realizations, the scaled G and OAN trends are 1.04 (ranging from 0.85 to 1.23) and −0.38 (ranging from −0.55 to −0.21) K (100 yr)−1, compared to the observed trend of 0.65 (ranging from 0.51 to 0.78) K (100 yr)−1 (here the observed trend uncertainty range includes both internal variability and bias adjustment uncertainty, calculated by bootstrapping from the distributions from the two sources of uncertainty). This represents an increase in the variance of the scaled trends uncertainties of less than 17%. Our analysis on the measurement and sampling ensemble give scaled trends with ranges only slightly larger than the HadCRUT4 median analysis (less than 1% of the variance). Similarly, our analysis on the combined bias, measurement, and sampling ensemble produce scaled trends near identical to the bias ensemble analysis results.
Another optimal detection study introduced part of the HadCRUT4 error model into its analysis. Gillett et al. (2013) used the same spatial and temporal meaning as this study but over a different period (1861–2010) and with the multimodel mean of nine CMIP5 models in an approach (Huntingford et al. 2006) that includes an estimate of uncertainty from intramodel variability. The study repeated the regression analysis on each of HadCRUT4’s bias adjustment realizations. To get the overall uncertainties on the scaling factors, the variance of the spread in the best estimates of the scaling factors were added in quadrature to the variance of the HadCRUT4 median analysis scaling factor uncertainties. The impact of the bias adjustment uncertainties was small, with an increase in variance of the G scaling factor range of approximately 20% (Fig. 4 in Gillett et al. 2013). That different approaches to include bias adjustment uncertainties into a detection study give similar results adds to our confidence that this source of error has limited influence on overall attribution statements.
d. Alternative observational datasets
To explore the possible impact of temperature dataset structural uncertainties we also undertook analyses on the alternative near-surface temperature datasets (Fig. 2a; Table 1). This follows the approach of Jones and Stott (2011), which was an optimal detection analysis using a single climate model with earlier versions of the HadCRUT4, GISTEMP, NOAAGlobalTemp, and JMA datasets.
We preprocessed all the observational datasets as for the HadCRUT4 analysis, with all data regridded to have same resolution, 5° × 5°. The only difference in the data processing is that each observational dataset’s own spatial coverage is retained, which is also applied to the simulated data in each analysis. Ideally the simulated data should be preprocessed in the same way that the alternative datasets were to emulate any filtering and infilling techniques. For instance infilled regions in the observational datasets may have lower spatial and temporal variability (Jones 2016) than regions with direct observations. This may mean some discrepancies between observational datasets and simulations could be processing artifacts. It is, however, extremely technically challenging to process the simulated data to replicate the smoothing and infilling procedures of some of the observational datasets.
The scaling factors values are largely insensitive to choice of EOF truncation (not shown), so for consistency we use the same EOF truncation of 40 as used above. The best estimates of the scaling factors across the alternative dataset analyses (Fig. 7a) cover a wider range than the scaling factors of the individual bias adjustment ensemble (Fig. 5a). In all the analyses, G and OAN are detected, but while βG is consistent with 1 in all analyses, βOAN is not for the analyses using the NOAAGlobalTemp and JMA datasets. The reconstructed scaled trends for the analyses on the alternative datasets (Fig. 7b) also cover a wider range than the spread of scaled trends deduced from the HadCRUT4 bias realization ensemble analyses. It should be borne in mind, however, that the unscaled simulations have trends that vary depending on the spatial coverage that is being applied. For instance, the twentieth-century trends of the well-mixed greenhouse gas simulations are largest over the high latitudes (e.g., Stott and Jones 2009), so analyses that mask that region out in the simulations will reduce the global mean warming trend (crosses in Fig. 7b).
To further explore the role of the different data infilling techniques on the range of attributed trends, we repeated the analyses on the datasets after the same common spatial coverage had been imposed on them (Fig. 2b). The scaling factors for G and OAN still show some variations across the dataset analyses (Fig. 8a), but now the GISTEMP analysis has scaling factors very similar to those of the NOAAGlobalTemp analysis. The scaling factors from the HadCRUT4, CW14, and BerkEarth analyses are now very similar. The G- and OAN-scaled trends are also now more similar than in the previous analyses (Fig. 8b). This is not surprising given the smaller range in scaling factors and there being no variation in the unscaled G and OAN trends for the different dataset analyses.
The variation of results we found when using the alternative observational datasets is in line with what was seen in a similar analysis in Jones and Stott (2011). While some of the differences in attributed trends are due to how the observational datasets are corrected for biases, processed etc., much of the variety is due to the coverage differences of the datasets. Dataset creation approaches that infill missing data areas may give overconfidence to climate changes in regions where there are no direct measurements, when compared with model simulations that have data in those regions. For instance the CW14 dataset has a trend for the 1906–2005 period that is larger than the HadCRUT4 trend by 0.05 K (100 yr)−1, predominantly resulting from CW14 inferring changes across data-sparse regions. In contrast, when the historicalGHG simulations have the same full coverage as CW14, the multimodel mean trend is 0.14 K (100 yr)−1 larger than if the simulations had the same spatial coverage as HadCRUT4 (Fig. 7). While CW14 has no extra observations than were used in the creation of HadCRUT4, its use has a disproportionate impact on what extra information is included from the simulations, and as such care should be taken in the interpretation of data comparisons when using datasets with infilled data areas.
We found that including bias adjustment uncertainties in HadCRUT4 only slightly increased the uncertainties in the assessed forcing contributions to the observed changes from greenhouse gas and the other anthropogenic and natural influences. Our estimate of the correlated and uncorrelated sampling and measurement errors on HadCRUT4 has an even smaller impact on the results. This suggests that an analysis with the bias adjustment ensemble could normally be sufficient in a study attempting to include observational uncertainties. However, it is possible that the ad hoc methods we applied to include sampling and measurement uncertainties into this analysis were too conservative. Further work is being done by the team who developed HadCRUT4 to advance the error model, in particular to better understand the spatial and temporal correlation structure of the measurement and error uncertainties, so that users can include them appropriately in analyses in as easy and practical way as possible.
While we found the spread of results using the different observational datasets is wider than when using the error model of HadCRUT4, other uncertainties and methodological choices have a larger impact on the estimation of the scaled trends. For instance the choice of climate model to deduce the response patterns and what combinations of forcings to include in the regression produces scaled G trends that cover wider ranges than found here (Figs. 11 and 12 in Jones et al. 2013). Gillett et al. (2013) found that incorporating a measure of model uncertainty contributed more to the uncertainty of the scaled trends than when bias uncertainties of HadCRUT4 were included in their detection analysis.
This is the most comprehensive inclusion of observational uncertainties on near-surface temperatures in an optimal detection analysis to date. We found that the inclusion of HadCRUT4 observational uncertainties has a small impact on optimal detection analyses of large temporal- and spatial-scale temperature variations, relative to methodological choices and intramodel uncertainties (Jones et al. 2016). This supports the conclusions of other detection studies investigating the impact of observational uncertainties (Hegerl et al. 2001; Jones and Stott 2011; Gillett et al. 2013). The detection results were more sensitive to choice of alternative dataset than the quantified uncertainties of HadCRUT4 underlying the importance of structural uncertainty for fully understanding observational uncertainty and the value of multiple independent efforts to create global datasets.
While we found that the impact of including observational uncertainties may be relatively small, on smaller spatial and/or temporal scales (e.g., Karoly and Stott 2006) there is potential for more substantial influence, so observational uncertainties should be considered whenever possible and practical. Even when error models are not available it might be useful for authors to consider simple estimates to see how results could be influenced by observational error for different diagnostics. We would encourage authors to describe what sources of observational uncertainty they have attempted to include and what they have not.
Overall we found that the inclusion of HadCRUT4 uncertainties and dataset structural uncertainty had no effect on the main conclusion that the response to increases in greenhouse gas concentrations produced from human sources dominates warming over the last 100 years or so.
We wish to thank the editor and reviewers for their comments, which led to an improved manuscript. We thank Philip Brohan and Colin Morice for discussions about the uncertainties of HadCRUT4. We thank Peter Stott, Nikos Christidis, and Nathan Gillett for discussions about the inclusion of observational uncertainties in the optimal detection methodology. We thank the many individuals involved in the data collection and creation of the different observational datasets. We acknowledge the Program for Climate Model Diagnosis and Intercomparison and the World Climate Research Programme’s Working Group on Coupled Modelling, which is responsible for CMIP. We thank the climate modeling groups for their considerable effort producing and making available their model output to the scientific community. The CMIP5 data used in this study were obtained from http://cmip-pcmdi.llnl.gov/cmip5/ and were up to date as of March 2013. All model simulations used were “p1” physics versions, and revision numbers of the data retrieved are available from the author on request. The realizations of HadCRUT4 created for this study, which approximately samples the measurement and sampling uncertainties, can be downloaded from http://www.metoffice.gov.uk/hadobs/supporting_information/jones_kennedy_2017.html. The work of the authors was supported by the Joint UK BEIS/Defra Met Office Hadley Centre Climate Programme (GA01101)
Creation of Ensemble Sampling Measurement and Sampling Errors
Here we describe the method used to create an additional ensemble of realizations that approximately samples the measurement and sampling uncertainties of HadCRUT4. As is described in section 2a the uncorrelated errors over land are supplied as fields of standard deviations and the correlated errors over the sea are supplied as covariance matrices for each month, representing cross covariances between grid boxes (Morice et al. 2012). A realization of an ensemble that samples uncorrelated measurement and sampling errors over land can be created by sampling a normal distribution at each grid box and each month and multiplying by the uncorrelated measurement and sampling error standard deviations provided with HadCRUT4. This is then repeated 100 times to build up an uncorrelated measurement and sampling error ensemble.
The correlated measurement and sampling errors over the sea require more imaginative methods to create an ensemble to account for the limited knowledge about the space–time correlation structure. To calculate the area (e.g., global) mean uncertainties, Morice et al. (2012) used an estimate of the autocorrelation structure between monthly area means [see section 5.2 in Morice et al. (2012) for further details]. However, that technique is not easily applicable for creating realizations of spatial and/or temporal fields sampling the correlated measurement and sampling uncertainties. The method we describe here tries to roughly approximate the statistical behavior of the microbias corrections due to the movement of observation platforms across and between grid boxes. Together with using the provided covariance matrices for each month, which gives the spatial correlation structure between nearby grid boxes, the aim is to impose an arbitrary but plausible temporal correlation behavior on a time scale of several months. For each nonland grid box a random number is assigned, sampled from a normal distribution. The covariance matrix for that month is interrogated and convolved with the grid of random numbers to produce a grid of temperature anomalies for that month. The reassignment of a grid box’s random number with a new random number for each subsequent month is Bernoulli distributed with probability p = 0.15. This is to attempt to account for temporal correlations at the gridbox level, and the value of p = 0.15 is equivalent to half the grid boxes being reassigned a new random number at least once in a 4.2-month period. Our choice of p value was an initial guess, but as we found the produced ensemble had global annual means with statistical properties similar to the Morice et al. (2012) estimate (see below), we did not make any further attempts to explore other possible choices.
This is repeated for each month over the whole period to create a single realization and then repeated 100 times to create an ensemble. The ensemble is then added to the ensemble of uncorrelated land measurement and sampling realizations to produce an ensemble that samples both correlated and uncorrelated measurement and sampling errors. The measurement and sampling uncertainty ensemble can then be added to the ensemble of bias adjustment realizations to produce an overall ensemble sampling bias, measurement, and sampling uncertainties.
Figure A1 shows the annual mean uncertainty ranges (plus or minus two standard deviations) for the Northern and Southern Hemispheres, global, and tropical area means for the bias realization ensemble, the measurement and sampling ensemble, and the combined ensemble. We examine the validity of the choices and the assumptions being used by comparing the uncertainty spread of the large-spatial-scale annual means of the measurement and sampling ensemble with those estimated in Morice et al. (2012). The estimates of the measurement and sampling uncertainties we calculate in this study are similar to those reported in Morice et al. (2012), despite the quite different approaches, but are somewhat lower during the earlier part of the period (Fig. A1). Our choices also lead to a global annual mean correlation from year to year of approximately 0.2, which is only slightly higher than the correlation of 0.18 estimated by Morice et al. (2012). Although the technique we use may have slightly underestimated the uncertainty compared to the estimate of Morice et al. (2012), overall we feel the approach is adequate to investigate the impact of the uncertainties additional to the bias adjustment corrections, until a more formal dataset is available.