## Abstract

In an earlier study, a weaker trend in global mean temperature over the past 15 years relative to the preceding decades was characterized as significantly lower than those contained within the phase 5 of the Coupled Model Intercomparison Project (CMIP5) ensemble. In this study, divergence between model simulations and observations is estimated using a fixed-intercept linear trend with a slope estimator that has one-third the noise variance compared to simple linear regression. Following the approach of the earlier study, where intermodel spread is used to assess the distribution of trends, but using the fixed-intercept trend metric demonstrates that recently observed trends in global mean temperature are consistent () with the CMIP5 ensemble for all 15-yr intervals of observation–model divergence since 1970. Significant clustering of global trends according to modeling center indicates that the spread in CMIP5 trends is better characterized using ensemble members drawn across models as opposed to using ensemble members from a single model. Despite model–observation consistency at the global level, substantial regional discrepancies in surface temperature trends remain.

## 1. Introduction

Much attention has been focused on the fact that recent trends in global warming are slower than those predicted in many climate simulations. One class of explanation for this model–data disagreement is models not capturing internal variations in the surface energy balance (Kosaka and Xie 2013; England et al. 2014), possibly associated with increased deep ocean heat uptake (Meehl et al. 2011; Trenberth and Fasullo 2013). Changes to external radiative forcing specifications could also reduce model warming trends (Solomon et al. 2010; Santer et al. 2014; Schmidt et al. 2014; Huber and Knutti 2014), as would downward revision of a model’s transient climate sensitivity (Otto et al. 2013). Another class of explanation involves changes to global temperature estimates. Inclusion of arctic surface temperature estimates (Cowtan and Way 2014), revision of temperature buoy offsets (Karl et al. 2015), and adjusting for air–sea temperature differences (Cowtan et al. 2015) can all incline recent warming observations nearer to the models. Other studies reconcile observed trends with either statistical properties of individual models (Thorne et al. 2015) or specific phases of modeled internal variability (Risbey et al. 2014) within phase 5 of the Coupled Model Intercomparison Project (CMIP5) ensemble (Taylor et al. 2012).

Both improved mechanistic understanding of decadal temperature variability and more accurate global temperature estimates are of obvious value. There is also utility in addressing whether differences between recent temperature trends and model projections are statistically significant. Findings of significant differences would give grounds for concluding that models are missing major components of internal variability, that data-based estimates are biased, or that other sources of uncertainties are too narrowly construed.

A variety of approaches have been employed in assessing statistical significance of recent warming trends. Rajaratnam et al. (2015) tested whether recent warming rates were slower than those between 1950 and 1997 and found no evidence for significant slowing. Brown et al. (2015) assessed recent trends in global temperature against a combination of model- and empirically derived variability, finding consistency when using CMIP5 regional concentration pathways (RCPs) 4.5 or 6 (Taylor et al. 2012), but they observed that decadal trends fall below the 5th percentile of distributions when using RCP8.5. Fyfe et al. (2013, hereafter Fyfe2013) also assessed observed trends relative to CMIP5 projections and found them to generally reside below the 5th percentile of simulations when using RCP4.5. A similar analysis is presented in box 9.2 of IPCC AR5 (Flato et al. 2013).

To our knowledge, Fyfe2013’s analysis represents the strongest published claim for the statistical significance of the hiatus, and here we take up two major elements of that analysis in further detail. First, as Fyfe2013 document, their results are sensitive to selection of specific intervals. For example, trends in observed global temperature computed using the Hadley Centre/Climatic Research Unit version 4 (HadCRUT4) gridded compilation (Morice et al. 2012) range from 0° to 0.07°C decade^{−1} when started between 1998 and 2002, all ending in 2014. If individual intervals are then examined in isolation, statistical significance varies, with trends indicated as highly anomalous () or consistent with CMIP5 trends (). This sensitivity to interval selection is not surprising given the shortness of the examined trends (Wunsch 1999), but it introduces an element of arbitrariness insomuch as a basis for choosing between results is lacking. Further, finding an interval falling outside of a 95% confidence interval becomes increasingly likely with the number of distinct intervals examined (e.g., Marotzke and Forster 2015).

A second issue is how the model ensemble ought to be statistically interpreted for purposes of constructing a null distribution. A truth-plus-error approach posits temperature trends as involving a deterministic component plus biases, whereas an exchangeable approach posits that actual climate and individual ensemble members share equivalent statistical properties (e.g., Annan and Hargreaves 2010). Although Rougier et al. (2011) showed that these approaches can be statistically equivalent, the implementation of the truth-plus-error approach by Fyfe2013 generally indicates differences between simulations and observations that are significant, whereas the exchangeable implementation only indicates significant differences for some of the most recent intervals considered. Determining which representation of the null is better suited to the present test would also reduce arbitrariness in the interpretation of the results.

In the following, we introduce a more stable metric of divergence in trend between observations and simulations. This metric differs from the typical least squares linear metric because it is fit with a fixed intercept, reducing the added variance from interval selection. With the exception of this modified trend metric, we replicate the hypothesis testing of Fyfe2013 and identify why the null distributions inferred from the truth-plus-error and exchangeable approaches differ in implementation. On these bases, a consistent interpretation emerges whereby no significant difference is found between observed global trends and the CMIP5 ensemble using RCP4.5.

## 2. Data

Observational temperature estimates are from the 5° × 5° HadCRUT4 gridded compilation of instrumental temperatures (Morice et al. 2012). Missing monthly data are infilled using the annual average if at least 10 months of observations are present in the year. Only grid boxes having at least 90% of monthly data coverage between 1950 and 2015 are included, covering 71% of the global surface area. For included grid boxes, data that are still missing are filled with the total time series average. All values are monthly anomalies with respect to 1950–2015 average seasonal cycle.

Simulations of surface temperature are from the CMIP5 historical ensemble conjoined with matching members from the RCP4.5 ensemble (Taylor et al. 2012; van Oldenborgh 2015). The ensemble comprises 22 modeling centers, 38 models, and 108 simulations (Table 1). Our ensemble differs from that of Fyfe 2013 by inclusion of the EC-EARTH and INM-CM4.0 models, addition of 21 NASA Goddard Institute of Space Studies (GISS) ensemble members, and omission of HadCM3 owing to the lack of complete RCP4.5 runs. Models have varying numbers of representative ensemble members, with, for example, NASA GISS contributing 34 ensemble members, CSIRO–Queensland Climate Change Centre of Excellence (QCCCE) contributing 10, and MRI contributing 1. With the exception of two GISS simulations, different simulations from the same model will differ at least in their initialization time within a control run for a given physics parameterization.

Analyses are performed on annual averages, where the July–June year is used in order to better contain ENSO anomalies within a given year, with the year associated with the January–June portion of the average reported. Monthly averages are weighted according to the number of days in a month, which differs across models. Models variously employ the standard Gregorian calendar with leap years, a fixed 365-day no-leap-year calendar, and a fixed 360-day calendar. For all model–data comparisons, simulations were regridded to the observational grid by taking the area-weighted average of simulation grid boxes contained within each uncensored observational grid box. Global mean temperatures are determined as the area-weighted average across all uncensored grid boxes on the native grid of the simulation. Results are unchanged to two significant figures if simulations are instead regridded using linear interpolation.

## 3. Methods

### a. Measuring the divergence

We test the null hypothesis *H*_{0} that recent global temperature trends are consistent with the CMIP5 multimodel ensemble. It is useful to focus on the trend metric used in evaluating *H*_{0} because of its implications for the stability of the test. Global temperature trends are often quantified using an estimate of slope *s* in the following simple linear regression equation:

in which both *s* and *b* are estimated in the least squares sense, by minimizing . If is assumed uncorrelated and normally distributed with standard deviation *σ*, the expected variance of the slope estimator is , where *L* is the number of data points that comprise the trend. This formulation is common (e.g., Thompson et al. 2015) and derived in appendix B.

Many statistical models can be fit to quantify trends (e.g., Visser et al. 2015), and we consider further formulations according to several characteristics: similarity to foregoing approaches [i.e., Eq. (1)], suitability for describing previous agreement but recent divergence between models and data, and low sensitivity to choice of interval. Although not considered by Visser et al. (2015), a piecewise fit to a time series of the difference between two temperature estimates appears apt under these criteria. Specifically, the difference between two time series is fit using a constant offset *c* followed by a linear trend *δ* that is piecewise continuous:

The constant *c* is estimated as the average of for , and *δ* is estimated in a least squares sense, as for *s*. Estimates of *δ* differ from *s* in having an intercept fixed at (), which acts as a hinge point preceded by a constant and followed by a trend diverging from that constant. Different sets of time series used to calculate are defined in the context of various tests that follow.

The expected variance of *δ* is smaller than that of *s* when each is fit to the same time series. Assuming that and have equivalent distributions and that the variance of *c* is small on the basis of being constrained by a relatively long sequence of permits for writing . The ratio of variances between *δ* and *s* is then . Appendixes A and B give derivations of these variances and further calculations involving variance contributions from *c*. We find that *δ* has 0.35 times the variance of *s* when applied to global temperature trends over 15-yr intervals.

Importantly, *δ* is also more stable than *s* across application to different intervals. The variance of the difference in trends fit to *L*-length intervals with consecutive start years using *δ*, relative to that for *s* is given by . This ratio is 0.143 for an interval length of . Contributions from variance in *c* are neglected in the foregoing expression but are minor as long as the interval over which *c* is calculated () exceeds that over which *δ* is defined (; see appendix C). The lower variance of *δ* associated with interval selection reduces the potential for false positives that otherwise occur when conducting multiple tests for trend significance over various intervals.

Stability of the *δ* estimator is also empirically indicated by its more smoothly varying as a function of start year (Fig. 1). When is defined as the difference between HadCRUT4 global average temperature and the CMIP5 ensemble average, values of *s* equal , , and C decade^{−1} for 1998–2014, 2000–14, and 2002–14 (Fig. 1b). In contrast, estimates of *δ* fit to the same time series change monotonically when computed over the same intervals, having values of *δ* equal to , , and C decade^{−1} (Fig. 1c). Unless explicitly indicated otherwise, regional and global estimates of CMIP5 ensemble average temperature are always computed as the average across equally weighted modeling centers and include only grid boxes corresponding to observations.

Computing *δ* on a gridbox basis—again using CRU observations relative to the CMIP5 ensemble average—shows a coherent pattern of cooling in the eastern Pacific consistent with the negative phase of the Pacific decadal oscillation (PDO; Zhang et al. 1997; Fig. 2). Further, a cooling trend that is prominent in the midlatitudes of Eurasia when using *s* is suppressed when using *δ*, consistent with findings that these trends result from short-term internal variability (Li et al. 2015; Cohen et al. 2012) and that *δ* is less volatile. That *δ* yields more physically interpretable patterns, to which we return later, also supports its being a more suitable metric for interpreting divergence in recent trends. From both theoretical variance properties and empirical application, we find *δ* to be less volatile than *s* and, therefore, expect it to yield more consistent results when applied in testing for model–observational discrepancy.

### b. Formulating a null hypothesis

We evaluate a null hypothesis *H*_{0} that recent global temperature trends are consistent with the CMIP5 multimodel ensemble using the *δ* metric. Contributions to uncertainty in comparing observations and models are captured through combining different realizations of three time series: *A*, a version of global mean temperature anomalies from observations; *B*, a version of global mean temperature anomalies averaged across the CMIP5 ensemble; and *C*, a time series representing the variability associated with an individual CMIP5 simulation. The null hypothesis *H*_{0} is assessed by calculating the degree to which the distribution of trends derived from realizations of contain zero and, therefore, reflect consistency between model and observations. This hypothesis test is modeled exactly upon that of Fyfe2013 in order to allow for direct comparison of results. We evaluate 15-yr trends, as this was the chosen length of Fyfe2013 and Marotzke and Forster (2015), and Fyfe2013 rejected using *s* for the 15-yr interval 1998–2012.

Realizations of global temperature from observations *A* have uncertainties that include observational noise, issues associated with computing global averages from a limited network, and systematic errors from switching between observing methods. These uncertainties are expressed for the HadCRUT4 observations through an ensemble with 100 members, each perturbed with noise realizations (Morice et al. 2012). The mean and standard deviation of *s* computed between 2000 and 2014 from the HadCRUT4 observational ensemble is 0.076° ± 0.006°C decade^{−1}. Fyfe2013 compute realizations of observational trends by averaging over 100 draws of the HadCRUT4 ensemble taken with replacement. This approach suppresses the standard deviation of *s* by a factor of 10 and, in our view, seems unwarranted since each ensemble member is meant to indicate a plausible realization. Furthermore, note that the bias correction for ocean buoy data suggested by Karl et al. (2015) results in a trend estimate of C decade^{−1} between 2000 and 2014 that exceeds all HadCRUT4 ensemble trends, suggesting that the uncertainty estimates in the ensemble are too narrow or that the correction of Karl et al. (2015) is too large. A question noted, albeit not otherwise addressed here, is whether the observational record of surface temperature is known with sufficient accuracy to provide a stringent test of the climate models. We proceed using Fyfe2013’s estimate of temperature trends and their spread from the uncorrected HadCRUT4 ensemble in order to illustrate that it is difficult to reject *H*_{0} even under Fyfe2013’s representation of low uncertainty.

Realizations of global temperature across the CMIP5 ensemble *B* are obtained by sampling the 38 CMIP5 models with replacement, randomly selecting an ensemble member for each, and taking the average across samples. This approach gives equal weight to each model, as opposed to weighting according to the number of submitted ensemble members. Averaging across models in order to obtain a more stable estimate is more defensible than averaging across HadCRUT4 temperature realizations because models are developed semi-independently from one another and contain independent realizations of internal variability, whereas there is only a single set of temperature observations. There is evidence, however, that the spread across various model ensembles is suppressed (Huybers 2010; Masson and Knutti 2011), possibly because of anchoring effects or suppression of outliers (Cess et al. 1996).

Finally, a realization of the variability associated with a simulation *C* is obtained by sampling ensemble members in a similar manner as for *B* and then computing the difference between a single one of the sampled ensemble members and the average across the sample. In accord with the statistical approach that models and observations are exchangeable realizations of climate (Annan and Hargreaves 2010), the observations, represented by a single sample from the HadCRUT4 ensemble, are afforded the weight of one model and pooled with the 38 CMIP5 models for a total of 39 sampled units. In practice, however, we find that the inclusion of observations in this sampling process does not affect estimates of statistical significance.

Each realization of time series then describes a departure of observed temperatures *A* from a CMIP5 average *B* with variability associated with a single ensemble member *C*. The distribution of *H*_{0} is estimated from trends fit to realizations of , and *H*_{0} is rejected if fewer than 5% of realizations are greater than or less than zero; that is, tests are performed as two sided at the level.

### c. Clustering and rejection of an alternate test approach

Fyfe2013 also employ a methodology loosely motivated by the truth-plus-error framework where *B* is meant to represent true climate and a quantity represents internal variability. Following Fyfe2013, we realize by selecting one of the 13 CMIP5 models associated with more than one ensemble member and computing the difference between one of these ensemble members and the average time series of that model. This approach yields a more narrow distribution of trends and more frequent rejection of the null hypothesis than that based on the exchangeable hypothesis. The origin of this discrepancy becomes evident when assessing whether simulations cluster according to model.

Clustering between members of a sample tends to narrow the inferred distribution of trends, if not otherwise accounted for (Pennell and Reichler 2011). Several studies document correlations of temperature or precipitation patterns according to model (Masson and Knutti 2011; Knutti and Sedlácek 2013; Sanderson et al. 2015). We use an Ansari–Bradley rank-sum approach (Ansari and Bradley 1960) to specifically test whether the values of *δ* cluster according to model. Values of *δ* are assessed for each ensemble member relative to the CMIP5 ensemble average temperatures for 2000–14. The test is one sided and performed at the level.

Of the 16 modeling centers that contribute multiple ensembles, 7 centers comprising a total of 58 ensemble members each have significantly smaller dispersion than the remainder of the ensemble at the level (Table 2). The 34 ensemble members contributed by NASA GISS show particularly significant clustering at . A bootstrapped version of the Ansari–Bradley rank-sum test that accounts for the correlation between sample medians gives equivalent results. Figure 3 also illustrates this clumping of trends according to modeling center and that the empirical histogram of trends is broader when each modeling center is equally weighted.

Smaller dispersion of members from a single model relative to those from across models is not surprising considering that different models may include different physics (e.g., Watanabe et al. 2012), entail different parameterizations (e.g., Collins et al. 2011), and contain different forcing (e.g., Tebaldi and Knutti 2007). Smaller dispersion explains why the null distribution estimated using , which depends on intramodel differences, is narrower than *C*, which depends on intermodel differences. Simulations may also cluster along axes not entirely described according to model center—inclusive of aspects of physics, parameterization, and forcing—though such dependencies are not directly relevant to the distinction between the *C* and null models considered here. Our view is that evaluation of whether observations are consistent with simulations should include all relevant sources of uncertainty in model simulations and that the intermodel comparisons associated with *C* are more representative of this uncertainty. In the following sections, we thus rely exclusively on the exchangeable approach in order to gauge consistency between CMIP5 simulations and observations.

## 4. Results

We are systematically unable to reject *H*_{0} for all 15-yr intervals that we examine, with start years ranging between 1970 and 2000 (; Fig. 4). Note that this failure to reject is obtained using a *p* value of 0.1, as opposed to 0.05, and using the average across HadCRUT4 ensemble members in estimating *A*, both of which make it easier to reject *H*_{0}. The differences between our results and Fyfe2013 relates to our use of *δ* instead of the more volatile metric *s* and our employing only the exchangeable approach *C* as opposed to the more narrowly distributed obtained from a truth-plus-error approach.

Although we do not favor other approaches, for completeness we note that if instead a combination of *C* and *s* were employed, results would be more variable and rejection of *H*_{0} becomes almost, but still not quite, possible for the 1998–2012 interval. Using and *δ* would lead to rejections for 15-yr intervals starting between 1998 and 2000. Finally, using and *s* would lead to intermittent rejection of four different 15-yr intervals, starting at 1992, 1998, 1999, and 2000 (Fig. 4a), consistent with the combined effects of a null that is more narrowly distributed and a metric that is more variable between different fitted intervals. We note that the variability of *p* values across different 15-yr intervals is smooth in Fig. 2d of Fyfe2013, which we have not been able to reproduce.

Our basic result, that *H*_{0} cannot be rejected using *C* and *δ*, is insensitive to three other relevant variants. First, Fyfe2013 do not include INM-CM4.0 in their ensemble, and the single ensemble member associated with this model shows one of the most negative *δ* trends. Whereas repeating our test excluding this model lowers estimated *p* values, they nowhere become lower than 0.1.

Second, if *C* is interpreted as uncertainty that is equally applicable to the observations, represented by *A*, and to the CMIP5 ensemble average, represented by *B*, it appears arbitrary whether it is added to or subtracted from the quantity . However, *C* is asymmetric with a positive skew toward larger values (Fig. 4) such that *p* values are smaller when *C* is added, though again never lower than 0.1. It is unclear whether this asymmetry is indicative of physical processes that make positive anomalies from the mean more likely than negative ones, as implied by the asymmetric nature of feedback sensitivity (Roe and Baker 2007), or merely results from the small sample population.

Finally, when using the spatially interpolated HadCRUT4 produced by Cowtan and Way (2014) and the spatially complete estimates of global mean temperatures for the CMIP5 ensemble, it becomes even more difficult to reject *H*_{0} using *s* and *δ*. These latter results are expected given the rapid warming in polar regions (Cowtan and Way 2014).

## 5. Discussion and conclusions

Meehl et al. (2013) demonstrated that intervals of slow temperature rise in CCSM4 RCP4.5 projections also generally feature negative PDO patterns along with anomalously positive rates of ocean heat uptake. In our results, regional *δ* between CRU observations and the CMIP5 ensemble average show a clear negative phase of the PDO (Fig. 2b). Furthermore, the simulation having the second-closest global *δ* to the observations (ensemble member 24; see Table 1) shows a regional *δ* pattern resembling the negative phase of the PDO (Fig. 5a), suggesting that at least some ensemble members produce cooling for physically similar reasons. A systematic exploration of the manifestation of the PDO in each ensemble member, however, reveals no clear relationship between global values of *δ* and the pattern or phase of the PDO.

Rather than providing an explanation in terms of the PDO, ensemble members with global *δ* values similar to the observations generally have anomalously low temperature trends at high northern latitudes over continents (e.g., CSIRO–QCCCE model numbers 24 and 32; Figs. 5a,b). This congruence between regional and global *δ* is consistent with findings that cooling trends across northern regions are tied to the slowdown in global temperature trends (Cohen et al. 2012). The presence of such negative regional *δ* values within the ensemble follows from northern continental regions having high variance in *δ* across ensemble members (Fig. 5c). High variance in northern continental regions has been found in other simulations (e.g., Deser et al. 2012) and is presumably associated with positive feedbacks that amplify arctic warming (e.g., Feldl and Roe 2013) and low thermal buffering at high northern regions (e.g., Kim and North 1991). Negative trends in northern regions are, however, inconsistent with the observed neutral or positive *δ* values found at most high latitudes in observations (Fig. 2).

Further discrepancies exist in the eastern equatorial Pacific, where CMIP5 trends are uniformly higher than the observations across all ensemble members (Fyfe and Gillett 2014), a conclusion that holds no matter how the slope or its significance is determined. For example, although ensemble member 24 produces a North Pacific pattern of divergence that is unusually consistent with the observations, there is disagreement in *δ* values in the eastern equatorial Pacific, where the simulation has a more positive trend than in the observations. Kosaka and Xie (2013) show that restoring a model toward observed temperatures in the eastern equatorial Pacific results in global temperature variations very similar to observations, along with regional warming in arctic Eurasia and northeastern Canada similar to observations. Ding et al. (2014) demonstrate in a slab ocean model that a pattern of tropical temperature trends involving cooling in the eastern equatorial Pacific leads to increased warming focused in northeastern Canada as well as Greenland, also consistent with observed trends.

The apparent importance of the eastern equatorial Pacific for governing global temperature and the clear model–observation discrepancies in this region creates some tension with our conclusion that CMIP5 simulations are consistent with observations at the global scale. Indeed, the observed pattern of strong negative divergence in the tropics and positive divergence at high northern latitudes is not found among CMIP5 ensemble members (Fig. 6). Ensemble members with the most negative global *δ* values show negative divergences both in the tropics and at high northern latitudes. Thus, although the CMIP5 ensemble contains recent global temperature trends similar in magnitude to the observations, they are composed of differing regional patterns.

Our finding of global consistency but regional discrepancy between simulations and observations also reflects findings using earlier periods of the observational record. Examination of SST variability showed that, whereas global-scale decadal variability is consistent, decadal SST variability in 5° × 5° gridded observations is significantly larger than that found at comparable spatial scales in CMIP5 (Laepple and Huybers 2014a). Examinations of decadal variability at the scale of the eastern equatorial Pacific, however, show consistency between CMIP5 and observations (Ault et al. 2013; Fyfe and Gillett 2014). Further study of these issues is warranted to include how simulations and observations compare as a function of spatial scale (e.g., Stott and Tett 1998), how irregular sampling and noise influence estimates of SST variability (e.g., Laepple and Huybers 2014b), and how model specification and resolution influence simulated variability (e.g., Stammer 2005).

Our results differ from Fyfe2013 in that we find no significant differences between global-scale temperature trends and those in the CMIP5 ensemble across all 15-yr intervals since 1970. Given that we otherwise replicate the hypothesis testing of Fyfe2013, the stability of our results comes from using a metric of divergence in trend that pivots from a long-term mean as well as exclusive use of a null distribution that accounts for intermodel spread. This reevaluation brings comparison of observed and CMIP5 trends into agreement with other analyses (e.g., Brown et al. 2015). At some level, uncovering flaws in models on the basis of observations would be an important scientific accomplishment, demonstrating the capacity to test climate model predictions and helping to focus future work, but our findings for global mean temperature demonstrate mere consistency. Regional discrepancies, however, highlight the continued utility of improving observational estimates, developing techniques for better comparing observations and models, and continued inquiry into the causes of regional trends.

## Acknowledgments

We thank Geert Jan van Oldenborgh for facilitating access to CMIP5 simulation output through the KNMI climate explorer website and are grateful for comments provided by Lauren Kuntz, Thomas Laepple, Karen McKinnon, Cristian Proistosescu, Andrew Rhines, Daniel Schrag, Eric Stansifer, and Giuseppe Torri. Funding was provided by NSF Award 1304309 and an NSF Graduate Research Fellowship.

### APPENDIX A

#### Derivation of Trend Estimators

Estimators for the simple linear regression slope *s* and intercept *b* are derived from the linear equation, where and are random variables:

To simplify the subsequent algebraic expressions, *N* is defined as the number of data points after *t*_{0}, and *L* = *N* +1. For a trend of length *L* years, *s* is fit to . The sum of the residual variance is defined as follows:

Setting the partial derivative with respect to *s* equal to zero,

which yields the following expression for *s*:

An expression for *b* follows similarly:

A similar approach is used to find *δ*, where and are random variables:

Because the intercept at *k* = 0 is fixed at , *δ* is computed using . The sum of the residual variance is defined as follows:

Analogous to the derivation of *s* [Eq. (A4)], *δ* can be expressed in terms of *c*:

Equation (A10) can be rewritten to include the computation of *c*. In the following, indexing of is shifted by *M* − *N* to include the interval over which *c* is computed, where *M* is the total length of the time series . We define , and substituting this into *δ* gives the following:

### APPENDIX B

#### Derivation of Trend Estimator Variances

The variance of the trend estimator given in Eq. (A4) can be expressed as follows:

This equation can be substantially simplified. Defining the variance of as and noting that is the average ,

Further simplification comes from the equality , yielding

Finally, the integer addition identities allow for writing

An expression can similarly be derived for the *δ* estimator given in Eq. (A10) if the variance of the intercept *c* is assumed negligible:

The ratio of variances between the fixed-intercept and standard trend estimators is given by

If, instead, the variance of the intercept estimator *c* is accounted for, the overall variance associated with *δ* is given by

and the variance ratio becomes

For the 2000–14 trend examined in the main text, *L* = 15 and *M* = 64, which gives a variance ratio of 0.336. This ratio is only slightly higher in the presence of autocorrelation. For example, numerical simulations wherein and are represented as a first-order autoregressive process fit to each of the calculated from the CMIP5 ensemble gives an average variance ratio of 0.349.

### APPENDIX C

#### Derivation of Variances of Trend Differences between Intervals with Consecutive Start Years

Our choice of the *δ* estimator is guided by its greater stability than *s* between trend estimates of same-length intervals with consecutive start years. Stability can be demonstrated analytically by comparing the variances of the differences in trends between overlapping intervals offset by one year. The difference in consecutive trends *s* is first treated, where *s*^{+} corresponds to an interval that increments all years of *s* by one:

Applying the variance operator and foregoing integer addition identities gives the following:

The variance of consecutive *δ* slope estimates can be determined analogously if *c* is assumed the same for both intervals:

The ratio of the two variances is then given by

having a value of 0.143 for *L* = 15.

Including the variance contributions from estimating *c* over an interval of length *M* − *N* leads to a longer expression, where *δ* and are estimated over and , respectively:

The variance of this difference is given by

Note that the interval over which *c* is computed for is equivalent in length but incremented by one year relative to that of . The variance ratio is then given by

and is only slightly greater at 0.151 than when variance in *c* is neglected, given *M* = 64 and *L* = 15. The greater stability between estimates associated with *δ* reduces the potential for multiple tests involving different intervals to produce false positives.

## REFERENCES

*Climate Change 2013: The Physical Science Basis*, T. F. Stocker et al., Eds., Cambridge University Press, 741–866.