## Abstract

Climate time series often have artificial shifts induced by instrumentation changes, station relocations, observer changes, etc. Climate time series also often exhibit long-term trends. Much of the recent literature has focused on identifying the structural breakpoint time(s) of climate time series—the so-called changepoint problem. Unfortunately, application of rudimentary mean-shift changepoint tests to scenarios with trends often leads to the erroneous conclusion that a mean shift occurred near the series' center. This paper examines this problem in detail, constructing some simple homogeneity tests for series with trends. The asymptotic distribution of the proposed statistic is derived; en route, an attempt is made to unify the asymptotic properties of the changepoint methods used in today's climate literature. The tests presented here are linked to the ubiquitous *t* test. Application is made to two temperature records: 1) the continental United States record and 2) a local record from Jacksonville, Illinois.

## 1. Introduction

Changepoint features can drastically alter inferences made from a climatic time series. For example, Fig. 1 shows how trend estimates change when changepoint information is incorporated or neglected. This example considers annually averaged temperatures recorded at New Bedford, Massachusetts, from 1812 to 1999. The figure reports two statistical regression fits: 1) a line, which has a positive slope of 1.366°C century^{−1}, and 2) a “piecewise line” that allows for four mean shifts at times of known gauge changes and station relocations (these times are taken from the station's metadata and occur in 1888, 1906, 1951, and 1985). The slope in each segment of the piecewise line is constrained to be the same and is estimated at −0.234°C century^{−1}. Obviously, trend inferences are affected by changepoints.

The above data are typical of climate time series in that they have a time trend and multiple mean-shift changepoints that act to influence overall trend inferences. The four changepoints here are viewed as artificial as they are induced by changes in data collection (gauge and location changes). Methods that adjust data for artificial shifts of this type are called homogenization techniques. The homogenization changepoint problem is well known in the climate literature; numerous authors have presented changepoint tests for the case of a single mean shift when the series has no trends. A partial list of references for this task includes Page (1955), Hawkins (1977), Potter (1981), Alexandersson (1986), Easterling and Peterson (1995), Vincent (1998), Ducré-Robitaille et al. (2003), and Beaulieu et al. (2012). Some of this literature is reviewed in Reeves et al. (2007).

Homogenizing series with long-term trends is harder. Some methods use information from one or more nearby data collection stations. A station used in this manner is commonly referred to as a reference station. The most common way of using a temperature series reference differences the record with corresponding values from the reference. In many cases, the resulting time series still has a trend. In addition to the data homogenization problem, some climatologists have considered detecting abrupt naturally caused changes in a climate time series. References include Tomé and Miranda (2004), Rodionov (2004), Seidel and Lanzante (2004), and Gallagher et al. (2012). The methods here cannot distinguish between natural and artificial changepoints—such attribution is best done with a metadata record (when available). Whether one is interested in detecting artificial or natural changepoints, it is often prudent to allow for the potential of a trend.

Unfortunately, mean-shift tests may not work well when the series has a trend. If in truth a series has a trend but no mean shift, then a changepoint test designed for zero-trend data often tries to compensate for the mean misspecification by erroneously flagging a changepoint near the record's midpoint. If the series length is subsegmented about found changepoint times and the subsegments are analyzed, additional changepoints are often erroneously declared. It is therefore desirable to develop homogenization methods that are applicable to data with trends.

At most one changepoint (AMOC) tests that allow for trends were considered in Lund and Reeves (2002), Wang (2003, 2008), Wang et al. (2007), and Beaulieu et al. (2012). None of these references identifies an asymptotic statistical distribution for the crucial changepoint existence test. In fact, the current climate literature seems fractured and is often based on case-by-case simulation, despite the fact (which will become apparent here) that tractable asymptotic distributions can be derived.

Of the above papers, Wang et al. (2007) is the most relevant to the current exposition. These authors use empirical methods to improve testing procedures for changepoints in the presence of trends by imposing a “penalization term” that increases as the changepoint time moves away from the series' midpoint. Their penalty for a changepoint occurring at time *c*, denoted by *P*(*c*), is derived through trial and error simulations with normal (Gaussian) distributions. This results in an ad hoc improvement for the specific models and sample sizes employed, but no general justification is given. The mathematical derivations in this paper explain the necessity for, and provide a specific form for, the penalty term *P*(*c*), which is appropriate for all sample sizes. Justification under minimal assumptions is given—we require independent model errors, but no distributional assumptions (such as normality).

This paper seeks to provide a more unified and mathematically justified procedure for changepoint tests for climate time series with possible trends. The good news is that justifiable test statistics are explicit and can be calculated from regression *t* tests, which are provided by standard statistical software. Furthermore, hypothesis testing decision rules can be based on an asymptotic theory that holds in great generality. Correct quantification of the asymptotic distribution of the AMOC test statistic is needed, which is provided in this paper. For purposes of exposition, attention is restricted to models whose trend slope is constrained to be the same before and after the changepoint time. We provide all technical details so that an interested reader can modify our methods to more complicated models such as those considered in Beaulieu et al. (2012) (that considers a variety of regression structures, moving beyond linear trend paradigms). While several of our asymptotic distributions have appeared in the technical statistics literature, they seem unknown (unappreciated) in climate settings. For example, the asymptotic properties of the classical standard normal homogeneity test (SNHT) of Alexandersson (1986) are presented here for the first time in a climate venue. Also, even in the statistical literature, the connection between AMOC changepoint tests and standard regression *t* tests has not been made clear.

The rest of this paper proceeds as follows. The next section reviews the simple case of a mean shift only (no trends) and states asymptotic distributions. Section 3 then presents analogous results for regression structures involving a linear trend. Section 4 presents a simulation study that supports the asymptotic theory presented and illuminates some of the properties of the methods. Section 5 shows how the methods can be used to make inferences on two temperature series. Concluding comments are made in section 6. Comparisons of the different tests are made and some words of caution are presented. Technical derivations are collected into an appendix.

## 2. A mean-shift CUSUM review

First, consider the most basic changepoint problem: detecting a change in the mean of a series at an unknown time. A regression model describing this scenario for data (*X _{t}*) over

*t*= 1, … ,

*n*is

Here, *μ* is the mean of the series before the unknown changepoint time *c* and (ε_{t}) is zero-mean independent random error with unknown variance *σ*^{2}. The mean shift occurring at time *c* + 1 has magnitude Δ. Hence, the mean-shift factor (*δ _{t}*) obeys

Perhaps the first statistic used to detect mean shifts was the cumulative sum (CUSUM) statistic studied in Page (1955):

If in truth a mean shift occurred in the data at time *c* (specifically, sometime between times *c* and *c* + 1, which we call a shift at time *c*) with magnitude Δ, then taking expectations in (2.2) gives *E*[CUSUM(*c*)] = *n*^{−3/2}*c*(*n* − *c*)Δ. It follows that if |CUSUM(*c*)| is large, then there is statistical evidence that Δ is nonzero and that a changepoint occurred at time *c*. CUSUM changepoint statistics estimate the changepoint time as the value of *c* that maximizes |CUSUM(*c*)|. That is, the changepoint statistic used is

where the denominator has been normalized by an estimate of to scale up to a quantifiable asymptotic distribution. The estimate of *σ*^{2} used is typically computed under the null hypothesis of no changepoints and is , where .

Our first result is taken from MacNeill (1974) but does not appear to be appreciated in climate settings. It says that under the null hypothesis of no changepoints, *C* distributionally converges (as *n* → ∞) to the supremum (largest value) of a Gaussian process known as the Brownian bridge.

**Result 1:** As *n* → ∞,

Here, *B*(*z*) = *W*(*z*) − *zW*(1) for 0 ≤ *z* ≤ 1 and is a standard Brownian motion process. This means that *W*(*z*) has a normal distribution with zero mean and variance *z* for each *z* ∈ [0, 1], the increments of [*W*(*z*)] are stationary and independent, and *W*(0) = 0. Brownian motion, Brownian bridges, and other Gaussian processes arise as asymptotic distributions in many limit problems. The interested reader can consult Ross (1996) and problems 6.14 and 6.18 of Resnick (1992) for more on their probabilistic properties. We do not prove this convergence here as it is technical; however, an analogous result for the more complicated case with a linear trend will be derived later.

The immediate implication is that critical values of *C* can be calculated. Specifically, a known expression [see Resnick (1992) and Robbins et al. (2011a), among others] provides exact tail probabilities of the distribution of sup_{z∈[0,1]}|*B*(*z*)|. Here, the resulting critical values are provided in Table 1 for various levels of statistical confidence. Use of these percentiles is generally conservative for a finite *n*. This will be shown in section 4. Elaborating, should one want 95% confidence and use the 95th quantile in Table 1, then one is slightly more than 95% confident of the declaration of a found changepoint.

A perhaps more intuitive approach for changepoint detection examines the sample means before and after the changepoint. For a changepoint time *c*, set and . If these two sample means greatly differ, a changepoint is statistically suggested at time *c*. Scaling by a denominator that accounts for the possibly different segment lengths leads to consideration of a statistic *Z*(*c*) of the form

An easy calculation provides the variance margin . Using this in (2.4) and reworking the numerator into a CUSUM form provides

Since *c* is not known a priori and *n* is constant, our changepoint statistic is taken as a maximum of the squared values multiplied by *n*:

The statistic *C** is also a Gaussian likelihood ratio statistic. This means that if one assumes a Gaussian distribution for (ε_{t}) and performs a statistical likelihood analysis, then *C** arises. In general, likelihood ratio statistics can be difficult to compute, but are very powerful (i.e., they detect changepoints with a relatively high probability).

The asymptotic properties of *C** are harder to quantify than those of *C*. In fact, it can be shown that *C** diverges to infinity as *n* → ∞. Hence, *C** needs to be scaled in some way. A common scaling practice is to “crop boundaries.” One should not expect to accurately detect a changepoint that occurs near the data boundaries as there are too few observations before or after the changepoint to make reliable conclusions. Cropping the first and last *h* × 100% percent of the observations gives a test statistic defined by

The asymptotic distribution of is known (Csörgő and Horváth 1997) and again involves the Brownian bridge.

**Result 2:** As *n* → ∞,

Tail probabilities of the above limit distribution can be approximated using an expression that is provided by Csörgő and Horváth (1997) and Robbins et al. (2011a); simulation is required to obtain more precise values. Thereby, Table 2 lists simulated critical values of for various values of *h* and statistical confidence levels. Up to boundary cropping issues, this result provides the asymptotic distribution of the classical standard normal homogeneity test of Alexandersson (1986). An alternative to using the percentiles in Table 2 would be to simulate critical values for every sample size *n*.

The similarities and differences between the asymptotic distributions in (2.3) and (2.7) are worth discussing. First, 's limit distribution involves the supremum of the square of *B*(*z*) rather than |*B*(*z*)|. This is of little concern: taking the largest absolute value or the largest squared value produces the same changepoint time estimator. The statistic also has a *z*(1 − *z*) factor in the denominator that the *C* statistic lacks (one views *z* as *c*/*n*). This factor serves to place greater weight on *c* values near unity or *n*. As a result, should have higher power than *C* when a changepoint occurs near the record's endpoints. Wang et al. (2007) attempt to reduce this emphasis by imposing a penalization term that increases as the changepoint time moves away from the series midpoint. Specifically, they propose maximizing |*P*(*c*)*Z*(*c*)|, where *P*(*c*) is a penalty term derived through trial and error simulation. In the end, their form is unwieldy and statistically inconsistent (does not necessarily get the right answer as *n* → ∞), but is markedly similar to the structure in the denominator in the right-hand side of (2.5).

For practical recommendations, one should prefer if only one test can be considered. It turns out, however, that should the changepoint time lie near the center of the data record (around *n*/2), then the CUSUM statistic *C* is slightly more powerful at detecting the changepoint than the statistic [Robbins et al. (2011a) discuss this issue in detail]. Should the changepoint occur closer to the data boundaries (e.g., around 15% or 85% into the record), then the statistic is vastly more powerful. In short, it makes sense to compute both statistics but to rely on the likelihood ratio statistic should conclusions conflict. As to what value of *h* one should use, this is not usually important. Authors have truncated 1% and 5% of the data record on both sides with meaningful conclusions; we have yet to see a case where this truncation level greatly matters but do not doubt that such cases exist; especially should a changepoint lie very close to the data boundaries. Finally, one does not need Gaussian data for good performance. Indeed, many Gaussian-based statistics perform well (say consistently as *n* → ∞) in non-Gaussian settings. In the end, one only requires zero-mean independent errors (ε_{t}) for all of our results. Unfortunately, when the errors are not zero-mean, the performance of the changepoint tests above can be bad (for the reasons discussed in the introduction). This brings us to our next section.

## 3. Changepoint statistics in data with trends

Now suppose that a linear trend is allowed; that is, (2.1) is modified to

where *β* is a linear trend parameter. In order for the ensuing theory to hold as stated, one should restrict *t* to the discrete set 1, … , *n* (and not let *t* denote the raw year, for example). Under a null hypothesis of no changepoints, the ordinary least squares estimators of the mean and trend parameters are the values of *μ* and *β* that minimize the sum of squares . A calculus minimization shows these estimators to be

where is the average observation time.

When data are not independent and/or identically distributed, a common approach to changepoint detection is to apply the CUSUM methods of the previous section to a residuals sequence, ; here, the residuals under the null hypothesis are calculated as for *t* = 1, 2, … , *n*. For example, Gallagher et al. (2012) develop a residuals-based statistic for detection of a change in daily precipitation series that incorporates periodicity and that was illustrated to observe the same convergence as the *C* statistic seen in the previous section. Unfortunately, when a linear trend is involved, the results of the last section will no longer apply. This issue, in fact, confused many early statistics authors (MacNeill 1978; Sen 1982; Kim and Siegmund 1989). Nonetheless (and similar to what was seen in the last section), examining the CUSUM statistic of will illuminate better methods.

### a. CUSUM-type statistics

To quantify the asymptotic properties of the CUSUM of the residual sequence , a separate analysis is needed. For this, set

where

Here, is the null hypothesis estimate of *σ*^{2} and the second equality in (3.4) follows from the fact that ordinary least squares residuals sum to zero: . What can be shown (following MacNeill 1978) is that the asymptotic distribution of the trend adjusted CUSUM statistic in (3.3) has the following limit law. Our argument for this is presented in the appendix.

**Result 3:** In the linear trend case, as *n* → ∞,

Here, is a process related to the Brownian bridge via

From Gaussian process theory [see Ross (2010) for more on Gaussian processes], it is known that is another Gaussian process with and variance

The variance in (3.7) is the key nuance between no-trend and trend cases—the analogous variance for the no-trend model is Var[*B*(*t*)] = *t*(1 − *t*).

Table 3 provides asymptotic percentiles for *D*. The percentiles are again conservative when applied to scenarios with a finite *n*.

### b. Relationship to t statistics

Here, we connect the *D*(*c*) statistic, regression *t* tests, and likelihood ratio statistics for series with linear trends. Suppose that a changepoint is known to occur at time *c*. An involved calculation shows that the estimates that minimize the sum of squares are

Here, the notation , for example, signifies that the trend is estimated assuming a changepoint at time *c*. A changepoint is suggested at time *c* when is significantly nonzero. A *t* statistic for such a test is simply

To explicitly identify the numerator and denominator in (3.9), the appendix derives the two identities

and

Using (3.10) and (3.11) in (3.9) gives the following. Let *r* = *c*/*n*. When *n* is large, division by *n*^{2} is approximately division by *n*^{2} − 1, *c*(*n* − *c*)/(*n*^{2} − 1) ≈ (*c*/*n*)(1 − *c*/*n*), and under the null hypothesis, the approximation

emerges. To link *T*(*c*) and *D*(*c*) statistics, we show in the appendix that

where is the slope estimate under the null hypothesis. Substituting (3.13) for the numerator of (3.12) shows that, for large *n*,

Equation (3.14) has important ramifications. This is because any standard statistical software will calculate *T*(*c*) for each admissible changepoint time *c*. A *t*-based test for an unknown changepoint then simply looks at the maximum of |*T*(*c*)| [or *T*(*c*)^{2}] over all possible *c*. Such a statistic is also the Gaussian likelihood ratio statistic.

As in the simple mean-shift case, one cannot extract a meaningful limit law without first cropping. Specifically, it can be shown that max_{1≤c≤n}|*T*(*c*)| → ∞ as *n* → ∞ [also, as *n* → ∞]. This is a consequence of the fact that *r*(1 − *r*) converges to zero as *r* → 1 and/or as *r* → 0 and a deeper analysis of the behavior of near *z* = 0 and *z* = 1.

To quantify a limit law for a cropped statistic, let

Our last result is the following.

**Result 4:** In the linear trend case, as *n* → ∞,

The limit law for is the same as that for .

Asymptotic percentiles for are given in Table 4. As the next section shows, percentiles are again conservative when applied to scenarios with a finite *n* (see also Robbins et al. 2011a).

Notice that requires less computation than as one does not need to compute parameter estimators for each and every *c*. However, both have the same limiting distribution and hence the same asymptotic detection power. This said, should be slightly more powerful for finite *n* as it incorporates alternative hypothesis estimators.

It is easy to design a changepoint test statistic that has the advantages of (when compared to ) but also has the same asymptotic distribution as *D*. To do this, set

where *r* = *c*/*n*. Then the limit law is

where is as in (3.6).

Extending the rationale of Wang et al. (2007) to models with a linear trend, Wang (2008) empirically adds a penalty term to the above *t*-type tests. Elaborating, Wang et al. (2007) suggest maximizing *P**(*c*)*T*(*c*)^{2}, where *P**(*c*) is a penalty term that is “M-shaped” when graphed as a function of *c*. The author develops piecewise expressions for *P**(*c*) through simulation. The resulting penalty term is unwieldy, but markedly similar to *r*(1 − *r*)[1 − 3*r*(1 − *r*)] with *r* = *c*/*n*, the scaling term in (3.18) here (which is M-shaped when graphed). In effect, Wang (2008) is trying to introduce a test statistic with the behavior of . Unfortunately, Wang's (2008) statistic cannot be easily calculated or quantified asymptotically, whereas the statistics introduced here are easily calculated and have quantifiable limiting distributions.

For practical recommendations, we suggest using in favor of should conclusions conflict for reasons illustrated in the next section.

It remains to derive results 3 and 4, which is done in the appendix. This task is important since the asymptotic distribution depends on the presence/absence of a trend. One would have to perform yet another derivation for regression structures employing sinusoidal, quadratic, or exponential terms. Sinusoidal terms, for example, could arise in the analysis of daily or monthly series. This said, the derivations are similar in spirit and we offer the linear case as perhaps useful guidance.

## 4. A simulation study

This section uses simulation to assess the efficacy of the methods in section 3. All series are simulated from (3.1) with *μ* = 1, *β* = 0.5, and *σ*^{2} = 1. The performance of all methods seemed invariant of the regression parameters chosen; hence, we report results for the above choices only. The significance level *α* = 0.05 is used throughout.

To begin, the no changepoint null hypothesis performance of the methods is studied. For each series length *n*, one million series were generated from (3.1) with *δ _{t}* = 0 for all

*t*and Gaussian errors {ε

_{t}}. Because of the large number of replications (one million), little error is attributable to simulation. For each generated series, the statistics

*D*and , as specified in (3.3) and (3.16), are computed (tests using and are too computationally cumbersome to be evaluated here, but their asymptotic performance is the same as the corresponding statistics considered). For both tests, the empirical proportion of the one million series that falsely declare a changepoint is graphed in Fig. 2. The individual series are changepoint assessed using the critical values in Tables 3 and 4. Figure 2 reveals two features: 1) as the sample size

*n*becomes larger, the probability of erroneously declaring that a changepoint exists converges to 0.05 (as the asymptotic theory demands), and 2) the tests are conservative in that

*p*values will not overestimate the probability of changepoint presence. This last property is convenient for those wishing to avoid overhomogenizing a record for too many changepoints.

The detection power of the methods in the presence of a changepoint (i.e., *δ _{t}* ≠ 0 for some

*t*) is also of interest. While it would be instructive to compare to the climate-realistic benchmarks developed in Venema et al. (2012), these results were developed for multiple changepoint features. It is important to emphasize that our methods are single changepoint tests. While one can turn a single changepoint test into a multiple changepoint procedure via subsegmenting, Li and Lund (2012) show that some undesirable issues arise in so doing. In particular, mean shifts that occur closely in time become difficult to distinguish, or mean-shift magnitudes Δ that alternate in sign can fool segmentation methods.

Here, the series length *n* = 1000 is fixed for convenience (this large value is taken so that the asymptotics hold, but the scenario is not overly different for shorter series lengths). Several values of the changepoint time *c* and three nonzero values of the mean-shift magnitude Δ are studied: 0.2, 0.35, and 0.50. These levels are slightly less than the mean-shift magnitude variations simulated in the Venema et al. (2012) benchmarks. One hundred thousand replicated series are simulated for each *c* and Δ. The detection powers on the vertical axis in Fig. 3 show the proportion of series where the null hypothesis is correctly rejected, while the horizontal axis indicates the location of the changepoint time. The detection powers seem reasonable, with the tests performing relatively poorer when the changepoint time is located near either endpoint.

Figure 3 shows that the detection power of both tests increases as the mean-shift magnitude Δ increases. Both tests show M-shaped powers as a function of the changepoint time *c*—the test using is more powerful than *D* when the changepoint occurs near the center or endpoints of the dataset, and the *D* test is more powerful elsewhere. This said, the power in *D* is never drastically greater than that for —the greatest power discrepancy in favor of *D* is approximately 7%, which is seen when Δ = 0.35 and *c* = 220 or 780. However, may have a markedly higher power than *D* when the changepoint occurs near the endpoints of the dataset. For instance, the discrepancy in power is approximately 25% in favor of when Δ = 0.5 and *c* = 50 or 950. As previously mentioned, a similar phenomenon was noted by Robbins et al. (2011a) for the no-trend case. Hence, we recommend use of in practice. Of course, it is not difficult to compute both *D* and .

## 5. Examples

This section uses the results of the previous sections to study two temperature series. The first series contains annual temperatures for the contiguous United States (CONUS)—the so-called CONUS data discussed in Menne et al. (2010). This is a situation in which a reference series is not available; moreover, since the data are averaged over many stations, no good metadata exist. Our second series moves to a more local scenario: monthly high temperatures at Jacksonville, Illinois. Here, we construct a reference series to help make conclusions.

All tests calculate the following four statistics: 1) *D* in (3.5), 2) in (3.15), 3) in (3.18), and 4) in (3.16). Tests 1 and 3 use the critical values from Table 3 and tests 2 and 4 use critical values from Table 4. All conclusions are made at the 95% significance level; *h* = 0.05 is used in any test requiring cropping. We remind the reader that the methodology presented in this article is designed for the AMOC alternative, and climate series often have multiple changepoints. Here, the possibility of multiple changepoints is handled through subsegmenting. Specifically, the whole series is tested via AMOC methods; if any changepoints are detected, the methods are subsequently applied to the resulting subsegments. This process is repeated until all segments test as homogeneous. While this procedure is simple and has demonstrated utility in climate settings (Robbins et al. 2011b), methods that handle multiple changepoint aspects could be devised with further analyses (see, e.g., Li and Lund 2012).

Our first application studies annual temperatures from the CONUS record from 1895 to 2011 (*n* = 117). The *D** and statistics flag a changepoint in 1997, whereas the *D* and statistics both flag changepoints in 1985. As our preferred methods require some truncation, we support the 1997 changepoint if forced to choose one conclusion. This changepoint shifts temperatures upward. Phrased another way, we conclude that recent temperatures are increasing at a rate that is faster than can be explained by a linear trend. All subsegments test as homogeneous with 95% confidence. The *p* values for all tests are reported in Table 5 (definitive conclusions are made at 95% confidence). Figure 4 plots the *T*^{2}(*c*) values for this and all subsegments; the fit of the resulting changepoint model is plotted in Fig. 5. We comment that there are secondary peaks in the *T*(*c*)^{2} statistics for the 1895–1997 subsegment in Fig. 4 circa 1920 and 1957. While these times are “suspect,” they are not significant at level 95% (the Table 5,*p* values report them as significant with approximately 85%–90% confidence). Note that one cannot take all the times that exceed the 95% quantile in Fig. 4 and declare them as changepoints.

We next examine monthly maximum temperatures from Jacksonville, Illinois, from January 1928 to December 2010 (*n* = 996). A reference series was constructed by averaging the 40 most correlated neighboring stations to Jacksonville. (The Jacksonville minus reference series is plotted in Fig. 7.) Our changepoint methods are applied to the difference of the two series. Differencing acts to reduce or remove the effect of many complicating data characteristics (e.g., autocorrelation, periodicity, trend, etc.), thereby allowing the analyst to isolate nonclimatic changes (such as a gauge change). Nonetheless, a differenced series may still exhibit a linear trend, caused perhaps by external agents (e.g., urbanization) that influence the series under study. We have chosen to make a reference by averaging 40 adjacent stations to illustrate the methods, but avoid some of the complications of pairwise analyses that arise should one consider all differenced series. Menne and Williams (2005, 2009) admirably discuss pairwise comparison issues.

Table 6 displays the results from all changepoint tests. All methods flag a changepoint at January 1957. After subsegmentation, no further changepoints were detected at the 5% significance level. This declaration is again somewhat perilous as the *T*(*c*)^{2} statistics for the subsegments almost exceed the 95th percentile circa the early 1960s and mid-1980s. The 1957 changepoint acts to lower temperatures relative to the reference. Notice that the common trend slope in the two segments is positive, meaning that Jacksonville maximums are warming more rapidly than its reference. Graphical support for these conclusions are provided in Fig. 6, which displays the fitted model (changepoint and linear trend) and Fig. 7, which plots the *T*(*c*)^{2} values for the series and all subsegments.

In concluding these examples, we caution the reader that mean-shift models with a constant slope may not adequately describe all climate time series. The above conclusions, while supplying good rudimentary guidance, are not set in stone. Indeed, climatic forcings are unlikely to be linear in time; moreover, localized considerations, such as moving a station into or out of an urban heat island zone, may merit consideration of more complicated regression structures (perhaps varying slopes for the different segments).

## 6. Comments

In cases for which it is not clear whether the series has a linear trend, it is prudent to apply the methods that allow for a trend. Conclusions will then be conservative and changepoint conclusions will be more justifiable than if the trend were ignored. This said, if it is known concretely that a series does not have a trend, then one should not apply the methods for trends. In general, detection power is sacrificed when one examines models with more model parameters than needed.

As eluded to in the last section, another issue involves whether the trend slope should change at the mean-shift time. Two-segment models that allow for different slopes before and after the changepoint time are called two-phase regressions and are studied in Lund and Reeves (2002). Although the *F*_{max} statistic examined there has similarities to our statistics, quantification of the *F*_{max} limit distribution is too onerous to pursue here. In the case of multiple segments (more than two) with distinct trends, multiple changepoint techniques designed to handle such regression structures will likely be more reliable than segmentation approaches with AMOC tests. Our belief is that series mean misspecification is the primary unresolved issue in multiple changepoint problems. Li and Lund (2012) discuss multiple changepoint problems in more depth. In general, changepoints that occur close together in time or act to move the series in oscillating manners can fool subsegmenting methods.

Two other Medusas of changepoint methods merit mention. First, even moderate levels of error autocorrelation are common in climate series and will influence changepoint conclusions. The analyses here ignore autocorrelation. Naturally caused (Rodionov) mean shifts can be attributed to autocorrelation in part. Second, seasonality is often present in nonannualized series. For example, it is not clear that the Jacksonville minus reference series of the last section does not have any seasonal features. Both of these issues are discussed in detail in Lund et al. (2007).

## Acknowledgments

Robert Lund's work is supported by National Science Foundation Grant DMS 0905570. Comments from three referees improved this manuscript.

### APPENDIX

#### Derivation of Relevant Formulas in Section 3

**Proof of result 3:** We begin by justifying the convergence in result 3. Suppose that the null hypothesis model with *δ _{t}* = 0 is in force. Substituting the right-hand side of (3.1) into (3.2) and simplifying yields

and

[Recall that is the average observation time.]

Using these formulas to calculate the residuals under the null hypothesis gives

where . Using this expression for the residuals, (A1), and proves (3.13):

The advantage of the representation in (A2), which expresses the sum of the residuals as a sum of the (unobservable) independent variables {ε* _{t}*}, is twofold. First, well-known results for weighted sums of independent variables can be used to justify the Gaussian process limit (see Billingsley 1999). Second, the independence of the summands can be used to calculate the covariance of the Gaussian process. Since the [ε

*] are independent, Cov(ε*

_{t}*, ε*

_{t}*) = 0 when*

_{ℓ}*t*≠

*ℓ*. From this and linearity of covariances, when

*k*≥

*c*, one has

and

Using these in (A1) and properties of covariance gives our key equation:

where *s* = *k*/*n* and *r* = *c*/*n*. This equation identifies the limiting covariance structure of the Gaussian process. In particular, when *s* = *r*

The convergence claimed in result 3 now follows from process convergence results found in Billingsley (1999).

**Proof of result 4:** While (3.8) gives the ordinary least squares estimators of *μ*(*c*), *β*(*c*), and Δ(*c*) for each changepoint time *c*, the computations simplify if we work with a linear models orthogonal decomposition on the design matrix. Stapleton (2009) gives a comprehensive discussion of these techniques. For *t* > *c*, denote the predicted values under the alternative model as . Let **v**_{1} be an *n* × 1 vector whose entries are all ones, let the vector **v**_{2} have *t*th component , and let the vector **v**_{3} have *t*th component satisfying

The vector of predicted values has representation

and linear models theory (Stapleton 2009, section 3.7) provides

with **X** = (*X*_{1}, … , *X _{n}*)′. Here, the norm is the sum of squares and

**y**= (

*y*

_{1}, … ,

*y*)′.

_{n}A direct calculation with (A3) shows

which establishes (3.11), and

where is the residual calculated under the null hypothesis. This establishes (3.10).

The squared *t* statistic becomes

where, as *n* → ∞,

with *r* = *c*/*n*. Under the null hypothesis of no changepoints, as *n* → ∞,

The above derivations yield (3.12) and the convergence in result 4 now follows from (3.5), (3.12), and the continuous mapping theorem of Billingsley (1999).

## REFERENCES

*Convergence of Probability Measures.*2nd ed. Wiley, 285 pp.

*Limit Theorems in Change-Point Analysis.*John Wiley and Sons, 414 pp.

*Adventures in Stochastic Processes.*Birkhäuser, 626 pp.

*Introduction to Probability Models.*10th ed. Elsevier, 784 pp.

*Linear Statistical Models.*2nd ed. Wiley, 474 pp.