## Abstract

This review article enumerates, categorizes, and compares many of the methods that have been proposed to detect undocumented changepoints in climate data series. The methods examined include the standard normal homogeneity (SNH) test, Wilcoxon’s nonparametric test, two-phase regression (TPR) procedures, inhomogeneity tests, information criteria procedures, and various variants thereof. All of these methods have been proposed in the climate literature to detect undocumented changepoints, but heretofore there has been little formal comparison of the techniques on either real or simulated climate series. This study seeks to unify the topic, showing clearly the fundamental differences among the assumptions made by each procedure and providing guidelines for which procedures work best in different situations. It is shown that the common trend TPR and Sawa’s Bayes criteria procedures seem optimal for most climate time series, whereas the SNH procedure and its nonparametric variant are probably best when trend and periodic effects can be diminished by using homogeneous reference series. Two applications to annual mean temperature series are given. Directions for future research are discussed.

## 1. Introduction

Changepoints are times of discontinuities in a time series that can be induced from changes in observation locations, equipment, measurement techniques, environmental changes, and so on. Inferences drawn from climatic series frequently depend on “continuity of the measurement process,” that is, the lack of changepoints. For example, Easterling and Peterson (1995), Chen and Gupta (2000), Lu et al. (2005), Hanesiak and Wang (2005), and Wang (2006) note that linear trend estimates are trustworthy only when the series are homogeneous in time. A changepoint-free record is difficult to ensure; moreover, many changepoints occur without documentation. Before one studies trends, the relative homogeneity of the series should be assessed.

The World Meteorological Organization (WMO) Climate Program guidelines on climate metadata and homogenization (Llanso 2003) list at least 14 data homogenization assessment techniques, and many more methods have been suggested. Different homogenization techniques (methods/models) may be required for different climate elements or the same climate element on different time scales. With so many procedures, many of which yield conflicting conclusions when applied to the same series, a need has arisen for a careful discussion and comparison of these methods, along with some recommendations concerning which procedure(s) are best to use in commonly encountered situations. This paper attempts to survey, contrast/compare, and modify eight prominent changepoint detection methods, as displayed in Table 1. Although these eight do not cover all methods proposed in the climate literature, they are very representative.

We are not the first to compare different changepoint procedures for climatologists. Easterling and Peterson (1995) conducted the first major attempt, although their introduced statistic did not always produce a clear conclusion. More recent review attempts have been made by Peterson et al. (1998), Ducré-Robitaille et al. (2003), Rodionov (2004), and DeGaetano (2006). A difficulty in attempting such a review lies in placing the methods on a common footing—they were often devised for different situations. In this article, we start with the simplest assumptions, so that the fundamental characteristics of each procedure can be understood and compared, before introducing various complications. We believe this will enable practitioners to understand better the similarities and differences among the different procedures.

To begin, we impose the following assumptions:

Under the null hypothesis of a homogeneous series (no changepoints), the series of interest

*Y*can be adequately described by a regression equation (also called a linear model) with error terms that are independent and identically distributed (IID) Gaussian (also called normal) random variables._{t}Over the period examined,

*Y*experiences at most one changepoint._{t}Except where noted, the procedures examined are being applied directly to

*Y*. Discussion of reference series is contained in section 6a._{t}

Although these assumptions are commonly made, they might be somewhat unrealistic in some climate applications. Relaxation of these assumptions and their effects on conclusions are discussed in section 6.

The remainder of this article proceeds as follows. Section 2 reviews many of the existing homogeneity assessment procedures, with section 3 constructing some modifications of these. Section 4 compares these procedures under various scenarios, and section 5 presents applications to two climate series. Section 6 discusses the effects (or lack thereof) of the assumptions on conclusions.

## 2. Review of existing methods

### a. Simple changepoint methods

The methods described in this section are designed for cases in which the underlying regression response form is known (e.g., linear, quadratic, sinusoidal), aside from whether a changepoint exists. One should not expect a single method to perform optimally over all regression response forms; indeed, the tests presented here are most powerful when the assumed regression structure and normality of errors hold but may have less power under departures from these assumptions. Regression response form uncertainties are considered in sections 2b and 3.

#### 1) SNH and its variants

The standard normal homogeneity (SNH) test, which has roots in Hawkins (1977), was first applied to climatic data by Alexandersson (1986) and then was used by many others such as DeGaetano (1996) and Rosenbluth et al. (1997). Alexandersson (1986) originally scaled a target series by a reference series to create a series *q _{t}* whose components were assumed to be normal. He standardized

*q*into a series of standardized anomalies

_{t}*Z*through

_{t}*Z*= (

_{t}*q*−

_{t}*q*)/

*s*, where

*q*and

*s*are the sample mean and sample standard deviation of

*q*. The elements in

_{t}*Z*were treated as normally distributed, with the following null (

_{t}*H*

_{0}) and alternative (

*H*) hypotheses:

_{A}where *μ*_{1} ≠ *μ*_{2}, 1 ≤ *c* < *n*, and the parameters *μ*_{1}, *μ*_{2}, and *c* are unknown. The symbol “∼*N*(*μ*, *σ*^{2})” represents a normal distribution with mean *μ* and variance *σ*^{2}. Under these assumptions, Alexandersson (1986) derived the likelihood ratio statistic to assess *H*_{0} versus *H _{A}*, that is, to determine the existence of a changepoint

*c*. His test statistic was

where

denote sample means before and after time *c*.

A major assumption behind the procedure of Alexandersson (1986) is that the standardization [i.e., *Z _{t}* = (

*q*−

_{t}*q*)/

*σ*, with the true standard deviation

_{q}*σ*replaced by its estimate

_{q}*s*] produces normal variables with a unit variance. This assumption is reasonable if the data are homogeneous (in which case

*s*is a consistent estimator of

*σ*). In the presence of a changepoint

_{q}*c*, the (overall) sample standard deviation

*s*is a biased and inconsistent estimator of

*σ*, in which case

_{q}*σ*should be estimated by the pooled sample standard deviation

_{q}where *s*_{1} and *s*_{2} are the sample standard deviations of the two samples *q*_{1}, . . . , *q _{c}* and

*q*

_{c+}_{1}, . . . ,

*q*, respectively). However, because the existence and location of

_{n}*c*are unknown, correctly estimating

*σ*(by

_{q}*s*or

*s*) is not feasible, and therefore there is no guarantee that the variance of

_{p}*Z*is close to unity, which invalidates the assumptions in (2.1) and (2.2).

_{t}To fix this inaccuracy, we propose a more precise variant of the SNH test, which can be applied to *Y _{t}* when a good reference series is not available, provided that

*Y*is IID and Gaussian. In particular, the hypotheses in (2.1) and (2.2) become

_{t}where *c*, *μ*, *μ*_{1}, *μ*_{2}, and *σ*^{2} are all unknown. To assess the existence of a changepoint 1 ≤ *c* < *n*, we derive the likelihood ratio statistic as in Alexandersson (1986) and obtain our test statistic *T*_{max}, defined as

where *Y*_{1} and *Y*_{2} denote the sample means of *Y _{t}* before and after

*c*and

*s*is the pooled estimate of the standard deviation of

_{p}*Y*.

_{t}Both *T*_{0} in Alexandersson (1986) and our *T*_{max} are likelihood ratio statistics, which affords one statistical optimality. If *c* is known, *T̃ _{c}* in (2.3) is the likelihood ratio statistic for testing whether

*μ*

_{1}≠

*μ*

_{2}when the variance of the data is unity and will be a

*χ*

^{2}variable with 2 degrees of freedom under

*H*

_{0}in (2.1). In a similar way, |

*T*in (2.6) is the standard two-sample

_{c}|*t*-test statistic for equality of means when the variance

*σ*

^{2}is unknown; this statistic has a

*t*distribution with

*n*− 2 degrees of freedom under

*H*

_{0}in (2.4). In both cases, the value of

*c*that maximizes

*T̃*or |

_{c}*T*is declared the most probable changepoint position. Because |

_{c}|*T*depends on

_{c}|*c*,

*T*

_{max}is the maximum of

*t*statistics over all “admissible” changepoint positions

*c*, and similarly for

*T*

_{0}. Our version of the SNH procedure avoids the inaccurate standardization used in Alexandersson’s procedure, and hence the statistic

*T*

_{max}will perform better. Henceforth, we consider only the modified version of the SNH procedure when referring to SNH.

Because *T* ^{2}_{c} and |*T _{c}|* provide the same changepoint information,

is equivalent to *T*_{max} in (2.6). For comparability with statistics to be introduced later, we henceforth use *T* ^{2}_{max}. The critical values of *T* ^{2}_{max} for some values of *n* and levels of type-I error have been simulated under *H*_{0} in (2.4) and are reported in Table 2. If *T* ^{2}_{max} exceeds a rejection threshold set to a specified tolerance level (frequently 5%), then the test concludes that a changepoint exists, and the *c* that maximizes *T* ^{2}_{c} (and |*T _{c}*|) is the estimate of the changepoint time.

#### 2) Nonparametric variants of the SNH procedure

The SNH procedure is also a likelihood ratio test when the model errors are IID and Gaussian. Normality of errors is a debatable assumption for many climatological series. To guard against a spurious “false changepoint” on the basis of one or two outliers, especially near the record boundaries (*t* = 1 or *t* = *n*), one might prefer a more robust procedure. Statisticians typically make parametric test procedures more robust (less sensitive to distributional departures from normality) by applying parametric procedures to the relative ranks of the data rather than to the observed values. If the sample size is large, such nonparametric tests may be only slightly less powerful than parametric tests and will typically provide better false-detection rates (type-I errors) and powers when the parametric assumptions are violated.

Because the *T*_{max} statistic is the maximum of *n* − 1 two-sample *t* statistics, the obvious nonparametric analog is the maximum of *n* − 1 Wilcoxon rank-sum statistics (or, equivalent, Mann–Whitney statistics). Thus, a nonparametric SNH procedure (hereinafter referred to as NPW) is one that detects a changepoint at time *c* when *W*_{max} is sufficiently large, where

and *W _{c}* is the square of a normalized Wilcoxon rank-sum statistic for each fixed

*c:*

where *r _{t}* is the rank of the

*t*th element in the series (e.g., if

*X*

_{10}is the 32d largest value, then

*r*

_{10}= 32). Critical values for the

*W*

_{max}statistic under the null hypothesis of no changepoint could be obtained, as with

*T*

_{max}, by simulation. Examples of these critical values are presented in Table 2. The time

*c*at which

*W*attains its maximum is the nonparametric estimator of the changepoint time.

_{c}The *W*_{max} statistic or variants thereof have been proposed by several climate authors, most notably Karl and Williams (1987), with a refinement by Ducré-Robitaille et al. (2003). Lanzante (1996) also bases his procedure on the Wilcoxon statistic, and Yonetani and McCabe (1994) use the Lepage modification of the Wilcoxon statistic. Pettit (1979) uses a Mann–Whitney statistic, which is equivalent to the Wilcoxon rank-sum statistic and should hence perform equivalently. Of these references, only Pettit gives a clear procedure for determining critical values. The others either assume that the location of the changepoint is approximately known from metadata or downplay multiple testing aspects (the many candidate times at which a changepoint could occur). If one ignores multiple testing aspects, any detection method will yield too many false changepoints.

Parameter estimation generally poses more difficulty in nonparametric cases. For example, if the *T*_{max} statistic suggests a changepoint at time *c*, the Gaussian maximum likelihood estimator of Δ would be

The typical nonparametric estimator of the shift is

where the median is taken over all pairs of indices (*t*_{1}, *t*_{2}) that satisfy 1 ≤ *t*_{1} ≤ *c* and *c* + 1 ≤ *t*_{2} ≤ *n*: a total of *c*(*n* − *c*) pairs of differences. Although this testing procedure yields a relatively simple estimator for the shift Δ, the methods break down when trend features are included in the model (see also section 6b).

#### 3) The two-phase regression model and recent revision

Hinkley (1969, 1971) proposed a two-phase regression (TPR) model with a changepoint at time c:

where predictor values *x*_{1} ≤ *x*_{2} ≤ . . . ≤ *x _{n}* are ordered and known; the errors

*ɛ*are zero mean, IID, and Gaussian with variance

_{t}*σ*

^{2}; and

*μ*

_{1},

*μ*

_{2},

*β*

_{1},

*β*

_{2}, and

*c*are unknown. Hinkley assumed continuity of the regression response at the changepoint

*c*, which translates to the constraint

*μ*

_{2}=

*μ*

_{1}+ (

*β*

_{1}−

*β*

_{2})

*x*. Solow (1987) used (2.10) to test the homogeneity of a temperature series. Solow took

_{c}*x*=

_{t}*t*to allow for a time trend [the continuity constraint is

*μ*

_{2}=

*μ*

_{1}+ (

*β*

_{1}−

*β*

_{2})

*c*]. To study changes in the trend, the null and alternative hypotheses are

*H*

_{0}:

*β*

_{1}=

*β*

_{2}and HA:

*β*

_{1}≠

*β*

_{2}.

A drawback of Solow’s (1987) application lies with the continuity constraint. Although slow continuous trends can be caused by increasing urbanization (the so-called urban heat island effect), a deterioration of the instruments, a gradual change in the station environment and/or station instrumentation, or location changes typically induce a mean shift discontinuity into a series. In such cases, a continuity constraint is undesirable.

Instead of imposing continuity constraints, Lund and Reeves (2002) revised the TPR model to

which allows both step-type (*μ*_{1} ≠ *μ*_{2}) and trend-type (*β*_{1} ≠ *β*_{2}) changepoints. The null and alternative hypotheses are

If *c* were fixed and known, then *H*_{0} could be tested by simply using the standard *F* test for model reduction:

where SSE_{0} and SSE* _{A}* are the sum of squared errors computed under

*H*

_{0}and

*H*(with a changepoint at

_{A}*c*), respectively. As suggested by the term ∼

*F*

_{2,n−4}, this statistic follows the

*F*distribution with (2,

*n*− 4) degrees of freedom under

*H*

_{0}. Large

*F*values suggest

_{c}*H*with a changepoint at time

_{A}*c*.

When *c* is unknown, we use

as the test statistic. If *F*_{max} is too large to be attributed to chance variation, one concludes *H _{A}*, and the

*c*maximizing

*F*is taken as the estimate of the changepoint time. The true distribution of

_{c}*F*

_{max}under

*H*

_{0}does not follow any well-known distribution type; this is in part due to serial correlation in the

*F*

_{c}s (across different values of

*c*). Lund and Reeves (2002) presented simulated critical values of

*F*

_{max}that enable one to reach statistically valid conclusions about the existence of a changepoint. A drawback of the original method of Hinkley (1971) (subsequently used by others) is the inaccuracy of the percentiles of the

*F*

_{max}statistic under

*H*

_{0}. In particular, Lund and Reeves show that the

*F*

_{3,n−4}null hypothesis distribution reported by Hinkley gives erroneously low critical values. Use of

*F*

_{3,n−4}critical values in lieu of the correct values in Lund and Reeves (2002) will result in acceptance of many spurious changepoints. Turner et al. (2006) provide an application of the TPR model of Lund and Reeves (2002), hereinafter referred to as the LR method.

#### 4) Two-phase regressions with a common trend

Wang (2003) noted that the LR model, while correcting the existing testing flaws, may be unrealistic in climate settings. This occurs because the most typical changepoint effect would shift mean series levels rather than affecting both the mean and the trend. Thus, a more realistic trend model is

where the terms are as defined previously. The hypotheses of interest are *H*_{0}: *μ*_{1} = *μ*_{2} and *H _{A}*:

*μ*

_{1}≠

*μ*

_{2}.

If *c* were fixed and known, then *H*_{0} could be tested by simply using

where the terms are defined analogous to those above; when *c* is unknown, *F*_{max} in (2.14) is again used as the test statistic. Wang (2003) simulated critical values of this *F*_{max} test statistic for some common values of *n*, as in Table 2. Her method has been applied to time series of several climate variables (e.g., Vincent et al. 2005; Wang 2006).

This model (henceforth XLW) differs from the SNH model in that SNH assumes no trend and differs from the LR model in that it does not allow trend shifts at the changepoint time. Each procedure is based on an *F*_{max} statistic (*T* ^{2}_{max} is also an *F*_{max} statistic because *T* ^{2}_{c} ∼ *F*_{1,n−2} for fixed *c*) and is statistically most powerful if the hypothesized structural form is correct and the errors are Gaussian. All three procedures are very powerful at correctly identifying the changepoint if the relative shift size RSS = Δ/*σ* is large. However, the power to detect a changepoint decreases as RSS decreases, and use of an incorrectly specified model increases variability in the estimates of both *c* and Δ.

A generalization of the XLW method would replace the time factor *t* by a known “covariate series” *x _{t}*, possibly a reference series. The use of a homogeneous reference series with the same climate signal (i.e., trends and periodicity) as the target series has the potential to reduce greatly the model error variance and hence to increase power. Maronna and Yohai (1978) develop a bivariate changepoint detection procedure in this case, and Potter (1981) furthers the work. In cases in which

*x*is deterministic and known, the analysis proceeds as before with

_{t}*t*replaced by

*x*; in cases in which

_{t}*x*is random (such as a reference series for

_{t}*Y*), the critical rejection percentiles are not purely a function of the series length

_{t}*n*(as they are under the XLW procedure) but are also a function of the covariates

*x*. Buishand (1984) discusses various modifications to these statistics that make critical values depend almost only on

_{t}*n*. It is typically easier to incorporate reference series by modifying the response variable [e.g., using

*D*=

_{t}*Y*−

_{t}*x*or

_{t}*Q*= ln(

_{t}*Y*/

_{t}*x*)] than it is to include the

_{t}*x*directly as an explanatory factor (see section 6a).

_{t}### b. Hierarchical changepoint methods

The methods in section 2a involve simple alternatives in that the regression response form of the alternative model is mathematically specified. The tests presented are optimal (most powerful) for detecting changepoints when the underlying regression response structure is parameterized correctly. In some cases, a specific functional regression response form is not known, and it could be worthwhile to consider a hierarchy of possible forms.

Table 3 presents a hierarchy of regression models. The SNH and NPW methods test reduction from model 3 to model 1, and the XLW and LR methods test reduction from model 4 to model 2 and from model 5 to model 2, respectively. Solow’s model (Solow 1987) is a special case of model 5 with the constraint Δ = −*β*_{2}*c* imposed. The modified Vincent’s method, described below, and the penalized likelihood methods of section 2b(2) allow one to choose the best-fitting model in Table 3 while simultaneously assessing whether an undocumented changepoint exists.

#### 1) Modified Vincent’s method

The first climate changepoint detection methods to consider regression response form adequacy and changepoints simultaneously were introduced in Vincent (1998) and refined in Vincent and Gullett (1999). An application to Canadian temperatures can be found in Bonsal et al. (2001). Neglecting covariate terms, Vincent’s procedure is a type of “forward regression” algorithm in that the significance of the nonchangepoint parameters in the regression model is assessed before (and after) a possible changepoint is introduced. In the end, the most parsimonious model is used to describe the data. Vincent includes models 1, 2, 3, and 5 of Table 3 in her hierarchy; we add model 4 to this hierarchy and call the procedure MLV.

Vincent’s method is not easy to interpret or program. Figure 1 presents a flow diagram showing our interpretation of the sequence of tests needed for the MLV method. At each node of Fig. 1, one asks if the residuals from the fitted model are acceptable (white noise), with the scheme stopping if the answer is “yes,” and continuing otherwise. Residual adequacy is assessed with the Durbin–Watson test and, if that is inconclusive, by sample autocorrelation assessment. If the MLV procedure opts for a changepoint model (models 3, 4, or 5), the changepoint location is estimated by the *c* that minimizes the SSE. However, the decision to proceed to the next node is based on residual adequacy tests and not on an *F* statistic. Although Vincent’s procedure is less powerful than likelihood-based procedures, Vincent’s idea of searching for an appropriate model within a regression hierarchy is of considerable merit and is the basis for other more recently developed detection procedures.

#### 2) Penalized likelihood criteria

Choosing the best model within a hierarchy is frequently resolved by minimizing penalized likelihood statistics. Two common penalized likelihoods are Akaike’s information criteria (AIC) and Sawa’s Bayes criteria (SBC; sometimes also called BIC):

where *p* is the number of parameters in the model under consideration, *n* is the series length, and *L* is the likelihood of the model being evaluated at the estimated model parameters. For linear models in the hierarchy of Table 3,

where SSE is the sum of squared errors for the model being fitted.

An AIC or SBC selector chooses the model with the minimum AIC or SBC statistic. The driving idea is to penalize for an excessive number of model parameters. Note that SBC penalizes more heavily than AIC and hence tends to yield simpler models. Both AIC and SBC were originally developed for assessing the significance of regression parameters. There is continuing debate in the statistical community about whether a changepoint parameter should be treated the same as other parameters; whether it should be penalized more heavily is still not clear.

For a known changepoint time *c*, it would be a simple matter to evaluate AIC or SBC for each of the five models in the hierarchy. However, in applying AIC or SBC criteria, −2 ln(*L*) would need to be computed for models 1 and 2, as well as for each *c* ∈ 1, . . . , *n* − 1 for models 3, 4, and 5. In parameter enumeration, models 3–5 are penalized one extra parameter for the maximization over *c*. Thus, if one searches for the maxima over models 1–5, there are *p* = 1, 2, 3, 4, and 5 parameters, respectively, in the AIC and SBC statistics. These criteria are appealing in that they circumvent the simulations required to obtain critical values for the *T*_{max}, *F*_{max}, and *W*_{max} statistics, although they do not provide fixed levels (*α*) of statistical significance.

## 3. Method modifications

In cases in which the true regression response form is not clear, hierarchical alternative methods that consider a variety of regression forms have merit. We first modify the LR and XLW methods to allow for model selection.

### a. Modified LR and XLW procedures

Lund and Reeves (2002) assess changepoint existence by testing the reduction of model 5 to model 2; in a similar way, Wang (2003) tests the reduction of model 4 to model 2. If no changepoint is detected, both the LR and XLW procedures would select the trend model (model 2). However, if model 2 is deemed appropriate, one could further test whether the trend term *β*_{1} should be included (i.e., testing *H*_{0}: *β*_{1} = 0 against *H _{A}*:

*β*

_{1}≠ 0). Thus, the best model might be model 1.

In a similar way, the original LR and XLW procedures can be modified to assess the need of other model parameters after *c* is estimated. For example, if model 4 is selected after applying XLW, one may still test whether to reduce to model 3. This is done with a conventional regression *F* test because *c* is now fixed at the position estimated by model 4. Such parsimony modifications are unrelated to our focus of changepoint detection, although these modifications can affect the power of detecting the “true” model given that the presence of a changepoint is correctly gauged.

### b. A new generalized algorithm

Figure 2 shows a new generalized algorithm (GNL) for detecting a changepoint under regression response form uncertainty. This algorithm builds from the hierarchical methods in section 2b and the modified LR procedure. The modified LR and XLW procedures estimate the changepoint location and assess the changepoint’s significance only on the first model fit. It is conceivable that estimation of the changepoint time could be confounded with model choice. That is, if the null and full models used for testing homogeneity are incorrect, the estimate of the changepoint time may not be accurate. For the modified LR and XLW procedures, one cannot revise the changepoint time estimate in model fits at later stages if the initial models are incorrectly specified. However, the new generalized algorithm allows the changepoint time to be reestimated as the procedure evolves.

In Fig. 2, a “C” beneath an arrow denotes a stage at which a test is performed between a changepoint model (models 5, 4, or 3) and a nonchangepoint model (models 2 or 1) over all possible changepoint times. An “x” beneath an arrow indicates testing a regression structure reduction between two changepoint models or two nonchangepoint models, with the changepoint position fixed. Each node in Fig. 2 asks whether reduction from a higher-level model to a lower-level model is statistically permissible, with the next fit (or termination of the algorithm) dependent on the answer. For example, if the test for reduction from model 5 to model 2 (denoted by “M5 → M2” with subscript C in Fig. 2) rejects model 2 and therefore an estimate, say, *ĉ*_{1} of the changepoint is obtained under model 5, then the next question would be whether model 4, a simpler changepoint model, is more appropriate. This is addressed by testing reduction from model 5 to model 4 (denoted by “M5 → M4” with subscript x) with the changepoint fixed at *ĉ*_{1}. If reduction from model 5 to model 4 is permissible, the next node asks whether there is still a significant changepoint and what is its best estimate under model 4. The second question is then answered by testing reduction from model 4 to model 2 over all possible changepoint positions. Here, a new estimate, say, *ĉ*_{2} of the changepoint time is obtained if the last test also rejects model 2.

Similar to backward-regression methods, each test is conducted at the desired *α* level. There are several opportunities in the GNL algorithm for the changepoint time to be reestimated, and this differs fundamentally from the modified XLW and LR procedures. The GNL and MLV procedures have similar merits, although MLV is not as comprehensive as GNL, and MLV is a “forward regression” procedure, whereas GNL uses “backward regression.” The GNL procedure requires the type-I *α*-level critical values for the SNH, XLW, and LR tests, as well as conventional *F*_{1,n−2}, *F*_{1,n−3}, and *F*_{2,n−4} critical values, for each sample size *n*.

## 4. Comparisons of methods

### a. Simulation setup

Simulations were conducted to compare the eight methods shown in Table 4. All simulations were governed by one of the five models in Table 3. In all cases, without loss of generality, *μ* was taken as zero and the errors were generated as IID *N*(0, 1) noise. When the changepoint models 3–5 were used, the parameters *c*, Δ, *β*_{1}, and *β*_{2} were varied as *c* = 50, 65, or 80; Δ = 0.5, 1, or 2 and *β*_{1} or *β*_{2} = 0.005, 0.01, or 0.02, as explained later. Each table entry is based on *M* = 10 000 runs. For each run, a series *Y _{t}* of length

*n*= 100 was generated and subjected to the eight methods, with a type-I error rate

*α*= 0.05 used for the first six methods. The variation in

*c*represents changepoints near the center, slightly off center, and far from the center, given that

*n*= 100. No attempt was made to simulate “more extreme” values of

*c*, because it is well known that most of these procedures have a higher-than-specified false-alarm rate for detecting changepoints near the boundary. The three choices for each of Δ,

*β*

_{1}, and

*β*

_{2}represent low, moderate, and high variation effects, given the series length (

*n*= 100) and the error standard deviation (

*σ*= 1).

The XLW and LR critical values are taken from Wang (2003) and Lund and Reeves (2002), respectively. The critical values needed for the SNH and NPW methods were estimated from 1 million simulations under the null hypothesis. These four critical values, for *n* = 100 and *α* = 0.05, are displayed in Table 2. Critical values for conventional *F*, Durbin–Watson, and correlogram tests were taken from standard tables. The AIC and SBC procedures have no critical values.

### b. Power and fit statistics

The traditional way to compare changepoint detection algorithms (after ensuring a common type-I error rate) examines their power of detection given that a changepoint actually exists. This study considers four measures of model adequacy—three power measures (CRM, CRC, and CRB) and one measure of fit (RMP). In particular, CRM is the probability of selecting the correct model, irrespective of estimated parameters, CRC is the probability of “closely estimating the changepoint time *c*,” irrespective of models (by closely estimating *c* we mean that model 1 or model 2 is selected for series with no changepoint or that the estimated *c* is within ±3 of the true *c* value when a changepoint actually exists), CRB is the probability of *both* identifying the correct model *and* locating the changepoint (if one exists) within ±3 of the true location, and RMP is the average root-mean-squared-prediction error for procedure *j*:

where *Ŷ _{m,t}*(

*j*) is the predicted series value at time

*t*using procedure

*j*on the

*m*th simulation run,

*E*(

*Y*) is the true expected value of the series at time

_{t}*t*under the simulated conditions, and

*p*(

_{m}*j*) is the number of estimated parameters (including

*c*) in the model selected by procedure

*j*in the

*m*th simulation (here,

*M*= 10 000 and

*n*= 100). The RMP statistic quantifies separation between procedures even when the procedures themselves yield different models. In contrast, the three power measures are all dichotomous success/failure judgments of each simulation’s correctness.

### c. Simulation results for nonchangepoint models

Tables 5 and 6 summarize simulations for models 1 and 2. Here, the integer displayed in row *i* and column *j* of the left-hand part of the table displays the number of simulations for which procedure *i* chose model *j* as the best model over the 10 000 runs. The right-hand portion of the tables shows RMP and empirical counts of CRM, CRC, and CRB.

The Table 5 results are as expected when model 1 (no trend or shift) holds. For the SNH, NPW, and MLV methods, the correct model 1 is chosen in 95% of the simulations. For XLW and LR, there is a 95% chance that a nonchangepoint model (model 1 or model 2) is found and a 90% (≈95%^{2}) chance that model 1 is chosen after the subsequent test of model 2 versus model 1. The GNL method chooses model 1 with an 89% probability. The AIC and SBC procedures both overparameterize relative to the other six procedures. The penalty term used by SBC appears reasonable, with only 6.7% of the simulations yielding an incorrect changepoint model (models 3, 4, or 5). However, AIC is extremely underpenalized, because an incorrect model is selected in about 70% of the simulations. Among all eight methods, the MLV method has the best RMP of 0.087. From a practical viewpoint, all except AIC perform well under this null model (model 1) simulation.

Table 6 summarizes simulations in which model 2 holds, that is, with no shift but with the trend increasing from *β*_{1} = 0.005 to *β*_{1} = 0.01 to *β*_{1} = 0.02, respectively. As the trend increases, the procedures diverge. The XLW, LR, and GNL methods still report a 5% chance of falsely choosing models 3, 4, or 5. However, whether the 95% of non-changepoint-detected simulations are assigned correctly to model 2 depends on the magnitude of *β*_{1}, with power increasing for all three methods as *β*_{1} increases, and little difference between the three. The SNH and NPW methods decide between models 1 and 3. If the true trend is small, they typically choose the null model 1, whereas, as the trend increases, they opt for model 3, making their best step-function approximation to a linear increase. As *β*_{1} increases, MLV eventually selects the correct model 2 but is much less powerful than XLW, LR, or GNL. The two penalized procedures again overparameterize, with AIC being much worse than SBC. As *β*_{1} increases from 0 (*β*_{1} = 0 is model 1), type-I errors of AIC and SBC increase from 70% to about 91% and from 7% to 32%, respectively (when *β*_{1} is slightly larger than 1/*N*; not shown in Table 6), and then slowly drop to 0 as *β*_{1} increases further. Their rates of convergence to the true model are much slower than those of XLW, LR, and GNL, with SBC yielding a power of 80% (and AIC 18%) at *β*_{1} = 0.02 (see column CRM). The RMP statistic reveals more variety. For a small trend (*β*_{1} = 0.005), the MLV method is similar to XLW, LR, and GNL, with SNH, NPW, and SBC all displaying slightly greater RMP, and AIC again being much worse. For a moderate trend (*β*_{1} = 0.01), XLW, LR, and GNL are about equally good, followed by SBC, with SNH, NPW, MLV, and AIC all considerably worse. Last, with the large trend (*β*_{1} = 0.02), XLW, LR, and GNL continue to perform best, followed by SBC, MLV, and AIC, with SNH and NPW bringing up the rear.

### d. Simulation results for changepoint models

#### 1) Simulation results for model 3

Table 7 displays fit and power statistics for the eight procedures for the changepoint locations *c* = 50, 65, and 80 and shift Δ = 0.5, 1.0, and 2.0 under model 3. SNH and NPW uniformly yield the largest power. For any procedure, as Δ increases, the power increases, RMP decreases, and performance differences of the procedures become evident. Grouping procedures with respect to the power statistics gives the ranking

where procedures in parentheses are deemed roughly equivalent. AIC is not listed here, because its relative performance varies considerably, being relatively good for small Δ but worsening as Δ increases. The RMP rankings are similar, with the relative differences better discriminated for moderate (Δ = 1.0) than for small or large shifts. For fixed shifts Δ and procedures, the effect of the location of the changepoint *c* is mixed. For SNH and NPW, the power of detection decreases as the changepoint moves away from the center. The AIC and SBC powers are approximately constant, with a slight drop at *c* = 80. For XLW, LR, GNL, and MLV, the power of detecting the changepoint (for fixed Δ) increases slightly as *c* moves away from the center. For all procedures, RMP generally becomes slightly larger as *c* moves away from the center for a moderate shift Δ = 1.0, but very little variation in RMP over *c* is noted for small or large shifts.

#### 2) Simulation results for model 4

Table 8 displays simulation results under model 4. For these simulations, *c* was fixed at 50; *β*_{1} and Δ were varied as 0.005, 0.01, 0.02 and 0.5, 1.0, 2.0, respectively. The SNH and NPW methods cannot select the correct model (model 4), and so they have CRM = CRB = 0. They do a reasonable job at detecting the correct location of *c*, as seen from the CRC statistic, although this is partially because the changepoint occurs at the midpoint (*c* = 50) in these simulations. For the other six procedures, the CRMs and CRCs increase as the trend *β*_{1} or the shift Δ increases, that is, as one moves from upper left to lower right in Table 8. Some exceptions occur for AIC, LR, and GNL. Overall, there is no unique “best” procedure when model 4 holds. Using RMP or any of the power statistics, the following ordering seems to hold:

Neither SNH nor NPW can model trends and hence are typically worse than the above four procedures. The XLW method is generally the “winner,” as one would expect under model 4. By the RMP criterion, SBC is generally comparable to XLW and better than AIC, whereas AIC is better than either by CRM. By the CRC criteria, SBC is always better than XLW, with the difference in power being fairly dramatic under some conditions (53% vs 24% for *β*_{1} = 0.005 and Δ = 1.0) and less so as the parameters increase (95% vs 94% in the lower-right panel of Table 8).

#### 3) Simulation results for model 5

Table 9 displays simulation results under model 5 (i.e., with trend *β*_{1} before *c* and *β*_{1} + *β*_{2} afterward, along with a mean shift Δ at *c*). For these simulations, *β*_{1} = 0.01 and Δ = 1.0, and *c* and *β*_{2} were varied as 50, 65, 80 and .005, 0.01, 0.02, respectively. The SNH, NPW, and XLW procedures cannot choose the correct model 5. For these three procedures, for fixed *c*, CRC increases as *β*_{2} increases. For a fixed *β*_{2}, as *c* moves further from the center, the XLW method performs better in terms of RMP and CRC, but neither SNH nor NPW improves significantly. Among the five methods that could select model 5, MLV has a very disappointing performance, with its largest CRM being 3%; LR and GNL are very similar, even identical under some scenarios; AIC and SBC behave similarly under model 5, with AIC always better than SBC (which is usually slightly better than GNL/LR) by any criteria. Overall, AIC is the best procedure to use if one “knows” that model 5 is correct. Because AIC tends to overparameterize, it is not surprising that it wins in cases in which the simulations are run at the upper limit of the model hierarchy.

### e. Conclusions

In summary, there is no unique best procedure by any criteria. Under the null hypothesis of no changepoints (models 1 or 2), all six of the *α*-level procedures do yield 5% chance of falsely detecting a changepoint. AIC behaves very poorly with respect to this criterion, with a 70% type-I error under model 1 and up to 91% type-I error under certain model-2 scenarios. SBC performs acceptably under model 1 (6.7% type-I error) but could have type-I error as high as 32% under certain model-2 scenarios.

For model 3, which many consider to be the most realistic when good reference series are available (so that trend and periodic effects can be diminished), the SNH and NPW procedures are best and similar if the errors are truly normally distributed. If gross outliers are present, NPW will be more powerful than SNH. If the changepoint occurs near the boundaries of the series, neither SNH nor NPW will be very powerful at detecting the changepoint. For models of more complexity than model 3, SNH and NPW are reasonable at detecting changepoints but are useless in estimating other parameters.

The XLW, LR, and GNL procedures behave similarly, especially if the true model complexity is model 3 or below. In such cases, XLW, because it is simpler, is less prone to overparameterization and thus performs slightly better. If model 4 truly holds, XLW is better than LR or GNL. If model 5 is correct, LR and GNL will eventually outperform XLW, but not until the trend difference parameter *β*_{2} is very large. However, large trend changes (i.e., large *β*_{2} values) do not appear to be very realistic for most climate series. The MLV procedure does not appear to perform very well, except under models 1 and 2, where it has the correct type-I error. The main deficiency of MLV is that it uses an inefficient test to detect changepoints.

Overall, one would not want to use AIC or MLV because of the high type-I error rate and low power, respectively. It also seems that LR and GNL are more complex than necessary in most situations. Thus, XLW, SNH, NPW, and SBC would be reasonable alternatives. Of these, SBC has a simplicity advantage in that no critical values are needed, although its high type-I error under model 2 is a drawback. Perhaps a modified SBC procedure that penalizes the changepoint parameter more heavily than the other parameters merits further study. If good reference series (see section 6 for definition) are available, the SNH and NPW procedures are probably best, but for the many cases in which such reference series are unavailable or are of uncertain quality, the XLW and SBC procedures appear to be optimal.

## 5. Applications

The eight methods above are now applied to two annual average temperature series of interest. The first example comes from Tuscaloosa, Alabama, for which the complete series record extends over the 100-yr period of 1901–2000. This is a well-behaved series for which documentation is believed to be fairly good. There are three documented changepoints resulting from equipment changes or station relocations: June 1939, November 1956, and June 1987. Here, we will examine the 47-yr segment from 1940 to 1986 (Fig. 3a). Assuming that there are no other undocumented changepoints during this period, the procedures should detect only the known changepoint in 1956.

For the 1940–86 segment, all eight procedures detect the changepoint at *c* = 1957, one year later than actuality. All methods except AIC select model 3, with means of 17.789°C from 1940 to 1957 and 16.960°C from 1958 to 1986; that is, Δ = −0.829°C. AIC also detects the changepoint in 1957 but prefers model 4. In this example, all eight methods give approximately the same answers and jibe with historical records. Of course, this is the best answer in retrospect, given that we knew approximately where all changepoints should occur and were able to segment so as to reduce consideration to at most one changepoint. In practice, if one is not sure that the series being examined has at most one changepoint, as demonstrated next, the methods can easily produce conflicting models, none of which may actually be correct.

Our second example examines a 90-yr (1911–2000) annual series of temperatures from Libby, Montana. Again, the documentation is believed to be very good and indicates only one changepoint, in 1938 (resulting from a change in latitude and elevation). A plot of this series is shown in Fig. 3b. All eight detection methods were applied to the series. The two simplest methods, SNH and NPW, use model 3 to detect a changepoint in 1930, with a shift of Δ = +0.818°C between 1930 and 1931. The SBC, XLW, and MLV methods detect a changepoint in 1947, preferring model 4 with *μ* = 6.008°C, *β*_{1} = +0.032°C yr^{−1}, and a shift of Δ = −1.101°C between 1947 and 1948. Both model fits are superimposed on the original series in Fig. 3b. The AIC, LR, and GNL methods, not shown in Fig. 3b, all detect the changepoint in 1945, using model 5. Thus, all eight procedures detect changepoints in the general neighborhood of the documented changepoint (*c* = 1938), but location estimates and the preferred model vary. Perhaps this discrepancy is attributable to the effects of other undocumented (but small in magnitude) changepoints elsewhere in the series, or perhaps none of the models examined sufficiently describes this series’ behavior. With respect to the last explanation, the estimated lag-one autocorrelation for the series, after correcting for the changepoint, is 0.11, perhaps too seriously violating the IID error assumptions under which all eight procedures are predicated.

## 6. Other considerations and future directions

### a. Reference series

As seen in section 4, the power of detection may be low, even for moderately large shifts, especially when the changepoint occurs near the record boundaries. Incorporating a good reference series, when available, can boost changepoint detection power. A good reference series *R _{t}* should be homogeneous and highly correlated with the target series

*Y*. Gauging

_{t}*Y*relative to

_{t}*R*in some fashion results in a new series [e.g.,

_{t}*D*=

_{t}*Y*−

_{t}*R*or

_{t}*L*= ln(

_{t}*Y*/

_{t}*R*)] with much less variability than

_{t}*Y*itself and with features such as trends and periodicities removed or simplified. Then the previously mentioned procedures should yield markedly improved results when applied to the new series.

_{t}One must be certain that the reference series *R _{t}* is changepoint free over the period of comparison. The use of a reference series that is not homogeneous and/or has different climate signals (trends and periodicities) would complicate the problem of changepoint detection/adjustment. For the Libby series in section 5, an attempt was made to find a reference series. Temperature series from 34 stations in Montana were examined, three of which were very highly correlated with the Libby series after subtracting monthly means: Fortine (

*r*= 0.903), Kalispell (

*r*= 0.906), and Saint Ignatius (

*r*= 0.902). The average of these three series was used as a reference, and the eight detection procedures were applied to the annual differences:

*D*=

_{t}*Y*, for

_{t}− R_{t}*t*= 1911–2000. None of these methods came close to detecting the known changepoint in 1938. Results were not appreciably better when any of the three individual series was used as a reference. This example illustrates a well-known deficiency of reference series; they do not work well unless the reference has no changepoints.

### b. Departures from normality

Section 2a(2) shows how a parametric changepoint detection procedure can be made nonparametric by applying parametric methods to the relative ranks of the series. Nonparametric procedures are less sensitive to outlying and skewed data, making them good for hypothesis testing but sometimes inconvenient for parameter estimation. It is fortunate that, as noted in section 2a(2), the NPW procedure will yield simple parameter estimates for both *c* and Δ when no trend is present; that is, under the conditions of model 3. Nonparametric generalizations to trend-inclusive procedures such as XLW or LR could be easily devised, but estimating model parameters after a changepoint is detected by such methods is problematic. There is discussion of using Theil’s estimator (a nonparametric slope estimator) in Lanzante (1996), but this method has not yet been fully developed. Nevertheless, if one suspects outliers, then it is wise to examine nonparametric and parametric changepoint procedures in tandem.

### c. Periodicities and autocorrelated errors

The models considered here were developed for annual series. Many climate series, such as monthly and daily series, have periodic features. The procedures examined in this article should not be expected to work well for periodic data. In theory, correcting for a periodic mean is not too difficult, because one would effectively replace the mean in the previous models by a periodic mean with a specified (known) period.

Autocorrelation is another concern. The methods developed previously assume independent error terms. This assumption, while tenable for some annual climate series, is not realistic for daily or monthly series, where there is much empirical evidence of autocorrelation. Although the difficulties posed by periodicity and autocorrelation are distinct, it is pragmatic to attempt to account for them simultaneously. An article that does exactly this is Lund et al. (2007). One conclusion therein is that changepoint detection procedures developed for independent error series when positive autocorrelation is truly present will result in the detection of too many false changepoints.

### d. Multiple changepoints

The procedures presented control the type-I error rate and are heavily dependent on the “at most one changepoint” assumption. If the series is long and undocumented changepoints are plausible, then more than one undocumented changepoint may be present. Many practitioners merely search for the most obvious changepoint, correct for that (if one is found), and then reapply the method to the corrected series. This can lead to erroneous adjustments, because the effects of the first changepoint are heavily biased when other unaccounted changepoints are present (Wang and Feng 2004). In the ideal case, all possible changepoint times should be identified jointly before their mean shift magnitudes are estimated.

Multiple changepoints are an active current area of statistical research. In climate settings, Wang and Feng (2004) introduce a semihierarchical splitting algorithm to identify multiple changepoints [Menne and Williams (2005) introduced a similar algorithm]. In this procedure, “prior” changepoint times are reassessed after additional changepoint times are located. Methods that impose a Bayesian prior on the number of possible changepoints and then fit the model conditional on these changepoint times show promise. Among the methods discussed in this article, the method most amenable to this approach is the SBC procedure, because it could easily be evaluated and penalized for any model and number of changepoints. In fact, the multiple-changepoint procedure suggested by Caussinus and Mestre (2004) is very similar to an SBC approach generalized to multiple changepoints.

## Acknowledgments

The Climate Research Division of the Atmospheric Science and Technology Directorate of Environment Canada is acknowledged for supporting Jaxk Reeves and Jien Chen through Contracts KM040-04-5200 and KM040-05-0121. Robert Lund and QiQi Lu were supported in part by National Science Foundation Grant DMS 0304407.

## REFERENCES

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**.**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**.**

**,**

## Footnotes

*Corresponding author address:* Dr. Jaxk Reeves, Department of Statistics, The University of Georgia, Athens, GA 30602. Email: jaxk@stat.uga.edu