## 1. Introduction

Climatic time series that are free of artificial changepoints are indispensable to the study of observed climate variability and change, especially at local and regional scales (Easterling et al. 1996). Unfortunately, few climate series of even modest historic length are characterized only by variations in weather and climate. Even minor changes in a meteorological station’s environment or in observation practices can artificially alter the mean level of measurements and/or introduce a local trend (Conrad and Pollack 1962). In situ observation practice changes include instrument relocations or replacement, sensor drift from calibration, changes in land use/land cover surrounding the observing site, and changes to the daily observation schedule. The challenge of artificial changepoint detection and adjustment in climate series is reflected by the expansive literature on the subject. Peterson et al. (1998a) provide a review of many of the techniques that have been used or proposed in the climate literature. Techniques to evaluate documented risks of changepoints have been used (e.g., Karl and Williams 1987), in addition to those applied in the detection of unknown (undocumented) changepoints (e.g., Solow 1987; Easterling and Peterson 1995; Alexandersson and Moberg 1997; Vincent 1998; Lund and Reeves 2002). Archives or other knowledge of observational practice can be used to test for artificial shifts at the instant of known observation practice changes. Unfortunately, station histories (metadata) are often incomplete, and climate series may contain undocumented changepoints, even when relatively extensive metadata exist.

In the absence of corroborating metadata, however, questions regarding the veracity of apparent undocumented changepoints can arise. This is especially true when the interest lies in the continuity of a single or small number of time series versus a collection of series used *en masse* to calculate, for example, the spatial mean across a large region (Easterling et al. 1996). Some questions are probably inevitable because a certain background rate of type I and type II errors is always present. Nevertheless, determining the appropriate sensitivity of an undocumented changepoint test can be an iterative process, and many “false” changepoints may be revealed if an inappropriate sensitivity level or test statistic is used (Lavielle 1998, Lund and Reeves 2002). Visual inspection of a time series can provide insight into possible changepoints, but such inspection becomes impractical when a large number of time series requires evaluation. Moreover, even with visual inspection, the presence of a nonclimatic changepoint may still be debatable (Lund and Reeves 2002), and the analyst has little recourse other than to speculate on its cause or lack thereof.

Given the necessity of testing for undocumented changepoints and requirements for automated detection in some circumstances (e.g., the reprocessing and/or update of large datasets), a comparison of the characteristics of some commonly used test statistics is described below. Rather than compare, for example, the percentage of simulated changepoints that are identified by various tests (see Ducré-Robitaille et al. 2003 for a recent comparison of eight methods), this comparison was undertaken to ascertain whether multiple tests can be combined to improve overall confidence in undocumented changepoint detection. Specifically, the goal was to evaluate to what degree various test statistics provide independent assessments of the presence of undocumented changepoints and their position in a series. The comparison between tests was likewise motivated by the desire to evaluate undocumented changepoint detection as a function of the method that was used to formulate a composite reference series against which a target (candidate) series is compared. Frequently, a difference or ratio series is formed between the target and reference series in order to differentiate artificial changepoints from those rooted in true climate change and variability. Changepoint detection skill was, therefore, evaluated using different formulations of composite reference series. In addition, because the test statistics that are commonly applied to climate series are strictly relevant to determining the likelihood of a single changepoint, the skill of detection was evaluated for series that contain multiple changepoints, including in the component series that are used to form a composite reference. In practice, multiple changepoints are commonly present in both the target climate series and in series from nearby locations used to estimate the background climate signal. Situations in which multiple undocumented changepoints occur in all series are particularly challenging. Therefore, the skill of successive hypothesis testing using multiple tests is compared to an alternative approach, which optimizes a statistic based on an exhaustive comparison of all possible changepoint number and position combinations.

A description of the test statistics used in the comparison is given in section 2. Methods used to detect multiple undocumented changepoints and the framework for evaluating changepoint detection skill are also described in section 2. Three alternative formations of composite reference series are discussed in section 3, as well as the simulation of groups of cross-correlated climate series. Changepoint detection results are presented in section 4. A discussion and concluding remarks are provided in section 5.

## 2. Changepoint tests and quantification of detection skill

Three test statistics that are commonly applied to climate series were used in the comparison. Ducré-Robitaille et al. (2003) found these statistics to be among the highest performing in terms of the combination of changepoints that are correctly identified and the number “falsely detected” in the series with multiple step changes. Thorough descriptions of the test statistics can be found in Alexandersson (1986), Vincent (1998), and Lund and Reeves (2002), so only brief descriptions are provided below. These tests can be used with or without comparison to observations from nearby stations. In practice, however, a reference series is commonly used and the exposed changepoints are relative nonhomogeneities (Conrad and Pollack 1962; Alexandersson and Moberg 1997). While each test statistic may be used to detect a change in slope (trend) as well as a change in mean, here changes in mean level only were considered. Lund and Reeves (2002) note that step- and trend-type changes are difficult to unconfound in general. Wang (2003) discusses the potential of confounding artificial changepoints and those that are associated with true periodic variations in a climate series. That risk should be alleviated with the use of a reference series, provided that it and the target series are characterized by similar true variations. Nevertheless, the automated, skillful detection of local climate trends remains a difficult problem.

*Y*} of length

_{t}*n*in its standardized form. The series {

*Y*} may be either a raw climate series or a sequence of differences or ratios formed with a reference series. Assuming that {

_{t}*Y*} is normally distributed, a single shift in the level of the standardized series {

_{t}*z*} is determined using the null hypothesis H

_{t}_{o}and alternative hypothesis H

_{a}, given by

_{o}is rejected in favor of H

_{a}, the implication is that there has been a shift in the level of the

*z*series. With the sample means that are used as the maximum likelihood estimators for the means before (

z

_{1}) and after (

z

_{2}) all possible instances of shift, the test statistic can be calculated as (Hawkins 1977; Alexandersson 1986)

*T*are generated via Monte Carlo simulations of

_{c}*z*under the null hypothesis, recording the maximum

*T*value for each realization as

_{c}_{o}is rejected when

*T*in a series exceeds the chosen percentile of

_{c}*T*

_{max}for one or more values of

*c*, the instant of the change (defined here as the last value at the former level). Alexandersson and Moberg (1997) discuss how the likelihood of a change in trend can be similarly obtained using a likelihood ratio test. Potter (1981) describes a different version of the likelihood ratio test in which comparison to a reference is implicit to the test statistic (see also Maronna and Yohai 1978).

*Y*} is given by (Lund and Reeves 2002)

_{t}*μ*

_{1}−

*μ*

_{2}) and slope (

*α*

_{1}−

*α*

_{2}) should be close to zero for each

*c*∈ {1, . . . ,

*n*}, and a single phase of the regression would be justified because

*μ*

_{1}≈

*μ*

_{2}≈

*μ*

_{RED}and

*α*

_{1}≈

*α*

_{2}≈

*α*

_{RED}. The subscript “RED” refers to a single-phase or “reduced” model. To evaluate the null hypothesis of no changepoint versus the alternative hypothesis of an undocumented changepoint, an

*F*statistic is calculated at each position

*c*in the time series as

_{FULL}refers to the sum of the squared errors about each of the two phases (the “full” model). As with a

*T*

_{max}test statistic, percentiles of the

*F*statistic are obtained via simulations under the null hypothesis (Lund and Reeves 2002), in this case, recording the maximum

*F*value in each series of

_{c}*F*s,

_{c}*F*

_{max}is greater than the chosen percentile (significance level).

*α*can be eliminated and the two-phase model for {

*Y*} becomes (Lund and Reeves 2002)

_{t}*F*statistic will have 1 numerator degree of freedom and

*n*− 2 denominator degrees of freedom. If (6) is used, the two-phase regression test is equivalent to the likelihood ratio test; however, while a series of

*F*s based on (6) will be similar to a series of

_{c}*T*s using (1), critical values depend on which form of test statistic is used, as shown in Fig. 1. Unlike the

_{c}*F*

_{1,}

_{n}_{−2}and

*t*statistic, which can be appropriate for evaluating the likelihood of a shift at the instant of a known risk of changepoint (Lund and Reeves 2002),

*F*

_{max}for model (6) is not the square of a

*T*

_{max}statistic, and large differences in critical values exist for smaller sample sizes (

*n*).

### a. Detection of multiple changepoints

*K*segments to a series (or

*K*− 1 changepoints), the time series may be treated as

*piecewise*stationary and

*c*

_{0}= 1 and

*C*=

_{K}*n*. The solution to (7) frequently has been based on successive hypothesis testing using a hierarchic binary segmentation of the series (Hawkins 2001). In this approach, a series is split at the location where the hypothesis test statistic reaches a maximum, provided that its critical value is exceeded. Subsequences on either side of the split are likewise evaluated, and the process is repeated recursively until either the magnitude of the statistic does not exceed the chosen significance level in the remaining subsequences or the sample size in a segment is too small to test. This kind of solution is called “greedy” because changepoints are selected to maximize the separation between segments at each split, as opposed to evaluating all possible changepoint combinations iteratively to identify the optimal multiway split. The solution is hierarchic because it will reliably converge to the optimal solution only when the true changepoints are hierarchic, which may not be the case (Hawkins 2001).

*k*th segment. The penalized contrast function in Lavielle’s (1998) approach takes the form

*β*= 2

*ασ*

^{2}

_{ε}. The first term on the right-hand side of (8) measures the fidelity of the model to the observations {

*Y*}, while the second term, the penalty function, is proportional to the number of changepoints. The estimated number of segments will be, in this case, the greatest

*K*with a

*p*value that is larger than

*α*, the configuration of which is determined by minimizing

*U*. An optimal global solution, therefore, requires evaluation of the large number of possible changepoint number and position combinations (a total of 2

^{n}^{−1}), for which dynamic programming can be used to reduce computational complexity (Lavielle 1998; Hawkins 2001).

Because the number of artificial shifts in a climate series is generally unknown, an optimal global solution will likely require calibration in order to avoid revealing too many “unimportant” or “false” changepoints (Lavielle 1998). The nature of the jumps that are identified in a series is calibrated via a penalty function like *β* (see also Akaike 1974; Schwarz 1978; Caussinus and Mestre 2004). Ideally, the penalty function should set the desired balance between the probability (power) of detection and probability of false detection. A solution using a relatively large penalty function will expose only the more important “jumps,” but will overlook others. On the other hand, a small penalty function may reveal too many changes that are caused only by chance variation in the time series. Consequently, the best choice of the penalty function may not be obvious, but could be selected by a specialist with experience using the data. We used a very small *p* value (0.000 01) to solve (8) because too many changes are detected with a larger value (M. Lavielle 2005, personal communication). Nevertheless, the choice will likely require some level of intervention, ideally for each series tested (Lavielle 1998; Caussinus and Mestre 2004).

For successive hypothesis testing, we used a semihierarchic splitting algorithm to compare hypothesis testing to optimal solutions. In the semihierarchic algorithm, each splitting step is followed by a merging step to test whether a split chosen at an earlier stage has lost its importance after subsequent break points are identified (Hawkins 1976). At each splitting step, H_{o} is evaluated separately for all subsequences that occur between the apparent changepoints identified up to that stage. The subsequences are defined as 1 to *c*_{1}, *c*_{1} + 1 to *c*_{2}, etc., up to *c*_{K−1} + 1 to *n*. If H_{o} is rejected in any subsequence, that segment is split and *K* is incremented. In the merging step that follows each splitting step, H_{o} is evaluated for all subsequences that include only one of the *K* − 1 apparent changepoints. In this case, the segments are defined from 1 to *c*_{2}, from *c*_{1} + 1 to *c*_{3}, up to *c*_{K−2} + 1 to *n*. If H_{o} is not rejected in one of these subsequences, the apparent changepoint that is contained therein is removed and *K* is decremented. The process ends when no subsequence is split and no subsequences are merged on a pass through the full sequence. Although an improvement over strictly hierarchic solutions, this algorithm may not always converge to an optimal solution when *K* is greater than two and the changepoints are not hierarchic (Hawkins 2001) and/or occur close in time. In such circumstances, an optimal approach should have a higher power of detection.

### b. Quantification of detection skill

*f*and observations

*o*(e.g., Murphy and Winkler 1987) also may be applied to hypothesis testing (Stephenson 2000). In this case, “forecast” refers to the rejection or acceptance of the null hypothesis of homogeneity at each position in a series. “Observation” refers to the true, known occurrence or nonoccurrence of a simulated changepoint. The possible joint outcomes of changepoint detection (

*f*) and occurrence (

*o*) are represented as

*f*and

*o*then can be calculated using a 2 × 2 contingency table, containing counts of the four possible outcomes as shown in Table 1. The rate of type I (reject null hypothesis when it is true: a “false alarm” or “false positive”) and type II (fail to reject null hypothesis when it is false: a “miss” or “false negative”) errors can be calculated for each test statistic individually and for the “consensus” of multiple tests. The hit rate

*H*measures the ratio of correctly classified changepoints to the total number of changepoints and is known as the

*sensitivity*. Here,

*H*and its counterpart, the false alarm rate

*F*, are calculated as

*a*=

*f*

_{1},

*o*

_{1}(hit),

*b*=

*f*

_{1},

*o*

_{0}(false alarm),

*c*=

*f*

_{0},

*o*

_{1}(miss), and

*d*=

*f*

_{0},

*o*

_{1}(correct acceptance of H

_{o}). The term hit rate is sometimes used to refer to the quantity (

*a*+

*d*)/

*n*(the “percent correct”) while the false alarm rate or ratio (FAR), or false positive rate, will sometimes (e.g., Wilks 1995) refer to the quantity

_{o}rejections to the number of simulated changepoints. When the base rate of event occurrence is much lower than the rate of nonoccurrence, skill scores like the Heidke Skill Score (HSS) are commonly used to adjust for the large number of correctly predicted nonevents. Because changepoints do not occur in a majority of years (or months), that is, the quantity

*d*in Table 1 is much larger than

*a*+

*c*, a changepoint reasonably can be treated as a rare event. The HSS compares the proportion that are correct to a random no-skill forecast with the same base rate of event occurrence (Doswell et al. 1990; Stephenson 2000), and can be calculated as

The most likely position of a changepoint is where the test statistic reaches a maximum (or minimum in the case of the simple sum of squares). A hit is tallied when this maximum (or minimum) coincides with the true position of a simulated changepoint. If the test statistic exceeds the critical value, but the maximum (minimum) is not coincident with a simulated changepoint, it is counted as a false alarm. When the null hypothesis is not rejected in a sequence that contains a changepoint, a miss is recorded. As shown in Fig. 2, the time series of a test statistic may exceed the critical value across a range of locations around the true position of the changepoint, and the highest value is subject to some chance variation. Consequently, it may be desirable to qualify a rejection of the null hypothesis as a “hit” when the maximum in the test statistic occurs within one to a few time steps of its true position. Here, coincidence between tests was defined as ±2 time steps.

## 3. Reference series formulation and simulation of climate series

A good choice of reference series should capture the background climate signal that is common to the target and surrounding station series. Under the assumption that the composite reference series is at least approximately homogeneous, when a changepoint is revealed in a difference or a ratio series is formed between the target and its reference, the conclusion is that an artificial shift is present in the target series. However, an artificial shift in one or more of the *m* nearby stations series that is used to form the reference may carry through to the series of differences or ratios and the assumption of reference series homogeneity can be invalid. In that case, a changepoint in the reference may be erroneously attributed to the target series (see also Szentimrey 1999).

*Y*}, formed between observations at the target station and the average from nearby stations, calculated according to Alexandersson and Moberg (1997) as

_{t}*y*and

_{t}*x*

_{jt}are monthly or annual temperatures for the candidate and each of

*m*neighboring stations, respectively, and

*ρ*represents the correlation coefficients between observations at the candidate station and the

_{j}*j*th instance of

*m*surrounding stations. The quantities with an overbar may be calendar monthly (e.g., Menne and Duchon 2002) or annual means over a series of length

*n*. We refer to the term to the right of the minus sign in (14) as the anomaly-weighted average (ANWA) composite reference series.

Peterson and Easterling (1994) suggest using first-difference-filtered values of each series (i.e., where *Y*′_{t} = *y _{t}* −

*y*

_{t−1}) to calculate each

*ρ*to reduce the chance of making poor estimates of the magnitude of correlation between the candidate and neighboring series when one or both series contain a shift or trend. In their method, the first-difference correlation coefficients are used as weights to form an

_{j}*average*first-difference series from the

*m*nearby station series. We refer to this formulation as the first-difference-weighted average (FDWA) composite reference series. Using common weights and serially complete (i.e., no missing values) reference series components, the ANWA and FDWA composite reference series are exactly correlated and differ only by the offset that is used to convert the FDWA series back to raw averages.

Vincent (1998) does not use a reference series per se. Rather, the residuals *e _{t}* from a multiple linear regression (MLR) equation, using observations from neighboring stations to estimate values at the candidate station, are examined for evidence of changepoints using either the Durbin–Watson or lag-1 test for serial correlation,

*e*(

_{t}*e*=

_{t}*Y*−

_{t}*Ŷ*). In the case of identifying a step change or artificial trend, the null hypothesis of serial independence in the residuals is evaluated against the alternative hypothesis that they are consistent with a first-order autoregressive process (Wilks 1995; Durbin and Watson 1950, 1951, 1971). A step or trend in the target series will tend to cause serial correlation in the regression residuals. When the value of the test statistic is sufficient to reject the null hypothesis of uncorrelated residuals, a binary variable is introduced iteratively at each series position to separate the multiple linear regression estimates into all combinations of two phases. The changepoint position, which minimizes the pooled residual sum of squares (SSE

_{t}_{FULL}) about the two phases, is considered to be the most likely break point. The relative performance of these three formulations of reference series was evaluated by controlling for the test statistic whereby each reference was paired with each test as shown in Table 2.

### a. Simulation of climate series

*μ*is the mean of the time series (in this case 0),

*ϕ*is the autoregressive parameter, and

*ε*is a random error component (Wilks 1995). For each realization (

*n*= 100), the autoregressive parameter

*ϕ*was randomly selected from a sample distribution of observed lag-1 (1 yr) autocorrelation coefficients that are calculated using the time series of mean annual temperatures from stations in the United State Historical Climatology Network (USHCN; Karl et al. 1990). The values in each AR(1) series, though approximately standard normal, then were restandardized. To create groups of cross-correlated series, a constant of 2.0 times a random cross-correlation coefficient, also drawn from observed values, was added a total of (

*m*+ 1) times to each of the original 1000 series (

*m*= 5 “neighboring” series plus the target). Each of the (

*m*+ 1) series in a group is formed, therefore, from the same “parent” series, which is not used. Because the target (candidate) and reference component (neighbor) series are all “sibling” series, each has approximately the same degree of cross correlation, on average, with every other series in its group.

### b. Addition of random changepoints

Detection results are based on time series that contain zero, one, two, or a variable number changepoints in the combinations shown in Table 3. The amplitude of each simulated changepoint was selected at random from the standard normal distribution with no restrictions. As shown in Fig. 3, the standard normal distribution is a reasonable proxy for the distribution of known changepoints in the USHCN (expressed in standardized form). The simulated changepoint position was allowed to vary randomly. It should be noted, however, that when two changepoints are separated in time by no more than a few time steps, a changepoint detection algorithm may identify only one changepoint that is, in effect, an amalgam of the two nearby changepoints. If the two are of a comparable amplitude but opposite in sign, neither changepoint may be detected. On the other hand, if the changepoints are of disparate amplitudes, the larger shift may eclipse the smaller. To avoid sorting out the impact of these confounding scenarios on measures of detection skill, the results presented below are based on simulated shifts separated in time by no fewer than five positions in a sequence. In practice, however, nearby changepoints are a distinct possibility, especially in the analysis of annual values. In Fig. 2, a realization of a target/neighbor series from case 2a was shown that includes the values of each reference series formulation and test statistic at the first splitting step.

## 4. Results

In practice, the number of true changepoints in a climate series is unknown. Moreover, the presence of multiple shifts can sometimes suppress the magnitude of the test statistics near each true break point to such a degree that none exceeds its critical value. In these situations, the first split can be made at the position where the test statistic reaches a maximum without regard to the significance level (0.05 is used here) to avoid the possibility that the series will be overlooked completely when there is a complicated multi-break-point configuration. Of course, it is generally not known in advance that such a situation exists so the first split must be made in all series, which necessitates the merging steps in the semihierarchic method. It is worth pointing out, however, that the number of H_{o} rejections using hierarchic binary splitting in series where the null hypothesis is true is greater relative to the expected value in a test for a single changepoint (no splitting). This is because when the critical value is exceeded somewhere in a sequence, a split is made at the point where the test statistic reaches a maximum. At that point, the subsequences on either side of the split are evaluated separately, each having some probability of type I error as in the case of testing the full series. Inclusion of the merging steps will rejoin some of the “false splits” and reduce the number of type I errors, but the number may, nevertheless, be larger than the expected value.

The consequence of the larger number of hypothesis tests can be seen in Table 4, where changepoint detection is summarized for case 1 (null hypothesis always true). For most of the nine test statistic/reference pairs, the number of false alarms is higher than the expected value of 5% (not all combinations are shown). On the other hand, when H_{o} is true, the null hypothesis is rarely rejected at the same position in a series (±2 time steps) by more than one test so the false alarm rate when agreement between tests is required is less than the expected value for one test, suggesting some independence between tests.

The detection results summarized for cases 2 through 6 (Tables 5 –9 ), indicate that, apart from the optimal algorithm, the likelihood ratio statistic is generally the most sensitive of the three calculated statistics (cf. the single test hit rates *H*, for cases 2, 4, 5b, and 6). The superior sensitivity, however, comes at the price of a larger number of false alarms, especially when changepoints are present in the reference series components (cases 3, 4, 5b, and 6), but even when the reference series is truly homogeneous (case 2). In the case 2 simulations, as in case 1, a large reduction in type I errors occurs when a consensus is required that includes at least two different test statistics. Likewise, in case 2 changepoint detection is essentially the same whichever composite reference series formulation is paired with the likelihood ratio test. In fact, results from pairing all three reference formulations with a common statistic (all combinations not shown) suggest that the choice of test statistic is more important than choice of reference series formulation when the reference series components are serially complete and homogeneous. In that case, the ANWA and the FDWA reference series are identical because they differ only by an offset.

The impact of changepoints in reference component series is illustrated in Table 6 (case 3). Because in these realizations each reference component contains either 1 (case 3a) or 2 (case 3b) changepoints, a composite reference will incorporate 5 (case 3a) or 10 (case 3b) changepoints of various amplitudes and locations. The number of false alarms using the likelihood ratio test and the ANWA or FDWA composite reference increases from just over 100 in the null case (case 1) to over 700 in case 3a and over 1000 in case 3b. Similarly, the ANWA and FDWA composite reference paired with the two-phase regression test statistic show a fourfold or better increase in the number of false alarms. In contrast, paired with the MLR reference series, the likelihood ratio and two-phase regression tests have less than half the number false alarms (not shown) and the increase over the sample of “ideal” reference series (case 1, 2, or 5a) is, therefore, much smaller. The MLR–lag 1 combination produced the smallest number of false alarms for a single reference series–test statistic pair.

It appears that a step change in a reference component series of anomalies or raw values will reduce the magnitude of the series coefficient in the MLR equation, and, therefore, its weight, effectively filtering the impact of the step changes in the composite. On the other hand, using first-difference-filtered series to calculate truer correlation-based weights when artificial break points may be present helps to ensure that step changes in the component series will carry through to the composite by minimizing the impact of a step change on the correlation coefficients. Nevertheless, the value of a consensus result is especially evident in case 3 from the large reduction in false alarms linked to nonhomogeneities in the composite reference series.

In case 4, when all series contain one or two changepoints, the number of false alarms can approach, or, in the case of ANWA–*T*_{max} even exceed the number of hits. As in other cases, the advantage to using a consensus result is apparent by the large reduction in false alarms relative to most single tests. Unfortunately, no consensus combination of reference series–test statistic pairs clearly stands out as the more skillful because many appear to optimize test sensitivity while others produce the fewest false alarms. Nevertheless, the pairing of MLR–*T*_{max} forms a good combination with many other reference series–test statistic pairs because this pairing filters the impact of changepoints in the reference series components while retaining much of the test sensitivity. However, skillful consensus detection with this reference series–test statistic pair is possible only when none of the other pairings includes the MLR reference series because its use has a large impact on test sensitivity with all of the test statistics.

Based on the HSS, a consensus of any two of three tests is generally more skillful than agreement between a single pair of reference series–test statistic combinations. In fact, a consensus of any two or three reference series–test statistic pairs is more skillful than the use of either any two of four, or three of five pairs, etc. This is because a consensus of a large number of reference–test combinations will maximize both the number of coincident hits and the number of coincident false alarms. Because there is probably more independence between tests in terms of the position of false alarms, the small gain in hits using an agreement between, say, any two of four over any two of three tests is more than offset by the gain in the number of consensus false alarms. In case 2, the highest skill scores for any 2 of 3 of the 84 reference series–test statistic pairings are those paired combinations that include all three test statistics. In case 3, it is for pairings that include only the MLR reference.

In case 5a and 5b, which are comprised of simulations with randomly censored values, results are similar to the analogous cases 1 and 4a, respectively, with one exception: all test statistics that are paired with the FDWA show a large increase in false alarm numbers relative to the serially complete counterpart scenarios (cf. e.g., the false alarm column in Tables 4 and 8). This large increase in pairings that include the FDWA appears to be caused by random walks introduced into the FDWA series that are a consequence of biased estimates of the average first difference when one or more of a component’s series values are censored (missing) at various positions. Such biased estimates are unavoidable when values are missing, and they also impact the ANWA reference series, but in that case cause only a small increase in false alarms. In the case of the FDWA, however, a biased estimate at one position in a series will cause all subsequent composite averages (or, in this case, working backward, all earlier averages) to exhibit the same bias. If there are missing values in the various reference series at different positions scattered throughout the summary period, the combination can cause a random walk, rather than a simple step change, the range of which may be large (e.g., one standard deviation), as shown in Fig. 4. When the FDWA composite reference series is used to form a difference (or ratio) with the target series, the difference series will incorporate the characteristics of a step change or random walk in the reference and lead to a large increase in false alarms relative to that based on serially complete data or other reference series formulation. Thus, the averaging of first difference series should be avoided when serially incomplete values or a changing station mix must be used. In addition, the potential for a biased estimate using the ANWA or FDWA formulation will differ according to the relative magnitude of the field variance of anomalies versus the field variance of first differences (interannual variability).

Not surprisingly, a comparison of detection results for simulations containing one shift to those with two suggests that there is a general reduction in the hit rate when more than one undocumented changepoint occurs in a series. However, in the simulated scenarios with a maximum of two changepoints, the number of false alarms increases proportionately less than the number of hits when there are two shifts in the series versus one, so the skill of detection (the HSS) is not necessarily greatly reduced. The disproportionate change in the number of false alarms relative to hits is reflected by the reduction in bias (*B*) when there are two changepoints instead of one. Compare, for example, case 2a to 2b (Table 5) or case 4a to 4b (Table 7). In some reference series–test statistic pairings, the HSS is essentially equivalent in scenarios with one and two breaks, especially in reference series–test statistic pairings that include the likelihood ratio test.

In general, the optimal solution using the method defined in (8) is more sensitive than successive hypothesis tests, especially in case 7. The case 7 scenario is precisely the kind of situation in which the optimal solution should be superior because the imposed changepoints are not hierarchic and they have equal amplitudes but opposite signs at positions 70 and 75. Nevertheless, while the optimal solution has a higher hit rate than any single hypothesis test, the best of an agreement between any two of three hypothesis test statistic–reference series pairings also has a very high hit rate. Given that a consensus of successive hypothesis tests has many fewer false alarms than the optimal solution, the skill of a consensus of successive hypothesis testing is nearly identical to the optimal skill in case 7 and is higher than the optimal solution in the other cases.

Because the number of false alarms in the optimal solution, expressed in Tables 4 –9 as the total number of false alarms over the number of target series, is high, a different penalty function might be used to reduce this total. Caussinus and Mestre (2004), for example, specified a penalty function for the same type of optimal solution, which, in contrast to methods by Akaike (1974) and Schwarz (1978), did not produce an excessive number of changepoints. In Fig. 5, a histogram of the number of detected changepoints by position is shown using the solution that is provided by the best two of three test statistic–reference series pairings. A comparison of Fig. 5 and a similar histogram provided in Table 1 of Caussinus and Mestre (2004) indicates that the consensus result of successive hypothesis tests is more sensitive than their approach to a penalty function, while at the same time it limits the number of false alarms. Thus, successive hypothesis testing using multiple tests might be a reasonable alternative to optimal solutions, even when complicated multi-break-point scenarios occur. Moreover, in the most realistic of changepoint scenarios, case 6, the optimal hit rate is not as high as the ANWA–*T*_{max} combination when a semiempirical splitting algorithm is used for the hypothesis test.

## 5. Discussion and conclusions

The quantification of detection skill using Monte Carlo case studies indicates that the likelihood ratio test is the most sensitive of the three successive hypothesis test statistics in all but one of the simulated scenarios. As a result, it is also the most sensitive to changepoints in reference series components and, thus, has a higher probability of detection and a higher probability of false detection. The higher sensitivity of the likelihood ratio test is not surprising given that the assumption of no slope in the form of the test used here was met perfectly by the Monte Carlo simulations. The assumption that there is no local trend may be reasonable in many situations, but nevertheless should be evaluated in practice. Wang (2003), arguing from the standpoint of sampling variability, noted that the sensitivity of the two-phase regression test can be increased, especially in short segments, by using a common slope parameter between the two phases. By eliminating the slope parameter altogether, the sensitivity of the two-phase regression test is equivalent to that of the zero-slope version of likelihood ratio test, and there is no benefit to including both zero-slope test models in multiple testing.

Even when no phase break points or trends are anticipated, there are step-change configurations in which allowance for trend changes may vastly increase step-change detection sensitivity. This was illustrated by the case 7 detection results and in a simulation by Easterling and Peterson (1995) who imposed simulated changepoints with equal amplitudes but opposite signs 10 positions apart. Inclusion of the two-phase regression model with a slope parameter greatly increases the likelihood of finding the “temporary” jumps relative to a zero-slope test model because the apparent change in trend near the step is frequently sufficient to reject the null hypothesis of a one-phase segment. In case 7, where the step changes are not hierarchic, the semihierarchic splitting algorithm that seeks to resolve only changes in mean often failed to converge on the optimal solution. Even in case 7, however, detection using successive hypothesis testing is improved with the use of multiple tests and the consensus skill is comparable to an optimal solution. In arguably the most realistic of the simulations, case 6, the consensus of successive hypothesis testing can be more skillful at undocumented changepoint detection than an optimal solution by limiting the number of false alarms without reducing sensitivity too much. Thus, successive hypothesis testing may be preferable in situations where intervention in the result of an optimal algorithm is impractical.

The comparison of various combinations of test statistics and composite reference series formulations suggests that for reasonably well correlated time series, the choice of reference series formulation has relatively little impact on target series changepoint detection skill, provided that the reference component series are homogeneous. Though probably rare in practice, under such circumstances the choice of the test statistic has a greater impact. In the case where reference series components contain changepoints, and/or values are missing, the choice of reference series formulation has more important implications in changepoint detection. Step changes in the various component series are more readily transferred to the composite reference when first-difference-filtered climate series are used to calculate truer correlation-based weights and increase the likelihood that heterogeneities in the composite reference will be erroneously identified as changepoints in the target series. On the other hand, a multiple linear regression or non-first-difference correlation-based-weighted reference series will tend to reduce the impact step changes on the composite reference. To confound the problem, an analyst risks weighting most heavily those station series that contain similar artificial breaks when anomaly or raw value correlation weights are used, reducing changepoint detection sensitivity. This is a pervasive problem when coincident or nearly coincident network-wide practice changes are imposed.

The first-difference composite reference has been advocated to facilitate changepoint detection in shorter, incomplete series for which anomaly calculation using a common base period is problematic (Peterson et al. 1998b). Moreover, the removal of spurious first differences in reference series components, presumably caused by step changes, also has been recommended prior to computing the average (Peterson and Easterling 1994). The results of this analysis suggest that composite first differences should be avoided if values are missing or removed from one or more component series or, more generally, when the composition of component series changes through time. In such circumstances, the averaging of first differences introduces step changes or random walks when the series is converted back to a raw value average. Random walks and spurious steps increase the number of false alarms if this form of composite reference is subtracted from a target series and may lead to erroneous conclusions about the nature of the background climate signal in a region when only a small number of reference component series is available. To avoid such artifacts in first-difference-based reference calculations, only serially complete segments should be used. In that case, the average first-difference series that is converted back to raw observations is exactly correlated with a similarly weighted average anomaly or raw value series, and there is no advantage to the use of first differences. If missing values are estimated, the estimate error still will cascade throughout the average first difference series and potentially lead to the same type of random walks.

A principal benefit of a multitest consensus, in addition to improved detection when changepoints are not hierarchic, occurs in situations where the composite reference series is not homogeneous because there appears to be greater independence between tests in the occurrence of false alarms than in detected changepoints. A consensus of a large number of test statistic–reference series pairs, however, maximizes the number of false alarms, while the consensus of only two test statistics–reference series pairs limits detection to the least sensitive pairing. Consequently, the most skillful consensus appears to be any two of three test statistic–composite reference series pairs. Using the agreement between any two of three tests, detection skill in simulations where reference series components contain changepoints (case 4) are comparable to the perfectly homogeneous reference series case and the use of a single test (case 2). The reference series–test statistic combinations that are most appropriate for a particular evaluation of nonclimatic changepoints may depend, in large part, on the relative priority of reducing false alarms or avoiding misses. If a climate series is to be adjusted for undocumented changepoints, then the reduction of false alarms may be critical and multiple linear regression or non-first-difference-based correlation weights should be used in a least one of the three reference series formulations. If sensitivity is critical, then multiple linear regression or non-first-difference-based weighting should be used in, at most, one of the reference series formulations.

Even in the use of a consensus approach, however, the proportion of detected changepoints that are false (the FAR) remains substantial, over 25%, for example, in the most realistic simulations (case 6). Alternative strategies, therefore, are likely required to reduce the false alarm rate in real world applications. Such strategies may include increasing the number of references series components that are used to form the average (Li et al. 2005, manuscript submitted to *J. Geophys. Res.*), or an iterative recalculation of the composite reference where the target series is the only series common to all calculated difference series (Szentimrey 1999). False alarms that are linked to changepoints in the composite reference series also may be avoided through a pairwise comparison of climate series (Jones et al. 1986; Menne and Duchon 2001; Caussinus and Mestre 2004). In a pairwise approach, the concept of target and reference lose their meaning and the offending series may be more readily identified. Detection skill based on this approach to undocumented changepoint analysis will be discussed in a forthcoming paper.

## Acknowledgments

The authors wish to thank Dr. Thomas C. Peterson for bringing recent climate changepoint detection work, summarized in the WMO publication “Guidance on metadata and homogenization” (WMO/TD 1186), to our attention. Special thanks also to Dr. Xiaolan Wang, Tressa Fowler, and two anonymous reviewers whose thoughtful and constructive comments greatly improved this manuscript. Algorithms for solving the penalized contrast function were provided by Dr. Marc Lavielle (http://www.math.u-psud.fr/~lavielle/programs/index.html). Partial support for this work was provided by the Office of Biological and Environmental Research, U.S. Department of Energy.

## REFERENCES

Akaike, H., 1974: A new look at the statistical identification model.

,*IEEE Trans. Autom. Control***19****,**716–723.Alexandersson, H., 1986: A homogeneity test applied to precipitation data.

,*J. Climatol.***6****,**661–675.Alexandersson, H., and A. Moberg, 1997: Homogenization of Swedish temperature data. Part I: Homogeneity test for linear trends.

,*Int. J. Climatol.***17****,**25–34.Caussinus, H., and O. Mestre, 2004: Detection and correction of artificial shifts in climate series.

,*J. Roy. Stat. Soc. Ser. C***53****,**405–425.Conrad, V., and C. Pollack, 1962:

*Methods in Climatology*. Harvard University Press, 459 pp.Doswell III, C. A., R. Davies-Jones, and D. L. Keller, 1990: On summary measures of skill in rare event forecasting based on contingency tables.

,*Wea. Forecasting***5****,**576–585.Ducré-Robitaille, J-F., L. A. Vincent, and G. Boulet, 2003: Comparison of techniques for detection of discontinuities in temperature series.

,*Int. J. Climatol.***23****,**1087–1101.Durbin, J., and G. S. Watson, 1950: Testing for serial correlation in least squares regression. I.

,*Biometrika***37****,**409–428.Durbin, J., and G. S. Watson, 1951: Testing for serial correlation in least squares regression. II.

,*Biometrika***38****,**159–178.Durbin, J., and G. S. Watson, 1971: Testing for serial correlation in least squares regression. III.

,*Biometrika***58****,**1–19.Easterling, D. R., and T. C. Peterson, 1995: A new method for detecting undocumented discontinuities in climatological time series.

,*Int. J. Climatol.***15****,**369–377.Easterling, D. R., T. C. Peterson, and T. R. Karl, 1996: On the development and use of homogenized climate datasets.

,*J. Climate***9****,**1429–1434.Hawkins, D. M., 1976: Point estimation of the parameters of a piecewise regression model.

,*Appl. Stat.***25****,**51–57.Hawkins, D. M., 1977: Testing a sequence of observations for a shift in location.

,*J. Amer. Stat. Assoc.***72****,**180–186.Hawkins, D. M., 2001: Fitting multiple change-points to data.

,*Comput. Stat. Data Anal.***37****,**323–341.Karl, T. R., and C. N. Williams Jr., 1987: An approach to adjusting climatological time series for discontinuous inhomogeneities.

,*J. Climate Appl. Meteor.***26****,**1744–1763.Karl, T. R., C. N. Williams Jr., F. T. Quinlan, and T. A. Boden, 1990: United States Historical Climatology Network (HCN) serial temperature and precipitation data. Oak Ridge National Laboratory, Carbon Dioxide Information and Analysis Center, Environmental Science Division Publication No. 3404, 389 pp.

Jones, P. D., S. C. B. Raper, R. S. Bradley, H. F. Diaz, P. M. Kellyo, and T. M. L. Wigley, 1986: Northern Hemisphere surface air temperature variations: 1851–1984.

,*J. Climate Appl. Meteor.***25****,**161–179.Lavielle, M., 1998: Optimal segmentation of random processes.

,*IEEE Trans. Signal Process.***46****,**1365–1373.Lund, R., and J. Reeves, 2002: Detection of undocumented changepoints: A revision of the two-phase regression model.

,*J. Climate***15****,**2547–2554.Maronna, R., and V. J. Yohai, 1978: A bivariate test for the detection of a systematic change in mean.

,*J. Amer. Stat. Assoc.***73****,**640–645.Menne, M. J., and C. E. Duchon, 2001: A method for monthly detection of inhomogeneities and errors in daily maximum and minimum temperatures.

,*J. Atmos. Oceanic Technol.***18****,**1136–1149.Menne, M. J., and C. E. Duchon, 2002: Quality assurance of monthly temperature data at the National Climatic Data Center. Preprints,

*13th Conf. on Applied Climatology*, Portland, OR, Amer. Meteor. Soc., 18–21.Murphy, A. H., and R. L. Winkler, 1987: A general framework for forecast verification.

,*Mon. Wea. Rev.***115****,**1330–1338.Peterson, T. C., and D. R. Easterling, 1994: Creation of homogeneous composite climatological reference series.

,*Int. J. Climatol.***14****,**671–679.Peterson, T. C., and Coauthors, 1998a: Homogeneity adjustments of in situ atmospheric climate data: A review.

,*Int. J. Climatol.***18****,**1493–1517.Peterson, T. C., T. R. Karl, P. F. Jamason, R. Knight, and D. R. Easterling, 1998b: First difference method: Maximizing station density for the calculation of the long-term global temperature change.

,*J. Geophys. Res.***103****,**(D20). 25967–25974.Potter, K. W., 1981: Illustration of a new test for detecting a shift in mean in precipitation series.

,*Mon. Wea. Rev.***109****,**2040–2045.Schwarz, G., 1978: Estimating the dimension of a model.

,*Ann. Stat.***6****,**461–464.Solow, A. R., 1987: Testing for climate change: An application of the two-phase regression model.

,*J. Climate Appl. Meteor.***26****,**1401–1405.Stephenson, D. B., 2000: Use of the “odds ratio” for diagnosing forecast skill.

,*Wea. Forecasting***15****,**221–232.Szentimrey, T., 1999: Multiple analyses of series for homogenization (MASH).

*Proc. of the Second Seminar for Homogenization of Surface Climatological Data*, WMO-TD-962, Budapest, Hungary, WMO, 27–46.Vincent, L. A., 1998: A technique for the identification of inhomogeneities in Canadian temperature series.

,*J. Climate***11****,**1094–1104.Wang, X. L., 2003: Comments on “Detection of undocumented changepoints: A revision of the two-phase model.”.

,*J. Climate***16****,**3383–3385.Wilks, D. S., 1995:

*Statistical Methods in the Atmospheric Sciences*. Academic Press, 467 pp.

Example of Monte Carlo simulation with one changepoint in target series (in bold) and none in the composite reference series components (case 2a). (a) Target (candidate) series and five correlated composite reference component (“neighboring”) series; (b) target and ANWA composite reference series (correlation-weighed average of five reference component series); (c) difference between candidate and composite reference series and residuals from multiple linear regression prediction of target series; (d) series of *T _{c}* and

*F*statistics and residual sum of squares (SSE

_{c}_{FULL}) for two-phase model.

Citation: Journal of Climate 18, 20; 10.1175/JCLI3524.1

Example of Monte Carlo simulation with one changepoint in target series (in bold) and none in the composite reference series components (case 2a). (a) Target (candidate) series and five correlated composite reference component (“neighboring”) series; (b) target and ANWA composite reference series (correlation-weighed average of five reference component series); (c) difference between candidate and composite reference series and residuals from multiple linear regression prediction of target series; (d) series of *T _{c}* and

*F*statistics and residual sum of squares (SSE

_{c}_{FULL}) for two-phase model.

Citation: Journal of Climate 18, 20; 10.1175/JCLI3524.1

Example of Monte Carlo simulation with one changepoint in target series (in bold) and none in the composite reference series components (case 2a). (a) Target (candidate) series and five correlated composite reference component (“neighboring”) series; (b) target and ANWA composite reference series (correlation-weighed average of five reference component series); (c) difference between candidate and composite reference series and residuals from multiple linear regression prediction of target series; (d) series of *T _{c}* and

*F*statistics and residual sum of squares (SSE

_{c}_{FULL}) for two-phase model.

Citation: Journal of Climate 18, 20; 10.1175/JCLI3524.1

Distribution of the estimated amplitude of step changes at known risks of artificial changepoints in the USHCN. Step-change amplitude is expressed in standardized form.

Citation: Journal of Climate 18, 20; 10.1175/JCLI3524.1

Distribution of the estimated amplitude of step changes at known risks of artificial changepoints in the USHCN. Step-change amplitude is expressed in standardized form.

Citation: Journal of Climate 18, 20; 10.1175/JCLI3524.1

Distribution of the estimated amplitude of step changes at known risks of artificial changepoints in the USHCN. Step-change amplitude is expressed in standardized form.

Citation: Journal of Climate 18, 20; 10.1175/JCLI3524.1

(a) Example of FDWA and ANWA composite reference series with randomly censored values in the five component series; (b) difference between the true value of the ANWA and FDWA composite reference series and its estimate using component series with censored values (from case 5a).

Citation: Journal of Climate 18, 20; 10.1175/JCLI3524.1

(a) Example of FDWA and ANWA composite reference series with randomly censored values in the five component series; (b) difference between the true value of the ANWA and FDWA composite reference series and its estimate using component series with censored values (from case 5a).

Citation: Journal of Climate 18, 20; 10.1175/JCLI3524.1

(a) Example of FDWA and ANWA composite reference series with randomly censored values in the five component series; (b) difference between the true value of the ANWA and FDWA composite reference series and its estimate using component series with censored values (from case 5a).

Citation: Journal of Climate 18, 20; 10.1175/JCLI3524.1

Histogram of the number of detected changepoints by position for case 7. Detection is from the best consensus result using three test statistic–reference series pairs (agreement between any two of three) as indicated in Table 9.

Citation: Journal of Climate 18, 20; 10.1175/JCLI3524.1

Histogram of the number of detected changepoints by position for case 7. Detection is from the best consensus result using three test statistic–reference series pairs (agreement between any two of three) as indicated in Table 9.

Citation: Journal of Climate 18, 20; 10.1175/JCLI3524.1

Histogram of the number of detected changepoints by position for case 7. Detection is from the best consensus result using three test statistic–reference series pairs (agreement between any two of three) as indicated in Table 9.

Citation: Journal of Climate 18, 20; 10.1175/JCLI3524.1

Contingency table for the detection of undocumented changepoints. The null hypothesis for each test is series homogeneity (no changepoint).

Matrix of possible composite reference series formulation and test statistic pairings.

Number of added changepoints in each target–reference component series groups used in five Monte Carlo case studies. Each case comprised of 1000 simulated series groups.

Skill scores for simulations with no changepoints in the candidate or in the reference component series (case 1).

Skill scores for simulations with one (case 2a) or two (case 2b) changepoints in the candidate and none in the reference series components.

Skill scores for simulations with no changepoints in the candidate and one (case 3a) or two (case 3b) changepoints in each reference series component

Skill scores for simulations with one (case 4a) or two (case 4b) changepoints in the candidate and one (case 4a) or two (case 4b) changepoints in each reference series component.

Skill scores for simulations with one, two, or five missing values in a row at random positions but no changepoints in the candidate and or reference series components (case 5a) or with missing values and one changepoint in the candidate and one changepoint in each reference series component (case 5b).

Skill scores for simulations with zero to six changepoints of random amplitude and position (case 6) and with six changepoints of fixed amplitude and position (case 7). In case 7, changepoints with an amplitude of 2.0 were added or subtracted as in Caussinus and Mestre (2004), i.e., +2.0 at position 20, +2.0 at position 40, −2.0 at position 50, −2.0 at position 70, +2.0 at position 75, and +2.0 at position 85.