## 1. Introduction

The accuracy and homogeneity of climate data are indispensable for many aspects of climate research. In particular, a realistic and reliable assessment of historical climate trends and variability is hardly possible without a long-term, homogeneous time series of climate data. Therefore, it is unfortunate that many kinds of changes (instrument/observer changes, station location/exposure changes, observing practices/procedures changes, etc.) that took place during the period of a data record can cause nonclimatic sudden changes (artificial shifts) in the time series. The fast-developing climate-monitoring science and technology (e.g., human observations are being replaced with automated observations or remote sensing) also make discontinuities inevitable in long-term climate data records. Such artificial shifts could have huge impacts on the results of climate analysis, especially those of climate trend analysis, as shown in Wang (2006) and Hanesiak and Wang (2005), among others. The reliability of the results of historical climate trend assessment is often hampered by the existence of discontinuities in instrumental records of climate and is often the subject of the ongoing debate on climate change. Accurate and homogeneous climate data are also indispensable for the calculation of related statistics that are needed and used to define the state of climate and climate extremes. Therefore, artificial shifts should be eliminated, to the extent possible, from time series prior to their application, especially in climate trend assessment.

There exist two types of artificial shifts: documented and undocumented shifts. Documented shifts refer to those with known position/time of shift (i.e., the time and cause of shift were recorded in the related metadata). Documented shifts are much easier to test/assess, because we do not need to identify the position of the shift statistically (we can find it from metadata); thus, the regular tests of means or variances are applicable. However, metadata often lack accuracy and completeness or in some cases are not available at all. One needs to rely on an appropriate statistical test to detect and assess undocumented shifts (i.e., those that have no metadata support).

In recent decades, several methods have been developed for detection of undocumented changepoints. In the literature of statistics, most of the existing detection methods can be classified into three categories: likelihood-based methods, linear regression–based methods, and nonparametric methods (Csörgő and Horváth 1997). In the climate literature, the most commonly used methods for changepoint detection include the standard normal homogeneity test (SNHT; Alexandersson 1986), the cumulative deviation test (Buishand 1982), two-phase regression-based methods (Solow 1987; Easterling and Peterson 1995; Lund and Reeves 2002; Wang 2003), Bayesian-based methods (Perreault et al. 2000; Chu and Zhao 2004), multiple linear regression (Vincent 1998), and some nonparametric methods such as the Mann–Whitney *U* test and the Wilcoxon rank test. Readers are referred to Reeves et al. (2007) for a recent, comprehensive review and comparison of these methods and to Peterson et al. (1998) for a review of homogeneity adjustments of in situ atmospheric climate data. DeGaetano (2006) also compared several methods for detecting discontinuities in mean temperature series.

This study attempts to improve a test for detecting undocumented shifts, proposing a new test statistic that treats each candidate changepoint in the time series being tested more equally (such equality is not achieved by the existing methods; see details shown/described later in sections 2 and 5). Although undocumented shifts may take the form of a change in mean, variance, or both, this study only aims at detection of an undocumented shift in the mean. A mean-shift at time *t* = *k* refers to the case in which the mean of the data series before this point (i.e., the average over all *t* ≤ *k*) is significantly different from that of the data series after this point (i.e., the average over all *t* > *k*). Here, we consider time series with zero trend and identically and independently distributed (IID) Gaussian errors, and we focus on the case in which the time series being tested contains at most one changepoint (AMOC). However, note that one can implement statistical tests that are developed for the AMOC case with an appropriate recursive testing algorithm to detect multiple changepoints in a time series, as in Wang (2007), Menne and Williams (2005), and Wang and Feng (2004). In the mean time, Davis et al. (2006) and Caussinus and Mestre (2004) are two prominent examples of recent studies that address directly in their statistical model the issue of detecting multiple changepoints. An algorithm for detecting undocumented and documented artificial mean shifts in tandem has also been proposed and implemented (Wang 2007; Wang and Feng 2004).

The rest of the paper is arranged as follows. We describe the proposed new test, a penalized maximal *t* test, in section 2. Then, we compare this method with SNHT using Monte Carlo simulations in section 3 and using atmospheric pressure data series from a Canadian station in section 4. We give some concluding remarks in section 5.

## 2. Penalized maximal *t* test

*X*(

_{t}*t*= 1, . . . ,

*N*) denote an IID Gaussian series. To detect a changepoint in time series

*X*is to test the null hypothesisagainst the alternativewhere

_{t}*μ*

_{1}≠

*μ*

_{2}and “

*X*∼ IID

_{t}*μ*,

*σ*

^{2})” stands for “

*X*follows an IID Gaussian (i.e., normal) distribution of mean

_{t}*μ*and variance

*σ*

^{2}.” When

*H*is true, the point/time

_{a}*t*=

*k*is called a changepoint, and Δ = |

*μ*

_{1}−

*μ*

_{2}| is called the magnitude of mean shift (or step size alternatively). In other words, if there is such a point

*k*, the time series can be viewed as two independent samples from two normal distributions of the same unknown variance

*σ*

^{2}, one for all

*t*≤

*k*and another for all

*t*>

*k*. The task of undocumented changepoint detection is to find out the most probable value of

*k*and to test whether the means of these two samples are statistically significantly different from each other. The traditional test for this kind of problem is the so-called likelihood ratio test. The most probable point to be the changepoint is the one that is associated with the maximal value of the following log likelihood ratio (Csörgő and Horváth 1997):whereAccording to Csörgő and Horváth (1997), maximizing

*l*(

*k*) is equivalent to maximizingwhereand the test statistic for detecting an undocumented mean shift isbecause of the necessary search over all candidate changepoints

*k*∈ 1, 2, . . . ,

*N*− 1 for the most probable position of an undocumented mean shift. This test is called the maximal two-sample

*t*test hereinafter (“two sample” may be suppressed). In comparison with

*l*(

*k*),

*T*(

*k*) is more intuitive in form. Note that without taking the absolute value of (

X

_{1}−

X

_{2}) in

*T*(

*k*) this statistic becomeswhich is just the test statistic of the well-known two-sided two-sample

*t*test (for the equal but unknown variance case) that follows the Student’s

*t*distribution with (

*N*− 2) degrees of freedom under the null hypothesis (von Storch and Zwiers 1999). Although it is difficult to derive the theoretical distribution of

*T*

_{max}, its empirical critical values can be generated by Monte Carlo simulations, as commonly practiced. The popular SNHT is a special case of the above maximal

*t*test [with the test statistic defined in (5)], though SNHT is formulated differently: Alexandersson (1986) assumes a known, unit variance for standardized ratio series. Nevertheless, the above maximal

*t*test and SNHT are equivalent.

Gardner (1975) showed that the power of the *t* test may decrease considerably when the two samples are of unequal size (relative to the equal-size case). This is because the estimate from the shorter series tends to be less accurate. As a consequence, the maximal *t* test and SNHT suffer from the disadvantage that points in a homogeneous time series have different probabilities of being mistakenly identified as changepoints. This is illustrated through Monte Carlo simulations below.

*N*and changepoint position

*k*(

*k*= 1, 2, . . . ,

*N*− 1), for various choices of

*N*(see Fig. 1), we simulate

*M*= (

_{N}*N*− 1) × 100 000 homogeneous IID Gaussian time series, denoted as

*X*(

_{j}*t*) (

*j*= 1, 2, . . . ,

*M*;

_{N}*t*= 1, 2, . . . ,

*N*). Then, for each time series

*X*(

_{j}*t*), we calculate the statistic

*T*(

*k*) for each

*k*∈ 1, 2, . . . ,

*N*− 1, find the maximal value

*T*(

_{j}*k*) = max

_{j}_{1≤k≤N−1}

*T*(

*k*), and record its corresponding position

*k*. The (1 −

_{j}*α*)th percentile of the

*T*

_{max}statistic [

*T*(

_{max}*α*)] is then estimated from the

*T*(

_{j}*k*) values (

_{j}*j*= 1, 2, . . . ,

*M*). Further, let

_{N}*M*(

_{α}*k*) denote the count of cases (out of the

*M*cases) for which point

_{N}*k*is associated with

*T*(

_{j}*k*) >

*T*

_{max}(

*α*), that is, for which point

*k*is mistakenly identified as a changepoint at the

*α*significance level [which is also often referred to as “at the

*p*= (1 −

*α*) level of confidence,” although the word “confidence” is somewhat misleading here]. Here we use

*α*= 0.05, and thus

*p*= 0.95. Then, the false-alarm rate for point

*k*is estimated aswhere

*M*=

*M*/(

_{N}*N*− 1) = 100 000;

*p*(

_{e}*k*) = [1 −

*α*(

_{e}*k*)] = [1 − FAR

*(*

_{α}*k*)] is called the effective level of confidence of the test, and

*α*(

_{e}*k*) = FAR

*(*

_{α}*k*) is the effective level of significance (

*k*= 1, 2, . . . ,

*N*− 1). We repeat the above calculations for each of 18 selected values of

*N*(ranging from 6 to 500); the resulting false-alarm rate as a function of

*k*, FAR

*(*

_{α}*k*), is shown in Fig. 1 [the curve for

*N*= 6, not shown, is almost flat; note that these curves of FAR

*(*

_{α}*k*) values plotted against the corresponding

*k*values are referred to as the FAR

*(*

_{α}*k*) ∼

*k*curves in this study]. Note that exactly the same FAR

*(*

_{α}*k*) ∼

*k*curves are obtained when the SNHT test statistic is used instead of the

*T*

_{max}above, as expected, because the maximal

*t*test and SNHT are equivalent.

It is clear that the FAR* _{α}*(

*k*) ∼

*k*curves are U shaped and that the larger the series length

*N*is, the flatter is the bottom of the U-shaped curve (Fig. 1). These U-shaped curves indicate that the chance for points near the ends of a homogeneous series to be mistakenly identified as a changepoint is much larger than those near the middle of the series. As a consequence, the effective level of confidence on the results of the above maximal

*t*test (and its equivalent SNHT) is much lower than the specified level (

*p*= 0.95) if the detected changepoints are near the ends of the series (i.e.,

*p*< 0.95 or

_{e}*α*> 0.05) but is much higher if they are near the middle of the series (i.e.,

_{e}*p*> 0.95 or

_{e}*α*< 0.05). That is, for a changepoint of certain magnitude, the test would detect it with a lower-than-specified level of confidence and hence more easily when it occurs near the ends of the series than when it occurs around the middle, and the test would mistakenly declare many more changepoints near the ends of a homogeneous series than around the middle. This problem arises when the two samples (before and after point

_{e}*k*) are substantially unequal in size. The effect of unequal sample sizes on the statistic

*T*(

*k*) is not very obvious, but it is enough to make an important difference in the result of searching for the maximum value of

*T*(

*k*) across

*k*∈ 1, 2, . . . ,

*N*− 1 (it gets “exaggerated” by this inevitable “maximizing” process).

*T*

_{max}[

*α*(

_{e}*k*)] and

*T*

_{max}(0.05) are the critical values of the

*T*

_{max}in (5) that are corresponding to the

*α*(

_{e}*k*) = FAR

*(*

_{α}*k*) and 5% level of significance, respectively. Similar to the way in which

*α*(

_{e}*k*) = FAR

*(*

_{α}*k*) values are estimated, the

*T*

_{max}[

*α*(

_{e}*k*)] and

*T*

_{max}(0.05) and, hence,

*R*are also estimated through Monte Carlo simulations. The estimated

_{k}*R*values are shown as crosses, squares, circles, or asterisks in Fig. 2. By conducting these simulations, we convert the effect of unequal sample sizes on the FAR

_{k}*(*

_{α}*k*) values (shown in Fig. 1) into the reverse of the effect on the

*T*

_{max}statistic, so that a penalty factor on

*T*(

*k*) that would even out the U shape of the FAR

*(*

_{α}*k*) ∼

*k*curves should fit the ratios

*R*well. Therefore, to construct such a penalty factor, we try to obtain least squares fits to the simulated

_{k}*R*values, for each of the 18 selected values of

_{k}*N*, subsequently. Bearing in mind the characteristics of the effect of unequal sample sizes (they are functions of

*k*and

*N*), we find out through trials that the penalty factor can be constructed using the following functions of

*k*and

*N*:By trial and error, we find the following penalty function:wherefor all time series of length

*N*≤ 100 andfor time series of length

*N*> 100. Figure 2 shows the fits of this penalty function (thin dashed curves; most of them are the same as the thick curves that are explained below) to the

*R*values for each of the 17 selected values of

_{k}*N*(the values for

*N*= 6 have little dependence on

*k*and hence are not shown here), which are all very good. Note that for time series of length

*N*> 100 the penalty function with

*υ*= (2

*C*

^{3}+ 2

*C*

^{2}− 1)/(100

*C*) fits the

*R*values even better than does the one above, but further checks on the resulting FAR

_{k}*α*(

*k*) ∼

*k*curves indicate that it overpenalizes the test statistic.

In general, our experiments suggest that the penalty function in (8) tends to overpenalize the test statistic for the points that are near the ends of the series [the resulting FAR* _{α}*(

*k*) ∼

*k*curves become M shaped, like the case for

*N*= 500 in Fig. 3 but of much bigger amplitude]. Say, for a specific series length

*N*, there are

*K*

_{1}points at each end that are associated with penalty

*P*(

_{o}*k*) ≤ 1 [i.e.,

*P*(1) <

_{o}*P*(2) < · · · <

_{o}*P*(

_{o}*K*

_{1}) ≤ 1 (the first

*K*

_{1}points) and

*P*(

_{o}*N*− 1) <

*P*(

_{o}*N*− 2) < · · · <

*P*(

_{o}*N*−

*K*

_{1}) ≤ 1 (the last

*K*

_{1}points)] while

*P*(

_{o}*k*) > 1 is true for all

*k*∈ [

*K*

_{1}+ 1,

*N*−

*K*

_{1}− 1]. The above penalty term overpenalizes approximately the first and last

*L*points, where

*L*= (⌊

*K*

_{1}/2⌋ + 3) for series of length 10 <

*N*< 50 and

*L*= (⌊

*K*

_{1}/2⌋ + 2) otherwise (here ⌊

*K*

_{1}/2⌋ means to take the floor value of the integer division

*K*

_{1}/2;

*L*ranges between 1 and 26 for

*N*∈ [6, 500]). We speculate that this is because the estimates of the false-alarm rate and other statistics for these end points are much more unstable than for all the other points. Also, for very large

*N*(e.g.,

*N*= 500), the above penalty function is a little too narrow (not quite visible in Fig. 2); the curve is almost vertical at the very ends of series, and so the penalty function is very sensitive to any error in the estimate of false-alarm rate for these points. Besides, we speculate that a penalty factor makes the statistic more sensitive to any estimation error than does an additive penalty term.

*N*> 100,Figure 2 also shows this modified penalty function in comparison with

*P*(

_{o}*k*) and their fit to the

*R*values for each of the 17 selected values of

_{k}*N*.

*t*test (PMT):Figure 3 shows the FAR

*(*

_{α}*k*) ∼

*k*curves of PMT in comparison with the corresponding curves of SNHT for 12 selected values of

*N*(it looks similar for other

*N*values, and thus they are not shown). It is clear that PMT has a much more evenly distributed false-alarm rate across all points in the series than does SNHT, although it tends to overpenalize slightly the test statistic for the end points of very long time series (

*N*≥ 500). The penalty factor does even out, to a great extent, the U shape of the FAR

*(*

_{α}*k*) ∼

*k*curves of the unpenalized maximal

*t*test (and SNHT). The new test statistic takes the relative position of each candidate changepoint into account to reduce the distortion of the test statistic that is due to unequal sample sizes. Observations are treated more equally during the process of searching for the most probable changepoint position/time. This is of great practical importance, because it will result in a basically uniform level of confidence on the identified changepoints regardless of their position in the time series, or a uniform false-alarm rate across a homogeneous series.

We acknowledge that our approach here is purely empirical [i.e., both *P _{o}*(

*k*) and

*P*(

*k*) are obtained empirically] and that a theoretically based penalty term is yet to be found. This empirical exercise basically achieves the desired results; it clearly reveals the characteristics of the effect of unequal sample sizes, which may help to develop a theoretically based penalty term.

Empirical critical values of PT_{max} for some selected values of *N* are obtained by simulating 10 million PT_{max} values under the null hypothesis of no changepoint and are presented in Table 1 (for other *N* values not listed here, a linear interpolation between its two neighboring values would be of sufficiently good accuracy). The empirical critical values of *T*_{max} in SNHT are also simulated similarly and listed in Table 1, which will be used for the comparison in the next section.

The effect of unequal sample sizes may not be a big problem for detecting a large mean shift, but it cannot be ignored when the magnitude of mean shift is small or medium relative to the noise variance, especially in short time series (such as most annual climate data series). This will be illustrated through the comparison in the next section.

## 3. Comparison of PMT with SNHT

In this section, the performance of PMT is compared with that of SNHT through Monte Carlo simulations. To evaluate the performance of each method, we use three different measures of detection power. The first measure is position rate (PR), which is the rate of detecting a changepoint position “correctly” (i.e., the detected changepoint position is within the interval [*K* − 2, *K* + 2] of the true position *K*), regardless of its significance (see the discussion below for the choice of this interval). The second measure is significance rate (SR), which is the rate of detecting a statistically significant (at the 5% level) changepoint regardless of whether or not its position is identified correctly. When each of the time series being tested contains a mean shift (i.e., *H _{a}* is true),

*β̂*= (1 − SR) is a rough estimate of the type-II error rate

*β*(i.e., the rate of mistakenly accepting the null hypothesis of no changepoint when a mean shift does exist in the time series), and SR = (1 −

*β̂*) in this case. Note that (1 −

*β*) is statistically referred to as the power of test. Thus, SR can also be called the “power” of test when it is equal to (1 −

*β̂*) (such as in the setting described in the next paragraph). The third measure is the hit rate (HR), which is the rate of detecting a changepoint with both statistical significance (at the 5% level) and correct position (as defined for the position rate). For any method, the PR or SR is only a rough measure of its detection ability, whereas HR is a strict measure of its “accurate” detection power, and (1 − HR) is a more accurate estimate of the type-II error rate

*β*.

The comparison is carried out as follows. First, for each of the 18 selected values of series length *N* (ranging from 6 to 500), we generate 1000 homogeneous time series from an IID Gaussian distribution with zero mean and variance *σ*^{2} (without lose of generality, we set *σ* = 1 here). We insert a mean-shift Δ at point *k* (between *k* and *k* + 1) for each *k* ∈ 2, 3, . . . , *N* − 2 and each mean-shift magnitude Δ ∈ 0.25*σ*, 0.5*σ*, *σ*, 1.5*σ*, 2*σ*. The ratio Δ/*σ* is called the relative step/shift size (RSS), that is, RSS = Δ/*σ*; and we tried with RSS ∈ 0.25, 0.5, 1, 1.5, 2 in this study. Then, we subsequently apply the two tests (PMT and SNHT), at the same 5% level of significance, to each of the 1000 time series for each combination of *k* and Δ [there are 5 × (*N* − 3) combinations for each *N* value with five choices of Δ] to detect the mean shift. We calculate the three detection rates for each combination of *N*, *k*, and Δ values. As an example, we show the complete results for the case of *N* = 20 in Tables 2 –4. For other selected values of *N*, only the mean hit rates, mean position rates, and mean significance rates (i.e., those averaged over all the changepoint positions *k* ∈ 2, 3, . . . , *N* − 2) are listed in Tables 5 –7, respectively [note that we have tried to insert a changepoint in almost all possible positions (except *k* = 1 and *k* = *N* − 1; i.e., the case with the first or last datum being an outlier), one at a time, for each selected value of *N*; thus, it is too much to show the complete results for all 18 selected values of *N*].

To check the effect of the width of the interval [*K* − *δ*, *K* + *δ*] for defining a correct identification of changepoint position, for the case of *N* = 100 we carry out the above simulations using *δ* = 1, 2, . . . , 10, subsequently. The resulting mean position rates are shown in Table 8. It is clear that the results of the comparison of PMT with SNHT have little dependence on the choice of *δ* value (see the ratios in parentheses in Table 8). We see a small dependence only for the cases of identifying very small shifts (Δ = 0.25*σ*) in which the narrower the width of the interval that is used the more superior PMT is over SNHT. However, there is no notable difference for *δ* = 2–7. In this study, as described above, we chose *δ* = 2, so that the interval is also suitable for very short time series (such as *N* = 6, the lowest among the 18 selected values of *N* we investigate in this study).

As would be expected, PMT has higher power of detection (in terms of all three detection rates) than does SNHT for changepoints that occur not too close to the ends of the series while it has lower power of detection for changepoints that are near the ends of the series (Tables 2 –4), because it tries to obtain an evenly distributed effective level of significance/confidence (and hence power) of the test across all possible changepoint positions *k*, that is, to make the effective level of significance/confidence of the test close to the specified level for every point.

Most important is that, when averaged over all possible changepoint positions *k* ∈ 2, 3, . . . , *N* − 2, PMT significantly outperforms SNHT in detecting small shifts no matter whether in short or long time series and in detecting medium–large shifts in short time series. For large shifts in very long time series, however, SNHT is only slightly better than PMT (Fig. 4 and Tables 5 –7).

In terms of the mean hit rate (see Fig. 4a), the improvement of PMT over SNHT can be as much as 14%–25% for detecting small shifts (RSS ≤ 0.5) and up to 5% for detecting medium shifts (e.g., RSS = 1.0 and 1.5) in time series of length *N* < 100, whereas SNHT only has about 1%–2% higher hit rates in detecting medium–large shifts in long time series (of length *N* > 100). Note that the smaller the RSS is, the greater is the improvement of PMT over SNHT (Fig. 4a). Also, the largest improvement occurs when the time series length *N* is less than 100 (which is true for most annual climate data series), and the larger the RSS is, the shorter is the length of time series in which the largest improvement is obtained (i.e., the peaks of the curves shown in Fig. 4a occur at smaller *N* for larger RSS).

As shown in Fig. 4b, PMT has up to about 9% higher power in identifying correctly the position of small shifts in all time series (short or long), whereas PMT and SNHT perform very similarly in detecting the correct position of medium–large shifts (with PMT being slightly better when the time series is short and SNHT being slightly better when the time series is very long).

Because each of the 1000 time series tested here contains a mean shift (i.e., *H _{a}* is always true here), the type-II error rate can be estimated as (1 − significance rate). Thus, as mentioned before, the higher the significance rate is, the lower is the type-II error rate and hence the better is the test. [In the mean time, a more accurate/strict measure of the type-II error rate can be derived from the hit rate, i.e., the “miss rate,” defined as (1 − hit rate), which, however, contains the same information as the hit rate.] In terms of significance rate, as shown in Fig. 4c, the improvement of PMT over SNHT is similar to that in terms of the mean hit rate, with up to 16% improvement for small shifts and up to 5% improvement for medium shifts in time series of length

*N*< 100, and the two methods perform very similarly in detecting large shifts.

As shown in Fig. 4, the hit rate is much more sensitive to change in the length of time series being tested or in the magnitude of shift than is the position or significance rate. This is because a hit is counted only when the test identifies the changepoint with both statistical significance and correct position.

## 4. An application of PMT and SNHT to climate data series

In this section, we present an application of PMT and SNHT to detect undocumented mean shifts in climate data series. The purpose is to see which of the two methods performs better in practical use. Thus, we need to be able to verify the results. That is, we need to apply the methods to a time series that contains a documented mean shift (i.e., we know exactly when the shift occurred) to see if the methods can detect the mean shift if we did not know about it (i.e., if we treat it as undocumented for the purpose of this application). To this end, we apply PMT and SNHT to time series of monthly mean and annual mean station pressure recorded at Burgeo (Newfoundland, Canada) for the 28-yr period from January 1967 to December 1994 (no data outside this period), because we know that the pressure series contains an artificial mean shift (see Fig. 5a) that is caused by neglecting the station elevation of 10.6 m in the calculation of station pressures from barometer readings in the period prior to January 1977 (a problem of the so-called 50-feet rule, which is to use zero elevation in the calculation of station pressures from barometer readings if the station elevation is less than 50 ft, i.e., 15 m). According to a physically based estimate using a hydrostatic model and hourly pressure and temperature data (Wan et al. 2007), neglecting such an elevation causes a bias of 1.32 hPa on pressure values.

We also know that the station pressure recorded at Yarmouth Airport (Nova Scotia, Canada) for the same 28-yr period is basically homogeneous and is highly correlated with the pressure data recorded at Burgeo. It is the best available reference series for the Burgeo series. Thus, we use the corresponding pressure series from Yarmouth Airport as reference series in this application.

Because both SNHT and PMT assume that the time series being tested is IID Gaussian, we apply the tests to the annual mean and monthly mean pressure series for each calendar month (13 time series in total) separately, to minimize the effect of autocorrelation. As a result, both PMT and SNHT identify a changepoint of at least 5% significance from the annual mean series and from the monthly mean series for July, September, and November (see Table 9 and Figs. 5b–d) and a changepoint (1976) of 5%–10% significance from the December mean pressure series (not shown or listed in Table 9). Both methods are consistently correct in the identification of the mean shift in 1976 from the annual mean series and from the monthly mean series for July and September (Table 9). However, PMT is more accurate in identifying the mean shift from the November mean pressure series; it identified the mean shift to be in 1974 (two intervals/years earlier than the true position), whereas SNHT found it to be in 1972 (four intervals/years earlier; see Table 9 and Fig. 5c).

Both methods found no significant changepoint from the monthly mean pressure series for the other calendar months. One of the reasons for the detection failure is the presence of autocorrelation in the time series. Although all of the base-minus-reference pressure series tested here are annual series (i.e., the interval between two consecutive data is 1 yr), their lag-1 autocorrelation (after taking into account the mean shift in 1976) ranges from −0.303 to 0.117, with more months having a negative autocorrelation. According to Lund et al. (2007), ignoring a positive autocorrelation will increase the false-alarm rate of the test, whereas ignoring a negative one will let real changepoints go undetected, and the larger the autocorrelation is in absolute value, the greater is the effect. The large negative autocorrelations in the base-minus-reference pressure series definitely contributes to the detection failure. In addition, sampling variability is larger for short time series than for long series. The short series length (here *N* = 28) also makes the detection harder.

## 5. Concluding remarks

Based on empirical methods, a penalized maximal *t* test is proposed for detecting undocumented mean shifts in climate data series. PMT takes the relative position of each candidate changepoint into account to diminish the effect of unequal sample sizes on the false-alarm rate and hence on the power of detection. It has been shown, for time series of selected lengths *N* ∈ [6, 500], that the false-alarm rate of PMT is evenly distributed across all candidate changepoint positions *k* ∈ 1, 2, . . . , *N* − 1; it is very close to the specified level of significance at all candidate changepoints *k* ∈ 1, 2, . . . , *N* − 1. This feature is highly desirable in practice, because each point of data should be treated equally: each should have equal chance to be picked mistakenly as a changepoint when the time series is homogeneous (i.e., the same false-alarm rate); on the other hand, a shift of certain magnitude should have the same probability of being detected no matter where the shift occurs (near the ends or the middle of the series). Without this feature, the resulting level of confidence on the identified changepoints that are near the ends of the series is much lower than the specified level and is much higher on those that are near the middle. However, it is also shown that the false-alarm rate of SNHT can be up to 10 times the specified level for points near the ends of the series and much lower for the middle points (Fig. 3); that is, each point of data in the time series is not treated equally.

PMT consequently has higher power in detecting changepoints that are not too close to the ends of series and has lower power for detecting changepoints that are near the ends of series, in comparison with SNHT. However, note that the higher power of SNHT for detecting changepoints near the ends of series arises from the fact that the effect of unequal sample sizes makes the effective level of confidence of SNHT much lower than the specified level for the end points, which is not desirable. What is highly desirable is for a test to perform effectively at the specified level of significance/confidence no matter where the shift occurs, that is, to have the same probability of detecting a shift of certain magnitude regardless of the position of shift (near the ends or the middle of the series). PMT has this highly desirable feature, although it is constructed empirically.

Most important is that, when averaged over all possible changepoint positions, PMT has higher power of detection. In terms of hit rate, the improvement of PMT over SNHT can be as much as 14%–25% for detecting small shifts (Δ < *σ*) regardless of time series length and up to 5% for detecting medium shifts (Δ = *σ*–1.5*σ*) in time series of length *N* < 100. The smaller the relative shift size RSS = Δ/*σ* is, the greater is the improvement. The largest improvement is obtained for time series of length *N* < 100, which is of great practical importance, because most annual climate data series are shorter than this length (or the period that contains only one changepoint is shorter than this).

Note that the effect of unequal sample sizes also exists in the two-phase regression model–based tests for undocumented changepoints (Lund and Reeves 2002; Wang 2003), which we are also tackling and will report in a separate paper.

Also, both PMT and SNHT assume that the errors in the time series being tested are IID Gaussian, which is hardly true for climate data series (even for annual series as discussed in section 4). Autocorrelation and periodicity are typically inherent in climate data series. Periodicity and trend can be greatly diminished by using a good reference that has the same trend and periodicity as the base series, but the use of reference series cannot diminish autocorrelation in the time series being tested. Thus, it is of crucial importance for a test of undocumented changepoint to take into account the effect of autocorrelation and periodicity in the time series. Lund et al. (2007) recently proposed a new method for changepoint detection in periodic and autocorrelated time series, although the effect of unequal sample sizes is yet to be taken into account in this new method. The latter should be the subject for our next study.

The Climate Research Division of the Atmospheric Science and Technology Directorate of Environment Canada is acknowledged for supporting this research through grants and contributions (KM040–05–0016–IP). The three anonymous reviewers and the editor Dr. Arthur DeGaetano are acknowledged for their helpful review comments.

## REFERENCES

Alexandersson, H., 1986: A homogeneity test applied to precipitation data.

,*J. Climatol.***6****,**661–675.Buishand, T. A., 1982: Some methods for testing the homogeneity of rainfall records.

,*J. Hydrol.***58****,**11–27.Caussinus, H., , and O. Mestre, 2004: Detection and correction of artificial shifts in climate.

,*Appl. Stat.***53****,**405–425.Chu, P. S., , and X. Zhao, 2004: Bayesian changepoint analysis of tropical cyclone activity: The central North Pacific case.

,*J. Climate***17****,**4893–4902.Csörgő, M., , and L. Horváth, 1997:

*Limit Theorems in Change-Point Analysis*. John Wiley and Sons, 414 pp.Davis, R. A., , T. C. M. Lee, , and G. A. Rodriguez-Yam, 2006: Structural breaks estimation for non-stationary time series models.

,*J. Amer. Stat. Assoc.***101****,**223–239.DeGaetano, A. T., 2006: Attributes of several methods for detecting discontinuities in mean temperature series.

,*J. Climate***19****,**838–853.Easterling, D. R., , and T. C. Peterson, 1995: A new method for detecting undocumented discontinuities in climatological time series.

,*Int. J. Climatol.***15****,**369–377.Gardner, P. L., 1975: Scales and statistics.

,*Rev. Educ. Res.***45****,**43–57.Hanesiak, J. M., , and X. L. Wang, 2005: Adverse weather trends in the Canadian Arctic.

,*J. Climate***18****,**3140–3156.Lund, R., , and J. Reeves, 2002: Detection of undocumented changepoints: A revision of the two-phase regression model.

,*J. Climate***15****,**2547–2554.Lund, R., , X. L. Wang, , Q. Lu, , J. Reeves, , C. Gallagher, , and Y. Feng, 2007: Changepoint detection in periodic and autocorrelated time series.

, in press.*J. Climate*Menne, M. J., , and C. N. Williams Jr., 2005: Detection of undocumented changepoints using multiple test statistics and composite reference series.

,*J. Climate***18****,**4271–4286.Perreault, L., , J. Bernier, , and E. Parent, 2000: Bayesian changepoint analysis in hydrometeorological time series. Part 1. The normal model revisited.

,*J. Hydrol.***235****,**221–241.Peterson, T. C., and Coauthors, 1998: Homogeneity adjustments of in situ atmospheric climate data: A review.

,*Int. J. Climatol.***18****,**1493–1517.Reeves, J., , J. Chen, , X. L. Wang, , R. Lund, , and Q. Lu, 2007: A review and comparison of changepoint detection techniques for climate data.

,*J. Appl. Meteor. Climatol.***46****,**900–915.Solow, A., 1987: Testing for climatic change: An application of the two-phase regression model.

,*J. Climate Appl. Meteor.***26****,**1401–1405.Vincent, L., 1998: A technique for the identification of inhomogeneities in Canadian temperature series.

,*J. Climate***11****,**1094–1104.von Storch, H., , and F. W. Zwiers, 1999:

*Statistical Analysis in Climate Research*. Cambridge University Press, 484 pp.Wan, H., , X. L. Wang, , and V. R. Swail, 2007: A quality assurance system for Canadian hourly pressure data.

, in press.*J. Appl. Meteor. Climatol.*Wang, X. L., 2003: Comments on “Detection of undocumented changepoints: A revision of the two-phase regression model.”.

,*J. Climate***16****,**3383–3385.Wang, X. L., 2006: Climatology and trends in some adverse and fair weather conditions in Canada, 1953–2004.

,*J. Geophys. Res.***111****.**D09105, doi:10.1029/2005JD006155.Wang, X. L., 2007: A recursive testing algorithm for detecting and adjusting for multiple artificial changepoints in a time series.

*Report of the Fifth Seminar for Homogenization and Quality Control in Climatological Databases*, Budapest, Hungary, World Climate Data and Monitoring Programme, WMO, in press.Wang, X. L., , and Y. Feng, cited. 2004: RHTest user manual. [Available online at http://cccma.seos.uvic.ca/ETCCDMI/RHTest/RHTestUserManual.doc.].

Empirical critical values of the SNHT_{max}, that is, the *T*_{max} in Alexandersson (1986) and PT_{max} statistics, obtained through 10 million simulations of each of the two statistics for each *N* value.

Hit rates of PMT and SNHT (PMT/SNHT, in counts per 1000) for time series of length *N* = 20. The numbers in parentheses are the ratios of the PMT rate to the SNHT rate. The “mean” row is the average of the column (i.e., over changepoint positions *k* ∈ 2, 3, . . . , *N* − 2).

As in Table 2 but for the position rates of PMT and SNHT (PMT/SNHT, in counts per 1000) for time series of length *N* = 20.

As in Table 2 but for the significance rates of PMT and SNHT (PMT/SNHT, in counts per 1000) for time series of length *N* = 20.

As in Table 2 but for the mean hit rates of PMT and SNHT for the indicated different series lengths *N* (i.e., a summary of the results similar to the last row of Table 2 for different *N* values).

As in Table 3 but for the mean position rates of PMT and SNHT for the indicated different series lengths *N* (i.e., a summary of the results similar to the last row of Table 3 for different *N* values).

As in Table 4 but for the mean significance rates of PMT and SNHT for the indicated different series lengths *N* (i.e., a summary of the results similar to the last row of Table 4 for different *N* values).

As in Table 6 but for the mean position rates of PMT and SNHT (for time series of *N* = 100) as a function of the width of the interval [*K* – *δ*, *K* + *δ*] used to define correct identification of changepoint position.

Significant (at the 5% level) undocumented changepoints detected by applying SNHT and PMT to annual and monthly mean station pressure data recorded at Burgeo during the 28-yr period from January 1967 to December 1994, using as reference series the station pressure data series recorded at Yarmouth Airport for the same period (which is found to be homogeneous).