## 1. Introduction

In the recent decades, climate change has been a subject of ongoing debate, which is also in part related to the inevitable discontinuities in long-term climate data records that hamper the reliability of the results of historical climate trend assessment, climate change detection, and attribution. The effects of artificial shifts (nonclimatic sudden changes) on the results of climate analysis, especially of historical climate trend assessment, have been illustrated in many studies, for example in Wang (2006), Hanesiak and Wang (2005), Vincent et al. (2002), and Easterling and Peterson (1995), among many others. It has become common knowledge that a realistic and reliable assessment of historical climate trends and variability is hardly possible without a long-term, homogeneous time series of climate data. Thus, more and more effort has been devoted to detecting and adjusting for artificial shifts in climate data series, which are typically due to changes in instrument, in observing location and/or environment, in observing practices/procedure, etc. Some of these changes are documented in the related metadata database, while others are not (due to inaccurate and/or incomplete metadata). Therefore, there exist documented and undocumented mean shifts. With the time of shift documented in metadata (so that one does not need to identify it statistically), documented shifts are much easier to assess; they can be tested using the regular tests of means or variances (e.g., Student’s *t* test or *F* test). However, one needs to rely on an appropriate statistical test to detect and assess undocumented shifts (i.e., those that have no metadata support).

Several statistical methods have been proposed for detecting undocumented shifts. Also, there have been several studies that review and compare different methods for testing artificial shifts (e.g., Easterling and Peterson 1995; Peterson et al. 1998; DeGaetano 2006). Most comprehensively, Reeves et al. (2007) review and compare eight of them, including (but not limited to) the standard normal homogeneity test (SNHT) for mean shifts in time series of zero trend (Alexandersson 1986), the common-trend two-phase regression model–based maximal *F* test (TPR3) for mean shifts in time series with a constant linear trend throughout the series (Wang 2003), and the two-phase regression model–based maximal *F* test for a sudden change that occurs simultaneously in both the mean and the linear trend of the series (TPR4; Lund and Reeves 2002). They show that TPR3 seems optimal for most climate data time series, while SNHT is probably best when trend and periodic effects can be diminished by using homogeneous reference series.

However, as noted in Wang et al. (2007, hereafter WW07) the effect of unequal sample sizes results in uneven distributions of false alarm rate (FAR) and detection power of the SNHT type or the maximal *F* tests. (Why there exists the problem of unequal sample sizes in this type of test is also explained in section 2 below.) WW07 show that the SNHT-type tests have a U-shaped distribution of FAR over the admissible changepoint positions, which makes it much easier to detect a shift if it occurs near the ends of the series than near the middle of the series because it applies a lower-than-specified level of confidence (so that not all thus-identified shifts are actually significant at the nominal level). In other words, for a shift of certain magnitude, the power of detection of the SNHT-type tests depends on the location of the shift in the time series (near the ends or the middle), which is undesirable because one wishes to detect a shift of detectable magnitude at the nominal level of confidence/significance regardless of its location. To diminish such undesirable effects, WW07 propose a penalized maximal *t* test (PMT) for detecting undocumented mean shifts in time series of zero trend, using an empirically constructed penalty function. It is shown that PMT has a significantly higher power of detection than the SNHT-type tests; its FAR is close to the nominal level for all admissible changepoint positions.

Similarly, this study aims to improve TPR3, the common-trend two-phase regression model–based maximal *F* test for undocumented mean shifts that are not accompanied by any sudden change in the linear trend of time series (the most commonly encountered type of artificial shifts in climate data series; Wang 2003). For this purpose, a penalty function is constructed empirically and imposed on the test statistic of TPR3 to diminish the effect of unequal sample sizes on the FAR and detection power of the test. The focus here is on time series with identically and independently distributed (IID) Gaussian errors that might have at most one changepoint (AMOC). However, it is generally common practice to implement statistical tests that are developed for the AMOC case with an appropriate stepwise testing algorithm to detect multiple changepoints in a climate data time series, as in Wang (2008), Menne and Williams (2005), and Wang and Feng (2007). Nevertheless, readers are referred to Caussinus and Mestre (2004) and Davis et al. (2006) for statistical methods that address directly the issue of detecting multiple changepoints in different settings. Readers are also referred to Wang and Feng (2007), and Wang (2008) for algorithms for testing undocumented and documented artificial mean shifts in tandem.

This article proceeds as follows. The penalized maximal *F* test (PMFT) and how it is constructed are detailed in section 2 below. Then, the performance of PMFT is compared with that of TPR3 using Monte Carlo simulations in section 3, and using three climate data series from Canadian stations in section 4. Section 5 completes this study with some concluding remarks.

## 2. Penalized maximal *F* test

*denote an IID Gaussian variable of zero mean and unknown variance*

_{t}*σ*

^{2}. As proposed by Wang (2003), to test whether or not there exists a mean shift at time

*t*=

*k*in time series {

*X*} with linear trend

_{t}*β*is to test the null hypothesis against the alternative where

*μ*

_{1}≠

*μ*

_{2}. When

*H*is true, the point/time

_{a}*t*=

*k*is called a changepoint, and Δ = |

*μ*

_{1}−

*μ*

_{2}| is called the magnitude of mean shift (or step size alternatively). As shown in Wang (2003), the most probable point to be the changepoint is the one that is associated with where and where

*μ̂*

_{0}and

*β̂*

_{0}are estimated under the constraint

*μ*

_{1}=

*μ*

_{2}=

*μ*(i.e.,

*X*=

_{t}*μ*

_{0}+

*β*

_{0}

*t*+ ɛ

*for*

_{t}*t*= 1, 2, . . . ,

*N*). Note that under the null hypothesis of no changepoint and assuming IID Gaussian errors ɛ

*, the*

_{t}*F*(

_{c}*k*) statistic above has an

*F*distribution of one numerator degree of freedom and (

*N*− 3) denominator degrees of freedom (Wang 2003). Thus, the test using the

*F*

_{max}statistic above is called the maximal

*F*test (also referred to as TPR3 hereafter). Such hypothesis testing works with a preset or nominal level of significance

*α*(or confidence

*p*= 1 −

*α*). A mean shift or changepoint is declared if the

*F*

_{max}statistic calculated for the time series being tested is greater than the critical value corresponding to the nominal level of significance; otherwise, the time series is declared homogeneous at the nominal level of significance/confidence (Wang 2003). The nominal significance level

*α*represents the rate at which the test mistakenly declares a change point for a homogeneous time series, which is also called the FAR or type-I error rate. That is,

*α*= FAR. The most common choice is

*α*= 0.05 (i.e.,

*p*= 0.95), which is also used in this study.

When *H _{a}* is true, the data series can be viewed as two samples: one of size

*N*

_{1}=

*k*(for

*t*= 1, 2, . . . ,

*k*), which has mean response (

*μ*

_{1}+

*βt*); another of size

*N*

_{2}= (

*N*−

*k*) (for

*t*=

*k*+ 1,

*k*+ 2, . . . ,

*N*), which has mean response (

*μ*

_{2}+

*βt*). The sizes of the two samples are not equal generally; that is,

*N*

_{1}≠

*N*

_{2}except when

*k*=

*N*/2 (i.e., when the point being tested is the midpoint of the data series). Unequal sample sizes can have huge impact on the results of tests for undocumented mean shifts. For cases in which

*β*= 0 (i.e., time series of zero trend), the effect of unequal sample sizes makes the FAR of the SNHT-type tests much higher than the nominal level for points near the ends of the series and much lower for the middle points; the FAR as a function of the location

*k*of the point being tested; that is, FAR(

*k*) is U-shaped due to this effect (WW07). Consequently, the resulting level of confidence on the identified changepoints that are near the ends (middle) of the series is much lower (higher) than the nominal level, which is not desirable. Similarly, the effect of unequal sample sizes also exists in the above maximal

*F*test (and in other maximal

*F*tests, such as TPR4).

*P*(

*k*) is a penalty factor that will be constructed empirically, as described later in this section, and

*F*(

_{c}*k*) is as defined above.

*k*) values of TPR3,

*M*= (

_{N}*N*− 1) × 100 000 homogeneous IID Gaussian time series of length

*N*, denoted as

*Z*(

_{j}*t*) (

*j*= 1, 2, . . . ,

*M*;

_{N}*t*= 1, 2, . . . ,

*N*), are simulated and a linear trend is then added to these series; that is,

*X*(

_{j}*t*) =

*βt*+

*Z*(

_{j}*t*) with

*β*= 0.01. Then, for each series

*X*(

_{j}*t*), the statistic

*F*(

_{c}*k*) for each

*k*∈ {1, 2, . . . ,

*N*− 1} is calculated, and the maximal value

*F*(

_{j}*k*) = max

_{j}_{1≤}

_{k}_{≤}

_{N}_{−1}

*F*(

_{c}*k*) and its corresponding position

*k*are identified and recorded. The (1 −

_{j}*α*)th percentile of the

*F*

_{max}statistic,

*F*

_{max}(

*α*) is then estimated from the

*F*(

_{j}*k*) values (

_{j}*j*= 1, 2, . . . ,

*M*). Further, let

_{N}*M*(

_{α}*k*) denote the count of cases (out of the

*M*cases) where point

_{N}*k*is associated with

*F*(

_{j}*k*) >

*F*

_{max}(

*α*); that is, point

*k*is mistakenly declared as a changepoint at the

*α*significance level (here

*α*= 0.05). Then, the false alarm rate for point

*k*is estimated as where

*M*=

*M*/(

_{N}*N*− 1) = 100 000;

*p*(

_{e}*k*) = [1 −

*α*(

_{e}*k*)] = [1 − FAR

*α*(

*k*)] is called the effective level of confidence of the test, and

*α*(

_{e}*k*) = FAR

*(*

_{α}*k*) the effective level of significance (

*k*= 1, 2, . . . ,

*N*− 1). The above calculations are repeated for each of 20 selected values of

*N*(ranging from 10 to 900); the resulting FAR

*(*

_{α}*k*) curves are shown in Fig. 1. Note that

*M*= 10 000 was used for the curve of

*N*= 900, while

*M*= 100 000 for each of the other 19

*N*values (i.e., 8 990 000 simulations were done for

*N*= 900, instead of 89 900 000 simulations). This is why the curve for

*N*= 900 is not as smooth as the others (it is also the reason for the crosses in Fig. 2 to scatter wider for

*N*= 900).

Clearly, the FAR* _{α}*(

*k*) functions are W-shaped, with the highest values near the ends of the series and the lowest values occurring approximately at the point

*W*= (11

*N*)/50 [or

*W*= (21

*N*/100) for

*N*> 500] from either end of the series, while a secondary peak occurs at the midpoint of the series (see Fig. 1). As the series length increases, the false alarm rate for the midpoint of the series converges to the nominal level (0.05); it approximately equals the nominal level for series of length

*N*≥ 300, as shown in Fig. 1. These W-shaped curves indicate that the chance for points near the ends of a homogeneous series to be mistakenly declared as a changepoint is much larger than those around point

*W*from either end of the series; it is also slightly larger for points around the middle of the series when the series length

*N*< 300 (see Fig. 1). Consequently, the effective level of confidence on the results of the above maximal

*F*test is much lower than the nominal level (

*p*= 0.95) if the detected changepoints are near the ends of the series (i.e.,

*p*< 0.95 or

_{e}*α*> 0.05), but is much higher if they are near point

_{e}*W*from each end of the series (i.e.,

*p*> 0.95 or

_{e}*α*< 0.05). In other words, the above maximal

_{e}*F*test (i.e., TPR3) would detect a changepoint of certain magnitude with a lower-than-specified level of confidence and hence more easily when it occurs near the ends of the series than around point

*W*from either end of the series; it would mistakenly declare many more changepoints near the ends of a homogeneous series than around the point

*W*. As mentioned before, this problem arises when the two samples (before and after point

*k*) are substantially unequal in size (when one sample is very small compared to the other). Similar to the cases shown in WW07, the effect of unequal sample sizes on the statistic

*F*(

_{c}*k*) is not very obvious either, but it is enough to make a significant difference in the result of searching for the maximum value of

*F*(

_{c}*k*) across

*k*∈ {1, 2, . . . ,

*N*− 1} (it gets “exaggerated” by this inevitable “maximizing” process).

However, note that the effect of unequal sample sizes on the results of the above TPR3 (shown in Fig. 1) is notably smaller than that on the results of the SNHT-type tests (shown in Fig. 1 of WW07). For example, for points near the ends of a series of length *N* = 40, the FAR values of the SNHT-type tests can reach 0.105 (about 2 times the nominal level 0.05; see WW07), while the FAR values of the TPR3 can only reach about 0.071 (only about 1.4 times the specified level). In general, for a fixed value of *N*, the height of the W-shaped curve is much smaller than that of the U-shaped curve. That is to say, TPR3 is much less severely affected by the inequality of sample sizes than are the SNHT-type tests.

*F*test of such highly desirable features, a penalty factor is empirically constructed in this study using the ratios where

*F*

_{max}[

*α*(

_{e}*k*)] and

*F*

_{max}(0.05) are the critical values of the

*F*

_{max}in (3) that are corresponding to the

*α*(

_{e}*k*) and 5% level of significance, respectively; they are also estimated through Monte Carlo simulations. The estimated

*R*values are shown in Fig. 2 (see the crosses or squares or circles or stars). Through these simulations, the estimated effect of unequal sample sizes on the FAR

_{k}*(*

_{α}*k*) values (shown in Fig. 1) is converted into the reverse of the effect on the

*F*

_{max}statistic; thus, a penalty factor on

*F*(

_{c}*k*) that would even out the W shape of the FAR

*(*

_{α}*k*) function should fit the ratios

*R*well.

_{k}*R*values for all the 19 selected values of

_{k}*N*(the case for

*N*= 10 is excluded hereafter because the effect is negligible for

*N*< 15). Through trial-and-error Monte Carlo simulation experiments, it is found in this study that the following functions of

*k*and/or

*N*are useful and thus used to obtain the least squares fits mentioned above: where

*R*values reasonably well for all the 19 selected values of

_{k}*N:*where and

The thin dashed curves in Fig. 2 show the above penalty function *P _{o}*(

*k*), and its fit to the

*R*values, for the 19 selected values of

_{k}*N*(in each case, most of the thin dashed curve is overlapped with a thick curve, which will be explained below).

Although these fits are reasonably well, it is noticed in this study that, for all time series of *N* ≥ 40, the above penalty *P _{o}*(

*k*) tends to overpenalize the test statistic for a few extreme end points but underpenalize the rest of about

*L*points at each end of the series, where

*L*= integer[1 + (317

*N*

^{3/4}− 2867)/1000] (take the integer value here;

*L*ranges from 3 to 50 for 40 ≤

*N*≤ 900; these are also results of the trial-and-error experiments of the current study). In other words, the fit to the

*L*points at each end of the series can be improved.

*N*∈ [40, 100], and for all time series of length

*N*> 100. Figure 2 also shows this modified penalty function

*P*(

*k*) (thick curves), in comparison with the corresponding unmodified penalty

*P*(

_{o}*k*) (thin dashed curves), and its fit to the

*R*values for each of the 19 selected values of

_{k}*N*. The modified penalty function

*P*(

*k*) is then used in the test statistic in (7); thereby PMFT is constructed.

Figure 3 shows the FAR* _{α}*(

*k*) curves of PMFT in comparison with the corresponding curves of TPR3 for 12 selected values of

*N*(it looks similar for other

*N*values, which are thus not shown). Clearly, PMFT has a much more evenly distributed false alarm rate across all points in the series than does TPR3, although it tends to overpenalize slightly the test statistic for the end points of very long time series (

*N*> 500; see Fig. 3 for

*N*= 720). The penalty does even out, to a great extent, the W shape of the FA

*R*(

_{α}*k*) curves of TPR3. This is because the new test statistic takes the relative position of each point being tested into account to reduce the distortion of the test statistic due to unequal sample sizes. Observations are treated more equally during the process of searching for the most probable changepoint position/time. Consequently, PMFT has the highly desirable features of great practical importance, that is, a basically uniform false alarm rate across a homogeneous series and, hence, a basically uniform level of confidence on the identified changepoints regardless of their position in the time series.

As in WW07, the approach here is also purely empirical [i.e., both *P _{o}*(

*k*) and

*P*(

*k*) are obtained empirically]. A theoretically based penalty term is yet to emerge. This empirical exercise satisfactorily achieves the desired results; it clearly reveals the characteristics of the effect of unequal sample sizes, which may help develop a theoretically based penalty term.

Although it is difficult to derive the theoretical distribution of the test statistic PF_{max}, its empirical critical values for some selected values of *N* can be and are obtained by simulating 10 million PF_{max} values under the null hypothesis of no changepoint; these values are presented in Table 1 (for other *N* values not listed here, a linear interpolation between its two neighboring values would be of sufficiently good accuracy). The empirical critical values of *F*_{max} in TPR3 are also simulated similarly and listed in Table 1, which will be used for the comparison in section 3.

It will be illustrated through the comparison in the next section that, although the effect of unequal sample sizes may not be a big problem for detecting a large mean shift in a very long time series, it cannot be ignored when the magnitude of mean shift is small or medium relative to the noise variance.

## 3. Comparison of detection power of PMFT with TPR3

Similar to the comparison of PMT with SNHT in WW07, the detection power of PMFT is compared with that of TPR3 through Monte Carlo simulations in this section. The following rates are also used here to measure the power of detection of each method:

1) Position rate (PR), defined as the rate of detecting a changepoint position “correctly” (i.e., within the interval [*K* − 3, *K* + 3] of the true position *K*), regardless of its significance.

2) Significance rate (SR), defined as the rate of declaring a statistically significant (at the 5% level) changepoint regardless of whether or not its position being identified “correctly.” (This is usually called detection power in statistics.)

3) Hit rate (HR), defined as the rate of detecting a changepoint with both statistical significance (at the 5% level) and “correct” position (as defined for the position rate). For any method, the PR or SR is only a rough measure of its detection ability, while the hit rate is a strict measure of its “accurate” detection power and is of the most practical importance. In practice, identifying a changepoint at a wrong position could be as bad as failing to detect it.

The comparison is carried out as follows. First, for each of the 18 selected values of series length *N* (ranging from 15 to 720), 10 000 homogeneous time series are generated from an IID Gaussian distribution with zero mean and variance *σ*^{2} (without lose of generality, *σ* is set to 1 here), and a linear trend component (*β* = 0.01) is added to each of these time series. Then a mean-shift Δ is inserted at point *k* (between *k* and *k* + 1) in each of these series, for each *k* ∈ {2, 3, . . . , *N* − 2} and each relative shift size *δ* = Δ/*σ* ∈ {0.25, 0.5, 1, 1.5, 2}. Then, PMFT and TPR3 are applied subsequently to each of the 10 000 time series for each combination of *k* and *δ* to detect the mean shift using the same nominal level of significance (5%). The three detection rates are calculated for each combination of *N*, *k*, and *δ* values, and are analyzed below.

As an example, Fig. 4 shows the comparison of the three rates as a function of changepoint position *k* ∈ {2, 3, . . . , *N* − 2}, where *N* = 100, with three different sizes of shift (*δ* = 0.25, 1.0, 2.0) at each position. As would be expected, PMFT has higher power of detection (in terms of all the three measures) than does TPR3 for changepoints that occur not too close to the ends of the series; while it has lower power in detecting changepoints that occur within the first or last *N*/10 points of the series. However, the seemingly higher power of TPR3 in detecting changepoints near these end points comes from the fact that it actually applies a lower-than-specified level of confidence so that not all the changepoints it identified are actually significant at the nominal level; while PMFT identifies every changepoint with approximately the same level of confidence as specified, regardless of the changepoint location. That is, PMFT obtains a basically evenly distributed effective level of significance/confidence (and hence power) of the test across all possible changepoint positions *k*. For other selected values of *N*, the situation is similar and hence not shown.

Most importantly, however, when averaged over the possible changepoint positions *k* ∈ {2, 3, . . . , *N* − 2}, PMFT has significantly higher power of detection than does TPR3. Tables 2 –4 list, respectively, the mean hit rates, mean position rates, and mean significance rates (i.e., those averaged over *k* = 2, 3, . . . , *N* − 2) for the 18 selected values of series length *N*. Figure 5 shows these mean rates as a function of series length *N* and relative shift size *δ*.

Clearly, in terms of the mean hit rate, the improvement of PMFT over TPR3 is seen for all the five shift sizes regardless of the series length. For small shifts (*δ* ≤ 1), the improvement increases steadily as the series length increases and reaches its maximal of around 10% for series of length *N* = 400 ∼ 500 (Fig. 5a). For medium–large shifts (*δ* > 1), the improvement is maximal for series of length 100 ≤ *N* ≤ 300. In general, the smaller the *δ*, the greater the improvement of PMFT over TPR3; and the longer the time series, the greater the improvement of PMFT in detecting shifts in time series of length *N* < 100 (regardless of shift size; Fig. 5a).

In terms of the mean position rate, PMFT is also superior to TPR3. As shown in Fig. 5b, PMFT has up to about 5% higher power in identifying “correctly” the position of small shifts in long time series, while PMFT and TPR3 perform very similarly in detecting the “correct” position of large shifts in very long series (*N* > 400). For series of length *N* < 100, the improvement of PMFT over TPR3 also increases as the series length increases (Fig. 5b).

Note that each of the 10 000 time series tested in each setting here contains a mean shift (i.e., *H _{a}* is always true here). Thus, the type-II error rate can be estimated as (1 − SR) roughly [or as (1 − HR) more accurately, which, however, contains the same information as HR], and the higher the significance rate SR, the lower the type-II error rate (and hence the better the test). In this regard, as shown in Fig. 5c, PMFT has 2%–7% improvement over TPR3 in detecting medium–large shifts in medium–long time series, or in detecting small shifts (e.g.,

*δ*= 0.5) in long time series (

*N*≥ 300); while PMFT and TPR3 have comparable significance (or type-II error) rates in detecting small shifts (

*δ*< 1), especially in time series of short–medium length.

Similarly, as noted in WW07, the hit rate here is also much more sensitive to change in the length of the time series being tested, or in shift size, than is the position or significance rate (see Fig. 5), because a hit is counted only when the test identifies the changepoint with both statistical significance and “correct” position.

## 4. Examples of application to climate data series

In general, PMFT can be applied to any time series with IID Gaussian errors and a common nonzero trend throughout the series (note that PMT of WW07 would be better for time series with zero trend). Thus, it can be applied to any climate element as long as the data series is properly pretreated so that the above assumptions are not significantly violated. For example, seasonality in the mean is often present in climate variables; it can be diminished by roughly deseasonalizing the series (e.g., by subtracting the 12 monthly sample means, one for each calendar month) or by using a good reference series that has the same seasonality. A climatic shift in the trend component can also be accounted for by using a homogeneous reference series from the same climate region/regime. When such a reference series is available, PMFT (or PMT) can be applied to the base-minus-reference series. Generally, the four-parameter two-phase regression model (TPR4; Lund and Reeves 2002) should be used to detect trend changes. However, as mentioned before, TPR4 also suffers from the effect of unequal sample sizes. Thus, a penalized version of TPR4 is currently under development and will be reported in a separate paper.

Serial correlation is another inherent feature of many climate data series, which cannot be diminished by using reference series. Also, long-term climate data series could contain multiple changepoints. Recently, two stepwise algorithms have been developed to empirically account for serial correlation in detecting multiple changepoints using PMT or PMFT (Wang 2008).

In the remainder of this section, examples of application of PMFT and TPR3 to detect undocumented mean shifts in climate data series are presented. The goal is to show which of the two methods performs better in practical use. Thus, we need to be able to verify the results. That is, we need to apply the methods to a time series that contains a documented mean shift (i.e., the time and cause of the shift are known), to see if the methods can detect the shift if it is assumed undocumented. For such purpose, we applied the methods to three data series: two monthly mean station pressure series, and a monthly mean daily minimum air temperature series.

The first example series is the time series of monthly mean station pressures recorded at Greenwood Airport (Nova Scotia, Canada) for the 52-yr period from 1953 to 2004. The related metadata indicates that this series contains an artificial mean shift (see the thick solid line in Fig. 6a) caused by using the so-called established elevation (instead of station elevation) in the calculation of station pressures from barometer readings in the period prior to January 1977. The shift magnitude is estimated to be 0.35 hPa [which is about 0.12*σ̂* by hydrostatic model, using hourly pressure and temperature data and correct station elevation (see Wan et al. 2007)].

It is also noticed that the roughly deseasonalized pressure time series has a negligible lag-1 autocorrelation of 0.045 (here *N* = 624). Thus, PMFT and TPR3 can be, and were, applied to the deseasonalized pressure time series (see the thin solid curve in Fig. 6). At the 5% level of significance, PMFT identified a changepoint in October 1976 (*k* = 286, PF_{max} = 17.391; see the dashed line in Fig. 6), while TPR3 finds the time series homogeneous (it finds *k* = 349 with *F*_{ma}* _{x}* = 8.699). In other words, TPR3 fails to detect the artificial shift between December 1976 and January 1977, while PMFT detects it with good accuracy (only two months earlier than the actual time of change). This is an example showing that PMFT outperforms TPR3 when the changepoint occurs near the middle of the series.

Similarly, the deseasonalized series of station pressures recorded at Daniels Harbor (Newfoundland, Canada) for the 50-yr period from 1953 to 2002 (see Fig. 6b) was also estimated to have a negligible lag-1 autocorrelation (of 0.038). Thus, PMFT and TPR3 were also applied to this series, separately, to see if they can detect the mean shift due to station relocation in the end of July 1955 (as documented in the metadata record). At the 5% level of significance, both tests identified a significant changepoint in August 1955, which is just one month later than the actual time of change (Fig. 6b). This is an example showing that PMFT and TPR3 can perform similarly when the changepoint is located near the end of the series.

The third example series is the series of monthly means of daily minimum air temperature recorded at AMOS (Quebec, Canada) for the 46-yr period from July 1913 to December 1997 (*N* = 1014). The station inspection reports indicate that 1) the thermometer screen and stand were replaced on 24 October 1958; and 2) the station elevation changed sometime between August 1948 and 24 October 1958 (from 990 to 1002 ft or, equivalently, from 297 to 300.6 m). The deseasonalized temperature series is estimated to have a lag-1 autocorrelation of 0.173 [its 95% uncertainty range: (0.112, 0.233)], which is highly significant. Taking into account the effect of this autocorrelation in the way proposed in Wang (2008), PMFT identified a changepoint in August 1958 that is significant even without metadata support [*k* = 548, PF_{max} = 26.82 > 15.85(14.06 − 17.88); the numbers the parentheses are the lower and upper bound of the 95th percentile of the PF_{max}, which corresponds to the uncertainty range of autocorrelation estimate above]. This changepoint is obviously due to the replacement of the thermometer screen and stand on 24 October 1958. In addition, PMFT also identified a changepoint in October 1953 that would be significant if it has metadata support [*k* = 490, *p* = 0.998]. Thus, the station elevation change most likely happened around October 1953. The fit with these two changepoints is shown in Fig. 7. TPR3 was not applied to this series because we do not have a version of the TPR3 that can account for the highly significant autocorrelation.

To verify whether it is reasonable to assume that the time series being tested have a common trend throughout, the sum of squared residuals of the common-trend model fit was compared with that of the different-trend model fit to each of the three de-seasonalized series above (via a regular *F* test). The results show that the different-trend model (which has one or several more parameters) has no significant improvement over the common-trend model in terms of fit (*p* = 0.744, 0.062, and 0.275 for the Greenwood, Daniels Harbor, and AMOS series, respectively). This indicates that there is no significant trend change at the time(s) of the mean shift. The common-trend assumption of PMFT or TPR3 is valid for these series.

Also, the QQ plots (not shown) of the residuals of the PMFT model fit to each of the three series reveal that the error term (prewhitened in the AMOS case) in each of these time series does approximate very well to a Gaussian distribution. Thus, the normality assumption of the PMFT or TPR3 is also valid for these series.

## 5. Concluding remarks

In this study, the common-trend two-phase regression model–based maximal *F* test (TPR3) for detecting undocumented mean shifts that are not accompanied by any sudden change in the linear trend of time series (Wang 2003) is modified to diminish the effect of unequal sample sizes on the false alarm rate and hence on the power of detection, using an empirical penalty function (also constructed in this study).

For time series of length in the wide range *N* ∈ [15, 720], the new penalized maximal *F* test (PMFT) has been shown to have basically evenly distributed false alarm (or type-I error) rate over every point in the time series. That is to say, PMFT performs generally at the nominal level of significance/confidence, no matter which point in the time series is being tested (or where the changepoint occurs). This feature is highly desirable in practice, because each point of data should be treated equally, each should have equal chance to be mistakenly declared as a changepoint when the time series is homogeneous. On the other hand, a shift of certain magnitude should have the same probability of being detected no matter where it occurs. Without this feature, as in the case of TPR3, not all the changepoints identified by the test are actually significant at the nominal level, while some changepoints that are actually significant at the nominal level might go undetected. This is why TPR3, which has W-shaped FAR(*k*) curves, seemingly has higher power in detecting change points that are near the ends (within the first or last *N*/10 points) of the series than does PMFT, while it has lower power in detecting changepoints that occur near point *W* = (0.21 ∼ 0.22) *N* from either end of the series.

Most importantly, however, when averaged over all possible changepoint positions, PMFT has a higher power of detection than does TPR3. The improvement in hit rate can be more than 10% for detecting small shifts (Δ ≤ *σ*). Also, PMFT has up to about 5% higher power in identifying “correctly” the position of small shifts in long time series. In terms of the type-II error rate, it has 2%–7% improvement over TPR3 in detecting medium–large shifts in medium–long time series, or in detecting small shifts (e.g., Δ = 0.5*σ*) in long time series (*N* ≥ 300); while PMFT and TPR3 have comparable type-II error rates in detecting small shifts (*δ* < 1), especially in time series of short–medium length.

It is also shown that the results of TPR3 are notably less severely affected by the problem of unequal sample sizes than those of the SNHT-type tests. This is because the estimate of the common trend in TPR3 is obtained using the whole series, not affected by the problem of unequal sample sizes. In other words, the whole series is used in the estimate of one of the model parameters and hence in the calculation of the test statistic of TPR3. In contrast, the estimate of all parameters in the SNHT-type tests is affected by the problem of unequal sample sizes. This is also the case for the TPR4 approach (Lund and Reeves 2002), which also has a U-shaped distribution of false alarm rate (similar to what are shown in WW07). An empirically penalized version of TPR4 is also being constructed and will be reported in a separate paper.

Note that PMFT and TPR3, or PMT and the SNHT-type tests, all assume that the errors in the time series being tested are IID Gaussian, which is hardly true for climate data series (even for annual series). Most climate data series exhibit autocorrelation and periodicity. As noted in WW07, the use of a good reference that has the same trend and periodicity as the base series can greatly diminish the periodicity and trend in the time series being tested, but it cannot diminish autocorrelation in the time series. Therefore, a test of undocumented changepoint that takes into account the effect of autocorrelation needs to be developed. Recently, Lund et al. (2007) addressed this need, proposing a new method for changepoint detection in periodic and autocorrelated time series. However, the effect of unequal sample sizes has yet to be diminished in this new method, which should be the subject for future study. Wang (2008) has developed an algorithm to account for autocorrelation in detecting mean shifts in climate data series using the PMFT.

Finally, a software package RHtestV2 (written in R and FORTRAN languages) for implementing PMFT (or PMT) with a stepwise testing algorithm for detecting multiple mean shifts in a time series has been developed and made freely available (see online at http://cccma.seos.uvic.ca/ETCCDMI/software.shtml).

The author wishes to thank Lucie Vincent and Hui Wan (CRD, Environment Canada) for their helpful comments on an earlier version of this paper. The two anonymous reviewers are also acknowledged.

## REFERENCES

Alexandersson, H., 1986: A homogeneity test applied to precipitation data.

,*Int. J. Climatol.***6****,**661–675.Caussinus, H., , and Mestre O. , 2004: Detection and correction of artificial shifts in climate series.

,*Appl. Stat.***53****,**405–425.Davis, R. A., , Lee T. C. M. , , and Rodriguez-Yam G. A. , 2006: Structural breaks estimation for non-stationary time series models.

,*J. Amer. Stat. Assoc.***101****,**223–239.DeGaetano, A. T., 2006: Attributes of several methods for detecting discontinuities in mean temperature series.

,*J. Climate***19****,**838–853.Easterling, D. R., , and Peterson T. C. , 1995: A new method for detecting undocumented discontinuities in climatological time series.

,*Int. J. Climatol.***15****,**369–377.Hanesiak, J. M., , and Wang X. L. , 2005: Adverse-weather trends in the Canadian Arctic.

,*J. Climate***18****,**3140–3156.Lund, R., , and Reeves J. , 2002: Detection of undocumented changepoints: A revision of the two-phase regression model.

,*J. Climate***15****,**2547–2554.Lund, R., , Wang X. L. , , Lu Q. , , Reeves J. , , Gallagher C. , , and Feng Y. , 2007: Changepoint detection in periodic and autocorrelated time series.

,*J. Climate***20****,**5178–5190.Menne, M. J., , and Williams C. N. Jr., 2005: Detection of undocumented changepoints using multiple test statistics and composite reference series.

,*J. Climate***18****,**4271–4286.Peterson, T. C., and Coauthors, 1998: Homogeneity adjustments of

*in situ*atmospheric climate data: A review.,*Int. J. Climatol.***18****,**1493–1517.Reeves, J., , Chen J. , , Wang X. L. , , Lund R. , , and Lu Q. , 2007: A review and comparison of changepoint detection techniques for climate data.

,*J. Appl. Meteor. Climatol.***46****,**900–915.Vincent, L. A., , Zhang X. , , Bonsal B. R. , , and Hogg W. D. , 2002: Homogenization of daily temperatures over Canada.

,*J. Climate***15****,**1322–1334.Wan, H., , Wang X. L. , , and Swail V. R. , 2007: A quality assurance system for Canadian hourly pressure data.

,*J. Appl. Meteor. Climatol.***46****,**1804–1817.Wang, X. L., 2003: Comments on “Detection of undocumented changepoints: A revision of the two-phase regression model”.

,*J. Climate***16****,**3383–3385.Wang, X. L., 2006: Climatology and trends in some adverse and fair weather conditions in Canada, 1953–2004.

,*J. Geophys. Res.***111****.**D09105, doi:10.1029/2005JD006155.Wang, X. L., 2008: Accounting for autocorrelation in detecting mean-shifts in climate data series using the penalized maximal

*t*or*F*test., in press.*J. Appl. Meteor. Climatol.*Wang, X. L., , and Feng Y. , 2007: RHtestV2 user manual. Climate Research Division, Atmospheric Science and Technology Directorate, Science and Technology Branch, Environment Canada, 19 pp. [Available online at http://ccsma.seos.uvic.ca/ETCCDMI/software.shtml.].

Wang, X. L., , Wen Q. H. , , and Wu Y. , 2007: Penalized maximal

*t*test for detecting undocumented mean change in climate data series.,*J. Appl. Meteor. Climatol.***46****,**916–931.

Empirical critical values of the PF_{max} statistic and of the *F*_{max} statistics of TPR3 in Wang (2003), obtained through 10 million simulations of each of the two statistics for each *N* value.

The mean hit rates (averaged over *k* = 2, 3, . . . , *N* − 2) of PMFT and TPR3 for the indicated different series lengths *N* and relative shift sizes. The numbers in parentheses are the ratios of the PMFT rate to the TPR3 rate.

Same as in Table 2, but for the mean position rates of PMFT and TPR3.

Same as in Table 2, but for the mean significance rates of PMFT and TPR3.