## 1. Introduction

It is well known that temperature series may contain unknown artificial discontinuities (Peterson et al. 1998). Such discontinuities are typically caused by station moves, instrument changes, and/or microclimate changes surrounding a station. If left undetected, these artificial signals can bias attempts to estimate true climate signals (Menne et al. 2009). Consequently, many algorithms have been developed to detect the discontinuities. A list of representative publications includes Alexandersson (1986), Vincent (1998), Lund and Reeves (2002), Caussinus and Mestre (2004), Della-Marta and Wanner (2006), Lund et al. (2007), Reeves et al. (2007), Wang et al. (2007), Wang (2008a,b), Menne and Williams (2009), Hannart and Naveau (2009), Beaulieu et al. (2010), and Lu et al. (2010).

Caussinus and Mestre (2004) and Menne and Williams (2009) adopt a pairwise comparison approach that is argued to have advantages in terms of avoiding the detection of true climate signals and utilizing difference series to increase signal-to-noise ratios (SNR) and improve hit rates (HRs). In Menne and Williams (2009), a semihierarchical splitting algorithm is applied to each difference series to identify all potential discontinuities and a rule-based algorithm is used to automatically assign discontinuities to the corresponding stations. However, for a specific target discontinuity, the estimated locations based on different target–neighbor difference series may not agree with each other. Menne and Williams (2009) solve the location uncertainty issue empirically. Hannart and Naveau (2009) and Beaulieu et al. (2010) address the location uncertainty issue from a Bayesian perspective. Hannart and Naveau (2009) propose a method based on Bayesian decision theory. Their method identifies subsequences containing a unique discontinuity by minimizing average posterior cost functions recursively. Beaulieu et al. (2010) develop a framework based on Bayesian normal homogeneity test (BNHT) and apply BNHT recursively on the series to detect multiple discontinuities. However, such uncertainty can also be addressed from a Bayesian model selection perspective.

Here, we describe a Bayes factor model selection procedure for the automatic detection of temperature series changepoints using pairwise comparisons. In the procedure, the Bayes factor or the evidence of discontinuities at each time step is first computed via a sliding sample window. Then, after the Bayes factors are obtained, we identify potential discontinuities by comparing the Bayes factors with an appropriate threshold and calculate the posterior probabilities of the discontinuities for each time step. Finally, we obtain the estimated locations for the discontinuities by computing the posterior mean for each location. In section 2, we describe the details of the Bayes factor model. In section 3, we discuss how to select model parameters. Some results based on simulations and real observations are discussed in section 4. Also, sensitivity analyses with respect to different model parameters are presented in section 4. The conclusions are in section 5.

## 2. Description of the Bayes factor model

### a. Difference series

*T*(

*t*) is the monthly temperature anomaly

^{1}at a station,

*t*is the monthly index,

*C*(

_{T}*t*) is the climate signal,

*J*(

_{T}*t*) is the artificial changepoint signal, and ε

_{T}(

*t*) is the noise. Here,

*T*(

*t*) has a series of correlated neighbors

*N*(

_{j}*t*), where

*j*= 1, … ,

*n*. The difference series can be expressed as

*T*(

*t*) and

*N*(

_{j}*t*), ε

_{i}(

*t*) − ε

_{j}(

*t*) typically has a smaller variance than does ε

_{i}(

*t*) or ε

_{j}(

*t*). Here,

*J*(

_{T}*t*) −

*J*(

_{j}*t*) contains the changepoint signals from either

*T*(

*t*) or its neighbor

*N*(

_{j}*t*). Because of multidecadal variations and trends,

*C*and

_{T}*C*are not stationary in time. Rather, it is assumed the high spatial correlation inherent in temperature fields means that

_{j}*C*and

_{T}*C*are approximately equal. As discussed below, a rather narrow time-frame moving window is used to identify local discontinuities; therefore, low-frequency “creeping” inhomogeneities are not likely to be efficiently identified by the Bayes factor approach.

_{j}### b. Bayes factors

*M*

_{0}and

*M*

_{1}where

*M*

_{0}is no changepoints in Δ

*T*(

_{j}*t*) and

*M*

_{1}means that there is a changepoint at month

*t*in Δ

*T*(

_{j}*t*); we can compute the posterior probability for each model using the Bayes theorem,

*i*= 0, 1 and

**Y**is the observation. We obtain posterior odds by

*P*(

**Y**|

*M*), we integrate out the parameters,

_{i}*P*(

**Y**|

*ω*,

_{i}*M*) is the probability density function with parameter

_{i}*ω*under

_{i}*M*and

_{i}*ψ*(

*ω*|

_{i}*M*) is the prior density for

_{i}*ω*under

_{i}*M*. After obtaining the Bayes factor and assume prior odds equal to 1, we often use the value of 2 log

_{i}_{e}(BF

_{10}) to evaluate the evidence against

*M*

_{0}(Kass and Raftery 1995). For example, when the value of 2 log

_{e}(BF

_{10}) is between 2 and 6, there is evidence against

*M*

_{0}(Kass and Raftery 1995). The definition in (5) can be extended to the cases with more than two models. More details of Bayes factors and Bayesian model selection can be found in Kass and Raftery (1995) and MacKay (2003).

### c. Bayes factors for one difference series

Suppose for a short time window *τ* = {*s*, *s* + 1, … , *s* + *l*}, where *s*, *s* + 1, … , *s* + *l* are consecutive time indexes in a monthly temperature anomaly series, we have a set of hypotheses *H* = {*M _{t}*:

*t*= 0,

*s*,

*s*+ 1, … ,

*s*+

*l*}, where

*M*= {at month

_{t}*t*in Δ

*T*(

*t*)

_{j}there is a discontinuity} and

*M*

_{0}= {no discontinuities}. To limit the potential hypotheses, we assume that there is at most one discontinuity in

*τ*. According to Menne et al. (2009), the average distance between two detected discontinuities for the U.S. Historical Climatology Network (USHCN) monthly temperature data version 2 is about 180–240 months. Although the actual frequency of the discontinuities must be higher, we expect that most time windows will contain at most one discontinuity, especially when

*l*≪ 180.

*M*, we write the competing models as

_{t}*μ*

_{1,t}is the mean before month

*t*and

*μ*

_{2,t}is the mean after month

*t*. With the above assumptions, we can compare

*M*, … ,

_{s}*M*

_{s}_{+l}with

*M*

_{0}and produce the Bayes factors, BF

_{s0}, … , BF

_{s+l0}. Then, after obtaining BF

_{s0}, … , BF

_{s+l0}, we can compute the posterior probabilities for all models.

*t*-test framework proposed by Gönen et al. (2005). Suppose

*x*

_{1,j}’s are observations before

*t*and

*x*

_{2,j}’s are observations after

*t*, for

*M*with

_{t}*t*=

*s*, … ,

*s*+

*l*; we assume observations are from two normal distributions

*M*

_{0}, we assume observations are from one normal distribution

**Y**for

*τ*, we compute the Bayes factor for

*M*by integrating out all parameters. Gönen et al. (2005) obtain a closed form for the Bayes factor,

_{t}*ζ*is the usual two-sample

*t*statistic;

*μ*

_{0}and

*σ*

_{0}are prior mean and prior variance of (

*μ*

_{1,i}−

*μ*

_{2,j})/

*σ*;

*n*is the pooled sample size; and ϒ

_{p}_{ν}(.|

*ξ*,

*κ*) is the noncentral

*t*distribution with location parameter

*ξ*, scale parameter

*κ*

^{1/2}, and degree of freedom

*ν*. It is possible to use different priors such as a Cauchy prior (Rouder et al. 2009). However, we have found via simulations that, while we can achieve similar results with normal and Cauchy priors, numerical integration is required with the latter, which increases computational cost. So we choose to use normal priors in our calculations.

### d. Bayes factors for K difference series

**Y**

_{1},

**Y**

_{2}, … ,

**Y**

_{K}} of observations from

*K*difference series for a time window

*τ*, we want to compute the Bayes factor for the set of target–neighbor differences. Because the

**Y**

_{i}’s are not independent, we cannot combine them in one

*t*statistic. Nevertheless, theoretically the Bayes factor for a model

*M*can be obtained with

_{t}*P*(

**Y**

_{1}

**Y**

_{2}…

**Y**

_{K}|

*M*) and to integrate out the parameters directly, so in this case we apply the Bayes factor model of a single series on each difference series and use the following formula to approximate the Bayes factor for

_{t}*K*difference series:

_{t0,i}’s are computed with Eq. (9). The rationale behind this approximation is that log

_{e}(BF

_{t0,i}) can be viewed as the weight of evidence from each dataset

**Y**

_{i}and the mean of log Bayes factors can be viewed as the average weight of evidence from the dataset {

**Y**

_{1},

**Y**

_{2}, … ,

**Y**

_{K}} (Good 1985). For real applications, the median instead of the mean is recommended to mitigate the impacts of outliers.

### e. Estimate break locations

*τ*= {

_{m}*s*,

_{m}*s*+ 1 … ,

_{m}*s*+

_{m}*l*} with 2 log

_{m}_{e}(BF) above the threshold will be identified to form the model set

*H*

_{m}.

^{2}Depending on the magnitude of the threshold and the size of the discontinuities, we may have several

*τ*’s for a station with multiple discontinuities. Each

_{m}*τ*contains a potential discontinuity, and the number of

_{m}*τ*’s corresponds to the number of potential discontinuities. Since we have the approximate Bayes factors BF

_{m}_{i0}for each

_{00}=

*c*

_{0}= 1. If we define

*A*= {there is a discontinuity in

*τ*}, then

_{m}*P*(

*A*|

**Y**

_{1},

**Y**

_{2}, … ,

**Y**

_{K}) > 0.5 indicates there is a discontinuity in the time window. For time windows with

*P*(

*A*|

**Y**

_{1},

**Y**

_{2}, … ,

**Y**

_{K}) > 0.5, we estimate the expected location

*E*(

*L*|

_{m}**Y**

_{1}

**Y**

_{2}…

**Y**

_{K},

*A*) for discontinuities. We compute the probability when there is a discontinuity by

*τ*, the posterior mean of the location of the discontinuity is

_{m}*E*(

*L*|

_{m}**Y**

_{1}

**Y**

_{2}…

**Y**

_{K},

*A*) to the closest integer and obtain the final estimation of the location of the discontinuity. We could compute the variance of the location of the discontinuity with

_{e}(BF) for a station based on simulated data is shown in Fig. 1 (details of the simulations are provided in section 4). In Fig. 1, there are three time windows containing 2 log

_{e}(BF) above the threshold. The threshold value is 4 and the indexing refers to months beginning with January 1900 (

*i*= 1) and going through December 1999 (

*i*= 1200). The break around 850 in differences series is from the neighbor series and we notice that 2 log

_{e}(BF) for this break is not above the threshold. The estimated locations for three detected discontinuities are 562, 933, and 1007 and estimated variances are 47.6, 7.5, and 14.7. For the first detected break, the true location is 570; for the second, the true location is 932; for the third, two true breaks are located at 1002 and 1011. We notice that all true breaks are within the interval of two estimated standard deviations from the estimated location and the distance between the estimated location and the true location is approximately proportional to the estimated standard deviation. Thus, the value of the estimated variance could be used to measure the relative accuracy of the estimated location and to select temperature series when the accuracy of the time location is important.

## 3. Selection of parameters

*μ*

_{0}and prior variance

*μ*

_{0}to zero because we do not know whether the discontinuities will be positive or negative. A reasonable guess about the prior variance is that 90% of the discontinuities have a standardized size less than 1 (the accuracy of this guess will not significantly impact the results, as we later will discuss the sensitivity with respect to the prior variance in the results section). So the prior variance

*μ*=

*μ*

_{1}−

*μ*

_{2}, and the prior variance

*n*: that is, how many observations from each side of the potential break point will be included in the

*t*statistic. Including too many observations will increase undesired biases (i.e., the window may encompass more than one break); and including too few will lead to large uncertainty. We select the sample window size through a series of sensitivity analyses. As discussed further in the results section, it seems that if we let the window size

*n*equal to 30 months (each side of the potential break), we achieve good results on simulated datasets (details are provided in the results section). For the prior odds

*P*(

*M*)/

_{t}*P*(

*M*

_{0}), typically the noninformative prior odds

*P*(

*M*)/

_{t}*P*(

*M*

_{0}) = 1 is used in the calculation, although other choices are possible, as we later will see in the results section.

To find the potential time window *τ _{m}*’s, a threshold for 2 log

_{e}(BF) needs to be specified. We follow the recommendation in Kass and Raftery (1995) and use the moderately positive evidence level of 4 as the threshold for 2 log

_{e}(BF). This threshold seems to work well for various simulations. Finally, since we could potentially have hundreds of neighbors, we need to set the upper limit for the number of neighbors included in the computation. To avoid including neighbors with very low correlation,

^{3}we must set the lowest correlation limit. Based on our simulations, the performance of the algorithm is not very sensitive to these two parameters. Therefore, we use 40 as the upper limit for the number of neighbors and 0.5 as the lowest correlation limit for the neighbors as in Menne and Williams (2009).

## 4. Results

### a. Results using simulated and real-world observations

We used two large simulated datasets to evaluate the effectiveness of our algorithm. Simulated datasets have been used to benchmark algorithms in the homogenization of radiosonde records (e.g., Titchner et al. 2009); paleoclimate reconstructions (e.g., Mann et al. 2005); and, more recently, surface temperature records (Venema et al. 2012; Williams et al. 2012). Here we used two of the simulated datasets described in Williams et al. (2012).

Each of these synthetic datasets contained about 7700 station series and each station has a maximum record length of 100 yr; however, many of the stations have much shorter records and are characterized by missing periods of varying length. The simulated temperature series are based on the climate model output and contain correlated errors. As described in detail in Williams et al. (2012), the missing data patterns mimic the data record and geographic distribution of stations in the U.S. Cooperative Observer Network. The simulations contain only step function discontinuities, which are arguably the most prevalent type of artificial discontinuities. We have two simulated datasets that we call them simulations 1 and 2. In Williams et al. (2012), simulation 1 is referred to as “clustering and sign bias C20C1,” and simulation 2 is referred to as “many small breaks with sign bias.” For simulation 1, there are on average 7 breaks for each series. The breaks are not randomly spaced through time but rather are “clustered”, with most stations having a break within 30 yr of 1945 and during the 1980s to reflect the changes that occurred in the real-world network (Menne et al. 2009). For simulation 2, there are on average 10 breaks for each series. Also, for both simulations, there is a sign bias to reflect what is known about the errors in the USHCN (Menne et al. 2009), which means that the imposed errors do not have a frequency distribution that is symmetric about zero. Rather, there is a preference for positive errors in simulation 1 and negative errors in simulation 2. Overall, simulation 1 contains a mixture of large, medium, and small discontinuities and resembles the type of errors thought to be present in USHCN temperature data. Simulation 2 has predominantly small breaks and resembles a very challenging situation. In addition, for simulation 2, breaks are very close to each other, which makes the detection even more challenging. More details of two simulations can be found in Table 1.

We used the parameter settings described in section 3 to identify breaks in the two datasets. The result for simulation 1 is listed in Table 2. Our algorithm detects 84.50% of true large discontinuities.^{4} For detected large discontinuities, the false detection rate (FDR) is only 1.11%. The overall hit rate is 47.11%, and the false detection rate is 11.82%. The result for simulation 2 is listed in Table 3. We detected 11.17% of the total breaks. The false detection rate is 9.55%. The median of

Results obtained in simulation 1, “clustering and sign bias C20C1.”

Results obtained in simulation 2 “many small breaks with sign bias.”

Median of the estimated SNR. The sizes are as follows: large is *δ* ≥ 1.0°C, medium is 0.5°C ≤ *δ* < 1.0°C, and small is *δ* < 0.5°C, where *δ* is the size of a discontinuity and 1°C = 1°C.

To further evaluate the efficiency of the Bayes algorithm, a simple adjustment factor was calculated for each of the break dates identified. For each detected break, multiple adjustments were first calculated using a 30-month window (each side of the break) on each difference series and the median of the adjustments was used as the final adjustment factor. These adjustments were then applied to the 1218 simulated series that are corollaries to the real USHCN station temperature series (Menne et al. 2009). A conterminous U.S. (CONUS) average time series was then computed as in Williams et al. (2012) using the 1218 adjusted series as well as the raw input unadjusted series (i.e., with errors) and the underlying series with no seeded errors. As shown in Fig. 2, applying the Bayes factor adjustments moves the CONUS average trends closer to its true “homogeneous” value. In the case of simulation 1, the adjusted trends are smaller than the raw input indicating the adjustments are accounting for the input data errors, which have had a positive sign bias. In simulation 2, the errors have a negative bias, and the adjusted trends are therefore larger than the raw input. Not surprisingly, the adjustments do not move the CONUS average trend too far but rather not far enough. This is an indication that the adjustments are incomplete rather than overly aggressive, especially in the case of simulation 2 where the detection rate is relatively low. As discussed in Williams et al. (2012), PHA-based adjustments behave similarly (results using the operational configuration of the PHA, version 52i are also shown in Fig. 2a). Notably, the Bayes factor algorithm moves trend nearly as far as the operational PHA algorithm in simulation 2 but not in simulation 1. Because the detection rates are comparable between the two algorithms, the reason for the differential adjustments may be related to the way in which the Bayes factor adjustments are calculated (i.e., using a very limited time window) compared to the PHA and/or the fact that the Bayes factor algorithm, unlike the PHA, does not currently use metadata as a prior. As mentioned in the conclusion, the adjustment method and exploiting available metadata are both logical options for future Bayes factor algorithm improvement.

(top) Annual average CONUS temperature series calculated using the USHCN monthly temperature series from the simulation-1 dataset. Spatial averages are based adjustments calculated from the Bayes factor algorithm (in black), the Menne and Williams (2009) PHA (in orange). CONUS averages for the nonhomogenized (raw) input values with the seeded errors are shown in red. Averages based on the true data series without errors are shown in green. (bottom) As in (top), except for simulation 2.

Citation: Journal of Climate 25, 24; 10.1175/JCLI-D-12-00052.1

(top) Annual average CONUS temperature series calculated using the USHCN monthly temperature series from the simulation-1 dataset. Spatial averages are based adjustments calculated from the Bayes factor algorithm (in black), the Menne and Williams (2009) PHA (in orange). CONUS averages for the nonhomogenized (raw) input values with the seeded errors are shown in red. Averages based on the true data series without errors are shown in green. (bottom) As in (top), except for simulation 2.

Citation: Journal of Climate 25, 24; 10.1175/JCLI-D-12-00052.1

(top) Annual average CONUS temperature series calculated using the USHCN monthly temperature series from the simulation-1 dataset. Spatial averages are based adjustments calculated from the Bayes factor algorithm (in black), the Menne and Williams (2009) PHA (in orange). CONUS averages for the nonhomogenized (raw) input values with the seeded errors are shown in red. Averages based on the true data series without errors are shown in green. (bottom) As in (top), except for simulation 2.

Citation: Journal of Climate 25, 24; 10.1175/JCLI-D-12-00052.1

The Bayes factor algorithm was also applied to mean monthly maximum and mean monthly minimum temperature series from full 7000+ stations in the U.S. Cooperative Observer Program network with the parameters described in section 3. Similar to the above, a CONUS-wide average was computed from the 1218-station USHCN subset of stations using both the raw input series and the adjusted series. The time series and trend values for maximum and minimum temperatures are shown in Fig. 3. As in the case of the PHA adjustments (also shown), adjusted maximum temperature trends based on the Bayes factor algorithm are larger than in the raw, unadjusted trends. This is consistent with the presented understanding that maximum temperatures in the United States contain pervasive negative biases, especially since 1950. These biases are primarily related to changes in the time of observation and a widespread change from liquid-in-glass thermometers to electronic thermistors (see Menne et al. 2009; Williams et al. 2012). For minimum temperatures, there are apparent conflicting biases in the USHCN temperature measurements, with a negative time of observation bias dominating since 1950 and a positive bias associated with the change to electronic thermistors that occurs largely in the mid-1980s. The Bayes factor adjustments on minimum temperature trends are also broadly consistent with this understanding.

As in Fig. 2 but for real-world monthly-mean (top) maximum and (bottom) temperatures.

Citation: Journal of Climate 25, 24; 10.1175/JCLI-D-12-00052.1

As in Fig. 2 but for real-world monthly-mean (top) maximum and (bottom) temperatures.

Citation: Journal of Climate 25, 24; 10.1175/JCLI-D-12-00052.1

As in Fig. 2 but for real-world monthly-mean (top) maximum and (bottom) temperatures.

Citation: Journal of Climate 25, 24; 10.1175/JCLI-D-12-00052.1

### b. Evaluation of parameter sensitivity

For simulation 1, we randomly selected 5% of the stations and performed the sensitivity analyses of the HRs and FDRs with respect to prior variances, prior odds, threshold values of 2 log_{e}(BF), and sample window sizes. Figure 4a shows the sensitivity analysis of the HRs and FDRs with respect to the prior variance. We observe that the HRs and FDRs are not overly sensitive to the choice of the prior variance unless the prior variance is unreasonably small. This means that the selection of the prior variance is not a concern for the model. Figure 4b contains the sensitivity analysis of the HRs and FDRs with respect to the log_{10}(prior odds) of *P*(*M _{t}*)/

*P*(

*M*

_{0}). The HRs and FDRs are not very sensitive to the choice of prior odds when the log

_{10}(prior odds) is greater than −2. The sensitivity analysis of the HRs and FDRs with respect to the sample window size is shown in Fig. 4c. Very large or very small sample window sizes will lower the HRs. Also, very large sample window sizes cause the FDRs to increase. This is perhaps caused by including nearby discontinuities in the sample window. Figure 4d shows the sensitivity analysis of the HRs and FDRs with respect to the threshold values of 2 log

_{e}(BF). The HRs and FDRs are sensitive to the threshold value. Fortunately, the FDRs decrease more rapidly than the HRs when the threshold value increases.

The sensitivity analysis of HRs and FDRs with respect to (a) the prior variance, (b) the the log_{10}(prior odds), (c) the sample window size, and (d) the threshold value of 2 log_{e}(BF).

Citation: Journal of Climate 25, 24; 10.1175/JCLI-D-12-00052.1

The sensitivity analysis of HRs and FDRs with respect to (a) the prior variance, (b) the the log_{10}(prior odds), (c) the sample window size, and (d) the threshold value of 2 log_{e}(BF).

Citation: Journal of Climate 25, 24; 10.1175/JCLI-D-12-00052.1

The sensitivity analysis of HRs and FDRs with respect to (a) the prior variance, (b) the the log_{10}(prior odds), (c) the sample window size, and (d) the threshold value of 2 log_{e}(BF).

Citation: Journal of Climate 25, 24; 10.1175/JCLI-D-12-00052.1

From Eq. (9), we know that the 2 log_{e}(BF) is the function of the SNR, the sample window size *n* and the prior variance *μ*_{0} is equal to 0.^{5} From Fig. 5, we know that the 2 log_{e}(BF) is not sensitive to the change of prior variance _{e}(BF) is sensitive to the change of the sample window size *n*. However, a large sample window may include nearby discontinuities. The choice of the sample window size *n* depends on the prior information about the density of the discontinuities and the level of SNR. To apply the model on real observations, we could start from a relatively small window and gradually increase the size of the window until the HRs decrease. From Fig. 5, we also notice that increasing the threshold for 2 log_{e}(BF) will effectively eliminate false detections with small SNR values and lower the FDRs. Choosing a different prior odds will also affect FDRs. In the next paragraph, we will discuss the choice of prior odds.

(a) The 2 log_{e}(BF) as the function of the sample window size and SNR value when prior variance equal to 0.3696 (each curve has the same SNR value), (b) 2 log_{e}(BF) as the function of the sample window size and SNR value when prior variance equal to 5.3696, (c) 2 log_{e}(BF) as the function of the sample window size and SNR value when prior variance equal to 10.3696, and (d) the box plot of

Citation: Journal of Climate 25, 24; 10.1175/JCLI-D-12-00052.1

(a) The 2 log_{e}(BF) as the function of the sample window size and SNR value when prior variance equal to 0.3696 (each curve has the same SNR value), (b) 2 log_{e}(BF) as the function of the sample window size and SNR value when prior variance equal to 5.3696, (c) 2 log_{e}(BF) as the function of the sample window size and SNR value when prior variance equal to 10.3696, and (d) the box plot of

Citation: Journal of Climate 25, 24; 10.1175/JCLI-D-12-00052.1

(a) The 2 log_{e}(BF) as the function of the sample window size and SNR value when prior variance equal to 0.3696 (each curve has the same SNR value), (b) 2 log_{e}(BF) as the function of the sample window size and SNR value when prior variance equal to 5.3696, (c) 2 log_{e}(BF) as the function of the sample window size and SNR value when prior variance equal to 10.3696, and (d) the box plot of

Citation: Journal of Climate 25, 24; 10.1175/JCLI-D-12-00052.1

*c*=

_{j}*c*for ∀

*j*∈ (

*s*, … ,

_{m}*s*+

_{m}*l*), then from Eq. (13) we know

_{m}*A*= {there is a discontinuity in

*τ*}. Because the condition of having a break is

_{m}*P*(

*A*|

**Y**

_{1},

**Y**

_{2}, … ,

**Y**

_{K}) > 0.5, from Eq. (23), we know

*P*(

*A*|

**Y**

_{1},

**Y**

_{2}, … ,

**Y**

_{K}) > 0.5 is

*c*= 1, Eq. (26) is always valid. If we want to achieve the maximum hit rate for a certain threshold of 2 log

_{e}(BF), then we should use the noninformative prior. If we want to lower the FDR and the threshold for 2 log

_{e}(BF) is

*T*, we can choose a different

*c*with

*d*> 0 and

## 5. Conclusions

Detecting the artificial discontinuities in the real temperature series usually carries uncertainties. For example, a large fraction of the breaks in most surface temperature networks are probably undocumented and we may have more than one plausible location for a discontinuity. Because the “true” climate signal in these series is unknown, the best that we can hope to do is to quantify the uncertainty, and one way to do this is to approach the changepoint detection problem in multiple different ways (Thorne et al. 2011). With the model in this paper, we can quantify the uncertainty of the location and estimate the most likely location in a probabilistic approach. We have shown in the examples that the proposed model achieved reasonable results with simulated large-scale realistically noisy temperature series. The results of sensitivity analyses also provide the evidence that the model is useful for the real applications. In the future, we plan to use the available metadata information as a prior in the Bayes factor algorithm as well as test alternative ways to calculate adjustments for the identified breaks. These future algorithm enhancements will allow for a more comprehensive comparison to other homogenization algorithms and help quantify the structural uncertainty associated with surface temperature homogenization.

## Acknowledgments

We are grateful to Dr. Peter Thorne for his assistance with the data preparation and to Dr. Murray Clayton for his comments on our model. The comments by three anonymous reviews also greatly improved the manuscript.

## REFERENCES

Alexandersson, H., 1986: A homogeneity test applied to precipitation data.

,*J. Climatol.***6**, 661–675.Beaulieu, C., T. Ouarda, and O. Seidou, 2010: A Bayesian normal homogeneity test for the detection of artificial discontinuities in climatic series.

,*Int. J. Climatol.***30**, 2342–2357.Caussinus, H., and O. Mestre, 2004: Detection and correction of artificial shifts in climate series.

,*J. Royal Stat. Soc.***53C**, 405–425.Della-Marta, P., and H. Wanner, 2006: A method of homogenizing the extremes and mean of daily temperature measurements.

,*J. Climate***19**, 4179–4197.Gönen, M., W. Johnson, Y. Lu, and P. Westfall, 2005: The Bayesian two-sample t test.

,*Amer. Stat.***59**, 252–257.Good, I., 1985: Weight of evidence: A brief survey.

*Bayesian Statistics,*J. Bernardo et al., Eds., Elsevier, 249–269.Hannart, A., and P. Naveau, 2009: Bayesian multiple change points and segmentation: Application to homogenization of climatic series.

,*Water Resour. Res.***45**, W10444, doi:10.1029/2008WR007689.K-1 Model Developers, 2004: K-1 coupled GCM (MIROC) description. K-1 Tech. Rep, 1, 39 pp.

Kass, R., and A. Raftery, 1995: Bayes factors.

,*J. Amer. Stat. Assoc.***90**, 773–795.Lu, Q., R. Lund, and T. Lee, 2010: An MDL approach to the climate segmentation problem.

,*Ann. Appl. Stat.***4**, 299–319.Lund, R., and J. Reeves, 2002: Detection of undocumented changepoints: A revision of the two-phase regression model.

,*J. Climate***15**, 2547–2554.Lund, R., X. Wang, Q. Lu, J. Reeves, C. Gallagher, and Y. Feng, 2007: Changepoint detection in periodic and autocorrelated time series.

,*J. Climate***20**, 5178–5190.MacKay, D. J. C., 2003:

*Information Theory, Inference, and Learning Algorithms.*Cambridge University Press, 628 pp.Mann, M., S. Rutherford, E. Wahl, and C. Ammann, 2005: Testing the fidelity of methods used in proxy-based reconstructions of past climate.

,*J. Climate***18**, 4097–4107.Menne, M. J., and C. N. Williams, 2009: Homogenization of temperature series via pairwise comparisons.

,*J. Climate***22**, 1700–1717.Menne, M. J., C. N. Williams, and R. S. Voss, 2009: The U.S. Historical Climatology Network monthly temperature data, version 2.

,*Bull. Amer. Meteor. Soc.***90**, 993–1007.Peterson, T. C., and Coauthors, 1998: Homogeneity adjustment of in situ atmospheric climate data: A review.

,*Int. J. Climatol.***18**, 1493–1517.Reeves, J., J. Chen, X. L. Wang, R. Lund, and Q. Q. Lu, 2007: Comparison of techniques for detection of discontinuities in temperature series.

,*J. Appl. Meteor. Climatol.***46**, 900–914.Rouder, J. N., P. L. Speckman, D. Sun, R. D. Morey, and G. Iverson, 2009: Bayesian t tests for accepting and rejecting the null hypothesis.

,*Psychon. Bull. Rev.***16**, 225–237.Thorne, P. W., and Coauthors, 2011: Guiding the creation of a comprehensive surface temperature resource for twenty-first-century climate science.

,*Bull. Amer. Meteor. Soc.***92**, ES40–ES47.Titchner, H., P. W. Thorne, M. P. McCarthy, S. F. B. Tett, L. Haimberger, and D. E. Parker, 2009: Critically assessing tropospheric temperature trends from radiosondes using realistic validation experiments.

,*J. Climate***22**, 465–485.Venema, V. K. C., and Coauthors, 2012: Benchmarking monthly homogenization algorithms.

,*Climate Past***8**, 89–115, doi:10.5194/cp-8-89-2012.Vincent, L., 1998: A technique for the identification of inhomogeneities in Canadian temperature series.

,*J. Climate***11**, 1094–1104.Wang, X. L., 2008a: Accounting for autocorrelation in detecting mean shifts in climate data series using the penalized maximal

*t*or*F*test.,*J. Appl. Meteor. Climatol.***47**, 2423–2444.Wang, X. L., 2008b: Penalized maximal

*F*test for detecting undocumented mean shift without trend change.,*J. Atmos. Oceanic Technol.***25**, 368–384.Wang, X. L., Q. H. Wen, and Y. Wu, 2007: Penalized maximal

*t*test for detecting undocumented mean change in climate data series.,*J. Appl. Meteor. Climatol.***46**, 916–931.Washington, W., and Coauthors, 2000: Parallel climate model (PCM) control and transient simulations.

,*Climate Dyn.***16**, 755–774.Williams, C. N., M. J. Menne, and P. W. Thorne, 2012: Benchmarking the performance of pairwise homogenization of surface temperatures in the United States.

,*J. Geophys. Res.***117**, D05116, doi:10.1029/2011JD016761.

^{1}

We calculate the mean monthly temperature for each month and obtain the monthly temperature anomaly by subtracting the corresponding monthly-mean temperature from the actual monthly temperature.

^{2}

We focus on time windows with 2 log_{e}(BF) greater than zero since such windows contain evidence that favors discontinuities.

^{3}

The interstation correlation is estimated from the first difference series.

^{4}

We classify the discontinuities into three categories in terms of their actual sizes to help readers understand the performance of the model. Three categories are defined as follows: large, *δ* ≥ 1.0°C; medium, 0.5°C ≤ *δ* < 1.0°C; and small, *δ* < 0.5°C, where *δ* is the size of a discontinuity and 1°C = 1°C.