## Abstract

This paper examines multiple changepoint detection procedures that use station history (metadata) information. Metadata records are available for some climate time series; however, these records are notoriously incomplete and many station moves and gauge changes are unlisted (undocumented). The shift in a series must be comparatively larger to declare a changepoint at an undocumented time. Also, the statistical methods for the documented and undocumented scenarios radically differ: a simple *t* test adequately detects a single mean shift at a documented changepoint time, while a *t*_{max} distribution is appropriate for a single undocumented changepoint analysis. Here, the multiple changepoint detection problem is considered via a Bayesian approach, with the metadata record being used to formulate a prior distribution of the changepoint numbers and their location times. This prior distribution is combined with the data to obtain a posterior distribution of changepoint numbers and location times. Estimates of the most likely number of changepoints and times are obtained from the posterior distribution. Simulation studies demonstrate the efficacy of this approach. The methods, which are applicable with or without a reference series, are applied in the analysis of an annual precipitation series from New Bedford, Massachusetts.

## 1. Introduction

Climate series often contain abrupt changes, often called changepoints, induced by gauge changes, station relocations, observer changes, and so on. It is critical to remove the effects of these artificial “inhomogeneities” before assessing trends in the series (Jones et al. 1986; Lu and Lund 2007). Many stations experience multiple changes in measurement procedures during their operation, with Mitchell (1953) estimating that U.S. temperature series average roughly six changepoints per century. While changepoints can affect series means, variances, and extremes, we are most interested in the case of shifts in average series values (mean shifts).

Often, a station metadata record that logs changes of data collection methods, device adjustments, recording time or frequency changes, station moves, etc., is available. Metadata records are notoriously incomplete in that many changepoint times are absent from the logs. Moreover, not every change listed in the metadata necessarily induces a true shift in the series. This said, time changes reported in the metadata record seem more likely to be true changepoint times.

The typical objective of a changepoint analysis is to identify how many changepoints a series has and where they occur. The appropriate statistical methods to handle documented and undocumented changepoint times radically differ (Lund and Reeves 2002). To appreciate the issue, consider a record of *n* = 100 annualized temperatures *X*_{1}, …, *X*_{n} and testing whether or not there is a single mean shift in the record. If the time *c* ∈ {2, …, *n*} is known to be the changepoint time a priori, then the *t* statistic

is a good way to detect a possible shift (Reeves et al. 2007). Here, is the sample series mean up to time *c*, is the sample mean after time *c*, and is its estimated variance margin (the form for this expression is not given here). Homogeneity of the two segments is rejected when |*t*_{c}| is too large to be statistically explained by chance variation. This is a likelihood ratio test, which affords one statistical optimality (Casella and Berger 1990). When the changepoint time *c* is not known a priori, one can examine , rejecting homogeneity when |*t*_{max}| is too large to be explained by chance variation. Here, the location of the largest *t*_{c} statistic (in absolute value) serves as the estimated changepoint time. The distributional percentiles for *t*_{c} and *t*_{max} are very different. In independent Gaussian (normally distributed) data with *n* = 100, |*t*_{c}| values need only exceed 1.984 to declare a mean shift at time *c* with 95% confidence, while |*t*_{max}| must exceed 17.663 to declare an undocumented changepoint with the same confidence. Past authors have confused these cases (Lund and Reeves 2002)—the fundamental issue is that there are *n* − 1 admissible changepoint times to consider in the undocumented case, and this multiplicity must be taken into account in the distributional percentiles.

As more evidence (a larger data shift) is needed to declare an undocumented changepoint than a documented one, some authors require that the changepoint time be significant at undocumented thresholds, looking at the metadata to hopefully confirm conclusions in hindsight. Such procedures lack detection power in that many occurring changepoints risk going unflagged. In section 5, examples are given where documentation of the changepoint time makes it six times more likely to be flagged.

Our goal here is to develop a method capable of merging documented and undocumented paradigms. This is done via Bayesian statistical methods, with the metadata record being used to construct a prior distribution for the number of changepoints and their location times. The observed data are also important in the analysis. A time not listed in the metadata could be deemed a changepoint should the data shift enough near this time. Evidence for this is quantified via a data likelihood, as explained in section 3. The prior distribution and data likelihood are combined to form a posterior distribution. The configuration is estimated as the most likely changepoint number and times from the posterior distribution.

This study considers the multiple changepoint setting, the frequent case for climate series. Elaborating, the at most one changepoint (AMOC) setting is not a priori posited, nor is any fixed number of changepoints assumed. The methods are applicable when documented and undocumented changepoints occur in close proximity, or sparsely in time. The procedure does not rely on subsegmentation methods with an AMOC technique. Any AMOC test can be turned into a multiple changepoint segmenter: simply divide the series into two segments about the most prominent changepoint time and analyze the resulting segments as separate series. The process is repeated about all found changepoint times until no segments are deemed to have changepoints. While simple to implement, AMOC subsegmentation procedures are often fooled by nonmonotonic shifts. For example, an up, up, up configuration shift in a series is more readily detected by AMOC segmenters than an up, down, up configuration shift; Li and Lund (2012) and section 5 herein discuss this issue.

Bayesian climate AMOC methods are studied in Perreault et al. (2000), Western and Kleykamp (2004), Seidou et al. (2007), Hannart and Naveau (2009), and Beaulieu et al. (2010). These references use subsegmentation techniques, or identification of subsequences that contain at most one changepoint, to deal with multiple changepoint issues. Of particular note is Beaulieu et al. (2010), who use metadata in an AMOC setting. There, a prior distribution of the changepoint location is posited to vary with a distribution whose mode occurs at the single metadata time. Essentially, this manuscript constructs an analogous technique for multiple changepoints. The methods can handle undocumented and documented changepoint times that exist in close proximity.

The Bayesian statistics literature on multiple changepoint problems is also developed (Chernoff and Zacks 1964; Yao 1984; Barry and Hartigan 1992, 1993; Green 1995; Chib 1998; Wang and Zivot 2000; Fearnhead 2006; Ruggieri 2013). Much of the focus lies with efficient computation for the various models constructed. Computation will also be a prominent issue here. Metadata issues do not seem to arise in statistical studies; the problem appears unique to climatology. Autocorrelation (persistence) is taken into account in our methods. Autocorrelation can degrade changepoint tests (Lund et al. 2007); however, allowing for the feature often complicates the analysis.

The rest of this paper proceeds as follows. Section 2 introduces our model and discusses its features. Section 3 discusses the Bayesian statistical techniques used. Section 4 moves to computational issues. Here, a Metropolis–Hastings algorithm, a Markov chain Monte Carlo (MCMC) method, is advocated for accurate estimation of the number of changepoints and their locations. Section 5 presents simulation studies showing how the methods work on benchmarked time series where changepoint configurations are known. The simulations show the power of the technique: a changepoint is more likely to be detected when the station change is documented. Section 6 applies the methods to a long annual precipitation series from New Bedford, Massachusetts, with and without a reference series. A reference series usually comes from a geographically close station record(s) and serves as a control record to avoid flagging regional changes (e.g., climate trends or natural variability) as artificial shifts. The series under study is often called the target series. The New Bedford record has been studied without metadata, which affords us an opportunity to compare findings. Comments and conclusions are made in section 7.

## 2. Model

Any multiple changepoint configuration can be expressed as ** γ** = (

*m*;

*τ*

_{1}, …,

*τ*

_{m}), where

*m*is the number of changepoints in {2, …,

*n*} and are the changepoint location times. With an objective function that quantifies how good each changepoint configuration is, an optimal changepoint configuration can be estimated. This is a model selection problem, with each changepoint configuration

**representing a model. The model space contains 2**

*γ*^{n−1}distinct changepoint configurations, since each time in {2, …,

*n*} may be a changepoint. Alternatively, each changepoint configuration can be expressed as

**= (**

*η**η*

_{1}, …,

*η*

_{n})′, where

*η*

_{t}is the indicator:

The primes denote the transpose of vectors and matrices.

Our time series model for the data *X*_{1}, …, *X*_{n} obeys

Here, *X*_{t} is the observation at time *t*, *α* is a linear trend coefficient (we allow a trend because long climate records often exhibit a climate change secular trend and changepoint detection differs in trended and nontrend cases; Gallagher et al. 2013), and *r*(*t*) is the regime number in use at time *t*. For example, if *n* = 100 and there is a single changepoint at time 50, then *r*(*t*) = 1 for *t* = 1, …, 49 and *r*(*t*) = 2 for *t* = 50, …, 100. Finally, {*ϵ*_{t}} is posited to be zero mean stationary noise. The model errors {*ϵ*_{t}} need not be independent in time. In fact, most climate series display some autocorrelation. Autocorrelated series often have sample paths that mimic series with mean shifts (Lund et al. 2007) and it is important to allow for this feature. For simplicity, the autocorrelation in the series is quantified via a first-order autoregression [AR(1)], which satisfies

Here, {*Z*_{t}} is zero mean independent and identically distributed noise with variance *σ*^{2} and *ϕ* ∈ (−1, 1) is the autocorrelation between consecutive *ϵ*_{t}. AR(1) models are common in climate time series, where slow elements from the climate system gradually respond to random forcing from weather events.

Using the piecewise notation in Li and Lund (2012), one has

The mean of the series during the th segment (regime) is for = 1, …, *m* + 1; this mean shifts at each and every changepoint time.

The model in (1) does not allow for seasonality in its present form (it could be modified) but is appropriate for annual data. The model allows for a linear trend to characterize long-term changes. Applying changepoint methods that do not allow a trend in cases where a trend exists can lead to the detection of spurious changepoints (Gallagher et al. 2013). A no-trend changepoint method would try to compensate for the lack of a trend by erroneously flagging multiple mean shifts that “follow the trend.” While most climate trends are nonlinear, our focus here is not on the trend parameter *α*, but rather on how many changepoints exist and where they lie.

## 3. Statistical methods

Our primary goal is to estimate the “best” changepoint configuration ** η**, which is denoted by , each being zero or unity. The estimated number of changepoints is . A specific changepoint configuration

**will be judged by its posterior probability, conditional on the observed data**

*η***X**= (

*X*

_{1}, …,

*X*

_{n})′, and is denoted by . The best configuration is the one that maximizes the posterior probability (the mode of the posterior distribution):

The posterior distribution can be calculated using Bayes’s theorem:

Here, *π*(** η**) is the prior probability of the changepoint configuration

**and denotes the likelihood (probability density function) of the data when the changepoint configuration is known to be**

*η***. It will become evident below that is computable (tractable). Since each point in {2, …,**

*η**n*} could be a changepoint, the summation in (4) is over 2

^{n−1}distinct changepoint configurations. As the denominator in (4) is constant, the optimal in (3) satisfies

To find , is ideally evaluated over all admissible 2^{n−1} multiple changepoint configurations ** η**. However, this is impractical even for

*n*as small as 100 since 2

^{n−1}rapidly increases with increasing

*n*. In the next section, an MCMC algorithm is developed to efficiently search the possibly huge number of admissible changepoint configurations.

### a. The prior distribution π*(***η***)*

Our prior distribution represents our beliefs before observing data. Here, we assume that every time not listed in the metadata is equally likely to be a changepoint, say with probability *ρ*_{1}, and that each time listed in the metadata has an equally likely chance to be a changepoint, say with probability *ρ*_{2}, but that *ρ*_{2} > *ρ*_{1}. We assume mutual independence among changepoint existence at all distinct times. These assumptions allow times reported in the metadata to be more likely changepoints than times not reported, but weigh all times equally within each respective category. When metadata are unavailable, all *n* − 1 points in {2, …, *n*} are equally likely to be changepoints.

Empirical simulations show that changepoint detection is sensitive to choices of *ρ*_{1} and *ρ*_{2}. Hence, these parameters are allowed to vary in a random way. Our prior beliefs about them are then updated from the data. This is done by placing a hierarchical prior distribution on *ρ*_{1} and *ρ*_{2}. As suggested by Scott and Berger (2010), a natural prior involves the beta distribution class, which is flexible and is supported over [0, 1] (in which probabilities must lie). Beta priors are common in Bayesian methods and are usually termed conjugate beta-binomial priors. Mathematically, the beta formulation entails that *ρ*_{1} has a beta distribution with parameters *a*_{1} and *b*_{1} and that *ρ*_{2}, independent of *ρ*_{1}, has a beta distribution with parameters *a*_{2} and *b*_{2}. The beta distribution with parameters *a* > 0 and b > 0 has a probability density function of form

and a mean of *a*(*a* + *b*)^{−1}.

Simulations suggest that changepoint conclusions are less sensitive to the choices of *a*_{1}, *a*_{2}, *b*_{1}, and *b*_{2} than to choices of *ρ*_{1} and *ρ*_{2}. As is common with beta-binomial priors, the *a*_{i} parameters are set to unity: *a*_{1} = *a*_{2} = 1. Hence, choices of *b*_{1} and *b*_{2} reflect our prior beliefs. As times in the metadata are more likely to be changepoint times than nonmetadata times, we take *E*[*ρ*_{1}] ≤ *E*[*ρ*_{2}], or equivalently, *a*_{1}(*a*_{1} + *b*_{1})^{−1} ≤ *a*_{2}(*a*_{2} + *b*_{2})^{−1}. For example, our default choices take *b*_{1} = 19 and *b*_{2} = 3. Hence, a priori, *E*[*ρ*_{1}] = 0.05 and 5% of undocumented times are changepoints; likewise, *E*[*ρ*_{2}] = 0.25 and 25% of documented times are true changepoint times. Under this default, times reported in the metadata are 5 times more likely to be changepoints than times not reported in the metadata. One can change prior beliefs about the undocumented and documented changepoint time frequencies by using different values of *b*_{1} (undocumented) and *b*_{2} (documented). Section 5 provides a sensitivity analysis of these choices.

In appendix A, the prior density of the changepoint configuration ** η** = (

*η*

_{2}, …,

*η*

_{n})′ is derived as

The notation here has *m*_{1} as the number of nonmetadata changepoint times in the changepoint configuration ** η**,

*m*

_{2}as the number of metadata changepoint times in

**,**

*η**n*

_{1}as the total number of nonmetadata points,

*n*

_{2}as the total number of metadata points (so that

*n*

_{1}+

*n*

_{2}=

*n*− 1 and

*m*

_{1}+

*m*

_{2}=

*m*), and the beta function for

*a*> 0 and

*b*> 0 is

The beta function is readily computed by most software packages. The posterior distribution of *ρ*_{i} given ** η** and the data

**X**also has a beta distribution with parameters

*a*

_{i}+

*m*

_{i}and

*b*

_{i}+

*n*

_{i}−

*m*

_{i}for

*i*= 1, 2. Inferences about

*ρ*

_{1}and

*ρ*

_{2}that take the data into account can be made from this.

### b. The marginal likelihood

We now derive the data likelihood given a fixed changepoint configuration ** η**. The law of total probability provides

where is the probability density function (likelihood) of the data when both the changepoint configuration ** η** and the nonchangepoint model parameters

**are known (specified), and is the prior probability density function of the model parameters**

*ψ***when the changepoint configuration is specified as**

*ψ***. Our general notation uses**

*η**f*(⋅) to denote a density (regardless of dimension) and

*f*(⋅|⋅) to denote conditional densities. The symbol

*π*(⋅) is used as a generic prior density and

*p*(⋅) as a generic posterior density. In our Gaussian AR(1) model in (1),

**= (**

*ψ***′,**

*μ**α*,

*ϕ*,

*σ*

^{2})′, where the mean regime parameters are

**= (**

*μ**μ*

_{1}, …,

*μ*

_{m+1})′ in (2).

Our next objective is to obtain a tractable expression for . Given ** η** and

**, the data**

*ψ***X**have a multivariate normal distribution with mean and covariance . To obtain , we define an

*n*× (

*m*+ 1) configuration-dependent design matrix , whose entries are only zeros and ones. Specifically, the (

*t*,

*j*)th entry in is unity only when time

*t*lies in the

*j*th regime. In addition, the covariate associated with the linear trend is

**t**= (1, 2, …,

*n*)′. Then the mean is

For the covariance matrix, let be the lag-*h* covariance of the error process {*ϵ*_{t}} and an *n* × *n* matrix whose (*i*, *j*)th entry is for 1 ≤ *i*, *j* ≤ *n*. In our AR(1) setup, ; this covariance structure could be modified if desired. The conditional covariance is then and does not depend on ** η**.

The multivariate normal density has form

This can be written in the innovations form (Li and Lund 2012)

Here, for *t* ≥ 2 is a linear prediction of *X*_{t} from the past data *X*_{1}, …, *X*_{t−1} (with the startup ) given ** η** and

**and is its unconditional mean squared prediction error. The form of the prediction is due to the AR(1) model; this also gives**

*ψ**e*

_{t}=

*σ*

^{2}for

*t*≥ 2 and

*e*

_{1}=

*σ*

^{2}(1 −

*ϕ*

^{2})

^{−1}. Equation (8) is useful should one desire time series models “more complicated” than an AR(1).

In obtaining , it is tempting to plug in a point estimate of ** ψ** in the expression in (7). A natural candidate for this is the argument that maximizes the likelihood . Such a procedure, while intuitive, will favor a configuration that contains all times as changepoints. A better approach adopts a fully Bayesian setup, with

**random and more diffusely distributed (than a point mass) over a high-dimensional space.**

*ψ*To proceed further, the prior distribution of ** ψ** needs to be specified. Convenient choices take the regime means

*μ*

_{j}to be independent, identically distributed Gaussian variables with mean and variance

*σ*

^{2}

*ν*

_{0}. Here, the parameters and

*ν*

_{0}are prespecified. Therefore, the prior density of

**has the form**

*μ*As little is assumed about the regime means * μ* before data collection, our prior distribution needs to be largely noninformative. We do this by taking

*ν*

_{0}as constant and relatively large, say 5. As the prior is less informative than the data, this choice has little impact on the posterior distribution. Hence, we choose as the sample mean .

No assumptions are made about the linear trend parameter *α*. Elaborating, the completely noninformative uniform prior distribution *π*(*α*) ∝ 1 is taken. This prior density does not integrate to unity; in fact, it has infinite mass (there is no true uniform distribution over all real numbers). However, it is a valid choice because it yields a proper posterior distribution. The readers are referred to Berger (1985) for improper Bayesian priors.

With these prior choices, appendix B derives an expression for the conditional marginal likelihood

up to a normalizing constant in a tractable form [see (B1)].

Finally, the AR(1) parameters *σ*^{2} and *ϕ* in (9) must be specified. As these are second-order parameters (they only arise in the variance structure of the data, not the mean), an empirical Bayes approach is used. Specifically, to obtain the empirical Bayes estimate of the marginal likelihood , point values of *σ*^{2} and *ϕ* that maximize the conditional marginal likelihood are derived and plugged back into the expression (see appendix C).

## 4. Computational issues

The number of changepoint configurations in a series of length *n* is 2^{n−1}. When *n* = 200, there are about 8 × 10^{59} distinct changepoint configurations. Recall that the optimal changepoint configuration has the largest posterior probability given the data. A naive way to identify the optimal configuration(s) is to evaluate the posterior probabilities at each configuration ** η** and take the maximum. Such an exhaustive search is not possible in model spaces of this dimension; hence, some optimization techniques are necessary. It is also desirable in course to obtain the posterior distribution of the number of changepoints

*m*and the marginal posterior probability that any specific time

*t*is a changepoint—this is . The law of total probability gives

Here, is the number of changepoints in the configuration ** η** and is the indicator of the event

*A*, taking the value of unity when

*A*happens and zero otherwise.

Again, the summations above cannot be realistically conducted because the model space is of size 2^{n−1}. As an alternative, global stochastic search algorithms will be used to estimate the optimal changepoint configuration and the probabilities in (10) and (11). These algorithms are capable of identifying changepoint configurations with high posterior probabilities without considering all 2^{n−1} possible changepoint configurations.

Our basic computational method is an MCMC technique; in particular, a Metropolis–Hastings algorithm is used. MCMC techniques are ubiquitous stochastic search methods. In them, a Markov chain is constructed whose stationary distribution is the distribution under study, which here is the posterior distribution . From the (*r* − 1)th to *r*th iteration of the chain, a changepoint configuration *η*^{(r)} is simulated based on the previous configuration *η*^{(r−1)} and a strategically chosen Markov transition operation. As the iteration number *r* tends to infinity, the chain is guaranteed to distributionally converge to the posterior distribution . This allows us to study indirectly via simulation. The higher the posterior probability a configuration ** η** has, the more likely it will be visited by the chain at some time.

In short, a random draw from the posterior distribution can be taken as the Markov chain configuration *η*^{(r)} after a large number of iterations *r*. The methods allow us to approximate features of the posterior distribution in (10) and (11) via simulation. Elaborating, we generate the chain and use the empirical averages

and

as estimates. Here, *R* is the number of chain iterations simulated. As a large *R* guarantees convergence from any starting point of the chain *η*^{(0)}, selection of *η*^{(0)} is not important. However, the simulation requires a burn-in starting period to allow the chain to converge and make the approximations in (12) and (13) accurate. Specifically, the first *B* + 1 iterations are treated as the burn-in period and are not included in the averages in (12) and (13).

To select the optimal changepoint model(s), a Metropolis–Hastings algorithm is used to draw samples from the posterior distribution . From the *r*th Markov chain configuration *η*^{(r)}, a new configuration candidate is simulated according to a given Markovian transition . This candidate is either “accepted” and moved to, or *η*^{(r)} is retained as the value of *η*^{(r+1)}. More precisely, the generated is taken (accepted) as *η*^{(r+1)} with probability

If is not accepted, then the new iteration *η*^{(r+1)} is set to *η*^{(r)}.

To rapidly search all multiple changepoint configurations, our methods intertwine two different types of transitions *q*(⋅|⋅): a component-wise updating similar to the integrated stochastic search variable selection method of George and McCulloch (1997), and a simple random swapping. Quantifying further, the initial chain value *η*^{(0)} is generated so that for every time *t* not in the metadata, *η*_{t} = 1 with a small probability, say (as governed by the beta-binomial prior with parameters *a*_{1} = 1 and *b*_{1} = 19). For every documented changepoint time *t*, *η*_{t} = 1 with a relatively large probability, say ¼ (as governed by the beta-binomial prior with parameters *a*_{2} = 1 and *b*_{2} = 3). Suppose that the *r*th chain iteration is *η*^{(r)}. Then the update for the (*r* + 1)th iteration is as follows.

First, if *r* is odd, we “flip” one randomly selected changepoint component. Specifically, a time *t* is selected uniformly over {2, …, *n*} and this coordinate in the proposal is flipped: . The rest of the components in remain unaltered. Probabilistically, this mechanism is a symmetric random walk, with . It follows that the proposal is accepted with probability

Second, if *r* is even, we swap one changepoint time (if it exists) to a nonchangepoint time (if it exists). This swapping keeps the chain from converging to local modes. We select time *t* uniformly from the set of changepoint times and time *k* uniformly from the set of nonchangepoints. If either of these two sets are empty, then we default to *η*^{(r+1)} = *η*^{(r)}. Otherwise, configurations at the time *t* and *k* are swapped: and . Here, the Markov transition is After this swapping, the number of changepoints in the proposed transition remains unchanged. Hence, is accepted as the new state with the same probability in (14).

The R package bcpmeta is developed to implement our methods, and is posted online (http://cran.r-project.org/web/packages/bcpmeta/index.html).

## 5. Simulation studies

To see how metadata influence detection power, our method is studied via simulation. Here, a total of 1000 Gaussian AR(1) time series of length 100 are generated; each series has a mean shift at time *t* = 50 of varying size. A sequence of shift magnitudes from 0 to 0.6, incremented by 0.05, is examined. The AR(1) parameters are set to *ϕ* = 0.2 and *σ*^{2} = 0.025 here. There is no trend in any of the generated series and no trend is allowed in the changepoint detection approach. Each series is analyzed twice, once assuming absence of metadata and once assuming metadata that documents a single change at *t* = 50. The top panel of Fig. 1 displays detection rates at the true changepoint time for varying signal to noise ratios, which are measured by a signal-to-noise ratio of the absolute shift magnitude to the time series standard deviation sd(*X*_{t}) = [*σ*^{2}/(1 − *ϕ*^{2})]^{1/2} = 0.161. Each point in the graphic is aggregated from 1000 independent series. As expected, one is more likely to flag changepoint times with metadata. When the shift magnitude is 0.15 (signal-to-noise ratio of 0.93), incorporating the metadata increases the detection rate over sevenfold—from 11.4% to 81.5%. The same detection rate cannot be achieved without metadata until the shift magnitude becomes 0.5 (a signal-to-noise ratio of about 3.10). As the mean shift becomes large, the changepoint is accurately flagged with or without metadata. Using metadata does not appear to substantially increase false alarm rates: when the mean shift is absent (i.e., of size zero) the false detection rate with metadata is 0.8%, reasonably close to the 0.1% rate achieved without metadata.

To see how a trend affects changepoint detection, a linear trend with *α* = 0.001 was added to each generated series, and the model was refitted allowing a linear trend component. The bottom panel of Fig. 1 again shows a significant increase in the detection powers with metadata incorporation. When the signal-to-noise ratio is large—say, above 2 with metadata or 3 without metadata—detection power is virtually unchanged from the no-trend case. However, when the signal-to-noise ratio is relatively small, the shift seems to get confounded with the linear trend, which drops detection rates. For example, when the shift magnitude is 0.15 (a signal-to-noise ratio of 0.93) and a trend is allowed, detection rates drop to 31.7% and 2.1% with and without metadata, respectively. These rates are about two-fifths and one-fifth of those in the no trend case, respectively. Here, incorporating metadata lessens the drop in detection power.

We now move to the multiple changepoint setting. Here, the efficacy of the methods is demonstrated in cases where the true number of changepoints and their locations are known. Thousands of time series with known changepoint configurations are simulated and results from our methods are reported. Our simulation parameters are selected to mimic the New Bedford annual precipitation series analyzed in section 6 and can be compared to those in Li and Lund (2012). Specifically, each simulated series below is of length *n* = 200. As annual temperatures or precipitation totals (the latter after taking a logarithm) are often approximately Gaussian distributed with some slight autocorrelation, each series is generated from a Gaussian AR(1) process, with autocorrelation coefficient *ϕ* = 0.2. The white noise variance is taken as *σ*^{2} = 0.025. For simplicity, the trend component is ignored in both data generation and model fitting. For each simulated series, a MCMC chain with *R* = 10 000 iterations is generated, and the first *B* = 2000 iterations are discarded as burn in. Our choices of *R* and *B* ensure MCMC convergence in this setting.

To compare how our methods perform with and without metadata, each simulated series is analyzed twice. Our first analysis assumes that metadata do not exist. Our second analysis posits six metadata changepoints at the times *t* = 25, 50, 85, 125, 170, and 185. Three metadata changepoints per century is viewed as a 50/50 mix of documented and undocumented changepoints in a six changepoints per century scenario, the frequency quoted in Mitchell (1953). Three distinct changepoint/mean shift configurations will be studied, representing various scenarios. For each scenario, our simulations generate 1000 independent series. The data are generated with a random number generation seed set to unity (code details are contained in the R package bcpmeta). To assess sensitivity of parameter choices, our methods are implemented under a variety of *b*_{2} values; *b*_{1} is set to 19 throughout.

### a. Scenario 1: No changepoints

Our first scenario is a baseline case with no true changepoints—the series mean is zero at all times. In this case, the correct *m* is zero and, ideally, no changepoints should be present in our estimated changepoint configurations. The first panel in Fig. 2 shows a sample realization of such a series. The first block in Table 1 shows the empirical percentages of estimated *m* values aggregated from analysis of the 1000 series. Without metadata, our methods correctly estimate zero changepoints 99.4% of the time. In the other six series, the methods flagged one spurious changepoint.

When the same 1000 series are reanalyzed with the six times listed above declared as metadata times (under the default value *b*_{2} = 3), our methods correctly estimate zero changepoints in 98.5% of the runs. In the other 1.5% of the series, most of the falsely detected changepoint times occurred at metadata times. Even though the six documented metadata times have higher a priori changepoint probabilities, the data do not support detection of a changepoint in the posterior. Obviously, our methods have a low false alarm rate. The performance is stable under different choices of *b*_{2}. For example, if prior beliefs are that half of metadata times are true changepoints, then *b*_{2} = 1 and the percentage of correct detection (*m* = 0 or no changepoints) stays high at 97.5%.

### b. Scenario 2: A monotone increasing pattern

Scenario 2 deals with changepoints, considering three mean shifts acting in an “up, up, up” pattern; the shifts move the series mean successively higher. The second panel of Fig. 2 shows a sample generated series. This shift configuration is thought to be easier to detect by segmentation algorithms than shifts that move in different directions, such as up, down, up (Li and Lund 2012). The three shifts are evenly placed in the record at times *t* = 50, 100, and 150. The four regime means are taken as 0, 0.2, 0.4, and 0.6, respectively—all mean shifts are equal to 0.2 in magnitude. The top panel in Fig. 3 shows that with no metadata, our methods detect the three true changepoints in their exact locations with rates 26.6%, 26.3%, and 30.4%, respectively. Also, in many simulations, a changepoint time is only slightly misestimated; for example, the optimal estimated changepoint configuration estimates the time 50 changepoint at time 48. There do not appear to be any biases in the changepoint time estimates. These detection rates seem promising and are comparable with the simulation results in Li and Lund (2012), which were obtained by other methods.

Next, a reanalysis of the same 1000 scenario 2 series is made positing that a metadata record indicates six changes at times 25, 50, 85, 125, 170, and 185. In truth, the only metadata time that is a changepoint is *t* = 50. The bottom panel in Fig. 3 shows that the *t* = 50 changepoint time is now detected in its exact location in 82.5% of the runs. Essentially, the metadata documentation has increased the *t* = 50 changepoint time detection probability threefold. Also, the metadata have increased the precision of the estimated changepoint time: times adjacent to *t* = 50 are not likely to be flagged as changepoints. The two other true changepoint times at *t* = 100 and 150 are respectively detected with probabilities 24.2% and 27.5%, slight decreases from the values in the upper panel.

For the metadata times that are not true mean shift times, the chances of erroneously flagging them as changepoint times increase (compared to the case with no metadata), especially if they lie close to true changepoint times. This is the case at *t* = 85 and 125, which is expected since documented changepoint times have higher prior probabilities than undocumented ones. Fortunately, the increase in this false alarm rate is only slight. The time *t* = 185 listed in the metadata, which is far away from an actual changepoint time, is only erroneously declared a changepoint at rate 2.2%. Overall, the algorithm seems to yield a satisfactory performance.

The second block in Table 1 reports how many changepoints in total are detected; the correct number is *m* = 3. Without metadata, this is estimated correctly 65.5% of the time. As the table shows, the methods tend to underestimate the number of changepoints, with two (one less than the true number) being about 7 times more likely to be estimated than four (one more than the true number). When metadata is used, the estimated number of changepoints increases. In particular, the probability of obtaining the correct *m* = 3 increases to about 68.1%. The analyses are repeated under values of *b*_{2} = 2, 3 (default), and 4, which respectively correspond to prior beliefs that about one-third, one-fourth, and one-fifth of metadata times induce true mean shifts. As *b*_{2} increases, the number of changepoints detected slightly drops. The methods appear effective and robust for a wide range of parameter choices.

### c. Scenario 3: Nonmonotone shifts

Simulation scenario 3 again considers three true mean shifts, but alters the mean shift locations and pattern from scenario 2. Specifically, the mean shifts act in a “down, up, down” manner and are placed at times *t* = 25, 75, and 100. The regime means are 0, −0.2, 0.2, and 0, respectively. The third panel in Fig. 2 shows a sample realization of this series. Figure 4 summarizes the results of the simulations. Without metadata, the mean shift at time *t* = 75 is flagged at its exact time with probability 70.6%, which is around 3 times more likely than the other two changepoints. This is expected since the shift at time *t* = 75 is twice as large as the other two shifts. The bottom block in Table 1 displays frequencies of the estimated number of changepoints. The methods estimate the correct number of changepoints 62.3% of the time and again tend to underestimate the number of changepoints.

When metadata are included, detection power of the *t* = 25 changepoint increases roughly threefold. Also, the detection power of the two other true changepoint times at *t* = 75 and 100 (these are undocumented) remains virtually the same as the case without metadata. The *t* = 25 changepoint is more likely to be detected with metadata than the undocumented *t* = 75 changepoint, but has half the mean shift magnitude. The bottom block in Table 1 shows an increase in the estimated number of changepoints when metadata are used. The frequency of correctly estimating *m* increases to 70.6%. In addition, results for various *b*_{2} seem roughly equivalent; a default value of *b*_{2} = 3 seems reasonable.

Compared with scenario 2, our methods have an overall lower false alarm rate among the metadata times that are not true changepoints. In particular, for the two documented times *t* = 170 and 185 that lie far away from the true changepoint times, their false alarm rates are only 1.0% and 1.2%, respectively.

So far, the methods seem to work well. The detection probabilities in undocumented settings are comparable to those in Li and Lund (2012), which are obtained from different methods. Using metadata substantially increases detection power.

### d. Adding a background linear trend

In the above three changepoint scenarios, our detection method was studied without a trend in the data generation and model fitting. In temperature and precipitation series, this is often the case when reference series are employed, because the regional trend (if it exists) is often eliminated in the target minus reference difference (or log-difference for precipitation). Should no suitable reference series exist, allowing a linear trend component in the detection approach may be desirable.

As in the single changepoint case with a trend (see Fig. 1), some confounding between shifts and trend in the multiple changepoint case may occur. Indeed, multiple monotone mean shifts (as in scenario 2 above) are a stepwise approximation to a linear trend. To investigate the detection power in the presence of a trend, a linear trend of *α* = 0.001 is added to each of the simulated series in all three scenarios, and our methods are again fitted, this time allowing for a trend. The bottom three panels in Fig. 2 display sample series with trends from each scenario.

When a trend is allowed and estimated in the changepoint detection procedure, the estimated values of *m* are shown in Table 2; trend parameter estimation performance is summarized in Table 3. Our overall finding is that allowing a trend makes changepoint detection less accurate. Furthermore, the detection powers decrease in a manner that depends on scenario. In scenario 1, where there are no changepoints, the linear trend only slightly increases the false alarm rates. Our method correctly estimates zero changepoints in 98.9% and 97.7% of the runs, without and with metadata, respectively. It also yields an accurate estimate of the trend. Over 1000 independent series, the average posterior mean estimate of the trend based on the optimal configuration is = 0.0010 and virtually equals the true trend. In addition, 98.6% (without metadata) and 98.5% (with metadata) of the time, the 95% credible intervals (Bayesian confidence intervals) of *α* do not contain zero, which suggests the linear trend is significantly nonzero.

For scenario 2, performance degrades. The middle block in Table 2 shows that it is difficult to distinguish between monotonic, evenly distributed shifts and a linear trend. Without metadata, zero changepoints are estimated in 94.1% of the runs; in contrast, the true number of changepoints *m* = 3 is only detected in 3.5% of the runs. Table 3 shows that the trend is typically overestimated: the average trend estimate is = 0.0045, more than 4 times the true trend. The trend estimator is trying to compensate for the monotone shifts. Incorporating metadata slightly increases accuracy of all quantities. Since the majority of runs detect no shifts, the detection rates at all three true changepoint times are very low, without or with metadata (not presented in the figure).

In scenario 3, since the three true shifts move the series in a nonmonotonic down, up, down pattern, they are less likely (than in scenario 2) to be confounded with the linear trend. The bottom block of Table 2 shows that without metadata, *m* = 3 changepoints are detected in 29.1% of the runs. This accuracy is roughly half of the no-trend case; hence, a trend degrades detection again. In 51.3% of the runs, an *m* = 1 changepoint is detected, and the detected changepoint is frequently located near time *t* = 75 (a true changepoint time). From Fig. 5, the detection rate at *t* = 75 is 71.0%, even slightly better than the no-trend case, while the detection rates at the other two shifts at times 50 and 100 drop to 15.0% and 12.4%, respectively. Incorporating metadata increases the detection power of the documented change at *t* = 50 about threefold to 52.4%. The percentage of runs estimating *m* = 3 changepoints also increases to 35.5%. On the other hand, estimation of the linear trend is again unreliable: Table 3 reports an average trend estimate of = −0.0001 (without metadata) or 0.0001 (with metadata); moreover, less than half of the time, the linear trend is significantly nonzero at a credible level of 95%.

To conclude, adding a linear trend makes changepoint detection more difficult. This is expected: while a trend makes the model more flexible, it also induces more uncertainty in estimation and detection. In the absence of a good reference series, one needs to be careful about trend issues in homogenization; Gallagher et al. (2013) discuss this issue further.

## 6. The New Bedford precipitation series

### a. Changepoint detection on the target series

This section studies an annual precipitation series from New Bedford, Massachusetts. This series was also analyzed in Li and Lund (2012) using a minimum description length (MDL) penalized likelihood. The data are observed from 1814 to 2001. Observations for 1913 and 1995 were missing and were infilled (estimated); while our methods could in principle be modified to account for missing data, the task is somewhat unwieldy and minimal loss of precision is sacrificed by infilling when only a few observations are missing. Metadata for this station show six possible changepoint times: 1854, 1861, 1906, 1951, 1974, and 1985. The top panel of Fig. 6 shows the series after taking logarithms. The lognormal distribution is frequently used to model annual precipitation (Wilks 2006); taking a logarithm converts such data to a Gaussian time series.

Two Markov chains of length *R* = 1 000 000 (with burn in of length *B* = 200 000) are generated, one accounting for metadata and one ignoring it. A linear trend component is allowed in both computations. The top half of Table 4 lists the five changepoint configurations with the highest posterior probabilities as estimated by the chains. When metadata is used, the configuration with the highest posterior probability has three changepoints at years 1861, 1906, and 1967. The first two of these times are documented in the metadata. This configuration is our optimal estimate and is graphed against the series in the top panel of Fig. 6. It seems to follow the data well, with two changepoints acting to increase precipitation and one acting to decrease it. When metadata are ignored, the configuration with the highest posterior probability has changepoints at 1917 and 1967. Four other high posterior probability configurations are also listed in Table 4. As a whole, these best configurations listed in Table 4 have between two and four changepoints. The year 1967 is in every configuration.

For the optimal changepoint configuration with metadata, the estimated white noise variance is = 0.023. The estimated autocorrelation = 0.097 suggests a weak but nonzero correlation. As for the linear trend, its posterior mean estimate is = −0.0007, which is slightly negative. However, a 95% posterior credible interval for the trend is (−0.0020, 0.0009). Since this interval contains zero, *α* does not appear to be significantly nonzero. Therefore, we also apply the model without a linear trend.

The bottom half of Table 4 lists the five best changepoint configurations for the New Bedford series with a linear trend. With metadata, the best changepoint configuration remains as 1861, 1906, and 1967. However, without metadata, the best configuration places changepoints at years 1867, 1910, 1965, and 1967. This is also the optimal configuration found in Li and Lund (2012) using MDL techniques.

The optimal configuration in Li and Lund (2012) and the optimal configuration found here with metadata both have 1967 in common. The difference between the two optimal configurations demonstrates the impact of metadata. The first changepoint times in these two configurations are 1861 and 1867, just several years apart. As 1861 is a documented change, including metadata information through the prior distribution allows us to detect this year as a changepoint, compared with a less accurate detection in 1867 if metadata are ignored. Similarly, the second changepoint times detected in the two configurations are 1906 (when using metadata) and 1910 (without metadata), also more precise when metadata are utilized.

Posterior distributions for the number of changepoints and the probability that each point is a changepoint can be obtained from (12) and (13). With metadata and a linear trend, our posterior distribution for *m* is right skewed, with a mode of 4, a mean of 6.649, and a 95% credible interval between 2 and 12. With metadata and no linear trend, the posterior distribution for *m* is also right skewed, with a mode 4, a mean of 6.940, and a 95% credible interval between 3 and 13. Here, removing the trend helps to detect more changepoints. The configurations in Table 4 also support this conclusion.

The bottom panel of Fig. 6 shows the marginal posterior probability (without a trend) that any specific time is a changepoint, aggregated from all changepoint configurations visited by the MCMC simulation after burn in. Marginally, 1967 is flagged as the most likely changepoint with a probability of 86.39%. Observe that the years appearing in the Table 4 optimal configurations (such as 1861, 1867, 1906, and 1910) also have relatively high probabilities of being changepoints. If one were seeking to add a fourth changepoint time, 1965 and 1832 would be good candidates.

### b. Changepoint detection on the target/reference series

Our multiple changepoint method cannot differentiate between artificial shifts (mean shifts induced by human causes such as gauge changes, station relocations, observer changes, etc.) and naturally caused climate shifts. For the New Bedford precipitation series, the mean shift detected in 1967 is plausibly explained by natural climate change involving a drought. According to the latest U.S. National Climate Assessment, 1967 coincides with a recovery from a sequence of historic dry years in New England; since then, precipitation has been increasing there.

To decide whether the 1967 change is due to natural or artificial reasons, our method is applied to the ratio between the target series (the New Bedford record) and a reference series. We examine ln(target/reference), which is ln(target) − ln(reference). A good reference series should experience similar climate to the target series and should be free of artificial changepoints (in which case, any shifts can be attributed to the target series). Target/reference comparisons often remove changes experienced by both stations, which tend to be natural changes, thereby emphasizing artificial shifts. In Li and Lund (2012), a target/reference changepoint analysis was conducted with a Boston, Massachusetts, reference. To avoid a changepoint analyses for the Boston series, we use the Massachusetts state average precipitation as a reference, which is available from 1895 to present (the precipitation data are available online at http://www.ncdc.noaa.gov/cag/).

Our method was applied to the logarithm of the target/reference series. A Markov chain of length *R* = 1 000 000 (with burn in of length *B* = 200 000) was generated. When a linear trend component is allowed, the best changepoint configuration contains the two years 1917 and 1967. Here, the posterior mean estimate of the linear trend is = −0.0003, with a 95% credible interval of (−0.0018, 0.0012). Since this interval contains zero, the trend does not appear significantly different from zero.

Thus, another Markov chain with the same lengths as above was generated without a trend. The optimal changepoint configuration again contains changepoints at years 1917 and 1967. For this configuration, the estimated parameters are = −0.0211 and = 0.0108. The top panel of Fig. 7 suggests that the model fits the data well. The two changepoints act to decrease precipitation in 1917 and increase it in 1967. Table 5 lists the five best configurations for the target/reference analysis with and without a linear trend. As the overlap in the target and reference records is only 1895–2001, it is not surprising that the target/reference analysis finds fewer changepoints than the target analysis alone. Interestingly, even though metadata has been taken into account, only one year reported, 1951, appears in the most likely configurations listed in Table 5. In the bottom panel of Fig. 7, the marginal posterior inclusion probability for 1967 is 67.79%, the highest of all time points. Even if real climate changes are indeed eliminated through reference comparisons (this is always debatable), 1967 is still concluded to be an artificial changepoint time.

It is not always clear whether to include trends in the model. For this reason, we ran analyses with and without a trend. Target/reference comparisons often eliminate a regional trend. Should one elect to include trends, one could also allow the trend slope *α* to change with changing segments. It is not clear how often such a scenario is encountered in practice.

## 7. Summary and comments

This paper developed methods to incorporate metadata into multiple changepoint detection procedures for climatic time series. Bayesian procedures were used to construct a prior distribution of changepoint numbers and location times. This prior reflects the belief that documented metadata times are more likely to be true changepoint times than times not listed in the metadata. Our methods take the optimal configuration as the one with the highest posterior probability given the data. The methods allow for the presence of autocorrelation and a linear trend. When trends are present, they can severely degrade multiple changepoint detection methods. A Metropolis–Hastings algorithm was implemented to efficiently search a huge changepoint configuration space and quickly identify configurations with high posterior probabilities. Inferences about the total number of changepoints and marginal inclusion probabilities can be made. Application to both simulated series and a series of annual precipitation measurements from New Bedford demonstrate the extra detection power gained by using metadata.

Additional features can be built into our current approach with some effort. First, to handle monthly series, seasonal location parameters need to be added into (1). Second, our method can be generalized to non-Gaussian series such as count or extreme value series. This would change the Gaussian assumptions, say, to Poisson for tropical cyclone counts (Robbins et al. 2011) or to generalized extreme values (Lee et al. 2014) for temperature extremes. A third improvement lies with upgrading the prior. For example, if a particular gauge change type (e.g., a Stevenson screen to an automated sensor) is known to shift temperatures in a particular direction, this could be incorporated into the prior.

## Acknowledgments

Robert Lund’s research was supported by National Science Foundation Grant DMS 1407480.

### APPENDIX A

#### Calculation of *π***(***η***)** as in (6)

*η*

From the probabilistic prior assumptions made about the changepoint configuration, the conditional prior density has the form

Combining this with the beta prior densities *π*(*ρ*_{1}) and *π*(*ρ*_{2}) in (5) and integrating the parameters *ρ*_{1} and *ρ*_{2} out to get the density gives

### APPENDIX B

#### Calculation of in the Gaussian AR(1) Case

Our parameterization specifies the normality

When the posterior distribution of parameters is of interest, this is equivalent to

Here, **1**_{n} = (1, …, 1)′ is an *n*-dimensional vector whose components are all unity. The latter representation is used here since it simplifies calculations. For simplicity, we write .

Integrating ** μ** out gives the density

Let . Then we can integrate *α* out (up to proportionality, which identifies the distribution since the density must integrate to unity) via

### APPENDIX C

#### Derivation of

This optimization problem is handled in two steps. First, for any value of *ϕ*, the optimal *σ*^{2} is found as

Fortunately, can be expressed in closed form. When treated as a function of *σ*^{2}, the conditional marginal likelihood in (B1) is proportional to the density function of an inverse gamma distribution:

It now follows that the optimum *σ*^{2} is the mode of this distribution:

Jensen’s inequality confirms that this solution is a true variance in that .

The optional depends on the value of *ϕ*, because depends on , which changes with *ϕ*. Hence in the second step, the optimal is plugged back into (B1) to find the best *ϕ*: . As this is a one-dimensional optimization problem, one can use built-in algorithms in computing software (e.g., the function optimize in R) to approximate it very precisely. In this manner, an approximation to the marginal likelihood (up to a normalizing constant) is obtained as .