Abstract

Identification and analysis of temporal trends and low-frequency variability in discrete time series is an important practical topic in the understanding and prediction of many atmospheric processes, for example, in analysis of climate change. Widely used numerical techniques of trend identification (like local Gaussian kernel smoothing) impose some strong mathematical assumptions on the analyzed data and are not robust to model sensitivity. The latter issue becomes crucial when analyzing historical observation data with a short record. Two global robust numerical methods for the trend estimation in discrete nonstationary Markovian data based on different sets of implicit mathematical assumptions are introduced and compared here. The methods are first compared on a simple model example; then the importance of mathematical assumptions on the data is explained and numerical problems of local Gaussian kernel smoothing are demonstrated. Presented methods are applied to analysis of the historical sequence of atmospheric circulation patterns over the United Kingdom between 1946 and 2007. It is demonstrated that the influence of the seasonal pattern variability on transition processes is dominated by the long-term effects revealed by the introduced methods. Despite the differences in the mathematical assumptions implied by both presented methods, almost identical symmetrical changes of the cyclonic and anticyclonic pattern probabilities are identified in the analyzed data, with the confidence intervals being smaller than in the case of the local Gaussian kernel smoothing algorithm. Analysis results are investigated with respect to model sensitivity and compared to a standard analysis technique based on a local Gaussian kernel smoothing. Finally, the implications of the discussed strategies on long-range predictability of the data-fitted Markovian models are discussed.

1. Introduction

Many real-life processes can be described as simplified discrete models switching between a finite number of states or regimes. Such processes can be found in biophysics (transitions between different conformations in biomolecules; Schütte and Huisinga 2003), in computational finance (transitions between different market phases; Hamilton 1989) and in weather/climate research (transitions between different atmospheric circulation regimes; e.g., Majda et al. 2006; Horenko 2008; Horenko et al. 2008b,a).

However, most of the available time series data from such processes share the two following properties: (i) the data have only a short record (because observations are usually available only over relatively short time intervals) and (ii) the underlying dynamics usually has a temporal trend and cannot be assumed to be stationary (e.g., in the Markovian case, the transition matrix cannot a priori be assumed to be time independent). Moreover, important practical problems frequently arise when analyzing nonstationary data, particularly with regard to the comparison of different trend models and trend hypothesis tests and the influence of the mathematical assumptions associated with the choice of the model on the analysis results (robustness).

One of the most frequently used techniques for analysis of nonstationary observation time series is the local Gaussian kernel smoothing approach (Loader 1996, 1999; Zhao and Wei 2003). The main idea of this approach is based on the locally stationary approximation of estimated quantities (like observable averages) or some estimated model parameters inside of the Gaussian window of a certain width, typically characterized by the respective variance parameter σ2 [for geophysical applications of the method see, e.g, Anderson (2003); Xiong et al. (2006), and Dakos et al. (2008)]. The window width parameter σ2 is typically chosen a priori to characterize some slow intrinsic time scale of interest. However, the main difficulty of this approach lies in its locality; that is, the information used to estimate the quantities of interest at certain time point is acquired only locally from the inner part of the Gaussian window around this point.

This paper considers a problem of trend estimation for the dynamical processes with discrete finite state space based on the given observation data. In context of atmospheric applications, such processes can describe the transitions between certain predefined circulation regimes or patterns [like the Lamb circulation index (Lamb 1972) or blocked and unblocked situations in atmosphere (Majda et al. 2006)]. The considered processes are assumed to fulfill the Markov property; that is, the probability of transition to any other available state at any time is only dependent on the current state and is independent of the previous transition history.

This paper introduces two global approaches to trend identification in discrete Markov time series and compares them to the local Gaussian kernel smoothing technique adapted here for Markov chain series estimation. One of the presented methods is a parametric approach, allowing us to test simple trend models of different types and compare them with each other in a Bayesian sense. The second technique is of a nonparametric form and allows us to estimate low-frequency trends by solving a metastable clustering problem with a fixed number of cluster states. It is explained how the optimal number of clusters can be identified dependent on the single external scalar regularization parameter ε2. The connection among the regularization parameter, the persistence of the resulting decomposition, and the Gaussian window width parameter σ2 is discussed. It is shown that in the context of Markovian processes, the presented nonparametric method can be viewed as an adaptive extension of the local Gaussian kernel smoothing technique. Numerical examples demonstrate that in contrast to the nonparametric local Gaussian kernel smoothing technique (Loader 1996, 1999; Zhao and Wei 2003), both of the presented methods allow for a more robust estimation and comparison of different nonstationary Markov models.

It is shown that the introduced methods can be helpful in analysis of practical applications; for example, the presented algorithms are applied to analysis of the historical sequence of atmospheric circulation patterns over the United Kingdom between 1946 and 2007. The long-term effects in transitions between different patterns are systematically investigated and compared.

The paper begins with a short description of the nonparametric local Gaussian kernel smoothing in the context of Markov chains, followed by a description of both global methods. Special emphasis is placed on the intrinsic mathematical assumptions in each of the methods. The final sections deal with application of the presented techniques to the analysis of illustrative model data and atmospheric data series, interpretation of the obtained results, and comparison of different methods with respect to their robustness.

2. Methods and approach

In the following we will consider a data series that is discrete in time and space; that is, {Xt}t=1, … ,T takes values from some fixed set of m distinct quantities s1, … , sm. These quantities can denote, for example, the states or configurations of the observed system along the dynamics. The process underlying the observations is called Markovian if the probability P of any current state of the process at time t depends only on the previous state at time t − 1 and does not depend on any other previous state. Mathematically, this property can be expressed as P[Xt = sj|X1, X2, … , Xt−1 = si] = P[Xt = sj|Xt−1 = si] = Pij(t), where Pij(t) denotes the probability of transition from state si to state sj in one step at time t. These probabilities can be put together into an m × m stochastic transition matrix 𝗣(t); that is, for any t and i. To be able to estimate the a priori unknown transition probabilities Pij(t) based on the observation data Xt, we first need to introduce the concept of the observation probability or likelihood. For example, consider a simple discrete Markov process with three possible states s1 = 1, s2 = 2, and s3 = 3, available in the form of the following observed time series:

 
formula

Applying the Markov property, it is easy to demonstrate that the probability of observing this time series can be expressed as a probability of starting in state 1, then staying in state 1 at t = 2, then jumping from state 1 to state 2 at t = 3, etc. That is,

 
formula

For any given Markovian series {X1, … , XT}, the corresponding observation probability (or likelihood) can be compactly written as , where {tij} is the set of all time instances when the transitions between si and sj are observed.1 This means that if the transition matrix is unknown, it can be found by maximization of the above likelihood function for a fixed observation sequence {X1, … , XT}, for instance, by solving the following maximization problem:

 
formula

From the numerical point of view, instead of solving the above maximization problem, it is much more convenient to maximize the logarithm of the above expression, i.e., the log-likelihood of the Markovian process:

 
formula

where

 
formula

is the partial log-likelihood of the state i. To preserve the stochasticity of the resulting matrix 𝗣(t), the minimization problem (4) is subjected to the following constraints:

 
formula

The main difficulty arising from the formulation (4)(6) is that it is not a well-posed problem. If the transition matrix is allowed to be completely time dependent, than the number of unknowns to be estimated in this case is equal m2T, whereas the number of observed transitions and equality constraints is m(T + 1) − 1. The optimizer of the problem (4)(6) at time t in such a case will be 𝗣XtXt+1(t) = 1 and 𝗣Xtk(t) = 0 for all other elements k = 1, … , m from the same row Xt of the transition matrix 𝗣(t). All other elements of the matrix can be set to any arbitrary values (satisfying the constraints), resulting in a meaningless estimation. Because the problem in (4)(6) is ill posed, one needs to incorporate some additional information into the formulation of the problem (or, in mathematical language, to regularize the problem) to make it well posed. The simplest form of regularization is an additional assumption about the global stationary of the Markov process [i.e., the transition matrix 𝗣(t) is assumed to be time independent].2

One of the straightforward possibilities to relax the aforementioned global stationarity assumption is based on the application of local Gaussian kernel smoothing idea (Loader 1996, 1999; Zhao and Wei 2003) in context of the optimization problem [(4)(6)]. Assuming a local stationarity of the underlying transition process at time t0 inside of the window of a certain width σ2, one can introduce a normalized Gaussian weight function γ(t, t0) = 1/c exp[−(tt0)2/σ2] (where c is the normalization constant). Approximating L[𝗣(t)] as

 
formula

and applying the method of Lagrange multipliers, we get

 
formula

where {ti} is the set of all time instances when the state si was visited.

a. Trend models of discrete Markov chains

The two main disadvantages of the local Gaussian kernel smoothing procedure described above are (i) the assumption about the local stationarity of the transition process and (ii) the arbitrariness in the choice of the window width parameter σ. Furthermore, this procedure does not directly make it possible to acquire a constructive functional form of the trend in the analyzed data. This means that to make predictions based on available time series information, one has to extrapolate the identified transition process 𝗣(t) in the future. To do that, the identified process 𝗣(t) has to be approximated with a time-dependent function of a certain class ϕ(t). Instead of that, one can impose the functional form ϕ(t) a priori and incorporate it into the maximization of the log-likelihood in the form of the trend model.

Single trend models have a general form

 
formula

where ϕ = ϕ(t) is some predefined bounded trend function. Inserting (9) into (4) and (6), after some obvious transformations for any i = 1, … , m, we get

 
formula
 
formula
 
formula
 
formula
 
formula

Problem (10) for a fixed observation sequence {X1, … , XT} and some given trend function ϕ is a 2m-dimensional concave maximization problem with respect to the elements of the ith row of matrices 𝗣(0) and 𝗣(1). The maximization is performed on a convex domain defined by the linear equality and inequality constraints (11)(14). Therefore, the problem [(10)(14)] has a solution that can be found numerically by applying any available nonlinear maximization algorithm with linear constraints [e.g., a Nelder–Mead optimization algorithm (Nelder and Mead 1964)]. Note that for the multiple trend model 𝗣(t) = 𝗣(0) + Σi𝗣(i)ϕi(t), analogs of the inequality constraints (13) and (14) will become nonlinear. This will make the numerical procedure much more costly; moreover, the convexity of the domain defined by the constraints will not be guaranteed. Therefore, because of these numerical reasons, in the following we will stick to the single trend model defined by (9). We will do that even despite of the fact that the multiple trend models have more skill in describing the nontrivial scenarios.

b. Trend identification with hidden states: Adaptive FEM clustering

Instead of looking for the trend in a certain class of parametric functions [e.g., in log-likelihood maximization with a single trend model; see Eq. (9)] or assuming the local stationarity of the transition process (as in the case of the nonparametric Gaussian kernel smoothing), one can assume that the element-wise logarithm of Markovian transition matrix (4) at any time t can be represented as a convex linear combination of K time-independent logarithms log Pi with some unknown time-dependent coefficients γi (t) (where K is some predefined number). In other words, in the space of the observed transitions xt, one can look for K clusters (or hidden states, because they are a priori unknown) characterized by K distinct sets of a priori unknown constant transition probability matrices 𝗣i ∈ ℝm×m with the cluster distance functional of the form

 
formula

This cluster distance functional describes the quality of explaining the observed transition xt: XtXt+1 at time t with the help of the transition matrix 𝗣i. For a given cluster distance functional (15), under data clustering we will understand the problem

 
formula

subject to the constraints on γi(t)

 
formula
 
formula

and on 𝗣1, … , 𝗣K (6). As was demonstrated in Horenko (2009a,b), in order to get persistent (or metastable clusters) from this minimization procedure, it is enough to impose some metastability assumptions in the space of functions γi(·) and then apply a finite Galerkin projection of this finite-dimensional Hilbert space. For example, one can restrain the weak discrete differentiability of functions γi; that is,

 
formula

For a given observation time series, the above constraint limits a total number of transitions between the clusters and is connected to the metastability of the hidden process γi(t), for i = 1, … , K (Horenko 2009b).

One of the possibilities to incorporate the constraint (19) into the minimization of (16) is to introduce the Lagrangian multiplier ε2:

 
formula

This form of penalized regularization was first introduced by A. Tikhonov for solution of ill-posed linear least squares problems (Tikhonov 1943) and was frequently used for nonlinear regression analysis in context of statistics (Hoerl 1962) and multivariate spline interpolation (Wahba 1990). In contrast to the aforementioned application of Tikhonov-type regularization to interpolation problems (where the regularization is controlling the smoothness of some nonlinear functional approximation of the given data), the presented form of the regularized averaged clustering functional (20) has a completely different mathematical structure due to the form of the functional (16). This specific formulation of the optimization problem with constrains allows one to control the metastability of the assignment Γ(t) of the analyzed data to K distinct a priori unknown clusters (Horenko 2009b).

Let {1 = t1, t2, … , tN−1 = T − 1} be a finite subdivision of the time interval [1, T − 1] with uniform time step δt. We can define a set of continuous functions {υ1(t), υ2(t), … , υN(t)} called hat functions or linear finite elements with compact support [i.e., such that each function υi(t) is taking positive values in the time interval (ti−1, ti+1) and is zero outside] (Braess 2007). Assuming that γih1 (1, T − 1) (i.e., functions with the first discrete derivative being square integrable functions in discrete sense; cf. Braess 2007), we can write

 
formula

where and χN is a discretization error (it becomes zero if δt = Δt). Optimal N can be chosen adaptively to guarantee that χn does not exceed a certain discretization error threshold. Inserting (21) into functional (20) and constraints (17) and (18), we get

 
formula
 
formula
 
formula
 
formula
 
formula

where γ̃i = (γ̃11, … , γ̃iN) is the vector of discretized affiliations to cluster i;

 
formula

is a vector of discretized cluster distances; and 𝗛 is the symmetric tridiagonal stiffness matrix of the linear finite element set, with 2/t on the main diagonal, −1/t on both secondary diagonals, and zero elsewhere. The system of equations (22)(26) is a nonlinear minimization problem with linear equality and inequality constraints, imposed on both γ̃i, and 𝗣i, for i = 1, … , K.

If ε2 = 0, then the above minimization problem [(22)(24)] can be solved analytically with regard to γ̃i(L) for a fixed set of transition matrices 𝗣1, … , 𝗣K (where L denotes the index of current iteration), resulting in

 
formula

If ε2 > 0, for a fixed set of transition matrices 𝗣1, … , 𝗣K, the minimization problem (22)(24) reduces to a sparse quadratic optimization problem with linear constraints, which can be solved by standard tools of sparse quadratic programming (sQP) with computational cost scaling as O[N log(N)] (Gill et al. 1987).

The minimization problem (22) with transition matrix constraints (25) and (26) can be solved analytically with regard to the parameters 𝗣1, … , 𝗣K for a fixed set of discretized cluster affiliations γ̃i. Applying the method of Lagrangian multipliers results in the following expression for the transition matrices:

 
formula

where {tij} ⊂ [1,T − 1] are the time instances when the transitions between the states si and sj are observed and {ti} is a set of all time instances when the state si is visited.

Note the obvious similarity between the expression (29) and the local transition matrix estimator (8) in the context of Gaussian kernel smoothing. Despite this similarity and the fact that both of the methods rely on some form of stationarity assumption, the statistical weights γ(t, t0) are fixed and localized by the choice of the Gaussian curve of the certain width, whereas the hidden probabilities γi(t) are defined adaptively, dependent on the whole set of available data and not just on the part of it from inside the window. In another words, expression (29) makes use of the global information contained in the cluster affiliation functions γi(t) (later on in the text it is explained how predefined discretization error threshold and regularization parameter ε2 influence these values). Globality of the adaptive finite element method (FEM) clustering procedure gives an advantage when analyzing the data with a short time record because it allows us to tighten the confidence intervals around the estimated parameters and can help to obtain more robust numerical results.

Similarly to the algorithms described in (Horenko 2009a,b), minimization of the functional (22) can be implemented as an iterative numerical scheme. The iterations can be repeated until the change of the functional value does not exceed some predefined threshold for the change of the functional value (22).

c. Estimation of optimal K dependent on ε2

The upper bound for the number of statistically distinguishable cluster states for each value of ε2 can be algorithmically estimated in the following way: starting with some a priori chosen (big) K, one solves the optimization problem (22–26) for a fixed value of ε2 and calculates the confidence intervals of the resulting local transition matrices 𝗣i, i = 1, … , K (this can be done by applying the standard sampling procedures; see Noe 2008). If two of the estimated matrices have confidence intervals that overlap in all components, this means that respective clusters are statistically indistinguishable and the whole procedure must be repeated for K = K − 1. If at a certain point all of the matrices are statistically distinguishable, the procedure is stopped and Kmax(ε2) = K.

d. Robustness estimation

It is intuitively clear that the quality of the resulting reduced model is very much dependent on the original data, especially on the length of the available time series. The shorter the observation sequence, the larger the uncertainty of the resulting parameters. Therefore, the global methods presented above (the single trend model and FEM clustering) should be more robust then local methods (e.g., Gaussian kernel smoothing) as the analyzed time series become longer. That is, for a short time series, everything is local, but as the time series becomes long (in the sense that Tσ, with σ defining the width of a Gaussian moving window) the global methods then have the advantage of assimilating more information.

Let be the Markovian transition matrix, estimated with one of the methods M described above from a given observation sequence {Xt}t=1, … ,T. As , we denote a single realization of the Markov process P(t) with the model M. To compare the robustness of different models for a given data, we introduce the following estimate deviation process for a given function f:

 
formula

The Gaussian assumption for the stochastic process gives an opportunity to estimate its confidence intervals for some given function f straightforwardly. This can be done in a standard way by using multivariate statistical analysis, that is, by the Monte Carlo sampling from the respective distribution and calculating expectation values with respect to ω (Mardia et al. 1979). The confidence intervals calculated for the variable give a possibility to estimate the intrinsic robustness of the chosen model M for a given observation {Xt}t=1, … ,T.

3. Application I: Instructive model example

To illustrate an application of the methods described above, it is instructive to investigate the following simplified model scenario: consider a process Xt with three directly observed possible states {1, 2, 3} and transition probabilities taking values from one of the two following transition matrices:

 
formula

The choice of the transition matrix, taken to calculate a probability of the next transition at time t, depends on another discrete hidden process Yt with two possible states {1, 2}. If the hidden process Yt takes the value 1, then matrix 𝗔1 is chosen to calculate transition probabilities of the observed process X at time t; otherwise, transition matrix 𝗔2 is taken. An example of the observation series Xt resulting from this computation procedure is demonstrated in the right panel of Fig. 1. The hidden process Yt used in this calculation is shown in the left panel of Fig. 1. The log-likelihood of the generated time series in terms of the switching Markov model (31) is Lreal = −708.67.

Fig. 1.

(left) Hidden process Yt switching between two different transition matrices (31) and (right) an example of the resulting discrete Markovian series Xt.

Fig. 1.

(left) Hidden process Yt switching between two different transition matrices (31) and (right) an example of the resulting discrete Markovian series Xt.

Next, the generated Markov time series from Fig. 1 is taken to compare three presented methods with respect to the identification of the nonstationary trend of the Markov transition probabilities (driven by the hidden variable Yt). We start with the local Gaussian kernel smoothing procedure (8), repeat the estimation with different values of the Gaussian window width parameter σ2, and keep the parameter value with the highest log-likelihood (which in this case turns out to be , corresponding to an effective window width of approx. 70 time units). The log-likelihood of the analyzed time series in terms of the local Gaussian kernel smoothing model with turns out to be LGauss = −717.34 (this is just 0.58% less than the log-likelihood of the original switching Markov model). Estimation of the single trend model (9) with ϕ(t) = tα as expected results in almost stationary transition matrix estimates. This is simply explained by the fact that models with polynomial trend are a bad choice for approximating nonstationary processes when the change of the transition probability matrices is described by the hidden process like the one from Fig. 1. Finally, we apply the adaptive FEM clustering procedure with different values of the regularization parameter ε2. The tolerance threshold χN (21) is chosen to be 0.0001 in all cases.

Figure 2 demonstrates an influence of the regularization parameter ε2 on the identified hidden process [described by the variables γi(t) resulting from the numerical solution of the optimization problem (22)(26)]. Whereas for ε2 = 0 (corresponding to the unregularized optimization) we get an arbitrary hidden process, for ε2 = 0.2 the identified hidden process almost exactly reproduces the original process Yt used in generation of the time series. Hidden processes identified with bigger values of ε2 become more and more smooth until the point where the regularization part of the functional (20) starts to dominate the optimization, producing a constant hidden process with no switches.

Fig. 2.

Comparison of the original hidden process from the left panel of Fig. 1 (solid) with the hidden processes identified from the respective observed sequence (see the right panel of Fig. 1) by adaptive FEM clustering (K = 2, with tolerance threshold χN = 0.0001). The minimization is performed with different values of ε2.

Fig. 2.

Comparison of the original hidden process from the left panel of Fig. 1 (solid) with the hidden processes identified from the respective observed sequence (see the right panel of Fig. 1) by adaptive FEM clustering (K = 2, with tolerance threshold χN = 0.0001). The minimization is performed with different values of ε2.

Figure 3 shows a comparison of the original transition probability trends as functions of time with the results obtained by the local Gaussian kernel smoothing model with (dotted; the robustness intervals are in light gray) and the trends obtained by the adaptive FEM clustering procedure with ε2 = 0.2 and K = 2 (dashed). The log-likelihood of the analyzed time series in terms of the adaptive FEM clustering procedure turns out to be LFEM = −703.21 [this is 0.67% bigger than the log-likelihood of the original switching Markov model (31)]. As can be seen from Fig. 3, trends resulting from the adaptive FEM clustering procedure reproduce the original trends much better then the local Gaussian kernel smoothing trends do. This is not a big surprise because the internal structure of the analyzed time series is much more reproducible with the help of the adaptive FEM clustering procedure. Figure 3 also demonstrates the effect of locality of the Gaussian smoothing procedure: because the optimal Gaussian window turns up to become relatively narrow, only a relatively small amount of information is getting incorporated in the trend estimation at any point. This results in huge confidence intervals for the estimated matrix elements. Comparison of log-likelihood values Lreal = −708.67, LGauss = −717.34, and LFEM = −703.21 and inspection of the respective probability trends in the Fig. 3 reveals that very small relative changes in the value of the log-likelihood can be induced by very significant changes of trend. This means that the relative value of the log-likelihood function alone cannot be considered as a measure for the model quality.

Fig. 3.

Comparison of Markov transition probabilities as functions of time: original transition probabilities (solid) used in the generation of the time series; probabilities estimated by the local Gaussian kernel smoothing with the value of σ2 defined by the variation of the log-likelihood maximization procedure (dark gray, dotted), together with their robustness intervals (light gray, dotted); and probabilities estimated for ε2 = 0.2 (K = 2, with tolerance threshold χN = 0.0001; dark gray, dashed). The robustness intervals for the FEM clustering method estimates are of the order of ±0.06.

Fig. 3.

Comparison of Markov transition probabilities as functions of time: original transition probabilities (solid) used in the generation of the time series; probabilities estimated by the local Gaussian kernel smoothing with the value of σ2 defined by the variation of the log-likelihood maximization procedure (dark gray, dotted), together with their robustness intervals (light gray, dotted); and probabilities estimated for ε2 = 0.2 (K = 2, with tolerance threshold χN = 0.0001; dark gray, dashed). The robustness intervals for the FEM clustering method estimates are of the order of ±0.06.

Finally, the procedure of Kmax(ε2) identification described above is exemplified for this model data example. Figure 4 demonstrates that the maximal number of statistically distinguishable clusters decreases with increasing value of ε2. Moreover, the described procedure allows a correct identification of hidden states Y in a wide range of regularization parameters (0.2 ≤ ε2 ≤ 0.8). Note that because of the form of the regularized functional (20), the absolute value of the regularization parameter ε2 necessary to achieve a desired persistence of the hidden process γi(t) for i = 1, … , K will be different for different applications. This value is influenced by the norm and the properties of the original cluster distance functional (16) evaluated for a given time series; that is, the magnitude of ε2 and numerical investigations analogous to the one presented in Fig. 4 should be performed in each data analysis case.

Fig. 4.

Maximal number of the statistically distinguishable cluster states Kmax as a function of ε2.

Fig. 4.

Maximal number of the statistically distinguishable cluster states Kmax as a function of ε2.

4. Application II: Analysis of Lamb circulation index series for the United Kingdom

a. Data sources

Analyzed data provided by the Climatic Research Unit of the University of East Anglia (available online at http://www.cru.uea.ac.uk/ftpdata/lwtjenk.dat) represent a sequence of atmospheric circulation patterns over the United Kingdom. In the original time series, each of the days between 1 January 1880 and 31 July 2007 is assigned to one of the 28 basic patterns (cyclonic, anticyclonic, western, etc.) based on the daily gridpoint mean sea level pressure data. The assignment of the circulation regimes in the considered time series is done according to the modified Lamb classification scheme proposed by Jenkinson and Collison (Lamb 1972; Jenkinson and Collison 1977; Jones et al. 1993). In the following, the analysis is restricted to the part of the series after World War II (i.e., between 1 January 1946 and 31 July 2007). To simplify the analysis and to make the graphical representation of results more comprehensible, the total number of the analyzed circulation patterns is reduced to three; that is, the anticyclonic pattern is denoted as state 1, the cyclonic pattern is denoted as state 3, and all other patterns are assigned to the collective state 2.

b. Trend discrimination with maximum likelihood approach

Before demonstrating the applicability of the strategies described in this paper to the resulting time series (switching between the discrete states 1, 2, and 3), it is necessary first to demonstrate the Markovian property of the respective transition process. Unfortunately, there is no standard strategy to test the Markov assumption in a nonstationary case. In the following we will use the test strategy applicable to a stationary case and will assume that the deviations from stationarity are not very significant for the analyzed data so the stationary tests are still valid in some sense. Development of nonstationary Markov tests is a matter of future research. For the stationary case (where the underlying transition matrix is assumed to be time independent), the Markovian property can be verified applying some standard tests; for example, one can check the generator of the process (see Crommelin and Vanden-Eijnden 2006; Metzner et al. 2007). Application of this test confirms the Markov hypothesis for the analyzed data in stationary case and reveals that the underlying process can be represented by a rapidly mixing Markov chain (i.e., the probabilities of staying in the same state are comparable to the probabilities of changing the state, so the processes do not get stuck in the same state for a long time).

As was mentioned above, identification of multiple trend models is a problem that is unfeasible from the numerical point of view because of the nonconvexness of the respective inequality constraints. To choose the most probable single trend model for a given time series of atmospheric circulation patterns, different trend function classes ϕ(t) can be parameterized by numerical solution of the problem (10)(14) and then compared with the help of the standard Bayesian hypothesis test approach (Gelman et al. 2004).

One of the most intuitive single trend models in meteorological context is the seasonal trend model of the form

 
formula

where T = 365.24 days is a seasonal period and ϕ ∈ [0, T] is a phase factor. The maximization problem (10)(14) is independently solved for each of the rows of the transition matrix. Optimization is repeated for various values of ϕ, and for each matrix row the parameters p(0), p(1), and ϕ with the highest value of the partial log-likelihood are kept (see the dashed lines in Fig. 4). The same kind of procedure is performed for the single trend model of the polynomial form

 
formula

Statistical hypothesis tests can be performed to decide which of the single trend models can best explain the observed data. The log-likelihood of the respective partial log-likelihood maxima (see Fig. 5) can be calculated, and the a posteriori model probabilities can then be acquired from the Bayes formula. It shows that the polynomial trend model P(t) = P(0) + P(1)tα and the nonstationary hidden states model (estimated with the adaptive FEM clustering algorithm for three hidden states) have the highest probability of describing the analyzed date. Moreover, different values of exponent α in the polynomial trend model are optimal for different atmospheric circulation patterns (α = 0.8 for anticyclonic, α = 0.3 for cyclonic, and α = 0.5 for the combination of all other patterns). Inspection of the resulting trend derivatives (t) = αP(1)tα−1 shows that according to the analyzed data, the speed of climate change visible in transition probability trends was higher at the beginning of the analyzed period as compared to the current period of time (because for all of the identified trends, α < 1). This finding may represent a local effect and must be verified on the global data. This is a matter of further research.

Fig. 5.

Maximal partial log-likelihoods (5) of (left) anticyclonic, (right) cyclonic, and (middle) all other circulation patterns. Dotted lines represent the partial log-likelihoods estimated for nonstationary Markovian models 𝗣(t) = 𝗣(0) + 𝗣(1)tα as functions of exponent α. Dashed lines mark the maximal values of partial log-likelihoods for the seasonal Markovian trends of the form 𝗣(t) = 𝗣(0) + 𝗣(1) sin[(2π/T)t + ϕ] (the maxima are calculated over all possible ϕ ε [0, T], where T = 365.4 days). Also shown are the partial log-likelihoods derived with the help of adaptive metastable FEM-clustering with K = 3, N = 50, ε = 50 000 (bars) and local kernel smoothing with the Gaussian window width of 10.0 yr (dashed–dotted lines). The log-likelihood maxima for the polynomial trend are achieved at (left) α1 = 0.8, (middle) α2 = 0.5, and (right) α3 = 0.3.

Fig. 5.

Maximal partial log-likelihoods (5) of (left) anticyclonic, (right) cyclonic, and (middle) all other circulation patterns. Dotted lines represent the partial log-likelihoods estimated for nonstationary Markovian models 𝗣(t) = 𝗣(0) + 𝗣(1)tα as functions of exponent α. Dashed lines mark the maximal values of partial log-likelihoods for the seasonal Markovian trends of the form 𝗣(t) = 𝗣(0) + 𝗣(1) sin[(2π/T)t + ϕ] (the maxima are calculated over all possible ϕ ε [0, T], where T = 365.4 days). Also shown are the partial log-likelihoods derived with the help of adaptive metastable FEM-clustering with K = 3, N = 50, ε = 50 000 (bars) and local kernel smoothing with the Gaussian window width of 10.0 yr (dashed–dotted lines). The log-likelihood maxima for the polynomial trend are achieved at (left) α1 = 0.8, (middle) α2 = 0.5, and (right) α3 = 0.3.

Figure 5 demonstrates that the maximal log-likelihood values for optimal polynomial trend and hidden-state models are higher than the optimal values for the seasonal trend model (dashed lines) and the local Gaussian kernel smoothing result (dashed–dotted); note that the effective Gaussian window width of 10 yr corresponds to the Gaussian window width used in the original work (Jones et al. 1993) considering the analysis of the U.K. Lamb index series. This finding means that the influence of seasonal pattern variability on transition processes is dominated by the long-term effects induced by the single polynomial trend and hidden-state models. Moreover, the optimal hidden-state model with K = 3 has the highest probability in the case of states 1 and 2. In the case of state 3 (describing the cyclonic pattern), the single polynomial trend model can explain the observed data better then any other tested model. However, this finding should be handled with care because the number of parameters used in the hidden state model for K = 3 is higher then the number of parameters involved in the single polynomial trend model.

c. Comparison of robustness and predictability

To interpret the obtained results, estimated transition matrices 𝗣(t) and instantaneous statistical weights πi (t), for i = 1, 2, 3,

 
formula

can be compared with respect to both their qualitative temporal behavior and robustness. Instantaneous statistical weights are the components of the steady-state PDF of the stationary Markov process; that is, in nonstationary cases these quantities describe a time evolution of an equilibrium probability density function (PDF). Figure 6 demonstrates a good agreement of the resulting parameters for the single polynomial trend model and the hidden states model. It can clearly be seen that in both cases the probability of staying in the anticyclonic pattern decreases, the probability of moving from the anticyclonic to the cyclonic pattern increases, and the probability of moving from anticyclonic to any other circulation pattern stays almost constant. For the cyclonic pattern transitions, the situation is reversed: the probability of staying in the cyclonic pattern increases, the probability of transitioning from the cyclonic to the anticyclonic pattern decreases, and the probability of transitioning from the cyclonic to any other circulation pattern stays almost constant. As can be seen from Fig. 7, this tendency results in increasing instantaneous statistical weight of the cyclonic pattern and in decreasing weight of the symmetric anticyclonic pattern. Moreover, Figs. 2 and 3 demonstrate the higher robustness of results obtained by the global methods compared to the local Gaussian kernel smoothing method.

Fig. 6.

Comparison of Markovian transition probabilities Pij(t) estimated with log-likelihood maximization with polynomial trend for α1 = 0.8, α2 = 0.5, α3 = 0.3 (black solid lines), local Gaussian kernel smoothing for the window width of 10.0 yr (gray solid lines), and FEM clustering for K = 3, N = 50, ε = 50 000 (dashed lines). Black dotted lines mark the robustness intervals for the parameters estimated by the log-likelihood maximization with polynomial trend, and gray dotted lines are the robustness intervals of the local Gaussian kernel smoothing. The robustness intervals of the FEM clustering method have almost the same size as the confidence intervals of the single trend model (because both methods are global approaches).

Fig. 6.

Comparison of Markovian transition probabilities Pij(t) estimated with log-likelihood maximization with polynomial trend for α1 = 0.8, α2 = 0.5, α3 = 0.3 (black solid lines), local Gaussian kernel smoothing for the window width of 10.0 yr (gray solid lines), and FEM clustering for K = 3, N = 50, ε = 50 000 (dashed lines). Black dotted lines mark the robustness intervals for the parameters estimated by the log-likelihood maximization with polynomial trend, and gray dotted lines are the robustness intervals of the local Gaussian kernel smoothing. The robustness intervals of the FEM clustering method have almost the same size as the confidence intervals of the single trend model (because both methods are global approaches).

Fig. 7.

Instantaneous statistical weights (34) and their robustness intervals (dotted) calculated with (left) the local Gaussian kernel smoothing transition matrix and (right) the polynomial trend transition matrix from Fig. 6 (right). Dashed lines denote the respective statistical weights calculated with the FEM clustering procedure. The robustness intervals of the FEM clustering method have almost the same size as the confidence intervals of the single trend model (both methods are global approaches).

Fig. 7.

Instantaneous statistical weights (34) and their robustness intervals (dotted) calculated with (left) the local Gaussian kernel smoothing transition matrix and (right) the polynomial trend transition matrix from Fig. 6 (right). Dashed lines denote the respective statistical weights calculated with the FEM clustering procedure. The robustness intervals of the FEM clustering method have almost the same size as the confidence intervals of the single trend model (both methods are global approaches).

To make predictions based on the available time series information, one has to extrapolate the identified transition process 𝗣(t) to the future. As was mentioned above, single trend models offer the direct possibility of generating long-term predictions because the exact functional form (in some predefined class of functions, e.g., polynomials) of the transition process 𝗣(t) is estimated explicitly from the observation data. For any given time 1 < τ < T, 𝗣[1,τ](t) will denote the Markovian transition matrix estimated only from part of the available time series X1, X2, … , Xτ. To quantify the prediction quality based on this estimate, the mean log-likelihood of prediction can be used:

 
formula

where are the time instances between τ + 1 and τ + Δt when the transitions between si and sj are observed. Figure 8 illustrates that predictions based on a single trend polynomial model are more probable then the predictions based on a stationary Markovian model without trend. This means that the long-term effects explained by the polynomial trend are significant and can be reliably identified even when using a part of the original data.

Fig. 8.

Mean log-likelihood of predictions (35) as a function of τ (in years; Δt = 6000 days) for P[1,τ](t) estimated as a stationary Markovian process without trend (dotted line) and a nonstationary Markovian process with polynomial trend (solid line).

Fig. 8.

Mean log-likelihood of predictions (35) as a function of τ (in years; Δt = 6000 days) for P[1,τ](t) estimated as a stationary Markovian process without trend (dotted line) and a nonstationary Markovian process with polynomial trend (solid line).

5. Concluding discussion

Three numerical methods for analysis of nonstationary Markovian data have been presented and compared here. In contrast to the standard approach (a local nonparametric technique, i.e., getting use only of the local information inside of the moving Gaussian window), two presented methods acquire the information globally and therefore, under predefined mathematical assumptions, they allow a more reliable estimate of the underlying model parameters. This feature is very important for analysis of a short record series. Both presented global methods demonstrated more reliable trend discrimination with narrower robustness intervals compared with the results of the local Gaussian kernel smoothing. It was exemplified how the new methods can help perform a robust identification of transition probabilities and stationary weights in the analyzed circulation data. Both methods, despite the difference in the mathematical assumptions implied in the data (one being parametric and the other nonparametric), revealed the same robust weight increase of the cyclonic circulation pattern (absolute increase of 6.3 ± 0.5%) and symmetrical decrease of the anticyclonic pattern weight (absolute decrease of 5.5 ± 0.5%) over the United Kingdom between 1945 and 2007. Moreover, the results of the single trend model analysis indicated that the speed of climate change identified from transition probability trends was higher at the beginning of the analyzed period as compared to the current period of time.

One of the most important practical issues in the area of time series analysis is the construction of dynamical models able to predict the future. The single trend models have obvious shortcomings predicting situations in which the functional and parametric form of the trend function ϕ(t) is changed (i.e., regime change). On the other hand, the adaptive FEM clustering method has a potential to cope with this problem. However, in the current setting of the algorithmic procedure, hidden state probabilities γi(t) are identified without making any assumptions about their dynamics (in contrast to the widely used HMM strategy in which these quantities are taken a priori to be Markov processes). This means that to be able to predict the phase transitions in realistic applications, the dynamics of hidden state probabilities γi(t) have to be further investigated a posteriori. Moreover, robust statistical techniques of change point analysis have to be developed. These problems are the matters of further research.

For the analyzed practical application, it was demonstrated that the polynomial parameterization of the nonstationary Markov model enables a better quality of predictions as compared to the stationary case. Further development of the presented data analysis techniques can help to acquire a better understanding of the low-frequency variability and dynamics of processes switching between metastable weather and climate regimes.

Acknowledgments

The author would like to thank Andrew Majda (Courant Institute, NYU), Eric Vanden-Eijnden (Courant Institute, NYU), Rupert Klein (FU Berlin), and Dirk Becherer (HU Berlin) for intensive discussions and valuable comments.

The author thanks the unknown referees whose valuable comments helped to make the paper more readable and the Climatic Research Unit of the University of East Anglia for the time series data. The work was partially supported by the DFG SPP 1276 MetStroem “Meteorology and Turbulence Mechanics” and DFG Research Center “Matheon.”

REFERENCES

REFERENCES
Anderson
,
J.
,
2003
:
A local least squares framework for ensemble fitting.
Mon. Wea. Rev.
,
131
,
634
642
.
Braess
,
D.
,
2007
:
Finite Elements: Theory, Fast Solvers, and Applications to Solid Mechanics.
Cambridge University Press, 370 pp
.
Crommelin
,
D.
, and
E.
Vanden-Eijnden
,
2006
:
Fitting time series by continuous-time Markov chains: A quadratic programming approach.
J. Comput. Phys.
,
217
,
782
805
.
Dakos
,
V.
,
M.
Scheffer
,
E.
van Nes
,
V.
Brovkin
,
V.
Petukhov
, and
H.
Held
,
2008
:
Slowing down as an early warning signal for abrupt climate change.
Proc. Natl. Acad. Sci. USA
,
105
,
266
282
.
Gelman
,
A.
,
J.
Carlin
,
H.
Stern
, and
D.
Rubin
,
2004
:
Bayesian Data Analysis.
Chapman and Hall, 668 pp
.
Gill
,
P.
,
W.
Murray
,
M.
Saunders
, and
M.
Wright
,
1987
:
A Schur-complement method for sparse quadratic programming.
Stanford University Systems Optimization Lab Tech. Rep. SOL 82-12, 1–19
.
Hamilton
,
J.
,
1989
:
A new approach to the econometric analysis of nonstationary time series and the business cycle.
Econometrica
,
57
,
357
384
.
Hoerl
,
A.
,
1962
:
Applications of ridge analysis to regression problems.
Chem. Eng. Prog.
,
58
,
54
59
.
Horenko
,
I.
,
2008
:
On simultaneous data-based dimension reduction and hidden phase identification.
J. Atmos. Sci.
,
65
,
1941
1954
.
Horenko
,
I.
,
2009a
:
On clustering of non-stationary meteorological time series.
Dyn. Atmos. Oceans
,
in press
.
Horenko
,
I.
,
2009b
:
Finite element approach to clustering of multidimensional time series.
SIAM J. Sci. Comput.
,
in press
.
Horenko
,
I.
,
S.
Dolaptchiev
,
A.
Eliseev
,
I.
Mokhov
, and
R.
Klein
,
2008a
:
Metastable decomposition of high-dimensional meteorological data with gaps.
J. Atmos. Sci.
,
65
,
3479
3496
.
Horenko
,
I.
,
R.
Klein
,
S.
Dolaptchiev
, and
C.
Schuette
,
2008b
:
Automated generation of reduced stochastic weather models. I: Simultaneous dimension and model reduction for time series analysis.
SIAM Multiscale Model. Simul.
,
6
,
1125
1145
.
Jenkinson
,
A.
, and
F.
Collison
,
1977
:
An initial climatology of gales over the North Sea.
Met Office, Synoptic Climatology Branch, Tech. Memo. 62, 18 pp
.
Jones
,
P.
,
M.
Hulme
, and
K.
Briffa
,
1993
:
A comparison of Lamb circulation types with an objective classification scheme.
Int. J. Climatol.
,
13
,
655
663
.
Lamb
,
H.
,
1972
:
British Isles Weather Types and a Register of Daily Sequence of Circulation Patterns 1861–1971.
Vol. 116. Her Majesty’s Stationary Office, 85 pp
.
Loader
,
C.
,
1996
:
Local likelihood density estimation.
Ann. Stat.
,
24
,
1602
1618
.
Loader
,
C.
,
1999
:
Local Regressions and Likelihood.
Springer, 308 pp
.
Majda
,
A.
,
C.
Franzke
,
A.
Fischer
, and
D.
Crommelin
,
2006
:
Distinct metastable atmospheric regimes despite nearly Gaussian statistics: A paradigm model.
Proc. Natl. Acad. Sci. USA
,
103
,
8309
8314
.
Mardia
,
K.
,
J.
Kent
, and
J.
Bibby
,
1979
:
Multivariate Analysis.
Academic Press, 450 pp
.
Metzner
,
P.
,
I.
Horenko
, and
C.
Schuette
,
2007
:
Generator estimation of Markov jump processes based on incomplete observations nonequidistant in time.
Phys. Rev.
,
227E
,
353
375
.
Nelder
,
J.
, and
R.
Mead
,
1964
:
A simplex method for function minimization.
Comput. J.
,
7
,
308
313
.
Noe
,
F.
,
2008
:
Probability distributions of molecular observables computed from Markov models.
J. Chem. Phys.
,
128
,
244103
.
doi:10.1063/1.2916718
.
Schütte
,
C.
, and
W.
Huisinga
,
2003
:
Biomolecular conformations can be identified as metastable sets of molecular dynamics.
Handbook of Numerical Analysis, Vol. X, P. G. Ciaret and J.-L. Lions, Eds., Elsevier, 699–744
.
Tikhonov
,
A.
,
1943
:
On the stability of inverse problems.
Dokl. Akad. Nauk SSSR
,
39
,
195
198
.
Wahba
,
G.
,
1990
:
Spline Models for Observational Data.
SIAM, 180 pp
.
Xiong
,
L.
,
S.
Guo
, and
K.
O’Connor
,
2006
:
Smoothing the seasonal means of rainfall and runoff in the linear perturbation model (LPM) using the kernel estimator.
J. Hydrol.
,
324
,
266
282
.
Zhao
,
S.
, and
G.
Wei
,
2003
:
Jump process for the trend estimation of time series.
Comput. Stat. Data Anal.
,
42
,
219
241
.

Footnotes

Corresponding author address: Illia Horenko, Institute of Mathematics, Free University of Berlin, Arn-imallee 2–6, 14195 Berlin, Germany. Email: horenko@math.fu-berlin.de

1

It is assumed that {tij} is not empty ∀i,j. If it is not true for a certain pair (i, j), it means that no direct transitions between the states si and sj were observed in the time series, and the respective matrix elements Pij(t) can be assumed to be equal to zero for all t.

2

If 𝗣(t) is assumed to be time-independent, optimization problem (4)(6) can be solved analytically applying the method of Lagrangian multipliers with Pij = |{tij}|/Σk|{tij}|, where |{tij}| denotes the number of observed transitions between the states si and sj and Σk|{tik}| ≠ 0.