## Abstract

A numerical framework for data-based identification of nonstationary linear factor models is presented. The approach is based on the extension of the recently developed method for identification of persistent dynamical phases in multidimensional time series, permitting the identification of discontinuous temporal changes in underlying model parameters. The finite element method (FEM) discretization of the resulting variational functional is applied to reduce the dimensionality of the resulting problem and to construct the numerical iterative algorithm. The presented method results in the sparse sequential linear minimization problem with linear constrains. The performance of the framework is demonstrated for the following two application examples: (i) in the context of subgrid-scale parameterization for the Lorenz model with external forcing and (ii) in an analysis of climate impact factors acting on the blocking events in the upper troposphere. The importance of accounting for the nonstationarity issue is demonstrated in the second application example: modeling the 40-yr ECMWF Re-Analysis (ERA-40) geopotential time series via a single best stochastic model with time-independent coefficients leads to the conclusion that all of the considered external factors are found to be statistically insignificant, whereas considering the nonstationary model (which is demonstrated to be more appropriate in the sense of information theory) identified by the methodology presented in the paper results in identification of statistically significant external impact factor influences.

## 1. Introduction

The parameterization of reduced dynamical models describing the behavior of the observed (measured) multiscale data has been a field of intensive research in the last several years. Approaches introduced in atmospheric science include stochastic differential equations (Majda et al. 1999, 2003; Wilks 2005), multiscale schemes (Majda et al. 2003; Fatkullin and Vanden-Eijnden 2004), regression models (Orrell 2003), discrete Markov chain models (Khouider et al. 2003), hidden Markov models (Majda et al. 2006; Horenko et al. 2008a,b), and conditional Markov models (Crommelin and Vanden-Eijnden 2008). In the present paper a purely data-driven approach for parameterization by means of nonstationary multivariate autoregressive factor (VARX) models is introduced, based on the combination of the stationary VARX models widely used in econometrics (Tsay 2005) with the recently introduced finite element method (FEM)-clustering procedure (Horenko 2009, 2010b). The resulting numerical strategy is demonstrated to allow the multiscale approximation of the nonstationary dynamical processes via optimal sequences of locally stationary fast VARX processes and some slow (or persistent) hidden process switching between them. In an atmospheric context it was recently demonstrated that the FEM-clustering framework can be successfully applied to identify large-scale dynamical circulation patterns in realistic atmospheric general circulation models (AGCMs; Franzke et al. 2009). In the current manuscript the FEM-clustering framework is extended to allow for discontinuous hidden processes via the formulation of the respective variational problems in the space of the functions with bounded variation. Applications of the proposed method are demonstrated in two different scenarios and compared with standard purely data-driven methods (Orrell 2003; Wilks 2005): (i) in the context of the subgrid-scale parameterization for the Lorenz ’96 model (Lorenz 1996) with nonstationary forcing in the right-hand side and (ii) in the context of climate impact factor analysis for 40-yr European Centre for Medium-Range Weather Forecasts (ECMWF) Re-Analysis (ERA-40) historical geopotential data in Europe between 1958 and 2003 (Simmons and Gibson 2000).

The outline of the remainder of this paper is as follows. In section 1 the inverse problem for nonstationary dynamical systems is formulated as a clustering problem and it is demonstrated how the multiscale assumption can be incorporated into the resulting variational formulation via the persistency condition in the space of functions with bounded variation. The finite element method is deployed to reduce the dimensionality of the resulting optimization problem, and the iterative numerical scheme is introduced. In section 2 the numerical details of the resulting FEM–VARX clustering method are explained. In section 3 different strategies of postprocessing the clustering results are discussed with regard to their insight into the analyzed data. In section 4 the performance of the presented framework is demonstrated for two practical applications and the discussion is presented in section 5.

## 2. Constrained clustering method

### a. Model distance functional

Let *x*_{0}, … , *x _{T}* ∈ Ψ ⊂

**R**

*be the observed*

^{n}*n*-dimensional time series with

*T*+ 1 snapshots in time interval [0,

*T*]. In the following we will assume that in this time interval the considered time series

**x**

*is approximated by a time-discrete output of the certain direct mathematical model*

_{t}where **F**(·) is the model operator, *τ* is the model time step, *mτ* is the memory depth (*m* = 1 for Markov models), and

is a (time-dependent) set of the model parameters (including, if necessary in some model contexts, some initial and/or boundary values), where *d* is the dimension of a model parameter space. Let

be a functional (called the model distance functional) describing the distance between some given **x*** _{t}* at time

*t*and the output of the model (1) calculated for a fixed set of parameters

**(**

*θ**t*). In this case, for a given observation series

*x*

_{0}, … ,

*x*and some fixed functional form

_{T}*g*(·), the inverse problem (or the parameter identification problem) can be approached via the solution of the following variational problem:

subjected to the constraints given in (2). Problem (4) is clearly ill posed if no special assumptions about the temporal dependence of the unknown parameters ** θ**(

*t*) can be made. In the following we will assume that for any

*t*∈ [0,

*T*] model distance, functional (3) can be represented as a convex linear combination of

*K*≥ 1 stationary model distance functionals—that is, model functionals dependent on some constant (time-independent) model parameters

*θ*∈

_{i}**Ω**,

*i*= 1, … ,

*K*:

with some time-dependent model affiliations *γ _{i}*(

*t*) fulfilling the convexity condition

In another words, we assume here that at any time *t* the global time-dependent (or nonstationary) model distance functional (3) can be approximated by one of *K* local time-independent (or stationary) model distance functionals chosen according to some time-dependent probabilities (or model affiliations) **Γ**(*t*) = [*γ*_{1}(*t*), … , *γ _{K}*(

*t*)]. This idea for the inverse numerical problems is widely used in the context of data clustering (Höppner et al. 1999); in the presented general form it was introduced in Horenko (2010a) and stems from the classical spline interpolation approach for direct numerical problems (see, e.g., Deuflhard 2004). Inserting the ansatz (5) into (4) results in the minimization of the average clustering functional

To comprehend the above concepts it is instructive to consider a case where the direct model (1) has a following simple form:

where * ε_{t}* is some independent identically distributed (i.i.d.) stochastic variable with zero expectation and

**(**

*θ**t*): [0,

*T*] →

**R**

*is a time-dependent parameter describing the evolution of the expectation value of the process*

^{n}**x**

*. The model distance functional (3) in such a case has the form*

_{t}and the corresponding average clustering functional can be numerically minimized applying the standard *K*-means clustering algorithm (Bezdek 1981; Höppner et al. 1999). This means that with the help of the ansatz (5), the solution of the nonstationary inverse problem for dynamical system (9) can be approached via the iterative clustering algorithm based on the minimization of the average clustering functional (8).

In the following it will be explained in detail how these concepts can be interpreted and applied in context of more general nonstationary multivariate models with external forcing (VARX models).

### b. Nonstationary VARX models and VARX model distance functional

The stationary VARX model is a widely used dynamical multivariate tool to investigate the time series subject to external forcing (Brockwell and Davis 2002; Tsay 2005). If, in addition to the time series *x*_{0}, … , *x _{T}* ∈

**R**

*(describing the internal degrees of freedom of the considered dynamical system), the time series*

^{n}*u*

_{0}, … ,

*u*∈

_{T}**R**

*of the external influences (or forcing) is available, the nonstationary nonlinear VARX model has the following form:*

^{l}where *ϕ*_{1}(*x*_{t−τ}, … , *x*_{t−mτ}): **R**^{nm} → **R**^{d} is some (in general nonlinear) function connecting the previous observations *x*_{t−τ}, … , *x*_{t−mτ}; *ϕ*_{2}(*u*) = {*ϕ*_{2}^{1}[*u*(*t*)], … , *ϕ*_{2}^{k}[*u*(*t*)]} : **R**^{l} → **R**^{k} is some fixed (nonlinear) function of the external factors; * ε_{t}* : [0,

*T*] →

**R**

^{h}(

*h*≪

*n*) is a Gaussian process with zero expectation; ;

**(**

*μ**t*) : [0,

*T*] →

**R**

^{n}; 𝗔(

*t*) : [0,

*T*] →

**R**

^{n×d}; 𝗕(

*t*) : [0,

*T*] →

**R**

^{n×k}; and 𝗖(

*t*) : [0,

*T*] →

**R**

^{n×h}.

The most simple and straightforward form of the VARX model used in the literature is the linear autoregressive factor model with **ϕ**_{1}(*x*_{t−τ}, … , *x*_{t−mτ}) = [*x*_{t−τ}, … , *x*_{t−mτ}] and a VARX model equation (Brockwell and Davis 2002):

We will further assume that the noise matrix 𝗖(*t*) (describing the coupling between the *h*-dimensional Gaussian noise process to the analyzed time series **x**_{t}) for any *t* ∈ [0, *T*] can be represented as

where 𝗣(*t*) : [0, *T*] → **R**^{n×h} is an orthogonal matrix function and **Λ**(*t*) : [0, *T*] → **R**^{h×h} is a diagonal matrix function with nonnegative diagonal elements. Defining ** θ**(

*t*) = [

**(**

*μ**t*), 𝗔(

*t*), 𝗕(

*t*), 𝗖(

*t*)] and under the assumption (13), the VARX model distance functional (3) of dynamical system (11) can be written as

where the 𝗣(*t*)-weighted norm ‖·‖_{𝗣(t)} = [·𝗣(*t*), 𝗣^{†}(*t*)·]_{2} is used and † denotes the matrix transposition. The main advantage of the above definition of the least squares residual norm (14)—as compared to the standard Gaussian norm based on the 𝗖(*t*)-weighted norm ‖·‖_{𝗖(t)} = [·, 𝗖(*t*)·]_{2}—is that it preserves the norm of the original residuals of the least-squared problem in the essential noise dimension. If the aforementioned assumptions are fulfilled, then it is easy to demonstrate that the time series of model distances *g*[**x*** _{t}*,

**(**

*θ**t*)],

*t*= 0, … ,

*T*is a

*χ*

^{2}process. In the application examples below it will be demonstrated how this property of the process can be used to estimate the confidence intervals of the model parameters. In the context of stationary VARX models (e.g., in the case of the time-independent parameter matrices 𝗔, 𝗕, and 𝗖) this property can be deployed to a posteriori check the model assumption (11) and to demonstrate the asymptotical normality of the resulting parameter estimates (i.e., to demonstrate that for

*T*→ ∞ the parameter estimates are distributed according to the multivariate Gaussian; Reinsel 1993). Applying the ansatz (5), the nonstationary parameter identification problem can be approached via the solution of the minimization problem (8) [with

*= (*

**θ**_{i}*, 𝗔*

**μ**^{i}*, 𝗕*

^{i}*, 𝗖*

^{i}*)]. This can be done by means of the iterative clustering algorithm (just as described for the*

^{i}*K*-means clustering; Höppner et al. 1999; Horenko 2010b).

### c. Incorporation of additional information

Direct numerical treatment of the problem (8) is hampered by the following issues: (i) the problem is ill posed since the number of unknowns can be higher then the number of known parameters and (ii) because of the nonlinearity of *g*, the problem is in general nonconvex and the numerical solution gained with some sort of local minimization algorithm depends on the initial parameter values (Deuflhard 2004). Perhaps it is even more important that the solution **Γ** of the above constrained minimization task might be an irregular function: to see this, let us assume that we already know the minimizing values **Θ**_{*} for **Θ**. Then, the minimizer **Γ**_{*} for the affiliation vector **Γ** has the following form:

Thus, the datum **x*** _{t}* has perfect affiliation with state

*i*if the model distance functional for

**x**

*is minimal in state*

_{t}*i*. That is, if the process exhibits strong variability, then the affiliations are rather nonsmooth functions. Whenever the affiliation functions just take values 0 or 1, we will call them

*deterministic*, in the sense that then for every datum in the time series it is certain to which cluster it belongs.

As demonstrated in the literature (Horenko 2009, 2010b), one of the possibilities for approaching the two aforementioned problems simultaneously is first to incorporate some additional information about the regularity of the observed process [e.g., in the form of smoothness assumptions in the space of time-continuous functions **Γ**(·)] and then to apply a finite Galerkin discretization (e.g., a FEM discretization) of this infinite-dimensional Hilbert space. In context of Tykhonov-based FEM-clustering methods, this was done assuming the weak differentiability of the time-continuous functions *γ _{i}*; that is,

where *W*_{1,2}(0, *T*) is the Sobolev space of weakly differentiable functions [e.g., functions with _{2}(0, *T*) integrable first derivatives].

As demonstrated in Horenko (2010a), one possibility to incorporate this a priori information from (16) into the optimization is to modify the functional (8) and to write it in the Tykhonov-regularized form

However, introduction of the *ε*^{2}-dependent penalty term in the formulation of the Tykhonov-regularized problem (17) changes the functional form of the original clustering problem (8) and biases the position of the solution of the respective minimization problem with growing *ε*^{2} (e.g., the solution of the regularized problem may have a significant deviation from the global minimum of the original problem). As was demonstrated in Horenko (2009, 2010b), increasing the value of *ε*^{2} leads to the “smoothing out” of the sharp transitions between the cluster states.^{1} Moreover, the formulation (16) of the persistency condition in *W*_{1,2} sense relies on the differentiability and continuity of the underlying cluster affiliation functions *γ _{i}*(

*t*). This cannot in general be assumed to be granted in the cases where the transitions between the cluster states are sharp [e.g., when the function

*γ*(

_{i}*t*) has jumps and is discontinuous].

In the following, an alternative way to incorporate the persistency assumption into the original clustering problem (8) (avoiding the two abovementioned problems) will be presented, based on the formulation of the persistency condition in a functional space allowing for a discontinuity of its elements [functions with bounded variation, BV(0, *T*)].

### d. Persistence in the BV(0, T) sense: The constrained BV-clustering method and FEM discretization

Instead of limiting ∂* _{t}γ_{i}* in the

^{2}sense [which relies on the differentiability and continuity assumption for the cluster affiliation function

*γ*(

_{i}*t*)], we will consider the time-discrete functions

*γ*(

_{i}*t*) defined only at the time instances where the observations

**x**

*are available. We will formulate the persistency condition in the time-discrete BV(0,*

_{t}*T*) sense

where the persistency parameter *C* defines the maximal number of transition between the cluster state *i* and all other states in time interval (0, *T*). Note that since at least in the time-continuous case *W*_{1,2}(0, *T*) ⊂ BV(0, *T*) (Moreau 1988), this kind of persistency condition will also allow us to preserve the “smooth” *W*_{1,2} transitions between the cluster states in the continuous limit by an appropriate choice of the persistency parameter *C*.

Let 𝗗 be the discrete difference operator (the right-hand derivatives) with regard to *t*:

For a given **Θ** = (*θ*_{1}, … , *θ _{K}*), let

Then problem (8) transforms to

where † denotes the transposition operation. The above problem (21) is subject to the constraints

The direct numerical optimization of the problem (21)–(24) is hampered by the fact that the persistency constraint (22) makes the overall problem nondifferentiable. In the following, an adequate transformation of the above problem to the differentiable formulation will be introduced. This will allow us to use the standard numerical optimization methods in the context of (21)–(24). We define *υ _{i}* :=

**D**

*γ*

_{i}^{†}and split this into nonpositive and nonnegative parts (following the theorem about the unique representation of the BV functions; cf. Moreau 1988):

Moreover, let 𝗗^{−1} be the discrete integration operator; for example,

for some variable (defining the initial value for the cluster affiliation *i* at time 0). Now we can express *γ _{i}*

^{†}as

Defining

we can express the original nondifferentiable optimization problem [(21)–(24)] as the following differentiable optimization problem (for more details, see the appendix):

in the vector space of the higher dimensionality [since by construction the dimension of variable *x̃ _{i}* defined in (28) is almost twice as high as the dimension of the original variable

*γ*]. The solution of the above minimization problem can be approached via the subspace iteration procedure (e.g., via the solution of the restrained optimization problems in parameter subspaces

_{i}**x̃**and

**subsequently). Completely analogously to the Tykhonov-regularized FEM-clustering case (Horenko 2009, 2010b), it can be demonstrated that this subspace iteration procedure converges toward the (local) minimum of the problem (29) if some appropriate assumptions (convexity and differentiability) of the model distance functional (3) are fulfilled.**

*θ*However, since the dimensionality of the variable **x̃** is growing as *K*(2*T* + 1) with the length of the analyzed time series, the numerical solution of the restrained problem (29) for a fixed value of **Θ** can become increasingly expensive for long time series. In the following the finite element method will be deployed to reduce the dimensionality of the above problem.

### e. FEM discretization

Let {0 = *τ*_{0}, *τ*_{1}, *τ*_{2}, … , *τ*_{N−1}, *τ _{N}*,

*τ*

_{N+1}=

*T*} be a finite subset of the interval [0,

*T*] with uniform time steps

*δ*. We can define a set of

_{t}*N*≪ (

*T*+ 1) time-discrete functions with bounded variation {

*f*

_{1}(

*t*),

*f*

_{2}(

*t*), … ,

*f*(

_{N}*t*)} defined at

*T*+ 1 time points of the observation series

*x*, where each function

*f*(

_{i}*t*) shall take positive values at the observation time instances of

*x*in the time interval (

*τ*

_{i−1},

*τ*

_{i+1}) and be zero at the time instances outside this interval.

^{2}Now we represent the

*υ*’s from (25) by these functions; thus,

where *χ _{N}*

^{+}and

*χ*

_{N}^{−}are discrete discretization errors. As

*δ*goes to 1, the discretization errors become zero. This develops into a reduced discrete representation of the discrete problem (29) by using

_{t}where 𝗪* _{N}* ∈

**R**

^{(T+1)×N}is a FEM-basis matrix and

**ṽ**

_{i}

^{−}is defined analogously. Then the previously defined

*x̃*can be approximated as

_{i}Defining

results in the FEM-discretized problem in the BV sense:

Note that **x** has a dimensionality of *K*(2*N* + 1), which is much less then the dimension *K*(2*T* + 1) of the original variable **x̃** if *N* ≪ *T*. Analogously to the Tykhonov-regularized FEM-clustering problem described in Horenko (2009), adaptive FEM techniques can be deployed to find the optimal set of time intervals *δ _{t}* for a given total discretization error

**.**

*δ̃*^{3}

For a fixed **Θ** the above minimization problem is a linear minimization with regard to **x** with linear equality and inequality constraints. This problem can be solved by means of some standard numerical methods of linear programming like simplex or interior point methods. For the fixed parameter **x**, the above problem is an unconstrained minimization problem that can be solved analytically if the model distance functional *g* is convex with regard to *θ* and the problem ∂*g*(*x _{t}*,

*θ*)/

*θ*= 0 has a unique analytical solution with regard to

*θ*(as will be demonstrated in the following, this is the case for the VARX models). The iterations can be repeated until the change of the functional value does not exceed some predefined threshold for the change of the functional value. The algorithm is:

Compared to the Tykhonov-regularized FEM-clustering problem as described in Horenko (2009, 2010b), the above problem [(34)] has two major numerical advantages. First, while for the Tykhonov-regularized FEM-clustering problem the persistence could be influenced only indirectly through the choice of the regularization parameter *ε*^{2}, in the context of (34) it is directly controllable via the persistency threshold *C*. Second, a sparse linear programming problem is solved in each iteration of the above algorithm (compared to the more numerically expensive quadratic programming in the case of the Tykhonov-regularized FEM-clustering problem).

In the next section it will be demonstrated how the above numerical procedure can be formulated for the parameter identification of nonstationary VARX models (11).

## 3. FEM–VARX clustering: Parameter estimation and determination of optimal values for number of states and persistency threshold

### a. Derivation of the estimator formulas

In every iteration *s* of the subspace iteration algorithm described above, the unconstrained minimization problem

is solved for a fixed value of *x̃*^{(s−1)} (step 2 of the above algorithm). If the dynamics of the analyzed time series is assumed to be governed by the VARX process (11) with memory *m*, the model distance functional in the clustering problem formulation will take the form (14). Let 𝗔𝗕* ^{i}* = (

*μ*, 𝗔

^{i}*, 𝗕*

^{i}*) ∈*

^{i}**R**

^{n×(1+d+k)}[where

*d*is the dimensionality of the output of function

*ϕ*

_{1}(·)] and Γ

^{s−1}(

*t*) be the cluster affiliations reconstructed from the current values

*x̃*

^{s−1}via the subsequent application of Eqs. (30) and (32). After inserting Γ

^{s−1}(

*t*) into the function

*g*(

_{i}*θ*),

_{i}*i*= 1, … ,

*K*and taking the derivative of the functional

**L**

_{0 }(34) w.r.t. the new variables 𝗔𝗕

*, we get the optimal estimators for 𝗔𝗕*

^{i}*in the iteration*

_{s}^{i}*s*as the solution of the following system of linear equations:

^{4}

where *ϕ*_{1}^{t} = *ϕ*_{1}(*x*_{t−τ}, … , *x*_{t−mτ}) and *ϕ*_{2}^{t} = *ϕ*_{2}[*u*(*t*)]. If the matrix (𝗫*_{s}𝗫_{s}) is invertible, then the solution of the above system exists, is unique, and can be expressed as

where * denotes the matrix conjugate (Golub and Loan 1989). The optimal estimates of the noise parameters 𝗖* ^{i}* are straightforwardly calculated from the covariance matrices of the model residuals in the cluster state

*i*:

Equations (37) and (38) give explicit estimator expressions that are used in step 2 of the subspace iteration algorithm.

As can be seen from the above estimator formulas, from the view point of the inverse numerical problem there is no difference between linear (12) and nonlinear (11) factor models: both result in the solution of a linear least squares problem (36). This is explained by the fact that in both cases the right-hand sides of the models are linear functions of model parameters. However, as will be demonstrated in the following, linear autoregressive models can in some situations provide more insight, allowing us to use the available tools of linear data analysis, and can be successfully applied to analyze realistic (nonlinear) data.

### b. FEM–VARX model order selection for fixedK and C

To select a proper model order *m* and the optimal functional form *ϕ*_{2}(*u*) = {*ϕ*_{2}^{1}[*u*(*t*)], … , *ϕ*_{2}^{k}[*u*(*t*)]} for the external factors, standard tools of information theory such as the Akaike information criterion (AIC) or the Bayesian information criterion (BIC) McQuarrie and Tsai (1998) developed for the linear stationary VARX models can be applied a posteriori to the locally stationary VARX models identified via the FEM–VARX procedure. In terms of the information criteria, various models are compared wrt the special functional consistent of the model log likelihood with an added regularization term penalizing the total number of parameters involved in the model (to avoid overfitting). For example, the BIC functional (being in general more robust than AIC; cf. McQuarrie and Tsai 1998) for the cluster state *i* will have the form

where _{i} = *γ _{i}g_{i}*

^{†}(

*θ*) and

_{i}*N*is the number of the model parameters in the cluster state

_{i}*i*. Given any two estimated cluster models [e.g., models with different memory depth and/or different functions

*ϕ*_{2}(

*u*)], the model with the lower value of BIC(

*i*) is the one to be preferred. In the following it will be demonstrated how the criterion (39) can be applied in the praxis to make a decision about the optimal functional form of the local VARX models and to test the significance of different external factors

*u*for the dynamical process explaining the analyzed time series data.

### c. Choosing the optimal number of local models K

The upper bound for the number of statistically distinguishable cluster states for each value of the persistency threshold *C* can be algorithmically estimated in the following way: starting with some a priori chosen (big) *K*, one solves the optimization problem (34) for different fixed value of *C* and calculates the confidence intervals of the resulting local parameter matrices **Θ*** ^{i}*,

*i*= 1, … ,

*K*(this can be done applying standard bootstrap sampling procedures; see Chernik 2007). If two of the estimated parameter sets for two of the identified cluster states have confidence intervals that overlap in all components, this means that respective clusters are statistically indistinguishable and the whole procedure must be repeated for

*K*=

*K*− 1. If at a certain point all of the matrices are statistically distinguishable, the procedure is stopped and

*K*

_{max}(

*C*) =

*K*.

Another possible means of estimating the optimal number of clusters can be used if the identified transition process Γ(*t*) is shown to be Markovian for given *K*, *C*. Markovianity can be verified applying some standard tests (e.g., one can check the generator structure of the hidden process; see Metzner et al. 2007).^{5} In such a case the hidden transition matrix can be calculated and its spectrum can be examined for a presence of the spectral gap (a gap separating the fast and the slow time scales in the Markov dynamics). If the spectral gap is present, then the number of the dominant eigenvalues (i.e., eigenvalues between the spectral gap and 1.0) gives the number of the metastable clusters in the system (Schütte and Huisinga 2003). Positive verification of the Markovianity of the hidden process has an additional advantage: it allows us to construct a reduced dynamical model of the analyzed process and to estimate some dynamical characteristics of the analyzed process (e.g., one can calculate relative statistical weights, mean exit times, and mean first passage times for the identified clusters) (Horenko et al. 2008a).

Verification of the Markov assumption also allows us to construct predictive Markovian models of the persistent dynamical process Γ(*t*) switching between the cluster states (Horenko et al. 2008a; Horenko 2009). As will be demonstrated for the numerical examples in the following, the respective *K* × *K* Markovian transition matrix, together with the parameter estimates (37) and (38) of the locally stationary cluster states, can help to construct the dynamical ensemble prediction models. It is important to mention that without this a posteriori Markov verification, the minimum of the functional (34) cannot be directly used to generate the data-based predictions since the identified cluster affiliation function Γ is just an abstract BV function with some predefined persistency *C* and has no intrinsic representation in terms of the underlying dynamical process.

### d. Selection of the optimal persistency threshold C for a given number of states K

Linearity of the functional (34) wrt **x̃** for *fixed* values of model distance parameters *θ _{i}* can help to apply the standard instruments from the theory of ill-posed linear problems, such as the

*L*-curve approach (Calvetti et al. 2000), to identify the optimal value of the persistency threshold

*C*for a fixed number of clusters

*K*. For example, the optimal value of

*C*can be determined as the edge point (or the point of maximal curvature) on a two-dimensional plot. This plot depicts a dependence of the residuum-norm of the solution from

*C*(Horenko 2009). Alternatively, as would be demonstrated in the numerical examples, decreasing

*C*up to some point [meaning the increasing regularity of the optimal solutions of (34) in the BV sense] results in decreasing the respective values

**L**

_{0}of the clustering functional in the solution point (meaning the increasing quality of the resulting clustering). The point with the minimal value of

**L**

_{0}in such a case indicates the optimal persistency threshold guaranteeing the best clustering quality.

## 4. Postprocessing of the FEM–VARX clustering results

Application of the FEM–VARX numerical scheme with fixed values of *K* and *C* results in the identification of *K* optimal locally stationary VARX models and the persistent BV(0, *T*) function Γ(*t*) switching between them. As explained above, the a posteriori verification of the Markovian hypothesis for Γ(*t*) allows us to construct the reduced representation of the overall dynamics in the multiscale sense (e.g., as some slow persistent Markovian process switching between *K* locally stationary VARX parameter sets). Postprocessing of the derived local VARX models can give some additional insight into the analyzed time series. For example, expectation values of mean dynamical equilibrium positions of the analyzed dynamical process in the cluster state *i* can be calculated as functions of the external forcing *u _{t}*. In the case of the linear autoregressive VARX models (12), this is done via the solution of the following system of linear equations:

^{6}

Note that if 𝗔* _{q}^{i}* = 0, ∀

*q*, then the above result is equivalent to the multivariate trend estimate in context of the recently introduced FEM–

*K*-trends clustering algorithm (Horenko 2010b).

The local linearity of the identified VARX models also allows us to apply various techniques of model reduction known in the literature, such as proper orthogonal decomposition or balanced truncation (Moore 1981) allowing for construction of energy-preserving low-rank approximation*s* to the identified VARX models.

Another kind of insight into the underlying multidimensional dynamics can be gained via the analysis of the Fourier transforms of the identified models (e.g., via the analysis of the transfer function matrices and directed transfer functions; Pereda et al. 2005). This kind of analysis can help to quantify the causal influence of different data dimensions on each other.

## 5. Numerical examples

In the following we will illustrate the proposed FEM–VARX clustering strategy on two practical examples: first, on data from the Lorenz ’96 model (Lorenz 1996) with external periodical forcing switching between the deterministic and the chaotic regime behavior and second, on a set of averaged daily ERA-40 500-hPa geopotential data between 1958 and 2003 on a 16 × 9 spatial grid covering Europe and part of the North Atlantic.

### a. Subgrid-scale modeling for Lorenz ’96 system with forcing

The Lorenz ’96 model of type II (Lorenz 1996; Orrell 2003) is a two-scale simplified ordinary differential equation (ODE) describing advection, damping, and forcing of some (slow) resolved atmospheric variable *x̃ _{i}* being coupled to some (fast) subscale variables

*y*

_{i,j}:

with all of the model parameters ( = 8, = 4, *h* = 1, *b* = *c* = 10) set to be the same as in the paper of Orrell (2003). In contrast to the paper of Orrell (2003), we choose the external forcing to be an explicitly time-dependent function of the form *F*(*t*) = 10 sin(2*π*/80*t*), resulting in switching between “deterministic” and “chaotic” regime behavior (Orrell and Smith 2003) on the time scale of the forcing (e.g., producing a system with three different time scales).

We generate one realization of (41) in the time interval [0, *T*] with an adaptive Runge–Kutta method of the fourth order (MATLAB command ode45), resulting in a 40-dimensional time series with 4000 instances (time step *τ* = 0.1). We further aim at parameterization of the subgrid-scale influences *x _{t}^{i}* =

**F**

*(*

_{t}^{y}*t*) in the nonstationary VARX form (11) with external factors defined by the resolved variables {

*ϕ*_{2}[

*u*(

*t*)] = [

*x̃*

_{1}(

*t*), … ,

*x̃*

_{8}(

*t*)]}. Figure 1 demonstrates the application of the FEM–VARX clustering to the time series of

**x**

*and*

_{t}

*ϕ*_{2}[

*u*(

*t*)] defined in such a way for

*K*= 2 (number of clusters),

*q*= 1 (memory depth), and with

*N*= 200 (number of finite elements) for various numbers of persistency threshold parameter. The minimum is achieved for

*C*= 6, which corresponds to the number of transitions through the bifurcation point of (41). As can be seen from Fig. 2, the FEM–VARX clustering procedure for

*K*= 2 and

*C*= 6 identifies both dynamical regimes of the model (41), whereas the minimization of (34) without the persistency constraint (or, equivalently,

*C*= 2

*N*= 400) results in a much worse clustering result (in terms of the optimal value of

**L**

_{0}) and is due to the trapping in the local minimum of the clustering functional. Inspection of the confidence intervals of the estimated local stationary VARX model parameters in Fig. 3 reveals that whereas the matrices 𝗔 are statistically indistinguishable (e.g., the “own dynamics” of the subgrid-scale process is basically the same in both states), there are statistically significant differences in 𝗕 (in this case it governs the coupling from the resolved degrees of freedom

**x̃**to subgrid-scale degrees of freedom). Also, the noise intensities 𝗖 are different in both cases (in the identified chaotic regime the noise intensity is significantly higher).

Finally, we verify the Markov assumption for the identified affiliation function Γ(*t*), generate the ensemble predictions (with 10 000 ensemble members) based on the estimated eight-dimensional FEM–VARX parameterization (based on the first 90% of the data), calculate the expectation values of the relative prediction errors (in 2-norm, based on the last 10% of the data; black solid line in Fig. 4), and compare them with predictions obtained by other data-driven models (see Fig. 4): (i) stationary constant (gray dotted line) and regressive (black dashed–dotted line) subgrid-scale models (Orrell 2003); (ii) stationary linear stochastic model (black dashed line) (Wilks 2005); and (iii) one-dimensional FEM–VARX model, estimated under the assumption that different subgrid-scale processes in different dimensions do not interact (e.g., the matrices 𝗔 are diagonal) (gray solid line). Figure 4 shows that the fully interactive eight-dimensional persistent FEM–VARX model has a much better predictive skill for the considered nonstationary model series.

### b. Analysis of ERA-40 geopotential data in Europe (1958–2003)

Using the FEM–VARX method introduced above, we analyze daily mean values of the 500-hPa air temperature field from the ERA-40 reanalysis data (Simmons and Gibson 2000).^{7} We consider a region with the coordinates 32.5°–75.0°N, 27.5°W–47.5°E, which includes Europe and a part of the eastern North Atlantic. The resolution of the data is 5°, which implies a grid with 16 points in the zonal and 9 in the meridional direction. For the analysis we have considered temperature values only for the period 1 December 1958 to 31 July 2003; thus, we end with a 144-dimensional time series of 16 314 days. To remove the seasonal trend we apply a standard procedure, where from each value in the time series we subtract a mean build over all values corresponding to the same day and month (e.g., from the data on 1 January 1959 we subtract the mean value over all days that are the first of January, and so on). The resulting deviations Δ*H*(*t*) of the geopotential heights from their seasonal mean values are the subject of the data analysis in the following.

Since the overall dimensionality of 144 of the analyzed data series will induce uncertainties too high for the given time span of 16 314 days, the scope of the data analysis in the following is reduced to the projections on 20 dominant EOFs (describing approximately 99% of the total data variance).

Three different external influence factors **u*** _{t}* are tested on their significance for the change of the EOF projection data

**x**

*in the setting of the VARX models (11): (i) atmospheric CO*

_{t}_{2}values

*u*

_{t}^{1}from the Mauna Loa observatory (data are available online at http://cdiac.ornl.gov/ftp/trends/co2/maunaloa.co2), (ii) the seasonal trend factor in the form

*u*

_{t}^{2}= sin(2

*π*/365.4

*t*), and (iii) the sunspot cycle data

*u*

_{t}^{3}(data are available online at http://solarscience.msfc.nasa.gov/SunspotCycle.shtml).

We start the data analysis parameterizing the globally stationary VARX model for 20 dominant EOFs with different sets of external factors described above for various values of the memory parameter *q*, applying the BIC test to verify the statistical significance of the factors and testing the independent identically distributed (i.i.d.) assumptions for the model residuals with the AR(1) test (Brockwell and Davis 2002) (to check the validity of the underlying estimator assumptions). It is found that the optimal globally stationary VARX model for the analyzed data is the one with *q* = 2 and no external factors; that is, in the globally stationary framework all of the external factors are found to be statistically nonsignificant.

Application of the nonstationary FEM–VARX framework according to the *K* selection procedure described above results in determining *K* = 4 as the maximal number of statistically distinguishable local VARX states (valid for a wide range of parameters *C*, *N*, *q*). For a fixed value of *K* = 4 we further determine the optimal value of the persistency threshold *C* (in the same manner demonstrated in Fig. 1), determine the optimal value of memory *q* = 2 via the BIC test (e.g., the same as for the globally stationary VARX model), and test the i.i.d. assumptions for the model residuals with the AR(1)-test (Brockwell and Davis 2002). Next, we repeat the verification of external factors in the same way as described above for the globally stationary VARX model. As shown in Fig. 5, the BIC test confirms the hypothesis that the optimal FEM–VARX model is the one with the two external factors *u _{t}*

^{1}(atmospheric CO

_{2}values) and

*u*

_{t}^{2}(seasonal factor), whereas the influence of the sunspot cycle data

*u*

_{t}^{3}is found to be statistically negligible. Therefore, all of the further tests are conducted only with the factors

*u*

_{t}^{1}and

*u*

_{t}^{2}.

Inspection of the respective cluster affiliation functions *γ _{i}*(

*t*) (see Fig. 6) together with the mean equilibrium positions in the cluster states (40) reveals (see Figs. 7 and 8) that two of the identified cluster states, namely states 3 and 4, are describing the blocking situation in the upper troposphere. Figure 6 shows the comparison of the sum of cluster affiliations

*γ*

_{3}(

*t*) +

*γ*

_{4}(

*t*) for the par of the time series with the scaled zonally averaged Lejenas–Okland blocking index. It indicates the appearance of a blocking anticyclone and the duration of the event. We have a blocking if the geopotential height difference at 500 hP between 40° and 60°N is negative over a region with 20° zonal extent. The exact formula is given in Lupo et al. (1997); for the purpose of representation, we have computed a zonally averaged value of the index, rescaled it, and reversed its sign.

Next we aim at demonstrating how the postprocessing methods described in section 3 can help us gain additional insight into the data to understand the impact of the external factors. Figures 7 and 8 show the EOF back-projection to the original 144-dimensional space of the mean dynamical equilibrium positions in clusters 3 and 4 calculated at different times with the respective values of the factors *u _{t}*

^{1}and

*u*

_{t}^{2}As can be seen from the graphics, the impact of the seasonal factor

*u*

_{t}^{2}results in significant weakening of the blocking states in summer, a finding consistent with the observations reported in the literature. The impact of the increasing CO

_{2}concentration also results in a weakening of the blocking situation in both states 3 and 4 that is less pronounced but still statistically significant (see the respective confidence intervals in Figs. 7 and 8).

Finally, in the same manner as in the previous nonstationary Lorenz ’96 example, we construct an ensemble prediction model (with 10 000 ensemble members) based on the FEM–VARX clustering results and compare the resulting predictions with the ones obtained by other methods. First, we verify the Markov assumption in the time-homogenous approximation for the identified affiliation function Γ(*t*), generate the ensemble predictions based on the estimated 20-dimensional FEM–VARX parameterization (based on the first 90% of the data), and calculate the expectation values of the relative prediction errors (in 2-norm, based on the last 10% of the data). As can be seen from Fig. 9, FEM–VARX ensemble predictions produce much better predictions than the constant model (where the prediction is always the expectation value over the whole history) and the “same as today model” (where the prediction for the next day is just the state of the system now). Compared with the globally stationary VARX model (where the ensemble prediction is calculated from 10 000 ensemble members of the global VARX model), FEM–VARX produces only slightly better predictions (approximately 2% better). This observation could be explained by the low overall persistence of the process Γ(*t*) compared, for example, to the previous Lorenz ’96 example where the hidden process was very persistent. This issue needs a deeper understanding and is a matter of future research.

## 6. Conclusions and discussion

A numerical FEM–VARX scheme for a data-driven parameterization of nonstationary multiscale dynamical processes was introduced. As demonstrated in the present paper, the application of the FEM–VARX method allows for the good description of the analyzed nonlinear and nonstationary data (in terms of recovering the nonlinear effects associated with the regime switching and in terms of the good prediction quality of the resulting reduced representation). Furthermore, a wide range of data-analysis techniques—for example, from information theory (like the Bayesian information criterion; cf. McQuarrie and Tsai 1998), model reduction approaches (like balanced truncation; cf. Moore 1981) and adaptive finite element methods—has become available and can be applied in the FEM–VARX context to postprocess the obtained results and to get additional insight into the analyzed data. In contrast to other multiscale approaches known from the literature (Majda et al. 2003; Fatkullin and Vanden-Eijnden 2004), the presented numerical scheme is not a systematic strategy based on the knowledge of some first principles but is purely data driven and results in an approximation of the analyzed process by means of a sequence of “simple” linear factor models. From this perspective, it is important to put further efforts into development of mixed numerical schemes, combining some features of systematic “first principles” approaches (and with this some a priori knowledge about the analyzed dynamics) with some aspects of purely data-driven approaches (like the one presented in this paper).

One of the main emphases in the present paper was on implicit mathematical assumptions being made at different stages of the derivation of the numerical method and postprocessing of the obtained results. In context of the meteorological application it was shown how big the impact of the implicit stationarity assumption is on the analysis of climate factors’ influence: whereas in the case of the globally stationary VARX model with constant coefficients, the impacts of the seasonal trend and CO_{2} concentration were found to be statistically insignificant, the nonstationary FEM–VARX clearly reveals the statistical significance of the two factors (see Fig. 5). It is important to emphasize that the applicability of the presented method, as well as subsequent interpretation and postprocessing of the obtained results, is dependent on the fulfillment of the mathematical assumptions involved. Conclusions that are drawn in every specific application case are reliable only if those assumptions are fulfilled; this is one of the most important issues to be kept in mind when applying methods such as the one presented in this paper.

## Acknowledgments

The work was partially supported by the DFG SPP 1276 MetStroem “Meteorology and Turbulence Mechanics”, the DFG Research Center “Matheon” (Project E8), and the Center of Scientific Simulations (CSS Berlin).

## REFERENCES

_{2}concentration atmospheres.

### APPENDIX

#### Making the BV Problem Differentiable

subject to

## Footnotes

*Corresponding author address:* Illia Horenko, Institute of Mathematics, Free University of Berlin, Arnimallee 6, 14195 Berlin, Germany. Email: horenko@math.fu-berlin.de

^{1}

The influence of the Tykhonov regularization in the *W*_{1,2} Sobolev norm is equivalent to the action of the diffusion operator *ε*^{2}Δ* _{t}* on

*γ*(

_{i}*t*) [e.g., the increase of the regularization parameter

*ε*

^{2}results in the stronger diffusion of the affiliation function

*γ*(

_{i}*t*) through the interface separating the clusters in time]. For a detailed discussion of the

*W*

_{1,2}regularization effects and their influence on different clustering scenarios, see Horenko (2009, 2010b).

^{2}

For practical examples of standard finite element function sets *f _{i}*(

*t*) (such as linear finite elements) in discrete time, see Horenko (2010b) and Braess (2007).

^{3}

This can be done in a standard way, controlling the norm of the discretization errors *χ _{N}*

^{+},

*χ*

_{N}^{−}locally and applying the multigrid approach to guarantee that |

*χ*

_{N}^{±}| ≤

**(Braess 2007).**

*δ̃*^{4}

We get use of the convexity of the functional *g*[**x**_{t}, * θ_{i}*] = ‖

**x**

_{t}−

*− 𝗔*

**μ**^{i}

^{i}**ϕ**_{1}(

*x*

_{t−τ}, … ,

*x*

_{t−mτ}) − 𝗕

^{i}**ϕ**_{2}[

*u*(

*t*)]‖

_{Pi}with respect to parameters 𝗔𝗕

*and the necessary minimum condition for convex functionals.*

^{i}^{5}

Note that all of the numerical criteria for verification of the Markov assumption known in the literature imply the stationarity (or homogeneity) of the underlying transition matrix and are in general not applicable if the analyzed process is known to be strongly nonstationary.

^{7}

The author thanks the Potsdam Institute for Climate Impact Research (PIK) for access to the ERA-40 data.

^{8}

One could take the condition as an additional explicit constraint, ending up with two equality constrains instead of one. This will obviously increase the overall numerical cost of the method.