## Abstract

The authors consider statistical ensemble data assimilation for a one-layer shallow-water equation in a twin experiment: data are generated by an *N* × *N* enstrophy-conserving grid integration scheme along with an Ekman vertical velocity at the bottom of an Ekman layer driving the flow and Rayleigh and eddy viscosity dissipation damping the flow. Data are generated for *N* = 16 and the chaotic flow that results is analyzed. This analysis is performed in a path-integral formulation of the data assimilation problem. These path integrals are estimated by a Monte Carlo method using a Metropolis Hastings algorithm. The authors' concentration is on the number of measurements *L _{c}* that must be assimilated by the model to allow accurate estimation of the full state of the model at the end of an observation window. It is found that for this shallow-water flow approximately 70% of the full set of state variables must be observed to accomplish either goal. The number of required observations is determined by examining the number needed to synchronize the observed data

*L*and the model output when

_{c}*L*data streams are assimilated by the model. Synchronization occurs when and the correct selection of which data are observed is made. If the number of observations is too small, so synchronization does not occur, or the selection of observations does not lead to synchronization of the data with the model output, state estimates during and at the end of the observation window and predictions beyond the observation window are inaccurate.

## 1. Introduction

Transferring information from observations to models of geophysical flows allows the estimation of unknown parameters in the dynamical equations and the estimation of unobserved states of the model. This permits one to examine a representation of the state of the flow at the end of the observation period and enables the use of the model to predict beyond the observation period. Data assimilation has been explored widely as the tool through which one can affect this transfer of information, and various methods have been used to accomplish the task.

We describe here results on the use of the path-integral formulation (Abarbanel 2009; Quinn and Abarbanel 2010; Apte et al. 2007; Kim et al. 2003) of the statistical data assimilation problem (Evensen 2008; Kalnay 2003; Cox 1964; Friedland and Bernstein 1966; Jazwinski 1970) in its application to shallow-water flows driven by a vertical Ekman velocity driving the interior shallow layer at the base of the Ekman layer (Pedlosky 1987; Gill 1982). This is a well-studied problem from the point of view of using one and many shallow layers to predict the geophysical flows of interest. We do not add to the investigation of these flows from the point of view of their producing realistic flow regimes or from the point of view of exploring the bifurcation structures of the flows delineating different stable regimes of the flow.

In this paper we study a single shallow-water layer driven by the vertical velocity at the base of an Ekman layer above it. It forces the shallow interior flow into a chaotic regime. We describe an exact framework for utilizing a set of *m* + 1 *L*-dimensional measurements *y _{l}*(

*t*);

_{n}*l*= 1, 2, …

*L*;

*n*= 0, 1, 2, …,

*m*in the observation window {

*t*

_{0},

*t*

_{1},

*t*

_{2}, …,

*t*=

_{m}*T*} to estimate the unobserved states of a model for the flow with

*D*degrees of freedom;

*D*≥

*L*. We then develop a useful set of approximations to this.

The model of the flow is described by a *D*-dimensional discrete time map of the state **x**(*t _{k}*) =

**x**(

*k*), after discretization of the fluid equations in space and time. The map takes the state

**x**(

*t*) from time

*t*to time

_{k}*t*

_{k}_{+1}via

where **p** is a set of time-independent parameters arising from the details of the model equations and the selected discretization in space and time. Equation (1) is the deterministic version of the dynamical equations, and we account for some form of model error below.

We address the following questions:

How many measurements

*L*at each observation time_{c}*t*must one make to assure that the estimation procedure can be accurate?_{n}Which

*L*measurements are required to allow accurate estimation and prediction? Different measurements carry more or less information about the data to the model._{c}If

*L*≥*L*, how well can we predict forward after the observation window, that is,_{c}*t*>*T*? This prediction is limited by the intrinsic instabilities of the chaotic flow, and we connect those instabilities with the limit on the accuracy of the prediction.If

*L*<*L*, how can we detect that situation. What limitation does that present on estimating the state of the model at_{c}*t*=*T*and predictions beyond that time?

We perform a set of twin experiments wherein we generate our *D*-dimensional “data” for the shallow-water flows and select a subset of *L* ≤ *D* observations available for the data assimilation. The observations are taken to be noisy, errors in the model are permitted, and we address uncertainty in the model state **x**(*t*_{0}) = **x**(0) when observations commence.

Our ensemble data assimilation tool is a path integral over the state through the observation window. In this paper we take as observations measurements of the height *h*(*x*, *y*, *t*), zonal (north–south) velocity *υ*(*x*, *y*, *t*), and meridional (east–west) velocity *u*(*x*, *y*, *t*) fields at the spatial grid points used in the discretization of space and at the observation times *t _{n}*. The observations

**y**(

*t*) will be compared directly with the model outputs. If one observed functions

_{n}_{l}[

**x**(

*t*)] of the flow fields, then comparison of the observations

_{n}*y*(

_{l}*t*) would be made with the

_{n}_{l}[

**x**(

*t*)].

_{n}The use of the path integral permits direct estimation of the expected value of each dynamical variable **x**(*t _{n}*) and each fixed parameter as well as the estimation of RMS errors and higher moments about this expected value. Further, from the Monte Carlo evaluation of the path integral (Quinn and Abarbanel 2010) one may also extract the marginal distribution of any component of

*x*(

_{a}*t*) and any of the fixed parameters. The outcome of our data assimilation will directly address the quantification of uncertainty in geophysical flows.

_{k}We use a Metropolis–Hastings (Metropolis et al. 1953; Hastings 1970) method for the evaluation of the relevant path integrals. Other methods such as ensemble Kalman filters (Evensen 2008) or sequential Monte Carlo methods (Doucet and Johansen 2011) may provide equally good approaches to the estimation of the relevant means and RMS variations about these means. Our focus here is on the number of measurements one requires to assure that any methods of estimation may succeed. We, therefore, do not attempt to compare various methods in this work.

To evaluate the number of measurements required for accurate estimation of states, we proceed in two ways:

First, we take data generated by the solution of the shallow-water equations as described below and ask when the output of a model of the flow, as

*L*= 1, 2, …,*L*data streams from the data are presented to the model, synchronizes with the data._{c}Second, we use the path-integral method, outlined in the next section, to provide estimates of the unobserved states of the shallow-water model and ask how accurate are the predictions of the shallow-water flow as the number of data streams assimilated by the model is increased.

These two methods of identifying the number of required measurements are essentially the same. The quantity *L _{c}* is the number of conditional Lyapunov exponents (Pecora and Carroll 1990; Abarbanel et al. 2010) associated with the manifold where the data and the model output match. More precisely we have an

*L*-dimensional submanifold where the

*L*-dimensional observations

*y*(

_{l}*n*) are located. If these observations are physically taken to correspond to the

*L*measurement functions with known dependence on the

*D*-dimensional model output

**x**(

*n*) as

*h*(

_{l}**x**), then the synchronization manifold is the

*L*subspace of the

*D*-dimensional model state space where

*y*(

_{l}*n*) =

*h*[

_{l}**x**(

*n*)]. If

*L*<

*L*, synchronization does not occur and the surfaces in a search space are riddled with local minima. Accurate predictions rest on having the same number of data streams presented to the model as does accurate synchronization of the model output with the measurements.

_{c}In principle, one can evaluate the conditional Lyapunov exponents directly from the differential equations of the model (Abarbanel 1996; Kantz and Schreiber 2004) but diagonalizing the very ill-conditioned *L* × *L* matrices is possible, but numerically very challenging. Even determining the largest conditional Lyapunov exponent, while less challenging, is numerically demanding. Examining the synchronization between the model output associated with two solutions starting from different initial conditions conveys the same information and is much less difficult. The *L _{c}* is the number of measurements required to make all conditional Lyapunov exponents negative.

We show from simulations of the model that, with our selection of parameters, the time series of the shallow-water flow is chaotic, so accurate prediction is very demanding on accurate estimation of the full state of the shallow-water system. In our calculation we solve the shallow-water equations on a 16 × 16 grid, requiring 3 times 256 differential equations or a 768 dimensional state space estimation requirement. In the cases examined, *L _{c}* ≈ 70% of the number of degrees of freedom (height and horizontal velocities) in the model.

In twin experiments we can also examine how well the model output approximates the data for unobserved data streams. This too shows that when the number of measurements is fewer than the number *L _{c}* seen for good prediction or synchronization, the unobserved state variables are not well estimated. In working with real data, we have no knowledge of the full dataset, so one cannot make this comparison. Prediction becomes a key indicator of success in the quality of the model and in the accuracy of the estimation of the states, observed and unobserved.

## 2. Path-integral data assimilation: Uncertainty quantification

The quantity of interest to us is the conditional probability distribution of the state variables at time *t _{n }*

**x**(

*n*) when measurements have been made at times {

*t*,

_{n}*t*

_{n}_{−1},

*t*

_{n}_{−2}, …,

*t*

_{0}} before that time. Designate the

*L*-dimensional measurements at time

*t*and earlier as

_{n}**Y**(

*n*) = {

**y**(

*n*),

**y**(

*n*− 1), …,

**y**(0)}. (If a measurement is not made at

*t*;

_{k}*t*

_{0}≤

*t*≤

_{k}*t*, it is omitted from the list.) We would like to evaluate

_{n}*P*[

**x**(

*n*) |

**Y**(

*n*)] and from this provide a formula for the expectation value of any function

*G*(

**X**) along the path

**X**= {

**x**(

*m*),

**x**(

*m*− 1), …

**x**(1),

**x**(0)} of the

*D*-dimensional model state through the observation window {

*t*

_{0},

*t*

_{1},

*t*

_{2}, …,

*t*=

_{m}*T*}.

Because the measurements are noisy and the model has errors and the initial state of the model **x**(0) is uncertain, we deal with probability distributions. First note

where we have used Bayes's rule, which gives identities on conditional probabilities (MacKay 2003). In the final step we use the Markov property of the dynamics Eq. (1) stating that the state **x**(*n* + 1) depends only on the previous state **x**(*n*).

The conditional mutual information (Fano 1961)

tells us how much information is known about the *D*-dimensional state **x**(*n*) associated with the *L*-dimensional measurement conditioned on all earlier measurements.

This fundamental recursion relation allows us to give a path-integral representation to the probability distribution at the end of a measurement window conditioned on the measurements that have taken place within the window:

The expectation value of any function *G*(**X**) along the path is given as

All the sources of uncertainty in the model dynamics are embodied in the transition probabilities *P*[**x**(*n* + 1) | **x**(*n*)]. All the sources of noise or interference in the measurements are embodied in the conditional mutual information terms CMI[**x**(*n*), **y**(*n*) | **Y**(*n* − 1)] and all the uncertainty in the state of the model at the time we start measurements is embodied in *P*[**x**(0)].

This means that choosing *G*(**X**) appropriately we may answer various questions about the properties of the outcome of informing our model about measurements while allowing the state **x**(0) to be moved through the observation window by the dynamics. Suppose we want to know the expected value of the model state along the path, then we select *G*(**X**) = **X**. If we wish the RMS errors about these estimates of the expected value, we select

and from these expected values we may evaluate the RMS error at each point within the observation window for each components of the state vector **x**.

These moments of the state of the model, along with any other moments one considers significant, quantify the uncertainty about the outcome of the model, given information about the measurements, accounting for all sources of uncertainty in the processes of acquiring the data and propagating the dynamics through the data acquisition window. Of course, the skill with which we can represent the sources of uncertainty in the ingredients of *A*_{0}[**X**, **Y**(*m*)] is not addressed by this construction.

The accuracy with which we are able to quantify the uncertainty in the model state now rests on how well we are able to estimate the (*m* + 1)*D* dimensional integrals required.

We will estimate the integrals of the form Eq. (5) using a Monte Carlo method (Metropolis et al. 1953; Hastings 1970; MacKay 2003; Neal 1993; Gelman et al. 2003) that provides a way to estimate the probability distribution exp − {*A*_{0}[**X**, **Y**(*m*)]}. The Monte Carlo method identifies paths **X** that approximate this probability distribution, and when we have completed the Monte Carlo calculation we will have a set of accepted paths **X**^{(j)} according to the criteria we discuss below. If we have *N*_{path} accepted paths, then

A more detailed discussion of the Monte Carlo procedure and its parallel implementation will be given below.

We predict beyond the observation window (*t*_{0}, *t _{m}* =

*T*) by taking some fraction of the accepted paths, here 1%, identifying the component of each path representing the state of the model at

*t*=

*T*, and using

**x**

^{(j)}(

*T*) as initial conditions and using the estimated parameters associated with

**X**

^{(j)}to integrate the equations of motion forward into

*t*>

*T*. Averaging over these paths gives the expected value of the predicted state. The RMS error about this expected value is also computed using the accepted paths. An alternative, equivalent, approach would be to extend the path integral into the time region

*t*>

*T*and use the same Monte Carlo estimation procedure to estimate properties of the state for

*t*>

*T*.

It is useful to note what the Monte Carlo method achieves. We have a probability density *P*(**X**) in which **X** is an (*m* + 1)*N* dimensional vector, and we write this density as

If we think of the Monte Carlo method as searching about in **X** space to sample this distribution, we can attach a label associated with the number of searches (Monte Carlo iterations) to each path, so we have a series of vectors **X**(*s*) with the label *s*. If the movement in *s* of **X**(*s*) satisfies a Langevin equation of the form

with *N*(0, 1) Gaussian noise of zero mean and variance unity, then one can derive (Lindenberg and West 1990) a Fokker–Planck equation for *P*[**X**(*s*)]:

As *s* → ∞ the stationary solution to this is . The Monte Carlo method can then be seen as seeking points in **X** space that are near the extrema of the action *A*_{0}(**X**) accounting for fluctuations about the extrema to enable statistical measures of the distribution to be evaluated. This applies equally in the time region where observations are located as well as in the time region where predictions are achieved.

The path-integral formulation is exact, and to be useful it requires approximations we discuss below. Here we note it is the natural framework for quantifying the uncertainty in data assimilation problems: with and without data. “Without data” the path integral provides an integral representation for the initial value problem of a “master equation” for the probability density functions (Zinn-Justin 2002; Lindenberg and West 1990). With data it adds to the free propagation of the dynamical system the transfer of information from the data to the model.

Once one specifies the distribution of the errors in the measurements and adopts a statement of the distribution of the errors in the model, then the path integral directly provides the statistics on the expected path and the RMS errors about that path. Though we do not address it here, this applies equally well to the estimation of the fixed parameters, indeed to the estimate of the distribution of the parameters as indicated by the model, the data, and the distribution of errors (Quinn and Abarbanel 2010).

One could proceed directly with the stochastic differential equations of the model as a method to generate paths allowed by the structure of the dynamics and the nature of the selected stochastic terms in the equations. The use of closure approximations such as “polynomial chaos” (Najm 2009), which provides a finite number of coupled ordinary differential equations among the moments of the state variables leads to inconsistency expressed in the appearance of negative probability distributions (Marcinkiewicz 1939; Pawula 1967a,b).

We turn next to a very brief review of the shallow-water equations to set the stage for data assimilation investigations using them as the generators of our data in a twin experiment.

## 3. One-layer shallow-water flow

Shallow-water flow is a very well-established model for geophysical flows (Pedlosky 1987; Gill 1982), and we repeat the dynamical equations here to establish notation. We wish to describe the interior flow of a single constant density *ρ* shallow layer of fluid on a *β* plane rotating at *f*(**r**)/2 about the *z* axis. *f*(**r**) = *f*_{0} + *βy*. It lies between a flat bottom at *z* = 0 and a height *h*(**r**, *t*) = *H*_{0} + *η*(*r*, *t*). The quantity *H*_{0} is a constant taken much larger than the Ekman-layer depth; *h*(**r**, *t*) is the deviation of the interior surface from *H*_{0}. Horizontal coordinates are indicated as **r** = (*x*, *y*). Above the height *h*(**r**, *t*) lies a wind-driven Ekman layer resulting in a vertical velocity . The Ekman layer is assumed to be thin compared to the depth of the interior fluid *h*(**r**, *t*).

The base of Ekman layer where it drives the interior flow is taken to be several times so that the horizontal velocity of the Ekman flow is negligible with respect to the interior flow **u**(**r**, *t*) = {*u*(**r**, *t*), *υ*(**r**, *t*)}, which is taken to be independent of *z*. In essence the Ekman layer is taken to act as a more or less passive fluid medium transferring information about surface wind stresses to an interior flow via a vertical velocity boundary condition. Its motion is tied to the ocean surface.

Further, we take the wind stress at the top of the Ekman layer, the fluid surface itself, to be time independent. In this setting, as described in some detail in Gill's section 9.9 (Gill 1982), the forcing is moved from the momentum equations now driven by pressure gradients and damped by friction, to the vertically integrated continuity equation.

This leads us to the three dynamical equations for **u**(**r**, *t*) and *h*(**r**, *t*):

with *a* = 1, 2. This is flow on a *β* plane with coordinates **r** = (*x*, *y*) and rotating at **f**(**r**)/2 about the *z* axis; *f*(**r**) = *f*_{0} + *βy*. In our calculations we take the time-independent wind stress to have the form *τ*(**r**) = [*F* cos(2*πy*), 0], and again we note *h*(**r**, *t*) = *H*_{0} + *η*(**r**, *t*). The **z** is a unit vector in the *z* direction. The **z** component of the vorticity *ζ*(*x*, *y*, *t*) is defined in the usual way.

This is a simplification of the full dynamical equations for a wind-driven flow that would separate the horizontal velocity into an Ekman velocity **u**_{E}(**r**, *z*, *t*) and the interior pressure-driven velocity **u**_{p}(**r**, *t*). This decomposition then couples the Ekman flow (Gill 1982) into the interior flow both through **u**_{E}(**r**, *z*, *t*) in the nonlinear terms of the momentum equation and in the manner we have chosen in Eq. (11) through the boundary condition on the vertically integrated conservation equations at the base of the Ekman layer.

One could, of course, bring the considerations in this paper to the more complicated wind-driven flow, and there should be some interest in that. We have simplified the flow with a number of assumptions as our focus is on methods, generalizable to more complex flow regimes, to determine how many and which measurements of the flow fields are required to permit synchronization of data and the model in a twin experiment where the data is generated by the model. We show in the following discussion that this number is also critical for enabling accurate estimation of the state of the flow at the end of an observation window, and thus to the ability to predict the flow beyond the observation times.

## 4. Generating the data

Using the parameters in Table 1 we generated data on a 16 × 16 grid using the (Sadourny 1975) enstrophy-conserving discretization scheme. Periodic boundary conditions were imposed on the flow. Figure 1 presents a snapshot of the flow at *t* = 40 h following a spin up period of 2000 h. A fourth-order Runge–Kutta method was used to move the solutions forward in time. On a 16 × 16 grid, large values of the surface elevation accompany the chaotic flow.

The flow generated with these parameters is chaotic. We demonstrate this feature by evaluating the largest Lyapunov exponent for the 768 dimensional system of ordinary differential equations (Abarbanel 1996; Kantz and Schreiber 2004). To accomplish this we selected two sets of very close initial conditions and integrated the two solutions of the equations forward in blocks of 4000 time steps of 0.01 h each. During each block of 40 h, we tracked the RMS distance between the two slightly perturbed solutions to the shallow-water equations and evaluated the largest Lyapunov exponents (LE) by approximating the growth of this error as *e*^{LE×t}. Following the standard procedure in studies of nonlinear dynamics, when this error grew large, we rescaled the values of the two orbits to have a very small value of their distance and then tracked growth of the distance between the same two orbits for another block of 40 h in time. We did this for 80 blocks of time.

In Fig. 2 we present the estimated largest Lyapunov exponents from each of the horizontal velocity and height fields. These estimates vary a small amount from each other as the time window over which nearby orbits is tracked is only 40 h, and also because each rescaling starts an orbit from different locations on the system attractor and the local Lyapunov exponents are inhomogeneous on the attractor. The variation of the largest Lyapunov exponents is seen to be relatively small, and the average over the variation shown in Fig. 2 for each field is LE ≈ 0.122 h^{−1} or LE ≈ 1 (8.2 h)^{−1}. From that we estimate that predictions should be accurate, assuming we have accurately estimated the states at the beginning of predictions, up to the order of 10 or 20 h after the start of predictions.

## 5. Synchronization of the data with the model output

If the estimations, using the path integral or other methods of data assimilation, are to be accurate, the phase space orbits in the data and those from the model output must synchronize. If this fails to happen, the surfaces over which one searches for values of the fixed parameters or for values of the unobserved state variables will be very ragged with numerous local minima (Abarbanel et al. 2010). The estimates of the desired quantities will be inaccurate. If synchronization occurs, then these same surfaces become smooth and the search, accomplished with one of many available numerical optimization packages, will be accurate.

In any problem such as the one we address in the context of the shallow-water equations, it is strongly suggested that checking the synchronization of the data and the model output be done. Below we will show that in this problem, when synchronization occurs, good estimates and good predictions are possible. When synchronization is absent, inaccuracies in both estimation and predictions will transpire. This check on synchronization does not require actual observed data, but is a consistency check on the model that indicates how many data streams and which data streams are expected to allow accurate estimations and predictions. The synchronization of the two data streams is a property that can be examined in twin experiments as is done here.

We now used our model to generate the synchronization error for each of the fields *φ _{i}*

_{,j}(

*t*) = {

*u*(

*i*,

*j*,

*t*),

*υ*(

*i*,

*j*,

*t*),

*h*(

*i*,

*j*,

*t*)} by evaluating

as a function of *T*_{max} for various configurations of the observed set of grid locations (*i*, *j*). As a notational remark, we use the integer labels (*i*, *j*) to indicate locations on the grid while the independent variables **r** = (*x*, *y*) indicate continuous spatial field labels. Recall the fields as well as (*x*, *y*, *t*) are made dimensionless using the value in Table 1 to scale them.

To couple the information from the observations into the model, we add a term *k*[*φ _{i}*

_{,j−data}(

*t*) −

*φ*

_{i}_{,j−model }(

*t*)];

*k*≥ 0 to the right-hand side of each of the model differential equations in Eq. (11) that corresponds to an observed field. With different initial conditions

*φ*

_{i}_{,j−model}(0) than used in generating the data, we evaluate the RMS synchronization error for each field as a function of

*T*

_{max}. For the first 40 h of integration, we set all

*k*= 0.4, and for the second 40 h of integration, we reset

*k*to 0. For

*k*= 0 the two time series, data and model output, do not synchronize. As

*k*is increased, there is a critical

*k*where the largest conditional Lyapunov exponent (Pecora and Carroll 1990; Abarbanel 1996; Kantz and Schreiber 2004) becomes negative. For

*k*greater than this value, the orbits can synchronize, and our choice of

*k*= 0.4 is in that synchronization regime.

A *k* = 0 means no direct information about the data is available to the model, while during the first 40 h, the data are directly coupled to the model. Since the data and the model output are chaotic, we expect the two trajectories arising from different initial conditions to depart from each other, to become decorrelated, and to move separately around the system attractor in the second time window. This is, indeed, the case when *k* = 0. The decorrelation they acquire increases the synchronization error until it saturates at approximately the size of the attractor.

In the top panel of Fig. 3, we display the RMS synchronization error for each of the fields {*h*(*i*, *j*, *t*), *u*(*i*, *j*, *t*), *υ*(*i*, *j*, *t*)} when all *h*(*i*, *j*, *t*), all *υ*(*i*, *j*, *t*) and the westernmost grid column of *u*(*i*, *j*, *t*) are observed. For the first 40 h of integration *k* = 0.4, coupling the data into the differential equations, and we see that the synchronization error for each of the fields *u*, *υ*, and *h* decreases to order 10^{−4} or smaller indicating the data and the model output are synchronized. When we turn off the coupling at *t* = 40 h, so *k* = 0, the model is evolving without information from the data, the chaotic model output and the chaotic data become unsynchronized. This happens because at *t* = 40 h, the values of the model state variables, though possibly very close to the data values, have small residual errors compared to the data, and these errors grow because each field is chaotic. This absence of synchronization is associated with positive conditional Lyapunov exponents on the synchronization manifold (Abarbanel 1996; Kantz and Schreiber 2004) where data = model output. The surfaces over which one must search for unobserved state variables or unknown parameters is replete with local minima when *k* = 0 and is smooth and easy to search for *k* = 0.4. In further examples below, we will see that it is not only the number of observed state variables that matters, but it also matters which ones are observed. The bottom panels of Fig. 3 display the synchronization errors of two other choices of observed fields. The configuration in on the bottom left does not synchronize, while the configuration on the bottom right does exhibit synchronization.

As shown in Quinn and Abarbanel (2010), there is an association with the structure of the action *A*_{0}[**X**, **Y**(*m*)] where numerous local minima in the synchronization error translates to numerous local minima in the action. As the number of measured quantities reaches *L _{c}*, one or a distinct few minima dominate the action suggesting that estimation and prediction using the path-integral formulation will be successful. This turns out to be the case.

## 6. Specifying the action

To proceed with our estimation of the path integral, we must specify the action in more detail. For the information transfer term, we make the assumption that the noise in the measurements is independent at each time when measurements are made and at each spatial point where measurements are made. If this is *not* true, then the general formulation includes that case. Further, we assume the noise is Gaussian at each time step and each spatial location where observations occur. In that case we may write

This expresses the idea that different spatial components of the measurements may have different noise levels [proportional to ] and that the variance noise levels are independent of time.

The model error term log{*P*[**x**(*n* + 1) | **x**(*n*)]} represents the effect of noise or finite resolution on the deterministic dynamical equations. In this instance we represent those errors as a Gaussian spread of the deterministic transition probability that is a delta function on the state space **x**. With a Gaussian error, we may write

where the differential equation arising from spatial discretization is

and we have used the trapezoidal rule for the discrete time representation of this with a time step Δ*t*. The error variance (proportional to ) may vary with the spatial location (*i*, *j*), and this is reflected in this expression for the transition probability. It does not vary in time.

If we have a priori information on the distribution *P*[**x**(0)] when the observations begin, we could use it. In the calculations we present, we assume no information about the distribution of the initial state, so we assume maximal ignorance, which means *P*[**x**(0)] is uniformly distributed over the domain in state space where our Monte Carlo procedure searches. It is then just a constant and thus drops out of any calculation of *E*[*G*(**X**) | **Y**(*m*)]. We ignore it henceforth.

The action *A*_{0}[**X** | **Y**(*m*)] is now given as

In our use of the approximate action *A*_{0}(**X**) we did not use spatial correlations among the fields. The reader will recognize this approximation to the action as the weak four-dimensional variational data assimilation (4DVar; Evensen 2008) cost function, though we do not minimize it in the evaluation of the path integral. Its minimization in variational data assimilation is the first step in a systematic estimation of the path integral using a stationary path approximation.

The discussion of the Monte Carlo estimation of the path integrals that now follows does not at all depend on this form of the action. We do use this form, however, in the discussion.

## 7. Monte Carlo estimation of the path integral

Now we have a formulation of *P*[**X** | **Y**(*m*)], and we wish to evaluate quantities such as means and covariances about that mean of functions along the path **X**. These can be used to make estimates and predictions. These quantities are path integrals of the form given in Eq. (5), where *G*(**X**) is chosen to be some function of the path that is of interest.

The challenge is to evaluate these path integrals. One way to do this is to generate a series of paths that are distributed in path space according to *P*[(**X** | **Y**(*m*)] ∝ exp{−*A*_{0}[**X** | **Y**(*m*)]} then we can use those paths to approximate *E*[*G*(**X**) | **Y**(*m*)] as

The sample paths can be thought of as representing the many possible time evolutions of the system state that could have produced the observed data. The paths that are more likely to have produced the observed data will be generated more times than the paths that are less likely to have produced the observed data. Indeed, the higher or lower frequency of the appearance of an acceptable path is according to the distribution exp{−*A*_{0}[**X**, **Y**(*m*)]}.

Now we need a method that will produce a series of paths that are distributed according to exp{−*A*_{0}[**X**, **Y**(*m*)]}. There are several path-integral Monte Carlo methods, such as Metropolis or Hybrid Monte Carlo, that do exactly this (Metropolis et al. 1953; MacKay 2003; Neal 1993).

One of the simplest and oldest approaches is the Metropolis Monte Carlo method (Metropolis et al. 1953). This method works by generating a sequence of paths that we call Monte Carlo paths by a biased or guided random walk through path space. The method is an example of a Markov chain Monte Carlo method, because the paths are generated in sequence and the next path is generated only from the current path in a stochastic way. The random walk is biased in a particular way as described below so that the sequence of paths that are generated will be distributed according to exp{−*A*_{0}[**X**, **Y**(*m*)]}.

The method works by generating a new path **X**^{(q+1)} from the current path **X**^{(q)} with a two-step procedure. The term *q* is an integer labeling the path selection, *q* = 0, 1, … ,. First, a candidate path **X**^{cand} is proposed by adding an unbiased random displacement to the current path **X** = **X**^{(q)}. The displacement may be to any collection of the components of the path **X** = {**x**(0), **x**(1), …, **x**(*m*)}, and these displacements may be drawn from any type of distribution, as long as it is unbiased. This assures that the transition **X** → **X**^{cand} is as likely to occur as the transition **X**^{cand} → **X**.

The proposed distribution actually does not need to be unbiased if the acceptance probability is modified in the way shown in (Hastings 1970). There are several methods that make a smarter choice of the proposal distribution, such as the force bias method (Pangali et al. 1979). Such methods may converge with fewer iterations, but they have the added complexity of requiring derivatives of the action to be computed. We have not found one or another method of selecting proposed path to be preferable in our work, but in each data assimilation problem one should examine this choice of methods.

Next the proposed path is either accepted **X**^{(q+1)} = **X**^{cand}, or it is rejected **X**^{(q+1)} = **X**, and the original path is retained. The probability for acceptance is

where Δ*A*_{0}(**X**^{cand}, **X**) = *A*_{0}(**X**^{cand}) − *A*_{0}(**X**) is the change in the action that would be caused by changing **X** to **X**^{cand}. This says that if a proposed change will lower the action, it should be accepted, and if the proposed change increases the action it should only be accepted with probability exp[−Δ*A*_{0}(**X**^{cand}, **X**)]. Only the change in the action is required, so the full action never needs to be computed. This means we do not need to keep track of additive constants to the action, and we can make the computation much more efficient by only evaluating the terms in the action that will be changed by the update.

## 8. Parallel implementation for GPUs

The Metropolis Monte Carlo method is simple and powerful, but it has the drawback of requiring very many path updates to get accurate statistics. Since so many path updates are required, the computation becomes very expensive in terms of computer time. One way to deal with this problem is to take advantage of parallel computing technology, using a graphics processing unit (GPU). With GPU technology, and using the Compute Unified Device Architecture (CUDA) instruction set, it is possible to execute hundreds of threads of instructions simultaneously (Quinn and Abarbanel 2011). Typically each thread will perform the same operations, but on different pieces of the data. Since the paths are updated sequentially, the path update process cannot be run in parallel. However, the many computations needed on each iteration can be done in parallel by having different threads work on different time steps.

The use of GPUs for Monte Carlo calculations was not originated by us in Quinn and Abarbanel (2011), and earlier references include Hanweck (2011) and Tomov et al. (2005).

First the current path **X** is set to an arbitrary initial guess **X**^{(0)}, and the observation time series **Y**(*m*) is loaded from a file. The path **X**^{(0)} consists of the state vector **x**_{n} at every time step *n* = 0, 1, …, *m*. The current path, the observation time series, and a running sum of moments of the path components are allocated in GPU memory and initialized with the appropriate data.

Then the path update loop begins. First the even time *n* states are updated, and then the odd time *n* states (see Table 2 and Fig. 4). The reason for doing the state update in two steps is to uncouple the state vectors: to calculate the change in action due to perturbing **x**(*n*), we need to know **x**(*n* − 1) and **x**(*n* + 1), but none of the other state vectors. This way each even *n*, and then each odd *n*, can be updated independently, in any order or simultaneously.

After all the states have had an opportunity to update, the current path can be used to update the path statistics. Typically this will be the mean and variance of each component of the path, but covariances or higher moments could be of interest too. Another possibility is to record bin counts of some components of interest to make a histogram. This step is skipped for the first *N*_{initial} path updates, and after that only done every *N*_{skip}th path update. The statistics collection happens on the GPU, also in parallel, and so the individual paths are not recorded. This avoids costly data transfers between the GPU and CPU controlling the overall calculation.

There are several more details that will now be discussed. The first *N*_{initial} iterations comprise an initialization phase, during which no path statistics are recorded. This initial phase is necessary to remove the influence of the initial guess path. In our implementation two other things happen during this phase: simulated annealing and automatic adjustment of the Monte Carlo step sizes Δ(*i*).

The simulated annealing is accomplished by multiplying the dynamics term of the action by *β* ≤ 1, and gradually increasing *β* during the initialization phase from some initial value up to its final value of *β* = 1. This way the action will initially be dominated by the observation term *A _{O}*, which is smooth and has a single well-defined minimum. Specifically,

*β*is initially set to

*β*=

*β*

_{0}< 1 and multiplied by the constant factor

*f*after each iteration, until

_{β}*β*= 1. The constant multiplying factor is

The *β* plays a role similar to inverse temperature *T* in a system with a Boltzmann distribution *P*(**x**) ∝ exp[−*E*(**x**)/*kT*]. In our case the conditional path distribution is *P*[**X** | **Y**(*m*)] ∝ exp{−*A _{O}*[

**X**|

**Y**(

*m*)]} exp[−

*βA*(

_{d}**X**)]. The two situations are not quite the same because in the second case there is a path-dependent and observation-dependent term

*A*[

_{O}**X**|

**Y**(

*m*)], which is not multiplied by

*β*.

An automatic procedure for adjusting the Monte Carlo step sizes Δ(*i*) is useful because there are typically many different step sizes that would be hard to tune by hand. It is important to tune the sizes because if they are too small, then the proposed path changes will be very small. They will be likely to be accepted, but very many iterations will be required for the path to change significantly. On the other hand, if the step size is too large, most proposed changes will be rejected, and in this case also, many iterations will be required for the path to change significantly. In this case there is a different step size for each component of the path indexed by *i*. The delta adjustment rule used was

where *N*_{acc,i} is the number of accepted changes for component *i* over the past *N*_{skip} number of iterations, *f*_{acc} is the target acceptance rate, and *α* is a constant that controls the adjustment rate. This rule is applied every *N*_{skip}th iteration and only during the initialization phase. It is important to note that adjusting the step sizes will bias the distribution, so samples generated during the adjustment phase should not be used when calculating statistics. It is run in parallel on the GPU, with one thread assigned to each path component. It is important that each parameter and state variable be tuned independently, because the scales may be very different.

## 9. Application to the shallow-water equations

In the calculations we present here, we used 200 000 updates of the path according to the general scheme presented above, with *N*_{skip} = 500. The first 100 000 updates were excluded from the statistics on the expectation values; *N*_{init} = 100 000. Simulated annealing was done over the first *N*_{cool} = 80 000 iterations, with various values for *β*_{0}. The target acceptance rate was *f*_{acc} = 0.4, and the step size adjustment rate was *α* = 0.1. The initial settings for Monte Carlo step sizes were Δ(*i*) = 0.1 for *U*(*i*, *j*), *V*(*i*, *j*) and Δ(*i*) = 0.875 for the *P*(*i*, *j*) state variables. The scaling factor of 8.75 was used because the dynamical range of *P*(*i*, *j*, *t*) was about 8.75 times larger than the dynamical range of *U*(*i*, *j*, *t*) and *V*(*i*, *j*, *t*).

The number of data points used was *m* = 4000, or 40 h of data with a time step of 0.01 h. The matrix _{f} was taken to be a diagonal matrix with 12.8 on the diagonal for the *U*(*i*, *j*), *V*(*i*, *j*) entries and 0.17 on the diagonal for the *P*(*i*, *j*) entries. The matrix _{m} was also set to be a diagonal matrix with 100.0 on the diagonal for the observed *U*(*i*, *j*), *V*(*i*, *j*) entries and 1.0 for the observed *P*(*i*, *j*) entries. The *R _{m}*(

*i*,

*j*) is 0 for the unobserved locations.

Each run of 200 000 iterations took 12.3 h on an NVIDIA GTX 580 that has 512 CUDA threads. The whole process was repeated 5 times, for a total of 1 000 000 iterations, each time starting off with the final path from the previous run. The five runs differ by the initial cooling value *β*_{0}. It was set to 0.0001, 0.1, 0.1, 0.1, 1.0 in sequence. Only the last 100 000 iterations of the final run were used in the statistics. The runs were repeated until the total action stopped decreasing significantly.

The 10 different observation configurations were all run simultaneously on a CPU–GPU cluster running the Rocks Cluster Distribution software and the Oracle Grid Engine queuing system. The cluster is made up of three compute nodes, each with four NVIDIA GTX 580 GPUs installed, and a head node with two additional GTX 580s and an interface to the user and a high-bandwidth Infineon communications node. Each of the observation configurations was assigned to a single GPU and was run independently.

We selected different subsets of the 768 fields *φ _{i}*

_{,j}(

*t*) to evaluate the ability of the path-integral method to accurately estimate the unobserved states during and at the termination of the observation window at

*t*=

*T*= 40 h and to evaluate the predictions from

*T*forward. The predictions were made by using 1% (1000) of the accepted paths during the observation window as initial conditions for the prediction from

*t*= 40 to

*t*= 100 h. The mean prediction and the RMS variation about that mean are shown in each case. In the left panels of Figs. 5–9 for each selection of observed quantities we show the expected values along with the RMS errors about the mean, and in the right panels we remove the error bars for clarity of the display. The fixed parameters in the shallow-water equations were not estimated along with the state variables in these calculations.

Our first example uses observations at each value of *υ*(*i*, *j*, *t*) and *h*(*i*, *j*, *t*), and the westernmost grid column of *u*(*i*, *j*, *t*) during the observation period. For this set of observations the data and the model output synchronize. Removing any of the observations from this set leads to a failure of synchronization. The results are displayed in Fig. 5. We show the time series for an observed value of *u*(*i*, *j*, *t*) both in the period 0 ≤ *t* ≤ 40 h and for the prediction window 40 ≤ *t* ≤ 100 h. The estimates are quite accurate, and we conclude from the accuracy of the predictions that all unobserved state variables were accurately estimated using the path-integral method. The predictions are accurate for about twice the inverse largest Lyapunov exponent, which is consistent with the chaotic behavior of the dynamics and the estimation errors at the end of the observation window. An error at time *T* of |Δ**x**(*T*)| grows to the size of the attractor, which we call one here, as |Δ**x**(*t* + *T*)| ≈ |Δ**x**(*T*)|exp[LE × *t*]. We reported above that LE ≈ 1/8.2 h and our estimates at *T* are accurate to a few percent of the known value.

Our second example uses observations of all values of *u*(*i*, *j*, *t*) and all values of *h*(*i*, *j*, *t*) along with the topmost row of values of *υ*(*i*, *j*, *t*). The results are shown in Fig. 6. This is a situation where we have synchronization between the data and the model output. Indeed, as we expect by now, we see that the estimation in this case is quite good, both for the displayed *u*(*i*, *j*, *t*) and, as evidenced by the good prediction, for all the unobserved values of *υ*(*i*, *j*, *t*).

The third example, Fig. 7, we present is an unobserved value of *u*(*i*, *j*, *t*) in the case where we have observed all values of *υ*(*i*, *j*, *t*) and all values of *h*(*i*, *j*, *t*), but no values of *u*(*i*, *j*, *t*). This is a situation where we do not have synchronization of the data with the model output, and we are able to see this both in the observation period because this is a twin experiment and in the prediction period where we can infer from the inadequate prediction that the unobserved states were not well estimated. In a realistic situation where we do not have knowledge of the “unobserved” state variables, that is, not a twin experiment, we could not know that the unobserved variable we selected was badly estimated in the prediction window *t* ≥ 40 h. However, as we know there is no synchronization between the model and itself for this observation scenario, we would expect bad predictions as well.

Our penultimate example is displayed in Fig. 8. It is similar to our second example. Here we display an unobserved value of *υ*(*i*, *j*, *t*) when the observations, all *u*(*i*, *j*, *t*), all *h*(*i*, *j*, *t*), and the topmost row of *υ*(*i*, *j*, *t*), are sufficient to allow synchronization between the data and the model output. As we have synchronization, we expect accurate estimates of the unobserved states at the end of observations *T* = 40 h here, and from that knowledge of the full set of state variables at *T*, we see that good predictions ensue. The chaotic behavior of the orbits exhibits itself again by the divergence of the known (data) and the predicted values of this variable after about 25–30 h after the end of observations.

The final example we display has another message. Here we have observations of all values of *h*(*i*, *j*, *t*) and various values of *u*(*i*, *j*, *t*) and *υ*(*i*, *j*, *t*) but not sufficient or appropriate observations to yield synchronization between the data and the model output. Nonetheless, as we see in Fig. 9 we have excellent estimates of the observed *u*(*i*, *j*, *t*) variable in the observation window; however, we see quite inadequate predictions after the observations cease. This is the result of the inability of the data and the model output to synchronize and efficiently pass information from the data to the model. Estimations of the unobserved state variables apparently fail in this circumstance.

## 10. Discussion

The shallow-water equations (Pedlosky 1987; Gill 1982) are part of many larger models of atmospheric and oceanic flows. We have investigated a single-layer shallow equation on a *β* plane selecting parameters for the model equations that give rise to chaotic flows. In this context we explored a “twin experiment” wherein we generate the data with the shallow-water model with *N* × *N* grid points, and then presenting *L* = 1, 2, …, *D* = 3(*N* × *N*) time series of data to the model. There are three fields for the model, the horizontal velocity [*u*(*x*, *y*, *t*), *υ*(*x*, *y*, *t*)] and the height *h*(*x*, *y*, *t*).

We ask within this twin experiment how well we can estimate the *D* − *L unobserved* state variables for a given *L* ≤ *D* observations. The metric for the quality of the estimation is the ability to accurately predict in time beyond the window of observation; and we showed that when *L* is ≤*D* (both in number and choices of observables) that the data and the model output: synchronize and estimations of the unobserved states at the end of an observation window are accurate; when these time series do not synchronize, the predictions are not accurate. In the latter case, we infer that the representation of the observed and unobserved state variables in the full-state shallow-water model, is not accurate at the termination of observations.

Synchronization is examined by the direct use of data from the model itself to evaluate the RMS deviation of a time series set from the model started with different initial conditions. In our calculations where *N* = 16, we find for this example that about 70% of the 768 model state variables must be observed, and we note that not just any 70% of the variables need be observed, but we must observe columns of the **x** velocity and rows of the **y** velocity. This configuration of observations passes information about shears and vorticity in the flow.

If a set of observations does not permit synchronization of data and model output, inaccurate predictions are the result. This statement is important even in settings where one wants an accurate representation of the model state following data assimilation (Wunsch and Heimbach 2007) and is not interested in prediction. If the model does not synchronize with itself with the observations available, then the state of the fluid will not be well represented.

The determination of the number of observations and which observations are required for synchronization of the data and model output can be performed in simulation before any data are available. We suggest based on our work here with the one-layer shallow-water equations that it is important to carry out this inquiry before any confidence in any data assimilation approach can be achieved.

The estimation of *L _{c}* requires the solution of the underlying model equations, here a 16 × 16 grid for a one-layer shallow-water flow, from two different initial conditions. One then compares these solutions as a function of time for a variety of choices for the selected

*L*measurements. This computation might be quite challenging for large models, but our analysis of the Lorenz 96 model (Abarbanel et al. 2010) suggests that a pattern may appear that can be extrapolated to finer and finer grids. In that example we found that

*L*≈ 0.4

_{c}*D*for the dimension of the Lorenz 96 system ranging from

*D*= 5 to

*D*= 100. Estimating

*L*on coarse grids at first then finer grids seeking such a pattern in large models may prove much easier and quite interesting. This is not to suggest that the numerical challenges in utilizing these methods for large models used in describing numerical weather prediction or complex turbulent flows will be a small extension of either what we have done in this paper or in the Lorenz 96 model calculations just cited.

_{c}In our work we used Monte Carlo evaluations of the data assimilation path integrals. These are often called smoothers in the literature (Doucet and Johansen 2011) and there are a variety of methods to do the evaluation. The issue of which approach one takes is not central to our argument, and many methods may do (Evensen 2008; Doucet and Johansen 2011). The determination of *L _{c}* is a property of the communications of information from the data to the model, in particular the stability of that channel.

The numerical challenges of using the Monte Carlo methods to estimate the state for substantially larger fluid flow problems is likely to be even more than the use of synchronization to estimate the number of required measurements. One need only consider the fact that operational numerical weather prediction models have order of 10^{6} times the number of state variables than our small shallow-water example to see that more than porting the ideas to a bigger computer will be required. One can be pleased with the example we have studied in this paper, but much remains to be done.

Attending to the number of required observations before carrying out a data assimilation procedure raises the difficult question of what one does when the number of observations is fewer than the number required for synchronization and accuracy in estimating the unobserved model state variables. We do not have an adequate answer to this question now.

The work presented here on the shallow-water equations includes both slow (geostrophic) and fast (gravity wave) modes within those equations. It would be of substantial interest to carry out the same analysis as presented here for one or more of the quasi- or near-geostrophic equations (Pedlosky 1987; Gill 1982) where the rapid motions have been filtered out from the outset.

## Acknowledgments

Support from the U.S. Department of Energy (Grant DE-SC0002349) and the National Science Foundation (Grants IOS-0905076, IOS-0905030, and PHY-0961153) are gratefully acknowledged. Discussions with Brian Hoskins and Carl Wunsch are also appreciated.

## REFERENCES

*Analysis of Observed Chaotic Data*. Springer-Verlag, 272 pp.

*Oxford Handbook of Nonlinear Filtering,*Oxford University Press, 656–704.

*Data Assimilation: The Ensemble Kalman Filter*. 2nd ed. Springer, 307 pp.

*Transmission of Information: A Statistical Theory of Communications.*The MIT Press, 389 pp.

*Bayesian Data Analysis*, 2nd ed. Chapman and Hall, 290–318.

*Atmosphere–Ocean Dynamics*. International Geophysics Series, Vol. 30, Academic Press, 662 pp.

*Stochastic Processes and Filtering Theory.*Academic Press, 376 pp.

*Atmospheric Modeling, Data Assimilation and Predictability.*Cambridge University Press, 341 pp.

*Nonlinear Time Series Analysis*. 2nd ed. Cambridge University Press, 369 pp.

*The Nonequilibrium Statistical Mechanics of Open and Closed Systems*. John Wiley & Sons, 448 pp.

*Information Theory, Inference, and Learning Algorithms.*Cambridge University Press, 628 pp.

*Geophysical Fluid Dynamics*. 2nd ed. Springer-Verlag, 710 pp.

*Quantum Field Theory and Critical Phenomena*. 4th ed. Oxford University Press, 1054 pp.