## 1. Introduction

In Kalman filters, the accuracy of the estimated error covariances closely depends on the quality of the assumptions about model error and observation error statistics. Inaccurate parameterization may even lead to filter divergence, with error estimates becoming grossly inconsistent with the real error (Maybeck 1979; Daley 1991). To avoid this divergence, one recognized solution (adaptive filtering) is to determine the list of uncertain parameters in the model or observation error statistics, and try to adjust them using the actual differences between forecasts and observations (Daley 1992; Dee 1995; Wahba et al. 1995; Blanchet et al. 1997; Hoang et al. 1997; Wang and Bishop 2003; Lermusiaux 2007; Li et al. 2009). In particular, if the forecast and observation error probability distributions can be assumed Gaussian, a possible solution (proposed by Dee 1995) is to compute the maximum likelihood estimate of the adaptive parameters given the current innovation vector. This strategy is used and further developed in Mitchell and Houtekamer (2000) or Anderson (2007, 2009) in the more specific context of the ensemble Kalman filter. It is also this line of thought that is followed in this study to compute optimal estimates of adaptive statistical parameters.

A major difficulty with this kind of method is that, in general, the computational complexity of the parameter estimation is several times larger than the complexity of the estimation of the system state vector (i.e., than the classic observational update of the Kalman filter). The reason is that, in Kalman filters, the optimal state estimate is linear in the observation vector (of size *y*), whereas the optimal parameter estimate is intrinsically nonlinear in the observation vector, so that the optimal solution must be computed iteratively (for instance using a downhill simplex method to find the maximum of the likelihood function, as in Mitchell and Houtekamer 2000). A first objective of this paper is to show that there exists nonetheless a few types of important parameters, for which a maximum likelihood optimal estimate can be computed at a numerical cost that is asymptotically negligible (for large *y*) with respect to that of the standard Kalman filter observational update. Second, taking advantage of this small additional computational complexity, the method is extended to condition the current parameter estimation on the full sequence of past innovations, which amounts to solving an additional (nonlinear) filtering problem for the unknown statistical parameters.

Furthermore, in square root or ensemble Kalman filters, the forecast error covariance matrix is always available in square root form, making it possible to use a modified observational update algorithm [proposed by Pham et al. (1998), as one of the essential elements defining the singular evolutive extended Kalman (SEEK) filter algorithm], whose computational complexity is linear in the number of observations, instead of being cubic as in the standard formula. Originally, this modified algorithm requires that the observation error covariance matrix be diagonal, but solutions exist to preserve its numerical efficiency (linearity in *y*) in presence of observation error correlations, as shown by Brankart et al. (2009), who also give a detailed comparison of the modified versus the original algorithms. In the present paper, we first show in section 2 how the optimal adaptive filtering problem described above can be formulated in the framework of this modified square root algorithm. It is indeed in this framework that optimal parameter estimates can be computed at negligible additional numerical cost. This is shown in section 3, where the discussion focuses on the few types of parameters for which such computational efficiency is possible. These important parameters are (i) scaling factors for the forecast error covariance matrix, (ii) scaling factors for the observation error covariance matrix, and (iii) scaling factors for the observation error correlation length scale.

In section 4, this adaptive filter is applied to the problem of estimating the evolution of an ocean mesoscale signal using observations of the ocean dynamic topography. To demonstrate the behavior of the adaptive mechanism, idealized experiments are performed, in which the reference signal (the truth of the problem) is generated by a primitive equation ocean model and sampled to produce synthetic observations with known error statistics. In that way, it is possible to check that the method is able to produce accurate parameter estimates and to explore the ill-conditionned situations (inappropriate prior assumptions or uncontrollability of the parameters) in which adaptivity can be misleading.

## 2. Formulation of the problem

### a. Nonadaptive statistics

**x**(

*t*), between times

*t*

_{0}and

*t*

_{N+1}, given a set of observation vectors

**y**

*at times*

_{k}*t*,

_{k}*k*= 1, … ,

*N*(

*t*<

_{k}*t*

_{k+1}):

**x**

*=*

_{k}**x**(

*t*), 𝗛

_{k}*, and*

_{k}

*ϵ**are the state vector, the observation operator, and the observational error at time*

_{k}*t*, respectively. We also assume that we have information on the initial condition

_{k}**x**

_{0}=

**x**(

*t*

_{0}) and optionally on dynamical laws governing the time evolution of

**x**(

*t*). In many situations, it is useful to solve the filtering problem, in which the estimation at time

*t*is computed using only past information. This means that the information only needs to be propagated forward in time from

*t*

_{0}to

*t*

_{N+1}, with discrete updates each time that an observation vector is available. In a probabilistic framework, this amounts to computing sequentially the following probability distributions:

*p*

_{0}(

**x**

_{0}) and

*p*

_{N+1}(

**x**

_{N+1}) are the initial and final probability distributions, and where

*p*(

_{k}^{f}**x**

*) and*

_{k}*p*(

_{k}^{a}**x**

*) are the probability distributions at time*

_{k}*t*before and after that the observation vector

_{k}**y**

*is taken into account (superscripts “*

_{k}*f*” and “

*a*” stand for “forecast” and “analysis”). Since we solve a filtering problem, it is implicit that every probability distribution is conditioned on the past observations [i.e.,

**y**

_{k′}with

*k*′ <

*k*for

*p*(

_{k}^{f}**x**

*), and*

_{k}**y**

_{k′}with

*k*′ ≤

*k*for

*p*(

_{k}^{a}**x**

*)].*

_{k}**x**

*and*

_{k}^{f}**x**

*are the expected forecast and analysis state vectors at time*

_{k}^{a}*t*, respectively, and where 𝗣

_{k}*and 𝗣*

_{k}^{f}*are the corresponding forecast and analysis error covariance matrices, respectively. They are computed by repeating the following two steps in sequence from*

_{k}^{a}*k*= 1 to

*k*=

*N*. The forecast step computes

**x**

*and 𝗣*

_{k}^{f}*from*

_{k}^{f}**x**

*and*

_{k}**x**

_{k−1}(as expressed, e.g., by approximate dynamical laws or by a time decorrelation model). The analysis step (or observational update) computes

**x**

*and 𝗣*

_{k}^{a}*from*

_{k}^{a}**x**

*and 𝗣*

_{k}^{f}*by conditioning the prior distribution*

_{k}^{f}*p*(

_{k}^{f}**x**

*) on the observation vector*

_{k}**y**

*using the Bayes’s theorem:*

_{k}*p*(

_{k}^{a}**x**

*) ∼*

_{k}*p*(

_{k}^{f}**x**

*)*

_{k}*p*(

**y**

*|*

_{k}**x**

*). It is well known that the observational update preserves Gaussianity as soon as*

_{k}*p*(

**y**

*|*

_{k}**x**

*) is Gaussian:*

_{k}*p*(

**y**

_{k}|

**x**

_{k}) =

_{k}

**x**

_{k}, 𝗥

_{k}), where 𝗥

_{k}= 〈

*ϵ*_{k}

*ϵ*_{k}

^{T}〉 is the observation error covariance matrix, and that

**x**

*and 𝗣*

_{k}^{a}*can be computed by the classic linear observational update formulas:*

_{k}^{a}**d**

*is the innovation vector and 𝗞*

_{k}*is the Kalman gain. It is useful to remark that the Gaussian parameters*

_{k}**x**

*and*

_{k}^{f}**x**

*,*

_{k}^{a}*k*= 1, … ,

*N*(and the innovations

**d**

*as well) functionally depend on the sequence of observation vectors*

_{k}**y**

*,*

_{k}*k*= 1, … ,

*N*(even if, in practice,

**x**

*and*

_{k}^{f}**x**

*are usually directly determined from the actual value of the observations:*

_{k}^{a}**y**

*=*

_{k}**y**

*). Conversely, the Gaussian parameters 𝗣*

_{k}^{o}*and 𝗣*

_{k}^{f}*do not depend on the observations*

_{k}^{a}**y**

*(but on the observation operator 𝗛*

_{k}*and on the observation error covariance matrix 𝗥*

_{k}*) and as long as all input covariance matrices are assumed to be error free, the resulting 𝗣*

_{k}*and 𝗣*

_{k}^{f}*are also known matrices, which do not need to be inferred from the observations. This is precisely what is going to be changed in section 2b.*

_{k}^{a}*are available in square root form:*

_{k}^{f}**x**

*and*

_{k}**y**

*into the new vectors*

_{k}

*ξ**and*

_{k}

*η**:*

_{k}

*δ**is the projection of the innovation vector*

_{k}**d**

*on the square root of the forecast error covariance matrix 𝗛*

_{k}*𝗦*

_{k}*(with metric*

_{k}^{f}*(unitary matrix) and*

_{k}**Λ**

*(diagonal matrix) are the matrices with the normalized eigenvectors and inverse eigenvalues of the matrix:*

_{k}*p*(

_{k}^{f}**x**

*) and*

_{k}*p*(

**y**

*|*

_{k}**x**

*) transform to*

_{k}**x**

*and 𝗣*

_{k}^{a}*by inverting the transformation in (7):*

_{k}^{a}*, the observational update of every component of the*

_{k}

*ξ**vector are thus independent [all covariance matrices in (10) and (11) are diagonal].*

_{k}The computational complexity of the observational update is also structurally modified by the transformation. In the linear observational update algorithm, the computational cost mainly results from the dependence between the vector components. In the presence of correlation, weighting optimally the forecast and observational information indeed requires the inversion of a full matrix (with computational complexity proportional to the cube of the size of the matrix). The main difference introduced by the transformation is thus that, in Eq. (5), the inversion is performed in the observation space, while in Eq. (9), it is performed in the state space [the cost of the inversion of 𝗥 is here assumed negligible, as shown by Brankart et al. (2009), for large classes of observation error correlation models]. Moreover, if 𝗣* _{k}^{f}* is rank deficient, the size of the matrix

**Γ**

*in (9) is not the size of the state vector, but the rank of 𝗣*

_{k}*, which is given by the number of independent columns in 𝗦*

_{k}^{f}*(i.e., the error modes or the ensemble members). For such problems, the transformation in (7) is here introduced as a simple way of obtaining directly the reduced observational update formulas, that are otherwise deduced from (4) and (5) using the Sherman–Morrison–Woodbury formula (Pham et al. 1998). This transformed algorithm is particularly efficient for the reduced rank problems involving many observations, which are quite common in atmospheric and oceanic applications of square root or ensemble Kalman filters. The key property of the algorithm is indeed that its computational complexity is*

_{k}^{f}*linear*in the number

*y*of observations, making it possible to deal efficiently with large size observation vectors [see Brankart et al. (2009), for a detailed comparison of the computational complexity of the transformed versus the original algorithm].

### b. Adaptive statistics

In the filtering problem described in section 2a, optimal observational updates can only be obtained if the forecast and observation error covariance matrices 𝗣* _{k}^{f}* and 𝗥

*are accurate. However, in realistic atmospheric or oceanic applications, both matrices depend on inaccurate parameters. On the one hand, in the forecast step, modeling the time dependence of the errors usually requires questionable parameterizations (e.g., to account for errors in the dynamical laws governing the time evolution of the system). On the other hand, accurate observation error statistics are not always available. This is especially true for representation errors [difference between the truth of the problem and the real world, see Cohn (1997)], which can occur for instance if the state vector of the problem only contains a subrange of the scales that are present in the real system. In addition, the computation of the covariance matrices 𝗣*

_{k}*and 𝗣*

_{k}^{f}*never involves the observations*

_{k}^{a}**y**

*themselves, so that no feedback using differences with respect to the real world is possible; consequently, any inaccuracy in the parameterization of the observation or system noises can lead to instability of the error statistics produced by the filter. This is a well-known effect in Kalman filters, which is usually circumvented (in atmospheric and oceanic applications) by adjusting uncertain parameters in the forecast or observation error covariance matrices using statistics (variance or covariance) of the innovation sequence (adaptive Kalman filters). See for instance Dee (1995) for a more precise justification.*

_{k}

*α**and*

_{k}

*β**in the description of the forecast and observation error covariance matrices 𝗣*

_{k}*(*

_{k}^{f}

*α**) and 𝗥*

_{k}*(*

_{k}

*β**), with probability distributions*

_{k}*p*(

_{k}^{f}

*α**,*

_{k}

*β**) and*

_{k}*p*(

_{k}^{a}

*α**,*

_{k}

*β**) before and after their update using observations*

_{k}**y**

*. Thus,*

_{k}

*α**and*

_{k}

*β**are additional random vectors that must be estimated from the observational information: they are additional degrees of freedom that are introduced in the estimation problem to account for possible uncertainty in the Gaussian error covariances. In principle, assuming that the parameters*

_{k}

*α**are uncertain transforms the probability distribution*

_{k}*p*(

_{k}^{f}**x**

*) into*

_{k}*p*(

_{k}^{f}**x**

*|*

_{k}

*α**) is the Gaussian probability distribution of*

_{k}**x**

*that is assumed in section 2a if the vector of parameters*

_{k}

*α**is known. In this equation, we use the updated parameter probability distribution*

_{k}*p*(

_{k}^{a}

*α**,*

_{k}

*β**) for the parameters to express that all available information before and including time*

_{k}*t*is taken into account to estimate the uncertain parameters in the prior state probability distribution

_{k}*p*(

_{k}^{f}**x**

*), that is, the observational update of the parameter probability distribution is performed before the observational update of the state probability distribution. However, in (13), which explicitly takes into account uncertainties in the parameters,*

_{k}*p*(

_{k}^{f}**x**

*) is no longer Gaussian in general, so that the filter equations described in section 2a do not apply anymore. An approximation is needed. To close the problem and be able to compute an explicit solution both for the state of the system and for the parameters, the central assumption is that the forecast probability distribution*

_{k}*p*(

_{k}^{f}**x**

*) is still Gaussian (as in section 2a), but with covariances corresponding to the current best estimate*

_{k}*****

*α*_{k}of the parameters:

*****

*α*_{k}and

*****

*β*_{k}can be obtained.

*****

*α*_{k}and

*****

*β*_{k}are computed by solving an additional filtering problem for the parameters, which amounts to sequentially computing the following probability distributions (which are not Gaussian in general):

*p*(

_{k}^{f}

*α**,*

_{k}

*β**) from*

_{k}*p*(

_{k}^{a}

*α*_{k−1},

*β*_{k−1}) by exploiting a prior knowledge

*p*(

*α**,*

_{k}

*β**|*

_{k}

*α*_{k−1},

*β*_{k−1}) about the time dependence of the parameter values (which must be specified, see section 3):

*p*(

_{k}^{a}

*α**,*

_{k}

*β**) from*

_{k}*p*(

_{k}^{f}

*α**,*

_{k}

*β**) by conditioning the prior distribution*

_{k}*p*(

_{k}^{f}

*α**,*

_{k}

*β**) on the innovation vector*

_{k}**d**

*using the Bayes theorem:*

_{k}**d**

*contains information about the parameters that is independent from that contained in the previous innovations, which is already included in*

_{k}*p*(

_{k}^{f}

*α**,*

_{k}

*β**). As in the state filtering problem, this can be justified by the hypothesis that model errors and observation errors (that parameters*

_{k}

*α**and*

_{k}

*β**are meant to model) are independent for times*

_{k}*t*and

_{k}*t*

_{k′}(

*k*≠

*k*′). Furthermore, since

*p*(

_{k}^{f}**x**

*|*

_{k}

*α**) and*

_{k}*p*(

_{k}**y**

*|*

_{k}**x**

*,*

_{k}

*β**) are the Gaussian distributions given in section 2a, the probability distribution*

_{k}*p*(

**d**

*|*

_{k}

*α**,*

_{k}

*β**) of the innovation vector*

_{k}**d**

*is also a Gaussian distribution:*

_{k}

*α**,*

_{k}

*β**) distribution defined by Eq. (18). It is then the purpose of section 3 to describe how the probability distributions sequence in (16) can be computed efficiently from (17) and (18) in practical applications and how to deduce from them the best estimates*

_{k}*****

*α*_{k}and

*****

*β*_{k}to use for the state observational update at time

*t*.

_{k}This joint state and parameter estimation problem can even be solved without the assumptions in (14) and (15) as soon as the state observational update is performed by a separate application of Eq. (4) for every member of the ensemble [as proposed by Evensen and van Leeuwen (1996), for the ensemble Kalman filter]. In that case, the prior non-Gaussian forecast probability distribution given by the integral in (13) can be simulated by randomly drawing a different parameter vector from *p _{k}^{a}*(

*α**,*

_{k}

*β**) to perform the update of each member. With that scheme, uncertainty in the parameter of the prior distribution can thus be explicitly taken into account, with asymptotic convergence of the prior distribution to the integral in (13) for large ensemble size. However, using (13) instead of (14) may not be appropriate if*

_{k}*p*(

_{k}^{a}

*α**,*

_{k}

*β**) is not very accurate, for instance if the dispersion of the parameters (second-order moment) is not correctly simulated. Inaccuracy of the second-order moment is indeed the very reason why adaptivity is needed in the filter in (2) for the system state vector. Since this cannot be repeated for the parameter filter in (16), the use of the best parameter estimates*

_{k}*****

*α*_{k}and

*****

*β*_{k}with Eqs. (14) and (15) can also be viewed as a closure that prevents inaccuracies in the second-order moments of

*p*(

_{k}^{a}

*α**,*

_{k}

*β**) from affecting directly the observational update of the state vector.*

_{k}## 3. Efficient adaptive parameter estimates

### a. Constant parameters

*α**and*

_{k}

*β**are assumed constant in time, then*

_{k}*p*(

*α**,*

_{k}

*β**|*

_{k}

*α*_{k−1},

*β*_{k−1}) =

*δ*(

*α**−*

_{k}

*α*_{k−1},

*β**−*

_{k}

*β*_{k−1}) so that the forecast step in (17) reduces to

*k*to the parameter vectors, and all probability distributions of the sequence in (16) can be obtained by a recursive application of (18):

*L*(

_{k}**,**

*α***) is the likelihood function (as defined, e.g., in Von Mises 1964). From this expression, the best estimates of**

*β******

*α*_{k}and

*****

*β*_{k}of the parameters

**and**

*α***at time**

*β**t*can be obtained as the mean (minimum variance estimator) or the mode (maximum probability estimator) of

_{k}*p*(

_{k}^{a}**,**

*α***). In this study, we only consider the mode of (21), which minimizes the cost function:**

*β**p*(

**d**

_{k′}|

**,**

*α***) in (19) is explicitly introduced, and Cst is an arbitrary constant.**

*β*### b. Nonconstant parameters

*α**and*

_{k}

*β**are not assumed constant in time, it is necessary to introduce a prior knowledge*

_{k}*p*(

*α**,*

_{k}

*β*_{k}|

*α*_{k−1},

*β*_{k−1}) about the time dependence of the parameters in order to perform the forecast step in (17). In this study, it is assumed that the effect of time is only to diffuse parameter probability densities from high probability regions to low probability regions, according to the simple model:

*f*simply corresponds to multiplying the covariance by a factor 1/

*f*, which means that the effect of time is simply to increase the error variance on the previous estimate by a factor 1/

*f*. We call

*f*the “forgetting exponent” because it corresponds to the exponential rate at which old information must be forgotten in the estimation of the parameters. The cost function in (22) indeed transforms to

*e*-folding forgetting time scale is thus

*k*= (ln1/

^{e}*f*)

^{−1}. Using a very small

*f*exponent (

*f*→ 0) means that only the last innovation is used to estimate the current parameters. This particular case exactly corresponds to the solution proposed by Dee (1995).

In addition, the solution of Dee (1995) is written without the first term in the cost function in (22) or (24), which corresponds to computing the maximum likelihood estimator of the parameters (as defined, e.g., in Von Mises 1964, chapter 10). The parameters ** α***

_{k}and

*****

*β*_{k}are then said to maximize the likelihood of the observed innovation sequence (i.e., the conditional probability of the innovation sequence for given parameters). This is useful in absence of reliable prior information on the parameters. With the parameterization in (23), this initial information is anyway progressively forgotten with time, as shown by the exponentially decreasing factor

*f*in the cost function in (24).

^{k}### c. Evaluation of the cost function

Minimizing the cost functions (22) or (24) with respect to the parameters requires the application of an iterative method, and thus the possibility of evaluating *J _{k}* for the successive iterates of the parameters

**,**

*α***(for the sake of simplicity, subscript**

*β**k*is removed for the parameters; it is now implicit that they are computed for the current cycle

*k*). The main difficulty with the expressions in (22) or (24) is that the evaluation of

*J*requires the computation of the inverse and determinant of the covariance matrix 𝗖

_{k}_{k′}(

**,**

*α***) for the full sequence**

*β**k*′ ≤

*k*of previous innovation vectors, and this is needed for all successive iterates of

**,**

*α***(let**

*β**p*be the number of iterates needed to reach the minimum with sufficient accuracy). Even if we truncate the innovation sequence to the

*K*last innovations, this corresponds to a computational complexity proportional to

*pKy*

^{3}(leading behavior for large

*y*), just to compute the optimal parameters

*** and**

*α**** at time**

*β**t*[i.e., a factor

_{k}*pK*with respect to the computational complexity of the observational update (4) and (5), or more precisely, with respect to the leading component (proportional to

*y*

^{3}) of this computational complexity for large size observation vectors]. This large computational cost explains why using

*K*= 1 (as in Dee 1995) is the only affordable solution (with this classic observational update algorithm) to compute optimal adaptive parameters in realistic atmospheric or oceanic assimilation systems. But, even in this special case, the computational complexity of the parameter estimation is still a factor

*p*larger than the estimation of the state vector with Eqs. (4) and (5).

*δ**(size*

_{k}*r*) defined by Eq. (8), that is, by exploiting the transformation in (7) that transports the inversion problem from the observation space to the reduced dimension space defined by the square root or ensemble representation of the forecast error covariance matrix 𝗣

*(*

_{k}^{f}**). For that purpose, assume first that we know the reduced innovation vectors**

*α*

*δ*_{k′}(

**,**

*α***) and the matrix**

*β***Γ**

_{k′}(

**,**

*α***), for**

*β**k*′ ≤

*k*, defined by Eqs. (8) and (9) as a function of the parameters

**,**

*α***, together with the corresponding eigenvalues**

*β***Λ**

_{k′}(

**,**

*α***), and eigenvectors 𝗨**

*β*_{k′}(

**,**

*α***) as defined by Eq. (9). From this, we need to compute two kinds of terms in the cost function:**

*β*_{k′}(

**) and**

*β***Γ**

*, we obtain the determinant:*

_{k}**Λ**

*and since*

_{k}*is a classic difficulty in many problems involving the estimation of Gaussian parameters, which is here circumvented by introducing the diagonal covariance matrix*

_{k}**Λ**

*instead of 𝗖*

_{k}*using the transformation in (7).*

_{k}Yet, the transformed Eqs. (28) and (31) are still not a solution by themselves to efficiently compute the cost function as a function of the parameters, since the computation of *δ*_{k′}(** α**,

**),**

*β***Γ**

_{k′}(

**,**

*α***), and then**

*β***Λ**

_{k′}(

**,**

*α***) and 𝗨**

*β*_{k′}(

**,**

*α***) is required for the full sequence of previous innovation vectors**

*β**k*′ ≤

*k*, and for all successive iterates of (

**,**

*α***) (needed to minimize the cost function); that is, there is still a factor**

*β**pK*with respect to the computational complexity of the state observational update (8) and (9), or more precisely, with respect to the leading component (proportional to

*y*) of this computational complexity for large size observation vectors. However, with the transformed equations in (28) and (31), there are classes of parameters

**,**

*α***for which the additional computational complexity of the cost function minimization is either negligible with respect to the state observational update, or at least independent of the number**

*β**y*of observations. These classes of parameters are presented in sections 3d–f. Such simplifications are not possible with the original expressions in (22) or (24), because the matrix 𝗖

*(*

_{k}**,**

*α***) is the sum of two matrices: one depending on**

*β***and the other on**

*α***. The inverse and determinant must therefore be computed explicitly for every iteration of**

*β***and**

*α***.**

*β*### d. Scaling of the forecast error covariance matrix

*α*> 0 is introduced at each time

*t*to rescale the forecast error covariance matrix 𝗣

_{k}*=*

_{k}^{f}*α*

*that is normally produced by the filter. [An efficient method to compute optimal estimates of this scaling factor is also proposed by Anderson (2007, 2009).] This can be done for instance to compensate a possible collapse of the ensemble forecast that can result from an insufficient system noise parameterization or the inadequacy of the Gaussian approximation. Then, if*

_{k}^{f}*α*is the only parameter to estimate, the cost functions in (28) and (31) reduce to

*δ̃*_{k′},

**Λ̃**

_{k′}, and

_{k′}are already available from the state observational update performed at time

*t*

_{k′}, so that the computational complexity of (32) is negligible (it is proportional to

*r*, because the product

*α** can always be obtained without significant additional cost. Moreover, this can be directly generalized to a vector of parameters, scaling separately the eigenmodes

*λ*

_{l,k′},

*l*= 1, … ,

*r*). This can be done for instance by parameterizing a function

*α*(

*λ*) giving the scaling factor

*α*as a function of the relative importance of each mode in the square root or ensemble representation of

However, when estimating the inflation parameter *α*, it is important to keep in mind that nothing prevents the maximum likelihood estimator of *α* from being negative, which means that innovations are too small to explain the observation error covariance 𝗥_{k′} alone, so that the adaptive scheme is attempting to reduce 𝗖_{k′}(*α*) by subtracting _{k′}. This usually results from an incorrect parameterization of the observation error covariance matrix, which underestimates the accuracy of the observations. This problem cannot occur with the maximum probability estimator since the prior probability distribution *p*_{0}(*α*) is equal to zero for *α* ≤ 0, so that optimization always provides a positive value for *α**. In this case, the optimal value is, however, closely related to the behavior of *p*_{0}(*α*) near to zero and no longer to the statistics of the innovation sequence. This means that the adaptive scheme ascribes undeserved accuracy to the forecast (i.e., a small-scale factor *α**), to compensate an incorrect parameterization of the observation errors. This phenomenon always results in further underestimating the weight of the observations in the assimilation system, and it can only be avoided either by improving the parameterization of the observation errors or by including observation error parameters ** β** in the list of adaptive parameters (see sections 3e,f).

### e. Scaling of the observation error covariance matrix

*β*at each time

*t*to rescale the observation error covariance matrix 𝗥

_{k}_{k}=

*β*

_{k}, where

_{k}is the default prior estimate for this matrix. This can be useful to adjust inaccurate observation error statistics resulting in particular from the unknown amplitude of representation errors. Thus, if we also keep the parameter

*α*(introduced in section 3d) in the control vector, Eqs. (28) and (31) reduce to

*y*

_{k′}is the number of observations at time

*t*

_{k′}. Again, the computational complexity of (33) and (34) is negligible. The term

*t*

_{k′}, with a computational complexity 2

*y*

_{k′}(for a diagonal

_{k′}) that is negligible with respect to that of the observational update. We can then conclude that, if the observational update is performed in the reduced space defined by the transformation in (7), optimal adaptive estimates of both scaling factors

*****

*α*_{k}and

*****

*β*_{k}can always be obtained without significant additional cost.

*β*for the whole observation error covariance matrix to the more general problem of estimating separate scaling factors for several segments of the observation vector. This may be useful for instance if the observation vector results from the merging of several datasets (

*i*= 1, … ,

*N*), originating from different instruments, whose errors need to be scaled separately. However this generalized problem requires additional matrix operations that may significantly increase the computational complexity, because changing the structure of the matrix 𝗥

*modifies the eigenvalue problem defined by Eq. (9), which must then be solved again for every new iterate of the vector of parameters*

_{k}**. The first step is to express the**

*β**β*), which are nonzero only for observations belonging to the corresponding dataset:

_{i}

*δ*_{k,i}and matrices

**Γ**

_{k,i}given by Eqs. (8) and (9). This computation does not involve any additional operations since this can be done once for all, and since every scalar product only needs to be extended to the part of

**and for every**

*β**k*′ ≤

*k*: (i) compute the overall

*δ*_{k′}(

**) and**

*β***Γ**

_{k′}(

**) as**

*β***Γ**

_{k′}(

**) from which the second term of Eqs. (28) and (31) can be easily computed [with factor**

*β**α*as in (32) if required]. Concerning the first term of (28) or (31), it can be computed as the sum of the individual elements:

*y*

_{k′,i}is the number of observations in the segment number

*i*. This computation is straightforward since the individual

*k*′. The dominant cost of these additional operations (for

*N*≪

*r*) comes from the necessity to recompute

**Λ**

_{k′}and 𝗨

_{k′}(computational complexity proportional to

*r*

^{3}) for every iterate of

**and every**

*β**k*′ ≤

*k*, so that the overall additional complexity is proportional to

*pKr*

^{3}, still independent of the number of observations, but no more linear in

*r*.

Finally, it is important to keep in mind that, in order to attempt the joint control of observation and forecast error covariance scaling factors *α* and *β*, we must be certain that both parameters can be simultaneously identified through the innovation sequence. [See Li et al. (2009), who also propose a method to control these two parameters.] This is for instance clearly impossible if observation and forecast errors have identical covariance structures (in the observation space), because then parameters *α* and *β* play exactly the same role in the innovation covariance matrix in (25). In this ill-conditioned situation, parameters *α* and *β* are not jointly controllable whatever the number of observations or the length of the innovation sequence.

### f. Observation error correlation length scale

*can be inverted at low cost (e.g., if it can be assumed diagonal). It is nevertheless possible to preserve the efficiency of the transformed observational update algorithm in presence of observation error correlations, as shown by Brankart et al. (2009). Their method consists in augmenting the observation vector*

_{k}**y**

*with new observations that are linear combinations of the original observations, and assuming a diagonal observation error covariance matrix in the augmented observation space. Since the computational complexity of the algorithm is linear in the number of observations, the size of the observation vector can indeed be increased without prohibitive consequence on the numerical cost. If the augmented observation vector*

_{k}**y**

^{+}, the augmented observation operator 𝗛

^{+}, and the associated diagonal observation error covariance matrix 𝗥

^{+}are defined as

_{0}, 𝗥

_{1}are diagonal matrices, then, the observational update that is performed using

**y**

^{+}and 𝗥

^{+}is equivalent to an observational update performed using only observations

**y**with a nondiagonal observation matrix 𝗥 given by

*σ*

_{0}/

*σ*

_{1}(for homogeneous 𝗥

_{0}=

*σ*

_{0}

^{2}𝗜 and 𝗥

_{1}=

*σ*

_{1}

^{2}𝗜). The purpose of this section is to show how the optimal adaptive algorithm described above must be modified if this kind of parameterization of the observation errors correlations is applied.

**y**and 𝗥 or

**y**

^{+}and 𝗥

^{+}. The vector

*δ**and the matrix*

_{k}**Γ**

*are indeed left unchanged by this change of observation vector, if the observation error covariance matrices 𝗥 and 𝗥*

_{k}^{+}are related by Eq. (39). (It is the purpose of this transformation to keep the observational update unchanged.) This is also true for the term

**d**

^{T}𝗥

^{−1}

**d**in Eq. (27):

*δ*_{k′},

**Γ**

_{k′}, 𝗨

_{k′}, and

**Λ**

_{k′}that are already available from the observational update performed at time

*t*

_{k′}.

^{+}|. The evaluation of the cost function thus requires an explicit computation of the determinant |𝗥| using Eq. (39). Fortunately, this computation can be simplified a great deal using the following transformation. First, the determinant of (39) can be rewritten as

*μ*are the eigenvalues of

_{l}^{T}. In general, computing the

*μ*eigenvalues is a problem with computational complexity proportional to

_{l}*y*

^{3}, but in many practical situations, a sufficient knowledge of this spectrum may not be very difficult to acquire if 𝗧 is kept simple enough [as is required for an efficient observational update, see Brankart et al. (2009)]. For instance, if 𝗧 is a discrete gradient operator and 𝗥

_{0}=

*σ*

_{0}

^{2}𝗜 and 𝗥

_{1}=

*σ*

_{1}

^{2}𝗜 are homogeneous,

*μ*are simply the eigenvalues of a discrete Laplacian operator.

_{l}*β*

_{0}and

*β*

_{1}, which must be controlled:

_{0}and

_{1}are the default prior estimate of the diagonal matrices 𝗥

_{0}and 𝗥

_{1}defined by (39). Here

*β*

_{0}and

*β*

_{1}are thus scaling factors for these two matrices: 𝗥

_{0}=

*β*

_{0}

_{0}and 𝗥

_{1}=

*β*

_{1}

_{1}. If 𝗧 is a combination of derivative operators of successive orders (and 𝗥

_{0}, 𝗥

_{1}are spatially homogeneous: 𝗥

_{0}=

*σ*

_{0}

^{2}𝗜, 𝗥

_{1}=

*σ*

_{1}

^{2}𝗜), then the scaling factor for the correlation length scale ℓ is a function of

*α*, Eqs. (28) and (31) for the components of the cost function reduce to

*δ*_{k′}(

*α*,

*β*

_{0},

*β*

_{1}) and

**Γ**

_{k′}(

*α*,

*β*

_{0},

*β*

_{1}) defined in (36) can be written as

*β*

_{0}/

*β*

_{1}, which scales the observation error correlation length. The computational complexity of the minimization is thus proportional to

*pKr*

^{3}(leading behavior for large

*y*and

*r*), where

*p*is here only the number of iterations requiring the update of the correlation length scale. In summary, by minimizing the cost function in (22) or (24) with components (44) and (45), it is possible to adapt simultaneously scaling factors for the forecast error covariance matrix, for the observation error covariance matrix, and for the observation error correlation length scale (given by

*α*,

*β*

_{0}and

*β*

_{0}/

*β*

_{1}, respectively).

*α*= 0 corresponds to estimating parameters

*β*

_{0}and

*β*

_{1}of the covariance 𝗥 of a zero mean random vector from a finite sample of independent events

**d**

_{k′},

*k*′ = 1, … ,

*k*. In this case, the last term disappears in (44) and (45) and we can obtain the likelihood function

*y*of the observation vector, a constant operator 𝗧, and constant diagonal matrices

_{0}and

_{1}are assumed. If a prior probability distribution

*p*

_{0}(

*β*

_{0},

*β*

_{1}) is available, the posterior distribution is then

*p*(

_{k}^{a}*β*

_{0},

*β*

_{1}) ∼

*p*

_{0}(

*β*

_{0},

*β*

_{1})

*L*(

*β*

_{0},

*β*

_{1}). Parameters

*β*

_{0}and

*β*

_{1}jointly govern the observation error variance

*σ*

^{2}and the observation error correlation length scale ℓ (while the shape of the correlation function is governed by the operator 𝗧, which is assumed to be known). The likelihood function in (47) is thus also a likelihood function for

*σ*

^{2}and ℓ as soon as their relation to

*β*

_{0}and

*β*

_{1}is known. This relation can be deduced from the functions

*f*(ℓ) =

*σ*

_{0}

^{2}/

*σ*

_{1}

^{2}and

*g*(ℓ) =

*σ*

_{0}

^{2}/

*σ*

^{2}that can be obtained from (43) with

_{0}=

*σ*

_{0}

^{2}𝗜 and

_{1}=

*σ*

_{1}

^{2}𝗜 (see Brankart et al. 2009). Using these functions, the likelihood function in (47) can be rewritten as a function of

*σ*

^{2}and ℓ:

*μ*are here directly the eigenvalues of 𝗧𝗧

_{l}^{T}(and not of

^{T}as before). If 𝗧 is the gradient operator, the

*μ*are simply the eigenvalues of the Laplacian operator,

_{l}*f*(ℓ) = ℓ and

*g*(ℓ) is a decreasing function of ℓ, behaving proportionally to ℓ

*(in*

^{n}*n*dimensions) if ℓ is large with respect to the distance between observations. If the correlation length is known, we retrieve the classic likelihood function for the variance

*σ*

^{2}of a Gaussian random vector. Conversely, if the variance is known, Eq. (48) is simply the likelihood function

*L*(ℓ) for the correlation length of a random vector with known variance and known correlation shape. The equation shows that it can be computed explicitly at low cost as soon as 𝗧,

*μ*,

_{l}*f*(ℓ), and

*g*(ℓ) are known (see section 4d).

## 4. Demonstration experiments

The purpose of this section is to demonstrate how the optimal adaptive algorithm described in the previous sections can be used in practice. As an example application, we consider the problem of estimating the long-term evolution of a model-simulated ocean mesoscale signal from synthetic observations of the ocean dynamic topography. To concentrate on the behavior of the adaptive algorithm, it is assumed that the only source of information to solve this estimation problem comes from these synthetic observations. The dynamical laws governing the ocean flow are not exploited to constrain the solution. Only simplified assimilation experiments are thus performed in which the ocean model operator 𝗠 is set to zero (total ignorance) or identity (persistence). In that way, a clear diagnostic of the adaptive mechanism can be obtained, without being blinded by unverifiable interferences with a complex nonlinear ocean model. Several examples are shown to illustrate the control of the forecast error covariance scaling (section 4b), the observation error covariance scaling (section 4c), and the correlation length scale (section 4d), before attempting the joint control of all these parameters (section 4e). But before that, we describe how the reference mesoscale signal and the synthetic observations are generated (section 4a).

### a. Description of the experiments

The reference mesoscale signal is simulated using a primitive equation model of an idealized square and 5000-m-deep flat bottom ocean at midlatitudes (between 25° and 45°N). In this square basin a double-gyre circulation is created by a constant zonal wind forcing blowing westward in the northern and southern parts of the basin and eastward in the middle part of the basin (with sinusoidal latitude dependence: *τ* = −*τ*_{0} cos[(*λ* − *λ*_{min})/(*λ*_{max} − *λ*_{min})] and *τ*_{0} = 0.1 N m^{−2}). The western intensification of these two gyres produces western boundary currents that feed an eastward jet in the middle of the square basin (see the resulting mean dynamic height in Fig. 1). This jet is unstable (Le Provost and Verron 1987) so that the flow is dominated by chaotic mesoscale dynamics, with largest eddies that are ∼100 km wide, and to which correspond velocities of ∼1 m s^{−1} and dynamic height differences of ∼1 m (see the resulting dynamic height standard deviation in Fig. 1). All this is very similar in shape and magnitude to what is observed in the Gulf Stream (North Atlantic) or in the Kuroshio (North Pacific).

The time evolution of this chaotic system is computed using the Nucleus for European Modelling of the Ocean (NEMO) numerical ocean model, with a horizontal resolution of ¼° × ¼° cos*λ* and 11 levels in the vertical (see Cosme et al. 2010 for more detail about the model configuration). The main three physical parameters governing the dominant characteristics of the flow are the stratification, the bottom friction, and the horizontal viscosity. The model is started from rest with uniform stratification and can be considered to reach equilibrium statistics after 20 yr of simulation. In this paper, we thus concentrate on the estimation of the 100-yr signal from years 21 to 120. Moreover, we focus our study to a limited subdomain in the middle of the jet (about 650 × 650 km, as shown by the black square in Fig. 1) with intense and quite homogeneous mesoscale activity. It is also assumed that the reference simulation is known with a time resolution of 1 snapshot every 10 days. Figure 2 shows for instance the simulated dynamic height in October of year 21, with a time resolution of 10 days, which is clearly sufficient to observe the slow westward (upstream) motion of the main eddies.

To estimate the time evolution of this mesoscale flow, we assume that the ocean altimetry is observed every 10 days at model resolution. Without a dynamical model to constrain the estimation problem, it is indeed important to have observations with a sufficient horizontal coverage. However, in order to generate the synthetic observations, an artificial observational noise is added to the reference simulation. This noise is meant to simulate measurement and representation errors that always exist in a real observation dataset. In our system, the representation error mainly includes the subgrid-scale eddies that are not resolved by the model discretization. This component of the observation error is thus often dominant and poorly known, which justifies adjusting its main statistics (variance and correlation length scale) using the available observations. In this study, three kinds of correlation model are used to simulate the observational noise: (A) uncorrelated errors, (B) exponential decorrelation function: *ρ*(*r*) = exp(−*r*/ℓ), and (C) a smooth noise correlation model: *ρ*(*r*) = (*r*/*l*)*K*_{1}(*r*/*l*), where *K*_{1} is the second kind modified Bessel function of order 1. The two last models have the property that they can be efficiently parameterized using a diagonal observation error covariance matrix in an augmented observation space [as in Eq. (38)]. Model B just requires including gradient observations (with error standard deviation *σ*_{1} = *σ*_{0}/ℓ), while model C requires including both gradient and curvature (with error standard deviations *σ*_{1} = *σ*_{0}/ℓ*σ*_{2} = *σ*_{0}/ℓ^{2}). See Brankart et al. (2009) for more detail about this correspondence. Figure 3 shows an example of the simulated observation noise corresponding to the three correlation models. They are randomly sampled from the Gaussian distribution *σ* = 0.2 m and one of the correlation models A, B, or C, with a correlation length scale ℓ equal to 5 grid points (for the last 2 models).

### b. Control of the forecast error covariance scaling

The first set of experiments is dedicated to the estimation of a single global scaling factor *α* for the forecast error covariance matrix, by applying the method described in section 3d. In these first experiments, observation errors are uncorrelated, with standard deviation *σ* = 0.2 m, and the corresponding observation error covariance matrix is assumed perfectly known: 𝗥 = *σ*𝗜. Problems with a constant and nonconstant parameter *α* are successively considered.

#### 1) Gaussian random signal

As a first illustration of the adaptive mechanism, let us consider the problem of estimating a sequence of random and independent draws from the Gaussian probability distribution *α*_{ref}(*k*)*P̃ ^{f}*], where

*α*

_{ref}(

*k*) is not accurately known, and

*P̃*is the covariance of the 100-yr model sequence described in section 4a (and illustrated in Fig. 2). In this very simple (but unfavorable) situation, the best model is obviously 𝗠 = 0, and the past observations are only useful to improve the knowledge of the parameter

^{f}*α*. In the couple of filtering problems described by the sequences in (2) and (16), only the second remains, and the problem reduces to estimating one parameter of a Gaussian distribution

*α*

^{f}𝗛

^{T}+ 𝗥) from a random sample whose size

*k*is growing with time.

As a first case study, the reference sequence of draws **x*** _{k}* is sampled from a

*constant*Gaussian probability distribution:

*α*

_{ref}(

*k*) = 1. From this, we can simulate a sequence of observations vectors

**y**

*as explained above, and then compute the terms of the cost function in (22) using Eq. (32). This can be done efficiently by computing once for all*

_{k}

*δ̃*_{k′},

**Λ̃**, and

*δ̃*_{k′}depends on

*k*in this simple problem) from the square root decomposition of

^{f}using Eqs. (8) and (9). Figure 4 (left panel) shows the resulting likelihood function

*L*(

_{k}*α*), given by Eq. (21) that is obtained for several time indices

*k*= 1, 3, 10, and 100. (The function is scaled so that the maximum is always equal to 1.) The narrowing of

*L*(

_{k}*α*) around the reference value

*α*= 1 for increasing

*k*shows that the knowledge of the parameter is increasing with time as more observations become available. With a prior probability distribution for the parameter

*α*, for instance

*p*

_{0}(

*α*) = exp(−

*α*), the posterior probability at time

*t*can be computed as

_{k}*p*(

_{k}*α*) ∼

*p*

_{0}(

*α*)

*L*(

_{k}*α*). The mode of this distribution together with percentiles 0.1 and 0.9 are drawn in Fig. 4 (right panel) as a function of

*k*, showing that probability density is progressively concentrating toward the reference value

*α*= 1.

If the reference parameter is no longer constant, and if little is known about the time dependence of the parameter values, we can use the simple model described in section 3b to forget the oldest innovations in the computation of the current parameter best estimate. To test the method for several parameter fluctuation time scales in one single experiment, we set *α*_{ref}(*k*) = *a* + *b* sin[*ω*(*k*)*k*] with *ω*(*k*) = *k*/*k*_{1} and *k*_{1} = 10 000. It is shown in Fig. 5 (thick dotted line) for *a* = 1 and *b* = ½. The figure also shows the estimate ** α***(

*k*) that is obtained using an

*e*-folding forgetting time scale

*k*= 100 (i.e., a forgetting exponent

^{e}*f*≃ 0.99). This forgetting time scale is well adapted only if it is comparable to the reference parameter fluctuation time scale:

*k*∼ 1/

^{e}*ω*(

*k*) (i.e., for

*k*∼

*k*

_{1}/

*k*= 100). Before this time index, the accuracy of the estimate can be improved by keeping a longer sequence of observations, and after this time index, it becomes better to forget the observations faster. The estimation thus suffers from the poor prior knowledge (summarized in the single parameter

^{e}*f*≃ 0.99) of the time dependence between parameter values. As the fluctuation frequency increases with time, the representation of the successive minima and maxima of the reference parameter time series becomes less and less accurate.

#### 2) Model simulated signal

Let us now consider the more realistic problem of estimating the 100-yr model-simulated mesoscale signal described in section 4a (and illustrated in Fig. 2). The initial condition is assumed perfectly known, the observation errors are still uncorrelated (with *σ* = 0.2 m), but here, a better model is to assume persistence: 𝗠 = 𝗜. Furthermore, assuming stationary statistics, the corresponding model error covariance matrix can be consistently parameterized using the time covariance of differences between successive model snapshots in the reference simulation: * _{k}^{f}* =

*α*𝗤, where

*α*is an unknown scaling parameter. (In other situations, it may be better to make a different assumption and write for instance

With this parameterization, we can solve the joint filtering problem for the state of the system and for the parameter *α*, as explained in sections 2 and 3. In the parameter filter, we prescribe the prior probability distribution for the parameter *p*_{0}(*α*) = exp(−*α*) and the forgetting exponent *f* = 0.9. Figure 6 (top panel) shows the resulting estimate *α**(*t _{k}*) that is obtained as the mode of the posterior probability distribution at time

*t*. As expected, the covariance of the error of the persistence model is close to our prior estimate 𝗤 so that the estimated

_{k}*α*remains close to 1. On the other hand, as explained in the previous example [section 4b(1)], the estimated accuracy of the parameter is very sensitive to the forgetting exponent (i.e., the number of innovative vectors that is assumed relevant to include in the current parameter estimate). This sensitivity to a subjective assumption is the very reason why using the closure in (14) and (15) is thought to be better than using the integral in (13) to simulate the forecast error probability distribution (see the explanation at the end of section 2b). In this example with very steady statistics, a better parameter accuracy can be obtained by increasing the forgetting exponent

*f*(using for instance

*f*= 0.99 instead of

*f*= 0.9), but this is done at the expense of a much larger numerical cost (10 times larger for

*f*= 0.99) since a longer innovation sequence must be used for each evaluation of the cost function.

On the other hand, Fig. 6 illustrates the corresponding error on the state estimate, as obtained for altimetry (middle panel) and velocity (bottom panel). The figure shows the root-mean-square difference between the state estimate (after the observational update at time *t _{k}*) and the reference simulation (solid thick line), as compared to the corresponding error estimate produced by the filter (the square root of the trace of the analysis error covariance matrix, dashed thick line). In the adaptive experiment (thick line), these two curves remain quite consistent on the long term, indicating that the adaptive mechanism is sufficiently constraining the error statistics to produce consistent estimates of the total error variance. This result is compared with another simulation that is performed without the adaptive mechanism (thin lines) using the fixed value

### c. Control of the observation error covariance scaling

The second set of experiments is dedicated to the estimation of a single global scaling factor *β* for the observation error covariance matrix, by applying the method described in section 3e. In these experiments, observation errors are still uncorrelated, with a standard deviation of *σ* = 0.2 m, but the scaling of the observation error covariance matrix is assumed unknown: 𝗥 = *β**σ*𝗜. We first consider the same example as in section 4b(2), but with known forecast error covariance matrix 𝗣* _{k}^{f}* = 𝗤. The only parameter to control is thus the scaling factor

*β*. This is done by solving the joint filtering problem for the state of the system and for the parameter

*β*[as in section 4b(2) for parameter

*α*, but this time using the expression of the cost function given in section 3e], using again the prior probability distribution

*p*

_{0}(

*β*) = exp(−

*β*), but forgetting the exponent

*f*= 1, since

*β*can be assumed constant.

Figure 7 (top panel, thick lines) shows the resulting estimate *β**(*t _{k}*) that is obtained as the mode of the posterior probability distribution at time

*t*, together with percentiles 0.1 and 0.9. As expected, the optimal estimate

_{k}*β**(

*t*) quickly converges toward the correct value

_{k}*β*= 1, with an associated error decreasing to zero as more observations become available. Old observations are indeed never forgotten since the parameter is assumed constant. The figure also illustrates the corresponding error on the state estimate (as in Fig. 6), showing that we obtain the same result as in Fig. 6 as soon as parameter

*β*becomes sufficiently accurate. This result is compared with another simulation that is performed without the adaptive mechanism (thin lines) using the inaccurate value

*β*= 0.5. We observe again that, without adaptive statistics, the error estimate remains inconsistent, so that the resulting nonoptimal scheme can only produce larger errors, together with badly estimated error variance (underestimated in this example).

On the other hand, it is also interesting to investigate what happens if we control the scaling factor *β* alone, in presence of inaccurate forecast error covariance scaling. For that purpose, we just redo the same experiment with 𝗣* _{k}^{f}* =

*α*𝗤, with constant

*α*= ¼. The resulting estimate

*β**(

*t*) is also shown in Fig. 7 (top panel, thin line). As can be observed,

_{k}*β**(

*t*) quickly diverges from the correct value

_{k}*β*= 1, because the adaptive mechanism is trying to compensate for the underestimation of the forecast error variance by overestimating the observation error, in order to account for the total innovation signal. It is thus very important to stress again the assumption that all statistical parameters that are not included in the control vector are accurately known. Inconsistency in the prior statistical parameterization can easily lead to grossly incorrect adaptive parameters.

### d. Control of the correlation length scale

The third set of experiments is dedicated to the estimation of the correlation length scale from a random sample of a Gaussian distribution, by applying the method described in section 3f. In these experiments, it is assumed that this Gaussian distribution is characterized by a zero mean and a covariance 𝗥. This application thus corresponds to the problem described in section 3f with the simplification that *α* = 0 (i.e., 𝗥 is the only remaining term in the covariance 𝗖 of the random vector), so that we can directly apply Eqs. (47) and (48) to estimate the signal variance and/or correlation length scale.

As an illustration, let us consider the problem of estimating the correlation length scale of the observational noise described in section 4a (and illustrated in Fig. 3) from samples of various sizes. In this experiment, it is assumed that the noise standard deviation *σ* = 0.2 m and the correlation structure in model B [*ρ*(*r*) = exp(−*r*/ℓ)] or model C [*ρ*(*r*) = (*r*/*l*)*K*_{1}(*r*/*l*)] are known, so that the correlation length scale ℓ is the only parameter that must be estimated. Moreover, we have already said that these two correlation structures can be consistently parameterized using the parameterization in (39) with a transformation operator 𝗧 that includes the gradient for model B or the gradient and curvature for model C. From this operator, it is easy to evaluate the eigenvalues *μ _{l}* of 𝗧𝗧

^{T}, and the function

*f*(ℓ) and

*g*(ℓ) using Eq. (43). Figure 8 illustrates the resulting likelihood function

*L*(ℓ) for the correlation length scale ℓ, as computed using Eq. (48), for several sizes

*k*of the sample. (The figure is scaled so that the maximum is always equal to 1.) The figure shows that, for the correlation model B, the likelihood function narrows close to the correct value ℓ = 2 grid points (left panel) or, ℓ = 5 grid points (middle panel), as more observations are available, which indicates that the adaptive mechanism presented in this paper is able to control the correlation length scale of a random signal, as soon as a correct assumption about the correlation structure can be formulated. Concerning the correlation model C, also illustrated in Fig. 8, there is a discrepancy between the true correlation length scale (ℓ = 5 grid points) and the estimated value (ℓ ∼ 3.2 grid points). This problem results from an approximate evaluation of the function

*g*(ℓ) in Eq. (48), for which we assumed an infinite domain (whereas our real domain is not very large with respect to the correlation length scale). This difficulty does not arise if the adaptive parameters are

*β*

_{0}and

*β*

_{1}(instead of

*σ*

^{2}and ℓ), but this points out the sensitivity of the estimation to inaccuracies in the representation of the correlation structure.

### e. Joint control of forecast and observation error parameterizations

Last, in a fourth set of experiments, we solve a more general assimilation problem in which two adaptive parameters considered in the previous sections are controlled together: a parameter *α* to adjust the scaling of the forecast error covariance matrix, and a parameter *β* to adjust the scaling of the observation error covariance matrix. For that purpose, we use observations with uncorrelated or correlated errors (*σ* = 0.2 m and ℓ = 5 grid points) using correlation model A, B, or C, as illustrated in Fig. 3. In addition, in the experiments, the structure of the observation error covariance matrix is assumed to be known: it is given by Eq. (39) with an operator 𝗧 consistent with the real correlation model (A, B, or C), and with diagonal matrices 𝗥_{0} = *σ*_{0}𝗜 and 𝗥_{1} = *σ*_{1}𝗜 parameterized in such a way that the true value of the global scaling parameter is *β* = 1. As prior probability distribution for the parameters, we use *p*_{0}(*α*, *β*) = exp[−(*α* + *β*)], that is, independent prior distributions with exponential probability density for each parameter. As in section 4b(2), the parameters are not assumed constant in time, and the same forgetting factor *f* = 0.9 is used. With these assumptions, we can solve the joint filtering problem for the state of the system and for the parameters *α* and *β*. In particular, the optimal parameter estimates are obtained at each time *t _{k}* by minimizing the cost function whose components are given by Eqs. (33) and (34) as a function of the parameters

*α*and

*β*(with

*y*

_{k′}equal to the number of real observations, i.e., excluding the fictious gradient or curvature observations if any).

Figure 9 (top two panels, thick lines) shows the resulting estimates *α**(*t _{k}*) and

*β**(

*t*) that are obtained as the mode of the posterior probability distribution at time

_{k}*t*. As expected, for all correlation models A, B, or C, the optimal forecast error covariance scaling

_{k}*α**(

*t*) converges toward the same solution as in section 4b(2), and the optimal observation error covariance scaling

_{k}*β**

_{1}(

*t*) converges approximately toward the correct values

_{k}*β*= 1. Figure 9 (bottom panel) also illustrates the corresponding error on altimetry. For correlation model A, the result is similar to the solution shown in Fig. 6 (obtained with known

*β*), whereas for the two other correlation models B or C, the error is larger, as a result of the lower quality of the observations that are assimilated (correlated instead of uncorrelated observation error). But in any case, the error estimate produced by the adaptive filter (dotted thick line) is always a consistent estimate of the real error standard deviation. Without adaptivity (not shown in the figure), the error estimate can become grossly inconsistent (as explicitly shown in the examples of sections 4b and 4c) with the same negative consequences on filter optimality.

Finally, in order to characterize more completely the knowledge that is acquired about parameters *α* and *β*, Fig. 10 represents their joint likelihood function (based on the first two innovation vectors), as obtained in the three experiments (i.e., with correlation model A, B, or C for the observation error). The first thing that can be observed is that *β* is diagnosed with a better accuracy than *α*, as a consequence of the larger number of degrees of freedom in the observation error as compared to the forecast error (see Figs. 2 and 3). Second, in these experiments, the slope of the first principal axis of the sensitivity ellipse happens to be only slightly negative, which indicates that, in this case study, the adaptive scheme is able to make a clear distinction between forecast and observation errors. This absence of an unstable direction with large inaccuracy (which would correspond to a correct representation of the total covariance whatever *α* and *β* along that line) explains why parameters *α* and *β* can be jointly identified. Again, this is linked to the very distinct structures of the forecast and observation errors (cf. Figs. 2 and 3) so that their respective variances can be easily diagnosed as soon as their respective covariance structures are known.

## 5. Conclusions

It is a common practice in Kalman filter applications to adjust uncertain statistical parameters by exploiting the information contained in the innovation sequence. Yet, optimal estimates can only be obtained if it is possible to evaluate the posterior probability distribution for the parameters given the innovation sequence. With the classic formulation of the Kalman filter observational update, the computational complexity of this optimization is *C*_{0} ∼ *pKy*^{3} (leading behavior for large *y*), where *y* is the size of the innovation vector, *K* is the number of innovation vectors, and *p* is the number of iterates to reach the optimum (i.e., a factor *pK* with respect to the observational update itself). This cost is obviously prohibitive for the large systems that are usually considered in atmospheric and oceanic applications, so that practitioners are compelled to develop complex nonoptimal schemes that are usually based on a clever fitting of filter statistics to innovation statistics. In this paper, it has been demonstrated that optimal parameter estimates can be computed efficiently (with a computational complexity *C*_{1} that is asymptotically negligible with respect to that of the observational update) as soon as the observational update is performed using a transformed formulation working in the reduced control space defined by the square root or ensemble representation of the forecast error covariance matrix. (This transformed formulation is also more efficient for the state observational update as soon as the dimension *r* of the reduced space is small with respect to the size of the observation vector: *r* ≪ *y*.) However, this level of efficiency can only be achieved for the following important parameters: scaling of the forecast error covariance matrix (*C*_{1} ∼ *pKr*), scaling of the observation error covariance matrix (*C*_{1} ∼ *pKr*), or parameters modifying the shape of the observation error covariance matrix, such as the correlation length scale (*C*_{1} ∼ *pKr*^{3}). In addition, the method is based on the fundamental assumption that the probability distribution of the innovation given the parameters is a Gaussian distribution. A direct generalization to non-Gaussian distributions is possible as soon as a nonlinear change of variables (anamorphosis) can be found to transform the non-Gaussian distributions into Gaussian distributions.

Idealized experiments have been performed to demonstrate the ability of this optimal adaptive algorithm to effectively control unknown statistical parameters in addition to the state of the system. These experiments have been designed to estimate a synthetic mesoscale signal over a limited region of the ocean basin, so that the rank of the covariance matrices is always small enough to preserve the efficiency of the adaptive algorithm. This means that in more realistic applications involving a large size dynamical system, the adaptive algorithm could only remain efficient if it is used in conjunction with a covariance localization method, in which local low rank covariance matrices can be obtained. In that way, it would become possible to compute the adaptive parameters locally, and thus to introduce even more degrees of freedom in the filter statistics. Showing how this can be done with sufficient numerical efficiency is the subject of a future work.

In addition, this local region of the mesoscale flow is estimated with a simplified assimilation scheme (i.e., zero or identity model and full observation coverage), which is devised in such a way that the optimality of the scheme can always be easily diagnosed. The results show first that the adaptive mechanism is able to control the above-cited unknown statistical parameters, separately and even jointly. Second, the associated measure of the accuracy of the estimated parameters (given for instance by the difference between percentiles 0.9 and 0.1 of their marginal posterior distribution) is closely related to the number of innovation vectors that take a significant part in the estimation. It is thus quite sensitive to the prior assumptions about the time dependence of the parameters (i.e., the choice of the forgetting exponent in our parameterization). Third, the experiments demonstrate that adaptivity can significantly improve the estimation of the state of the system. Comparisons with nonadaptive examples show that statistical parameters that are left inaccurate make the system depart from optimality, with dramatic consequences on the accuracy of the estimation. In addition, in this case study, the application of the adaptive algorithm always produces error estimates that are more consistent with the real error. This is a direct consequence of the improved accuracy of the statistical parameters, to which error estimates are particularly sensitive (much more than the state estimate itself).

However, the experiments also show that the estimation of the statistical parameters can be distorted by the presence of inaccurate parameters that are not included in the control vector. Then, the system tries to compensate for these errors and can produce parameter estimates that are even further away from their real value. Conversely, if too many parameters are included in the control vector, they may not be simultaneously controllable using the available observations. For instance, forecast and observation error scaling factors cannot be controlled together if their correlation structures are too similar. Consequently, defining the list of control parameters is still a subjective choice that needs to be carefully considered in order to find the best compromise between adjusting the largest number of uncertain parameters (to remove any possible inaccuracy in the error parameterization) and the possibility to control them effectively through the innovation sequence. Nevertheless, the optimal adaptive algorithm presented in this paper potentially introduces useful supplementary degrees of freedom in the estimation problem, and if they are judiciously chosen, the direct control of these statistical parameters by the observations increases the robustness of the error estimates produced by the filter. Being closer to statistical optimality, the adaptive filter can thus make a better use of the observational information, and be of direct benefit to atmosphere or ocean data assimilation systems.

## Acknowledgments

This work was conducted as part of the MERSEA and MyOcean projects funded by the EU (Grants AIP3-CT-2003-502885 and FP7-SPACE-2007-1-CT-218812-MYOCEAN), with additional support from CNES. The calculations were performed using HPC resources from GENCI-IDRIS (Grant 2009-011279).

## REFERENCES

Anderson, J. L., 2007: An adaptive covariance inflation error correction algorithm for ensemble filters.

,*Tellus***59A****,**210–224.Anderson, J. L., 2009: Spatially and temporally varying adaptive covariance inflation for ensemble filters.

,*Tellus***61A****,**72–83.Blanchet, I., C. Frankignoul, and M. Cane, 1997: A comparison of adaptive Kalman filters for a tropical Pacific Ocean model.

,*Mon. Wea. Rev.***125****,**40–58.Brankart, J-M., C. Ubelmann, C-E. Testut, E. Cosme, P. Brasseur, and J. Verron, 2009: Efficient parameterization of the observation error covariance matrix for square root or ensemble Kalman filters: Application to ocean altimetry.

,*Mon. Wea. Rev.***137****,**1908–1927.Cohn, S. E., 1997: An introduction to estimation theory.

,*J. Meteor. Soc. Japan***75****,**257–288.Cosme, E., J-M. Brankart, J. Verron, P. Brasseur, and M. Krysta, 2010: Implementation of a reduced rank, square-root smoother for high resolution ocean data assimilation.

*Ocean Modelling,*in press.Daley, R., 1991:

*Atmospheric Data Analysis*. Cambridge University Press, 457 pp.Daley, R., 1992: Estimating model-error covariances for application to atmospheric data assimilation.

,*Mon. Wea. Rev.***120****,**1735–1746.Dee, D., 1995: Online estimation of error covariance parameters for atmospheric data assimilation.

,*Mon. Wea. Rev.***123****,**1128–1145.Evensen, G., and P. J. van Leeuwen, 1996: Assimilation of Geosat altimeter data for the Agulhas Current using the ensemble Kalman filter with a quasi-geostrophic model.

,*Mon. Wea. Rev.***124****,**85–96.Hoang, S., R. Baraille, O. Talagrand, X. Carton, and P. De Mey, 1997: Adaptive filtering: Application to satellite data assimilation in oceanography.

,*Dyn. Atmos. Oceans***27****,**257–281.Le Provost, C., and J. Verron, 1987: Wind-driven mid-latitude circulation—Transition to barotropic instability.

,*Dyn. Atmos. Oceans***11****,**175–201.Lermusiaux, P., 2007: Adaptive modelling, adaptive data assimilation and adaptive sampling.

,*Physica D***230****,**172–196.Li, H., E. Kalnay, and T. Miyoshi, 2009: Simultaneous estimation of covariance inflation and observation errors within an ensemble Kalman filter.

,*Quart. J. Roy. Meteor. Soc.***135****,**523–533.Maybeck, P. S., 1979:

*Stochastic Models, Estimation and Control*. Vol. 1. Academic Press, 423 pp.Mitchell, H. L., and P. L. Houtekamer, 2000: An adaptive ensemble Kalman filter.

,*Mon. Wea. Rev.***128****,**416–433.Pham, D. T., J. Verron, and M. C. Roubaud, 1998: Singular evolutive extended Kalman filter with EOF initialization for data assimilation in oceanography.

,*J. Mar. Syst.***16****,**323–340.Von Mises, R., 1964:

*Mathematical Theory of Probability and Statistics*. Academic Press, 694 pp.Wahba, G., D. R. Johnson, F. Gao, and J. Gong, 1995: Adaptive tuning of numerical weather prediction models: Randomized GCV in three- and four-dimensional data assimilation.

,*Mon. Wea. Rev.***123****,**3358–3369.Wang, X., and C. Bishop, 2003: A comparison of breeding and ensemble transform Kalman filter ensemble forecast schemes.

,*J. Atmos. Sci.***60****,**1140–1158.

Dynamic height (m) snapshots corresponding to year 21 on 10, 20, and 30 Oct.

Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Dynamic height (m) snapshots corresponding to year 21 on 10, 20, and 30 Oct.

Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Dynamic height (m) snapshots corresponding to year 21 on 10, 20, and 30 Oct.

Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Simulated observation noise (m) corresponding (left to right) to correlation models A, B, and C with *σ* = 0.2 m and ℓ = 5 grid points (for the last 2 models).

Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Simulated observation noise (m) corresponding (left to right) to correlation models A, B, and C with *σ* = 0.2 m and ℓ = 5 grid points (for the last 2 models).

Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Simulated observation noise (m) corresponding (left to right) to correlation models A, B, and C with *σ* = 0.2 m and ℓ = 5 grid points (for the last 2 models).

Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

(left) Likelihood function for the scaling factor of the forecast error covariance matrix, as a function of the number of input innovation vectors (1, 3, 10, and 100). (right) Mode of the posterior probability, together with percentiles 0.1 and 0.9 (thin dashed lines), as a function of the number of innovations (in abscissa).

Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

(left) Likelihood function for the scaling factor of the forecast error covariance matrix, as a function of the number of input innovation vectors (1, 3, 10, and 100). (right) Mode of the posterior probability, together with percentiles 0.1 and 0.9 (thin dashed lines), as a function of the number of innovations (in abscissa).

Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

(left) Likelihood function for the scaling factor of the forecast error covariance matrix, as a function of the number of input innovation vectors (1, 3, 10, and 100). (right) Mode of the posterior probability, together with percentiles 0.1 and 0.9 (thin dashed lines), as a function of the number of innovations (in abscissa).

Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Mode (solid line) and percentiles 0.1 and 0.9 (thin dashed lines) of the posterior probability distribution for the forecast error covariance scaling factor, as estimated from the last 50 observations of a signal with nonconstant statistics. The dotted line represents the true scaling factor.

Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Mode (solid line) and percentiles 0.1 and 0.9 (thin dashed lines) of the posterior probability distribution for the forecast error covariance scaling factor, as estimated from the last 50 observations of a signal with nonconstant statistics. The dotted line represents the true scaling factor.

Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Mode (solid line) and percentiles 0.1 and 0.9 (thin dashed lines) of the posterior probability distribution for the forecast error covariance scaling factor, as estimated from the last 50 observations of a signal with nonconstant statistics. The dotted line represents the true scaling factor.

Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

(top) Estimated scaling factor for the forecast error covariance matrix of the model-simulated signal (thick lines). Associated root mean square error (only shown for the first 20 yr) for (middle) altimetry and (bottom) velocity, as compared to another simulation (thin lines) that is performed without adaptivity (i.e., with fixed

Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

(top) Estimated scaling factor for the forecast error covariance matrix of the model-simulated signal (thick lines). Associated root mean square error (only shown for the first 20 yr) for (middle) altimetry and (bottom) velocity, as compared to another simulation (thin lines) that is performed without adaptivity (i.e., with fixed

Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

(top) Estimated scaling factor for the forecast error covariance matrix of the model-simulated signal (thick lines). Associated root mean square error (only shown for the first 20 yr) for (middle) altimetry and (bottom) velocity, as compared to another simulation (thin lines) that is performed without adaptivity (i.e., with fixed

Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

(top) Estimated scaling factor for the observation error covariance matrix; the dashed lines represent the percentiles 0.1 and 0.9, and the thin line represents the solution that is obtained with an incorrect forecast error covariance scaling. Associated rms error for (middle) altimetry and (bottom) velocity; the dashed line represents the error standard deviation as estimated by the filter, and the thin lines represents the solution that is obtained without the adaptive mechanism.

Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

(top) Estimated scaling factor for the observation error covariance matrix; the dashed lines represent the percentiles 0.1 and 0.9, and the thin line represents the solution that is obtained with an incorrect forecast error covariance scaling. Associated rms error for (middle) altimetry and (bottom) velocity; the dashed line represents the error standard deviation as estimated by the filter, and the thin lines represents the solution that is obtained without the adaptive mechanism.

Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

(top) Estimated scaling factor for the observation error covariance matrix; the dashed lines represent the percentiles 0.1 and 0.9, and the thin line represents the solution that is obtained with an incorrect forecast error covariance scaling. Associated rms error for (middle) altimetry and (bottom) velocity; the dashed line represents the error standard deviation as estimated by the filter, and the thin lines represents the solution that is obtained without the adaptive mechanism.

Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Likelihood function for the correlation length scale (in grid points), as a function of the number of input innovation vectors (1, 3, 10, and 100). The result is shown for the correlation model B with (left) ℓ = 2 and (middle) 5 grid points and (right) the correlation model C with ℓ = 5 grid points.

Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Likelihood function for the correlation length scale (in grid points), as a function of the number of input innovation vectors (1, 3, 10, and 100). The result is shown for the correlation model B with (left) ℓ = 2 and (middle) 5 grid points and (right) the correlation model C with ℓ = 5 grid points.

Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Likelihood function for the correlation length scale (in grid points), as a function of the number of input innovation vectors (1, 3, 10, and 100). The result is shown for the correlation model B with (left) ℓ = 2 and (middle) 5 grid points and (right) the correlation model C with ℓ = 5 grid points.

Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Estimated scaling factors for (top) the forecast error covariance matrix and (middle) the observation error covariance matrix. The three lines correspond to experiments performed with correlation models A (solid lines), B (dashed lines), or C (dotted lines) for the observation error. (bottom) Associated rms error for altimetry (solid line) and error standard deviation as estimated by the filter (dotted line). The smallest error (about 4 cm, similar to Fig. 6) is obtained using correlation model A and the largest error (about 8 cm) using correlation model C.

Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Estimated scaling factors for (top) the forecast error covariance matrix and (middle) the observation error covariance matrix. The three lines correspond to experiments performed with correlation models A (solid lines), B (dashed lines), or C (dotted lines) for the observation error. (bottom) Associated rms error for altimetry (solid line) and error standard deviation as estimated by the filter (dotted line). The smallest error (about 4 cm, similar to Fig. 6) is obtained using correlation model A and the largest error (about 8 cm) using correlation model C.

Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Estimated scaling factors for (top) the forecast error covariance matrix and (middle) the observation error covariance matrix. The three lines correspond to experiments performed with correlation models A (solid lines), B (dashed lines), or C (dotted lines) for the observation error. (bottom) Associated rms error for altimetry (solid line) and error standard deviation as estimated by the filter (dotted line). The smallest error (about 4 cm, similar to Fig. 6) is obtained using correlation model A and the largest error (about 8 cm) using correlation model C.

Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Likelihood function for parameters *α* (*x* axis) and *β* (*y* axis), based on the first two innovation vectors. Experiments performed using the correlation model (left) A, (middle) B, or (right) C for the observation error.

Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Likelihood function for parameters *α* (*x* axis) and *β* (*y* axis), based on the first two innovation vectors. Experiments performed using the correlation model (left) A, (middle) B, or (right) C for the observation error.

Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Likelihood function for parameters *α* (*x* axis) and *β* (*y* axis), based on the first two innovation vectors. Experiments performed using the correlation model (left) A, (middle) B, or (right) C for the observation error.

Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1