1. Introduction
In Kalman filters, the accuracy of the estimated error covariances closely depends on the quality of the assumptions about model error and observation error statistics. Inaccurate parameterization may even lead to filter divergence, with error estimates becoming grossly inconsistent with the real error (Maybeck 1979; Daley 1991). To avoid this divergence, one recognized solution (adaptive filtering) is to determine the list of uncertain parameters in the model or observation error statistics, and try to adjust them using the actual differences between forecasts and observations (Daley 1992; Dee 1995; Wahba et al. 1995; Blanchet et al. 1997; Hoang et al. 1997; Wang and Bishop 2003; Lermusiaux 2007; Li et al. 2009). In particular, if the forecast and observation error probability distributions can be assumed Gaussian, a possible solution (proposed by Dee 1995) is to compute the maximum likelihood estimate of the adaptive parameters given the current innovation vector. This strategy is used and further developed in Mitchell and Houtekamer (2000) or Anderson (2007, 2009) in the more specific context of the ensemble Kalman filter. It is also this line of thought that is followed in this study to compute optimal estimates of adaptive statistical parameters.
A major difficulty with this kind of method is that, in general, the computational complexity of the parameter estimation is several times larger than the complexity of the estimation of the system state vector (i.e., than the classic observational update of the Kalman filter). The reason is that, in Kalman filters, the optimal state estimate is linear in the observation vector (of size y), whereas the optimal parameter estimate is intrinsically nonlinear in the observation vector, so that the optimal solution must be computed iteratively (for instance using a downhill simplex method to find the maximum of the likelihood function, as in Mitchell and Houtekamer 2000). A first objective of this paper is to show that there exists nonetheless a few types of important parameters, for which a maximum likelihood optimal estimate can be computed at a numerical cost that is asymptotically negligible (for large y) with respect to that of the standard Kalman filter observational update. Second, taking advantage of this small additional computational complexity, the method is extended to condition the current parameter estimation on the full sequence of past innovations, which amounts to solving an additional (nonlinear) filtering problem for the unknown statistical parameters.
Furthermore, in square root or ensemble Kalman filters, the forecast error covariance matrix is always available in square root form, making it possible to use a modified observational update algorithm [proposed by Pham et al. (1998), as one of the essential elements defining the singular evolutive extended Kalman (SEEK) filter algorithm], whose computational complexity is linear in the number of observations, instead of being cubic as in the standard formula. Originally, this modified algorithm requires that the observation error covariance matrix be diagonal, but solutions exist to preserve its numerical efficiency (linearity in y) in presence of observation error correlations, as shown by Brankart et al. (2009), who also give a detailed comparison of the modified versus the original algorithms. In the present paper, we first show in section 2 how the optimal adaptive filtering problem described above can be formulated in the framework of this modified square root algorithm. It is indeed in this framework that optimal parameter estimates can be computed at negligible additional numerical cost. This is shown in section 3, where the discussion focuses on the few types of parameters for which such computational efficiency is possible. These important parameters are (i) scaling factors for the forecast error covariance matrix, (ii) scaling factors for the observation error covariance matrix, and (iii) scaling factors for the observation error correlation length scale.
In section 4, this adaptive filter is applied to the problem of estimating the evolution of an ocean mesoscale signal using observations of the ocean dynamic topography. To demonstrate the behavior of the adaptive mechanism, idealized experiments are performed, in which the reference signal (the truth of the problem) is generated by a primitive equation ocean model and sampled to produce synthetic observations with known error statistics. In that way, it is possible to check that the method is able to produce accurate parameter estimates and to explore the ill-conditionned situations (inappropriate prior assumptions or uncontrollability of the parameters) in which adaptivity can be misleading.
2. Formulation of the problem
a. Nonadaptive statistics



























The computational complexity of the observational update is also structurally modified by the transformation. In the linear observational update algorithm, the computational cost mainly results from the dependence between the vector components. In the presence of correlation, weighting optimally the forecast and observational information indeed requires the inversion of a full matrix (with computational complexity proportional to the cube of the size of the matrix). The main difference introduced by the transformation is thus that, in Eq. (5), the inversion is performed in the observation space, while in Eq. (9), it is performed in the state space [the cost of the inversion of 𝗥 is here assumed negligible, as shown by Brankart et al. (2009), for large classes of observation error correlation models]. Moreover, if 𝗣kf is rank deficient, the size of the matrix Γk in (9) is not the size of the state vector, but the rank of 𝗣kf, which is given by the number of independent columns in 𝗦kf (i.e., the error modes or the ensemble members). For such problems, the transformation in (7) is here introduced as a simple way of obtaining directly the reduced observational update formulas, that are otherwise deduced from (4) and (5) using the Sherman–Morrison–Woodbury formula (Pham et al. 1998). This transformed algorithm is particularly efficient for the reduced rank problems involving many observations, which are quite common in atmospheric and oceanic applications of square root or ensemble Kalman filters. The key property of the algorithm is indeed that its computational complexity is linear in the number y of observations, making it possible to deal efficiently with large size observation vectors [see Brankart et al. (2009), for a detailed comparison of the computational complexity of the transformed versus the original algorithm].
b. Adaptive statistics
In the filtering problem described in section 2a, optimal observational updates can only be obtained if the forecast and observation error covariance matrices 𝗣kf and 𝗥k are accurate. However, in realistic atmospheric or oceanic applications, both matrices depend on inaccurate parameters. On the one hand, in the forecast step, modeling the time dependence of the errors usually requires questionable parameterizations (e.g., to account for errors in the dynamical laws governing the time evolution of the system). On the other hand, accurate observation error statistics are not always available. This is especially true for representation errors [difference between the truth of the problem and the real world, see Cohn (1997)], which can occur for instance if the state vector of the problem only contains a subrange of the scales that are present in the real system. In addition, the computation of the covariance matrices 𝗣kf and 𝗣ka never involves the observations yk themselves, so that no feedback using differences with respect to the real world is possible; consequently, any inaccuracy in the parameterization of the observation or system noises can lead to instability of the error statistics produced by the filter. This is a well-known effect in Kalman filters, which is usually circumvented (in atmospheric and oceanic applications) by adjusting uncertain parameters in the forecast or observation error covariance matrices using statistics (variance or covariance) of the innovation sequence (adaptive Kalman filters). See for instance Dee (1995) for a more precise justification.














This joint state and parameter estimation problem can even be solved without the assumptions in (14) and (15) as soon as the state observational update is performed by a separate application of Eq. (4) for every member of the ensemble [as proposed by Evensen and van Leeuwen (1996), for the ensemble Kalman filter]. In that case, the prior non-Gaussian forecast probability distribution given by the integral in (13) can be simulated by randomly drawing a different parameter vector from pka(αk, βk) to perform the update of each member. With that scheme, uncertainty in the parameter of the prior distribution can thus be explicitly taken into account, with asymptotic convergence of the prior distribution to the integral in (13) for large ensemble size. However, using (13) instead of (14) may not be appropriate if pka(αk, βk) is not very accurate, for instance if the dispersion of the parameters (second-order moment) is not correctly simulated. Inaccuracy of the second-order moment is indeed the very reason why adaptivity is needed in the filter in (2) for the system state vector. Since this cannot be repeated for the parameter filter in (16), the use of the best parameter estimates α*k and β*k with Eqs. (14) and (15) can also be viewed as a closure that prevents inaccuracies in the second-order moments of pka(αk, βk) from affecting directly the observational update of the state vector.
3. Efficient adaptive parameter estimates
a. Constant parameters






b. Nonconstant parameters




In addition, the solution of Dee (1995) is written without the first term in the cost function in (22) or (24), which corresponds to computing the maximum likelihood estimator of the parameters (as defined, e.g., in Von Mises 1964, chapter 10). The parameters α*k and β*k are then said to maximize the likelihood of the observed innovation sequence (i.e., the conditional probability of the innovation sequence for given parameters). This is useful in absence of reliable prior information on the parameters. With the parameterization in (23), this initial information is anyway progressively forgotten with time, as shown by the exponentially decreasing factor fk in the cost function in (24).
c. Evaluation of the cost function
Minimizing the cost functions (22) or (24) with respect to the parameters requires the application of an iterative method, and thus the possibility of evaluating Jk for the successive iterates of the parameters α, β (for the sake of simplicity, subscript k is removed for the parameters; it is now implicit that they are computed for the current cycle k). The main difficulty with the expressions in (22) or (24) is that the evaluation of Jk requires the computation of the inverse and determinant of the covariance matrix 𝗖k′(α, β) for the full sequence k′ ≤ k of previous innovation vectors, and this is needed for all successive iterates of α, β (let p be the number of iterates needed to reach the minimum with sufficient accuracy). Even if we truncate the innovation sequence to the K last innovations, this corresponds to a computational complexity proportional to pKy3 (leading behavior for large y), just to compute the optimal parameters α* and β* at time tk [i.e., a factor pK with respect to the computational complexity of the observational update (4) and (5), or more precisely, with respect to the leading component (proportional to y3) of this computational complexity for large size observation vectors]. This large computational cost explains why using K = 1 (as in Dee 1995) is the only affordable solution (with this classic observational update algorithm) to compute optimal adaptive parameters in realistic atmospheric or oceanic assimilation systems. But, even in this special case, the computational complexity of the parameter estimation is still a factor p larger than the estimation of the state vector with Eqs. (4) and (5).



















Yet, the transformed Eqs. (28) and (31) are still not a solution by themselves to efficiently compute the cost function as a function of the parameters, since the computation of δk′(α, β), Γk′(α, β), and then Λk′(α, β) and 𝗨k′(α, β) is required for the full sequence of previous innovation vectors k′ ≤ k, and for all successive iterates of (α, β) (needed to minimize the cost function); that is, there is still a factor pK with respect to the computational complexity of the state observational update (8) and (9), or more precisely, with respect to the leading component (proportional to y) of this computational complexity for large size observation vectors. However, with the transformed equations in (28) and (31), there are classes of parameters α, β for which the additional computational complexity of the cost function minimization is either negligible with respect to the state observational update, or at least independent of the number y of observations. These classes of parameters are presented in sections 3d–f. Such simplifications are not possible with the original expressions in (22) or (24), because the matrix 𝗖k(α, β) is the sum of two matrices: one depending on α and the other on β. The inverse and determinant must therefore be computed explicitly for every iteration of α and β.
d. Scaling of the forecast error covariance matrix







However, when estimating the inflation parameter α, it is important to keep in mind that nothing prevents the maximum likelihood estimator of α from being negative, which means that innovations are too small to explain the observation error covariance 𝗥k′ alone, so that the adaptive scheme is attempting to reduce 𝗖k′(α) by subtracting
e. Scaling of the observation error covariance matrix


















Finally, it is important to keep in mind that, in order to attempt the joint control of observation and forecast error covariance scaling factors α and β, we must be certain that both parameters can be simultaneously identified through the innovation sequence. [See Li et al. (2009), who also propose a method to control these two parameters.] This is for instance clearly impossible if observation and forecast errors have identical covariance structures (in the observation space), because then parameters α and β play exactly the same role in the innovation covariance matrix in (25). In this ill-conditioned situation, parameters α and β are not jointly controllable whatever the number of observations or the length of the innovation sequence.
f. Observation error correlation length scale





































4. Demonstration experiments
The purpose of this section is to demonstrate how the optimal adaptive algorithm described in the previous sections can be used in practice. As an example application, we consider the problem of estimating the long-term evolution of a model-simulated ocean mesoscale signal from synthetic observations of the ocean dynamic topography. To concentrate on the behavior of the adaptive algorithm, it is assumed that the only source of information to solve this estimation problem comes from these synthetic observations. The dynamical laws governing the ocean flow are not exploited to constrain the solution. Only simplified assimilation experiments are thus performed in which the ocean model operator 𝗠 is set to zero (total ignorance) or identity (persistence). In that way, a clear diagnostic of the adaptive mechanism can be obtained, without being blinded by unverifiable interferences with a complex nonlinear ocean model. Several examples are shown to illustrate the control of the forecast error covariance scaling (section 4b), the observation error covariance scaling (section 4c), and the correlation length scale (section 4d), before attempting the joint control of all these parameters (section 4e). But before that, we describe how the reference mesoscale signal and the synthetic observations are generated (section 4a).
a. Description of the experiments
The reference mesoscale signal is simulated using a primitive equation model of an idealized square and 5000-m-deep flat bottom ocean at midlatitudes (between 25° and 45°N). In this square basin a double-gyre circulation is created by a constant zonal wind forcing blowing westward in the northern and southern parts of the basin and eastward in the middle part of the basin (with sinusoidal latitude dependence: τ = −τ0 cos[(λ − λmin)/(λmax − λmin)] and τ0 = 0.1 N m−2). The western intensification of these two gyres produces western boundary currents that feed an eastward jet in the middle of the square basin (see the resulting mean dynamic height in Fig. 1). This jet is unstable (Le Provost and Verron 1987) so that the flow is dominated by chaotic mesoscale dynamics, with largest eddies that are ∼100 km wide, and to which correspond velocities of ∼1 m s−1 and dynamic height differences of ∼1 m (see the resulting dynamic height standard deviation in Fig. 1). All this is very similar in shape and magnitude to what is observed in the Gulf Stream (North Atlantic) or in the Kuroshio (North Pacific).
The time evolution of this chaotic system is computed using the Nucleus for European Modelling of the Ocean (NEMO) numerical ocean model, with a horizontal resolution of ¼° × ¼° cosλ and 11 levels in the vertical (see Cosme et al. 2010 for more detail about the model configuration). The main three physical parameters governing the dominant characteristics of the flow are the stratification, the bottom friction, and the horizontal viscosity. The model is started from rest with uniform stratification and can be considered to reach equilibrium statistics after 20 yr of simulation. In this paper, we thus concentrate on the estimation of the 100-yr signal from years 21 to 120. Moreover, we focus our study to a limited subdomain in the middle of the jet (about 650 × 650 km, as shown by the black square in Fig. 1) with intense and quite homogeneous mesoscale activity. It is also assumed that the reference simulation is known with a time resolution of 1 snapshot every 10 days. Figure 2 shows for instance the simulated dynamic height in October of year 21, with a time resolution of 10 days, which is clearly sufficient to observe the slow westward (upstream) motion of the main eddies.
To estimate the time evolution of this mesoscale flow, we assume that the ocean altimetry is observed every 10 days at model resolution. Without a dynamical model to constrain the estimation problem, it is indeed important to have observations with a sufficient horizontal coverage. However, in order to generate the synthetic observations, an artificial observational noise is added to the reference simulation. This noise is meant to simulate measurement and representation errors that always exist in a real observation dataset. In our system, the representation error mainly includes the subgrid-scale eddies that are not resolved by the model discretization. This component of the observation error is thus often dominant and poorly known, which justifies adjusting its main statistics (variance and correlation length scale) using the available observations. In this study, three kinds of correlation model are used to simulate the observational noise: (A) uncorrelated errors, (B) exponential decorrelation function: ρ(r) = exp(−r/ℓ), and (C) a smooth noise correlation model: ρ(r) = (r/l)K1(r/l), where K1 is the second kind modified Bessel function of order 1. The two last models have the property that they can be efficiently parameterized using a diagonal observation error covariance matrix in an augmented observation space [as in Eq. (38)]. Model B just requires including gradient observations (with error standard deviation σ1 = σ0/ℓ), while model C requires including both gradient and curvature (with error standard deviations σ1 = σ0/ℓ
b. Control of the forecast error covariance scaling
The first set of experiments is dedicated to the estimation of a single global scaling factor α for the forecast error covariance matrix, by applying the method described in section 3d. In these first experiments, observation errors are uncorrelated, with standard deviation σ = 0.2 m, and the corresponding observation error covariance matrix is assumed perfectly known: 𝗥 = σ𝗜. Problems with a constant and nonconstant parameter α are successively considered.
1) Gaussian random signal
As a first illustration of the adaptive mechanism, let us consider the problem of estimating a sequence of random and independent draws from the Gaussian probability distribution
As a first case study, the reference sequence of draws xk is sampled from a constant Gaussian probability distribution: αref(k) = 1. From this, we can simulate a sequence of observations vectors yk as explained above, and then compute the terms of the cost function in (22) using Eq. (32). This can be done efficiently by computing once for all δ̃k′, Λ̃, and
If the reference parameter is no longer constant, and if little is known about the time dependence of the parameter values, we can use the simple model described in section 3b to forget the oldest innovations in the computation of the current parameter best estimate. To test the method for several parameter fluctuation time scales in one single experiment, we set αref(k) = a + b sin[ω(k)k] with ω(k) = k/k1 and k1 = 10 000. It is shown in Fig. 5 (thick dotted line) for a = 1 and b = ½. The figure also shows the estimate α*(k) that is obtained using an e-folding forgetting time scale ke = 100 (i.e., a forgetting exponent f ≃ 0.99). This forgetting time scale is well adapted only if it is comparable to the reference parameter fluctuation time scale: ke ∼ 1/ω(k) (i.e., for k ∼ k1/ke = 100). Before this time index, the accuracy of the estimate can be improved by keeping a longer sequence of observations, and after this time index, it becomes better to forget the observations faster. The estimation thus suffers from the poor prior knowledge (summarized in the single parameter f ≃ 0.99) of the time dependence between parameter values. As the fluctuation frequency increases with time, the representation of the successive minima and maxima of the reference parameter time series becomes less and less accurate.
2) Model simulated signal
Let us now consider the more realistic problem of estimating the 100-yr model-simulated mesoscale signal described in section 4a (and illustrated in Fig. 2). The initial condition is assumed perfectly known, the observation errors are still uncorrelated (with σ = 0.2 m), but here, a better model is to assume persistence: 𝗠 = 𝗜. Furthermore, assuming stationary statistics, the corresponding model error covariance matrix can be consistently parameterized using the time covariance of differences between successive model snapshots in the reference simulation:
With this parameterization, we can solve the joint filtering problem for the state of the system and for the parameter α, as explained in sections 2 and 3. In the parameter filter, we prescribe the prior probability distribution for the parameter p0(α) = exp(−α) and the forgetting exponent f = 0.9. Figure 6 (top panel) shows the resulting estimate α*(tk) that is obtained as the mode of the posterior probability distribution at time tk. As expected, the covariance of the error of the persistence model is close to our prior estimate 𝗤 so that the estimated α remains close to 1. On the other hand, as explained in the previous example [section 4b(1)], the estimated accuracy of the parameter is very sensitive to the forgetting exponent (i.e., the number of innovative vectors that is assumed relevant to include in the current parameter estimate). This sensitivity to a subjective assumption is the very reason why using the closure in (14) and (15) is thought to be better than using the integral in (13) to simulate the forecast error probability distribution (see the explanation at the end of section 2b). In this example with very steady statistics, a better parameter accuracy can be obtained by increasing the forgetting exponent f (using for instance f = 0.99 instead of f = 0.9), but this is done at the expense of a much larger numerical cost (10 times larger for f = 0.99) since a longer innovation sequence must be used for each evaluation of the cost function.
On the other hand, Fig. 6 illustrates the corresponding error on the state estimate, as obtained for altimetry (middle panel) and velocity (bottom panel). The figure shows the root-mean-square difference between the state estimate (after the observational update at time tk) and the reference simulation (solid thick line), as compared to the corresponding error estimate produced by the filter (the square root of the trace of the analysis error covariance matrix, dashed thick line). In the adaptive experiment (thick line), these two curves remain quite consistent on the long term, indicating that the adaptive mechanism is sufficiently constraining the error statistics to produce consistent estimates of the total error variance. This result is compared with another simulation that is performed without the adaptive mechanism (thin lines) using the fixed value
c. Control of the observation error covariance scaling
The second set of experiments is dedicated to the estimation of a single global scaling factor β for the observation error covariance matrix, by applying the method described in section 3e. In these experiments, observation errors are still uncorrelated, with a standard deviation of σ = 0.2 m, but the scaling of the observation error covariance matrix is assumed unknown: 𝗥 = β
Figure 7 (top panel, thick lines) shows the resulting estimate β*(tk) that is obtained as the mode of the posterior probability distribution at time tk, together with percentiles 0.1 and 0.9. As expected, the optimal estimate β*(tk) quickly converges toward the correct value β = 1, with an associated error decreasing to zero as more observations become available. Old observations are indeed never forgotten since the parameter is assumed constant. The figure also illustrates the corresponding error on the state estimate (as in Fig. 6), showing that we obtain the same result as in Fig. 6 as soon as parameter β becomes sufficiently accurate. This result is compared with another simulation that is performed without the adaptive mechanism (thin lines) using the inaccurate value β = 0.5. We observe again that, without adaptive statistics, the error estimate remains inconsistent, so that the resulting nonoptimal scheme can only produce larger errors, together with badly estimated error variance (underestimated in this example).
On the other hand, it is also interesting to investigate what happens if we control the scaling factor β alone, in presence of inaccurate forecast error covariance scaling. For that purpose, we just redo the same experiment with 𝗣kf = α𝗤, with constant α = ¼. The resulting estimate β*(tk) is also shown in Fig. 7 (top panel, thin line). As can be observed, β*(tk) quickly diverges from the correct value β = 1, because the adaptive mechanism is trying to compensate for the underestimation of the forecast error variance by overestimating the observation error, in order to account for the total innovation signal. It is thus very important to stress again the assumption that all statistical parameters that are not included in the control vector are accurately known. Inconsistency in the prior statistical parameterization can easily lead to grossly incorrect adaptive parameters.
d. Control of the correlation length scale
The third set of experiments is dedicated to the estimation of the correlation length scale from a random sample of a Gaussian distribution, by applying the method described in section 3f. In these experiments, it is assumed that this Gaussian distribution is characterized by a zero mean and a covariance 𝗥. This application thus corresponds to the problem described in section 3f with the simplification that α = 0 (i.e., 𝗥 is the only remaining term in the covariance 𝗖 of the random vector), so that we can directly apply Eqs. (47) and (48) to estimate the signal variance and/or correlation length scale.
As an illustration, let us consider the problem of estimating the correlation length scale of the observational noise described in section 4a (and illustrated in Fig. 3) from samples of various sizes. In this experiment, it is assumed that the noise standard deviation σ = 0.2 m and the correlation structure in model B [ρ(r) = exp(−r/ℓ)] or model C [ρ(r) = (r/l)K1(r/l)] are known, so that the correlation length scale ℓ is the only parameter that must be estimated. Moreover, we have already said that these two correlation structures can be consistently parameterized using the parameterization in (39) with a transformation operator 𝗧 that includes the gradient for model B or the gradient and curvature for model C. From this operator, it is easy to evaluate the eigenvalues μl of 𝗧𝗧T, and the function f (ℓ) and g(ℓ) using Eq. (43). Figure 8 illustrates the resulting likelihood function L(ℓ) for the correlation length scale ℓ, as computed using Eq. (48), for several sizes k of the sample. (The figure is scaled so that the maximum is always equal to 1.) The figure shows that, for the correlation model B, the likelihood function narrows close to the correct value ℓ = 2 grid points (left panel) or, ℓ = 5 grid points (middle panel), as more observations are available, which indicates that the adaptive mechanism presented in this paper is able to control the correlation length scale of a random signal, as soon as a correct assumption about the correlation structure can be formulated. Concerning the correlation model C, also illustrated in Fig. 8, there is a discrepancy between the true correlation length scale (ℓ = 5 grid points) and the estimated value (ℓ ∼ 3.2 grid points). This problem results from an approximate evaluation of the function g(ℓ) in Eq. (48), for which we assumed an infinite domain (whereas our real domain is not very large with respect to the correlation length scale). This difficulty does not arise if the adaptive parameters are β0 and β1 (instead of σ2 and ℓ), but this points out the sensitivity of the estimation to inaccuracies in the representation of the correlation structure.
e. Joint control of forecast and observation error parameterizations
Last, in a fourth set of experiments, we solve a more general assimilation problem in which two adaptive parameters considered in the previous sections are controlled together: a parameter α to adjust the scaling of the forecast error covariance matrix, and a parameter β to adjust the scaling of the observation error covariance matrix. For that purpose, we use observations with uncorrelated or correlated errors (σ = 0.2 m and ℓ = 5 grid points) using correlation model A, B, or C, as illustrated in Fig. 3. In addition, in the experiments, the structure of the observation error covariance matrix is assumed to be known: it is given by Eq. (39) with an operator 𝗧 consistent with the real correlation model (A, B, or C), and with diagonal matrices 𝗥0 = σ0𝗜 and 𝗥1 = σ1𝗜 parameterized in such a way that the true value of the global scaling parameter is β = 1. As prior probability distribution for the parameters, we use p0(α, β) = exp[−(α + β)], that is, independent prior distributions with exponential probability density for each parameter. As in section 4b(2), the parameters are not assumed constant in time, and the same forgetting factor f = 0.9 is used. With these assumptions, we can solve the joint filtering problem for the state of the system and for the parameters α and β. In particular, the optimal parameter estimates are obtained at each time tk by minimizing the cost function whose components are given by Eqs. (33) and (34) as a function of the parameters α and β (with yk′ equal to the number of real observations, i.e., excluding the fictious gradient or curvature observations if any).
Figure 9 (top two panels, thick lines) shows the resulting estimates α*(tk) and β*(tk) that are obtained as the mode of the posterior probability distribution at time tk. As expected, for all correlation models A, B, or C, the optimal forecast error covariance scaling α*(tk) converges toward the same solution as in section 4b(2), and the optimal observation error covariance scaling β*1(tk) converges approximately toward the correct values β = 1. Figure 9 (bottom panel) also illustrates the corresponding error on altimetry. For correlation model A, the result is similar to the solution shown in Fig. 6 (obtained with known β), whereas for the two other correlation models B or C, the error is larger, as a result of the lower quality of the observations that are assimilated (correlated instead of uncorrelated observation error). But in any case, the error estimate produced by the adaptive filter (dotted thick line) is always a consistent estimate of the real error standard deviation. Without adaptivity (not shown in the figure), the error estimate can become grossly inconsistent (as explicitly shown in the examples of sections 4b and 4c) with the same negative consequences on filter optimality.
Finally, in order to characterize more completely the knowledge that is acquired about parameters α and β, Fig. 10 represents their joint likelihood function (based on the first two innovation vectors), as obtained in the three experiments (i.e., with correlation model A, B, or C for the observation error). The first thing that can be observed is that β is diagnosed with a better accuracy than α, as a consequence of the larger number of degrees of freedom in the observation error as compared to the forecast error (see Figs. 2 and 3). Second, in these experiments, the slope of the first principal axis of the sensitivity ellipse happens to be only slightly negative, which indicates that, in this case study, the adaptive scheme is able to make a clear distinction between forecast and observation errors. This absence of an unstable direction with large inaccuracy (which would correspond to a correct representation of the total covariance whatever α and β along that line) explains why parameters α and β can be jointly identified. Again, this is linked to the very distinct structures of the forecast and observation errors (cf. Figs. 2 and 3) so that their respective variances can be easily diagnosed as soon as their respective covariance structures are known.
5. Conclusions
It is a common practice in Kalman filter applications to adjust uncertain statistical parameters by exploiting the information contained in the innovation sequence. Yet, optimal estimates can only be obtained if it is possible to evaluate the posterior probability distribution for the parameters given the innovation sequence. With the classic formulation of the Kalman filter observational update, the computational complexity of this optimization is C0 ∼ pKy3 (leading behavior for large y), where y is the size of the innovation vector, K is the number of innovation vectors, and p is the number of iterates to reach the optimum (i.e., a factor pK with respect to the observational update itself). This cost is obviously prohibitive for the large systems that are usually considered in atmospheric and oceanic applications, so that practitioners are compelled to develop complex nonoptimal schemes that are usually based on a clever fitting of filter statistics to innovation statistics. In this paper, it has been demonstrated that optimal parameter estimates can be computed efficiently (with a computational complexity C1 that is asymptotically negligible with respect to that of the observational update) as soon as the observational update is performed using a transformed formulation working in the reduced control space defined by the square root or ensemble representation of the forecast error covariance matrix. (This transformed formulation is also more efficient for the state observational update as soon as the dimension r of the reduced space is small with respect to the size of the observation vector: r ≪ y.) However, this level of efficiency can only be achieved for the following important parameters: scaling of the forecast error covariance matrix (C1 ∼ pKr), scaling of the observation error covariance matrix (C1 ∼ pKr), or parameters modifying the shape of the observation error covariance matrix, such as the correlation length scale (C1 ∼ pKr3). In addition, the method is based on the fundamental assumption that the probability distribution of the innovation given the parameters is a Gaussian distribution. A direct generalization to non-Gaussian distributions is possible as soon as a nonlinear change of variables (anamorphosis) can be found to transform the non-Gaussian distributions into Gaussian distributions.
Idealized experiments have been performed to demonstrate the ability of this optimal adaptive algorithm to effectively control unknown statistical parameters in addition to the state of the system. These experiments have been designed to estimate a synthetic mesoscale signal over a limited region of the ocean basin, so that the rank of the covariance matrices is always small enough to preserve the efficiency of the adaptive algorithm. This means that in more realistic applications involving a large size dynamical system, the adaptive algorithm could only remain efficient if it is used in conjunction with a covariance localization method, in which local low rank covariance matrices can be obtained. In that way, it would become possible to compute the adaptive parameters locally, and thus to introduce even more degrees of freedom in the filter statistics. Showing how this can be done with sufficient numerical efficiency is the subject of a future work.
In addition, this local region of the mesoscale flow is estimated with a simplified assimilation scheme (i.e., zero or identity model and full observation coverage), which is devised in such a way that the optimality of the scheme can always be easily diagnosed. The results show first that the adaptive mechanism is able to control the above-cited unknown statistical parameters, separately and even jointly. Second, the associated measure of the accuracy of the estimated parameters (given for instance by the difference between percentiles 0.9 and 0.1 of their marginal posterior distribution) is closely related to the number of innovation vectors that take a significant part in the estimation. It is thus quite sensitive to the prior assumptions about the time dependence of the parameters (i.e., the choice of the forgetting exponent in our parameterization). Third, the experiments demonstrate that adaptivity can significantly improve the estimation of the state of the system. Comparisons with nonadaptive examples show that statistical parameters that are left inaccurate make the system depart from optimality, with dramatic consequences on the accuracy of the estimation. In addition, in this case study, the application of the adaptive algorithm always produces error estimates that are more consistent with the real error. This is a direct consequence of the improved accuracy of the statistical parameters, to which error estimates are particularly sensitive (much more than the state estimate itself).
However, the experiments also show that the estimation of the statistical parameters can be distorted by the presence of inaccurate parameters that are not included in the control vector. Then, the system tries to compensate for these errors and can produce parameter estimates that are even further away from their real value. Conversely, if too many parameters are included in the control vector, they may not be simultaneously controllable using the available observations. For instance, forecast and observation error scaling factors cannot be controlled together if their correlation structures are too similar. Consequently, defining the list of control parameters is still a subjective choice that needs to be carefully considered in order to find the best compromise between adjusting the largest number of uncertain parameters (to remove any possible inaccuracy in the error parameterization) and the possibility to control them effectively through the innovation sequence. Nevertheless, the optimal adaptive algorithm presented in this paper potentially introduces useful supplementary degrees of freedom in the estimation problem, and if they are judiciously chosen, the direct control of these statistical parameters by the observations increases the robustness of the error estimates produced by the filter. Being closer to statistical optimality, the adaptive filter can thus make a better use of the observational information, and be of direct benefit to atmosphere or ocean data assimilation systems.
Acknowledgments
This work was conducted as part of the MERSEA and MyOcean projects funded by the EU (Grants AIP3-CT-2003-502885 and FP7-SPACE-2007-1-CT-218812-MYOCEAN), with additional support from CNES. The calculations were performed using HPC resources from GENCI-IDRIS (Grant 2009-011279).
REFERENCES
Anderson, J. L. , 2007: An adaptive covariance inflation error correction algorithm for ensemble filters. Tellus, 59A , 210–224.
Anderson, J. L. , 2009: Spatially and temporally varying adaptive covariance inflation for ensemble filters. Tellus, 61A , 72–83.
Blanchet, I. , C. Frankignoul , and M. Cane , 1997: A comparison of adaptive Kalman filters for a tropical Pacific Ocean model. Mon. Wea. Rev., 125 , 40–58.
Brankart, J-M. , C. Ubelmann , C-E. Testut , E. Cosme , P. Brasseur , and J. Verron , 2009: Efficient parameterization of the observation error covariance matrix for square root or ensemble Kalman filters: Application to ocean altimetry. Mon. Wea. Rev., 137 , 1908–1927.
Cohn, S. E. , 1997: An introduction to estimation theory. J. Meteor. Soc. Japan, 75 , 257–288.
Cosme, E. , J-M. Brankart , J. Verron , P. Brasseur , and M. Krysta , 2010: Implementation of a reduced rank, square-root smoother for high resolution ocean data assimilation. Ocean Modelling, in press.
Daley, R. , 1991: Atmospheric Data Analysis. Cambridge University Press, 457 pp.
Daley, R. , 1992: Estimating model-error covariances for application to atmospheric data assimilation. Mon. Wea. Rev., 120 , 1735–1746.
Dee, D. , 1995: Online estimation of error covariance parameters for atmospheric data assimilation. Mon. Wea. Rev., 123 , 1128–1145.
Evensen, G. , and P. J. van Leeuwen , 1996: Assimilation of Geosat altimeter data for the Agulhas Current using the ensemble Kalman filter with a quasi-geostrophic model. Mon. Wea. Rev., 124 , 85–96.
Hoang, S. , R. Baraille , O. Talagrand , X. Carton , and P. De Mey , 1997: Adaptive filtering: Application to satellite data assimilation in oceanography. Dyn. Atmos. Oceans, 27 , 257–281.
Le Provost, C. , and J. Verron , 1987: Wind-driven mid-latitude circulation—Transition to barotropic instability. Dyn. Atmos. Oceans, 11 , 175–201.
Lermusiaux, P. , 2007: Adaptive modelling, adaptive data assimilation and adaptive sampling. Physica D, 230 , 172–196.
Li, H. , E. Kalnay , and T. Miyoshi , 2009: Simultaneous estimation of covariance inflation and observation errors within an ensemble Kalman filter. Quart. J. Roy. Meteor. Soc., 135 , 523–533.
Maybeck, P. S. , 1979: Stochastic Models, Estimation and Control. Vol. 1. Academic Press, 423 pp.
Mitchell, H. L. , and P. L. Houtekamer , 2000: An adaptive ensemble Kalman filter. Mon. Wea. Rev., 128 , 416–433.
Pham, D. T. , J. Verron , and M. C. Roubaud , 1998: Singular evolutive extended Kalman filter with EOF initialization for data assimilation in oceanography. J. Mar. Syst., 16 , 323–340.
Von Mises, R. , 1964: Mathematical Theory of Probability and Statistics. Academic Press, 694 pp.
Wahba, G. , D. R. Johnson , F. Gao , and J. Gong , 1995: Adaptive tuning of numerical weather prediction models: Randomized GCV in three- and four-dimensional data assimilation. Mon. Wea. Rev., 123 , 3358–3369.
Wang, X. , and C. Bishop , 2003: A comparison of breeding and ensemble transform Kalman filter ensemble forecast schemes. J. Atmos. Sci., 60 , 1140–1158.

(left) Mean and (right) standard deviation of the dynamic height (m) over the full square basin. Our region of interest is represented by the black square in the middle of the basin.
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

(left) Mean and (right) standard deviation of the dynamic height (m) over the full square basin. Our region of interest is represented by the black square in the middle of the basin.
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1
(left) Mean and (right) standard deviation of the dynamic height (m) over the full square basin. Our region of interest is represented by the black square in the middle of the basin.
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Dynamic height (m) snapshots corresponding to year 21 on 10, 20, and 30 Oct.
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Dynamic height (m) snapshots corresponding to year 21 on 10, 20, and 30 Oct.
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1
Dynamic height (m) snapshots corresponding to year 21 on 10, 20, and 30 Oct.
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Simulated observation noise (m) corresponding (left to right) to correlation models A, B, and C with σ = 0.2 m and ℓ = 5 grid points (for the last 2 models).
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Simulated observation noise (m) corresponding (left to right) to correlation models A, B, and C with σ = 0.2 m and ℓ = 5 grid points (for the last 2 models).
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1
Simulated observation noise (m) corresponding (left to right) to correlation models A, B, and C with σ = 0.2 m and ℓ = 5 grid points (for the last 2 models).
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

(left) Likelihood function for the scaling factor of the forecast error covariance matrix, as a function of the number of input innovation vectors (1, 3, 10, and 100). (right) Mode of the posterior probability, together with percentiles 0.1 and 0.9 (thin dashed lines), as a function of the number of innovations (in abscissa).
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

(left) Likelihood function for the scaling factor of the forecast error covariance matrix, as a function of the number of input innovation vectors (1, 3, 10, and 100). (right) Mode of the posterior probability, together with percentiles 0.1 and 0.9 (thin dashed lines), as a function of the number of innovations (in abscissa).
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1
(left) Likelihood function for the scaling factor of the forecast error covariance matrix, as a function of the number of input innovation vectors (1, 3, 10, and 100). (right) Mode of the posterior probability, together with percentiles 0.1 and 0.9 (thin dashed lines), as a function of the number of innovations (in abscissa).
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Mode (solid line) and percentiles 0.1 and 0.9 (thin dashed lines) of the posterior probability distribution for the forecast error covariance scaling factor, as estimated from the last 50 observations of a signal with nonconstant statistics. The dotted line represents the true scaling factor.
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Mode (solid line) and percentiles 0.1 and 0.9 (thin dashed lines) of the posterior probability distribution for the forecast error covariance scaling factor, as estimated from the last 50 observations of a signal with nonconstant statistics. The dotted line represents the true scaling factor.
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1
Mode (solid line) and percentiles 0.1 and 0.9 (thin dashed lines) of the posterior probability distribution for the forecast error covariance scaling factor, as estimated from the last 50 observations of a signal with nonconstant statistics. The dotted line represents the true scaling factor.
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

(top) Estimated scaling factor for the forecast error covariance matrix of the model-simulated signal (thick lines). Associated root mean square error (only shown for the first 20 yr) for (middle) altimetry and (bottom) velocity, as compared to another simulation (thin lines) that is performed without adaptivity (i.e., with fixed
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

(top) Estimated scaling factor for the forecast error covariance matrix of the model-simulated signal (thick lines). Associated root mean square error (only shown for the first 20 yr) for (middle) altimetry and (bottom) velocity, as compared to another simulation (thin lines) that is performed without adaptivity (i.e., with fixed
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1
(top) Estimated scaling factor for the forecast error covariance matrix of the model-simulated signal (thick lines). Associated root mean square error (only shown for the first 20 yr) for (middle) altimetry and (bottom) velocity, as compared to another simulation (thin lines) that is performed without adaptivity (i.e., with fixed
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

(top) Estimated scaling factor for the observation error covariance matrix; the dashed lines represent the percentiles 0.1 and 0.9, and the thin line represents the solution that is obtained with an incorrect forecast error covariance scaling. Associated rms error for (middle) altimetry and (bottom) velocity; the dashed line represents the error standard deviation as estimated by the filter, and the thin lines represents the solution that is obtained without the adaptive mechanism.
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

(top) Estimated scaling factor for the observation error covariance matrix; the dashed lines represent the percentiles 0.1 and 0.9, and the thin line represents the solution that is obtained with an incorrect forecast error covariance scaling. Associated rms error for (middle) altimetry and (bottom) velocity; the dashed line represents the error standard deviation as estimated by the filter, and the thin lines represents the solution that is obtained without the adaptive mechanism.
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1
(top) Estimated scaling factor for the observation error covariance matrix; the dashed lines represent the percentiles 0.1 and 0.9, and the thin line represents the solution that is obtained with an incorrect forecast error covariance scaling. Associated rms error for (middle) altimetry and (bottom) velocity; the dashed line represents the error standard deviation as estimated by the filter, and the thin lines represents the solution that is obtained without the adaptive mechanism.
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Likelihood function for the correlation length scale (in grid points), as a function of the number of input innovation vectors (1, 3, 10, and 100). The result is shown for the correlation model B with (left) ℓ = 2 and (middle) 5 grid points and (right) the correlation model C with ℓ = 5 grid points.
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Likelihood function for the correlation length scale (in grid points), as a function of the number of input innovation vectors (1, 3, 10, and 100). The result is shown for the correlation model B with (left) ℓ = 2 and (middle) 5 grid points and (right) the correlation model C with ℓ = 5 grid points.
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1
Likelihood function for the correlation length scale (in grid points), as a function of the number of input innovation vectors (1, 3, 10, and 100). The result is shown for the correlation model B with (left) ℓ = 2 and (middle) 5 grid points and (right) the correlation model C with ℓ = 5 grid points.
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Estimated scaling factors for (top) the forecast error covariance matrix and (middle) the observation error covariance matrix. The three lines correspond to experiments performed with correlation models A (solid lines), B (dashed lines), or C (dotted lines) for the observation error. (bottom) Associated rms error for altimetry (solid line) and error standard deviation as estimated by the filter (dotted line). The smallest error (about 4 cm, similar to Fig. 6) is obtained using correlation model A and the largest error (about 8 cm) using correlation model C.
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Estimated scaling factors for (top) the forecast error covariance matrix and (middle) the observation error covariance matrix. The three lines correspond to experiments performed with correlation models A (solid lines), B (dashed lines), or C (dotted lines) for the observation error. (bottom) Associated rms error for altimetry (solid line) and error standard deviation as estimated by the filter (dotted line). The smallest error (about 4 cm, similar to Fig. 6) is obtained using correlation model A and the largest error (about 8 cm) using correlation model C.
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1
Estimated scaling factors for (top) the forecast error covariance matrix and (middle) the observation error covariance matrix. The three lines correspond to experiments performed with correlation models A (solid lines), B (dashed lines), or C (dotted lines) for the observation error. (bottom) Associated rms error for altimetry (solid line) and error standard deviation as estimated by the filter (dotted line). The smallest error (about 4 cm, similar to Fig. 6) is obtained using correlation model A and the largest error (about 8 cm) using correlation model C.
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Likelihood function for parameters α (x axis) and β (y axis), based on the first two innovation vectors. Experiments performed using the correlation model (left) A, (middle) B, or (right) C for the observation error.
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1

Likelihood function for parameters α (x axis) and β (y axis), based on the first two innovation vectors. Experiments performed using the correlation model (left) A, (middle) B, or (right) C for the observation error.
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1
Likelihood function for parameters α (x axis) and β (y axis), based on the first two innovation vectors. Experiments performed using the correlation model (left) A, (middle) B, or (right) C for the observation error.
Citation: Monthly Weather Review 138, 3; 10.1175/2009MWR3085.1