## Abstract

In geophysical data assimilation, observations shed light on a control parameter space through a model, a statistical prior, and an optimal combination of these sources of information. This control space can be a set of discrete parameters, or, more often in geophysics, part of the state space, which is distributed in space and time. When the control space is continuous, it must be discretized for numerical modeling. This discretization, in this paper called a representation of this distributed parameter space, is always fixed a priori.

In this paper, the representation of the control space is considered a degree of freedom on its own. The goal of the paper is to demonstrate that one could optimize it to perform data assimilation in optimal conditions. The optimal representation is then chosen over a large dictionary of adaptive grid representations involving several space and time scales.

First, to motivate the importance of the representation choice, this paper discusses the impact of a change of representation on the posterior analysis of data assimilation and its connection to the reduction of uncertainty. It is stressed that in some circumstances (atmospheric chemistry, in particular) the choice of a proper representation of the control space is essential to set the data assimilation statistical framework properly.

A possible mathematical framework is then proposed for multiscale data assimilation. To keep the developments simple, a measure of the reduction of uncertainty is chosen as a very simple optimality criterion. Using this criterion, a cost function is built to select the optimal representation. It is a function of the control space representation itself. A regularization of this cost function, based on a statistical mechanical analogy, guarantees the existence of a solution.

This allows numerical optimization to be performed on the representation of control space. The formalism is then applied to the inverse modeling of an accidental release of an atmospheric contaminant at European scale, using real data.

## 1. Introduction

Data assimilation is a set of mathematical techniques that aims at optimally combining several sources of information: data of an experimental nature that come from observation of the system, statistical information that comes from a prior knowledge of the system, and a numerical model that relates the space of observation to the space of the system state. Modern data assimilation has been carried out in meteorological operational centers or in oceanographic research centers over the last 15 yr with success and a significant improvement in the forecast skills (Le Dimet and Talagrand 1986; Lorenc 1986; Courtier and Talagrand 1990; Ghil and Malanotte-Rizzoli 1991; Courtier et al. 1994—to mention just a few of the seminal works on the topic). The ideas and methods have also percolated in atmospheric chemistry over the last 10 yr on a research basis (the example of the methodological development of this paper pertains to this field).

One characteristic of geophysical data assimilation is the huge domain in which it is meant to be applied and the very inhomogeneous distribution of fields. For numerical purposes, the physical model and its equations need to be discretized on this domain. The number of spatial variables is then close to a billion (currently). Thus, compromises must always be made in terms of the sophistication of the mathematical methods, the amount of data needed, and the control parameters (which shall also be interchangeably called variables) that can be handled.

At the confluence of these data assimilation ingredients is the choice of the resolution of the control variables space. The resolution choice has a direct impact on the number of control variables. For instance, in the context of inverse modeling, a finer resolution may lead to underdetermination. In geophysics, changing the resolution (or the scale at which the phenomena are examined) also has physical modeling consequences. Indeed, the physics that are modeled are usually scale dependent; the same process may be represented differently at different scales. Setting the resolution of a geophysical model is an intricate choice, which is full of implications. As a consequence, this choice is usually made very early in the modeling study. The implementation of (complex enough) data assimilation methods usually follows this early decision. From time to time, operational centers would change the resolution of their main model and a fresh study would be conducted on the basis of the fixed targeted resolution. For variational data assimilation purposes, two grids can be used but are fixed nevertheless; for instance, one for the inner loop and another for the outer loop in four-dimensional variational data assimilation (4DVAR).

In this paper, we would like to reverse the point of view and consider the resolution a free parameter in the implementation of data assimilation techniques.

### a. Data assimilation and the representation of control space

In this paper, we are not directly interested in the nature of the control fields or in a careful choice of uncorrelated control fields. This choice is assumed to already have been made. Because the physics are modeled through numerics, they have to be discretized and the choice of a resolution should be made. This resolution problem generalizes to the question of the representation of the control fields.

Note that in numerical simulation (no assimilated observations) the issue of choosing a proper representation (e.g., an adaptive grid) is well studied. It allows speeding up the simulations and refines the computations in which the numerical error could be large (see, e.g., Saad 2003). An example in geophysics, and in particular air quality, can be found in Constantinescu et al. (2008).

However, in data assimilation, control variables and observations are equally important components. Because they do not share the same representation space but are nevertheless combined in the data assimilation process, the optimal representation problem is much less simple there.

Note that the control space representation can be different from the forward model (state) space discretization, even if the control space is a subspace of the state space. For instance, their resolutions may differ resulting from use of regular meshes of the same underlying domain.

### b. Can the representation be chosen?

A fundamental issue is to decide how control fields should be discretized. In particular, can we choose, in a reasonably nonarbitrary fashion, the resolution of the control space in data assimilation? Hints could come from information theory. Provided adequate tools exist, one would compare the information content of the observations and the capacity of the control space grid to contain this information filtered out by the data assimilation analysis. In addition to the Bayesian inference principle used in data assimilation, a (typically variational) side principle would help construct an optimal representation. Such a cost function could establish how fine the grid resolution should be if it has to accommodate for the set of available data. Such a choice may stem from the balance found between the information content of the observation (but also the background) and the amount of information that the control variables are able to hold.

In this paper, a simple criterion of optimality is chosen based on the trace of the inverse analysis error covariance matrix, with known links to an information content analysis to be discussed and recalled.

### c. Regularization, continuum limit, and scaling divergence

In a data assimilation system, what if the control space is discretized in too fine a manner? This question is actually central to inverse problems. Without regularization, the problem is underdetermined and therefore ill-posed. Regularization has been devised to circumvent this difficulty. In the data assimilation literature, regularization is performed through background implementation with a clear physical interpretation. In many cases, common regularization [such as Tikhonov; i.e., a least square constraint between the system state and a first guess (Rodgers 2000)] is sufficient to accommodate a finer resolution of control space (Bocquet 2005a). As the resolution increases, no new information coming from the observation is strengthening the inference or data assimilation analysis. Obviously, the computation load is heavier. However, the solution (the analysis) is only marginally changed.

Yet, it has been shown recently that, in some circumstances (e.g., the identification of an atmospheric pollutant accidental-release source), this regularization may not be enough to support the increase in the resolution (Bocquet 2005a; Furbish et al. 2008). As the resolution increases, the solution ignores the benefit of the observations more and more. With a Tikhonov regularization scheme, it may converge to the first guess. The analysis is mathematically sound, and the degeneracy is understood on physical grounds (Bocquet 2005a). However, the solution is no longer impacted by the observations except in the vicinity of the observation sites. In this context, it is therefore mandatory to make an educated choice on a (if not the) proper scale of the representation.

Building on Bocquet (2005a), general results will be obtained on this issue in section 2.

### d. Greenhouse gas fluxes estimation: An optimal representation problem

The driving force of global warming is very likely to be greenhouse gases and their relative balance (Pachauri and Reisinger 2007). The first one, in terms of impact, is carbon dioxide. From climate theory, it is therefore crucial to evaluate precisely the fluxes and exchange of carbon on earth. Therefore, biogeochemists have built up inventories of fluxes that are still uncertain, especially biogenic emissions and uptakes. A complementary approach is to use partial flux data and carbon dioxide concentration measurements and, through inverse modeling and because of prior hypotheses and a global dispersion model, estimate carbon fluxes on the globe. This is the so-called top-down approach. However, until the arrival of carbon satellite data, the pointwise concentration and flux measurements are sparse on the globe, albeit very reliable. Their information content is insufficient to accommodate the global-scale high-resolution map of fluxes that the carbon community is hoping for. So, it was decided very early to split the world into a few areas, leading to clearly identified aggregation errors (Fan et al. 1998; Bousquet et al. 2000). With more data available, the community was then inclined toward high-resolution reconstruction of the fields (Peylin et al. 2005). But then the information content of the observations does not match this too-fine control space. So here, too, the choice of the representation of the control space is thought to be crucial, and quantitative prescription of the right scale would be decisive. The formalism developed in this paper should help in this respect.

### e. Objectives and outline

The aim of this paper is to demonstrate that the representation of the control parameter space can be chosen optimally to better match the available sources of information introduced in the data assimilation system.

In section 2, we reinterpret old data assimilation results by studying, in general terms, how the data assimilation analysis, as a measure of the inference quality, evolves through changes in the representation of control space defined on the same domain. This justifies how important an optimal choice of representation can be. In section 3, a simple optimality criterion is chosen to select a representation. A mathematical framework is then developed to make the idea explicit and implementable. In section 4, the formalism is applied in the field of atmospheric dispersion, in which an optimal representation is obtained given a fixed number of grid cells. Using this optimal representation, inverse modeling of a tracer source is then performed and a comparison is made with earlier results on regular grids. A summary and perspectives are given in section 5.

## 2. Posterior analysis and change of representation

In this section, basic results of data assimilation will be scrutinized, keeping in mind a possible change of the control space grid structure. This will justify the importance of an optimal choice of representation.

### a. Reduction of error and analysis error covariance matrix

In the framework of the best linear unbiased estimator (BLUE) analysis, the posterior error covariance matrix reads as follows:

where 𝗕 is the background covariance matrix (which will be assumed full rank in the following), 𝗥 is the observation covariance matrix, and 𝗛 is, broadly speaking, the observation operator. Here, T denotes the usual vector or matrix transposition. Alternatively, the inverse of the error covariance matrix, or the confidence matrix, is the following:

This matrix is also called the Fisher information matrix (for a general definition in a geophysical context, see Rodgers 2000, p. 32), which is a measure of the information conveyed by the observations (typically a vector of measurements ** μ** ∈ ℝ

*, where*

^{d}*d*is the number of observations) to the control variables (a vector

**∈ ℝ**

*σ**, where*

^{N}*N*is the number of variables) through the data assimilation optimality system, when

**is seen as a random vector.**

*μ*It is assumed that the control space has an underlying spatial and temporal structure, with a discretization parameter *r* and a spacing that could be spatial, temporal, or both. This structure has a continuum limit (*r* → 0), which is considered as the “truth” in geophysics. For instance, one may be attempting to retrieve a ground emission field (two spatial dimensions and time; 2D + T). Then 𝗛, as a Jacobian matrix, relates the concentration measurements to the space–time emissions. It is computed because of numerical schemes solving consistently, for some discretization *r*′(possibly different from *r*), a set of partial differential equations.

As is well known, the analysis systematically reduces the uncertainty because 𝗣* _{a}* ≺ B, where ≺ denotes partial ordering between semidefinite positive matrices. Alternatively, the confidence is systematically increased: 𝗣

_{a}

^{−1}≻ 𝗕

^{−1}. A similar analysis based on the same dataset is contemplated but the resolution, or more generally the representation of the control space, is now changed. When both observations and model are perfect (i.e., 𝗥 = 0), then 𝗣

*= 𝗕 − 𝗕𝗛*

_{a}^{T}(𝗛𝗕𝗛

^{T})

^{−1}𝗛𝗕. A limiting case is reached when a representation of the control space has

*d*cells, which is the number of observations. As a consequence, 𝗛 may become invertible, at least in an infinite precision numerical context, so that in this specific case 𝗣

*= 0. Still assuming 𝗥 = 0, suppose that the number of variables is greater than the number of observations*

_{a}*N*>

*d*. This obviously happens when

*r*→ 0. Then the uncertainty is not null anymore but its reduction is controlled by , where and 𝗜

*is the identity matrix in ℝ*

_{N}*. Therefore, the posterior uncertainty is essentially controlled by the ability of the averaging kernel to reproduce the identity. [The averaging kernel is the sensitivity matrix of the analysis state with the true state; see Rodgers (2000) for a clear definition and properties.]*

^{N}One invariant measure of the total variance of the error is the trace of 𝗣* _{a}*, which serves as the optimality criterion for establishing BLUE. The expression 𝗕

^{−1/2}𝗣

_{a}𝗕

^{−1/2}reads the posterior covariance matrix in the basis where all parameters, viewed as random variables, are independent with variance unity. Therefore, it actually measures the reduction of uncertainty with respect to the uncertainty attached to the independent degrees of freedom of the prior. Then Tr(𝗜

*− 𝗣*

_{N}*𝗕*

_{a}^{−1}) =

*d*measures the number of degrees of freedom that are elucidated in the analysis. In an errorless context (𝗥 = 0), this equals the number of (independent) observations, as noted by Rodgers (2000). Similarly, Tr(𝗣

*𝗕*

_{a}^{−1}) =

*N*−

*d*rigorously states the obvious: whereas the degrees of freedom are increasing with increasing mesh resolution, the reduction of uncertainty from observation remains the same.

Note that 𝗣* _{a}*𝗕

^{−1}also appears in the measure of the average gain of information (average reduction of entropy) in the BLUE analysis (Fisher 2003):

where |𝗠| is the determinant of matrix 𝗠 and 𝗔 = 𝗜* _{N}* − 𝗣

*𝗕*

_{a}^{−1}is the averaging kernel matrix. Here, 𝒦

_{σ}is the relative entropy or Kullback–Leibler information measure (Cover and Thomas 1991), using Gaussian hypotheses. The reduction of uncertainty and the reduction of entropy are closely related and coincide in the limit where the reduction of entropy (gain of information) is small with respect to the prior, up to a conventional factor measuring the information unit: . In the same limit, a closely related expansion of the information measure can also be performed:

This expression will be used in section 3 as an optimality criterion.

These elementary data assimilation statements show that increasing the sharpness of the representation (e.g., the resolution by decreasing *r*), and hence increasing the number of unknown variables *N* beyond the available observational information (*d* observations), could be detrimental to the quality of the analysis. The background acts as a mathematical regularization in the data assimilation system when the information content of the *d* observations cannot account for the *N* unknown variables. However, the background is usually inducing extra confidence that is often criticized, especially within the atmospheric chemistry applications where emission inventories are very uncertain (Elbern et al. 2007). So, it may often come as a surprise that increasing the resolution in data assimilation decreases the performance of the analysis.

This may be considered less surprising when data assimilation is regarded as an inverse problem. Then, this quest for information (Tarantola and Valette 1982) amounts to matching just enough unknown variables for the available information content. This is obviously a trivial task when inverting an algebraic linear system because there is only one flux of information (the individual equations of the system). In data assimilation, the difficulty comes from quantifying the fluxes of information because the sources are multiple (e.g., background, model, and observations) and of distinct nature.

Note that the background modeling is also dependent on representation (or scale) because variances of control parameters usually vary with the scale. For instance, considering intensive variables, statistically independent subelements have a stronger variance than the variance attached to their aggregated mother element. However, taking into account the proper scaling does not necessarily prevent the degradation of the analysis. If we do not assume that the variables are statistically independent, then a correlated proper background might remove the degrading effect. But this popular trick actually amounts to reducing the effective number of independent degrees of freedom, though in a smoother manner.

### b. Continuum limit for the posterior analysis

We have proved in the previous section that increasing the resolution beyond some threshold determined by the information content of the observations is always detrimental to the analysis. Let us now show that it may even lead to a severe failure of the analysis.

As mentioned earlier, 𝗛 could be a Jacobian matrix that directly relates the observations to the control space. Again, let us change the representation of the control space (e.g., the resolution) and consider its consequence on the analysis. In the continuum limit, at least two kinds of behavior can actually occur, which have been described in Bocquet (2005a). The actual regime is decided by the behavior of the innovation statistics matrix 𝗥 + 𝗛𝗕𝗛^{T} in the continuum limit. A possible divergence was first pointed out by Issartel and Baverel (2003). To stress the significance of a proper representation of control space, let us first exhibit a simplified case relevant to physical applications in which such divergence may occur.

#### 1) Anomalous regime: example of divergence

Essentially following Bocquet (2005a), let us give a simple example of the singular regime. A Fickian diffusion equation is considered. A field of pollutant *c* obeys the following diffusion equation on the space–time domain Ω:

where *σ* is the source field of the pollutant and ** κ** is a diagonal diffusivity tensor. The choice of a Fickian operator may look overly simplified, but it does capture qualitative features of atmospheric transport, such as convective diffusion and mixing for a sufficiently long time, that are relevant to data assimilation and inverse modeling applications. To directly relate a pointwise concentration measurement

*μ*to the source field, the adjoint equations can be introduced, one for each observation:

_{i}where *π _{i}* is a retro source, which, in the case of a pointwise and instantaneous concentration measurement at (

**x**

*,*

_{i}*t*), reads

_{i}*π*(

_{i}**x**,

*t*) =

*δ*(

**x**−

**x**

*,*

_{i}*t*−

*t*) so that

_{i}where errors *ε _{i}* can also be accounted for.

In the discrete case, the vector of measurements ** μ** ∈ ℝ

*is related to the source or flux components*

^{d}**∈ ℝ**

*σ**through the Jacobian matrix 𝗛 ∈ ℝ*

^{N}^{d×N}, whose lines are approximated by the discretization

**c***

_{i}of the adjoint solutions

*c**

_{i}.

Assume that the source or emission field has a prior model formalized by a simple background, a Tikhonov regularization, given by

where 𝗕 = *m*^{2}𝗜* _{N}* and

*m*is a scalar, homogeneous to a root-mean-square error. Then, in this example, the innovation statistics diverge in the continuum limit. In particular, one has 𝗥 + 𝗛𝗕𝗛

^{T}≃

*m*

^{2}𝗛𝗛

^{T}. Indeed the elements of this Gram matrix are given by scalar products of adjoint solutions. For noncoinciding observations, the scalar product of two elementary diffusion solutions has a proper continuum limit: the two singularities of the integral coincide with the two observations that are integrable because the species concentrations are locally integrable by mass conservation. However, this is not the case for coinciding observations, so that the diagonal of 𝗛𝗛

^{T}diverges. In particular, it was shown (Bocquet 2005a) that in the continuum limit one typical element of the diagonal either reaches a finite limit or behaves like , where Ω

*is the full space and time domain, punctured by a ball with radius*

_{r}*r*around the related observation site. The problem-dependent exponent

*α*> 0 qualifies the rate of divergence. Note that this divergence can also characterize a wide range of non-Fickian diffusion operators and maybe other types of evolution operators that are relevant to many geophysical situations.

Keeping in mind this possible divergence in the continuum limit, we now look at the consequences for the analysis and the space–time spread of the observational information in quite general terms.

#### 2) Consequences for the analysis

Following the previous example, 𝗛 is assumed to relate the vector of fluxes or source elements ** σ** in ℝ

*to the vector of observations*

^{N}**in ℝ**

*μ**so that*

^{d}**= 𝗛**

*μ***+**

*σ**ε*, where

*ε*is the errors vector. However, this model equation is more general, even though it assumes a linear model. It was shown that the innovation statistics may diverge in some well-founded geophysical applications when

*r*vanishes. Let us analyze the implications on the BLUE analysis.

Here, 𝗛 has a singular value decomposition 𝗛 = 𝗨𝗗𝗩^{T}, where 𝗨 is an orthogonal matrix of size *d* × *d*, 𝗗 is a diagonal *d* × *d* matrix made of the singular values *λ _{i}* (which implicitly depend on the spacing

*r*), and 𝗩 is a

*N*×

*d*matrix satisfying 𝗩

^{T}𝗩 = 𝗜

*.*

_{d}In the following, for the sake of simplicity, the background and observation error covariance matrices will be assumed diagonal of the form 𝗕 = *m*^{2}𝗜* _{N}* and 𝗥 =

*χ*𝗜

*. As long as physical correlations are short range, the conclusions of the following developments will remain general. In particular, it is possible to perform similar calculations in the so-called standardized form, in which the singular value decomposition is applied to 𝗥*

_{d}^{−1/2}𝗛𝗕

^{1/2}instead of 𝗛 (for details on the standardized form, see Rodgers 2000). Depending on the exact definition of the coarse-graining operation,

*m*may or may not be scale dependent. Also,

*χ*may depend on the scale because of the representativeness error, but this effect is neglected here.

If 𝗩 = (*υ*_{1}, … , * υ_{d}*), then the averaging kernel 𝗔 = (𝗕

^{−1}+ 𝗛

^{T}𝗥

^{−1}𝗛)

^{−1}𝗛

^{T}𝗥

^{−1}𝗛 can be written as follows:

where 𝗔 is a projector on the {* υ_{i}*}

_{i=1, … ,d}followed by a contraction when errors are taken into account. In the singular regime, the vectors

*converge to*

**υ**_{i}**c***

_{i}/‖

**c***

_{i}‖, whereas the singular values

*λ*diverge. Therefore, in the continuum limit

_{i}*r*→ 0,

provided *m*^{−2}/*χ* does not diverge faster than *λ _{i}*

^{2}(this issue is settled by the exponent

*α*mentioned above and is therefore case dependent). In this limit, the averaging kernel is a mere projector onto the vector space generated by the adjoint solutions. If

*m*

^{−2}/

*χ*competes with one

*λ*(or more) in this limit, then this projection is followed by a contraction.

_{i}As for the gain matrix 𝗞 = (𝗕^{−1} + 𝗛^{T}𝗥^{−1}𝗛)^{−1}𝗛^{T}𝗥^{−1}, if 𝗨 = (**u**_{1}, … , **u*** _{d}*), then

In the singular regime, the vectors **u*** _{i}* converge to

**e**

*, where (*

_{i}**e**

_{1}, … ,

**e**

*) is the canonical basis of the physical space. Therefore, in the continuum limit,*

_{d}so that the gain 𝗞 goes to zero in the singular limit case. To estimate the performance of data assimilation within a single event context (only one set of measurements ** μ**), an objective root-mean-square-like score (Bocquet 2005a; Krysta and Bocquet 2007) is given by

where * σ_{t}* is the true control state,

*is the first guess, and*

**σ**_{b}*ε*is the true errors set. It can be shown that 0 ≤

_{t}*ρ*≤ 1 (Bocquet 2005a). The more resolved the control state is, the weaker the image of

*−*

**σ**_{t}*through the averaging kernel, unless all the mass of the state vector concentrates on the observation sites. In the singular limit case, for a randomly chosen*

**σ**_{b}*, one almost surely has the following:*

**σ**_{t}In the control space, the information matrix 𝗕^{−1} + 𝗛^{T}𝗥^{−1}𝗛 should also hint at this singular behavior. The question is whether the confidence that is attached to 𝗥^{−1} is or is not propagating through the action of 𝗛. This is clearly the case in the absence of singularity as follows:

The singular regime leads to a clear divergence in the confidence. As *r* goes to zero, the adjoint solution is more and more peaked near the observation sites so that the propagation is prominent in an increasingly close neighborhood. The average gain of information in the analysis is supporting this:

Thus, paradoxically, the gain of information diverges in the singular regime. A likely reason is that when the sources are localized in the vicinity of the observation sites, the inference becomes very informative compared to the prior assumption of an omnipotent emission field. The averaging kernel Eq. (10) tells what one can really hope for in this limit: it converges to a projector on the observation site support. This projector has no spread in the space–time domain. The normalized information attached to an observation will not propagate to the distributed control space. It will only resolve the vicinity of the observation sites.

Although it depends on scale, the analysis is not flawed. However, it does depend on the magnification chosen to investigate the control space, which, depending on the analysis purpose, may be inappropriate. If the objective is to tell anything about the whole distributed control space, then this situation is not desirable and the high-resolution limit should be avoided. To a lesser extent, this conclusion also applies to data assimilation systems of the nonsingular regime.

#### 3) BLUE estimators in the two regimes

The generic BLUE results depend on two typical behaviors. First, suppose that the limit of the innovation statistics matrix 𝗛𝗕𝗛^{T} is well defined. Then both the analysis error covariance matrix and the estimator tend to a proper nontrivial limit. The continuum limit of the analysis covariance matrix is still given by Eq. (1), and the estimator has the following usual form:

Even though there is a finite continuum limit for the data assimilation problem, the analysis performance may degrade when the resolution is decreased, down to a lower nonzero threshold. This has been checked in Bocquet (2005a) and Furbish et al. (2008). In the later paper model, error is also taken into account. On the contrary, when 𝗛𝗕𝗛^{T} is divergent, the posterior matrix then merely tends to the background covariance; that is, 𝗕^{1/2}𝗛^{T}(𝗥 + 𝗛𝗕𝗛^{T})^{−1}𝗛𝗕^{1/2} tends to a less and less relevant operator in ℝ^{N×N}, along with the BLUE gain matrix. Thus, almost surely,

where * σ_{t}* is a randomly chosen true state of the geophysical system.

## 3. Choosing a representation for the control space

The results of section 2 demonstrate that the choice of the representation of the control space can potentially lead to more efficient data assimilation. In this section, a mathematical formalism is proposed to determine a good representation. Beyond the simpler choice of the resolution of a regular grid on control space Ω, this may involve choosing an adaptive and multiscale grid for later data assimilation purposes. To achieve this, one must first adopt a criterion that defines optimality in a dictionary of possible representations.

### a. Simple criterion, simpler problem

We shall first drastically simplify the matter by fixing the number of parameters to be controlled. It is chosen on the order of the number of observations: a few control variables for one observation at most. Matching the resolution of a regular grid on Ω to the fixed number of cells is too rigid an approach, especially for distributed geophysical systems with very inhomogeneous control parameters. If the control space is multidimensional (three-dimensional for this paper application), a few hundred control parameters to match a few hundred observations is likely to be insufficient to describe the multidimensional distributed system through a regular grid.

Hence, the number of control parameters is fixed but not the partition of the control space that defines them. Call *ω* this representation, which is a partition of the space–time domain Ω. Here, *ω* is a set of (unnecessarily regular) cells that fully covers the space–time control domain (which is a 2D + T manifold for the typical application here). Besides, a point in the control domain Ω belongs to one and only one cell of the representation.

The criterion that we could choose to select a particular representation would consist in maximizing the total posterior confidence [i.e., the trace of the Fisher information matrix Tr(𝗣_{a}^{−1})] at a fixed number of control parameters. To make the criterion dimensionless and coordinate free and following the discussion of section 2, we prefer the closely related 𝒥 = Tr(𝗕𝗣_{a}^{−1} − 𝗜_{N}). It measures the gain in confidence obtained from observation relatively to the initial confidence in the background. It also reads 𝒥 = Tr(𝗕𝗛^{T}𝗥^{−1}𝗛). The physical interpretation is clear; confidence in the measurements 𝗥^{−1} is propagated by the model 𝗛 and measured in the basis for which the background confidence 𝗕^{−1} is standardized. It is also connected to the related information gain via Eq. (4).

### b. Space–time multiscale grid and Jacobian matrix

To perform multiscale data assimilation, the Jacobian matrix 𝗛 needs to be defined at several scales. The reference scale will be the scale that characterizes the finest available regular grid of size (*N _{x}*,

*N*,

_{y}*N*), whose total number of space and time cells is

_{t}*N*

_{fg}≡

*N*=

*N*. In this paper, physics at other scales will be derived from this reference scale by simple arithmetic averaging. However, the formalism developed in this paper could also accommodate physics that is based on more complex coarse-graining approaches with dedicated subgrid parameterizations that depend on the scale.

_{x}N_{y}N_{t}Recall that the physics of the problem is encoded in ** μ** = 𝗛

**+**

*σ**ε*, which is written in the reference regular grid. This equation could be written at a different scale or more generally with components of

**that could be defined at different scales. In our context, one can think about coarse-graining components of**

*σ***in ℝ**

*σ*^{Nfg}at the reference scale into a coarser

*in ℝ*

**σ**_{ω}*(*

^{N}*N*≤

*N*

_{fg}), whose components are unions of size-varying blocks of components of

**.**

*σ*We have chosen to precalculate coarse-grained versions of the finest Jacobian obtained from numerical models on the reference regular grid of the control domain Ω (other choices are possible, such as computing them on the fly). As a result, a multiscale Jacobian made of coarse-grained versions of the original Jacobian on the finest regular grid is stored in computer memory. This precomputation saves time in the optimizations. However, the size of this multiscale Jacobian matrix is a requirement to consider in practice, so that an estimation of its maximum occupancy will be given below.

The multiscale extension consists in recursive dyadic coarse grainings of the original finest regular grid. By a basic coarse graining in one direction, two adjacent grid cells are gathered into one and the physical quantities (Jacobian components) defined on them are averaged accordingly. For each direction, one considers a predefined number of scales: *n _{x}*,

*n*, and

_{y}*n*for the application ahead. For each vector of scales

_{t}**l**= (

*l*,

_{x}*l*,

_{y}*l*) such that 0 ≤

_{t}*l*<

_{x}*n*, 0 ≤

_{x}*l*<

_{y}*n*, and 0 ≤

_{y}*l*<

_{t}*n*, there is a corresponding coarse-grained version of the finest grid, with

_{t}*N*

_{fg}2

^{−lx−ly−lt}coarse cells (also called tiles) and a related coarse-grained Jacobian matrix (we assume

*N*,

_{x}*N*, and

_{y}*N*are multiples of 2

_{t}^{nx}, 2

^{ny}, and 2

^{nt}, respectively). The full multiscale grid on Ω is the union of all these coarser grids. The total number of tiles of the multiscale grid is 8

*N*

_{fg}(1 − 2

^{−nx})(1 − 2

^{−ny})(1 − 2

^{−nt}). Obviously, that is also the dimension of the space that the multiscale Jacobian matrix acts on. At worst, the multiscale grid is 8 times larger than the finest regular grid, which could be costly in terms of memory. A graphical representation of the memory occupancy is pictured in Fig. 1a.

### c. Adaptive tiling

Optimizations on the dictionary ℛ(Ω) of representations require that this set be handled mathematically and computationally. The following formalism is based on the multiscale grid detailed in the previous section.

In the representations ahead, a distinction is being made between space and time dimensions. Following the dyadic structure of the multiscale grid, the partition of time line is implemented by a binary tree 𝒯 with *n _{t}* levels.

In a first attempt and following a 2D + T example, the partition of the 2D space domain 𝒮_{xy} was built up because of a “quaternary tree” with *n _{s}* levels. It means that each 2D space cell could be divided into four subcells, down to a depth of

*n*levels (see Knuth 1997; Fig. 1b). A tensorial product of the time and space structures was then peformed. Two natural distinct sets of representations can be obtained: or . As far as computer memory saving is concerned, these sets are rather economical. A (2D + T) emission field would be represented with 8/3

_{s}*N*

_{fg}(1 − 4

^{−ns})(1 − 2

^{−nt}) scalars, in both multiscale structures. In the worst case (full downscaling), this involves a factor 8/3 as compared to the finest regular grid representation. However these representations are quite demanding in terms of coding. In particular, parsing efficiently all tree configurations in an optimization process can be very time-consuming.

That is why, in this paper, the set of representations that encompass all space–time “tilings” of a domain, based on the multiscale grid and Jacobian matrix defined earlier, is chosen instead. This set is the tensorial product of dyadic (binary) trees on time direction, *O _{t}*, and the spatial directions,

*O*and

_{x}*O*. It is a richer space than the two previous double-tree structures, but it is also much more memory consuming because the total number of cells to be run through is 8

_{y}*N*

_{fg}(1 − 2

^{−nx})(1 − 2

^{−ny})(1 − 2

^{−nt}). It corresponds to all grid cells, or tiles in this context, of the multiscale grid. At worst, it is 8 times larger than the finest regular grid. Again, this could be avoided if the Jacobian 𝗛 is not stored in memory, or as a compromise a quaternary tree representation is chosen instead.

A schematic of a double-tree partition is displayed in Fig. 1b, and a schematic of a tiling partition (triple tree) is displayed in Fig. 1c. Note that tensorial products of space and time structures are preferred to Cartesian products because the resulting dictionary of representations is much richer. Cartesian products would have implied separation of the space and time degrees of freedom, which is too poor an assumption.

### d. Vector space mathematical representation

The dictionary ℛ(Ω) of representations has been described geometrically. We shall now describe it algebraically.

Let us consider the vector space ℝ^{Nfg} attached to the finest grid of dimension *N*_{fg}. To each tile is attached a vector *υ*_{l,k} in this space, in which **l** = (*l _{x}, l_{y}, l_{t}*) is the set of scales of the tile and

*k*is the index of this tile in the set of all tiles of the same scale

**l**(one coarse version of the original grid). For the finest grid, one has

**l**= (0, 0, 0) and

*k*= 1, … ,

*N*

_{fg}. The cells of the finest grid (smallest tiles) are represented by the canonical basis of this vector space. A coarser tile, made of these finest tiles, has vector

*υ*_{l,k}∈ ℝ

^{Nfg}, defined as the sum of the vectors of the canonical vectors {

**e**

_{i,j,h}} in ℝ

^{Nfg}with 1 ≤

*i*≤

*N*, 1 ≤

_{x}*j*≤

*N*, and 1 ≤

_{y}*h*≤

*N*that are in one-to-one correspondence with the finest tiles that make up this tile: , where (

_{t}*i*,

_{k}*j*,

_{k}*h*) are the coordinates on the finest grid of the tile indexed by

_{k}*k*in the coarse-grained representation of Ω at level

**l**.

A tiling *ω* of Ω is a partition, in the strict mathematical sense, of the space and time control space with tiles. For each tiling *ω* in ℛ(Ω), we define an operator **Π**_{ω} : ℝ^{Nfg} → ℝ^{Nfg} built with the vectors attached to the tiles as follows:

where *n*** _{l}** is the number of tiles in the set of tiles with scale vector

**l**;

**l**runs on all predefined scales 0 ≤

*l*<

_{x}*n*, 0 ≤

_{x}*l*<

_{y}*n*, and 0 ≤

_{y}*l*<

_{t}*n*; and are coefficients that define the representation

_{t}*ω*. From now on, the superscript

*ω*on will be dropped to simplify the notation.

In a multigrid approach, a location in space and time is represented several times. On the contrary, in a single representation of Ω, such a point is accounted for once, according to the strict definition of a partition. In this paper context, a point is a cell of the finest reference grid. A representation corresponding to a strict partition will be qualified as admissible. To obtain an admissible representation of the control space, one requires that *α*_{l,k} be 0 or 1 for all (**l**, *k*) and that any point in the domain be represented by one and only one tile:

Furthermore, we impose that the number *N* of variables (hence tiles) be set as follows:

Obviously, for an admissible representation, *N*_{cg} ≤ *N* ≤ *N*_{fg}, where *N*_{cg} = *N*_{fg}2^{−nx−ny−nt}. On these conditions, it is not difficult to check that **Π**_{ω} = **Π**_{ω}^{T} and (**Π*** _{ω}*)

^{2}=

**Π**

*so that*

_{ω}**Π**

*is an orthogonal projector.*

_{ω}### e. Restriction and prolongation

For a given representation *ω*, *σ* is coarse-grained into *σ** _{ω}* = 𝗜

_{ω}

**, where 𝗜**

*σ*_{ω}: ℝ

^{Nfg}→ ℝ

^{N}is a restriction map that defines the coarse-graining operation. Although its precise definition depends on the context, it is always unambiguously defined. The prolongation operator 𝗜*

_{ω}: ℝ

^{N}→ ℝ

^{Nfg}relates

*back to*

**σ**_{ω}**. Contrary to the restriction operator, several consistent definitions can be used for the prolongation operator. However, it should consistently satisfy 𝗜**

*σ*_{ω}𝗜*

_{ω}= 𝗜

_{N}(for a discussion of these operators, see Rodgers 2000). In addition, we impose that 𝗜

_{ω}*𝗜

_{ω}=

**Π**

_{ω}. This identity means that the net effect of coarse graining followed by an interpolation back in the finest grid is a local averaging on the area covered by each tile. Note that what follows does not depend on the precise definition of 𝗜*

_{ω}. Then 𝗛 is coarse-grained into 𝗛

_{ω}= 𝗛𝗜*

_{ω}, and the coarse-grained observation equation

**= 𝗛**

*μ*_{ω}

*σ*_{ω}+

*ε*is made explicit in the finest grid vector space, because of the projection operator

**Π**

*:*

_{ω}In principle, *ε* should be made dependent on *ω* through the representativeness and model errors. Besides, whereas 𝗛* _{ω}* could be obtained from scale-dependent models at several scales, it is a mere coarse graining of a finescale Jacobian here. These important issues are beyond the scope of this paper, but the formalism developed ahead should accommodate them.

This allows to define a mathematically explicit multiscale criterion that depends on the choice of the representation *ω* ∈ ℛ(Ω):

The background error covariance matrix 𝗕 has been coarse-grained into 𝗕_{ω} = 𝗜*ω*𝗕𝗜_{ω}^{T}. In the following, 𝗕 will be again assumed of the form *m*^{2}𝗜_{Nfg} so that it commutes with **Π*** _{ω}*. As a consequence, . This formula can be generalized to any positive definite 𝗕, provided that

**Π**

*is generalized to a projector which is 𝗕*

_{ω}^{−1}orthogonal. This is not reported here because the application of section 4 does not require it.

The objective is to maximize the average reduction of uncertainty in the analysis 𝒥* _{ω}* on all admissible tilings (i.e., on the

*α*

_{l,k}that satisfy the above conditions).

### f. Lagrangian approach

To optimize 𝒥_{ω} on *ω*, a Lagrangian is written with the above conditions implemented because of Lagrange multipliers. A fixed number of tiles is imposed from a single multiplier *ζ*. The one point–one tile requirement is imposed because of a vector ** λ** of

*N*

_{fg}multipliers. The Lagrangian reads as follows:

Here, 𝗪 stands for 𝗕𝗛^{T}𝗥^{−1}𝗛 or any other appropriate criterion. The sum on *k* = 1, … , *n*_{0} = N_{fg} runs on all cells of the finest grid. In this sum, *α*_{l, k̃} is the coefficient attached to the tile at scale **l** that covers cell *k* from the finest grid. This tile has rank *k̃* among the *n*** _{l}** tiles related to scale

**l**. The Lagrangian can also be written as follows:

Then, the maximum can formally be taken on all representations, admissible or not, with any number of tiles in [*N*_{cg}, *N*_{fg}], on the set of coefficients *α*_{l,k} that have been freed from the constraints through the multipliers. This is made easier (to a limited extent) by the fact that *α*_{l,k} can only be 0 or 1. Then, if ,

Because this cost function is dual to ℒ(*ω*), it needs to be minimized, not maximized. Because it depends on ** λ** and

*ζ*only, its evaluation is not too time consuming. Its complexity scales like an optimization on the finest gridcell variables. However, the cost function is not smooth because it is nonderivable on edges of a polytope. Hence, optimization on the Lagrange parameters cannot make direct use of gradient-based minimization techniques. Besides, it is unlikely that this function is convex; nor is it guaranteed that it has a single minimum. To overcome these potential problems, a regularization of this effective cost function is proposed.

### g. Statistical physics interpretation

From now on, is denoted by *ε*_{l,k}. If 𝗪 = 𝗕𝗣_{a}^{−1} − 𝗜_{Nfg}, then *ε*_{l,k} is a measure of the extra confidence in the posterior estimate of the scalar defined on tile (**l**, *k*) resulting from the observations. Then the criterion can be written as , where Σ_{l,k} is shorthand for . Building on a statistical mechanics analogy introduced by Jaynes (1957a,b), we interpret the *ε*_{l,k} as individual energies, one for each tile. The system is thermalized at inverse temperature *β*, which is the regularization parameter. The average number of cells of the system is set to *N*. Also, any point must be covered by an average of one, and only one, tile. From statistical mechanics, the partition function of this system is

where ** λ** and

*ζ*are conjugate parameters to the occupation constraint and to the tiles number. Because the number of tiles is fixed and because a point is covered only once, the right statistical potential is the free energy functional that derives from the partition function:

By construction, this free energy function is strictly convex, which guarantees the existence of a single minimum, provided the constraints can be satisfied by at least one configuration *ω* in ℛ(Ω). In particular, *N* needs to satisfy *N*_{cg} ≤ *N* ≤ *N*_{fg}. This allows for the use of classical optimization tools.

The minimization of the free energy yields ** λ** and . Then, it is possible to compute the average filling factors:

which is essentially the main result of the optimization. The total reduction of uncertainty,

is given by the minimum of the free energy with an average reduction of *ε*_{l,k}*α*_{l,k} per tile.

This approach has been checked to be a consistent regularization of our main optimization problem. In the low temperature limit, *β* → +∞, the variables can be rescaled by *β* and the cost function becomes

and the Lagrangian Eq. (26) is recovered, up to *β*. Hence, this statistical approach is a way to regularize our problem at a small but finite temperature. Any other regularization is possible but this one offers a statistical interpretation and a guarantee of convexity.

A similar Lagrangian to Eq. (26) has been obtained in the case of binary–quaternary trees representations, as well as an analog formalism for this set of representations. This will be reported elsewhere because no application is given for it next.

The optimal structure obtained from this technique is also the least committed representation given that all the constraints are satisfied on average. At finite temperature, the optimal representation is a compromise between the criterion and the relative entropy of the representation as compared to the (nonadmissible) geometry in which all tiles are equiprobable. The proof is not essential here and will be reported elsewhere.

### h. Properties

A cost function Eq. (28) has been introduced to qualify representations. Several aspects of its minimization will be discussed now.

#### 1) Representation solution and optimal admissible representation

The main drawback of a regularization, including the one above, is that it is by definition a continuous deformation of the nonsmooth optimization problem. A consequence is that, for a finite regularization parameter, the average *α*_{l,k} is a scalar in the interval [0, 1] and not necessarily 0 or 1. So the representation that derives from the average *α*_{l,k} is not usually an admissible one. As *β* goes to ∞, the values of *α*_{l,k} gradually approach either 0 or 1. From the statistical point of view, the solution is still an average of admissible partitions of the domain for which the values of *α*_{l,k} are strictly 0 or 1, most of them being near optimal.

However, an admissible partition is needed to perform explicit data assimilation in the optimal representation. One solution is to optimize at very high *β* and, if possible, with high numerical precision. Unfortunately, this leads to unavoidable numerical instabilities (the regularization was used to overcome them in the first place). Besides, there is no guarantee that the solution does have a limit as *β* diverges, even though is consistently deformed.

Therefore, a suboptimal admissible representation is constructed. It is based on the average filling factors computed for each tile by Eq. (29). Then, a representation is built by choosing for each point the tile with the largest average filling factor. This process yields a complete representation (every point is covered), but it may not be admissible because a point can be covered by several tiles. Therefore, this constraint is respected by somehow arbitrarily removing excess tiles. Note that the final number of tiles of the resulting admissible representation may be different from *N*, though close to it.

#### 2) Links with the concept of resolution

In data assimilation, a concept of resolution quantifies the ability of the observations to locally resolve the control parameter space. It is notably applied in the assimilation of radiances for the reconstruction of profiles in the atmosphere (Purser and Huang 1993; Rodgers 2000). However, its quantification is empirical. As seen in section 2, the trace of the averaging kernel 𝗔 is a measure of the gain of information because of the observations. The diagonal elements of the averaging kernel are a local gain of information within the control space. The inverse of each diagonal element is then an empirical measure of the local resolution, or precision, in control space.

A similar but nonempirical concept can be derived from our formalism. For each point *k*, 1 ≤ *k* ≤ *N*_{fg} (cell in the finest reference grid), a mean size of the local tile *S*_{k} can be obtained, along with a measure of the average resolution _{k}:

where 𝒜 is the volume of the tile, which would scale as 2^{lx+ly+lt} in our dyadic representation in the units of cells of the finest grid. (For the sake of simplicity, the volumes are not converted in distance and time units.) The definition of *S*_{k} may be unpractical because the size of the tile grows geometrically in our multiscale coarse-graining picture. Therefore, either ln(*S*_{k}) or another average, a “mean logarithmic size,” can be used: *s*_{k} = Σ_{l}*α*_{l,k̃}(*l _{x}* +

*l*+

_{y}*l*). This definition of the resolution implies that 0 ≤

_{t}_{k}≤ 1. The best achievable resolution

_{k}= 1 means that the observations allow us to determine the parameter attached to the local cell in the finest reference grid. Similarly, one has 0 ≤

*s*

_{k}≤

*n*+

_{x}*n*+

_{y}*n*− 3 and 𝒜(0,0,0) ≤

_{t}*S*

_{k}≤ 𝒜(

*n*− 1,

_{x}*n*− 1,

_{y}*n*− 1), with the lower bounds corresponding to the finer scale.

_{z}#### 3) Numerics and parallelization

The optimization on tilings has been parallelized. Indeed, one practical advantage of the statistical approach through the partition function is that it enumerates all configurations. This sum within the computation of the regularized cost function Eq. (28) can easily be parallelized. Only a summation of the partial sums needs to be performed after all threads are finished. The search of the optimal representations in the example of section 4 requires a few seconds to a few hours, using an eight-core computer (double quad-core Intel Xeon). Because of the dual approach, the complexity of the algorithm depends not only on the number of grid cells of the finest level *N*_{fg} but also on the regularization parameter *β* (the higher the slower) and the targeted number of tiles *N*. The optimization algorithm is performed by the quasi-Newton limited memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS-B) minimizer (Byrd et al. 1995).

## 4. Application to atmospheric dispersion

In this section, the methodology of section 3 is applied to an experiment drawn from the air pollution context, using real data. It is taken from Bocquet (2005a), in which changes of regular inversion grids were tested on synthetic data. After determining an optimal representation for a fixed *N*, a data assimilation experiment will be performed on this representation.

### a. Finding an optimal representation

This application is taken from (Bocquet 2005a,b,c, 2007) and is focused on source inversion of the European Tracer Experiment (ETEX). This European Joint Researcher Centre 1994 campaign (Girardi et al. 1998) consisted of a pointwise 12-h-long release into the atmosphere of an inert tracer at continental scale. In the papers mentioned earlier, several three-dimensional inversions of the source were attempted successfully on either the synthetic or the real concentration measurements from the campaign. However, these inversions were based on a regular discretization of the 2D + T source space. These meshes comprised from a few thousand to one million control variables (from 2.25° × 2.25° to 0.5625° × 0.5625°).

The aim is to construct an optimal adaptive representation (a tiling) for inversions based on a much more limited number of cells, using the mathematical formalism developed in section 3.

As proposed in section 3, the gain in confidence 𝒥_{ω} = Tr(𝗕_{ω}𝗛_{ω}^{T}𝗥^{−1}𝗛*ω*) (or reduction in uncertainty) will be optimized on a dictionary of tilings, with the following multiscale structure: The finest grid has dimensions *N _{x}* = 64,

*N*= 32, and

_{y}*N*= 160 for a spacing of Δ

_{t}*= Δ*

_{x}*= 0.5625° and Δ*

_{y}*= 1 h, whereas*

_{t}*n*= 5 scales are used for

_{x}*O*,

_{x}*n*= 5 scales are used for

_{y}*O*, and

_{y}*n*= 5 scales are used for

_{t}*O*. The inverse temperature is set to

_{t}*β*= 10

^{5}.

In the following, a small set of 201 observations is used. It was diagnosed to have a high information content for the inversion (Bocquet 2008). The number of required tiles (402) is twice the number of observations, which is very limited when compared to the number of cells of regular grids.

The resulting mean logarithmic tile size *s*_{k} is displayed in Fig. 2. A suboptimal admissible tiling of 403 tiles is displayed in Fig. 3. Its reduction of uncertainty is 𝒥_{ω} = 802.98, whereas the optimal reduction of uncertainty is 𝒥 = 803.89. For the coarsest grid in this multiscale structure, 𝒥_{ω} = 23.42, whereas for the finest grid, 𝒥_{ω} = 1324.74. The latter is the maximum over all configurations for all *N*, which is due to the scaling divergence described in section 2.

Note how small the number of tiles *N* = 402 is compared to *N*_{fg} = 327 680. Nonetheless, 60% of the average gain of confidence would potentially be captured using this optimal representation compared to the choice of the finest grid. The average reduction of uncertainty for optimal representations with a tile number in the range [80, 327 680] is plotted on Fig. 4. This graph emphasizes how well performing an adaptive representation of a very limited number of cells can be for the purpose of data assimilation, especially in comparison to regular grid representations.

### b. Inverse modeling using the optimal representation

Inverse modeling will now be performed on the basis of the optimal representation (adaptive grid) previously obtained (Fig. 3). We follow the methodology of Bocquet (2007) and Krysta et al. (2008), which was based on a nonoptimal regular grid. The difference is that the inversion is now based on the adaptive optimal grid for *N* = 402.

A standard Gaussian prior for the source is chosen from the generic form, Eq. (8). However, it is now defined on an adaptive grid . As mentioned earlier, 𝗕_{ω} = *E _{νω}*[

*σ*_{ω}

*σ*_{ω}

^{T}] is related to 𝗕 through 𝗕

_{ω}= 𝗜

_{ω}𝗕𝗜

_{ω}

^{T}. We assume that 𝗕 =

*m*

^{2}𝗜

_{Nfg}, where

*m*sets the release rate magnitude. Here,

*m*= 0.025

*M*/12 is chosen arbitrarily, where

*M*= 340 kg is the true released mass of tracer in ETEX-I and 12 is the number of time steps of the true release in the finest grid. The errors are also chosen from the Gaussian a priori: , where the error covariance matrix is diagonal and homoscedastic 𝗥 =

*χ*𝗜

*, where*

_{d}*χ*is the only degree of freedom left to weight the departure from the observations and the departure from the prior.

The cost function, the by-product of the Bayesian inference, is then as follows:

which leads to the usual normal equations and estimators of the BLUE analysis. Here, *ω* is the conditional representation that has been obtained earlier. The value of *χ* is selected from an L-curve analysis following Davoine and Bocquet (2007) and Krysta et al. (2008): *χ* = 0.3 ng m^{−3}, which is equal to the value obtained from the regular grid (2.25° × 2.25° × 1*h*).

The integrated map of the reconstructed source is plotted on Fig. 5, as well as the retrieved profile in the vicinity of the true released site. The total retrieved mass of tracer is 879 kg, as compared to 684 kg obtained on a regular grid (Krysta et al. 2008). It has been diagnosed that 242 kg of tracer have been released in the vicinity of the true release site, as compared to 241 kg in Krysta et al. (2008). Therefore, the inversion result is very similar to the inversion on the regular grid but at a much lower cost. A difference is the excess in the total mass estimate, which can be attributed to observation–unconstrained aggregation error resulting from large tiles over the Atlantic and Ireland. Note that in Krysta et al. (2008), 81 920 and 20 480 control variables are solved for, as compared to 403 tiles here. On the practical side, optimizations are significantly accelerated.

We have also tested this representation within a non-Gaussian prior framework. A Bernoulli prior, which ensures the positivity of the source, has been selected. The two parameters that enter the inversion have been selected using an iterative L-curve approach. As reported in Krysta et al. (2008), on a regular grid, this prior leads to a better estimate of the source than the Gaussian prior. However the profile is less satisfying, with significant fluctuations within the true release period.

The results obtained on the optimal adaptive grid are displayed in Fig. 5. The profile is smoother than with a regular grid. This is because the optimal representation determines the proper balance between the space and time scales. Given *N* = 402, the optimal time step is about 3–4 h.

## 5. Summary and perspectives

In geophysical data assimilation, the choice that is made in the discretization of the spatially distributed control space is not innocuous. It has a direct impact on the quality of the assimilation as measured by the reduction of uncertainty or by the gain in information through, for instance, the Fisher information matrix. When the resolution of the control space discretization goes to zero, the input observational information remains the same while the uncertainty dramatically increases. Inverse modeling of atmospheric compounds is highly sensitive to this effect. Even though the information gain diverges, it is not spreading to the whole parameter space. Thus, in this limit, the estimator will not depart from the prior (outside the observation sites support). Therefore, there must be an optimal resolution for a regular space grid whose knowledge is even more needed in the latter context.

Hence, an idea is to consider the discretization of control space a degree of freedom. Instead of several regular grids at different scales, a richer set of representations to optimize on is given by adaptive grids. They have been defined as partitions (in the mathematical sense) of control space. Considering a three-dimensional control space 2D + T (typical of an atmospheric chemistry emission inverse problem), a product of binary trees was chosen, representing a tiling of the domain with nonoverlapping boxes.

To keep mathematical developments as simple as possible, the number of tiles of the partitions was fixed for all potential representations. The selected optimality criterion is the gain in information or confidence 𝒥 = Tr(𝗕𝗣_{a}^{−1} − 𝗜_{N}) in the basis where control variables are independent with unit variance for the background error covariance matrix 𝗕. This can also be understood as a measure of the average reduction of uncertainly for data assimilation analysis. The question addressed was how to spatially arrange a fixed number of degrees of freedom in the control space, so as to maximize the quality of the subsequent data assimilation.

Once the dictionary of all representations has been described mathematically, a Lagrangian has been proposed to optimize the cost function while momentarily getting rid of the constraints (fixed number of tiles; one point–one tile). Because the cost function that is obtained appeared as a nonsmooth, possibly nonconvex, programming problem, it was regularized using a statistical mechanics analogy. The regularized cost function is convex, which guarantees the existence of a minimum.

This leads to rigorous concepts of local resolution, or local size of grid cells. The result from the regularized optimization consists of a superposition of admissible representations, which may not be admissible. Although it cannot be used straightforwardly in a data assimilation setup, it can be used to construct a suboptimal admissible representation.

In the last section, this mathematical formalism is applied on a realistic atmospheric tracer dispersion event: the European Tracer Experiment. The representation is optimized. A tiling of 403 cells that captures 60% of the average reduction of uncertainty is obtained. An inverse modeling experiment is then carried out on this conditional tiling. It leads to similar results as those obtained with a regular grid but with a much smaller number of grid cells. Depending on the target number of tiles, the computation may be much quicker. These tiles are supposed to better match the information load that can be delivered from the observations through the model and data assimilation analysis.

Preserving the continuity of this paper and confident in the ability of this formalism, we believe many developments need to be undertaken. We would like to mention five of them.

The formalism of this paper allowed for a fixed number of observations. Therefore, the optimization of the representation was focused on adaptive grids that would optimally dispatch the information from the observations onto the control space. From the mathematical side, this constraint can easily be relieved by getting rid of the *ζ* multiplier. However, the optimality criterion should be able to accommodate a free number of tiles. This is not the case for the criterion retained here. Indeed, in the singular regime case, the information from Fisher and also Kullback–Leibler has been shown to diverge in the high-resolution limit. This is a fundamental issue: how many degrees of freedom can be resolved by only the priors and the observations transported by the model? This aspect of the problem was kept low by choosing a fixed number of tiles. Further progress should clarify this question.

The effort of bringing data assimilation to a multiscale formalism along the proposed route is not complete. The data assimilation process was decomposed into the search of an optimal representation followed by the data assimilation inference on the parameters. Yet, what could be desired is the joint optimization of both the partition of control space and the actual values of the variables defined on this partition. In addition, the scale dependence of the model and representativeness errors were neglected, whereas we believe they should be accounted for in a fully consistent framework.

We would like to emphasize that this optimization could be useful not only in the singular regime, such as in air quality, but also in most data assimilation applications. Accordingly, the formalism should be extended to systems where the Jacobian is computed on the fly, which covers the majority of current data assimilation systems (sequential or variational).

In geophysical data assimilation, the description of the optimality system in terms of information exchanges remains conceptual. The information flow is not discriminated by individual components in the system (except for a distinction between the flows to the background and observation departures). This paper should also be understood as an effort to design individual components of the control space to match the flow of information coming from the observations and the priors. It shares conceptual similarities with the tracking of information exchanges between components of dynamical systems (Liang and Kleeman 2005).

Eventually, as far as applications are concerned, we believe that this analysis may be relevant to the inversion of greenhouse gases from pointwise concentrations or flux measurements. This representation issue is often put forward in the carbon inversion community, even though not necessarily in these terms. The hope is that such method would allow the determination of exactly how many and which degrees of freedom (fluxes) can be resolved unambiguously by the observations through the transport model.

## Acknowledgments

The author is grateful to the organizers of the Banff International Research Station workshop “Mathematical Advancement in Geophysical Data Assimilation,” February 2008, which has fostered many exchanges and this special issue: K. Ide, P. Gauthier, C. K. R. T. Jones, and K. R. Thompson. The author would like to thank L. Wu for a careful reading of the manuscript and three anonymous reviewers for their suggestions. He acknowledges interesting discussions on related topics with P. Rayner and A. S. Denning. This paper is a contribution to the MSDAG project supported by the Agence Nationale de la Recherche, Grant ANR-08-SYSC-014.

## REFERENCES

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

_{2}data and models imply a large terrestrial carbon sink in North America.

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

_{2}flux estimates over Europe from continuous atmospheric measurements: 1, Inverse methodology.

**,**

**,**

**,**

## Footnotes

*Corresponding author address:* Marc Bocquet, CEREA, école des Ponts ParisTech, 6-8 avenue Blaise Pascal, Cité Descartes, Champs-sur-Marne, 77455 Marne la Vallée, CEDEX 2, France. Email: bocquet@cerea.enpc.fr

This article included in the Mathematical Advances in Data Assimilation (MADA) special collection.