## Abstract

While the formulation of most data assimilation schemes assumes an unbiased observation model error, in real applications model error with nontrivial biases is unavoidable. A practical example is errors in the radiative transfer model (which is used to assimilate satellite measurements) in the presence of clouds. Together with the dynamical model error, the result is that many (in fact 99%) of the cloudy observed measurements are not being used although they may contain useful information. This paper presents a novel nonparametric Bayesian scheme that is able to learn the observation model error distribution and correct the bias in incoming observations. This scheme can be used in tandem with any data assimilation forecasting system. The proposed model error estimator uses nonparametric likelihood functions constructed with data-driven basis functions based on the theory of kernel embeddings of conditional distributions developed in the machine learning community. Numerically, positive results are shown with two examples. The first example is designed to produce a bimodality in the observation model error (typical of “cloudy” observations) by introducing obstructions to the observations that occur randomly in space and time. The second example, which is physically more realistic, is to assimilate cloudy satellite brightness temperature–like quantities, generated from a stochastic multicloud model for tropical convection and a simple radiative transfer model.

## 1. Introduction

Data assimilation (see, e.g., Kalnay 2003; Majda and Harlim 2012) is a sequential method of estimating the conditional distribution of hidden state variables given noisy observations through Bayes’s formula:

where is the prior distribution of the state variables of interest and is the likelihood function corresponding to the observation model, at time *i*. Here, *h* denotes an observation function and represents the measurement noise. In most data assimilation implementations in numerical weather prediction (NWP), one typically assumes that the observation model *h* is explicitly known. Moreover, the tacit assumption is that the noise variables are independent Gaussian random variables with mean zero and a specified covariance matrix . With these assumptions, the likelihood function is parametrically defined as . In NWP applications, the prior density at time *i* in (1) is usually represented by an ensemble of forecast solutions (Epstein 1969) of a general circulation model.

For satellite data assimilation the observations can be radiances or brightness temperatures measured by satellite instruments. In this particular application, the typical observation model *h* is the radiative transfer model [see, e.g., Liou (2002) for the basic formulation]. Various software options for advanced radiative transfer models are available, for example, the Community Radiative Transfer Model (CRTM; Chen et al. 2008), RTTOV (Saunders et al. 1999), and many satellite data assimilation experiments/studies (e.g., Heilliette et al. 2013; Otkin 2010) are based on these software packages. While the high-resolution infrared spectral radiances contain detailed information about the temperature and humidity profiles (McNally et al. 2006), there are practical issues in assimilating the AIRS satellite measurements, for example, the presence of clouds (Reale et al. 2008). Assimilating satellite measurements under cloudy conditions is challenging since the presence of clouds in the atmosphere induces significantly cooler radiances (which can be viewed as large biases) relative to those measured under clear-sky conditions. The main challenge with this problem is in detecting the observation model error that can occur intermittently as a result of the misspecification of the cloud-top height, the number of clouds in a column of the atmosphere, and/or the cloud fraction in the radiative transfer model of cloudy measurements (McNally 2009). Mathematically, this suggests that when an incorrect observation model, , is used in place of the true observation function, *h*, the observations can be written as

where we introduce a biased model error, , in addition to the measurement error .

In this paper, we introduce a model error estimator to approximate the distribution of the error **b** at time *i*, assuming that the underlying observation function *h* is unknown. Formally, the model error estimator is a Bayesian nonparametric filter that estimates the time-dependent posterior density:

Here, the prior density, , will be constructed based on the predicted observation error. The likelihood function will be constructed nonparametrically using a historical time series of **y** and **b**, and no knowledge of the true observation function is assumed. Our construction is based on a machine learning tool known as the *kernel embedding of conditional distributions* formulation introduced in Song et al. (2009, 2013). An additional novelty of our approach is that we generalize the formula in Song et al. (2009, 2013) to data-driven Hilbert spaces with basis functions obtained from the diffusion maps algorithm (Coifman and Lafon 2006; Berry and Harlim 2016b). This is in contrast to their original approach, where they specify the Hilbert space using a specific choice of basis functions, such as the radial basis type (Song et al. 2013). Once the prior and likelihood are specified, they are combined with (3) to form the posterior density of the model error. Finally, we use the statistics of the posterior, such as the mean and variance, to compensate for the error in the observation function and thereby improve the estimation of the state . It is important to stress that this model error estimator can be used as a subroutine in tandem with any data assimilation forecasting system based on (1).

We will demonstrate this idea with two synthetic numerical examples. The first example is with the Lorenz-96 (L96) model (Lorenz 1996), where the observations are corrupted by severe biases at random instances and locations to mimic the bimodality of the observation error distribution in real applications. The second example is the stochastic multicloud model (Biello et al. 2010) for tropical convection, where the observed brightness temperature–like quantities are constructed with a simple radiative transfer model (Liou 2002) with severe biases in the presence of clouds.

The remainder of this paper is organized as follows. In section 2, we describe our framework for estimating the model error estimator, using the ensemble Kalman filter (EnKF) as an archetypal example. In section 3, we discuss the construction of the nonparametric filter in (3), which can be combined with any primary filter (1). In particular, we describe how to specify the likelihood function and the prior density in (3) by combining historical training data with the current observations and primary filter estimates. In section 4 we demonstrate our method on the two numerical examples described above and then we briefly conclude this paper in section 5.

## 2. An observation model error estimator

The key issue, as stated in (2), is to overcome the observation model error **b** when we have no access to the true observation function *h*. Let us outline a general framework for mitigating this issue with Fig. 1. In this diagram, we refer to the data assimilation system that is used to estimate **x** at time *i* in (1) as the primary filter. Since different operational NWP centers have their preferred methods (4DVAR, EnKF, hybrid, etc.), we will design our approach in such a way that it is applicable to any primary filtering scheme.

Our strategy is to apply a secondary filter in (3) and use the resulting posterior conditional statistics to correct the observation model error in the primary filter. In particular, we will use the posterior mean statistics to correct the biases and we will use the variances to correct the additional uncertainties beyond the measurement error. We should stress that the implementation of the secondary filter within this framework offers no additional changes to the infrastructure in the primary filter except for a simple two-way communication at each assimilation step. Namely, one needs to feed the predicted observation error from the primary filter into the secondary filter and then feed the model error statistics from the secondary filter back into the primary filter (to correct the likelihood functions in the primary filter). The third row in the diagram in Fig. 1 clarifies that we use the current observations to construct the prior density and the likelihood function used in the secondary filter. Now, let us illustrate this heuristic discussion with a concrete algorithm in the case where the primary filter is the ensemble Kalman filter (EnKF).

With the EnKF, both the prior and posterior distributions of the primary filter are assumed to be Gaussian and only the first two empirical moments are corrected using the Kalman filter formulas. The Kalman filter implicitly assumes that the observation error is unbiased: . In the *i*th analysis step, an ensemble of *K* prior estimates , where denotes the ensemble member, is transformed into an ensemble of analysis estimates . Here, the superscripts *f* and *a* are used to denote estimates before (prior/forecast) and after (analysis) the observation has been assimilated, respectively. In each analysis step, the EnKF procedure can be summarized in three steps. First, it approximates the prior mean and covariance statistics with empirically estimated statistics defined as

where . Second, it applies the Kalman filter formula to obtain the posterior mean and covariance:

Third, it draws an ensemble of analysis estimates . In (5), , , and are the empirically estimated statistics, where and are the ensemble of the predicted observations. Several methods for executing (5) and for drawing the analysis ensembles have been introduced, for example, with perturbed observations (Evensen 1994) or with square root filters (Bishop et al. 2001; Anderson 2001). Notice that this method depends on two parameters, and , that represent the covariance matrices of the dynamical and observation noise. In particular, when the observation model is perfect, we set , meaning that the only observation noise is the measurement error: .

We assume that the true observation model *h* is unknown and we are given the imperfect observation model . We define as the observation model error at the *i*th time step and assume that the observation model error is biased; that is, and is independent of the measurement noise . With this configuration, it is clear that if we consider filtering with the imperfect observation model in (2), the observation error is biased: .

One way to mitigate this issue is by adjusting the observation model as follows:

where we define as the unbiased observation model error with the covariance matrix . Assuming that and are independent, we can implement the EnKF with the unbiased observation model in (6) as follows:

Here, , , and , where and . When is given, we specify ; otherwise, we will estimate it together with the system covariance using the adaptive covariance estimation scheme of Berry and Sauer (2013). To implement the unbiased filter in (7), we need to estimate and . We propose to use the conditional statistics of the secondary filter in (3), and , as the estimators for and , respectively. With this goal in mind, we now explain the secondary filter.

## 3. The secondary filter

The goal of this section is to explain the construction of the nonparametric filtering in (3). Since this requires various technical tools from the machine learning community, we accompany the discussion with several appendixes for detailed discussions.

We assume that we are given only pairs of data points for training (so no knowledge of the true observation function *h* is assumed). The basic idea of our method is that one must gather simultaneous data of the state and observed variables (an admittedly nontrivial task) and then our method will use this dataset to correct the observation model error. We should also note that spatial inhomogeneity in the observations model would require applying our method separately at different locations using separate training datasets. The key feature of our approach is that no explicit modeling is involved and the observation model correction will be learned directly from the data.

From the data points , we compute the implied model error , where is the given imperfect observation model. First, we will discuss how to train the likelihood function using this dataset. Second, we will discuss how to extend the likelihood function onto new observations to obtain the conditional probability of **b** given new observations that are not part the training dataset. Third, we will discuss how to construct a prior. Subsequently, we discuss how to extract the estimators and that are needed in (7).

### a. Training the nonparametric likelihood function

Our goal here is to learn the likelihood function by using a training dataset consisting of pairs . The outcome of this training is a matrix of size , where the entry is an estimate of the conditional density, . This matrix is a discrete representation of the likelihood function evaluated at each point of the training dataset and . This matrix is a nonparametric representation of . However, since the dataset could be quite large, we will never explicitly construct this matrix, and instead, it will be represented by its projection into a lower-dimensional set of basis elements.

For clarity of presentation, we assume in this section that the observation is one-dimensional. To correct a multidimensional observation, the following bias correction steps are applied independently to each coordinate of the observation. With this scalar formulation, the computational cost of constructing the likelihood function becomes manageable and the secondary filter is easily parallelizeable for high-dimensional observations.

Our main idea is to represent the conditional density with a set of basis functions that can be learned from the training dataset using the diffusion maps algorithm (Coifman and Lafon 2006; Berry and Harlim 2016b). First, let us discuss how to construct the basis functions. Subsequently, we discuss a nonparametric representation of the conditional density functions.

#### 1) Learning the data-driven basis functions

In a nutshell, the diffusion maps algorithm can be described as follows. Given a dataset with sampling density , defined with respect to the volume form inherited by the manifold from the ambient space , the diffusion maps algorithm is a kernel-based method that is used to construct an matrix that approximates a weighted Laplacian operator, . The eigenvectors of the matrix are discrete estimates of the eigenfunctions of the operator , which form an orthonormal basis of a weighted Hilbert space .

For example, if the data are uniformly distributed, , on the unit circle , then the matrix approximates the Laplacian on this periodic domain. In this case, eigenvectors of approximate eigenfunctions of the Laplacian operator on the unit circle, which are the Fourier functions that form an orthonormal basis of . Thus, one can think of the diffusion maps algorithm as a method for specifying a generalized Fourier basis adapted to the dataset and the sampling measure *q*. More generally, it has been shown by Berry and Sauer (2016) that are eigenfunctions of a Laplace–Beltrami operator on the manifold with respect to a Riemannian metric with a volume form weighted by the density. The basis functions are represented nonparametrically by the vectors , whose th component is a discrete estimate of the eigenfunction , evaluated at training data point .

In our application, we apply the diffusion maps separately on the datasets and . Let and be the sampling densities of and , respectively. Implementing the diffusion maps algorithm, we obtain vectors , which approximate , and , which approximate . In our implementation, we use the variable-bandwidth kernels introduced by Berry and Harlim (2016b). We refer to the appendix in Berry and Harlim (2016a) for the pseudocode of the algorithm.

#### 2) A nonparametric representation of conditional density functions

In this section we develop a method of representing conditional density functions based on reproducing kernel Hilbert spaces (RKHS). For further details see appendixes A and B, where we generalize previous methods to weighted RKHS. Let and be the basis functions approximated by the diffusion maps algorithm. For finite modes, , we consider a nonparametric representation of the conditional density as follows:

where the expansion coefficients are defined as

Here, the matrix is and matrix is , whose components can be approximated by Monte Carlo averages as follows:

The equation for the expansion coefficients in (9) is based on the theory of kernel embedding of conditional distribution introduced in Song et al. (2009, 2013), which we review in appendix A below. See appendix B for the detailed proof of (9)–(11).

From the expression in (9), one can see that the conditional density in (8) is represented as a regression in infinite-dimensional spaces with basis functions and . This representation is nonparametric in the sense that we do not assume that the density function is of a particular distribution. Numerically, the nonparametric representation of is given by an matrix whose th component is

where and the coefficients are given by (9) evaluated at the training data . From (12), notice that all we need are the function values , , and , which are obtained via the diffusion maps algorithm. We should note that the sampling density is estimated using a kernel density estimation method in our implementation of the diffusion maps algorithm [see the appendix in Berry and Harlim (2016a) for the algorithmic details]. In our numerical implementation, we will set .

### b. Extension of the likelihood function to new observations

To construct for that is not in the training dataset, (8) suggests that we need to extend the eigenfunctions to this new data point, . One approach would be to use the Nyström extension (Nyström 1930); however, since the observation contains the additive noise , a more robust method is to compute weights

which are the probabilities of the training data points given the observation . We then estimate the expected value of :

where the expectation is taken with respect to the measurement noise . We then compute the expectation of the conditional density (8) with respect to :

where the coefficients are defined in (9) and computed in the training phase. We use the expectation (13) as an estimate of the likelihood function at the new data point. To conclude, we represent the likelihood function nonparametrically by an *N*-dimensional vector whose th component is , as defined in (13). In the remainder of this paper, we denote this nonparametric likelihood function as the *RKHS likelihood function*.

### c. Prior density functions

An appropriate construction of the prior distribution at each data assimilation time *i* is essential for accurate filter estimates. This is especially true when the likelihood function is bimodal as a function of **b**, as we will see in the numerical applications below. In this article, we consider a Gaussian prior distribution with the mean and variance given by

Given a training dataset , we can compute the training biases such that the prior density evaluated at the training data is given by

So, is represented by an *N*-dimensional vector whose th component is , as defined in (15). As an alternative to the time-dependent variance defined in (14), one can also use the climatological variance obtained from the historical training (this is analogous to a 3DVAR prior). While more complicated priors are always possible, we will apply our numerical experiments below using these Gaussian prior densities.

### d. Statistics of the posterior distribution

Now that we have defined the likelihood in (13) and the prior in (15), we can multiply these terms to form the posterior density:

where is the normalization constant estimated via Monte Carlo summation. With this posterior density estimate, we can compute the posterior mean and variance:

as estimators for and that can be used in the primary filtering step in (7). [Notice that the data points have distribution , which means that the summation would be a Monte Carlo estimator of . To estimate the mean of the conditional density, we need to remove the factor from this integral, so we divide each term by .] This completes our description of the secondary filter. In the remainder of this paper, we denote the EnKF in (7) with observation error corrected using the statistics in (16) as the *RKHS filter*.

## 4. Example 1: Assimilating random “cloudy” observations

In this example we will introduce a severe error into the observations of the 40-dimensional chaotic Lorenz (1996) system:

where the indices *j* are taken modulo 40. The underlying truth is generated by the RK4 integration scheme, with an integration time step of and randomly chosen initial conditions, and has a climatological standard deviation of . Direct observations are taken at every discrete time step , which we will vary between 0.1 and 0.5. We will show results for observing at every other grid points (20 observations). At each observation time step , we randomly choose up to from the 20 observed locations and obstruct the corresponding observations with a “cloud” as follows. We draw and let the observations at these *c* locations be

Effectively, up to 7 random locations out of 20 direct measurements have an 80% chance of being randomly scaled by and shifted down by eight units. In our experiment, the observed dataset is generated with this hidden observation model and the filter observation model is . Therefore, when , we directly observe the state and the observation model is correct; we refer to this case as an *unobstructed* (“clear sky”) observation. On the other hand, when , the filter observation model is incorrect, , and we refer to this case as an *obstructed* (“cloudy”) observation. Notice that at each time there are up to seven randomly chosen cloudy locations. To represent the measurement error, we also include additive Gaussian noise, , in the observations: . Since the obstructions in the observation *h* appear randomly, they are impossible to predict, and we assume that the modeler only has the incorrect observation model . For comparison, we also generate a set of clear-sky observations, , so that we can test the standard EnKF in the case when there is no model error.

In Fig. 2 we show a scatterplot that clearly illustrates the bimodal distribution of the observations . Notice that the clear-sky and cloudy observations are both permitted in the range , making it impossible to tell from the observations alone whether the observation is obstructed or not. The conditional density, , is trained from a short training dataset , where *N* = 10 000. In this example, since the model error is spatially homogeneous, we fit a single model and use it for each of the 20 observations. This assumption also allows us to use only 500 time steps of training data (since each time step contains 20 observations, this short time series gives us *N* = 10 000 observations). Below, we will apply each filter to 8000 time steps of observations and we compute the RMS error using the last 5000 time steps of the filter estimates. In this numerical experiment, the likelihood function is constructed using eigenfunctions.

In the bottom panels of Fig. 2, we show examples of the RKHS likelihood functions (red, bottom left and right), both of which result from observations near , but in one case the observation is obstructed (bottom left) and in the other the observation is unobstructed (bottom right). In these panels, we also show the Gaussian prior distribution as proposed in (15) with a time-independent obtained by taking the variance of the training data . Notice that the prior helps the secondary filter to identify whether the observation error is small (Fig. 2, bottom, left panel) or large (Fig. 2, bottom, right). In this case, the climatological prior variance is sufficient to give a relatively informative prior. In the next example, we will need a time-dependent prior since the variance is too broad.

In Fig. 3, we show the spatiotemporal structure of the observations, the true state, and the EnKF without and with the observation model error corrections. Notice that the observations are severely corrupted (see the deep blue dots) and these obstructions occur completely randomly in space and time. Note also that the cloudy observations cause observation model error because we assume no knowledge of the true observation function *h*, and we filter using on each observed location. Here, the primary filter is implemented with 80 ensemble members and an adaptive covariance estimation of the system and observation noise covariances, and [see Berry and Sauer (2013) for details]. Below, we will also include results without using the adaptive covariance estimation for comparison. We use the abbreviation EnKF to refer to the primary filter without correcting the observation model error; that is, and in (7), and we use the abbreviation RKHS to refer to the EnKF with the observation model error corrected by (16).

In Fig. 4, we show the root-mean-square errors (RMSEs) between the truth and the filtered estimates as functions of measurement error standard deviation and observation time . Each RMSE is averaged over 5000 filter steps after removing an initial transient to allow each filter to converge. In each panel in Fig. 4, we include the results of applying the EnKF to unobstructed (clear sky) observations (black, solid line), which should be considered the best possible filtering since the observation model is correct when the observations are all unobstructed. On the other hand, when the observations are obstructed, (cloudy sky), the observation model is incorrect and the results (red, dotted line) degrade significantly [notice that gaps in the curve indicate catastrophic filter divergence; Gottwald and Majda (2013)].

The results shown in the top row of Fig. 4 correspond to implementing the filter using the true value and a fixed small additive inflation, . In this case, the classical EnKF filter estimates diverge in the presence of this severe observation model error. This result suggests that the empirical additive inflation (which is commonly used to compensate for the model error in applications) is not able to overcome the ill-posedness induced by the bimodality of the model error distribution in this example. In the bottom row of Fig. 4, we include results that use and parameters determined by the adaptive covariance estimation method (Berry and Sauer 2013). Notice that the EnKF applied to the cloudy observations with the adaptive estimation of and is able to prevent catastrophic filter divergence except when the observation time . For both the fixed parameter and adaptive filters, it is clear that the RKHS filter given the cloudy observations approaches the performance of the EnKF given the clear-sky observations in the limit of small measurement error and small observation time.

In Fig. 5 we plot the average of the diagonal entries of the and parameters estimated by the adaptive covariance estimation as functions of both the measurement error covariance and the observation time. First, the EnKF with clear-sky observations recovers the true values of with high accuracy when the observation time is relatively small (). Notice that when the EnKF is given the cloudy observations, the estimates for and are much larger than the estimates when given the clear-sky observations. Moreover, the estimates from the RKHS filter are closer to those of the EnKF given the clear-sky observations. Intuitively, this means that the RKHS model error correction is effective and that the adaptive method only needs to compensate for the remaining measurement error, which is present even in the clear-sky observations. In other words, the RKHS correction is effectively removing the clouds from the cloudy observations and achieving results close to the EnKF given the clear-sky observations.

In this example, we only consider a single type of model error and by the spatial homogeneity assumption of the model we are able to correct the model error estimator only based on a single training. Since our method is pointwise, then we can always learn the nature of the error in each location separately if it is different so long as the data at these locations are available. In the next section, we will demonstrate a situation where at a single location we have 16 observations with a distinct error distribution. In such cases, we correct the error by training on each observation separately.

## 5. Example 2: Assimilating cloudy “satellite like” measurements

Here, we consider assimilating satellite measurements of an idealized stochastic multicloud model for a single column of organized tropical convection (Biello et al. 2010). The model is a system of four-dimensional ODEs that represents the time evolution of the first and second baroclinic anomaly potential temperatures, respectively, and ; an equivalent boundary layer anomaly potential temperature ; and a vertically averaged water vapor content *q* of a single-column tropical atmosphere (where *q* is rescaled to the temperature units). These ODEs are coupled to a stochastic birth–death process that governs the area fractions of three cloud types—congestus ; deep ; and stratiform clouds —over an unresolved horizontal domain corresponding to the column model. From and , we can extrapolate the anomaly potential temperature at height *z* (in km) using the following interpolation function (Biello et al. 2010):

where we model the height of the troposphere in the tropics as km. In our numerical experiment, we simulate the truth with parameters in case 1 of Table 5.2 in Biello et al. (2010) except that we set the reference value of the convective available potential energy (CAPE) to 200 J kg^{−1} to increase the convective activity and the area fractions. We should note that recently Deng et al. (2015) showed that when this stochastic multicloud model is coupled with an NCAR GCM, it is able to reproduce many realistic features of the Madden–Julian oscillation even with a coarse-resolution simulation. This positive result plus the fact that we can simulate cloudy satellite-like measurements (as we will discuss now and in appendix C) motivates us to test our approach on this simple model.

Below, we denote the state variables as and the cloud fractions as . We will assume that the cloud-top heights for both the deep and stratiform cases are the same, km, whereas that for the cumulus cloud is km. Based on these heights, we consider modeling the brightness temperature–like quantity at wavenumber *ν* with the following two-cloud type radiative transfer model (Liou 2002):

where denotes the transmission from height *z* to the satellite, which is assumed to be a decaying function of height and also depends on *q*. We specify the wavenumbers *ν* such that the weighting functions, , are maximized at heights km. In Fig. 6, we show the weighting functions (black solid) with the minimum humidity associated with km (black dashes). We also include the weighting functions corresponding to the maximum humidity value (red solid). This is to show that depending on the value of *q*, the shape of the weighting functions will vary between these two weights. (See appendix C for the detailed construction of these weighting functions.)

Essentially, the terms inside the square brackets in (20) denote the contribution below the deep and stratiform clouds; the first term on the third row denotes the contribution from the deep cloud-top height and the last term denotes the contribution of the free atmosphere above the deep cloud. By setting , we obtain the clear-sky observations. We consider implementing the primary filter with an incorrect observation model , which means that we assume the observed brightness temperatures are from clear-sky measurements. In Fig. 7, we show the scatterplot of the observations, which are defined as , as functions of the model error, . In this example the measurement errors are specified as a percentage of the variance of the measured variables, so with , where is the climatological variance of the observation. Below, we will also check the robustness of our method for larger measurement error .

We observe at each observation time and use an ensemble size of . In Fig. 8, we show examples of the secondary filter at two instances for observations at wavenumber *ν* corresponding to a weighting function with km. The two observations are relatively close to each other, and , but the former has very small observation error while the latter has a relatively large observation error (see the blue dots in these two figures). Here, the nonparametric likelihood functions are trained with and . To correct the observation model error, we ran the secondary filter in (3) with the Gaussian prior with time-dependent variances as defined in (14) (gray curves). In this example, we could not use the time-independent variance, , from the training data since the resulting prior is too broad and was uninformative. However, since is very small (see, e.g., Fig. 7), it is possible that the support of the prior will be almost entirely between the two modes of a bimodal likelihood function, which will cause the posterior normalization factor *Z* to be very small. Thus, when *Z* is found to be less than a threshold , we do not apply the secondary filter, since in these cases the likelihood function does not give enough information to inform the secondary filter. When the prior is on the tail of the likelihood function, we do not apply this thresholding step. In our numerical implementation, we found our results to be robust across a large range of thresholds .

In Figs. 9 and 10, we show the filter estimates as time series for all seven model variables, , when the measurement error is or 0.1% of the observation variance. For the diagnostic, we also include EnKF assimilating the same cloudy observations with the true observation function. Notice that the RKHS is as accurate as the EnKF implemented with the true observation function. On the other hand, the EnKF based on the incorrect observation model does not produce accurate estimates, despite using an inflated observation covariance matrix in the filter (a standard method for overcoming model error). In this example we do not use any adaptive covariance estimation, all filters use an additive covariance inflation , and an inflated .

In Figs. 11 and 12, we compare the filter-relative MSE as a function of the measurement error variance, , ranging from 0.1% to 2% of the observation variance. Here, the relative MSE is defined as the ratio between the MSE and the variance of the corresponding state variable, where MSE is averaged over 2500 filter steps. As we saw in the L96 example above, when the measurement error is small the RKHS performance approaches that of the EnKF given the true observation function. For large measurement error, the performance of all of the filters degrades significantly, even the EnKF, given the true observation function. In Figs. 13 and 14, we show the filter estimates as time series for all seven model variables at the highest measurement error, which is 2% of the observation variance. In this case, notice that even the EnKF given the true observation function produces inaccurate filter estimates for some variables (with large errors in ). This illustrates the difficulty of filtering these highly nonlinear observations given by the RTM in the presence of large measurement error. For small measurement errors, the RKHS correction is again effective in overcoming the severe bias introduced by the observation model error.

## 6. Summary

We introduced a data-driven observation model error estimator. The estimator is a recursive Bayesian method that uses a nonparametric likelihood function, constructed by representing the kernel embedding of conditional distributions with data-driven basis functions obtained via the diffusion maps algorithm. The method is scalable to high-dimensional problems since it can be used in parallel to correct the error in each component of a high-dimensional observation. While our main motivation is in solving the cloudy satellite inference problem, the proposed framework can be used in general applications so long as a training dataset of the states and corresponding observations is available as in any statistical learning techniques (Hastie et al. 2001). Specifically for the numerical weather prediction applications, if the goal is to improve the estimation of the prognostic variables (such as the wind speed, temperature, or humidity) by assimilating radiance, then the training phase of the proposed method requires a historical proxy for these prognostic variables [such as reanalysis; e.g., Kalnay et al. (1996)] together with the corresponding satellite radiance measurement, which may or may not have been used to approximate the true prognostic variables. If such data are not available, then the basic results in this paper hopefully give a more compelling reason for observationalists to collect these data. Also, while the proposed observation model error correction technique is shown in tandem with an ensemble Kalman filter, the fact that we estimate the conditional distribution (and not just the statistics) of the error, , implies that this framework can also be used in any filtering method.

In the examples above we showed that the RKHS correction is able to overcome complex multimodal observation model error. When there is sufficient observability (e.g., short observation time) and small measurement error, the RKHS correction approaches the filtering skill of the EnKF given the perfect observation function. In the presence of large measurement error or long observation time, the RKHS correction degrades relative to the EnKF with the perfect observation function; however, it is still a significant improvement in both skill and stability over simply inflating the EnKF noise parameters, and . The key to the RKHS is using historical data to learn the conditional distribution of the model error. The RKHS then combines the current filter-predicted observation with the actual observation to determine the appropriate correction to the current observation model. This process depends on choosing an appropriate prior for the current model error, and in particular, we presented two natural choices for the covariance of this prior. Future improvements may be possible by designing better or adaptive priors.

## Acknowledgments

The research of JH is partially supported by Office of Naval Research Grants N00014-16-1-2888 and MURI N00014-12-1-0912 and National Science Foundation Grant DMS-1317919.

### APPENDIX A

#### Kernel Embedding of Distributions

In this appendix we discuss the theory of kernel embeddings of distributions for constructing likelihood functions. We should point out that the review in this section follows the derivation in Song et al. (2009, 2013), except that we adapt their derivation to weighted Hilbert spaces, which will be required in the next section.

Given a symmetric positive definite kernel, , the Moore–Aronszajn theorem states that there exists a reproducing kernel Hilbert space (RKHS): . This is a space of functions with a *reproducing property*; that is, for each , there exists such that . Moreover, , which implies that . The map from to given by is called the feature map.

Let **X** be a random variable with distribution defined on a domain . The *kernel embedding of a distribution * maps *P* to a function given by

which is the expectation of the feature map. For any ,

So is the Riesz representative of expectation over .

Let **Y** be another random variable defined on another space and another kernel , which maps to a feature space . We assume the following equality:

Then, the joint distribution can be mapped into the product feature space with the following definition:

Then, for any functions , we have

using the equality in (A2) and the definition of in (A3). Again, the cross-covariance operator is a Riesz representative of expectation over .

The kernel embedding of the conditional distribution is defined as

and one can use the same argument to show that the embedding operator is the Riesz representer of the conditional expectation:

The main result from Song et al. (2013) that we will exploit is as follows:

Theorem 1. The kernel embedding of satisfies

where the operator is defined as in (A3) and .

Proof. This result depends on the identity derived in Song et al. (2009), which states that for ,

To see this, let and notice that

where we have used the fact that is the Riesz representer of expectation with respect to and also the equality in (A4). For each , then by the reproducing property of , one can write

### APPENDIX B

#### Proof of (9)–(11)

In this appendix, we will prove (9)–(11). For our discussion below, we define two random variables. The first one , with distribution and kernel such that , is the RKHS with orthonormal basis functions that can be estimated from the data with sampling density . The second one , with distribution and kernel such that , is the RKHS with orthonormal basis functions that can be estimated from the data with sampling density .

Taking the inner product of both sides of (8) with and using the orthogonality of the eigenfunctions , we obtain

Applying (A6) and the equality in (A5) from theorem 1 with the random variable **X** replaced by **B**, we have

In the second equality above, we used the reproducing property, , such that .

We define the projection of the operators and on the space spanned by the basis functions , respectively, with matrices and whose components are

### APPENDIX C

#### A Simplified Radiative Transfer Model

In this appendix, we discuss the simplified radiative transfer model used in section 5. Given the physical variables in the stochastic multicloud model in Biello et al. (2010); the first and second baroclinic potential temperatures, and , respectively; the equivalent boundary layer potential temperature ; and the vertically average water vapor content *q*, we consider an idealistic radiative transfer model for the brightness temperature at wavenumber *ν* in clear-sky conditions, following the standard formulation in Liou (2002):

where and the temperature at height *z* is defined as in (19). The first term on the right-hand side of (C1) denotes the contribution from the surface and the second term denotes the contribution from the whole column of the atmosphere, which is a weighted average of the temperature with the weighting function: .

In (C1), we define the transmission between heights *z* to , assuming the height of the satellite instrument is much larger than the troposphere, as follows:

Here, denotes the absorption rate that we assumed to decrease exponentially as a function of height:

where denotes a reference absorption rate that is to be determined. In our simulation, we set km. With this assumption, we have

and . Also, the weighting functions become

We determine the wavenumber *ν* by specifying corresponding to the height at which the weighting function is maximum. That is, setting , we have

where *q* needs to be fixed to a specific reference value. To increase the sensitivity of the weighting function to , where *a* and *b* denote the lower and upper bounds, respectively, of *q*, which can be obtained by the climatological data, we rescale *q* to fluctuate in between by defining

and define the absorption rate to be

This means that when , the corresponding will produce a weighting function, , that has a maximum weight at height . In Fig. 6, we show the weighting functions (black solid) with the reference humidity associated with km (black dashes). We also include the weighting functions corresponding to the humidity value (red solid), which show that depending on the value of *q*, the shape of weighting functions will vary between these two weights.

For a single type of cloud, the basic equation for the RTM of the cloudy sky measurement with cloud fraction *c* and cloud-top height is given as follows:

So, the first term denotes the contribution below the cloud, the second term denotes the contribution from the cloud-top height, and the last term denotes the contribution from the clear sky above the cloud-top height.

The stochastic component of the multicloud model in Biello et al. (2010) is a birth–death process that accounts for the evolution of the cloud fractions, , of three cloud types—cumulus, deep, and stratiform clouds, respectively. Assuming that the deep and stratiform cloud-top heights are similar, namely km, and the cumulus cloud-top height is km, we consider a two-cloud type formulation:

Here, the second row denotes the contribution below the deep and stratiform clouds, the first term in the last row denotes the contribution from the deep cloud-top height, and the last term denotes the contribution from the atmospheric column above the cloud. One can check that if and/or , then this formulation is consistent with (C1) and (C2).

In our numerical simulations, we simulate the observations as follows (ignoring time index :

for 16 wavenumbers *ν* chosen such that the weighting functions, , are maximum at heights km. But assuming that we do not know the cloud-top heights, , as well as the cloud fractions, , we will apply the filtering with the clear-sky observation model, :

where denotes error at frequency *ν*. In our implementation, we will either fix the parameter or estimate it adaptively using the method from Berry and Sauer (2013).

## REFERENCES

*The Elements of Statistical Learning*. Vol. 1. Springer Series in Statistics, Springer, 533 pp.

*Atmospheric Modeling, Data Assimilation, and Predictability.*Cambridge University Press, 341 pp.

*An Introduction to Atmospheric Radiation*. International Geophysics Series, Vol. 84, Academic Press, 583 pp.

*Proc. Workshop on Predictability*, Reading, United Kingdom, ECMWF, 1–18.

*Filtering Complex Turbulent Systems.*Cambridge University Press, 357 pp.

*Proc. 26th Annual Int. Conf. on Machine Learning*, Montreal, QC, Canada, ACM, 961–968.

## Footnotes

© 2017 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).