EOF Calculations and Data Filling from Incomplete Oceanographic Datasets

J. M. Beckers University of Liège, Liège, Belgium

Search for other papers by J. M. Beckers in
Current site
Google Scholar
PubMed
Close
and
M. Rixen University of Liège, Liège, Belgium

Search for other papers by M. Rixen in
Current site
Google Scholar
PubMed
Close
Full access

We are aware of a technical issue preventing figures and tables from showing in some newly published articles in the full-text HTML view.
While we are resolving the problem, please use the online PDF version of these articles to view figures and tables.

Abstract

The paper presents a new self-consistent method to infer missing data from oceanographic data series and to extract the relevant empirical orthogonal functions. As a by-product, the new method allows for the detection of the number of statistically significant EOFs by a cross-validation procedure for a complete or incomplete dataset, as well as the noise level and interpolation error. Since the proposed filling and analysis method does not need a priori information about the error covariance structure, the method is self-consistent and parameter free.

Additional affiliation: National Fund for Scientific Research, Brussels, Belgium

Current affiliation: Southampton Oceanographic Centre, Southampton, United Kingdom

Corresponding author address: Dr. J. M. Beckers, University of Liège, MARE-GHER-AGO, Sart-Tilman B5, B-4000 Liège, Belgium. Email: JM.Beckers@ulg.ac.be

Abstract

The paper presents a new self-consistent method to infer missing data from oceanographic data series and to extract the relevant empirical orthogonal functions. As a by-product, the new method allows for the detection of the number of statistically significant EOFs by a cross-validation procedure for a complete or incomplete dataset, as well as the noise level and interpolation error. Since the proposed filling and analysis method does not need a priori information about the error covariance structure, the method is self-consistent and parameter free.

Additional affiliation: National Fund for Scientific Research, Brussels, Belgium

Current affiliation: Southampton Oceanographic Centre, Southampton, United Kingdom

Corresponding author address: Dr. J. M. Beckers, University of Liège, MARE-GHER-AGO, Sart-Tilman B5, B-4000 Liège, Belgium. Email: JM.Beckers@ulg.ac.be

1. Introduction

Oceanographic and meteorological applications of empirical orthogonal decompositions have been commonly used for several decades and are well documented in textbooks (e.g., Preisendorfer 1988). Empirical orthogonal functions (EOFs) are however not a tool specific to the ocean sciences, but are widely used in a number of scientific domains under different names: proper orthogonal decomposition (POD) or Karhunen–Loève decompositions are mostly used in a functional framework [working on mathematical continuous functions; e.g., Lumley (1970); Holmes et al. (1996)], whereas the matrix versions (working on discrete datasets) are called proper orthogonal modes (POM), principal component analysis (PCA), factor analysis (e.g., Hair et al. 1998), or empirical orthogonal function analysis.

In oceanography, EOF analysis has been used for several kinds of applications, among them objective analysis (or optimal interpolation) of in situ data; for time-dependent 3D datasets, classic objective analysis methods are generally intractable and the use of vertical EOFs and horizontal objective analysis of their amplitudes drastically reduces the amount of data during the costly optimal interpolation, but takes into account the strong vertical correlations contained in the EOFs (Holbrook and Bindoff 2000). Other applications of EOFs include statistical comparison between data and model results (e.g., Beckers and MEDMEX 2002), data compression (e.g., Pedder and Gomis 1998), analysis of variability (e.g., Hendricks et al. 1996), and filtering (e.g., Vautard et al. 1992). Reducing the size of a problem by projecting (complex) reference model equations onto a small number of principal components (D'Andrea and Vautard 2001) is also a powerful application of EOF analysis.

Each of these applications basically exploit the fact that EOFs allow a decomposition of a signal into a series of data-based orthogonal functions, which are designed so that only a few of these functions are needed to optimally reconstruct the initial signal.

Contrary to most other domains, in oceanographic data and in particular satellite data analysis, both noise in the data and missing data points are a concern and must be dealt with: in the datasets, there is a combination of missing data due to misfunctions of instruments, inaccessible data (e.g., due to cloud coverage), instrumental error bars (generally small, except for altimetric data for which the precision of a few centimeters can be in the range of the dynamical signal), and noise related to small-scale features not resolved by the observational network (at least all features below the Nyquist frequency).

The aim of the present paper is to analyze the problem of missing or unreliable data in oceanographic datasets and in their EOF analysis. We will present the method on classic EOF (allowing for complex data), although generalizations of the present method are possible, as for example to extended EOFs, delayed EOFs, etc. (e.g., Emery and Thomson 1998); singular spectrum analysis (e.g., Vautard et al. 1992); or other transformations (e.g., Elmore and Richman 2001). For the sake of clarity and without loss of generality, we also focus on data given in a uniform spatiotemporal grid; for nonuniform grids, the approach of North et al. (1982) or Shen et al. (1998) (basically replacing the discrete problem by a continuous integral problem, allowing weighting of the data according to their associated spatial or temporal coverage) can be used to work numerically in a discrete uniform space.

We assume thus that we have a matrix 𝗫 containing the observations, which is arranged such that the element i, j of the matrix is called (𝗫)ij and is given by the value of the field f(r, t) at location ri and moment tj:
ijfritj
The field f is an observational field and thus contains all errors (instrumental, unresolved structures, etc). We can then write the matrix as a succession of n column vectors,
x1x2xn
each of the column vectors xj being the discrete state vector of size m at moment tj. The state vector can be a 1D field, an ordered set of values from a 2D satellite image, etc. Values of the state vector do not need to be real but can accommodate complex values, allowing complex EOF on 2D vectors, Fourrier transforms, Hilbert transforms, etc. (e.g., Emery and Thomson 1998). EOFs can then be calculated easily by standard singular value decomposition (SVD) techniques (e.g., Lawson and Hanson 1974), which are based on methods solving the following eigenvalue problem:
i1520-0426-20-12-1839-e3
the first system being of dimension m (as the eigenvector u and the discrete state variable x) and the second one of dimension n (as the eigenvector v). Here, u is often referred to as the spatial EOF, and v as the temporal EOF, while ρ is the so-called singular value: 𝗫* is the adjoint (conjugate transposed) of the matrix 𝗫.
As an alternative to solving Eqs. (3) and (4), we can solve instead
i1520-0426-20-12-1839-e5
showing that the spatial modes u are the eigenvectors of the time-averaged (or temporal) covariance matrix 𝗫𝗫*,1 while v are the eigenvectors of the spatially averaged covariance matrix 𝗫*𝗫.
Eigenvectors are normalized, and by virtue of Eqs. (5) and (6), orthogonal:
u*iujδijv*ivjδij
where δij is the Kronecker symbol (with a value of 1 if i = j and 0 otherwise). For the spatial patterns ui, we retrieve orthogonality, while for the temporal pattern vi the orthogonality is sometimes referred to as the time series of the ith EOF being uncorrelated to the other EOFs.
Once the eigenvectors and eigenvalues are known [the number of eigenvalues different from 0 is the rank q ≤ min(m, n)], we can decompose the initial matrix as follows:
i1520-0426-20-12-1839-e8
where we have for matrix 𝗫, defined matrices 𝗨 and 𝗩 so that they have as columns i the eigenvectors ui and vi, respectively, corresponding to the singular value ρi. Matrix 𝗗 is then a rectangular matrix whose only nonzero values are on its diagonal and (𝗗)ij = ρiδij. For the development, we use the classic convention of ordering the eigenvalues in descending order (from the larger to the smaller ones). The interest in SVD decomposition compared to any other orthogonal decomposition is because if instead of using all q eigenvalues and eigenvectors, we use only the first k of them, it can be shown (e.g., Preisendorfer 1988) that for a given number of modes k, no other choice of base vectors ui leads to a better approximation of the real matrix 𝗫 than the truncated series obtained with the k first EOFs. This means that when retaining only 1 mode, the corresponding EOF (with its evolution) is the one that on average allows for the best approximation of the matrix 𝗫 by a given spatial structure (the spatial EOF) whose amplitude evolves in time. A measure of how well an EOF mode represents the signal is contained in the singular values. Indeed, by construction
i1520-0426-20-12-1839-e9
is a measure of the total variance (also sometimes called energy) in the system. The ratio fk = ρ2k/Σqk=1 ρ2k is thus a measure of the variance contained in mode k compared to the overall energy and one often says that mode k explains 100 fk% of the variance and that the first K modes explain 100 ΣKk=1 ρ2k/Σqk=1 ρ2k% of the total variance. This ratio is often the basis for deciding the number of EOFs to retain in a given decomposition. A typical choice is to retain those modes that, when summed up, explain 95% of the signal. Another choice for optimal truncation is the so-called Preisendorfer and Overland criteria (Preisendorfer and Overland 1982): in this case, a Monte Carlo method generating random matrices of the same size decides when pure noise is likely to be interpreted as a signal if the series is not truncated.

To finish the general description of EOF decompositions, we can mention that when working on EOFs or covariances, some a priori background field or trend is subtracted from the data series (e.g., Kantha and Clayson 2000), notably to allow for an easy interpretation of the eigenvalues in terms of the energy of the variability [see Eq. (9)]. For the sake of symmetry in terms of space or time coordinates, we assume here that the a priori average we substract from the field (in order to work with anomalies) is the spatiotemporal average of all available data.

2. Missing data

The SVD method presented above assumes that the matrix 𝗫 is perfectly and completely known. Currently, when dealing with missing or unreliable data, one fills in the missing data by objective analysis methods or optimal interpolation (e.g., Houseago-Stokes 2000): of course the problem with such an approach is that one needs information about the correlation function and the signal/noise ratio of the data for the interpolation (e.g., Rixen et al. 2000; Gomis et al. 2001). Since the purpose of using EOFs is often for the data analysis itself, another preliminary data analysis is not very natural and introduces a priori information into the subsequent EOF analysis. This filling method (and its variants that use simpler interpolation methods) is the most commonly used option to estimate the missing data values, because it consists of applying a classic analysis tool (e.g., Wunsch 1996) as preprocessing, before using any SVD or EOF analysis package.

Another method for dealing with missing data consists of calculating the eigenvalues and eigenvectors of the correlation or covariance matrix 𝗖, where the covariance is calculated using only the available data (Boyd et al. 1994; Kaplan et al. 1997; von Storch and Zwiers 1999). This however does not necessarily lead to a semipositive defined covariance matrix (since it is no longer possible to write 𝗖 = 𝗫*𝗫, which would lead to an always positive defined form x*𝗖x, with x being a column vector). Therefore its eigenvalues can be negative. Since the energy of the system is related to the trace of the covariance matrix that is equal to the sum of its eigenvalues, having negative eigenvalues means that other eigenvectors will have higher eigenvalues than in reality, thus overestimating the corresponding amplitude of the EOF. In addition, once the eigenvalues and eigenvectors are calculated, one still faces the problem of deciding which EOFs are relevant. Due to the presence of negative singular values, the selection criteria can no longer rely on the relative importance of the eigenvalue, but should also take into account the effect of the negative eigenvalues on the distribution of the larger eigenvalues. Moreover, using a temporal instead of a spatial covariance matrix (based on averaging the available data only) will generally lead to different eigenvalues (and thus singular values) though the mathematical counterpart for the correlation matrix based on 𝗫𝗫* or 𝗫*𝗫 gives identical results. Another way of interpreting this can be based on the observation that the EOF coefficients (the product of the singular value and the temporal EOF) of a gappy data vector cannot simply be obtained by a dot product, but must be estimated by a least squares approach (von Storch and Zwiers 1999).

Since both of the approaches (filling or correlation calculated on the sole available data) most commonly used contain some subjective components (choice of interpolation methods and interpolation parameters, level of significant EOFs, …), any method that would not need such additional choices could be of interest. The development of such a method is the purpose of the present work and is presented in the following way. First we present the interpolation method and analyze how it modifies the data at the missing data points in section 3a. Then we will tackle the problem of deciding on the optimal number of EOFs to be retained (section 3b). We will validate the method on two datasets: a synthetic one for objective quantifications of the errors of the method (section 4a), and then a real Advanced Very High Resolution Radiometer (AVHRR) dataset in section 4b. Finally, conclusions are drawn in section 5.

3. EOF-based interpolation

We now present a method aimed at filling in the gaps in the datasets and determining the structure and number of significant EOFs.

Basically, since EOFs are supposed to summarize in a very efficient way the signal and are also supposed to detect coherent features, it is tempting to use the EOFs themselves for interpolations. If we knew the dominant EOFs of a given field and their amplitudes, their combination would also give us the value of the field at the missing data points.

a. Interpolation and adaptation of the EOFs

Large-scale EOFs should not be influenced by local changes in the values of a few points, and so one could try to calculate the EOFs based on a matrix in which a first rough estimate of the variable is used in the missing data points. Then, once the larger-scale EOFs and their amplitudes are estimated, they can serve to calculate the value of the field at the missing points. Then, of course, the EOFs themselves can be reevaluated and the process can be repeated until convergence. One of the problems, which will of course arise, is the decision on the number N of relevant EOFs to be retained to recompose the signal at the missing data points. We will come to this question later. Formally the method can then be described as follows:

  • let I be the ensemble of discrete points where data are missing; that is, when (i, j) ∈ I, we do not have data or they are unreliable; I is of course a subset of [1, m] × [1, n], and the number of missing data points is no;

  • 𝗫t is the true field we would like to recover;

  • 𝗥 is a noise field (unresolved features, instrumental noise, etc…);

  • 𝗫p = 𝗫t + 𝗥 is the true field with superimposed noise;

  • 𝗫o is the observed field with gappy data (typically cloud coverage for satellite images) and zero values at those locations; and

  • 𝗫a is the observed field where missing data were filled in (also called the analyzed field in analogy with objective analysis methods).

For any of these matrices, 𝗫e is the reconstruction of the corresponding matrix by the first N EOFs, thus filtering out the contribution of the higher EOFs (it should be noted that for each of the different matrices, the EOFs are different).

We will use the same sub- and superscripts for other variables (e.g., singular values) when we need to distinguish between those that are relevant to the real signal data, the noisy signal, signals with missing information, etc.

Our proposed algorithm performs as follows. First for all missing data points [(i, j) ∈ I] we put a value of 0 in the matrix (unbiased guess by assuming that the average value we subtracted was unbiased) to get the matrix 𝗫o.

Then we use the SVD decomposition of this matrix to get access to a first estimation of the spatial and temporal EOFs 𝗨 and 𝗩 as well as their amplitudes (the diagonal elements of the matrix 𝗗 containing the singular values):
Xo
The interpolated value at a missing data point is then easily obtained by the truncated EOF series:
i1520-0426-20-12-1839-e11
The matrix 𝗨N is the matrix consisting only of the first N spatial EOFs. We thus get the truncated reconstruction of the matrix at the missing data points. We now have an analyzed field 𝗫a given by
aoδ
where the matrix δ𝗫 is zero everywhere, except at the missing data points.
Once the missing data points are replaced by their new values, we can apply the same procedure again, since the EOFs as well as their amplitudes will be modified by the changes at the missing data points. The following set of operations is thus performed until convergence:
i1520-0426-20-12-1839-e13

It is instructive to analyze how the changes from one iteration to the next are structured. To do so, we assume that the changes are very small so that we can apply the perturbation theory of appendixes A and B.

If we start from the initial guess of zero anomalies at the missing data point, it can be shown (appendix A) that the variance of the dominant modes will generally be increased, whereas the less dominant modes will see their amplitude decreased. This is coherent with the interpretation that the initial guess overestimates “high frequency” signals and underestimates “large scale” signals by disregarding any local information around the missing data point.

We can also look at how the EOF structures are modified. In fact, it appears that the EOFs are slightly rotated so as to diminish the contribution to the dominant EOFs from the “noisy” EOFs partly introduced by the initial filling by zero values. We indeed observe in the examples (and show the mechanism in appendix A) that not only do we have a tendency to decrease the energy of higher modes, but that we subtract part of the corresponding higher EOF structure from the dominant EOF. This will, as will be seen in the examples, lead to a better representation of the real EOF structure.

In appendixes A and B it is also shown that if two EOFs have similar singular values, the iterations will strongly affect their representation, which could lead to large variations and convergence problems during the iterations. In this case, however, the EOFs do not have a proper meaning anyway, since any combination of two eigenvectors associated with the same eigenvalue is still an eigenvector. In this case (which was not encountered in the examples used later) iterations could be limited to a prescribed value.

b. Determination of the optimal number of EOF

We now have an algorithm that iteratively modifies EOFs and their amplitudes so as to achieve a self-consistent set of EOFs and interpolated data. However, the number N of relevant EOFs is yet undetermined. Subjective or objective criteria based on Monte Carlo methods could be applied (Preisendorfer and Overland 1982), but this would (a) introduce some subjective parameters or (b) introduce a criteria for truncation purely based on the matrix dimensions and not at all on the power spectra of the singular values of the given data. Furthermore, we would neglect a new possibility introduced by the method, the possibility to interpolate. Indeed, once we can interpolate missing data, we can turn toward cross-validation techniques (e.g., Brankart and Brasseur 1996). We can discard some additional points for validation by using the algorithm to interpolate these values too. Then we can compare the interpolated values to the data we put aside and have a measure of the interpolation error! Then of course finding the optimal number of EOFs is straightforward if we accept that the optimal number of EOFs as the one that minimizes the interpolation error. The procedure of a cross validation in the determination of the number of relevant EOFs is then the following. We set aside a random set of valid data typically at least 30 data points in order to have a robust statistical estimate of the interpolation error; but even better that a small percentage of the data points with more than 30 points should be used. The interpolation error estimate is then simply the rms distance between the interpolated field at these points and the data there.

The algorithm to determine the optimal number of EOFs and the corresponding interpolated data and EOFs is thus the following. We first apply the interpolation method with a single EOF retained until convergence. We can then calculate the error estimate. A second EOF is now taken into account in the interpolation method (starting of course with the interpolated data based on the single converged EOF) and interpolated until convergence with two EOFs. Then we can again calculate the error estimate. If the error starts to increase, the procedure stops; otherwise, we continue with more and more EOFs. To make sure that the process does not to stop at a local minimum, we could also continue the procedure until reaching a prescribed number of EOFs, never more than q, and retain the optimal set. Once the number of EOFs that minimizes the error estimate is found, we can fine-tune the calculation of the EOFs and data interpolation by finally injecting into the interpolation procedure the data that were set aside for the cross validation. In this case, one should of course start the final interative procedure with the interpolated data obtained with the optimal number of EOFs during the cross validation. In the case where there are no missing data, the proposed cross-validation technique reduces to finding the optimal number of EOFs and then in a last step performing the normal EOF decomposition on the data, knowing the optimal truncation.

With the cross-validation technique, we can thus find the optimal number of EOFs and an error estimate of the interpolation procedure (the rms distance between interpolated data and the validation data).

We now will show how the algorithm works on two datasets: a synthetic one and an AVHRR SST dataset in the Adriatic Sea.

4. Test cases

a. Synthetic data

A first exercise will be done on a synthetic dataset of a spatial 1D problem with time evolution. The field is shown in Fig. 1 in (x, t) space. The signal consists of a sum of a deterministic field and a completely random one, a normal distribution of known intensity (standard deviation), and zero mean. The deterministic part is given as follows:
i1520-0426-20-12-1839-e15
with xi = i2π/m, tj = j2π/n.

The objective is to retrieve from 𝗫o the best estimates of 𝗫t and its relevant empirical orthogonal functions. In this synthetic dataset, we thus know the true solution and the noise, so that we can make all necessary statistical comparisons and error analyses. Here, we decide to normalize the error and we can choose either  ‖ 𝗫t ‖  (when known, as in our synthetic example), or  ‖ 𝗫a ‖  if only observations are available. Thus, a normalized error of 0.01 means an error with less than 1% energy of the signal. In the present example we normalize by  ‖ 𝗫t ‖ .

We first apply our method to a case with a 20% noise level and 10% cloud coverage as illustrated in Fig. 1.

From the analysis of the singular values in this case (Table 1) the following points are clear:

  • The noise added to the data increases all singular values ρp compared to the unperturbed field (ρt), most significantly those from the fifth singular value on (cf. lines 1 and 2 in Table 1). This means that the first four singular values are slightly modified by the noise, or equivalently, that some of the noise structure was interpreted as EOFs signals. In view of this, there will be no chance to exactly recover the first four singular values (ρt).

  • The singular values obtained from the first guess underestimate the energy of the first modes (because we impose zero values, not taking into account the correlation between neighbors) and overestimates the higher EOFs' energy.

  • The reconstructed data (ρa) have energy levels for the EOFs much in line with the energy of the EOFs including the noise interpreted as signal (ρp), but the cross validation was able to detect that the optimal number of EOFs was four. Then, cutting beyond EOF4 eliminates the noise associated with the higher EOFs but it is impossible to eliminate the noise interpreted as signal in the first EOFs. Therefore the reconstructed data have the energy level of the first EOFs comparable to the noisy field.

We can also ask ourselves if the EOF structures are better represented after the interpolation method was applied. For mode 1, we have the following measures of errors on the EOF structure:  ‖ u0up ‖ 2 = 0.0027,  ‖ utup ‖ 2 = 0.0035,  ‖ uaup ‖ 2 = 0.0005,  ‖ uaut ‖ 2 = 0.0042, and  ‖ u0ut ‖ 2 = 0.0059, which shows that the reconstructed EOF ua is a better representation of the filtered ut or unfiltered up EOF than the initial guess u0 and that it is closer to the EOF including the interpreted noise than to the filtered EOF. Therefore we can conclude that the interpolation clearly helped to improve the EOF representation.

Then we modify the importance of the cloud coverage and noise level, in order to assess if the method systematically improves the estimates of the EOFs. Different errors are shown in Table 2 for different combinations of (uniform but random) cloud coverage and noise/signal ratios.

From Table 2 we can see the following.

  • The truncated EOF reconstructions from the analyzed field 𝗫ea are always a better representation of the real truncated EOF representation 𝗫et than the initial guess EOF representation 𝗫e0 (comparing the last two columns), except when there are no clouds, in which case both are identical.

  • The untruncated analyzed field 𝗫a is always a better representation of the true untruncated field 𝗫t than the initial guess 𝗫0, (columns 4 and 5), except again in the case when there are no clouds, in which case they are identical.

  • The truncated EOF series are always a better representation of the true field than the untruncated series, be it the initial guess solution (cf. columns 5 and 6) or the analyzed field (cf. columns 3 and 4).

The effect of repeating the iterations can be analyzed by comparing the calculations with a repeated iteration compared to a computation where iterations were limited to one application of the iterations (for the same number of EOFs). Errors  ‖ 𝗫ea − 𝗫t ‖ 2 and  ‖ 𝗫ea𝗫et ‖ 2 (which measure how well we really reproduce the significant part of the real data) were reduced typically by 30% when allowing the iterations to proceed.

Looking now qualitatively at the way the method re-creates the fields from the incomplete information, we observe (Fig. 1) that for a moderate noise/signal level and a cloud coverage of 10%, the reconstruction performs quite well, except for some higher-frequency signals imposed by the noise (in this case, truncating the EOF series earlier gives worse results because too much information is then eliminated, which shows that the cross-validation technique was able to pick up the best truncation). Increasing the noise (Fig. 2) shows that indeed the noise is responsible for the higher-frequency signals in the reconstruction. For a noise/signal ratio of O(1) (Fig. 3) even without cloud cover, the optimal EOF truncation filters only part of the noise. For lower noise, but a 50% cloud cover (Fig. 4), the reconstruction performs surprisingly well in view of the very fuzzy and incomplete original image. When the noise/signal ratio is decreased (Fig. 5), the reconstruction is still enhanced. For very low noise levels, cloud coverages of more than 75% (Fig. 6) still include the most relevant information that the EOF-based analysis method is able to extract. For even higher cloud coverage (Fig. 7) we see the problem that at some locations or moments there are no data at all. This clearly shows up in the EOF reconstruction and is due to the fact that no correlation information at all is available at these moments or points. Since this is the only information the EOF method uses (and no other a priori information in the correlation functions), it is impossible to infer the missing data when the whole time series is missing at some point (or one spatial pattern at one moment).

Concerning the detection of the optimal number of EOFs to retain, Fig. 8 shows the real error associated with the EOF truncation and error estimation from cross validation. In order to verify the stability of the truncation with respect to the random choice of the data put aside for cross validation, we repeated the experiment five times, with different random points and even a different number of data points set aside. In all cases the optimal number of EOFs detected was four, which corresponds to the minimum error. The difference in the absolute error estimation simply stems from the fact that, as already mentioned several times, the significant EOFs contain some interpreted noise as confirmed by the analysis of the singular values in the presence and absence of noise. Fig. 9 shows that the signal with noise has added some energy to the significant EOFs. We can also observe that on a subjective basis, the truncation with four EOFs would be the most natural, since only from EOF5 on are differences in the singular values between the noisy data and real data important. We also note that changing the random dataset put aside for the cross validation slightly modified the error estimate as well as the real error. For practical purposes, changing the random set of data put aside is thus an easy way to verify the robustness of the truncation. If for some reason there is then a doubt on the robustness, the cross validation presented here (simultaneously putting aside a number of data points) could be replaced by classical cross validation. The latter sets aside only a single data point (which makes the analysis less sensitive to the data elimination) and then repeats the experiment a sufficient number of times until a statistically significant measure of the error is obtained. This approach is then even more robust at the price of a higher number of analyses.

Concerning the present synthetic dataset, we have to mention that in reality cloud coverage is of course not random and that when larger patches of points are missing some degradations could appear. On the other hand, we did not implement any algorithm here to discard spatial points with too few data (or moments with full cloud coverage), so that some lines or columns in the reconstruction clearly show up as being underdetermined.

For comparison with other methods that deal with missing data, we use a simple method that is classically used and that linearly interpolates the missing data in space. No interpolation in time is done, since most of the time, satellite images are treated individually during the filling procedure. We can then compare our EOF-based interpolation and EOF truncation (Fig. 10) with the linearly interpolated field and its associated truncated EOF representation (Fig. 11, where the same number of EOFs is used). In the latter case, the systematic effects of linear interpolation are still visible in the EOF truncated (and hopefully filtered) reconstruction. One could argue that the improvement made by using the new method is not obvious, but besides the fact that the methods allow to detection of the optimal number of EOFs, a quantitative analysis of the error reveals the advantage of the new method. In the case presented here the error  ‖ 𝗫ea − 𝗫t ‖ 2 decreases from 0.067 to 0.024, while the error  ‖ 𝗫ea𝗫et ‖ 2 decreases from 0.064 to 0.020 when using the EOF interpolation rather than a spatial linear interpolation.

b. Real satellite data

As an illustrative example to demonstrate how the method behaves on real data and nonuniformly distributed cloud coverage, we will use an AVHRR dataset available from Ifremer (and available online at http://satftp.soest.hawaii.edu/adriatic/Adriatic/html/tecdoc.htm) from over the Adriatic Sea.

Seventy-one images of 400 pixels each centered in the middle of the Adriatic Sea (Fig. 12) from Julian day 130 to 294 of were retained for the present testing purposes2 with images at night (around midnight or 0200 LT) in order to avoid problems related to skin temperature.

As seen from the cloudy sequence (Fig. 13), the cloud coverage is not at all uniform, some pictures being more than half covered by clouds, and noisy patterns in the data are obvious. We therefore expect the present sequence to be a difficult test for the method. After application, the method selects an optimal truncation with nine EOFs. When reconstructing the data with these nine EOFs (Fig. 14), a clearly filtered sequence is visible, while the regions where clouds were filled in are almost impossible to detect (except maybe for one or two pictures in the middle of the sequence, where a persistent large cloud coverage can be seen). The benefit of updating the EOFs is clearly seen by looking at Fig. 15. Indeed, the original EOFs are neither able to eliminate the cloud signature nor the noise in the data. As in the synthetic data example, the choice of the optimal truncation by the cross-validation error estimator is quite robust (Fig. 16); however there is an indication that the error sequence contains some local minima. Our interpretation of the EOF modifications is also confirmed: Fig. 17 shows that the most significant EOF has increased energy compared to the initial guess, while the higher modes have a decreased energy after readjustments. Looking at the evolution of the temperature at a given location for the initial data and the reconstructed version in Fig. 18, the effects of both filtering and interpolation by the EOFs are obvious.

5. Discussion

Two improvements to classic EOF analysis were presented herein:

  1. An automatic, parameter-free, and self-consistent filling mechanism was proposed. Compared to a classic interpolation method (as optimal interpolation), the use of the EOFs generated by the data themselves has the advantage of not needing any a priori knowledge of a correlation function or correlation length. Moreover, since it is based solely on the data, inhomogenities or nonisotropic behaviors are automatically taken into account (as long as they are in the available data) during the interpolation.

  2. A cross validation that allows for establishing the optimal cutoff in EOF development (number of relevant EOFs to be retained) and the prediction error in the filled data gaps (from the cross validation estimator) were introduced.

Here computational efficiency considerations were not dealt with, since if only a few eigenvalues are needed, efficient eigenvalue solvers exist {compared to objective analysis used to fill the missing data [cost proportional to nm3 if done on space], the present interpolation/EOF decompositions are less time expensive consuming [roughly, (mn)][min(m, n)]2.5 for the analysis of the full range of possible N without optimization}. Moreover, the number of iterations could probably be reduced by using a better first guess. One could, for example, use any other classical method (be it filling methods or estimations of EOFs from the data correlation matrix) before actually starting the iterations.

We can also recall that with an enhanced and self-consistent filling mechanism, we are also improving the representation of the EOFs, meaning that any method relying on EOFs (e.g., some Kalman filters or predictions tools based on satellite images) will likely benefit from this better representation. For the method to have a chance to work, one needs, for each moment, at least a sufficiently large number of data points (otherwise one should drop the whole picture) and for each spatial point a sufficient amount of data in time (otherwise one should discard the point from the analysis). In both cases, disregarding a complete picture or a pixel can easily be dealt with. In the case where an image at a given moment is not available amounts to using nonuniform time steps between the data, so the methods to deal with nonuniform data can then be used.

Acknowledgments

European projects MEDAR (MAS3-CT98-0174) and SOFT (EVK3-CT-2000-00028) supported this work. M. Rixen presently benefits from a Marie-Curie fellowship at the Southampton Oceanographic Center (SOC). The National Fund for Scientific Research, Belgium, is acknowledged for the financing of supercomputer and the position of the first author. Two anonymous reviewers made constructive comments and are acknowledged. This is MARE publication MARE003. (Color images can be downloaded from http://modb.oce.be/personal/beckers/publications/cv.html.)

REFERENCES

  • Beckers, J-M., and Coauthors, 2002: Model intercomparison in the Mediterranean: MEDMEX simulations of the seasonal cycle. J. Mar. Syst., 33 , –34. 215251.

    • Search Google Scholar
    • Export Citation
  • Boyd, J., Kennelly E. , and Pistek P. , 1994: Estimation of EOF expansion coefficients from incomplete data. Deep-Sea Res., 41 , 14791488.

  • Brankart, J-M., and Brasseur P. , 1996: Optimal analysis of in situ data in the western Mediterranean using statistics and cross validation. J. Atmos. Oceanic Technol., 13 , 477491.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • D'Andrea, F., and Vautard R. , 2001: Extratropical low-frequency variability as a low dimensional problem. Part 1: A simplified model. Quart. J. Roy. Meteor. Soc., 127 , 13571375.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Elmore, K., and Richman M. , 2001: Euclidean distance as a similarity metric for principal component analysis. Mon. Wea. Rev., 129 , 540549.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Emery, W., and Thomson R. , 1998: Data Analysis Methods in Physical Oceanography. Pergamon, 634 pp.

  • Gomis, D., Ruiz S. , and Pedder M. , 2001: Diagnostic analysis of the 3D ageostrophic circulation from a multivariate spatial interpolation of CTD and ADCP data. Deep-Sea Res., 48 , 269295.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hair, J., Anderson R. , Tatham R. , and Black W. , 1998: Multivariate Data Analysis. Prentice Hall, 730 pp.

  • Hendricks, J., Leben R. , Born G. , and Koblinsky C. , 1996: Empirical orthogonal function analysis of global TOPEX/POSEIDON altimeter data and implications for detection of global sea rise. J. Geophys. Res., 101 , 1413114145.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Holbrook, N., and Bindoff N. , 2000: A statistically efficient mapping technique for four-dimensionsal ocean temperature data. J. Atmos. Oceanic Technol., 17 , 831846.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Holmes, P., Lumley J. , and Berkooz W. , 1996: Turbulence, Coherent Structures, Dynamical Systems and Symmetry. Cambridge University Press, 420 pp.

    • Search Google Scholar
    • Export Citation
  • Houseago-Stokes, R., 2000: Using optimal interpolation and EOF analysis on North Atlantic satellite data. International WOCE Newsletter, No. 38, WOCE International Project Office, Southampton, United Kingdom, 26–28.

    • Search Google Scholar
    • Export Citation
  • Kantha, L., and Clayson C. , 2000: Numerical Models of Oceans and Oceanic Processes. International Geophysics Series, Vol. 66, Academic Press, 940 pp.

    • Search Google Scholar
    • Export Citation
  • Kaplan, A., Kushnir Y. , Cane M. , and Blumenthal B. , 1997: Reduced space optimal analysis for historical data sets: 136 years of Atlantic sea surface temperature. J. Geophys. Res., 102 , 2785327860.

    • Search Google Scholar
    • Export Citation
  • Lawson, C., and Hanson R. , 1974: Solving Least Square Problems. Prentice Hall, 340 pp.

  • Lumley, J., 1970: Stochastic Tools in Turbulence. Academic Press, 194 pp.

  • North, G., Bell T. , Cahalan F. , and Moeng F. , 1982: Sampling errors in the estimation of empirical orthogonal functions. Mon. Wea. Rev., 110 , 699706.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Pedder, M., and Gomis D. , 1998: Applications of EOF analysis to the spatial estimation of circulation features in the ocean sampled by high-resolution CTD soundings. J. Atmos. Oceanic Technol., 15 , 959978.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Preisendorfer, R., 1988: Principal Component Analysis in Meteorology and Oceanography. Elsevier, 425 pp.

  • Preisendorfer, R., and Overland J. , 1982: A significance test for principal components applied to a cyclone climatology. Mon. Wea. Rev., 110 , 14.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Rixen, M., Beckers J-M. , Brankart J-M. , and Brasseur P. , 2000: A numerically efficient data analysis method with error map generation. Ocean Modell., 2 , 4560.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Shen, S., Smith T. , Ropelewski C. , and Livezey R. , 1998: An optimal regional averaging method with error estimates and a test using tropical Pacific SST data. J. Climate, 11 , 23402350.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Vautard, R., Yiou P. , and Ghil M. , 1992: Singular spectrum analysis: A toolkit for short, noisy chaotic signals. Physica D, 58 , 95126.

  • von Storch, H., and Zwiers F. , 1999: Statistical Analysis in Climate Research. Cambridge University Press, 484 pp.

  • Wunsch, C., 1996: The Ocean Circulation Inverse Problem. Cambridge University Press, 442 pp.

APPENDIX A

Transfer of Variance between Different EOFs and Changes of Structure during Iterations

At the first iteration with 0 values at the missing data points, we get eigenvectors and singular values ρ that satisfy
i1520-0426-20-12-1839-ea1
When the matrix on which the SVD is performed is perturbed by using Eq. (12), the eigenvalues and eigenvectors according to appendix B, and the singular values ρi are modified by the amount ri:
riv*iδ*uiu*iδvi
but since
i1520-0426-20-12-1839-ea4
we can rewrite
i1520-0426-20-12-1839-ea5
The change in the EOF fields can also be assessed in this way and we find the following after using the development of appendix B and some matrix manipulation:
i1520-0426-20-12-1839-ea6
where αil is basically the transfer of information from mode l into mode i. Mode i is therefore mostly influenced by the modes with similar singular values (ρiρl) or strong interactions between modes at the missing data points.
To further understand how these modifications act, suppose that we have a single missing data point, that we use only one EOF (N = 1), and that we are at the first iteration of our interpolation process (we made the EOF decomposition of 𝗫0), so that at the missing data point before updating we have (𝗫a)ij = 0. But then the first singular value changes according to Eq. (A5) is
i1520-0426-20-12-1839-ea7

This shows that the first EOF number will increase the energy of mode 1 after the interpolation is done, except in the case when the corresponding EOF is zero at the missing data point. Otherwise the term is always positive and simply the product of the temporal and spatial EOF amplitude at the missing data point. This interpretation shows that by replacing the missing data value by the first EOF guess, we add energy to the system, which is simply the energy of the first EOF at that location.

For the other singular values, we can exploit the fact that
i1520-0426-20-12-1839-ea8
or equivalently
i1520-0426-20-12-1839-ea9
so that for i > 1
i1520-0426-20-12-1839-ea10
The contributions to this sum can have any sign, except the contribution k = i, which is included here; the latter contribution is always negative (except again if the corresponding mode has a zero amplitude at the missing data point) and corresponds to the fact that initially, to force the value to zero, higher EOFs had seen their amplitude increased in order to be able to satisfy the initial guess (which is inconsistent with the large-scale EOFs present). Now the correction tries to reduce the energy of the corresponding mode. For the other contributions ki, their amplitudes decrease with k since the singular values are ordered with a decreasing amplitude. Furthermore, since on average, EOFs are decorrelated, the different contributions will compensate somehow, except the one for k = i, which is always negative. For a single mode, we thus expect mode 1 to see its energy increased and the energy of the others decreased after the first replacement.
When we use N modes, instead of using a single EOF for replacing the missing data values, the same analysis can be done. For a mode iN we have
i1520-0426-20-12-1839-ea11
The last term being always positive and the others of changing signs, we therefore expect the modes i < N to see their energy increased.
For i > N, we can again use
i1520-0426-20-12-1839-ea12
to show that the contribution to the ri (i > N) is always decreased by the component k = i, while the other ones are of unspecified sign. Also statistically, when more missing data points are present, we expect that
i1520-0426-20-12-1839-ea13
because of the orthogonality of the eigenvectors and the probability that when summing up all data points, the sum will cover the full range of i values. The same applies for the temporal sum
i1520-0426-20-12-1839-ea14
We can also look at how the EOF structures are modified. In fact, it appears that the EOFs are slightly rotated so as to diminish the contribution to the dominant EOFs from the “noisy” EOFs partly introduced by the initial filling by zero values. Indeed for a single missing data point we have, according to Eq. (A6),
i1520-0426-20-12-1839-ea15

APPENDIX B

Perturbation Theory on Singular Value Decomposition

A question that arises when dealing with matrices constructed from real data concerns the effects of noise in the data. Therefore we can analyze the problem of singular value decomposition (SVD) of a matrix subjected to perturbations. Suppose that we have found the solution of the SVD eigenvalue problem for an unperturbed matrix 𝗔 so that we have
i1520-0426-20-12-1839-eb1
or equivalently
i1520-0426-20-12-1839-eb3
Suppose we now perturb the matrix so that a new matrix 𝗔 + δ𝗕 is obtained with  | δ |  ≪ 1. We can then search for eigenvalues and singular values of the form
i1520-0426-20-12-1839-eb5
Neglecting higher-order termsB1 in δ, using the fact that ui, vi, and ρi are solutions of the unperturbed problem, we obtain the following set of equations for the perturbations:
i1520-0426-20-12-1839-eb8
In addition, since the eigenvectors are orthogonal and normalized u*iuj = δij, and we have (by perturbing this orthonormality condition)
i1520-0426-20-12-1839-eb10
Multiplying Eq. (B8) by u*i leads, by using Eqs. (B2), (B1), (B10), and (B11), to
riv*iuiu*ivi
which takes real values as we can easily verify. Since pi is orthogonal to ui by virtue of Eq. (B10), we can develop it as follows:
i1520-0426-20-12-1839-eb13
Injecting this equation into Eq. (B2) and applying u*l to it leads, by again using Eqs. (B2), (B1), (B10), and (B11), to
i1520-0426-20-12-1839-eb14
which gives the contribution of mode l on the change to mode i. If ρiρl, the change in EOFs is likely to be large, reflecting the fact that for two identical eigenvalues, any combination of the corresponding eigenvectors is also an eigenvector. When the eigenvalues are close, it means that a perturbation can transfer the contribution between these two EOFs very easily. Similarly we can calculate the perturbations on vi:
i1520-0426-20-12-1839-eb15

Fig. 1.
Fig. 1.

Synthetic dataset of a spatial 1D problem with time evolution. N/S = 0.197, n = 0.1 (10%) cloud cover. (upper left) True field, (upper right) perturbed field, (lower left) clouded field, and (lower right) reconstructed field. Shading scale varies linearly from −4 at left to 4 at right

Citation: Journal of Atmospheric and Oceanic Technology 20, 12; 10.1175/1520-0426(2003)020<1839:ECADFF>2.0.CO;2

Fig. 2.
Fig. 2.

Effect of increasing noise: N/S = 0.258, n = 0.1 cloud cover, (left) clouded field, and (right) reconstructed field

Citation: Journal of Atmospheric and Oceanic Technology 20, 12; 10.1175/1520-0426(2003)020<1839:ECADFF>2.0.CO;2

Fig. 3.
Fig. 3.

Effect of increasing noise: N/S = 0.744, (left) no cloud cover, and (right) reconstructed field

Citation: Journal of Atmospheric and Oceanic Technology 20, 12; 10.1175/1520-0426(2003)020<1839:ECADFF>2.0.CO;2

Fig. 4.
Fig. 4.

Effect of increasing noise: N/S = 0.118, n = 0.5 cloud cover, (left) clouded field, and (right) reconstructed field

Citation: Journal of Atmospheric and Oceanic Technology 20, 12; 10.1175/1520-0426(2003)020<1839:ECADFF>2.0.CO;2

Fig. 5.
Fig. 5.

Effect of increasing noise: N/S = 0.023, n = 0.5 cloud cover, (left) clouded field, and (right) reconstructed field

Citation: Journal of Atmospheric and Oceanic Technology 20, 12; 10.1175/1520-0426(2003)020<1839:ECADFF>2.0.CO;2

Fig. 6.
Fig. 6.

Effect of increasing noise: N/S = 0.0015, n = 0.77 cloud cover, (left) clouded field, and (right) reconstructed field

Citation: Journal of Atmospheric and Oceanic Technology 20, 12; 10.1175/1520-0426(2003)020<1839:ECADFF>2.0.CO;2

Fig. 7.
Fig. 7.

Effect of increasing noise: N/S = 0.00037, n = 0.83 cloud cover, (left) clouded field, and (right) reconstructed field

Citation: Journal of Atmospheric and Oceanic Technology 20, 12; 10.1175/1520-0426(2003)020<1839:ECADFF>2.0.CO;2

Fig. 8.
Fig. 8.

(left) Real error (compared to the filtered field) in the function of the number of EOFs N retained. (right) estimated error based on cross validation. The right one includes error due to noise. Five different sets of random data points set aside for the cross validation were used to verify robustness. For the real error, using another random set of data put aside does not change the optimal truncation and only slightly changes the error itself. For the error estimate through cross validation, the optimal truncation was not sensitive to a change in the random set, but the error estimation itself is slightly changed

Citation: Journal of Atmospheric and Oceanic Technology 20, 12; 10.1175/1520-0426(2003)020<1839:ECADFF>2.0.CO;2

Fig. 9.
Fig. 9.

Noise 4.5 (dots) vs no noise (solid line) for the first 10 eigenvalues: no cloud, logarithmic scale

Citation: Journal of Atmospheric and Oceanic Technology 20, 12; 10.1175/1520-0426(2003)020<1839:ECADFF>2.0.CO;2

Fig. 10.
Fig. 10.

EOF-based interpolation and EOF truncation: N/S = 0.15, n = 0.50 cloud cover, (left) clouded field, and (right) reconstructed field

Citation: Journal of Atmospheric and Oceanic Technology 20, 12; 10.1175/1520-0426(2003)020<1839:ECADFF>2.0.CO;2

Fig. 11.
Fig. 11.

Linear interpolation and EOF truncation: N/S = 0.15, n = 0.50 cloud cover, (left) clouded field interpolated linearly in space (but not in time), and (right) EOF truncated field

Citation: Journal of Atmospheric and Oceanic Technology 20, 12; 10.1175/1520-0426(2003)020<1839:ECADFF>2.0.CO;2

Fig. 12.
Fig. 12.

Location of the AVHRR images rectangle in the Adriatic Sea used in this study

Citation: Journal of Atmospheric and Oceanic Technology 20, 12; 10.1175/1520-0426(2003)020<1839:ECADFF>2.0.CO;2

Fig. 13.
Fig. 13.

Sequence of AVHRR images in the Adriatic Sea with clouds. The sequence is presented in left to right, top to bottom order

Citation: Journal of Atmospheric and Oceanic Technology 20, 12; 10.1175/1520-0426(2003)020<1839:ECADFF>2.0.CO;2

Fig. 14.
Fig. 14.

Reconstructed sequence of images in Fig. 13 with optimal number of EOFs and the proposed algorithm including the changed EOFs

Citation: Journal of Atmospheric and Oceanic Technology 20, 12; 10.1175/1520-0426(2003)020<1839:ECADFF>2.0.CO;2

Fig. 15.
Fig. 15.

Reconstructed sequence of images in Fig. 13 with the optimal number of EOFs, but using unmodified EOFs from the first guess x = 0

Citation: Journal of Atmospheric and Oceanic Technology 20, 12; 10.1175/1520-0426(2003)020<1839:ECADFF>2.0.CO;2

Fig. 16.
Fig. 16.

Normalized error estimation from cross validation as a function of the number of EOFs N retained. Normalization is done by using the data standard deviation of 2.6° with the rms error in the prediction being 0.36°

Citation: Journal of Atmospheric and Oceanic Technology 20, 12; 10.1175/1520-0426(2003)020<1839:ECADFF>2.0.CO;2

Fig. 17.
Fig. 17.

Energy vs EOF with singular values from initial dataset filled by zeros (dots) and the reconstructed dataset (solid line)

Citation: Journal of Atmospheric and Oceanic Technology 20, 12; 10.1175/1520-0426(2003)020<1839:ECADFF>2.0.CO;2

Fig. 18.
Fig. 18.

Evolution of temperature at a randomly chosen position. Initial values (dots) filled by zero anomalies (ticks instead of dots) corresponding to the average value 21.5°C. The line corresponds to the reconstructed time series

Citation: Journal of Atmospheric and Oceanic Technology 20, 12; 10.1175/1520-0426(2003)020<1839:ECADFF>2.0.CO;2

Table 1.

First seven singular values for exact field (ρt ), noisy field (ρp , clouded field filled with zeros (ρ0 ), and analyzed field (ρa )

Table 1.
Table 2.

Errors for different combinations of noise/signal ratios N/S and cloud coverage n

Table 2.

*

MARE Publication Number MARE003.

1

Strictly speaking the covariance matrix should be 1/n𝗫𝗫*, since it is obtained by time averaging.

2

Optimized versions with larger datasets are surely possible in view of the computational cost being roughly {[min(n, m)2.5](m + n)}.

B1

 This requires that 𝗔𝗕* is essentially different from 0 meaning that the perturbation matrix must not be orthogonal to the initial matrix.

Save
  • Beckers, J-M., and Coauthors, 2002: Model intercomparison in the Mediterranean: MEDMEX simulations of the seasonal cycle. J. Mar. Syst., 33 , –34. 215251.

    • Search Google Scholar
    • Export Citation
  • Boyd, J., Kennelly E. , and Pistek P. , 1994: Estimation of EOF expansion coefficients from incomplete data. Deep-Sea Res., 41 , 14791488.

  • Brankart, J-M., and Brasseur P. , 1996: Optimal analysis of in situ data in the western Mediterranean using statistics and cross validation. J. Atmos. Oceanic Technol., 13 , 477491.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • D'Andrea, F., and Vautard R. , 2001: Extratropical low-frequency variability as a low dimensional problem. Part 1: A simplified model. Quart. J. Roy. Meteor. Soc., 127 , 13571375.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Elmore, K., and Richman M. , 2001: Euclidean distance as a similarity metric for principal component analysis. Mon. Wea. Rev., 129 , 540549.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Emery, W., and Thomson R. , 1998: Data Analysis Methods in Physical Oceanography. Pergamon, 634 pp.

  • Gomis, D., Ruiz S. , and Pedder M. , 2001: Diagnostic analysis of the 3D ageostrophic circulation from a multivariate spatial interpolation of CTD and ADCP data. Deep-Sea Res., 48 , 269295.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hair, J., Anderson R. , Tatham R. , and Black W. , 1998: Multivariate Data Analysis. Prentice Hall, 730 pp.

  • Hendricks, J., Leben R. , Born G. , and Koblinsky C. , 1996: Empirical orthogonal function analysis of global TOPEX/POSEIDON altimeter data and implications for detection of global sea rise. J. Geophys. Res., 101 , 1413114145.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Holbrook, N., and Bindoff N. , 2000: A statistically efficient mapping technique for four-dimensionsal ocean temperature data. J. Atmos. Oceanic Technol., 17 , 831846.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Holmes, P., Lumley J. , and Berkooz W. , 1996: Turbulence, Coherent Structures, Dynamical Systems and Symmetry. Cambridge University Press, 420 pp.

    • Search Google Scholar
    • Export Citation
  • Houseago-Stokes, R., 2000: Using optimal interpolation and EOF analysis on North Atlantic satellite data. International WOCE Newsletter, No. 38, WOCE International Project Office, Southampton, United Kingdom, 26–28.

    • Search Google Scholar
    • Export Citation
  • Kantha, L., and Clayson C. , 2000: Numerical Models of Oceans and Oceanic Processes. International Geophysics Series, Vol. 66, Academic Press, 940 pp.

    • Search Google Scholar
    • Export Citation
  • Kaplan, A., Kushnir Y. , Cane M. , and Blumenthal B. , 1997: Reduced space optimal analysis for historical data sets: 136 years of Atlantic sea surface temperature. J. Geophys. Res., 102 , 2785327860.

    • Search Google Scholar
    • Export Citation
  • Lawson, C., and Hanson R. , 1974: Solving Least Square Problems. Prentice Hall, 340 pp.

  • Lumley, J., 1970: Stochastic Tools in Turbulence. Academic Press, 194 pp.

  • North, G., Bell T. , Cahalan F. , and Moeng F. , 1982: Sampling errors in the estimation of empirical orthogonal functions. Mon. Wea. Rev., 110 , 699706.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Pedder, M., and Gomis D. , 1998: Applications of EOF analysis to the spatial estimation of circulation features in the ocean sampled by high-resolution CTD soundings. J. Atmos. Oceanic Technol., 15 , 959978.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Preisendorfer, R., 1988: Principal Component Analysis in Meteorology and Oceanography. Elsevier, 425 pp.

  • Preisendorfer, R., and Overland J. , 1982: A significance test for principal components applied to a cyclone climatology. Mon. Wea. Rev., 110 , 14.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Rixen, M., Beckers J-M. , Brankart J-M. , and Brasseur P. , 2000: A numerically efficient data analysis method with error map generation. Ocean Modell., 2 , 4560.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Shen, S., Smith T. , Ropelewski C. , and Livezey R. , 1998: An optimal regional averaging method with error estimates and a test using tropical Pacific SST data. J. Climate, 11 , 23402350.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Vautard, R., Yiou P. , and Ghil M. , 1992: Singular spectrum analysis: A toolkit for short, noisy chaotic signals. Physica D, 58 , 95126.

  • von Storch, H., and Zwiers F. , 1999: Statistical Analysis in Climate Research. Cambridge University Press, 484 pp.

  • Wunsch, C., 1996: The Ocean Circulation Inverse Problem. Cambridge University Press, 442 pp.

  • Fig. 1.

    Synthetic dataset of a spatial 1D problem with time evolution. N/S = 0.197, n = 0.1 (10%) cloud cover. (upper left) True field, (upper right) perturbed field, (lower left) clouded field, and (lower right) reconstructed field. Shading scale varies linearly from −4 at left to 4 at right

  • Fig. 2.

    Effect of increasing noise: N/S = 0.258, n = 0.1 cloud cover, (left) clouded field, and (right) reconstructed field

  • Fig. 3.

    Effect of increasing noise: N/S = 0.744, (left) no cloud cover, and (right) reconstructed field

  • Fig. 4.

    Effect of increasing noise: N/S = 0.118, n = 0.5 cloud cover, (left) clouded field, and (right) reconstructed field

  • Fig. 5.

    Effect of increasing noise: N/S = 0.023, n = 0.5 cloud cover, (left) clouded field, and (right) reconstructed field

  • Fig. 6.

    Effect of increasing noise: N/S = 0.0015, n = 0.77 cloud cover, (left) clouded field, and (right) reconstructed field

  • Fig. 7.

    Effect of increasing noise: N/S = 0.00037, n = 0.83 cloud cover, (left) clouded field, and (right) reconstructed field

  • Fig. 8.

    (left) Real error (compared to the filtered field) in the function of the number of EOFs N retained. (right) estimated error based on cross validation. The right one includes error due to noise. Five different sets of random data points set aside for the cross validation were used to verify robustness. For the real error, using another random set of data put aside does not change the optimal truncation and only slightly changes the error itself. For the error estimate through cross validation, the optimal truncation was not sensitive to a change in the random set, but the error estimation itself is slightly changed

  • Fig. 9.

    Noise 4.5 (dots) vs no noise (solid line) for the first 10 eigenvalues: no cloud, logarithmic scale

  • Fig. 10.

    EOF-based interpolation and EOF truncation: N/S = 0.15, n = 0.50 cloud cover, (left) clouded field, and (right) reconstructed field

  • Fig. 11.

    Linear interpolation and EOF truncation: N/S = 0.15, n = 0.50 cloud cover, (left) clouded field interpolated linearly in space (but not in time), and (right) EOF truncated field

  • Fig. 12.

    Location of the AVHRR images rectangle in the Adriatic Sea used in this study

  • Fig. 13.

    Sequence of AVHRR images in the Adriatic Sea with clouds. The sequence is presented in left to right, top to bottom order

  • Fig. 14.

    Reconstructed sequence of images in Fig. 13 with optimal number of EOFs and the proposed algorithm including the changed EOFs

  • Fig. 15.

    Reconstructed sequence of images in Fig. 13 with the optimal number of EOFs, but using unmodified EOFs from the first guess x = 0

  • Fig. 16.

    Normalized error estimation from cross validation as a function of the number of EOFs N retained. Normalization is done by using the data standard deviation of 2.6° with the rms error in the prediction being 0.36°

  • Fig. 17.

    Energy vs EOF with singular values from initial dataset filled by zeros (dots) and the reconstructed dataset (solid line)

  • Fig. 18.

    Evolution of temperature at a randomly chosen position. Initial values (dots) filled by zero anomalies (ticks instead of dots) corresponding to the average value 21.5°C. The line corresponds to the reconstructed time series

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 5700 1497 149
PDF Downloads 4561 857 54