• Alvera-Azcárate, A., A. Barth, M. Rixen, and J. Beckers, 2005: Reconstruction of incomplete oceanographic data sets using empirical orthogonal functions: Application to the Adriatic sea surface temperature. Ocean Modell., 9, 325346.

    • Search Google Scholar
    • Export Citation
  • Baglama, J., and L. Reichel, cited2012: irlba: Fast partial SVD by implicitly-restarted Lanczos bidiagonalization. [Available online at http://CRAN.R-project.org/package=irlba.]

  • Barnett, T. P., and R. Preisendorfer, 1987: Origins and levels of monthly and seasonal forecast skill for United States surface air temperatures determined by canonical correlation-analysis. Mon. Wea. Rev., 115, 18251850.

    • Search Google Scholar
    • Export Citation
  • Beckers, J. M., and M. Rixen, 2003: EOF calculations and data filling from incomplete oceanographic datasets. J. Atmos. Oceanic Technol., 20, 18391856.

    • Search Google Scholar
    • Export Citation
  • Bien, J., and R. J. Tibshirani, 2011: Sparse estimation of a covariance matrix. Biometrika, 98, 807820.

  • Björnsson, H., and S. Venegas, 1997: A manual for EOF and SVD analyses of climate data. McGill University, Department of Atmospheric and Oceanic Sciences and Centre for Climate and Global Change Research Tech. Rep., 52 pp.

  • Boyd, J. D., E. P. Kennelly, and P. Pistek, 1994: Estimation of EOF expansion coefficients from incomplete data. Deep-Sea Res. I, 41, 14791488.

    • Search Google Scholar
    • Export Citation
  • Bretherton, C. S., C. Smith, and J. M. Wallace, 1992: An intercomparison of methods for finding coupled patterns in climate data. J. Climate, 5, 541560.

    • Search Google Scholar
    • Export Citation
  • Chavez, F. P., and R. Brusca, 1991: The Galapagos Islands and Their Relation to Oceanographic Processes in the Tropical Pacific.Plenum Press, 933 pp.

  • Globcolour Project, 2007: GlobCOLOUR: An EO based service supporting global ocean carbon cycle research. ACRI-ST/LOV Full Validation Rep., 76 pp. [Available online at http://www.globcolour.info/validation/report/GlobCOLOUR_FVR_v1.1.pdf.]

  • Hasselmann, K., 1988: PIPs and POPs: The reduction of complex dynamical systems using principal interaction and oscillation patterns. J. Geophys. Res., 93 (D9), 11 01511 021.

    • Search Google Scholar
    • Export Citation
  • Hu, C., K. Carder, and F. Muller-Karger, 2001: How precise are SeaWiFS ocean color estimates? Implications of digitization-noise errors. Remote Sens. Environ., 76, 239249.

    • Search Google Scholar
    • Export Citation
  • Kaplan, A., Y. Kushnir, M. A. Cane, and M. B. Blumenthal, 1997: Reduced space optimal analysis for historical data sets: 136 years of Atlantic sea surface temperatures. J. Geophys. Res., 102 (C13), 27 83527 860.

    • Search Google Scholar
    • Export Citation
  • Kaplan, A., Y. Kushnir, and M. A. Cane, 2000: Reduced space optimal interpolation of historical marine sea level pressure: 1854–1992. J. Climate, 13, 29873002.

    • Search Google Scholar
    • Export Citation
  • Marshall, J., A. Adcroft, C. Hill, L. Perelman, and C. Heisey, 1997: A finite-volume, incompressible Navier Stokes model for studies of the ocean on parallel computers. J. Geophys. Res., 102 (C3), 57535766.

    • Search Google Scholar
    • Export Citation
  • MITgcm Group, cited2012: MITgcm user manual. MIT/EAPS. [Available online at http://mitgcm.org/public/r2_manual/latest/online_documents.]

  • North, G., T. Bell, R. Cahalan, and F. Moeng, 1982: Sampling errors in the estimation of empirical orthogonal functions. Mon. Wea. Rev., 110, 699706.

    • Search Google Scholar
    • Export Citation
  • Pennington, J., K. Mahoney, V. Kuwahara, D. Kolber, R. Calienes, and F. Chavez, 2006: Primary production in the eastern tropical Pacific: A review. Prog. Oceanogr., 69, 285317.

    • Search Google Scholar
    • Export Citation
  • R Core Team, cited 2012: R: A language and environment for statistical computing. Vienna, Austria, R Foundation for Statistical Computing. [Available online at http://www.R-project.org/.]

  • Schartau, M., A. Engel, J. Schroter, S. Thoms, C. Volker, and D. Wolf-Gladrow, 2007: Modelling carbon overconsumption and the formation of extracellular particulate organic carbon. Biogeosciences, 4, 433454.

    • Search Google Scholar
    • Export Citation
  • Taylor, M. H., M. Losch, and A. Bracher, 2013: On the drivers of phytoplankton blooms in the Antarctic marginal ice zone: A modeling approach. J. Geophys. Res., 118, 6375, doi:10.1029/2012JC008418.

    • Search Google Scholar
    • Export Citation
  • von Storch, H., and F. W. Zwiers, 1999: Statistical Analysis in Climate Research.Cambridge University Press, 484 pp.

  • Wilks, D. S., 2006: Statistical Methods in the Atmospheric Sciences.2nd ed. International Geophysics Series, Academic Press, 627 pp.

  • Willmott, C. J., and K. Matsuura, 2005: Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate Res., 30, 79–82.

    • Search Google Scholar
    • Export Citation
  • View in gallery

    Comparison of gappy EOF approaches in the accuracy of field reconstruction under variable levels EOF truncations. The gappy field contains a single signal with differing levels of gappiness. The value of λ is determined directly from the EOF analysis. Relative variance compares the reconstructed field's variance to that of the observed gappy field. MAE is calculated between the reconstructed field and the true nongappy field. The amplified λ values calculated by LSEOF result in EOFs that carry a higher degree of variance and thus increased error (MAE) in the reconstruction. Plots for DINEOF are nearly identical for all levels of gappiness, preventing the visualization of all lines.

  • View in gallery

    Gappiness of remote sensing Globcolour Project (http://www.globcolour.info) chlorophyll data for the Galapagos Archipelago. For the period of 1997–2007, average daily mean gappiness is shown in the map, while the time series of monthly mean gappiness for the mapped area is shown below. Time axis ticks indicate the beginning of each year (1 Jan).

  • View in gallery

    The top three EOF modes derived from (top)–(bottom) true SST anomaly, true Chla anomaly, and observed (i.e., gappy) Chla anomaly fields. Observed Chla anomaly fields were subjected to the three gappy EOF approaches [(middle)–(bottom)]. Relative explained variance of each EOF mode as compared to the variance of the observed Chla anomaly field is displayed in the top-right corner of each map. Time axis ticks indicate the beginning of each year (1 Jan).

  • View in gallery

    Correlation of top 20 EOF coefficients from the observed (i.e., gappy) Chla anomaly field as derived from the three EOF approaches.

  • View in gallery

    Examples of reconstructed Chla anomalies for several dates using the top 20 EOFs derived from the three gappy EOF approaches. Maps of (top) the true data and (top middle) the observed (i.e., gappy) data. Grids with missing values are white in color. (middle)–(bottom) Reconstructions using the gappy approaches. The MAE of each day's reconstruction, as compared to the true nongappy data, is displayed in the top-right corner of the maps.

  • View in gallery

    MAE of (left) EOF reconstructed and (right) CCA predicted fields of Chla anomalies. EOFs were derived from the either the true or observed (i.e., gappy) Chla anomaly fields and error was gauged against true Chla anomaly field. The CCA model uses normalized EOF coefficients from true SST anomaly (n = 6) and observed Chla anomaly (variable n) fields as predictor and predictand, respectively. The MAE of the true Chla field (gray line) is provided as a reference for a perfect reconstruction/prediction.

  • View in gallery

    MAE of EOF reconstructions for the observed (i.e., gappy) Chla anomaly field with variable error (i.e., noise) added to the true signal. Error levels are given as standard deviation of log-transformed Chla, with corresponding median percent error given in parentheses. Open circle symbols designate the truncation level of lowest MAE.

  • View in gallery

    Linear regressions of daily spatial gappiness vs log-transformed MAE of the EOF reconstructed Chla anomaly fields (using the top 20 EOFs) for each gappy EOF approach. MAE is calculated against the true field. Shaded areas show the 25% and 75% quartiles for gappiness intervals by approach. Fitted regressions are shown as solid lines. Regression coefficients and R2 values are displayed at the top of the plot area. All regressions are based on n = 3269 data points and are significantly different from each other at the level p < 0.001 (F test).

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 574 484 20
PDF Downloads 429 362 16

On the Sensitivity of Field Reconstruction and Prediction Using Empirical Orthogonal Functions Derived from Gappy Data

View More View Less
  • 1 Alfred Wegener Institute for Polar and Marine Research, Bremerhaven, Germany
© Get Permissions
Full access

Abstract

Empirical orthogonal function (EOF) analysis is commonly used in the climate sciences and elsewhere to describe, reconstruct, and predict highly dimensional data fields. When data contain a high percentage of missing values (i.e., gappy), alternate approaches must be used in order to correctly derive EOFs. The aims of this paper are to assess the accuracy of several EOF approaches in the reconstruction and prediction of gappy data fields, using the Galapagos Archipelago as a case study example. EOF approaches included least squares estimation via a covariance matrix decomposition [least squares EOF (LSEOF)], data interpolating empirical orthogonal functions (DINEOF), and a novel approach called recursively subtracted empirical orthogonal functions (RSEOF). Model-derived data of historical surface chlorophyll-a concentrations and sea surface temperature, combined with a mask of gaps from historical remote sensing estimates, allowed for the creation of true and observed fields by which to gauge the performance of EOF approaches. Only DINEOF and RSEOF were found to be appropriate for gappy data reconstruction and prediction. DINEOF proved to be the superior approach in terms of accuracy, especially for noisy data with a high estimation error, although RSEOF may be preferred for larger data fields because of its relatively faster computation time.

Current affiliation: Leibniz Center for Tropical Marine Ecology, Bremen, Germany.

Corresponding author address: Marc H. Taylor, Leibniz Center for Tropical Marine Ecology, Fahrenheitstrasse 6, D-28359 Bremen, Germany. E-mail: marchtaylor@yahoo.com

Abstract

Empirical orthogonal function (EOF) analysis is commonly used in the climate sciences and elsewhere to describe, reconstruct, and predict highly dimensional data fields. When data contain a high percentage of missing values (i.e., gappy), alternate approaches must be used in order to correctly derive EOFs. The aims of this paper are to assess the accuracy of several EOF approaches in the reconstruction and prediction of gappy data fields, using the Galapagos Archipelago as a case study example. EOF approaches included least squares estimation via a covariance matrix decomposition [least squares EOF (LSEOF)], data interpolating empirical orthogonal functions (DINEOF), and a novel approach called recursively subtracted empirical orthogonal functions (RSEOF). Model-derived data of historical surface chlorophyll-a concentrations and sea surface temperature, combined with a mask of gaps from historical remote sensing estimates, allowed for the creation of true and observed fields by which to gauge the performance of EOF approaches. Only DINEOF and RSEOF were found to be appropriate for gappy data reconstruction and prediction. DINEOF proved to be the superior approach in terms of accuracy, especially for noisy data with a high estimation error, although RSEOF may be preferred for larger data fields because of its relatively faster computation time.

Current affiliation: Leibniz Center for Tropical Marine Ecology, Bremen, Germany.

Corresponding author address: Marc H. Taylor, Leibniz Center for Tropical Marine Ecology, Fahrenheitstrasse 6, D-28359 Bremen, Germany. E-mail: marchtaylor@yahoo.com

1. Introduction

Empirical orthogonal function (EOF) analysis, called principal component analysis (PCA) in other disciplines, is commonly used in climate research as a tool to analyze meteorological fields with high spatiotemporal dimensionality. The leading EOF modes will typically describe large-scale dynamical features in the field, and reconstruction of the field using a truncated subset of EOFs can filter out small-scale features or noise. Furthermore, EOF truncation may be useful for further statistical analysis by reducing the dimensionality of the data. For example, EOF coefficients have been used in canonical correlation analysis (CCA) for the identification of patterns in coupled fields (Barnett and Preisendorfer 1987). Other techniques like principal oscillation analysis (POP) or principal interaction patterns (PIP) aim at the approximation of complex dynamical systems by a simple dynamical model. Usually EOF techniques are applied in this reduction (Hasselmann 1988). The approach by Kaplan et al. (2000) has goals similar to our presentation. We will augment their work by comparing a suite of numerical techniques designed for this task.

a. Basic EOF approaches

EOF analysis is typically conducted via two main approaches; either by direct singular value decomposition (SVD) of the observed data matrix or by an eigenvalue decomposition of a covariance matrix. When fields are complete (i.e., no gaps with missing values), EOFs can be calculated in either way to achieve the same outcome.

For all presented approaches, we will consider a data matrix = xij, where i is the time index (length M) and j is the space index (length N). Each sample time series (columns) is centered (mean subtracted) so that the EOFs describe patterns of temporal covariance.

1) Direct data matrix decomposition

The direct approach via SVD is as follows:
e1
where is an M × N data matrix, is an N × N matrix containing the EOF patterns, is an M × N matrix of the EOF coefficients, Σ is an N × N matrix containing the singular values on the diagonal, and k is the EOF mode index (length N). Only EOFs ≤ min(M, N) will carry information. The explained variance of each mode is calculated as the square of each , which is typically presented as a percent,
e2

2) Covariance matrix decomposition

The covariance matrix decomposition approach requires a square matrix. One first constructs a covariance matrix ,
e3
where is an N × N matrix containing the covariance values between columns xj of . This is subsequently decomposed via eigenvalue decomposition,
e4
where is an N × N matrix of the EOF patterns and Λ is an N × N matrix containing the eigenvalues on the diagonal. Again, only EOFs ≤ min(M, N) will carry information. Then is projected onto to derive the EOF coefficients (sometimes referred to as the principal components),
e5
where is an M × N matrix of the EOF coefficients. Because of the projection, carries the magnitude of Λ. To create a normalized version of the EOF coefficients +, each EOF coefficient ak must be divided by the square root of their corresponding Λ values λk,
e6
Explained variance of each EOF mode k is calculated as follows:
e7
Following normalization, the two basic approaches are related as follows: = , + = , and Σ2 = Λ.

b. Gappy data EOF approaches

Gappiness in data fields can be due to instrument limitations (coverage) or errors in measurement. When gappiness is extreme, interpolation becomes impractical and EOF reconstruction can provide a more accurate alternative.

1) Covariance matrix decomposition/least squares estimation of coefficients

Because of the inability to decompose a matrix containing missing values, a direct data matrix decomposition via SVD is not possible. The approach via covariance matrix decomposition is possible; however, because of the missing values, one must adopt a least squares approach that takes into account the number of paired observations between samples. In this work, we will refer to this approach as least squares empirical orthogonal functions (LSEOF). In LSEOF, the above covariance matrix calculation [Eq. (3)] must be scaled by the number of shared, nonmissing values between samples (von Storch and Zwiers 1999; Kaplan et al. 1997; Boyd et al. 1994),
e8
where is the set of valid pairs (i = M when there are no gaps).
Following the decomposition of to obtain the EOFs [Eq. (4)], the EOF coefficients can be estimated via a least squares approximation,
e9
where is the error and φ is the objective function with the solution
e10
where Ji is the set of nonmissing values at time i. Note that the denominator reduces to 1 when there are no missing values; thus, it equals the scalar product for shown above [Eq. (5)].

Several issues have been identified with the use of this approach. First and foremost is the problem that the calculation of a covariance matrix derived from gappy data is not necessarily positive definite and decomposition via LSEOF can contain negative λ values. Since the variance of the dataset is contained in the trace of the covariance matrix and subsequently equal to the sum of Λ, having negative values will mean that other EOFs ek will have higher λk than in reality; thus, overestimating their amplitude and the amount of explained variance contained therein (Beckers and Rixen 2003; Björnsson and Venegas 1997).

The λ amplification also has consequences for the assessment of EOF “significance”: that is, differentiation between EOFs that describe large-scale patterns from those associated with small-scale features and noise. This is likely to equally affect both subjective methods, such as truncation based on visual inspection (e.g., Scree plots), and objective methods (e.g., North's rule of thumb) (North et al. 1982).

A second problem is that the decomposition of a nonpositive definite covariance matrix is a loss of orthogonality between EOFs (Björnsson and Venegas 1997), which makes their use in predictive models less attractive. For example, Barnett and Preisendorfer (1987) describe a method of CCA based on EOF coefficients, which is useful in determining the correlation between coupled fields. When correlations are high, issues associated with multicollinearity can affect the predictive ability of the model.

2) DINEOF

An alternate approach, data interpolating empirical orthogonal functions (DINEOF; Beckers and Rixen 2003; Alvera-Azcárate et al. 2005), interpolates missing values via an iterative SVD algorithm. DINEOF has similarities with approaches aimed at iterative estimation of the covariance matrix (e.g., Bien and Tibshirani 2011), although DINEOF directly iterates values in the data matrix itself.

Missing values are initially filled by an unbiased guess (zero in the typical case of mean-subtracted data). In addition, some nonmissing values (the authors recommend a small percentage of the data points or at least 30 points) are also treated as gaps (e.g., zero substituted), while their original values are retained separately for assessing the root-mean-square error (RMS) of the interpolated values.

The DINEOF algorithm subsequently decomposes the data matrix via SVD and a reconstruction is calculated using a single, leading EOF mode. The interpolated values for the missing locations are then substituted in the original matrix. Subsequent SVD iterations and their resulting EOF reconstructions will continually modify the values in the gaps until convergence of the RMS. Following convergence, a second EOF is then added to the reconstruction and again interpolated until convergence using two EOFs. This procedure continues with an increasing number of EOFs until the RMS converges [see Beckers and Rixen (2003) and Alvera-Azcárate et al. (2005) for further description of the algorithm]. The resulting interpolated matrix will no longer contain gaps, thus overcoming the drawbacks of the previous approach.

3) RSEOF

A third approach, recursively subtracted empirical orthogonal functions (RSEOF), is proposed in this work. It is an adaptation of LSEOF [section 1b(1)] in that it uses the same basic methodology of decomposition of a covariance matrix with least squares expansion of EOF coefficients [Eqs. (8) and (10)]; however, the procedure is done in a recursive fashion by solving for one EOF at a time. In each iteration, the leading EOF mode is used to reconstruct a truncated approximation of the data field, which is subsequently subtracted from the remaining data in the field. In principle, the procedure should better preserve orthogonality among EOFs and prevent λ amplification.

The approach is as follows:

  1. The observed data matrix O is (optionally) centered and/or scaled prior to the decomposition and is renamed as i for the first iteration, i = 1.

  2. A covariance matrix i is calculated from i [Eq. (8)].

  3. i is subjected to eigenvalue decomposition giving i and Λi [Eq. (4)].

  4. i is computed using the least squares approach [Eq. (10)]

  5. A truncated version of the data is reconstructed using the leading EOF mode, and , resulting in recon,i.

  6. This field is then subtracted from the data to give a new field for iteration i + 1: i+1 = irecon,i.

  7. Steps (ii)–(vi) are then iterated until a given criterion {e.g., for iN; remaining percent variance level, as calculated by ; minimization of reconstruction error [e.g., mean absolute error (MAE), RMS]}.

c. Data reconstruction

Reconstruction of the data field can simply be calculated as the scalar product of the EOFs and their coefficients. For the approaches involving and eigenvalue decomposition of a covariance matrix (e.g., LSEOF, RSEOF), this operation is as follows:
e11
where xij is the reconstructed data field. Under cases of nongappy data, when the full set of EOFs N is used, the reconstruction is said to be complete and exact. If k < N (e.g., truncated to include only the leading EOFs with largest λ values), then the reconstruction is approximate (Wilks 2006). Reconstruction from EOFs derived via SVD [Eq. (1)] or DINEOF require that Σ is included in the scalar product, since neither the EOFs nor the EOF coefficients carry the units of the field in the way that does.

d. Summary of gappy approaches and aims of the present work

We have outlined three main approaches for calculating EOFs with gappy data, including 1) decomposition of a covariance matrix followed by a least squares estimate of EOF coefficients (LSEOF); 2) filling of gaps via iterative SVD interpolation (DINEOF); and 3) recursive subtraction of EOFs from the data field (RSEOF). The first approach is known to have drawbacks associated with λ amplification, while the latter two approaches attempt to remedy this issue by either attempting to better preserve orthogonality of trailing EOFs (RSEOF) or by eliminating the problems associated with the decomposition of a nonpositive definite matrix via an optimal interpolation algorithm (DINEOF).

To illustrate these issues in a simple example, we can observe the performance of each approach in reconstructing a gappy field containing a single temporal sine-wave signal,
e12
where ti = i2π/M, sj = j, M = 200, and N = 100. Differing levels of gappiness (20%, 40%, 60%, and 80%) are randomly distributed throughout the field. The leading λ values are nearly identical for all approaches although trailing λ's are amplified substantially in the LSEOF approach. This amplification increases with the degree of gappiness in the observed field (Fig. 1, top panels). Statistics relating to field reconstruction can be seen in the middle and bottom panels of Fig. 1. The effect of λ amplification in the LSEOF approach is evident in the variance of the reconstructed field relative to true nongappy field. Reconstructions using EOFs derived from RSEOF and DINEOF do not exceed a relative variance of 100%. Another statistic describing the fit of the reconstruction is that of the MAE, which is calculated as follows:
e13
where (predk, obsk) is the kth of n pairs of predictions and observations. The MAE is the arithmetic average of the absolute error (Wilks 2006) and is of practical use for intercomparisons given that it presents the magnitude of average model-performance error in the same units as the field (Willmott and Matsuura 2005). Again, LSEOF amplifies the error of the reconstruction using trailing EOFs while RSEOF and DINEOF continue to decrease MAE before it flattens out. In this example, DINEOF outperforms RSEOF in terms of MAE under all degrees of gappiness.
Fig. 1.
Fig. 1.

Comparison of gappy EOF approaches in the accuracy of field reconstruction under variable levels EOF truncations. The gappy field contains a single signal with differing levels of gappiness. The value of λ is determined directly from the EOF analysis. Relative variance compares the reconstructed field's variance to that of the observed gappy field. MAE is calculated between the reconstructed field and the true nongappy field. The amplified λ values calculated by LSEOF result in EOFs that carry a higher degree of variance and thus increased error (MAE) in the reconstruction. Plots for DINEOF are nearly identical for all levels of gappiness, preventing the visualization of all lines.

Citation: Journal of Climate 26, 22; 10.1175/JCLI-D-13-00089.1

The aims of the present work are to further evaluate the performance of these EOF approaches in the reconstruction and prediction of gappy data fields. Toward this aim, we consider a more realistic example using modeled surface chlorophyll-a (Chla) concentrations that have been masked by historical cloud cover.

2. Experiments

a. Case study description

To examine the performance of the EOF approaches on a more realistic data field, we use the example of remotely sensed surface Chla concentration. Estimates of Chla have become a valuable source of information regarding the biological productivity and variability of aquatic systems ever since the regular availability of data, coinciding with start of the operation of the Sea-Viewing Wide Field-of-View Sensor (SeaWiFS) in 1997. Since then, additional satellite sensors [e.g., Moderate Resolution Imaging Spectroradiometer (MODIS), Medium Resolution Imaging Spectrometer (MERIS)] have been implemented to complement and improve upon its estimation from ocean color. Despite improvements in coverage and the availability of merged products (e.g., Globcolour Project: http://www.globcolour.info), cloud coverage continues to make the use of daily-resolution data impractical for many analyses because of the high degree of missing values.

We have chosen to use the example of the Galapagos Archipelago as an interesting test case due to the known variability in the ecosystem at both seasonal and interannual scales via the El Niño–Southern Oscillation (ENSO). The Galapagos lie in the heart of the equatorial upwelling (EU) region of the eastern tropical Pacific. Nutrients are supplied to the photic zone by equatorial upwelling and mixing, and by topographic upwelling of the Equatorial Undercurrent (EUC) on the western side of the archipelago (Chavez and Brusca 1991). In particular, cold, nutrient-rich waters of the EUC are brought to the surface following contact with the western side of the archipelago. As a result, the Galapagos are able to support at least twice the phytoplankton biomass and primary production as the remainder of the EU or any of the open-ocean regions of the eastern tropical Pacific (Pennington et al. 2006).

Under ENSO-neutral or ENSO-negative (La Niña) conditions, trade winds drive surface waters to the western tropical Pacific and create a basinwide slope, where sea surface is about 1/2 m higher at Indonesia than at Ecuador, effectively pushing down surface waters in the west. In the eastern tropical Pacific, the thermocline is closer to the surface, which facilitates the availability of nutrients to primary producers via upwelling. By contrast, ENSO-positive (El Niño) conditions are a result of weakened trade winds, causing surface waters to relax back to the east, which lowers the thermocline and the EUC. As a result, the availability of cool nutrient-rich waters to upwelling is decreased and primary production is dramatically reduced.

Remote sensing Chla data [Globcolour—(Garver–Siegel–Maritorena Model) GSM-merged product, 4.63-km resolution] of the region reveals that missing values show a distinct spatiotemporal pattern as related to cloud coverage. Highest gappiness is observed in the warmer oceanic waters north of the archipelago and during the austral winter months, while lowest gappiness is associated with the colder upwelling centers west of the archipelago (Fig. 2).

Fig. 2.
Fig. 2.

Gappiness of remote sensing Globcolour Project (http://www.globcolour.info) chlorophyll data for the Galapagos Archipelago. For the period of 1997–2007, average daily mean gappiness is shown in the map, while the time series of monthly mean gappiness for the mapped area is shown below. Time axis ticks indicate the beginning of each year (1 Jan).

Citation: Journal of Climate 26, 22; 10.1175/JCLI-D-13-00089.1

b. Synthetic dataset

To obtain full, nongappy data fields, we use model-derived data. The model consisted of a biogeochemical model, the Regulated Ecosystem Model (REcoM) (Schartau et al. 2007), coupled to a global general circulation model, the Massachusetts Institute of Technology General Circulation Model (MITgcm) (Marshall et al. 1997; MITgcm Group 2012). The model had a mean horizontal resolution of 18 km and a vertical resolution of 10 m near the surface. The simulation spanned the years 1992 through 2007 (for additional details, see Taylor et al. 2013).

Daily 4.63-km-resolution Globcolour chlorophyll data were used to create a cloud mask for the modeled data fields. When no valid data values were recorded within each larger grid of the model, the matrix location was classified as a missing value. In this way, we were able to obtain both the true nongappy field and an observed gappy data field masked primarily by clouds. We examined the region between 93° and 88°W and between 1°N and 2°S for the period coinciding with remote sensing estimates (1 September 1997–31 December 2007). Additionally, modeled sea surface temperature (SST) fields were used for the construction of a predictive CCA model. Both Chla and SST data were transformed to anomalies by subtracting the long-term monthly means from the time series of each grid. The resulting dimensions of the data matrices were 3774 × 608 (day × grid).

c. Analyses of performance

EOF was used to decompose true (i.e., nongappy) and observed (i.e., gappy) Chla and true SST fields. All three gappy approaches (LSEOF, RSEOF, and DINEOF) were used on the observed Chla field. For the DINEOF approach, we interpolated the missing values according to the methodology described earlier in section 1b(2). A total of 10 000 observed Chla values (approximately 1% of the known values) were used as the independent measure of RMS fit. The threshold for convergence was set at δRMS ≤ 1 × 105 (mg Chla m−3). Following convergence, these values were restored to their original values in the interpolated matrix and a final EOF decomposition was performed on the interpolated data field. All calculations and figures were done with R (R Core Team 2012).

1) EOF reconstruction

The Chla fields were reconstructed using variable degrees of EOF truncation (k = 1 → 20). Error of the reconstructed field was measured against the true Chla field via MAE.

2) EOF/CCA prediction

Significant SST EOF modes were identified via North's rule of thumb (North et al. 1982). A CCA was performed using these SST EOF coefficients as the predictor and a variable number of Chla EOF coefficients as the predictand (k = 1 → 20). The use of a truncated number of EOF coefficients in a CCA model was demonstrated by Barnett and Preisendorfer (1987) and has been shown to be an effective way of identifying coupled patterns between fields (Bretherton et al. 1992). The resulting model was used to predict Chla EOF coefficients, which were subsequently used to reconstruct the Chla field. Error of the reconstructed field was measured against the true Chla field via MAE.

3) Influence of noise

The influence of noise in a given gappy dataset on the accuracy of EOF reconstruction was explored for each of the approaches. In the case of remote sensing estimates of chlorophyll, estimation error is typically given as percent difference, implying that error increases proportionally with concentration. Error from SeaWiFS is usually within ±35% for case I waters but can reach ±60% (Hu et al. 2001). Estimated error from Globcolour is of a similar magnitude (Globcolour Project 2007). To simulate estimation error, normally distributed random numbers of mean = 0 and variable standard deviation (~0.1–0.5) were added to the log-transformed true Chla field, which translated to a median percent error of ~10%–30%. EOFs derived from these noisy data fields were used to reconstruct the field using variable degrees of truncation (k = 1 → 50). Error of the reconstructed field was measured against the true Chla field via MAE.

3. Results

a. EOF modes

The top three EOF modes for SST anomaly and Chla anomaly fields are presented in Fig. 3. All fields show a signal resembling interannual ENSO variability in the leading EOF mode. The strong El Niño event of 1997/98 is seen in the corresponding EOF coefficients of the leading mode, with opposing signs for SST and Chla. Such a relationship is to be expected; warm El Niño conditions are a result of a relaxation of trade winds and subsequent lowering of the thermocline, which in turn prevents upwelling of nutrient-rich, cold waters to the euphotic zone where they are used by primary producers. The second EOF mode relates to variations in the main upwelling center west of the archipelago, while the third EOF mode appears related to the shifting intertropical convergence zone. All three gappy approaches produced similar spatial EOF patterns as compared to the true Chla field; however, the LSEOF approach resulted in noisier EOF coefficients as well as much higher λ values, which amplified the variance of the reconstruction relative to the true field. RSEOF and DINEOF produced similar EOF coefficients, both in magnitude and pattern, as compared to those of the true field.

Fig. 3.
Fig. 3.

The top three EOF modes derived from (top)–(bottom) true SST anomaly, true Chla anomaly, and observed (i.e., gappy) Chla anomaly fields. Observed Chla anomaly fields were subjected to the three gappy EOF approaches [(middle)–(bottom)]. Relative explained variance of each EOF mode as compared to the variance of the observed Chla anomaly field is displayed in the top-right corner of each map. Time axis ticks indicate the beginning of each year (1 Jan).

Citation: Journal of Climate 26, 22; 10.1175/JCLI-D-13-00089.1

Figure 4 shows the correlation between EOF coefficients produced by the three approaches. A high loss of orthogonality is evident in the LSEOF approach. Some loss of orthogonality occurs in the RSEOF approach, although all off-diagonal correlations were low (|R| < 0.2). There was no loss in orthogonality with DINEOF as the EOFs are ultimately derived from an interpolated, nongappy matrix.

Fig. 4.
Fig. 4.

Correlation of top 20 EOF coefficients from the observed (i.e., gappy) Chla anomaly field as derived from the three EOF approaches.

Citation: Journal of Climate 26, 22; 10.1175/JCLI-D-13-00089.1

b. EOF reconstruction

Examples of daily field reconstructions using the top 20 EOF are presented in Fig. 5. RSEOF and DINEOF generally result in lower daily MAE, but this is not consistent for all days presented. The degree of gappiness and the location of gaps appear to have an effect on how well the EOFs are able to predict the missing values. LSEOF overestimates negative anomalies in the upwelling zone to the west of the archipelago in the July and October maps.

Fig. 5.
Fig. 5.

Examples of reconstructed Chla anomalies for several dates using the top 20 EOFs derived from the three gappy EOF approaches. Maps of (top) the true data and (top middle) the observed (i.e., gappy) data. Grids with missing values are white in color. (middle)–(bottom) Reconstructions using the gappy approaches. The MAE of each day's reconstruction, as compared to the true nongappy data, is displayed in the top-right corner of the maps.

Citation: Journal of Climate 26, 22; 10.1175/JCLI-D-13-00089.1

The effect of truncation level on MAE in the reconstruction can be seen in Fig. 6 (left plot). The MAE of the reconstruction using the EOFs of the true field is provided as reference. MAE increases with truncation level when using EOFs derived by LSEOF, while those derived with RSEOF and DINEOF progressively decrease MAE. EOFs derived by the DINEOF approach provided the best fit as evaluated against the true Chla field.

Fig. 6.
Fig. 6.

MAE of (left) EOF reconstructed and (right) CCA predicted fields of Chla anomalies. EOFs were derived from the either the true or observed (i.e., gappy) Chla anomaly fields and error was gauged against true Chla anomaly field. The CCA model uses normalized EOF coefficients from true SST anomaly (n = 6) and observed Chla anomaly (variable n) fields as predictor and predictand, respectively. The MAE of the true Chla field (gray line) is provided as a reference for a perfect reconstruction/prediction.

Citation: Journal of Climate 26, 22; 10.1175/JCLI-D-13-00089.1

c. EOF/CCA prediction

Figure 6 (right plot) shows the MAE of the predicted Chla field using the CCA model of SST and Chla EOF coefficients as predictor and predictand. All models show similar trends in that increasing EOF truncation does not greatly improve MAE. This is due to the fact that the leading EOF coefficients received the highest CCA loadings and carry the highest amount of variance (i.e., λ values) of the observed Chla field. Subsequent EOF coefficients are downweighted by the CCA model and contribute little to the prediction. EOF coefficients derived by the DINEOF approach provided the best prediction as evaluated against the true Chla field.

d. Influence of noise

The accuracy of reconstruction with LSEOF-derived EOFs was even poorer with noisy fields, and thus only results for RSEOF and DINEOF are shown. The addition of noise to the data affected the optimal level of truncation and accuracy of the reconstruction of both the RSEOF and DINEOF approaches (Fig. 7). As expected, MAE increases with increasing observation error, while the optimal truncation level decreases. For all levels of error, DINEOF outperformed RSEOF in terms of the MAE of the reconstruction and was able to incorporate a higher number of EOFs before MAE increased.

Fig. 7.
Fig. 7.

MAE of EOF reconstructions for the observed (i.e., gappy) Chla anomaly field with variable error (i.e., noise) added to the true signal. Error levels are given as standard deviation of log-transformed Chla, with corresponding median percent error given in parentheses. Open circle symbols designate the truncation level of lowest MAE.

Citation: Journal of Climate 26, 22; 10.1175/JCLI-D-13-00089.1

4. Discussion

a. EOF reconstruction and prediction

Of the gappy EOF approaches evaluated, DINEOF is shown to be superior as indicated by its accuracy in the reconstruction and prediction of data fields. The RSEOF approach was also successful in providing reliable results yet with a slightly lower accuracy, while the more traditional LSEOF approach was not appropriate for reconstruction. The LSEOF approach provided similar output in terms of spatial EOF patterns, but corresponding EOF coefficients showed increased noise and amplified λ values leading to increased variance (Fig. 3) and subsequently error in the reconstruction (Fig. 6). This approach should be discouraged, as it has been shown here to be deficient in cases where gappiness is high.

We find that the error of the reconstruction (MAE) is positively related to the degree of gappiness in the data. Figure 8 shows the relationship of increasing MAE with gappiness for daily maps using each of the approaches. RSEOF and DINEOF both dramatically reduce the MAE over that of LSEOF. A slightly lower slope is found for DINEOF as compared to RSEOF, again showing it to be the superior approach.

Fig. 8.
Fig. 8.

Linear regressions of daily spatial gappiness vs log-transformed MAE of the EOF reconstructed Chla anomaly fields (using the top 20 EOFs) for each gappy EOF approach. MAE is calculated against the true field. Shaded areas show the 25% and 75% quartiles for gappiness intervals by approach. Fitted regressions are shown as solid lines. Regression coefficients and R2 values are displayed at the top of the plot area. All regressions are based on n = 3269 data points and are significantly different from each other at the level p < 0.001 (F test).

Citation: Journal of Climate 26, 22; 10.1175/JCLI-D-13-00089.1

Field prediction based on the EOF/CCA model also shows the best accuracy for the DINEOF approach. The same issue of increasing MAE with truncation level was not found with the predictive CCA model using the LSEOF-derived EOF coefficients. This is in part because the main link between the SST and Chla anomaly fields is through the leading EOF, whereas later truncation only provides small improvements. Furthermore, the leading EOF is less affected by the problems associated with subsequent EOFs mentioned in section 1b(1). Even when these higher EOF modes are included, the CCA model is able to filter out this noise and prevents a rise in MAE with increasing truncation. Thus, the use of LSEOF-derived EOFs in CCA predictive models appears to be less problematic than in field reconstruction, especially in cases where the strongest correlation is via a dominant leading EOF mode.

DINEOF is also shown to deal better with data fields containing a high degree of noise. In addition to producing more accurate leading EOFs, a larger number of trailing EOFs can be used in the truncated reconstruction (as compared to RSEOF) before error begins to increase (Fig. 7). Thus, DINEOF is better able to determine both leading, large-scale EOFs, as well as higher EOFs, which correspond to small-scale features.

b. Computational considerations

This work has focused on the accuracy of gappy EOF approaches rather than their respective computational speed since we believe that, for most cases, missing data are more likely to be the limiting factor for many analyses. Nevertheless, it is important to mention the differences between the RSEOF and DINEOF approaches, which may be of interest to larger analyses. Users will need to evaluate whether improvements in EOF accuracy merit the additional computational costs of the DINEOF approach.

The DINEOF approach required ~400 iterations (i.e., individual SVD operations) to converge on an optimized interpolation using 70 EOFs, while RSEOF provided nearly as good a fit yet at a fraction of the computational time. As suggested by one of the reviewers, the speed of DINEOF can be increased through the adoption of less strict RMS convergence criteria for earlier EOF modes while maintaining more strict convergence criteria in later iterations. Furthermore, RSEOF may be used in combination with DINEOF by providing a better first guess estimate of missing values and help reduce the number of iterations needed for convergence.

For very large matrices, the computational speed of both DINEOF and RSEOF can be increased through combination with a Lanczos bidiagonalization, which derives a smaller subset of EOF patterns through partial SVD. The Lanczos solver is included in the UNIX distribution of DINEOF but will need to be implemented for use in other programming languages (e.g., R package irlba; Baglama and Reichel 2012).

5. Conclusions

EOFs derived from gappy data by means of a covariance matrix decomposition and subsequent least squares estimate of EOF coefficients (LSEOF) is demonstrated to be deficient for use in data field reconstruction and prediction. At the heart of this deficiency is the decomposition of a nonpositive definite covariance matrix, which results in amplified λ values and EOF coefficients that are not strictly orthogonal. As a consequence, the variance of the reconstructed field is also amplified.

The DINEOF and RSEOF approaches are able to successfully remedy these shortcomings through optimal EOF interpolation of missing values and preservation of EOF orthogonality by recursive EOF subtraction, respectively. The DINEOF approach is shown to be the superior approach and is especially useful in deriving smaller-scale features in noisy fields. The RSEOF approach, introduced here, provides a reliable alternative, which may be attractive in exploratory analyses of large data fields or as a means of providing an initial estimate of missing values preceding a more refined interpolation with DINEOF.

Acknowledgments

We thank the three anonymous reviewers for their helpful critique of this work. The authors are grateful to the German Research Foundation for funding (LO-1143/6).

REFERENCES

  • Alvera-Azcárate, A., A. Barth, M. Rixen, and J. Beckers, 2005: Reconstruction of incomplete oceanographic data sets using empirical orthogonal functions: Application to the Adriatic sea surface temperature. Ocean Modell., 9, 325346.

    • Search Google Scholar
    • Export Citation
  • Baglama, J., and L. Reichel, cited2012: irlba: Fast partial SVD by implicitly-restarted Lanczos bidiagonalization. [Available online at http://CRAN.R-project.org/package=irlba.]

  • Barnett, T. P., and R. Preisendorfer, 1987: Origins and levels of monthly and seasonal forecast skill for United States surface air temperatures determined by canonical correlation-analysis. Mon. Wea. Rev., 115, 18251850.

    • Search Google Scholar
    • Export Citation
  • Beckers, J. M., and M. Rixen, 2003: EOF calculations and data filling from incomplete oceanographic datasets. J. Atmos. Oceanic Technol., 20, 18391856.

    • Search Google Scholar
    • Export Citation
  • Bien, J., and R. J. Tibshirani, 2011: Sparse estimation of a covariance matrix. Biometrika, 98, 807820.

  • Björnsson, H., and S. Venegas, 1997: A manual for EOF and SVD analyses of climate data. McGill University, Department of Atmospheric and Oceanic Sciences and Centre for Climate and Global Change Research Tech. Rep., 52 pp.

  • Boyd, J. D., E. P. Kennelly, and P. Pistek, 1994: Estimation of EOF expansion coefficients from incomplete data. Deep-Sea Res. I, 41, 14791488.

    • Search Google Scholar
    • Export Citation
  • Bretherton, C. S., C. Smith, and J. M. Wallace, 1992: An intercomparison of methods for finding coupled patterns in climate data. J. Climate, 5, 541560.

    • Search Google Scholar
    • Export Citation
  • Chavez, F. P., and R. Brusca, 1991: The Galapagos Islands and Their Relation to Oceanographic Processes in the Tropical Pacific.Plenum Press, 933 pp.

  • Globcolour Project, 2007: GlobCOLOUR: An EO based service supporting global ocean carbon cycle research. ACRI-ST/LOV Full Validation Rep., 76 pp. [Available online at http://www.globcolour.info/validation/report/GlobCOLOUR_FVR_v1.1.pdf.]

  • Hasselmann, K., 1988: PIPs and POPs: The reduction of complex dynamical systems using principal interaction and oscillation patterns. J. Geophys. Res., 93 (D9), 11 01511 021.

    • Search Google Scholar
    • Export Citation
  • Hu, C., K. Carder, and F. Muller-Karger, 2001: How precise are SeaWiFS ocean color estimates? Implications of digitization-noise errors. Remote Sens. Environ., 76, 239249.

    • Search Google Scholar
    • Export Citation
  • Kaplan, A., Y. Kushnir, M. A. Cane, and M. B. Blumenthal, 1997: Reduced space optimal analysis for historical data sets: 136 years of Atlantic sea surface temperatures. J. Geophys. Res., 102 (C13), 27 83527 860.

    • Search Google Scholar
    • Export Citation
  • Kaplan, A., Y. Kushnir, and M. A. Cane, 2000: Reduced space optimal interpolation of historical marine sea level pressure: 1854–1992. J. Climate, 13, 29873002.

    • Search Google Scholar
    • Export Citation
  • Marshall, J., A. Adcroft, C. Hill, L. Perelman, and C. Heisey, 1997: A finite-volume, incompressible Navier Stokes model for studies of the ocean on parallel computers. J. Geophys. Res., 102 (C3), 57535766.

    • Search Google Scholar
    • Export Citation
  • MITgcm Group, cited2012: MITgcm user manual. MIT/EAPS. [Available online at http://mitgcm.org/public/r2_manual/latest/online_documents.]

  • North, G., T. Bell, R. Cahalan, and F. Moeng, 1982: Sampling errors in the estimation of empirical orthogonal functions. Mon. Wea. Rev., 110, 699706.

    • Search Google Scholar
    • Export Citation
  • Pennington, J., K. Mahoney, V. Kuwahara, D. Kolber, R. Calienes, and F. Chavez, 2006: Primary production in the eastern tropical Pacific: A review. Prog. Oceanogr., 69, 285317.

    • Search Google Scholar
    • Export Citation
  • R Core Team, cited 2012: R: A language and environment for statistical computing. Vienna, Austria, R Foundation for Statistical Computing. [Available online at http://www.R-project.org/.]

  • Schartau, M., A. Engel, J. Schroter, S. Thoms, C. Volker, and D. Wolf-Gladrow, 2007: Modelling carbon overconsumption and the formation of extracellular particulate organic carbon. Biogeosciences, 4, 433454.

    • Search Google Scholar
    • Export Citation
  • Taylor, M. H., M. Losch, and A. Bracher, 2013: On the drivers of phytoplankton blooms in the Antarctic marginal ice zone: A modeling approach. J. Geophys. Res., 118, 6375, doi:10.1029/2012JC008418.

    • Search Google Scholar
    • Export Citation
  • von Storch, H., and F. W. Zwiers, 1999: Statistical Analysis in Climate Research.Cambridge University Press, 484 pp.

  • Wilks, D. S., 2006: Statistical Methods in the Atmospheric Sciences.2nd ed. International Geophysics Series, Academic Press, 627 pp.

  • Willmott, C. J., and K. Matsuura, 2005: Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate Res., 30, 79–82.

    • Search Google Scholar
    • Export Citation
Save