## 1. Introduction

Data connect scientific research to real-world phenomena. In this modern era of big data, we often face two challenges: 1) devising an effective way to store data when storage capacity is limited and 2) designing an efficient data analysis method to extract useful information from data. These two challenges are often intertwined: efficient storage can facilitate data analysis, and an efficient analysis algorithm can reduce the need for storage.

The demand for large storage capacity is increasing at an unprecedented pace. As of 2012, the typical size of data that are generated daily and need to be saved is about 2.5 exabytes (2.5 × 10^{18} bytes) (http://en.wikipedia.org/wiki/Big_data). In climate science, for instance, the National Aeronautics and Space Administration (NASA) Center for Climate Simulation (NCCS) stores 32 petabytes (3.2 × 10^{16} bytes) of climate observations and simulations on the Discover supercomputing cluster (http://www.csc.com/cscworld/publications/81769/81773-supercomputing_the_climate_nasa_s_big_data_mission). The high volume and noise of the data present a huge storage challenge. Devising effective methods to compress data as well as to reduce noise in the data would significantly help to meet this challenge.

Another challenge posed by big data is the difficulty of distilling useful information. Extracting physically meaningful signals from data and understanding the relevant phenomena in the real world require data analysis to be carried out objectively. To achieve objectivity, any adopted analysis method should include as few subjective assumptions as possible to avoid biased analysis results. However, many traditional analysis methods, especially those used for time–frequency analysis, follow rigorous mathematic rules, and they all start with prescribed basis functions. In these methods, the signal is convolved with basis functions to obtain amplitude and frequency. Although the mathematical foundations of the methods are sound, they are not adaptive to the data because the a priori–specified basis functions do not necessarily fit data well at every temporal location. Besides, these methods are often subject to various assumptions about the data (such as linearity and stationarity). Because of the intertwined physical processes in most natural phenomena, the collected data are often inconsistent with the assumptions associated with various methods. Traditional methods, such as principal component analysis (PCA), which is also called empirical orthogonal function (EOF) analysis (Lorenz 1956; Wallace 1972; Wallace and Dickinson 1972; Dommenget and Latif 2002; Hannachi et al. 2007; Monahan et al. 2009), focus on extracting static spatial patterns. These methods have provided many opportunities to distill key climate variability information, especially when the particular climate phenomenon is stationary. However, these traditional methods are not as effective in studying nonstationary climate phenomena.

To address the problems of data storage and traditional analyses discussed above, methods that can capture the essence of climate evolution and effectively extract key information on climate variability and change are necessary. To satisfy this need, the multidimensional ensemble empirical mode decomposition method (MEEMD) was developed (Wu et al. 2009). The MEEMD, devised for analyzing multidimensional temporal–spatial climate data, is based on an adaptive and local method, the ensemble empirical mode decomposition (EEMD) (Huang and Wu 2008; Wu and Huang 2009). In MEEMD, time series of any climate variable at individual grid points are decomposed into oscillatory components on different time scales. By piecing together the individual components on similar time scales from the time series of different spatial locations, one obtains the spatial evolution of climate variables on different time scales. When the number of grid points of a climate field is large, such as in the reanalysis of historical climate over the globe or in climate model outputs, the computational costs are quite large. In this case, accelerating the algorithm of MEEMD becomes an important goal.

In this study, we present and validate a method for improving data storage techniques and accelerating MEEMD analysis of climate data. This goal is achieved through using the lossy data compression^{1} technique and modifying the MEEMD algorithm. During this process, we strictly enforce the principle that the lossy compression does not lose key information about the climate data but reduces temporally and spatially incoherent noise. This requirement is met by using the PCA, which is highly efficient in extracting temporally and spatially coherent structures. An advantage of using PCA is that a small number of principal components (PCs) and their corresponding empirical orthogonal functions (EOFs) may be sufficient to retain the key data information. By decomposing PCs using EEMD and taking into account the corresponding spatial structures of EOFs, we obtain the spatial evolutions of a climate variable on different time scales.

In the following sections, we present the details of this effort. The validities of lossy data compression and the accelerated (hereinafter, fast MEEMD) algorithm will be demonstrated by applying these newly developed techniques to global sea surface temperature anomalies (SSTAs). The remainder of the paper is organized as follows: Section 2 provides a brief description of the data. Section 3 reviews the original MEEMD method, introduces the concept of using PCA/EOF analysis as a tool of lossy data compression, and presents the fast MEEMD algorithm. Section 4 describes the validation of the method, and section 5 summarizes and discusses the results.

## 2. Data

In this study, we demonstrate the fast MEEMD by analyzing the monthly extended reconstructed sea surface temperature, version 3b (ERSST.v3b), dataset (Smith et al. 2008) over the global oceans. ERSST.v3b is generated using in situ SST data based on the International Comprehensive Ocean Atmosphere Data Set (ICOADS) release 2.4 and combined with statistical methods to ensure stable reconstruction with sparse data before 1880. Note that satellite data are not included in ERSST.v3b to avoid residual cold biases. The dataset, which has 2° × 2° spatial resolution, dates back to 1854. However, to avoid a damped signal (prior to 1880), only the segment from January 1880 through December 2009 (130 yr) of the data is used for this study.

Since PCA/EOF analysis requires mean-removed input data, we use the SSTA instead of the original total SST. In this paper, SSTA is defined as the departure from the mean SST over the entire temporal span of data for every grid point. The annual cycle of SST at every grid point is not removed. Another reason for defining the anomaly in this manner is that the SST may contain an amplitude–phase-modulated annual cycle (Gu and Philander 1995; Gu et al. 1997; Pezzulli et al. 2005; Wu et al. 2008; Qian et al. 2011), which should be included to fully understand the variability and change of the original total SST.

## 3. Compression and fast MEEMD

### a. MEEMD

The method we focus on in this study is an adaptive and temporal local analysis method, MEEMD (Wu et al. 2009). The development of MEEMD was based on EMD (Huang et al. 1998) and its more robust version EEMD (Huang and Wu 2008; Wu and Huang 2009). The EMD-based methods have already been widely applied in numerous climate studies (Franzke 2009; Qian et al. 2009; Ruzmaikin and Feynman 2009; Franzke 2010; Qian et al. 2010; Vecchio and Carbone 2010; Franzke and Woollings 2011; Fu et al. 2011; Wu et al. 2011; Hu et al. 2012; Huang et al. 2012a,b; Zhu et al. 2012; Misra et al. 2013).

*x*(

*t*) is decomposed into a finite and often small number of oscillatory components

*c*

_{j}, such that

*c*

_{j}represents simple oscillatory modes of certain frequencies, and

*r*

_{n}is the residual of the data. Each oscillatory component is obtained through a sifting process. In that process, for any time series

*r*

_{j−1}[when

*j*= 1,

*r*

_{0}becomes

*x*(

*t*)], the local mean for the riding wave is obtained by taking the average of the upper and lower envelopes, which connect all the local maxima (upper envelope) and minima (lower envelope), respectively. The oscillatory component is obtained by subtracting the obtained local mean from

*r*

_{j−1}at every temporal location. This sifting process is repeated until the upper and lower envelopes of the oscillatory component are symmetric with respect to the zero line (ideally). In this way, we obtain the highest-frequency oscillatory component of the time series

*r*

_{j−1}. The decomposition process stops when the remaining

*r*

_{n}becomes a monotonic function or a function that has at most one internal extremum. Since the oscillatory components that contain amplitude and frequency modulation are obtained without specifying any a priori oscillatory basis, the oscillations in these components are natural. Thus, EMD uses natural wave forms, and it can work particularly well with nonlinear and nonstationary time series.

In spite of the advantages of EMD over traditional data analysis methods, it has a serious scale-mixing problem, and its derived components are sensitive to minor noise contained in the data. To address these deficiencies, Wu and Huang (2009) developed an EEMD. EEMD resolves the problems of EMD to a large degree and maintains the aforementioned advantages of EMD. Specifically, white noise is used in EEMD to provide a relatively uniform reference-scale distribution and to extract scale-consistent signals. Instead of altering the true signals, the added white noise series simply cancel each other after ensemble siftings and facilitate extraction of meaningful signals. In this sense, EEMD qualifies as a noise-assisted data analysis (NADA) method.

Most recently, MEEMD, which is based on the applications of EEMD, was proposed (Wu et al. 2009). The main idea behind MEEMD is to view a multidimensional dataset as a combination of data in each dimension and to apply EEMD repeatedly for data in each dimension; by combining the components of similar scale among each dimension, one can obtain multidimensional evolving patterns on different scales. More specifically, for multidimensional temporal–spatial climate data, the first step of MEEMD is to apply EEMD to time series at each grid point to obtain oscillatory components on different time scales. We then combine all the oscillatory components on a particular time scale at different grid points to give the temporally and spatially evolving pattern on this specific time scale. By repeating this process for other time scales, we obtain the temporal–spatial evolving patterns on all naturally separated time scales. Figure 1 presents the results of MEEMD analysis for SSTAs along the equator from 160°E to 160°W. The original zonal–temporal evolution of SSTAs is plotted in Fig. 1a; the counterparts of the higher-frequency component (*c*_{1} + *c*_{2} + *c*_{3}) and the *c*_{4} are plotted in Figs. 1b–c, respectively; the zonal–temporal evolution of lower-frequency SSTAs (sum of *c*_{5} through *c*_{10}) is shown in Fig. 1d. Since the differences of the SSTA series of neighboring grids are usually small, the temporal locality of EEMD makes the resulting components insensitive to small perturbations while capturing the minor extrema shift caused by systematic signal propagation. Therefore, the resulting components of the same rank (third, fourth, etc.) can be pieced together to obtain the spatial evolution of the data on the time scale represented by a particular component. This argument is well supported by Fig. 1c, in which the eastward propagation of the warm anomalies on a quasi-biennial time scale is captured for 2002/03. The propagation is further illustrated by individually plotted SSTA time series of grid points, as shown in Fig. 1e.

The MEEMD method allows us to investigate the evolution on naturally separated time scales of temporally and spatially multidimensional data. This is one of the unique advantages of MEEMD over other traditional analysis methods. However, MEEMD is computationally expensive for data that have large domain(s), which is often the case with climate datasets. This study is motivated by the need to accelerate the MEEMD algorithm for multidimensional big data.

### b. PCA/EOF analysis as a data compression tool

The pioneer works of Obukhov (1947) and Lorenz (1956) introduced PCA/EOF analysis for weather/climate analysis and forecast. PCA/EOF analysis takes advantage of the hidden coherence in multidimensional (e.g., spatial–temporal) data and uses eigenvectors to effectively represent the data. It decomposes the data into certain structures (EOFs) in a subdimension (space) and their corresponding PCs in the remaining subdimension (time). Often, a small fraction of the EOFs and their corresponding PCs can capture almost all of the coherent information hidden in data; the remaining EOFs/PCs are incoherent noise. This characteristic makes PCA/EOF analysis an effective tool for compressing multidimensional spatial–temporal coherent data.

To elaborate upon the compression capacity of PCA/EOF analysis, here we illustrate the potential compressibility of three different types of spatial–temporal data (Fig. 2). In Fig. 2a, the selected SSTA time series are from three neighboring grid points and they tend to behave identically over time. Thus, a dominant direction of variability and change in phase space spanned by SSTA time series can be immediately identified (Fig. 2a). When viewed from other angles (Figs. 2b,c) variability and change in the remaining directions are very small, almost negligible. On the other hand, it is expected that variations are evenly distributed across all directions for random time series (white noise) in phase space (Figs. 2g–i) because noise is incoherent. Since the time–space coherence between SSTAs at widely separated grid points is small, the variations in the three time series can be quite different, and they resemble white noise when they are displayed in phase space, as shown in Figs. 2d–f. When PCA/EOF analysis is applied to these three types of data, the dominant variability and change of the first type of data can essentially be represented by one eigenvector (EOF) and its corresponding PC, whereas the latter two types have no preferred direction of variability and change, and all EOFs and PCs need to be retained to capture the nature of the variability and change of the original data.

Variations of SSTAs in phase space. The position of each point in the three-dimensional space is determined by (a)–(c) SSTAs of three neighboring grid points (0°, 178°E), (0°, 180°), and (0°, 178°W) at one specific time step; (d)–(f) SSTAs of three grid points (0°, 88°E), (0°, 180°), and (0°, 92°W) at one specific time step; and (g)–(i) the values of three generated random series at one specific time step. All points at different time steps are then chronologically connected with straight lines. The subplots in the same row are identical but viewed from different angles.

Citation: Journal of Climate 27, 10; 10.1175/JCLI-D-13-00746.1

Variations of SSTAs in phase space. The position of each point in the three-dimensional space is determined by (a)–(c) SSTAs of three neighboring grid points (0°, 178°E), (0°, 180°), and (0°, 178°W) at one specific time step; (d)–(f) SSTAs of three grid points (0°, 88°E), (0°, 180°), and (0°, 92°W) at one specific time step; and (g)–(i) the values of three generated random series at one specific time step. All points at different time steps are then chronologically connected with straight lines. The subplots in the same row are identical but viewed from different angles.

Citation: Journal of Climate 27, 10; 10.1175/JCLI-D-13-00746.1

Variations of SSTAs in phase space. The position of each point in the three-dimensional space is determined by (a)–(c) SSTAs of three neighboring grid points (0°, 178°E), (0°, 180°), and (0°, 178°W) at one specific time step; (d)–(f) SSTAs of three grid points (0°, 88°E), (0°, 180°), and (0°, 92°W) at one specific time step; and (g)–(i) the values of three generated random series at one specific time step. All points at different time steps are then chronologically connected with straight lines. The subplots in the same row are identical but viewed from different angles.

Citation: Journal of Climate 27, 10; 10.1175/JCLI-D-13-00746.1

The preceding illustration (Fig. 2) also provides a guideline for detailed compression using PCA/EOF analysis: the compression should consider the domain size that affects the spatial–temporal coherence. To ensure spatial–temporal coherence, a large spatial domain is divided into smaller subdomains. This division has a disadvantage and an advantage: on one hand, a large number of subdomains leads to a large total number of PCs/EOFs for the whole domain because each subdomain has to have enough pairs of PCs/EOFs to retain the information on variability and change hidden in the data, potentially compromising the compression rate. On the other hand, the compression quality might be largely improved for a smaller spatial domain, in which the spatial–temporal coherence is stronger. Therefore, a compromise is needed; in this study, the global oceans are first arbitrarily divided into eight regions as shown in Fig. 3.

Division of global oceans. Black open boxes denote the main eight basins: northern Atlantic Ocean (30°–60°N, 74°–2°W), northern Pacific Ocean (30°–60°N, 120°E–74°W), southern Indian Ocean (60°–30°S, 44°–120°E), southern Atlantic Ocean (60°–30°S, 74°–2°W), southern Pacific Ocean (60°–30°S, 120°E–74°W), tropical Indian Ocean (30°S–30°N, 44°–120°E), tropical Atlantic Ocean (30°S–30°N, 74°–2°W), and tropical Pacific Ocean (30°S–30°N, 120°E–74°W). The tropical Pacific Ocean is further divided into ETP (30°S–30°N, 180°–74°W; red dashed box) and WTP (30°S–30°N, 120°E–180°; blue dashed box).

Citation: Journal of Climate 27, 10; 10.1175/JCLI-D-13-00746.1

Division of global oceans. Black open boxes denote the main eight basins: northern Atlantic Ocean (30°–60°N, 74°–2°W), northern Pacific Ocean (30°–60°N, 120°E–74°W), southern Indian Ocean (60°–30°S, 44°–120°E), southern Atlantic Ocean (60°–30°S, 74°–2°W), southern Pacific Ocean (60°–30°S, 120°E–74°W), tropical Indian Ocean (30°S–30°N, 44°–120°E), tropical Atlantic Ocean (30°S–30°N, 74°–2°W), and tropical Pacific Ocean (30°S–30°N, 120°E–74°W). The tropical Pacific Ocean is further divided into ETP (30°S–30°N, 180°–74°W; red dashed box) and WTP (30°S–30°N, 120°E–180°; blue dashed box).

Citation: Journal of Climate 27, 10; 10.1175/JCLI-D-13-00746.1

Division of global oceans. Black open boxes denote the main eight basins: northern Atlantic Ocean (30°–60°N, 74°–2°W), northern Pacific Ocean (30°–60°N, 120°E–74°W), southern Indian Ocean (60°–30°S, 44°–120°E), southern Atlantic Ocean (60°–30°S, 74°–2°W), southern Pacific Ocean (60°–30°S, 120°E–74°W), tropical Indian Ocean (30°S–30°N, 44°–120°E), tropical Atlantic Ocean (30°S–30°N, 74°–2°W), and tropical Pacific Ocean (30°S–30°N, 120°E–74°W). The tropical Pacific Ocean is further divided into ETP (30°S–30°N, 180°–74°W; red dashed box) and WTP (30°S–30°N, 120°E–180°; blue dashed box).

Citation: Journal of Climate 27, 10; 10.1175/JCLI-D-13-00746.1

The compression rate that can be achieved by using PCA/EOF analysis is dependent on the spatial–temporal coherence of the data. Here we use a hypothetical case to illustrate the compression rate. Suppose that we have a dataset with 10 000 grid points and 1000 time steps that has good spatial–temporal coherence, that is, the sum of the first 40 leading pairs of PCs/EOFs can capture accurately the spatial–temporal coherence of the original data. The original data have 10^{7} values. The compressed data have 40 EOFs, which corresponds to 4 × 10^{5} values. The corresponding 40 PCs contain 4 × 10^{4} values. Therefore, the total number of values in the compressed data is 4.4 × 10^{5}. This number indicates the compression used in this example can reduce the size of the data to about 1/23rd of the original size. When the number of time steps increases, the compression rate will be even larger, for the addition of data will only increase the length of PCs for the compressed data.

### c. The fast MEEMD

**V**

_{i}, and their corresponding PCs

**Y**

_{i}, of SSTA over a subdomain are computed. Second, the number of pairs of EOFs/PCs that are retained in the compressed data is determined by calculating the accumulated total variance of the leading EOF/PC pairs. In practice, we find that the first 40 pairs of EOFs/PCs explain more than 99.5% of the variance for each subdomain (Table 1). Therefore, we retain only 40 pairs of EOFs/PCs in compression for each subdomain. Next, each PC

**Y**

_{i}is decomposed using EEMD, that is,

*c*

_{i,j}represents simple oscillatory modes of certain frequencies, and

*r*

_{i,n}is the residual of the data

**Y**

_{i}. The final result of the

*j*th MEEMD component

*C*

_{j}is obtained as

Flowchart for the fast MEEMD algorithm. The terms *i* (*i* = 1, 2, 3, …, 40) denote the *i*th mode, and the subscripts *j* (*j* = 1, 2, 3, …, 10) denote the *j*th component from EEMD of a certain frequency.

Citation: Journal of Climate 27, 10; 10.1175/JCLI-D-13-00746.1

Flowchart for the fast MEEMD algorithm. The terms *i* (*i* = 1, 2, 3, …, 40) denote the *i*th mode, and the subscripts *j* (*j* = 1, 2, 3, …, 10) denote the *j*th component from EEMD of a certain frequency.

Citation: Journal of Climate 27, 10; 10.1175/JCLI-D-13-00746.1

Flowchart for the fast MEEMD algorithm. The terms *i* (*i* = 1, 2, 3, …, 40) denote the *i*th mode, and the subscripts *j* (*j* = 1, 2, 3, …, 10) denote the *j*th component from EEMD of a certain frequency.

Citation: Journal of Climate 27, 10; 10.1175/JCLI-D-13-00746.1

Relevant statistical quantities during data compression and reconstruction. The second column lists the accumulative variance of the first 40 modes in each subdomain; the third and fourth columns give the temporal mean and standard deviation of the differences between the reconstructed and original SSTAs at selected grid points in all subdomains, respectively.

From the above description of fast MEEMD, it can be seen that the EEMD decomposition applies only to a small number of PCs. For the hypothetical example described in section 3b, EEMD should be applied to 10 000 time series in the original MEEMD, whereas EEMD is applied to only 40 PCs in this fast MEEMD, potentially speeding up the calculation by 250 times. Since the PCA/EOF analysis is computationally economical, it does not add significantly to the total computational time. Note that the computational speed of the fast MEEMD is dependent on both the number of grids and the temporal–spatial coherence of the data.

It should be noted that to what extent a particular EOF spatial structure projects onto an evolving MEEMD mode depends on the temporal characteristics of the corresponding PC. When the PC is dominated by the variability of a particular time scale, the projection is large; otherwise, the projection may be quite small. This projection may be quantified by calculating the variance fraction of an EEMD component of the PC. Since the final evolving MEEMD component of a naturally separated time scale is the sum of the projections of all PCs/EOFs onto that particular time scale, the overall contribution of an EOF spatial structure can be quantified by the product of the variance fraction explained by a PCA/EOF mode and the variance fraction of an EEMD component of the PC.

## 4. Validation

### a. Accuracy of compression

As mentioned in section 3c, the sum of the first 40 EOFs/PCs explains more than 99.5% of the variance for each subdomain (Table 1). This is one reason for retaining only 40 EOFs/PCs. As numerous climate studies have shown (Weare et al. 1976; Servain and Legler 1986; Thompson and Wallace 1998; Church and White 2011), EOFs/PCs that do not explain much of the variance tend to display highly incoherent structures both in space and time, implying that these EOFs/PCs have characteristics of incoherent noise. In Fig. 5, we display the 41st EOFs from different subdomains. In general, our results show spatial incoherence. The only exception is the tropical Pacific domain, where the 41st EOF still has a relatively larger-scale structure. However, this structure can hardly be linked to the known dynamics of tropical air–sea interaction, and its explained variance is only 0.02%.

EOFs of the 41st mode of PCA/EOF analysis in the (a) northern Pacific, (b) southern Pacific, (c) tropical Pacific, (d)WTP, and (e) ETP. The values are unitless coefficients.

Citation: Journal of Climate 27, 10; 10.1175/JCLI-D-13-00746.1

EOFs of the 41st mode of PCA/EOF analysis in the (a) northern Pacific, (b) southern Pacific, (c) tropical Pacific, (d)WTP, and (e) ETP. The values are unitless coefficients.

Citation: Journal of Climate 27, 10; 10.1175/JCLI-D-13-00746.1

EOFs of the 41st mode of PCA/EOF analysis in the (a) northern Pacific, (b) southern Pacific, (c) tropical Pacific, (d)WTP, and (e) ETP. The values are unitless coefficients.

Citation: Journal of Climate 27, 10; 10.1175/JCLI-D-13-00746.1

To avoid loss of useful information on climate variability and change that could result from dropping EOFs/PCs for the tropical Pacific, we further divide the tropical Pacific into two new subdomains: the eastern tropical Pacific (ETP) and the western tropical Pacific (WTP) (Fig. 3). The 41st EOFs of the two new subdomains display incoherent spatial structures of much smaller scales (Figs. 5d,e). This new division echoes our earlier compromise between the selection of the PCs/EOFs truncation number and the accuracy of the compressed data.

To further evaluate the quality of the reconstruction from compressed data, the reconstructed SSTAs are compared with their original counterparts at randomly chosen grid points in each subdomain. The results in the tropical oceans are shown in Fig. 6. Note that for better visual display, only a randomly selected period of time (i.e., January 1970–December 1999) is shown in the left panels of Fig. 6. In the right panels, we display the corresponding time-lagged autocorrelation of the difference between the original data and the reconstructed data. For all these randomly selected grid points, the differences between the original data and the reconstructed data are negligibly small. In addition, the time-lagged autocorrelations of these differences all show a quick drop from 1 (zero-lagged autocorrelation) to less than 0.4 (3-month-lagged autocorrelation) and a tailed-off structure, implying that these differences are mainly a combination of white noise and red noise. The mean of the difference at each grid point is less than 0.02°C, and the corresponding standard deviation is less than 0.1°C (Table 1), within the range of random errors contained in observational data. Further analysis shows that these differences are almost all white noise outside the tropical zone (not shown).

(left) Comparison between original (blue) and reconstructed (green) SSTAs (°C) at randomly selected grid points in the (a) tropical Atlantic, (c) tropical Indian, and (e) and WTP. Red denotes the differences between the original and reconstructed SSTAs. (right) Autocorrelation coefficients of the difference at the corresponding grid point in the left panel with time lags of 1 to 100 months.

Citation: Journal of Climate 27, 10; 10.1175/JCLI-D-13-00746.1

(left) Comparison between original (blue) and reconstructed (green) SSTAs (°C) at randomly selected grid points in the (a) tropical Atlantic, (c) tropical Indian, and (e) and WTP. Red denotes the differences between the original and reconstructed SSTAs. (right) Autocorrelation coefficients of the difference at the corresponding grid point in the left panel with time lags of 1 to 100 months.

Citation: Journal of Climate 27, 10; 10.1175/JCLI-D-13-00746.1

(left) Comparison between original (blue) and reconstructed (green) SSTAs (°C) at randomly selected grid points in the (a) tropical Atlantic, (c) tropical Indian, and (e) and WTP. Red denotes the differences between the original and reconstructed SSTAs. (right) Autocorrelation coefficients of the difference at the corresponding grid point in the left panel with time lags of 1 to 100 months.

Citation: Journal of Climate 27, 10; 10.1175/JCLI-D-13-00746.1

To check the overall quality of the compression, we plot the SSTAs over the tropical Pacific for a randomly selected year 1990 (Fig. 7). The differences between the compressed SSTAs and the original SSTAs are almost unnoticeable; spatial correlations between them are all higher than 0.99, and the standard deviation of the differences are all smaller than 0.01 for all months. It is evident that the compressed SSTAs accurately retain the temporal–spatial evolution of their original counterparts.

(left) Original and (right) reconstructed SSTAs (°C) of the tropical Pacific Ocean during 1990 at a 3-month interval.

Citation: Journal of Climate 27, 10; 10.1175/JCLI-D-13-00746.1

(left) Original and (right) reconstructed SSTAs (°C) of the tropical Pacific Ocean during 1990 at a 3-month interval.

Citation: Journal of Climate 27, 10; 10.1175/JCLI-D-13-00746.1

(left) Original and (right) reconstructed SSTAs (°C) of the tropical Pacific Ocean during 1990 at a 3-month interval.

Citation: Journal of Climate 27, 10; 10.1175/JCLI-D-13-00746.1

### b. Equivalence of the fast MEEMD to the original MEEMD

One purpose of this study is to devise a fast MEEMD algorithm. Because entirely different approaches are used in the original MEEMD algorithm and the newly devised fast MEEMD, the ability of both algorithms to produce the same decomposition results needs to be evaluated. Here we again randomly select a grid (10°N, 180°) to show that fast MEEMD can accurately recover the decomposition of SSTAs achieved by using original MEEMD. The correlations coefficients between the corresponding components are all greater than 0.9, except for the second pair of components (Fig. 8). To illustrate the closeness of the results from two methods, we also define the root-mean-square error (RMSE): the variance of the difference of an EEMD component obtained from fast MEEMD and the corresponding one from the original MEEMD normalized by the variance of the corresponding component obtained using the original MEEMD. As displayed in Fig. 8, the RMSE values for all corresponding components are all around 5%.

Stacked plot of SSTA components (°C) at grid point (180°, 10°N) by the original MEEMD (red) and the fast MEEMD (blue). Numbers on the left are the RMSE values of each pair of the components (see text for details); numbers on the right are the correlation coefficients of each pair of the components.

Citation: Journal of Climate 27, 10; 10.1175/JCLI-D-13-00746.1

Stacked plot of SSTA components (°C) at grid point (180°, 10°N) by the original MEEMD (red) and the fast MEEMD (blue). Numbers on the left are the RMSE values of each pair of the components (see text for details); numbers on the right are the correlation coefficients of each pair of the components.

Citation: Journal of Climate 27, 10; 10.1175/JCLI-D-13-00746.1

Stacked plot of SSTA components (°C) at grid point (180°, 10°N) by the original MEEMD (red) and the fast MEEMD (blue). Numbers on the left are the RMSE values of each pair of the components (see text for details); numbers on the right are the correlation coefficients of each pair of the components.

Citation: Journal of Climate 27, 10; 10.1175/JCLI-D-13-00746.1

### c. Actual compression rate and computational acceleration

In section 3c, we use a hypothetical dataset to illustrate how the computational costs of the original MEEMD can be reduced by fast MEEMD. Because of the different sizes of the subdomains, compression rates and computational acceleration for different subdomains vary. Figure 9a shows the actual sizes of compressed data compared to their original counterparts for all subdomains. On average, the compressed data file is 91% smaller than the original SSTAs file. The time consumed by the two MEEMD algorithms for all subdomains is presented in Fig. 9b. The total computational time is obtained by running the two MEEMD algorithms in a MATLAB program installed on a personal computer and tracking the time taken to finish the calculations with the MATLAB tic and toc functions. The computational time for PCA/EOF analysis-based compression does not vary much with the size of the subdomain (the averaged computation time is about 6 s). The average computational time for EEMD applied to one time series is about 90 s. If the original MEEMD is applied to decompose the ERSST data of an individual subdomain, it takes from 10 to 60 h (second to ninth bars in Fig. 9b) to complete the calculation, depending on the size of the domain. Alternatively, when fast MEEMD is applied to the same data, it takes only 1 h on average (first bar in Fig. 9b) to complete the equivalent calculation for each subdomain, almost independent of domain sizes. The overall computational speed of fast MEEMD is about 30 times faster than that of original MEEMD for the decomposition of ERSST.

(a) File size of the original uncompressed (yellow) and compressed (blue) SSTA data for each basin. Note that the northern Pacific (Atlantic) and the southern Pacific (Atlantic) have the same file size because their domain sizes are identical. (b) Estimated computational time for fast MEEMD (first bar) and original MEEMD in each subdomain (second to ninth bars). For the compressed SSTA data, the total computational time is almost independent of domain size, and the averaged computational time of all subdomains is given in the first bar. The estimated computational time for each subdomain using original MEEMD is calculated by multiplying the average computational time for EEMD (about 90 s) by the number of grid points with available data (i.e., excluding grids with missing data over land).

Citation: Journal of Climate 27, 10; 10.1175/JCLI-D-13-00746.1

(a) File size of the original uncompressed (yellow) and compressed (blue) SSTA data for each basin. Note that the northern Pacific (Atlantic) and the southern Pacific (Atlantic) have the same file size because their domain sizes are identical. (b) Estimated computational time for fast MEEMD (first bar) and original MEEMD in each subdomain (second to ninth bars). For the compressed SSTA data, the total computational time is almost independent of domain size, and the averaged computational time of all subdomains is given in the first bar. The estimated computational time for each subdomain using original MEEMD is calculated by multiplying the average computational time for EEMD (about 90 s) by the number of grid points with available data (i.e., excluding grids with missing data over land).

Citation: Journal of Climate 27, 10; 10.1175/JCLI-D-13-00746.1

(a) File size of the original uncompressed (yellow) and compressed (blue) SSTA data for each basin. Note that the northern Pacific (Atlantic) and the southern Pacific (Atlantic) have the same file size because their domain sizes are identical. (b) Estimated computational time for fast MEEMD (first bar) and original MEEMD in each subdomain (second to ninth bars). For the compressed SSTA data, the total computational time is almost independent of domain size, and the averaged computational time of all subdomains is given in the first bar. The estimated computational time for each subdomain using original MEEMD is calculated by multiplying the average computational time for EEMD (about 90 s) by the number of grid points with available data (i.e., excluding grids with missing data over land).

Citation: Journal of Climate 27, 10; 10.1175/JCLI-D-13-00746.1

## 5. Summary and discussion

In this big data era, efficiently storing data, effectively extracting useful information from data, and significantly reducing computational expenses have been popular research topics in the data analysis and processing community. In this study, we attempt to advance our capability in these areas, especially for spatially and temporally coherent climate data. We devise a climate data compression technique, that is, a fast MEEMD method, which is based on PCA/EOF analysis, to isolate variability and change on naturally separated time scales.

The lossy compression method using PCA/EOF analysis is demonstrated to be effective. For large-scale climate data, such as ERSST, storage size can be reduced by more than one order. Almost all the key information on variability and change in the original data is retained in the compressed data. The removed EOFs/PCs are almost all spatially and temporally incoherent. When applied to decompose a typical climate dataset such as ERSST, the fast MEEMD can consistently recover the results of the original MEEMD, but at a computational speed that is one to two orders faster. It is noted that both the compression rate and the acceleration of computational speed is dependent on the total number of grid points of the compressed domain. It is expected that the compression rate and computational speed can be increased by two orders for higher-resolution data, such as modern high-resolution climate model outputs.

It should be noted that the improved storage efficiency is achieved only by the compression using PCA/EOF analysis. The compression using PCA/EOF analysis makes it possible for EEMD to decompose only a significantly reduced number (often by two orders) of time series in contrast to applying EEMD to the time series of every grid point. As discussed in previous sections, the removed portion of the data by lossy compression is the spatially and temporally incoherent variability of the data, which is often called noise. However, the spatially and temporally incoherent variability may be quite important for determining extreme events on a regional scale. In this sense, caution needs to be taken when the compressed data are used to study regional-scale extreme events.

These new techniques have significant implications for climate study. In this big data era, the use of large data repositories is common. For example, the repository at the Lawrence Livermore National Laboratory for the Program for Climate Model Diagnosis and Intercomparison (PCDMI) stores model outputs from all the models participating in the Coupled Model Intercomparison Project (CMIP). These data are widely used by climate scientists worldwide. However, downloading such large datasets is often difficult, especially for climate scientists in less developed countries, where Internet capacity is limited. The compressed data can help to alleviate data transfer difficulties as well as to increase storage capacity. Fast MEEMD helps to reduce the computational costs of isolating evolution structures of climate variability and change on different time scales in large-scale climate data. Climate scientists can focus more on scientific investigation and less on the distractions resulting from insufficient computational power.

## Acknowledgments

We thank the two anonymous reviewers for their constructive suggestions. This work was supported by the U.S. National Sciences Foundation Grant AGS-1139479 (Feng and Wu), the NASA Program AIST-11-0012 (Feng and Wu), and NASA Grant NIVX13AG34G (Liu).

## REFERENCES

Candès, E. J., L. Demanet, D. L. Donoho, and L. Ying, 2006: Fast discrete curvelet transforms.

,*Multiscale Model. Simul.***5**, 861–899, doi:10.1137/05064182X.Church, J. A., and N. J. White, 2011: Sea-level rise from the late 19th to the early 21st century.

,*Surv. Geophys.***32**, 585–602, doi:10.1007/s10712-011-9119-1.Daubechies, I., 1992:

*Ten Lectures on Wavelets.*SIAM, 357 pp.Dommenget, D., and M. Latif, 2002: A cautionary note on the interpretation of EOFs.

,*J. Climate***15**, 216–225, doi:10.1175/1520-0442(2002)015<0216:ACNOTI>2.0.CO;2.Fourier, J. B. J., 1822:

*Théorie Analytique de la Chaleur.*Firmin Didot, 638 pp.Franzke, C., 2009: Multi-scale analysis of teleconnection indices: Climate noise and nonlinear trends analysis.

,*Nonlinear Processes Geophys.***16**, 65–76, doi:10.5194/npg-16-65-2009.Franzke, C., 2010: Long-range dependence and climate noise characteristics of Antarctic temperature data.

,*J. Climate***23**, 6074–6081, doi:10.1175/2010JCLI3654.1.Franzke, C., and T. Woollings, 2011: On the persistence and predictability properties of North Atlantic climate variability.

,*J. Climate***24**, 466–472, doi:10.1175/2010JCLI3739.1.Fu, C., C. Qian, and Z. Wu, 2011: Projection of global mean surface air temperature changes in next 40 years: Uncertainties of climate models and an alternative approach.

,*Sci. China Earth Sci.***54**, 1400–1406, doi:10.1007/s11430-011-4235-9.Gu, D., and S. G. H. Philander, 1995: Secular changes of annual and interannual variability in the tropics during the past century.

,*J. Climate***8**, 864–876, doi:10.1175/1520-0442(1995)008<0864:SCOAAI>2.0.CO;2.Gu, D., S. G. H. Philander, and M. J. McPhaden, 1997: The seasonal cycle and its modulation in the eastern tropical Pacific Ocean.

,*J. Phys. Oceanogr.***27**, 2209–2218, doi:10.1175/1520-0485(1997)027<2209:TSCAIM>2.0.CO;2.Hannachi, A., I. T. Jolliffe, and D. B. Stephenson, 2007: Empirical orthogonal functions and related techniques in atmospheric science: A review.

,*Int. J. Climatol.***27**, 1119–1152, doi:10.1002/joc.1499.Hu, Z. Z., B. Huang, J. L. Kinter III, Z. Wu, and A. Kumar, 2012: Connection of the stratospheric QBO with global atmospheric general circulation and tropical SST. Part II: Interdecadal variations.

,*Climate Dyn.***38**, 25–43, doi:10.1007/s00382-011-1073-6.Huang, B., Z. Z. Hu, J. L. Kinter III, Z. Wu, and A. Kumar, 2012a: Connection of stratospheric QBO with global atmospheric general circulation and tropical SST. Part I: Methodology and composite life cycle.

,*Climate Dyn.***38**, 1–23, doi:10.1007/s00382-011-1250-7.Huang, B., Z. Z. Hu, E. K. Schneider, Z. Wu, Y. Xue, and B. Klinger, 2012b: Influences of subtropical air–sea interaction on the multidecadal AMOC variability in the NCEP climate forecast system.

,*Climate Dyn.***39**, 531–555, doi:10.1007/s00382-011-1258-z.Huang, N. E., and Z. Wu, 2008: A review on Hilbert-Huang transform: Method and its applications on geophysical studies.

*Rev. Geophys.,***46,**RG2006, doi:10.1029/2007RG000228.Huang, N. E., Z. Shen, S. R. Long, M. C. Wu, E. H. Shih, Q. Zheng, C. C. Tung, and H. H. Liu, 1998: The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis.

,*Proc. Roy. Soc. London***454**, 903–995, doi:10.1098/rspa.1998.0193.Lorenz, E. N., 1956: Empirical orthogonal functions and statistical weather prediction. Department of Meteorology, Massachusetts Institute of Technology, Statistical Forecasting Project Sci. Rep. 1, 49 pp.

Misra, V., H. Li, Z. Wu, and S. DiNapoli, 2013: Global seasonal climate predictability in a two tiered forecast system: Part I: Boreal summer and fall seasons.

,*Climate Dyn.***42**, 1425–1448, doi:10.1007/s00382-013-1812-y.Monahan, A. H., J. C. Fyfe, M. H. P. Ambaum, D. B. Stephenson, and G. R. North, 2009: Empirical orthogonal functions: The medium is the message.

,*J. Climate***22**, 6501–6514, doi:10.1175/2009JCLI3062.1.Obukhov, A. M., 1947: Statistically homogeneous fields on a sphere.

,*Usp. Mat. Nauk***2**, 196–198.Pezzulli, S., D. B. Stephenson, and A. Hannachi, 2005: The variability of seasonality.

,*J. Climate***18**, 71–88, doi:10.1175/JCLI-3256.1.Qian, C., C. Fu, Z. Wu, and Z. Yan, 2009: On the secular change of spring onset in Stockholm.

*Geophys. Res. Lett*.,**36,**L12706, doi:10.1029/2009GL038617.Qian, C., Z. Wu, C. Fu, and T. J. Zhou, 2010: On multi-timescale variability of temperature in China in modulated annual cycle reference frame.

,*Adv. Atmos. Sci.***27**, 1169–1182, doi:10.1007/s00376-009-9121-4.Qian, C., Z. Wu, C. Fu, and D. Wang, 2011: On changing El Niño: A view from time-varying annual cycle, interannual variability, and mean state.

,*J. Climate***24**, 6486–6500, doi:10.1175/JCLI-D-10-05012.1.Ruzmaikin, A., and J. Feynman, 2009: Search for climate trends in satellite data.

,*Adv. Adapt. Data Anal.***1**, 667–679, doi:10.1142/S1793536909000266.Servain, J., and D. M. Legler, 1986: Empirical orthogonal function analyses of tropical Atlantic sea surface temperature and wind stress: 1964–1979.

,*J. Geophys. Res.***91**, 14 181–14 191, doi:10.1029/JC091iC12p14181.Smith, T. M., R. W. Reynolds, T. C. Peterson, and J. Lawrimore, 2008: Improvements to NOAA’s historical merged land–ocean surface temperature analysis.

,*J. Climate***21**, 2283–2296, doi:10.1175/2007JCLI2100.1.Thompson, D. W. J., and J. M. Wallace, 1998: The Arctic Oscillation signature in the wintertime geopotential height and temperature fields.

,*Geophys. Res. Lett.***25**, 1297–1300, doi:10.1029/98GL00950.Vecchio, A., and V. Carbone, 2010: Amplitude–frequency fluctuations of the seasonal cycle, temperature anomalies, and long-range persistence of climate records.

*Phys. Rev.,***82E,**066101, doi:10.1103/PhysRevE.82.066101.Wallace, J. M., 1972: Empirical orthogonal representation of time series in the frequency domain. Part II: Application to the study of tropical wave disturbances.

,*J. Appl. Meteor.***11**, 893–900, doi:10.1175/1520-0450(1972)011<0893:EOROTS>2.0.CO;2.Wallace, J. M., and R. E. Dickinson, 1972: Empirical orthogonal representation of time series in the frequency domain. Part I: Theoretical considerations.

,*J. Appl. Meteor.***11**, 887–892, doi:10.1175/1520-0450(1972)011<0887:EOROTS>2.0.CO;2.Weare, B. C., A. R. Navato, and R. E. Newell, 1976: Empirical orthogonal analysis of Pacific sea surface temperatures.

,*J. Phys. Oceanogr.***6**, 671–678, doi:10.1175/1520-0485(1976)006<0671:EOAOPS>2.0.CO;2.Wu, Z., and N. E. Huang, 2009: Ensemble empirical mode decomposition: A noise-assisted data analysis method.

,*Adv. Adapt. Data Anal.***1**, 1–41, doi:10.1142/S1793536909000047.Wu, Z., E. K. Schneider, B. P. Kirtman, E. S. Sarachik, N. E. Huang, and C. J. Tucker, 2008: The modulated annual cycle: An alternative reference frame for climate anomalies.

,*Climate Dyn.***31**, 823–841, doi:10.1007/s00382-008-0437-z.Wu, Z., N. E. Huang, and X. Chen, 2009: The multi-dimensional ensemble empirical mode decomposition method.

,*Adv. Adapt. Data Anal.***1**, 339–372, doi:10.1142/S1793536909000187.Wu, Z., N. E. Huang, J. M. Wallace, B. V. Smoliak, and X. Chen, 2011: On the time-varying trend in global-mean surface temperature.

,*Climate Dyn.***37**, 759–773, doi:10.1007/s00382-011-1128-8.Zhu, J., B. Huang, and Z. Wu, 2012: The role of ocean dynamics in the interaction between the Atlantic meridional and equatorial modes.

,*J. Climate***25**, 3583–3598, doi:10.1175/JCLI-D-11-00364.1.

^{1}

In information technology, “lossy” compression is a technique that compresses data by discarding some “unimportant” data information (based on prescribed criteria or prior understanding). In this study, the discarded unimportant information is the spatially and temporally incoherent variability that is often called noise.