1. Introduction
Principal component analysis (PCA), also known as empirical orthogonal function (EOF) analysis, has been widely used in oceanography and meteorology since its introduction to these fields by Lorenz (1956). PCA is an objective technique used to detect and characterize optimal lower-dimensional linear structure in a multivariate dataset, and it is one of the most important methods in the geostatistician’s multivariate statistics toolbox. Consequently, it has been well studied, and standard references exist describing the method and its implementation (Preisendorfer 1988; Wilks 1995). Its applications include reduction of data dimensionality for data interpretation (e.g., Barnston and Livezey 1987; Miller et al. 1997) and for forecasting (e.g., Barnston and Ropelewski 1992; Tangang et al. 1998). Furthermore, the connection between the the results of PCA, which are statistical in nature, and the underlying dynamics of the system under consideration are understood in some detail (North 1984; Mo and Ghil 1987).
PCA can be regarded as a single example of a general class of feature extraction methods, which attempt to characterize lower-dimensional structure in large multivariate datasets. By construction, PCA finds the lower-dimensional hyperplane that optimally characterizes the data, such that the sum of squares of orthogonal deviations of the data points from the hyperplane is minimized. If the structure of the data is inherently linear (e.g., if the underlying distribution is Gaussian), then PCA is an optimal feature extraction algorithm; however, if the data contain nonlinear lower-dimensional structure, it will not be detectable by PCA.
In the early 1990s, a neural-network-based generalization of PCA to the nonlinear feature extraction problem was introduced in the chemical engineering literature by Kramer (1991), who referred to the resulting technique as nonlinear principal component analysis (NLPCA). Another solution to this problem, coming from the statistics community, was put forward independently by Hastie and Stuetzle (1989), who named their method principal curves and surfaces (PCS). Recently, Malthouse (1998) demonstrated that NLPCA and PCS are closely related, and are, for a broad class of situations, essentially the same. Kramer’s NLPCA has been applied to problems in chemical engineering (Kramer 1991; Dong and McAvoy 1996), psychology (Fotheringhame and Baddeley 1997), and image compression (De Mers and Cottrell 1993), but apart from a single unpublished report by Sengupta and Boyle (1995), we have not yet seen it applied to the large multivariate datasets common in oceanic and atmospheric sciences.
This paper is the first in a series of investigations of the application of NLPCA to the analysis of climate data. In section 2, we introduce the general feature extraction problem and the PCA approach. We then introduce the NLPCA algorithm and show how it is in fact a natural nonlinear generalization of PCA. In section 3, we apply NLPCA to a synthetic dataset sampled from the Lorenz attractor (Lorenz 1963), with subsequent addition of noise to simulate measurement error, and show that NLPCA is better able to characterize the lower-dimensional nonlinear structure of the data than is PCA. Finally, conclusions and directions for future research are presented in section 4.
2. Feature extraction problems
The method of feature extraction most common in the atmospheric and oceanic sciences is PCA, which optimally extracts linear features from the data. However, if the underlying structure of the data is nonlinear, traditional PCA is suboptimal in characterizing this lower-dimensional structure. This motivates the definition of NLPCA. In this section, we provide a brief review of PCA and demonstrate how it generalizes naturally to a nonlinear method. We then discuss the NLPCA method in detail.
a. Principal component analysis
We note that the PCA approximation X̂(tn) to X(tn) is the composition of two functions:
b. Nonlinear principal component analysis
Another solution to the general feature extraction problem was introduced independently by Hastie and Stuetzle (1989). Their method, termed PCS, is based on a rather different set of ideas than NLPCA. In practice, however, because both minimize the sum of squared errors (10), the two methods both boil down to iterative solutions of a nonlinear variational problem. In fact, Malthouse (1998) argued that NLPCA and PCS are quite similar for a broad class of feature extraction problems. A primary difference between NLPCA and PCS is that in the former the projection function sf is constrained to be continuous, while in the latter it may have a finite number of discontinuities. Although in this paper we will investigate the use of Kramer’s NLPCA, because its implementation is straightforward, PCS has a stronger theoretical underpinning, and we will use the connection between the two methods noted by Malthouse to make hypotheses about certain properties of NLPCA.
We now consider the manner by which this network solves the feature extraction problem for P = 1. The first three layers, considered alone, form a map from ℜM to ℜ, and the last three layers alone form a map from ℜ to ℜM. All five layers together are simply the composition of these two maps. Because the bottleneck layer contains only a single neuron, the network must compress the input down to a single one-dimensional time series before it produces its final M-dimensional output. Once the network has been trained, the output
The network illustrated in Fig. 1 will extract the optimal one-dimensional curve characterizing X(tn). To uncover higher-dimensional structure, the number of neurons in the bottleneck layer can be increased. For example, if two neurons are used, the network will determine the optimal two-dimensional characterization (by continuous functions) of X(tn). In general, a P-dimensional NLPCA approximation to X(tn) can be obtained by setting to P the number of neurons in the bottleneck layer.
c. Implementation
Developing methods for the avoidance of overfitting is a major area of research in the field of neural networks (Finnoff et al. 1993). Of these, an important subclass is made up of early stopping techniques, in which the network training is stopped before the error function is minimized, according to some well-defined stopping criterion. The essential idea behind the early stopping is that the network is trained for a number of iterations sufficient to model any true underlying structure, but insufficient to fit the noise exactly. While the benefit of stopping network training before the error function is minimized may seem counterintuitive at first, experiments have demonstrated that early stopping algorithms improve the resulting neural network’s ability to generalize (Finnoff et al. 1993). The strategy we used was to set aside a fraction α (typically about 30%) of the data points, selected randomly, into a validation set not used to train the network. As network training proceeded, the network error on the validation set was monitored, and training was stopped when this validation set error started to increase, or after 500 iterations, whichever was the first to occur. Again, the validation set was not used to train the network, but only to monitor its ability to generalize. The degradation of this generalization ability was the stopping criterion employed.
The number of neurons L in the encoding and decoding layers determines the class of functions that sf and f can take. As L increases, there is more flexibility in the forms of sf and f, but the model also has more parameters, implying both that the error surface becomes more complicated and that the parameters are less constrained by data. In the end, the number of neurons L in the encoding and decoding layers was set to be the largest number such that the resulting NLPCA approximations conformed to the smoothness and robustness criteria discussed in the previous paragraph.
Clearly, a P-dimensional NLPCA model will contain more parameters than a P-dimensional PCA model. In fact, this is what allows NLPCA to solve more general feature extraction problems than PCA. Generally, the number of parameters in a statistical model is minimized to reduce overfitting of the data. However, this issue is addressed with the early stopping technique described above. The use of such a technique allows the number of parameters in the statistical model to be increased without substantially increasing the risk of the model overfitting the data.
Thus, the NLPCA approximation to a dataset X(tn) is defined operationally as the output of the five-layer autoassociative neural network that minimizes the error (12) subject to the constraint that the network generalizes, and that the output must be smooth, reproducible, and reflect symmetries manifest in X(tn). This operational definition will undoubtedly be refined as experience is gained in the implementation of NLPCA.
3. The Lorenz attractor
by adding random noise to the signal x(t), the sensitivity of the method to noise level can be tested, and
the structure of the Lorenz attractor is well known and of sufficiently low dimension that visualization of results is straightforward.
The 1D PCA approximation to x(tn) when η = 0 is displayed in Fig. 4; it is a straight line passing through the centers of the two lobes of the attractor and explains 60% of the variance of x(tn). Figure 5 displays the 1D NLPCA approximation to x(tn). As anticipated, it is a U-shaped curve passing through the middle of the data. This curve explains 76% of the variance, an improvement of 16% over the PCA results. Clearly, the 1D NLPCA approximation is substantially closer to x(tn) than is the 1D PCA approximation. The network used to perform the NLPCA had three input and output neurons for x1, x2, and x3; one bottleneck neuron; and L = 3 neurons in the encoding and decoding layers. Experimentation indicated that the NLPCA results improved using L = 3 over using L = 2 (i.e., the former had a smaller FUV than the latter), but that for L > 3, the results did not improve. Turning now to the issue of robustness of results, the NMSD between six different 1D NLPCA curves (not shown) varies between 0.5% and 2%. These curves differ only in small details and agree in their essential structure with the curve shown in Fig. 5. Thus, the 1D NLPCA approximation to x(tn) displayed in Fig. 5 is a robust result that improves substantially over the 1D PCA approximation.
Figure 6 displays the 2D PCA approximation of the data x(tn) when η = 0; this surface explains 95% of the variance. The 2D PCA approximation is a flat sheet that characterizes well the structure of the data as projected in the x1x3 and x2x3 planes but fails to reproduce the structure seen in the projection on the x1x2 plane. In Fig. 7, the result of a 2D NLPCA of x(tn) is shown. This surface explains 99.5% of the variance, implying an order of magnitude reduction in FUV as compared to the PCA result. The network used to perform the NLPCA had two neurons in the bottleneck layer and L = 6 neurons in the encoding and decoding layers. It was found that decreasing L below 6 also decreased the fraction of variance explained, and increasing it above L = 6 had little effect upon the results. The 2D NLPCA result is highly robust: a sample of four NLPCA models (not shown) had NMSD between curves of at most 0.1%. As with the 1D example considered above, the NLPCA approximation is a substantially better approximation to the original dataset than is the PCA approximation.
We consider now a dataset x(tn) obtained from (25) with η = 2.0. The 1D PCA approximation to x(tn) (not shown) explains 59% of the variance. The 1D NLPCA approximation (Fig. 8), explains 74% of the variance. The curve in Fig. 8 is very similar to that shown in Fig. 5 for the η = 0 case; the two-lobed structure of the data is still manifest at a noise level of η = 2.0, and the NLPCA is able to recover it. Addressing again the issue of robustness of results, six different NLPCA approximations to x(tn) were found to have NMSD varying from 0.5% to 3%. These six curves agree in their essential details, although the set displays more variability between members than did the corresponding set for η = 0. Figure 9 shows the results of a 2D NLPCA performed on this dataset. This explains 97.4% of the variance, in contrast to the PCA approximation, which explains 94.2%. Thus, the FUV of the 2D NLPCA approximation is about half that of the 2D PCA approximation. These results too are robust; the NMSD between different 2D NLPCA models was about 0.2%. The 2D NLPCA approximation is again an improvement over the 2D PCA approximation, but not by such a substantial margin as was the case when η = 0. This is because the noise-free Lorenz attractor is very nearly two-dimensional, so the 2D NLPCA was able to account for almost all of the variance. The addition of noise acted to smear out this fine fractal structure and made the data cloud more three-dimensional. The 2D NLPCA applied to this quasi-3D structure could not produce as close an analog as was the case when η = 0.
At a noise strength of η = 5.0, the dataset x(tn) still has a discernible two-lobed structure, but it is substantially obscured. The 1D PCA approximation (not shown) explains 54% of the variance, whereas the 1D NLPCA approximation (shown in Fig. 10) explains 65%. Again, the 1D NLPCA approximation to x(tn) is qualitatively similar to that obtained in the noise-free case (Fig. 5), especially in the projection onto the x1x3 plane. The η = 0 and η = 5.0 1D NLPCA approximations differ at the ends of the curves. Presumably, the structure represented in the former is somewhat washed out by noise in the latter. Four different NLPCA curves for the data obtained with η = 5.0 share their gross features but differ fairly substantially in detail. In this case, the NMSD between curves varies between 5% and 10%. The 2D NLPCA approximation to this data (not shown) explains 90% of the variance, only slightly more than the 2D PCA approximation, which explains 88% of the variance.
Finally, at a noise level of η = 10.0 the two-lobed structure of x(tn) is no longer obvious, and the data cloud appears as a fairly homogeneous, vaguely ellipsoidal blob. The results of NLPCA by this noise level are no longer robust, tending to be asymmetric, convoluted curves. At this noise level, then, NLPCA may no longer be a useful technique for characterizing low-dimensional nonlinear structure of the dataset, precisely because the addition of noise has destroyed this structure. This illustrates the point mentioned in section 2c that NLPCA is a useful, robust tool for reduction of data dimensionality when nonlinearities in the dataset are manifest, but it is perhaps untrustworthy when the data clouds possess no such obvious lower-dimensional structure.
4. Discussion and conclusions
In this paper, we have considered a nonlinear generalization of traditional principal component analysis (PCA), known as nonlinear principal component analysis (NLPCA). The generalization was carried out by casting PCA in a variational formulation as the optimal solution to the feature extraction problem posed in section 2 for which the functions sf and f are constrained to be linear. When PCA is regarded in this way, natural generalizations of the method to the nonlinear case are evident. One possible approach is to allow the map sf to be nonlinear while f remains linear. Such a method was considered by Oja and Karhunen (1993) and by Oja (1997), who showed that such a generalization could be carried out by a two-layer recursive neural network. Another approach is to allow both functions sf and f to be nonlinear, as was suggested by Kramer (1991), who also provided an implementation using a five-layer neural network. Kramer denoted his method nonlinear PCA, as did Oja and Karhunen. It is important to note that these two methods are in fact quite different, as the expansion map used by Oja and Karhunen is still a linear function.
Another approach to the nonlinear feature extraction problem was considered by Hastie and Stuetzle (1989), who introduced principal curves and surfaces (PCS) to find lower-dimensional structures that pass optimally through the middle of data. Like Kramer’s NLPCA, PCS models are obtained as iterative solutions to nonlinear variational problems. Kramer’s NLPCA has the advantage that it is implemented using feed-forward neural networks, for which computer packages are widely available. The advantage of PCS is that it rests on a stronger theoretical base than does NLPCA. However, as was argued by Malthouse (1998), NLPCA and PCS are in fact essentially the same for a broad class of feature extraction problems. We can thus use theoretical results proven rigorously for PCS and apply them by analogy to NLPCA. In particular, PCS can be proven to partition variance as described in Eq. (13), and we find experimentally that NLPCA does as well. It is important to remember that the two methods are not exactly identical and to be on guard for those cases in which the two methods diverge. Malthouse (1998) pointed out that because Kramer’s NLPCA is unable to model discontinuous projection and expansion functions, PCS and NLPCA will differ in those cases for which such maps are necessary. We hypothesize that the use of a seven-layer neural network to perform NLPCA, such that the projection and expansion networks can well approximate discontinuous maps, will yield an NLPCA that is in closer correspondence to PCS. This is a current subject of research and will be discussed in a future publication.
As a first experiment with NLPCA, we analyzed a 584-point dataset sampled from the Lorenz attractor (Lorenz 1963). We found that the 1D and 2D NLPCA approximations to this dataset explain 74% and 99.5% of the variance, respectively, substantially improving over the 1D and 2D PCA approximations, which explain respectively 60% and 95% of the variance. Adding noise to the Lorenz data, it was found that as long as the noise was sufficiently weak that the two-lobed structure of the Lorenz attractor was not obscured, both the 1D and 2D NLPCA approximations to the data were superior to the corresponding PCA approximations. However, when the noise was of sufficient strength that the nonlinear structure underlying the dataset was no longer manifest, the results of the NLPCA were not robust, and NLPCA ceased to be a useful method for characterizing the low-dimensional structure of the data.
However, the general NLPCA model (1) is neither separable nor additive. How, then, do we visualize the results? For each time tn, the NLPCA provides
P values from the bottleneck layer, and
a point in ℜM corresponding to a spatial map.
PCA has also been widely used as a data dimensionality reduction tool in the area of statistical forecasting. When building a statistical forecast model employing canonical correlation analysis (Barnston and Ropelewski 1992) or a neural network (Tangang et al. 1998) over a large number of grid sites, the time series of which are composed of signal plus noise, it is better to carry out the forecasts in a subspace of smaller dimension, within which a substantial fraction of the variance lives. In such cases, the forecast models predict the expansion coefficients ak(tn) of several of the leading PCA modes of the data (the precise number of modes used is, of course, problem dependent). The forecast field is then obtained through the expansion map. It is typical in these models that the time series of only the first few modes can be forecast with appreciable skill. In general, a P-dimensional NLPCA approximation to a dataset explains a higher fraction of variance than does the P-dimensional PCA approximation. If the NLPCA time series can be equally well forecast as the PCA time series, then the reconstructed field obtained through the NLPCA expansion map will be a better forecast than the corresponding PCA reconstructed field. Thus, the use of NLPCA rather than PCA as a data dimensionality reduction tool may improve the field forecast skill of statistical forecast models. This is being investigated at present.
While NLPCA and PCS have been successfully applied to problems in a number of disciplines (Hastie and Stuetzle 1989; Kramer 1991; Banfield and Raftery 1992;De Mers and Cottrell 1993; Dong and McAvoy 1996; Fotheringhame and Baddeley 1997), the only application of NLPCA to climate data that we have been able to find is an unpublished report by Sengupta and Boyle (1995), who performed an analysis of monthly averaged precipitation fields over the contiguous United States for the years from 1979 to 1988. Their results failed to demonstrate that the results of NLPCA improved substantially over those obtained by PCA. In particular, they used a visualization technique based on correlating the data with the output of their single bottleneck neuron that suboptimally characterized the information yielded by the NLPC analysis. As discussed above, the results of an NLPC analysis of gridded data have a natural cinematographic interpretation not captured by such a simple correlation. The results presented above for the Lorenz data, and the preliminary results of an NLPC analysis of tropical Pacific sea surface temperature (to be presented in a future publication), indicate that NLPCA is a more powerful tool for the reduction of data dimensionality than PCA.
Acknowledgments
The author would like to acknowledge William Hsieh, Benyang Tang, and Lionel Pandolfo for helpful advice during the course of this work, and an anonymous referee whose comments brought to the author’s attention one or two points that were unclear in the original draft. The author would like to acknowledge financial support from the Natural Sciences and Engineering Research Council of Canada via grants to W. Hsieh, and the Peter Wall Institute for Advanced Studies.
REFERENCES
Banfield, J. D., and A. E. Raferty, 1992: Ice floe identification in satellite images using mathematical morphology and clustering about principal curves. J. Amer. Stat. Assoc.,87, 7–16.
Barnston, A. G., and R. E. Livezey, 1987: Classification, seasonality, and persistence of low-frequency atmospheric circulation patterns. Mon. Wea. Rev.,115, 1083–1126.
——, and C. F. Ropelewski, 1992: Prediction of ENSO episodes using canonical correlation analysis. J. Climate,5, 1316–1345.
Berliner, L. M., 1992: Statistics, probability, and chaos. Stat. Sci.,7, 69–122.
Bishop, C. M., 1995: Neural Networks for Pattern Recognition. Clarendon Press, 482 pp.
Cybenko, G., 1989: Approximation by superpositions of a sigmoidal function. Math. Contrib. Signals Syst.,2, 303–314.
De Mers, D., and G. Cottrell, 1993: Nonlinear dimensionality reduction. Neural Inform. Proc. Syst.,5, 580–587.
Dong, D., and T. J. McAvoy, 1996: Nonlinear principal component analysis—Based on principal curves and neural networks. Comp. Chem. Eng.,20, 65–78.
Finnoff, W., F. Hergert, and H. G. Zimmermann, 1996: Improving model selection by nonconvergent methods. Neural Networks,6, 771–783.
Fotheringhame, D., and R. Baddeley, 1997: Nonlinear principal components analysis of neuronal spike train data. Biol. Cybernetics,77, 282–288.
Hastie, T., and W. Steutzle, 1989: Principal curves. J. Amer. Stat. Assoc.,84, 502–516.
——, and R. J. Tibshirani, 1990: Generalised Additive Models. Chapman and Hall, 335 pp.
Hsieh, W. W., and B. Tang, 1998: Applying neural network models to prediction and data analysis in meteorology and oceanography. Bull. Amer. Meteor. Soc.,79, 1855–1870.
Kramer, M. A., 1991: Nonlinear principal component analysis using autoassociative neural networks. AIChE J.,37, 233–243.
Le Blanc, M., and R. Tibshirani, 1994: Adaptive principal surfaces. J. Amer. Stat. Assoc.,89, 53–64.
Lorenz, E. N., 1956: Empirical orthogonal functions and statistical weather prediction. MIT Department of Meteorology, Statistical Forecast Project Rep. 1, 49 pp. [Available from Dept. of Meteorology, MIT, Massachusetts Ave., Cambridge, MA 02139.].
——, 1963: Deterministic aperiodic flow. J. Atmos. Sci.,20, 131–141.
Malthouse, E. C., 1998: Limitations of nonlinear PCA as performed with generic neural networks. IEEE Trans. Neural Networks,9, 165–173.
Miller, A. J., W. B. White, and D. R. Cayan, 1997: North Pacific thermocline variations on ENSO timescales. J. Phys. Oceanogr.,27, 2023–2039.
Mo, K. C., and M. Ghil, 1987: Statistics and dynamics of persistent anomalies. J. Atmos. Sci.,44, 877–901.
North, G. R., 1984: Empirical orthogonal functions and normal modes. J. Atmos. Sci.,41, 879–887.
Oja, E., 1997: The nonlinear PCA learning rule in independent component analysis. Neurocomputing,17, 25–45.
——, and J. Karhunen, 1993: Nonlinear PCA: Algorithms and applications. Helsinki University of Technology Tech. Rep. A18, 25 pp. [Available from Laboratory of Computer and Information Sciences, Helsinki University of Technology, Rakentajanaukio 2C, SF-02150, Espoo, Finland.].
Preisendorfer, R. W., 1988: Principal Component Analysis in Meteorology and Oceanography. Elsevier, 425 pp.
Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, 1992: Numerical Recipes in C. Cambridge University Press, 994 pp.
Sanger, T., 1989: Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Networks,2, 459–473.
Sengupta, S. K., and J. S. Boyle, 1995: Nonlinear principal component analysis of climate data. PCMDI Tech. Rep. 29, 21 pp. [Available from Program for Climate Model Diagnosis and Intercomparison, Lawrence Livermore National Laboratory, University of California, Livermore, CA 94550.].
Tangang, F. T., B. Tang, A. H. Monahan, and W. W. Hsieh, 1998: Forecasting ENSO events: A neural network—Extended EOF approach. J. Climate,11, 29–41.
von Storch, H., and F. W. Zwiers, 1999: Statistical Analysis in Climate Research. Cambridge University Press, 494 pp.
Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences. Academic Press, 467 pp.
The five-layer feed-forward autoassociative neural network used to perform NLPCA.
Citation: Journal of Climate 13, 4; 10.1175/1520-0442(2000)013<0821:NPCABN>2.0.CO;2
The Lorenz attractor, projected on the x1x3, x3x2, and x2x1 planes.
Citation: Journal of Climate 13, 4; 10.1175/1520-0442(2000)013<0821:NPCABN>2.0.CO;2
As in Fig. 2, for subsample of 586 points.
Citation: Journal of Climate 13, 4; 10.1175/1520-0442(2000)013<0821:NPCABN>2.0.CO;2
Noise-free Lorenz data and its 1D PCA approximation, projected as in Fig. 2 (note that axes have been rescaled). The dots represent the original data points, the open circles represent points on the approximation.
Citation: Journal of Climate 13, 4; 10.1175/1520-0442(2000)013<0821:NPCABN>2.0.CO;2
As in Fig. 4, for 1D NLPCA approximation.
Citation: Journal of Climate 13, 4; 10.1175/1520-0442(2000)013<0821:NPCABN>2.0.CO;2
As in Fig. 4, for 2D PCA approximation.
Citation: Journal of Climate 13, 4; 10.1175/1520-0442(2000)013<0821:NPCABN>2.0.CO;2
As in Fig. 4, for 2D NLPCA approximation.
Citation: Journal of Climate 13, 4; 10.1175/1520-0442(2000)013<0821:NPCABN>2.0.CO;2
As in Fig. 5, for Lorenz data with noise level η = 2.0 [see Eq. (25)].
Citation: Journal of Climate 13, 4; 10.1175/1520-0442(2000)013<0821:NPCABN>2.0.CO;2
As in Fig. 7, for Lorenz data with noise level η = 2.0.
Citation: Journal of Climate 13, 4; 10.1175/1520-0442(2000)013<0821:NPCABN>2.0.CO;2
As in Fig. 5, for Lorenz data with noise level η = 5.0.
Citation: Journal of Climate 13, 4; 10.1175/1520-0442(2000)013<0821:NPCABN>2.0.CO;2