## 1. Introduction

Linear multivariate statistical analysis methods are often used in atmospheric sciences to extract leading patterns in high-dimensional datasets. A popular method is principal component analysis (PCA), which by a rotation in phase space finds the directions that maximize the variance of the data. For highly nonlinear data the limitations of such linear methods are obvious, and recently nonlinear extensions to the linear multivariate methods have become popular. In particular, an extension to PCA known as nonlinear principal component analysis (NLPCA) was introduced to the atmospheric sciences through a study of a low-dimensional chaotic system (Monahan 2000). The method has subsequently been applied to the El Niño–Southern Oscillation (Hsieh 2001), the quasi-biennial oscillation (Hamilton and Hsieh 2002), and the Northern Hemisphere extratropical circulation (Monahan et al. 2000, 2001, 2003; Teng et al. 2004).

The existence of regimes in the extratropical low-frequency variability may have important consequences for our understanding of climate and climate change (Palmer 1999). However, the existence of such regimes is still debated (Corti et al. 1999; Christiansen 2002; Stephenson et al. 2004). Regimes are often inferred by multimodality in a probability density estimate. As the probability density only can be reliably estimated in one and two dimensions, such studies are often based on heavily truncated data where only the directions spanned by the leading principal components are retained. Monahan et al. (2000, 2001, 2003) extended this method and studied multimodality in the leading component of NLPCA.

The purpose of this article is to show that NLPCA often leads to spurious bimodality and that the use of NLPCA to detect atmospheric regime behavior is therefore error prone. We will see that NLPCA produces spurious bimodality if the input data are sufficiently isotropic and that the spurious bimodality is a robust feature of the NLPCA.

In section 2 we give a brief introduction to NLPCA. In section 3 we discuss some theoretical limitations of the NLPCA, and in section 4 we present numerical experiments showing that NLPCA very often reports strong bimodality for unimodal distributions. The numerical experiments include both idealized data and observations of the Northern Hemisphere stratospheric circulation.

## 2. Nonlinear principal component analysis

Here we briefly describe the method of nonlinear principal component analysis. More details can be found in the recent review by Hsieh (2004).

Nonlinear principal component analysis was proposed by Kramer (1991). Given is the dataset **x*** _{j}*,

*j*= 1, 2, . . .

*n*, with

**x**= (

*x*

_{1},

*x*

_{2}, . . . ,

*x*). The dataset can be seen as

_{l}*n*(often temporal) samples in an

*l*-dimensional phase space. As for linear PCA, one searches for a function

*f*from ℝ

*to ℝ*

^{l}*, where ℝ is the set of real numbers, such that the mean square error Σ*

^{l}_{j}||

**x**

_{j}−

*f*(

**x**

_{j})||

^{2}is minimized. In linear PCA,

*f*is restricted to be linear. In NLPCA, this restriction is lifted and the challenge lies in obtaining a balance between the smoothness of

*f*and the size of the error. Kramer (1991) proposed that

*f*is defined by a feed-forward neural network with three hidden layers. The layout of the neural network is shown in Fig. 1. The input and output layers contain

*l*neurons, the second and fourth layers contain

*m*neurons, and the central “bottleneck” layer contains a single neuron. Letting

*N*be the number of neurons in the

_{j}*j*th layer and

**u**

^{j}= (

*u*

^{j}

_{1},

*u*

^{j}

_{2}, . . . ,

*u*

^{j}

_{Nj}) be the state of that layer, then

**u**

*=*

^{j}*s*(𝘄

_{j}

^{j}**u**

^{j}^{−1}+

**b**

*). The weight, 𝘄*

^{j}^{j}, is a

*N*×

_{j}*N*

_{j}_{−1}matrix and the bias,

**b**

^{j}, is a vector of length

*N*. For the transfer functions we choose the hyperbolic tangent for

_{j}*s*

_{1}and

*s*

_{3}and the identity function for

*s*

_{2}and

*s*

_{4}. The dataset

**x**

*is fed into the input layer and the weights and biases are adjusted by numerical methods to minimize the error.*

_{j}The neural network is then a composition of two nonlinear maps *f* = *f*_{2} ○ *f*_{1}, *f*_{1} from ℝ* ^{l}* to ℝ, and

*f*

_{2}from ℝ to ℝ

*. The state of the bottleneck neuron,*

^{l}*u*

_{3}, is called the score or the nonlinear principal component and often denoted by

*λ*. The map

*f*

_{2}defines a one-dimensional curve in ℝ

*called the NLPCA mode. The projection of the original data on the NLPCA mode is known as the NLPCA approximation and is described by the map*

^{l}*f*

_{1}.

## 3. Theoretical limitations and considerations

The way chosen to parameterize the functions *f*_{1} and *f*_{2}, that is, the architecture of the neural network, has important consequences for the characteristics of the nonlinear principal component, *λ*, and for the characteristics of the possible NLPCA modes. Here we first describe how the architecture of the first layers favors multimodality in the distribution of *λ* even for normally distributed inputs. We then describe some restrictions on the possible curves in ℝ* ^{l}* when there are two neurons in the fourth layer. This choice of

*m*was often made in recent literature. We conclude the section with a very general consideration on the ambiguity of the nonlinear principal component

*λ*(Malthouse 1998).

The state of the bottleneck layer, the nonlinear principal component *λ*, is a linear combination of hyperbolic tangents Σ^{m}_{i=1}𝘄^{2}_{i} tanh *υ _{i}* +

*b*

^{2}, where the

*υ*s are linear combinations of the state of the input neurons. If the inputs are normally distributed the

_{i}*υ*s will also be normally distributed. However, the distributions of the terms tanh

_{i}*υ*can be bimodal due to the nonlinearity of the hyperbolic tangent. The distributions are only unimodal if the

_{i}*υ*s are small so that they fall on the linear part of the hyperbolic tangent. In fact, if

*υ*is normally distributed with variance

*σ*

^{2}, then the distribution of tanh

*υ*is given by exp(−

*υ*

^{2}/

*σ*

^{2}/2)/(1 – tanh

^{2}

*υ*), which is bimodal for

*σ*>

*m*= 2, up to four peaks can be present in the distribution of

*λ*if the inputs are normally distributed.

The class of NLPCA modes that can be described by the neural network is determined by the number, *m*, of hidden neurons in the fourth layer. When *m* = 1, only straight lines can be described, while any continuous curve can be described for *m* → ∞ (Malthouse 1998). For *m* = 2 the curves are described by *x _{i}* = tanh

*t*and

*x*=

_{j}*d*tanh

_{j}*t*+ tanh(

*at*+

*b*), in every two-dimensional projection (

*x*,

_{i}*x*) of ℝ

_{j}*. This holds up to a translation, scaling, and rotation because the map from the fourth layer to the output layer is a linear affinity. Here*

^{l}*t*is linearly related to

*λ*. For every two-dimensional projection these curves will have either zero, one, or two turning points in each direction, that is, points where

*δx*/

_{i}*δλ*is zero. It can also be seen that the curves will have the same asymptotic slope for

*t*→ ∞ as for

*t*→ −∞. Figure 2 shows some examples of possible curves. Note that the most complex curve possible for

*m*= 2 is the Z-shaped curve in Fig. 2d with two parallel outer branches.

A general consideration of direct importance for the subject of this paper was pointed out by Malthouse (1998). The neural network implementation of the NLPCA determines a parameterization *λ*, *λ* = *f*_{1} (**x**_{in}), **x**_{out} = *f*_{2}(*λ*), which minimizes the error. However, any other parameterization *s* = *g*(*λ*), where *g* is invertible, will have the same value of the error although the maps *s = g* ○ *f*_{1}(*x*_{in}) and by *x*_{out} = *f*_{2 °} *g*^{−1}(*s*) may not be realizable for the neural network. The distribution of *s* will not necessarily have the same characteristics as the distribution of *λ*. As an example, choosing *g* = ∫^{λ}_{−∞}*P*(*x*)*dx*, where *P* is the distribution of *λ*, will make *s* homogeneously distributed even if *P* is bimodal. Therefore, only the order of *λ* is interpretable but not its magnitude. Often *λ* is not used directly but a transformation to the arc length is performed. This choice of parameterization may improve the NLPCA in some applications (Newbigging et al. 2003) but does not remove the basic ambiguity of the NLPCA.

## 4. Numerical examples

In this section we show that NLPCA regularly produces strong bimodal parameterizations even if the underlying distribution is unimodal. We focus on two examples. The first is a two-dimensional Gaussian distribution and the second is an analysis of the low-frequency variability of the Northern Hemisphere stratospheric extratropical geopotential height. The latter example is motivated by the NLPCA study of Monahan et al. (2003), who reported three circulation regimes in the stratospheric circulation.

Minimization is carried out by the Broyden–Fletcher–Goldfarb–Shanno algorithm. To avoid the problem of getting stuck in a local minimum, the minimization is repeated 1000 times starting from different initial conditions and the best fit chosen. The resulting nonlinear principal component is transformed into the arc length with values between 0 and 1. However, this transformation is not important for our conclusions as the bimodality reported with the arc length parameterization is also present with the original nonlinear principal component.

Two-dimensional probability density estimates are calculated with the kernel density estimate procedure with a Gaussian kernel using the algorithm based on the fast Fourier transform (Silverman 1986) with a smoothing parameter of 0.2. Here the probability density estimates are only shown for illustrative purposes, and the precise value of the smoothing parameter is not important. One-dimensional histograms are calculated with a bin width of 1/*n**n* is the number of samples.

### a. Two-dimensional Gaussian distributions

We first consider the two-dimensional Gaussian distribution. We want to study how the degree of isotropy in the distribution influences the NLPCA. We focus on Gaussian distributions with centers at (0, 0) and widths (1, *c*), where *c* is a constant between 0 and 1. For different values of *c*, we now randomly draw 1000 independent pairs of numbers from this distribution and calculate the NLPCA. Figures 3 and 4 show results for *c* = 0.2 and *c* = 0.8, respectively. The upper panels show the two-dimensional probability distributions and the lower panels show the histograms of the nonlinear principal components. For *c* = 0.2, the two-dimensional probability distribution estimate is unimodal, while for *c* = 0.8 some deviations from unimodality are visible as a result of sampling variability. Overlaid on the probability density functions are the NLPCA mode and the NLPCA approximation to the data. For *c* = 0.2, the NLPCA mode is a straight horizontal line. The data are almost perfectly projected vertically onto this curve, and the NLPCA basically simulates a linear least squares fit. Accordingly, the nonlinear principal component is almost normally distributed. For *c* = 0.8, the situation is quite different. Now, the NLPCA mode is Z shaped and the histogram of the nonlinear principal component is strongly bimodal. The two peaks correspond to the two outer branches of the NLPCA mode, while the middle branch is very sparsely populated. Points to the lower right (upper left) of a curve centered in between and parallel with the two outer branches are projected onto the lower right (upper left) branch of the NLPCA mode. The projections are almost perpendicular, and the NLPCA basically simulates a least squares fit to two parallel lines.

We have repeated the calculations for many sets of 1000 randomly drawn pairs, and the results described above are typical for the two values of *c.* For *c* = 0.2, the NLPCA mode is always an almost straight horizontal line, and the distribution of the nonlinear principal component is always unimodal. For *c* = 0.8, the NLPCA mode is always Z shaped, and the distribution of the nonlinear principal component is always bimodal. The orientation of the Z shape varies with a seeming affinity for the orientation in Fig. 4 and its reflections about the symmetry axes of the two-dimensional Gaussian distribution. We will discuss the reproducibility of the results in more detail in section 4c. The results do not depend on the number of random pairs and can be reproduced with 500 or 5000 pairs instead of 1000.

For *c* = 0.5, the results of the NLPCA are less robust. Now some sets of 1000 randomly drawn pairs result in a straight horizontal NLPCA mode with a unimodal nonlinear principal component, while other sets result in a Z-shaped NLPCA mode with a bimodal nonlinear principal component.

In this subsection we have shown that the NLPCA can produce strongly bimodal nonlinear principal components even if the input data are Gaussian. If the input data are sufficiently isotropic, NLPCA will always find bimodality.

### b. The extratropical circulation

Nonlinear principal component analysis was used by Monahan et al. (2001, 2003) to study the dynamical structure of the Northern Hemisphere extratropical variability. They subsampled the original daily geopotential heights from the National Centers for Environmental Prediction–National Center for Atmospheric Research reanalysis (Kalnay et al. 1996) to a coarser grid of 36 latitudes and 72 longitudes, removed the annual cycle, low-pass filtered, and selected the December–February seasons. Then they calculated the leading linear PCs north of 20°N and used those as inputs to the NLPCA. With this approach, Monahan et al. (2001, 2003) found three regimes in both the stratosphere and the troposphere. We will here discuss the stratospheric results, although we have obtained similar results for the troposphere. We use the same procedure as Monahan et al. except that we have used a 30-day low-pass filter instead of a 10-day filter in order to reproduce their results in detail. The same choice was made in Christiansen (2002).

We calculate the two leading linear PCs, *a*_{1} and *a*_{2}, from the 20-hPa geopotential height and normalize them both with the standard deviation of *a*_{1}. By definition, the difference between the joint probability density *P*(*a*_{1}, *a*_{2}) and the product *P*(*a*_{1})*P*(*a*_{2}) of the marginal probabilities is zero if *a*_{1} and *a*_{2} are statistically independent. This difference is shown in the upper panel of Fig. 5, and it is almost identical to Fig. 10 of Monahan et al. (2003). The distribution of the two leading PCs are both unimodal. The distribution of the first PC, *a*_{1}, is skewed toward larger values, while the distribution of the second PC cannot be distinguished from a Gaussian distribution.

The NLPCA mode, which is overlaid on the probability density, is Z shaped and the probability distribution of the nonlinear principal component (lower panel of Fig. 5) is highly bimodal. Our NLPCA approximation does not have the same orientation as the NLPCA approximation of Monahan et al. (2003). However, the orientations of the two NLPCA approximations are almost identical up to a reflection about the horizontal axes, about which the distribution is approximately symmetric. As the relative width of the distributions of the two linear PCs is 0.64, this ambiguity should be expected from the experience of the previous subsection.

Monahan et al. (2003) argue that the orientation of and the peaks in the difference between the joint probability density *P*(*a*_{1}, *a*_{2}) and the product *P*(*a*_{1})*P*(*a*_{2}) of the marginal probabilities support the results of their NLPCA. However, as shown in Christiansen (2002), this difference does not have solid physical meaning as the difference is not preserved under an orthonormal rotation. As an additional test, we have constructed surrogate data (*x*_{1}, *x*_{2}), where *x*_{1} is randomly drawn from *P*(*a*_{1}) and *x*_{2} is randomly drawn from *P*(*a*_{2}). Consequently *x*_{1} and *x*_{2} are statistically independent, and (*x*_{1}, *x*_{2}) follows the distribution *P*(*a*_{1})*P*(*a*_{2}), where both *P*(*a*_{1}) and *P*(*a*_{2}) are unimodal. The resulting NLPCA approximation and the distribution of the nonlinear principal component (Fig. 6) resemble those of the original data (Fig. 5), although the only possible deviation from unimodality in the surrogate data is due to chance. This result is not sensitive to the sample size. While Fig. 6 is based on a sample of 1000 points, similar results are found with samples of 300 points.

### c. Reproducibility and overfitting

We saw in section 4a that NLPCA produces bimodality and Z-shaped NLPCA modes even for the Gaussian distributed data if the isotropy is large enough. We also saw that the orientation of the Z shape for *c* = 0.8 was not uniquely determined because of the symmetry of the data. For *c* = 1, the orientation of the Z-shaped NLPCA mode is completely random, as should be expected. If the symmetry of the data is lifted, for example, by drawing the datasets (with a sample size of 300 or 1000) from an asymmetric distribution, the NLPCA mode is again Z shaped with a bimodal nonlinear principal component, but now its orientation does not vary among the different realizations (not shown).

There is still hope that a procedure that carefully tests the reproducibility of the NLPCA would reject the spurious bimodality, as such a test could certainly reject the bimodality in our experiments with Gaussian distributed data. To address the possibility of such a method, we need to compare the NLPCA of the atmospheric data with a NLPCA of carefully constructed surrogate data. These surrogate data should resemble the atmospheric data but be drawn from a unimodal distribution. In particular, and in contrast to the data analyzed in Fig. 6, the surrogate data should now have the same serial correlations as the atmospheric data. We have constructed a two-dimensional Gaussian dataset with the same number of samples, the same serial correlations, and the same widths as the original PCs from the 20-hPa geopotential height. To this end, we have used the linear method described in Winkler et al. (2001). We proceed by performing the NLPCA on 25 different subsets of both the original and the surrogate data. The subsets each contain 80% of data (as in Monahan 2000). If the NLPCA of the surrogate data is less reproducible than the NLPCA of the original data, it would suggest that the Z-shaped NLPCA mode actually reflects some structure in the data. However, as illustrated by the four realizations in Fig. 7, there is no difference between the reproducibility of the NLPCA of the original data and the NLPCA of the surrogate data. The same result is found if the atmospheric data are compared to the skewed surrogate data described in the end of section 4b with sample sizes of either 300 or 1000 points. Therefore, statistical tests like those described in Monahan (2000) will not be able to reject the spurious bimodality detected by the NLPCA.

One could argue that the bimodality and the Z-shaped NLPCA modes are due to overfitting, which is a well-known problem with neural network methods (Hsieh 2004). One approach to avoid overfitting is to add a penalty term to the mean square error and to minimize the sum Σ_{j}||**x**_{j} − *f* (**x**_{j})||^{2}/*n*/*l* + *p*Σ^{4}_{j=1}Σ^{Nj}_{i=1}(𝘄* ^{j}_{i}*)

^{2}. The penalty term is proportional to the sum of the squared weights, and the inclusion of this term will reduce the nonlinearity of the neural network. We have repeated the NLPCA in section 4b of the 20-hPa geopotential height and the surrogate data with different values of the penalty parameter

*p*. With strong penalty, the NLPCA mode is a straight line. With decreasing penalty, the NLPCA mode first takes the form of a curved line before it adopts the Z shape. The same pattern is found for both the real data and the surrogate data (Fig. 8). Only for the strongest penalty where the NLPCA is a straight line will the distribution of the nonlinear principal component be unimodal.

## 5. Discussion

Recently, nonlinear principal component analysis has been introduced to the atmospheric sciences, and the observed multimodality in the nonlinear principal component has been taken as evidence for regimes in both the tropospheric and stratospheric circulation. We have reviewed the usefulness of NLPCA as a tool for detection of multimodality and found severe problems and limitations.

Our review included both theoretical arguments and numerical simulations based on both idealized datasets and observations of the Northern Hemisphere stratospheric circulation.

Theoretically we have shown that the nonlinear principal component easily becomes multimodal even if the input is normally distributed. The multimodality is a consequence of the nonlinearity of the NLPCA. We also saw that with two neurons in the fourth layer the NLPCA approximation can be only a straight line, a simple curve, or a Z-shaped structure. Finally, we reiterated a limitation of the method first reported by Malthouse (1998). This limitation states that only the order of the nonlinear principal component is interpretable and not its magnitude.

Numerical simulations based on Gaussian-distributed datasets confirmed that the NLPCA produces multimodality when fed with unimodal data. We saw that multimodality resulted when the distribution of the input data was sufficiently isotropic. The multimodality of the nonlinear principal component was accompanied by a Z-shaped NLPCA mode with a sparsely populated middle branch so that the NLPCA approximation effectively consisted of two parallel line pieces.

We repeated the NLPCA analysis of the Northern Hemisphere stratospheric circulation reported by Monahan et al. (2003). Like Monahan et al. (2003), we found bimodality in the distribution of the nonlinear principal component and a Z-shaped NLPCA approximation in the space spanned by the two leading PCs. We also found similar bimodality and Z-shaped NLPCA approximation for data drawn randomly from the product of the marginal distributions of the original PCs.

The two-dimensional, uncorrelated Gaussian data studied in section 4a were constructed to offer a simple and clean test case. We saw that the Z-shaped NLPCA mode appeared in all realizations but that its orientation varied with an affinity for certain orientations determined by the symmetry of the data. If the symmetry of the data is lifted, for example, by drawing the numbers from an asymmetric distribution, the NLPCA mode is again Z shaped with bimodal nonlinear principal component, but now its orientation does not vary among the different realizations. We elaborated on this point in section 4c, where we saw that the atmospheric data and appropriate unimodal surrogate data have the same reproducibility so that attempts to validate the method by training it on one part of the data and testing it on the remaining part will fail to reject spurious bimodality.

In the present paper we have focused on the bivariate case. However, additional experiments with datasets of higher dimensions show that here the NLPCA will also produce spurious bimodality when the data are sufficiently isotropic and the penalty factor is small enough. We also note that the spurious multimodality is robust to changes in the architecture of the neural network such as the transfer functions and the number of neurons, *m*, in the second and fourth layer.

We conclude that the NLPCA abundantly produces spurious multimodality and that it should not be used for the detection of multimodality and regime behavior.

The author would like to acknowledge insightful comments provided by Matt Newman during the review process, which substantially improved the manuscript. This work was supported by the Danish Climate Centre. The NCEP reanalysis data were provided by the NOAA–CIRES Climate Diagnostics Center, Boulder, Colorado, from their Web site (http://www.cdc.noaa.gov/).

## REFERENCES

Christiansen, B., 2002: On the physical nature of the Arctic Oscillation.

,*Geophys. Res. Lett.***29****.**1805, doi:10.1029/2002GL015208.Corti, S., , F. Molteni, , and T. N. Palmer, 1999: Signature of recent climate change in frequencies of natural atmospheric circulation regimes.

,*Nature***398****,**799–802.Hamilton, K., , and W. W. Hsieh, 2002: Representation of the QBO in the tropical stratospheric wind by nonlinear principal component analysis.

,*J. Geophys. Res.***107****.**4232, doi:10.1029/2001JD001250.Hsieh, W. W., 2001: Nonlinear principal component analysis by neural networks.

,*Tellus***A53****,**599–615.Hsieh, W. W., 2004: Nonlinear multivariate and time series analysis by neural network methods.

,*Rev. Geophys.***42****.**RG1003, doi:10.1029/2002RG000112.Kalnay, E., and Coauthors, 1996: The NCEP/NCAR 40-Year Reanalysis Project.

,*Bull. Amer. Meteor. Soc.***77****,**437–471.Kramer, M. A., 1991: Nonlinear principal component analysis using autoassociative neural networks.

,*AIChE J.***37****,**233–243.Malthouse, E. C., 1998: Limitations of nonlinear PCA as performed with generic neural networks.

,*IEEE Trans. Neural Networks***9****,**165–173.Monahan, A. H., 2000: Nonlinear principal component analysis by neural networks: Theory and application to the Lorenz system.

,*J. Climate***13****,**821–835.Monahan, A. H., , J. Fyfe, , and G. M. Flato, 2000: A regime view of northern hemisphere atmospheric variability and change under global warming.

,*Geophys. Res. Lett.***27****,**1139–1142.Monahan, A. H., , L. Pandolfo, , and J. Fyfe, 2001: The preferred structure of variability of the northern hemisphere atmospheric circulation.

,*Geophys. Res. Lett.***28****,**1019–1022.Monahan, A. H., , L. Pandolfo, , and J. Fyfe, 2003: The vertical structure of wintertime climate regimes of the Northern Hemisphere extratropical atmosphere.

,*J. Climate***16****,**2005–2020.Newbigging, S. C., , L. A. Mysak, , and W. W. Hsieh, 2003: Improvements to the Non-linear Principal Component Analysis method, with applications to ENSO and QBO.

,*Atmos.–Ocean***41****,**291–299.Palmer, T. N., 1999: A nonlinear dynamical perspective on climate change.

,*J. Climate***12****,**575–591.Silverman, B. W., 1986:

*Density Estimation for Statistics and Data Analysis*. Chapman and Hall, 175 pp.Stephenson, D. B., , A. Hannachi, , and A. O’Neill, 2004: On the existence of multiple climate regimes.

,*Quart. J. Roy. Meteor. Soc.***130****,**583–605.Teng, Q., , A. H. Monahan, , and J. C. Fyfe, 2004: Effects of time averaging on climate regimes.

,*Geophys. Res. Lett.***31****.**L22203, doi:10.1029/2004GL020840.Winkler, C. R., , M. Newman, , and P. D. Sardeshmukh, 2001: A linear model of wintertime low-frequency variability. Part I: Formulation and forecast skill.

,*J. Climate***14****,**4474–4494.