1. Introduction
Principal component analysis (PCA), also known as empirical orthogonal function (EOF) analysis, has a long history of use in meteorology, climatology, and oceanography since its introduction into this literature by Obukhov (1947), Lorenz (1956), and Davis (1976). The method often is very effective at compressing high-dimensional datasets by producing a relatively small number of uncorrelated derived variables, while preserving much of the original variance. The computations involve extracting eigenvalue–eigenvector pairs from the covariance matrix of the data under analysis, and projecting those data onto the eigenvectors corresponding to the largest eigenvalues.
A key issue in the use of PCA is choice of the number of these derived variables to retain in the compressed dataset, with this truncation point generally chosen on the basis of the magnitudes of the sample eigenvalues. A wide variety of methods have been proposed to guide this process (Preisendorfer and Mobley 1988; Jolliffe 2002; Wilks 2011), although these methods often disagree and no consensus has developed regarding the most appropriate approach. Some authors recommend against use of these selection rules at all (von Storch and Zwiers 1999).
Typically the choice of the truncation point is framed in terms of separating hypothesized “signal” and “noise” subspaces of the data space. It is usually assumed that the noise is relatively weak and uncorrelated across the original variables, and so will be represented by a sequence of higher-order generating-process eigenvalues of equal magnitude. Although this is a reasonable viewpoint for many applications, and will be assumed in this paper, the definition of what is considered as signal may depend on the scientific context. For example, if the scientific interest focuses on one or a few large-scale processes, some portion of the noise may be spatially correlated, and represented by intermediate eigenvalues. Alternatively, real physical processes occurring at unresolved scales may be aliased as noise onto the resolved scales, and make contributions across the low-variance eigenvalues.
Rule N (Preisendorfer et al. 1981; Overland and Preisendorfer 1982) is a popular method for PCA truncation that assumes the noise subspace is represented by a sequence of trailing eigenvalues of equal magnitude. It is based on statistical hypothesis testing ideas, and it continues to be used widely in climatology and oceanography (e.g., Pérez-Hernández and Joyce 2014; Ortega et al. 2015; Wang et al. 2015; Feng et al. 2016). However, Rule N is only strictly valid for the hypothesis test involving the leading eigenvalue, which amounts to testing the null hypothesis that the signal subspace is null. The result is that Rule N is usually conservative, meaning that it retains too few components.
This paper proposes a modification of Rule N for PCA truncation that is based on several relatively recent results from the statistics literature. Section 2 reviews Rule N and describes the proposed modification, section 3 compares the performance of the two methods in synthetic-data settings where the correct truncation points are known, section 4 illustrates the methods in a small real-data setting, and section 5 concludes the paper.
2. Testing a sequence of sample eigenvalues
Although it is conceptually attractive, the primary problem with Rule N is that, having rejected the first null hypothesis H1 that λ1 corresponds to noise (because
An improved hypothesis-testing-based approach to estimating the dimension of the signal subspace in PCA can be constructed by combining several relatively recent statistical results. Johnstone (2001) has shown that the sampling distribution of the leading eigenvalue of a sample covariance matrix derived from uncorrelated random variates, when appropriately scaled, follows a distribution proposed by Tracy and Widom (1996), which will be defined later. For example, over many Monte Carlo simulations, a histogram of appropriately scaled multiple random realizations for
Kritchman and Nadler (2008) show that if κ = 1, the (appropriately scaled) sampling distribution of
Under a null hypothesis Hk the quantity
The Pearson-III approximation to the Tracy–Widom distribution is quite good, especially for the right-tail quantiles that are of primary interest in the present setting (Chiani 2014). The correspondence in the left tail is less good, due in part to the fact that a Tracy–Widom variable can in principle be any real number, whereas Eq. (6) requires
Since the applicability of the Tracy–Widom distribution assumes that both n and M are large, it is of interest to examine its accuracy for the small to moderate sample size and/or data dimensions that are typically encountered. Table 1 shows relative specification error (%) for 95th percentiles computed using the Pearson-III approximation to Tracy–Widom distributions, relative to the corresponding empirical distributions obtained through Monte Carlo simulation using the method described in the appendix. The positive tabulated values indicate that Tracy–Widom distributions overspecify this quantile, which leads to conservatism in the hypothesis tests (i.e., when κ = k − 1, the null hypothesis Hk is rejected less frequently that the nominal αcrit = 0.05). However, this conservatism is slight unless either the sample size or the data dimension is fairly small, and the errors will be acceptable for many climatological applications. Similar results (not shown) are exhibited for other high quantiles. Fortunately, when either the sample size or data dimension is relatively small, the relevant reference distributions can be derived fairly quickly through Monte Carlo simulations, because (as outlined in the appendix) only the first sample eigenvalues
Percent error of 95th percentiles of Pearson-III approximations to Tracy–Widom distributions, relative to 10 000-member Monte Carlo counterparts.
3. Performance in synthetic-data settings
Figure 1 shows frequencies of estimated signal dimensions for the various parameter combinations according to Eq. (4), for Rule N (dashed) and the Tracy–Widom modification (solid histograms), over 1000 trials for each parameter combination. The vertical red lines in each panel locate the correct specification,
The Tracy–Widom modification of Rule N exhibits good performance even though no adjustments have been made to account for the effect of computing the sequence of multiple hypothesis tests specified by Eq. (4). Figure 2 illustrates that this result is a fortuitous consequence of the nonapplicability of the Tracy–Widom distribution for describing the sampling variation of the k = (κ + 2)th (second noise) eigenvalue. This figure pertains to the parameter combination n = 50, κ = 9, and S/N = 1, which is indicated by the heavy box outline in Fig. 1, but is representative of the other parameter combinations also. Figure 2a shows the relative frequencies of p10 [i.e., the test p value for the first noise eigenvalue in Eq. (4)] over 10 000 trials, when using the empirical distribution of p10 (red dashed histograms) and the Tracy–Widom approximation to it (black solid histograms). The thin horizontal line located at αcrit = 0.05 locates ideal behavior for this histogram, and the close approximation of the dashed histogram to this ideal behavior supports the conjecture of Kritchman and Nadler (2008) noted in section 2. The height of the leftmost bar in Fig. 2a reflects the achieved test size when αcrit = 0.05, and shows that the actual test level is very close to αcrit = 0.05 when the empirical distribution of
Figure 2b shows the corresponding relative frequencies for p11, which pertain to the test statistic
4. Real-data example
Overland and Preisendorfer (1982) illustrated use of Rule N with a PCA relating to numbers of cyclones transiting an M = 56-member network of grid cells in the Bering Sea during October–February, during a period of n = 23 winters. Their results pertaining to the covariance matrix of cyclone counts are displayed graphically in Fig. 3a. Here the five leading sample eigenvalues, scaled according to Eq. (2) are plotted, together with the 95th percentile of the Monte Carlo distribution for these statistics. Accordingly
Figure 3b shows the corresponding result when Eq. (4) is evaluated using Tracy–Widom distributions (dashed curve), and 10 000-member Monte Carlo distributions (solid curve) for
5. Conclusions
This paper has proposed a modification of the popular Rule N for PCA truncation, based on the Tracy–Widom distribution for the largest noise eigenvalue of a sample covariance matrix. This new method improves upon Rule N because it accounts at each stage of the sequence of hypothesis tests for the results of previous tests in the sequence.
Both Rule N and the proposed modification assume the particular separation of signal and noise subspaces expressed in Eq. (1). Experiments using synthetic data conforming to this assumption show excellent results unless the dimension of the signal subspace is a large fraction of the sample size and the signal-to-noise ratio is relatively small, and even under these circumstances the proposed modification strongly outperforms the original Rule N. However, both of these methods may perform poorly in settings where the signal and noise subspaces are defined differently.
Even though the proposed procedure is based on a sequence of hypothesis tests, it is not necessary to account for this test multiplicity because the bias of Tracy–Widom distributions for other than the largest noise eigenvalues prevents the procedure from greatly overestimating the dimension of the signal subspace.
The Tracy–Widom distribution provides a good representation for the random variations of the leading noise eigenvalue when sample size and data dimension are both moderate to large. When either or both of these parameters are relatively small the Tracy–Widom distribution yields conservative tests and truncation points; that is, too few of the leading eigenvalues are regarded as representing signal, on average, which will lead to concluding that some signal modes are noise. In this circumstance the appropriate reference distributions can be easily and quickly generated by Monte Carlo methods, as described in the appendix.
The results presented here were based on underlying Gaussian-distributed data, but the sampling distributions of sample covariance-matrix eigenvalues are very similar for uniform, or random sign (±1 with equal probability) data (Faber et al. 1994). However, if sensitivity to non-Gaussian data is suspected, the appropriate reference distributions can again be derived using Monte Carlo methods as described in the appendix.
Acknowledgments
I thank the anonymous reviewers, whose comments lead to an improved presentation. This research was supported by the National Science Foundation under Grant AGS-1112200.
APPENDIX
Monte Carlo Computation of the Reference Distributions
A different sampling distribution must be computed for each null hypothesis Hk [Eq. (3)], k = 1, 2, 3, . . . , pertaining to the largest sample eigenvalue of a pure noise covariance matrix formed from random data of dimension Mk, with sample size nk, where Mk = M − k + 1 and nk = n − k + 1. If either the sample size n or the data dimension M is relatively small, the Tracy–Widom distribution may represent inadequately the sampling distribution of the largest noise eigenvalue, as indicated in Table 1. In this case discrete approximations to the appropriate distributions can be computed relatively cheaply using the power method (e.g., Golub and Van Loan 1983), because only the leading eigenvalue rather than all of the nonzero eigenvalues of each sample covariance matrix needs to be computed. This approach can also be used if the effects of non-Gaussian data are suspected to be significant, but with possibly large computational expense if the smaller of nk and Mk is not relatively small.
REFERENCES
Barnston, A. G., 1994: Linear statistical short-term climate predictive skill in the Northern Hemisphere. J. Climate, 7, 1513–1564, doi:10.1175/1520-0442(1994)007<1513:LSSTCP>2.0.CO;2.
Chiani, M., 2014: Distribution of the largest eigenvalue for real Wishart and Gaussian random matrices and a simple approximation for the Tracy–Widom distribution. J. Multivar. Anal., 129, 69–81, doi:10.1016/j.jmva.2014.04.002.
Compagnucci, R. H., and M. B. Richman, 2008: Can principal component analysis provide atmospheric circulation or teleconnection patterns? Int. J. Climatol., 28, 703–726, doi:10.1002/joc.1574.
Davis, R. E., 1976: Predictability of sea level pressure anomalies over the North Pacific Ocean. J. Phys. Oceanogr., 6, 249–266, doi:10.1175/1520-0485(1976)006<0249:POSSTA>2.0.CO;2.
Faber, N. M., L. M. C. Buydens, and G. Kateman, 1994: Aspects of pseudorank estimation methods based on the eigenvalues of principal component analysis of random matrices. Chemom. Intell. Lab. Syst., 25, 203–226, doi:10.1016/0169-7439(94)85043-7.
Feng, J., Q. Wang, S. Hu, and D. Hu, 2016: Intraseasonal variability of the tropical Pacific subsurface temperature in the two flavours of El Niño. Int. J. Climatol., 36, 867–884, doi:10.1002/joc.4389.
Golub, G. H., and C. F. Van Loan, 1983: Matrix Computations. John Hopkins Press, 476 pp.
Johnstone, I. M., 2001: On the distribution of the largest eigenvalue in principal component analysis. Ann. Stat., 29, 295–327, doi:10.1214/aos/1009210544.
Jolliffe, I. T., 2002: Principal Component Analysis. Springer, 487 pp.
Kritchman, S., and B. Nadler, 2008: Determining the number of components in a factor model from limited noisy data. Chemom. Intell. Lab. Syst., 94, 19–32, doi:10.1016/j.chemolab.2008.06.002.
Lorenz, E. N., 1956: Empirical orthogonal functions and statistical weather prediction. Science Rep. 1, Statistical Forecasting Project, Department of Meteorology, MIT (NTIS AD 110268), 49 pp.
Obukhov, A. M., 1947: Statistically homogeneous fields on a sphere. Uspethi Math. Nauk, 2, 196–198.
Ortega, P., F. Lehner, D. Swingedouw, V. Masson-Delmotte, C. C. Raible, M. Casado, and P. Yiou, 2015: A model-tested North Atlantic Oscillation reconstruction for the past millennium. Nature, 523, 71–77, doi:10.1038/nature14518.
Overland, J. E., and R. W. Preisendorfer, 1982: A significance test for principal components applied to a cyclone climatology. Mon. Wea. Rev., 110, 1–4, doi:10.1175/1520-0493(1982)110<0001:ASTFPC>2.0.CO;2.
Pérez-Hernández, M., and T. M. Joyce, 2014: Two modes of Gulf Stream variability revealed in the last two decades of satellite altimeter data. J. Phys. Oceanogr., 44, 149–163, doi:10.1175/JPO-D-13-0136.1.
Preisendorfer, R. W., and C. D. Mobley, 1988: Principal Component Analysis in Meteorology and Oceanography. Elsevier, 425 pp.
Preisendorfer, R. W., F. W. Zwiers, and T. P. Barnett, 1981: Foundations of Principal Component Selection Rules. SIO Reference Series 81-4, Scripps Institution of Oceanography, 192 pp.
Tracy, C. A., and H. Widom, 1996: On orthogonal and symplectic matrix ensembles. Commun. Math. Phys., 177, 727–754, doi:10.1007/BF02099545.
von Storch, H., and F. W. Zwiers, 1999: Statistical Analysis in Climate Research. Cambridge University Press, 484 pp.
Wallace, J. M., and D. S. Gutzler, 1981: Teleconnections in the geopotential height field during the Northern Hemisphere winter. Mon. Wea. Rev., 109, 784–812, doi:10.1175/1520-0493(1981)109<0784:TITGHF>2.0.CO;2.
Wang, Y., R. M. Castelao, and Y. Yuan, 2015: Seasonal variability of alongshore winds and sea surface temperature fronts in eastern boundary current systems. J. Geophys. Res. Oceans, 120, 2385–2400, doi:10.1002/2014JC010379.
Wilks, D. S., 2008: Improved statistical seasonal forecasts using extended training data. Int. J. Climatol., 28, 1589–1598, doi:10.1002/joc.1661.
Wilks, D. S., 2011: Statistical Methods in the Atmospheric Sciences. Academic Press, 676 pp.