Principal component analysis (PCA), also known as empirical orthogonal function (EOF) analysis, is widely used for compression of high-dimensional datasets in such applications as climate diagnostics and seasonal forecasting. A critical question when using this method is the number of modes, representing meaningful signal, to retain. The resampling-based “Rule N” method attempts to address the question of PCA truncation in a statistically principled manner. However, it is only valid for the leading (largest) eigenvalue, because it fails to condition the hypothesis tests for subsequent (smaller) eigenvalues on the results of previous tests. This paper draws on several relatively recent statistical results to construct a hypothesis-test-based truncation rule that accounts at each stage for the magnitudes of the larger eigenvalues. The performance of the method is demonstrated in an artificial data setting and illustrated with a real-data example.
Principal component analysis (PCA), also known as empirical orthogonal function (EOF) analysis, has a long history of use in meteorology, climatology, and oceanography since its introduction into this literature by Obukhov (1947), Lorenz (1956), and Davis (1976). The method often is very effective at compressing high-dimensional datasets by producing a relatively small number of uncorrelated derived variables, while preserving much of the original variance. The computations involve extracting eigenvalue–eigenvector pairs from the covariance matrix of the data under analysis, and projecting those data onto the eigenvectors corresponding to the largest eigenvalues.
A key issue in the use of PCA is choice of the number of these derived variables to retain in the compressed dataset, with this truncation point generally chosen on the basis of the magnitudes of the sample eigenvalues. A wide variety of methods have been proposed to guide this process (Preisendorfer and Mobley 1988; Jolliffe 2002; Wilks 2011), although these methods often disagree and no consensus has developed regarding the most appropriate approach. Some authors recommend against use of these selection rules at all (von Storch and Zwiers 1999).
Typically the choice of the truncation point is framed in terms of separating hypothesized “signal” and “noise” subspaces of the data space. It is usually assumed that the noise is relatively weak and uncorrelated across the original variables, and so will be represented by a sequence of higher-order generating-process eigenvalues of equal magnitude. Although this is a reasonable viewpoint for many applications, and will be assumed in this paper, the definition of what is considered as signal may depend on the scientific context. For example, if the scientific interest focuses on one or a few large-scale processes, some portion of the noise may be spatially correlated, and represented by intermediate eigenvalues. Alternatively, real physical processes occurring at unresolved scales may be aliased as noise onto the resolved scales, and make contributions across the low-variance eigenvalues.
Rule N (Preisendorfer et al. 1981; Overland and Preisendorfer 1982) is a popular method for PCA truncation that assumes the noise subspace is represented by a sequence of trailing eigenvalues of equal magnitude. It is based on statistical hypothesis testing ideas, and it continues to be used widely in climatology and oceanography (e.g., Pérez-Hernández and Joyce 2014; Ortega et al. 2015; Wang et al. 2015; Feng et al. 2016). However, Rule N is only strictly valid for the hypothesis test involving the leading eigenvalue, which amounts to testing the null hypothesis that the signal subspace is null. The result is that Rule N is usually conservative, meaning that it retains too few components.
This paper proposes a modification of Rule N for PCA truncation that is based on several relatively recent results from the statistics literature. Section 2 reviews Rule N and describes the proposed modification, section 3 compares the performance of the two methods in synthetic-data settings where the correct truncation points are known, section 4 illustrates the methods in a small real-data setting, and section 5 concludes the paper.
2. Testing a sequence of sample eigenvalues
The underlying assumption in the following is that the multivariate data space can be partitioned into a κ-dimensional signal subspace and a remaining noise subspace. The eigenvalues corresponding to eigenvectors spanning these subspaces are
where M is the dimension of the data space. These data-generating-process (“population”) eigenvalues are estimated in PCA by extracting the sample eigenvalues of a covariance matrix, which has been calculated from a sample of n data vectors of dimension M. In a typical climatological application, M is the number of spatial locations (often grid points), and n is the number of temporal observations.
The goal of Rule N (Preisendorfer et al. 1981, Overland and Preisendorfer 1982) is to estimate the dimension of the signal subspace, κ, by comparing the members of an observed sample eigenvalue sequence , k = 1, 2, 3, . . . , to a high quantile (e.g., the 95th percentile) of their respective sampling distributions. These distributions are estimated from a large number of PCAs for covariance matrices that have been computed from uncorrelated (usually Gaussian) random data vectors having the same sample size and dimension. In order that the test and synthetic sample eigenvalues are scaled comparably, both are normalized by the average of the nonzero sample eigenvalues:
where Nrank = min(n − 1, M) is the number of nonzero sample eigenvalues. The name “Rule N” refers to the normalization of the sample eigenvalue estimates by the average eigenvalue in the denominator of Eq. (2) (Preisendorfer et al. 1981).
Using Rule N involves examining a sequence of hypothesis tests with null hypotheses Hk, specifying the propositions
that is, that following the assumption specified by Eq. (1) the trailing Nrank − k + 1 eigenvalues represent noise. The Rule N test statistic for each Hk is , which is compared to the collection of Monte Carlo realizations for the corresponding normalized eigenvalue. The null hypothesis Hk represents the proposition that κ < k, so that rejecting Hk indicates κ ≥ k. The estimated truncation point is then taken to be the eigenvalue index k that is one less than that of the first sample eigenvalue not regarded as statistically significant:
where αcrit is the chosen test level (often 0.05). For Rule N, pk is estimated as the fraction of the scaled [according to Eq. (2)] synthetic noise eigenvalues with index k that are larger than the current test statistic . Thus, for example, if the first null hypothesis H1 is not rejected because p1 > αcrit, Eq. (4) specifies that = 0, with the corresponding inference being that the underlying dataset is comprised entirely of noise.
Although it is conceptually attractive, the primary problem with Rule N is that, having rejected the first null hypothesis H1 that λ1 corresponds to noise (because is unusually large compared to its Monte Carlo critical value), the remaining tests (for H2, H3 . . .) are incorrect. The reason is that the total remaining variance in the observed sample eigenvalues for k >1 will be too small relative to the corresponding Monte Carlo distributions to which they are being compared. That is, the Rule N hypotheses tests beyond the first null hypothesis H1 do not account for previous significant results, because those results will imply a greater variance fraction represented by the leading eigenvalues than is assumed by the subsequent Monte Carlo distributions. The consequence is that the Rule N test is overly conservative, tending to conclude that < κ (Preisendorfer and Mobley 1988).
An improved hypothesis-testing-based approach to estimating the dimension of the signal subspace in PCA can be constructed by combining several relatively recent statistical results. Johnstone (2001) has shown that the sampling distribution of the leading eigenvalue of a sample covariance matrix derived from uncorrelated random variates, when appropriately scaled, follows a distribution proposed by Tracy and Widom (1996), which will be defined later. For example, over many Monte Carlo simulations, a histogram of appropriately scaled multiple random realizations for [Eq. (2)] from uncorrelated noise data will estimate the Tracy–Widom distribution.
Kritchman and Nadler (2008) show that if κ = 1, the (appropriately scaled) sampling distribution of also follows the Tracy–Widom distribution, if both n and M are large. Following Faber et al. (1994), this sampling distribution for will be the same as the sampling distribution for calculated from covariance matrices formed from vectors of uncorrelated noise data having dimension M2 = M − 1, with sample size n2 = n − 1. Kritchman and Nadler (2008) conjecture that their result generalizes for the sampling distributions of , k = 2, 3, . . . , when κ = k, in which case the relevant Tracy–Widom data dimension and sample size would be Mk = M − k + 1 and nk = n − k + 1.
Because a sequence of tests based on these results will relate only to those eigenvalues regarded as reflecting noise according to the current null hypothesis [Eq. (3)], the appropriate normalization of a raw sample eigenvalue is with respect to the average over these trailing eigenvalues:
Thus , but for k > 1 the normalization in Eq. (5) is unaffected by the magnitudes of the leading sample eigenvalues , m < k, so that for k > 1.
The Tracy–Widom distribution does not have a closed-form analytic representation, and the numerics required for working with it are elaborate. However, Chiani (2014) shows that the Tracy–Widom distribution can be well approximated by a Pearson-III (i.e., shifted gamma) distribution, the probability density function for which is
where Γ(•) denotes the gamma function. Correspondence with the Tracy–Widom distribution is achieved when the shape, scale, and shift parameters of the Pearson-III distribution are, respectively,
Here the Tracy–Widom location and scale parameters are
As before, Mk = M − k + 1 and nk = n − k + 1 for the null hypothesis Hk.
Under a null hypothesis Hk the quantity − ζk is thus a random draw from an ordinary (two parameter) gamma distribution with shape parameter α [Eq. (7a)] and scale parameter βk [Eq. (7b)]. The estimated dimension of the signal subspace is then specified by Eq. (4), where the quantity pk corresponds to probability in this distribution (i.e., the right-tail area) above − ζk.
The Pearson-III approximation to the Tracy–Widom distribution is quite good, especially for the right-tail quantiles that are of primary interest in the present setting (Chiani 2014). The correspondence in the left tail is less good, due in part to the fact that a Tracy–Widom variable can in principle be any real number, whereas Eq. (6) requires > ζk. As a consequence, occasionally a sample value of − ζk may be negative, in which case the interpretation should be that the corresponding pk ≈ 1.
Since the applicability of the Tracy–Widom distribution assumes that both n and M are large, it is of interest to examine its accuracy for the small to moderate sample size and/or data dimensions that are typically encountered. Table 1 shows relative specification error (%) for 95th percentiles computed using the Pearson-III approximation to Tracy–Widom distributions, relative to the corresponding empirical distributions obtained through Monte Carlo simulation using the method described in the appendix. The positive tabulated values indicate that Tracy–Widom distributions overspecify this quantile, which leads to conservatism in the hypothesis tests (i.e., when κ = k − 1, the null hypothesis Hk is rejected less frequently that the nominal αcrit = 0.05). However, this conservatism is slight unless either the sample size or the data dimension is fairly small, and the errors will be acceptable for many climatological applications. Similar results (not shown) are exhibited for other high quantiles. Fortunately, when either the sample size or data dimension is relatively small, the relevant reference distributions can be derived fairly quickly through Monte Carlo simulations, because (as outlined in the appendix) only the first sample eigenvalues need to be computed.
3. Performance in synthetic-data settings
In this section the performance of the Tracy–Widom-based modification to Rule N is investigated in synthetic-data settings, for which the correct κ can be known in advance. The data dimension is taken to be M = 1500, which is typical of the numbers of grid points in climate studies (e.g., Wallace and Gutzler 1981; Barnston 1994; Wilks 2008), and results will be shown for sample sizes n of 20, 50, and 100. Since n << M these simulations are challenging settings for estimating PC truncations. Results will be shown for signal subspace dimensions, κ, of 5, 9, 17, 33, and 65 (provided κ < n); signal-to-noise (S/N) ratios
of 1 (half of total variance in the signal eigenvalues) and 9 (90% of total variance in the signal eigenvalues); and ratio of largest to smallest signal eigenvalues λ1/λκ = 8, with intermediate signal eigenvalues varying linearly between the two extremes (results change little for values of this ratio in the range 2 through 32).
A large number of trials are computed for each combination of parameter values. In each trial, n synthetic M-dimensional data vectors xi have been generated using
where = diag (√λ1, √λ2, . . . √λκ, σnoise, σnoise, . . . σnoise), from which synthetic sample covariance matrices and their n − 1 nonzero eigenvalues were computed. Each of the M columns of the (M × M) matrix represents a generating-process eigenvector (i.e., the basis vectors estimated by the EOFs), the M elements of the diagonal matrix are the square roots of the signal and noise eigenvalues in Eq. (1), and each zi is an M-dimensional realization of independent standard Gaussian random variates. Although in a physically real setting the κ signal eigenvectors would usually exhibit larger spatial scales than the M–κ noise eigenvectors, here all M of these vectors have been generated through a Gram–Schmidt orthonormalization of random directions in M space, which does not affect the results for estimation of the signal dimensions.
Figure 1 shows frequencies of estimated signal dimensions for the various parameter combinations according to Eq. (4), for Rule N (dashed) and the Tracy–Widom modification (solid histograms), over 1000 trials for each parameter combination. The vertical red lines in each panel locate the correct specification, = κ. For κ = 5 (leftmost column) both methods perform well, although for n = 20 the undue conservatism of Rule N begins to be evident. For the four larger values of κ, Rule N chooses values that are much too small, whereas the Tracy–Widom modification yields excellent results except when κ is a large fraction of n and the signal-to-noise ratio is low. In these cases (n = 20 and κ = 17, n = 50 and κ = 33, n = 100 and κ = 65) the performance of Rule N is even worse. Only a small portion of the conservatism of the Tracy–Widom estimates in these low n/κ trials is due to use of the parametric distribution: the average is larger by approximately 0.5 when the Monte Carlo empirical distributions for the k = (κ + 1)th (i.e., first noise) eigenvalue (used as the reference “truth” in Table 1) are used to construct each hypothesis test.
The Tracy–Widom modification of Rule N exhibits good performance even though no adjustments have been made to account for the effect of computing the sequence of multiple hypothesis tests specified by Eq. (4). Figure 2 illustrates that this result is a fortuitous consequence of the nonapplicability of the Tracy–Widom distribution for describing the sampling variation of the k = (κ + 2)th (second noise) eigenvalue. This figure pertains to the parameter combination n = 50, κ = 9, and S/N = 1, which is indicated by the heavy box outline in Fig. 1, but is representative of the other parameter combinations also. Figure 2a shows the relative frequencies of p10 [i.e., the test p value for the first noise eigenvalue in Eq. (4)] over 10 000 trials, when using the empirical distribution of p10 (red dashed histograms) and the Tracy–Widom approximation to it (black solid histograms). The thin horizontal line located at αcrit = 0.05 locates ideal behavior for this histogram, and the close approximation of the dashed histogram to this ideal behavior supports the conjecture of Kritchman and Nadler (2008) noted in section 2. The height of the leftmost bar in Fig. 2a reflects the achieved test size when αcrit = 0.05, and shows that the actual test level is very close to αcrit = 0.05 when the empirical distribution of is used, and that only about 3% of these true null hypotheses are incorrectly rejected when the tests are computed using the Tracy–Widom distribution. This slight test conservatism is consistent with the results shown in Table 1.
Figure 2b shows the corresponding relative frequencies for p11, which pertain to the test statistic , when κ = 9. For both the empirical and Tracy–Widom distributions, the probability that the true null hypothesis H11 is rejected is vanishingly small. Similar plots for p12, p13, p14, etc. (not shown) exhibit even more extreme negative skewness. Accordingly, even though Eq. (4) specifies a sequence of multiple tests, if the first true null hypothesis Hκ+1 is erroneously rejected the probabilities are vanishingly small that Hκ+2, Hκ+3, etc., will also be rejected, obviating the multiple-testing problem in this setting.
4. Real-data example
Overland and Preisendorfer (1982) illustrated use of Rule N with a PCA relating to numbers of cyclones transiting an M = 56-member network of grid cells in the Bering Sea during October–February, during a period of n = 23 winters. Their results pertaining to the covariance matrix of cyclone counts are displayed graphically in Fig. 3a. Here the five leading sample eigenvalues, scaled according to Eq. (2) are plotted, together with the 95th percentile of the Monte Carlo distribution for these statistics. Accordingly points above the curve indicate pk ≤ αcrit = 0.05, so that Eq. (4) yields = 2. Overland and Preisendorfer (1982) concluded that only the two leading eigenvalues are statistically distinguishable from noise.
Figure 3b shows the corresponding result when Eq. (4) is evaluated using Tracy–Widom distributions (dashed curve), and 10 000-member Monte Carlo distributions (solid curve) for [Eq. (5)]. Table 1 indicates that for this small sample size and data dimension, the Tracy–Widom distribution overestimates the 95th percentile for the pure-noise statistics by approximately 4%, and so yields a test that is more conservative than the nominal αcrit = 0.05. In Fig. 3b, Eq. (4) yields = 2 when computing the pk values from Tracy–Widom distributions, but = 4 when using the direct Monte Carlo distributions. Because n and M are relatively small these Monte Carlo distributions can be computed very quickly, as outlined in the appendix.
This paper has proposed a modification of the popular Rule N for PCA truncation, based on the Tracy–Widom distribution for the largest noise eigenvalue of a sample covariance matrix. This new method improves upon Rule N because it accounts at each stage of the sequence of hypothesis tests for the results of previous tests in the sequence.
Both Rule N and the proposed modification assume the particular separation of signal and noise subspaces expressed in Eq. (1). Experiments using synthetic data conforming to this assumption show excellent results unless the dimension of the signal subspace is a large fraction of the sample size and the signal-to-noise ratio is relatively small, and even under these circumstances the proposed modification strongly outperforms the original Rule N. However, both of these methods may perform poorly in settings where the signal and noise subspaces are defined differently.
Even though the proposed procedure is based on a sequence of hypothesis tests, it is not necessary to account for this test multiplicity because the bias of Tracy–Widom distributions for other than the largest noise eigenvalues prevents the procedure from greatly overestimating the dimension of the signal subspace.
The Tracy–Widom distribution provides a good representation for the random variations of the leading noise eigenvalue when sample size and data dimension are both moderate to large. When either or both of these parameters are relatively small the Tracy–Widom distribution yields conservative tests and truncation points; that is, too few of the leading eigenvalues are regarded as representing signal, on average, which will lead to concluding that some signal modes are noise. In this circumstance the appropriate reference distributions can be easily and quickly generated by Monte Carlo methods, as described in the appendix.
The results presented here were based on underlying Gaussian-distributed data, but the sampling distributions of sample covariance-matrix eigenvalues are very similar for uniform, or random sign (±1 with equal probability) data (Faber et al. 1994). However, if sensitivity to non-Gaussian data is suspected, the appropriate reference distributions can again be derived using Monte Carlo methods as described in the appendix.
I thank the anonymous reviewers, whose comments lead to an improved presentation. This research was supported by the National Science Foundation under Grant AGS-1112200.
Monte Carlo Computation of the Reference Distributions
A different sampling distribution must be computed for each null hypothesis Hk [Eq. (3)], k = 1, 2, 3, . . . , pertaining to the largest sample eigenvalue of a pure noise covariance matrix formed from random data of dimension Mk, with sample size nk, where Mk = M − k + 1 and nk = n − k + 1. If either the sample size n or the data dimension M is relatively small, the Tracy–Widom distribution may represent inadequately the sampling distribution of the largest noise eigenvalue, as indicated in Table 1. In this case discrete approximations to the appropriate distributions can be computed relatively cheaply using the power method (e.g., Golub and Van Loan 1983), because only the leading eigenvalue rather than all of the nonzero eigenvalues of each sample covariance matrix needs to be computed. This approach can also be used if the effects of non-Gaussian data are suspected to be significant, but with possibly large computational expense if the smaller of nk and Mk is not relatively small.
Each Monte Carlo iteration (of perhaps 1000) begins with a (nk × Mk) matrix ′ of uncorrelated data anomalies (sample means of each of the Mk columns have been subtracted from the nk values in that column). These underlying data may be random numbers from a Gaussian or other distribution, or bootstrapped samples from an actual dataset. A sample noise covariance matrix is then formed as
Because the nonzero eigenvalues of Eqs. (A1a) and (A1b) are the same (e.g., von Storch and Zwiers 1999, p. 300; Wilks 2011, p. 554), the computations can be greatly reduced using Eq. (A1b) in the common circumstance that n << M. Because of the normalization in Eq. (5), the division by nk − 1 is not actually necessary. Note that Eq. (A1b) does not describe a “T-mode” analysis (Compagnucci and Richman 2008), because the anomalies in ′ are computed by subtracting the M column means of , and not the n row means.
The power method (e.g., Golub and Van Loan 1983) is then used to extract the leading eigenvalue λ1 of for the current iteration. For a given value of k, these leading noise eigenvalues collectively compose the discrete Monte Carlo approximation to the desired distribution. Beginning with an arbitrary initial guess for the leading eigenvector e, with ||e|| = 1, the algorithm proceeds by iterating until convergence:
Here v is an intermediate storage vector, and ||v|| indicates its Euclidean length. After the first Monte Carlo evaluation it is convenient to begin by taking e as the final value from the previous cycle.