1. Introduction
An issue that arises frequently in statistical testing of atmospheric data is the simultaneous evaluation of multiple hypothesis tests. Often these multiple tests pertain to an array of geographical locations, yielding a spatial “field” of tests that is to be evaluated jointly. This is the familiar problem of test multiplicity, or evaluation of “field significance.” The fields of atmospheric data used to compute these tests typically exhibit strong spatial correlations, and this characteristic further complicates the multiple testing problem.
Conventionally each individual, or local, test is conducted at a level αlocal, which is often taken to be 0.05. If an individual local null hypothesis is true, the probability of its being falsely rejected is equal to αlocal. However, joint evaluation of the results of multiple tests is more complicated because, even if each of K local null hypotheses are true, on average Kαlocal of them will be erroneously rejected. Therefore, finding field significance (rejecting a global null hypothesis that all K local null hypotheses are true) according to this method requires that substantially more than a small number of the local null hypotheses are rejected.
In addition to its sensitivity to correlation of the local test results, this traditional procedure has two other properties that are less than ideal. First, because the number of local test rejections m can take on only integer values, αglobal is only an upper bound on the actual test level that can be achieved when the local tests are independent: such tests will be inaccurate, in general, in the sense that the actual probability of rejecting a true global null hypothesis will be smaller than αglobal. An unfortunate consequence is that the sensitivity of the test to possible violations of the global null hypothesis is reduced. The second shortcoming is that the binary view of the local test results can also reduce the global test sensitivity, because local null hypotheses that are very strongly rejected (local p values that are very much smaller than αlocal) carry no greater weight in the global test than do local tests for which the p values are only slightly smaller than αlocal. That is, no credit is given for rejecting one or more local null hypotheses with near certainty when evaluating the plausibility of the global null hypothesis that all local null hypotheses are true.
These shortcomings of the conventional approach to field significance can in general be improved upon through the use of global test statistics that depend on the magnitudes of the individual p values of the K local tests. Section 2 describes two such global test statistics, assuming initially that the underlying data are spatially uncorrelated. First, Walker’s test is based on the smallest of the K p values, that is, the global test statistic is the p value for the single most significant local result. Section 2 also relates Walker’s test to a relatively recent idea in multiple testing, known as the false discovery rate (FDR, which involves examination of all K of the local p values) and demonstrates that the FDR is directly applicable to the field significance problem. Section 3 compares the properties of these two methods with those of the more conventional counting approach, using both independent and correlated tests, in a synthetic data setting. Section 4 concludes with a summary and recommendation.
2. Walker’s test of minimum p value, and the false discovery rate
a. Walker’s test
If all K of the local null hypotheses are true, then each of the respective test statistics represent random draws from their null distributions, whatever those distributions may be (i.e., regardless of the specific forms of the local tests). If those local null distributions are continuous, and if the results of the local tests are independent of each other, the resulting K p values will be a random sample from the uniform distribution, f (u) = 1, 0 ≤ u ≤ 1 (e.g., Folland and Anderson 2002; Lindgren 1976). If some of the local null hypotheses are false, their p values will tend to be smaller than would be expected from this uniform distribution. How small must the smallest of the K local p values be in order to reject the global null hypothesis that all of the local null hypotheses are true?
b. Global test based on the FDR
A relatively recent development in the simultaneous evaluation of multiple hypothesis tests is based on identifying locally significant tests by controlling the “false discovery rate,” which is the expected proportion of rejected local null hypotheses that are actually true (Benjamini and Hochberg 1995; Ventura et al. 2004). In the present context, global (field) significance would be declared by this method if at least one local null hypothesis is rejected, and so the FDR approach to local testing has the inherent side effect of also testing for field significance.
If only the smallest p value, p(1), satisfies the criterion in Eq. (7), then this is exactly equivalent to use of the Bonferroni critical value pBonf [Eq. (5)], with q = αglobal, as indicated in Eq. (7b), and is therefore nearly equivalent to Walker’s test also. It can happen that p(1) > αglobal/K but that Eq. (7) is nevertheless satisfied for one or more of the larger p values. In such cases the FDR test will reject the global null hypothesis [including the imputation of a significant result for the local test that produced p(1)], whereas Walker’s test almost always [cf. Eq. (6)] will not. The FDR test accordingly should be somewhat more sensitive to violations of global null hypotheses.
As indicated in the comparison of Eqs. (7a) and (7b), evaluating global significance at the αglobal level using the FDR criterion carries an additional benefit—namely, improved interpretability of the pattern of rejected local null hypotheses because of the tight constraint q = αglobal on the expected fraction of these that may actually be true. This is an important advantage of the FDR approach relative to the conventional counting procedure, which typically misidentifies a large number of local null hypotheses as being false (Ventura et al. 2004). That is, the counting procedure usually yields a large number of “false discoveries.” Walker’s test rejects local null hypotheses only to the extent that their p values are smaller than Eq. (4b), and so it may identify fewer locally significant tests than are warranted by the data.
3. Comparison of the counting test, Walker’s test, and the false discovery rate
a. Synthetic data and test setting
Two key points of comparison among hypothesis tests are accuracy of the achieved test level and test power. The achieved level refers to the proportion of true (global) null hypotheses rejected, when these tests are calculated at the nominal αglobal level. In the ideal case, the achieved level should match the nominal level. When the fraction of rejected tests is larger than αglobal, the test is permissive; when the fraction of rejected tests is smaller than αglobal, the test is conservative.
Test power refers to the sensitivity of a test to violations of the (global) null hypothesis. It is usually evaluated graphically using the “power function,” which plots the probability of rejecting a (global) null hypothesis as a function of the degree of error of that null hypothesis. In the context of this paper, errors in global null hypotheses can have two dimensions: the numbers of local null hypotheses that are false, and the magnitudes of those differences with the corresponding true alternatives. Power functions ideally rise rapidly from αglobal at the origin, where all local null hypotheses are true, to unity as the global null hypothesis becomes farther from the truth.
Global test performance, in terms of both correctness of level (when all local null hypotheses are true) and test power (sensitivity to violations of the global null hypothesis), will be examined here using K simultaneous local two-sample t tests for differences of mean, each with n1 = n2 = 50. The global tests described in section 2 are applicable regardless of the specific forms of the local tests, but this choice for the local tests allows comparison, for relatively small K, with the well-known Hotelling T 2 test (e.g., Johnson and Wichern 2002; Wilks 2006) for differences of vector means. The two-sample t setting also allows straightforward construction of a permutation test to evaluate global null hypotheses when the local tests are correlated.
b. Independent local tests
Figure 2 shows power curves for global tests calculated at the nominal αglobal = 0.05 level, based on K = 20 independent [i.e., ρ = 0 in Eq. (8b)] local tests. Each group of curves corresponds to a different number mA of tests for which the local alternative hypotheses are true. The degree to which each of the mA corresponding null hypotheses is in error, in units of standard deviations of the underlying data x, is indicated on the horizontal axis. That is, for each synthetic test realization, the (population) mean of one of the two samples in mA of the K = 20 tests has been offset by Δμ/σ, in units of standard deviations of the underlying data. The result is that, for these mA tests, the corresponding p values tend to be reduced, and they are reduced more as Δμ/σ increases. For each testing procedure, and for each combination of mA and Δμ/σ, the relative frequency of test rejections (i.e., the test power) has been calculated using 105 replications.
Regardless of the value of mA, all local null hypotheses are true when Δμ/σ = 0. Therefore, the vertical intercepts of the power curves indicate the achieved global test levels. With the exception of the traditional counting test, all of these levels are 0.05 (i.e., the test levels are accurate). Accuracy of the FDR test in this sense confirms the equivalence of Eqs. (7a) and (7b). That is, in the field significance context, the FDR q is numerically equal to the global test level αglobal. Figure 2 also shows that the conventional field significance test based on counting local test rejections is conservative. This inaccuracy occurs because of the discreteness of the test statistic m and because of its probability distribution function [Eq. (2)]. In particular, for αglobal = 0.05 and K = 20, it is necessary to reject at least M ≥ 4 local tests to satisfy Eq. (1). Equation (2) yields Pr(M ≥ 4) = 0.016, which is the intercept for the counting tests in Fig. 2, where αlocal = 0.05. Because the power curves for the traditional counting test start at too low of a level, this test is less powerful (i.e., is less sensitive to violations of global null hypotheses) than it otherwise would be.
Not surprising is that global test power increases as the number of false local null hypotheses mA increases, because the number of opportunities for a global test to discern an overall difference is increased. For mA = 10, all of the tests are virtually certain to reject the global null hypothesis when the magnitudes of the mA alternative hypotheses are larger than about Δμ/σ = 0.5, and for mA = 5 this point occurs near Δμ/σ = 0.7. For mA = 10, the traditional counting test is generally most powerful, and for mA = 5 the FDR test is most powerful. Regardless of the value of mA, the FDR and Walker tests perform very similarly, with (as expected) the FDR test being slightly more powerful because the FDR test sometimes rejects a global null hypothesis even though p(1) > 0.05/K. For mA = 2 and mA = 5, both of these tests also behave very similarly to the multivariate Hotelling T 2 test for differences of vector means, which is computationally feasible for this relatively small number of local tests.
The most striking differences in Fig. 2 are for the relatively small number of false local null hypotheses, mA = 2. Here the ability of the Walker and FDR tests to use information regarding the confidence with which local null hypotheses are rejected yields good power, even for “needle in haystack” situations such as this. In contrast, the conventional counting test for field significance performs very badly indeed, because it is blind to the magnitudes of local test rejections. Even for Δμ/σ larger than about 0.8, for which both of the mA = 2 false local null hypotheses are rejected with near certainty, the counting test rejects the global null hypothesis in only 22.6% of cases. The reason is that the global null hypothesis is rejected only for M ≥ 4 local test rejections, and two of those rejections must be from the remaining K − mA = 18 local tests for which the local null hypotheses are actually true. Using Eq. (2), the probability of at least 2 of 18 such spurious local rejections is indeed 0.226.
c. Correlated local tests
Figure 3 shows example power functions for global tests applied to correlated data, with K = 100, and ρ = 0.75 in Eq. (8). Because n1 = n2 = 50 as before, the Hotelling T 2 test is computationally infeasible for K = 100, although if it could be calculated the Hotelling test would have no difficulty operating on the (spatially) correlated data. The results are qualitatively similar to those in Fig. 2, although not surprising is that the power of the tests to discern differences among correlated data is somewhat lower, because there is effectively less independent information. For reference, the power of the Walker tests based on K = 100 and ρ = 0 (i.e., independent data) is shown with the light gray lines.
Because the local tests are mutually correlated, it may be problematic to use Eqs. (2), (4b), or (7b) (which were derived under the assumption of independence among the local tests) to obtain critical values for rejecting global null hypotheses. The usual approach to obtaining the sampling distribution of the global test statistic in this case is to construct a permutation test. In the setting here, the n1 values for x1 and the n2 values for x2 are pooled into a single population, which subsequently is repeatedly resampled (without replacement) to produce realizations of the test statistic consistent with the global null hypothesis (e.g., Livezey and Chen 1983; Wilks 2006).
The black curves in Fig. 3 show power functions for the counting and Walker tests, in which the critical values have been estimated from 103 such resamples in each of 104 replications. That is, these dark curves indicate power for the resampling tests, which have explicitly accounted for the spatial correlations in the data. As before, the counting test is generally more powerful when there are many false null hypotheses (e.g., mA = 50 in Fig. 3c) but exhibits extremely poor power when only a small number of the local null hypotheses are false (e.g., mA = 2 in Fig. 3a). There is no resampling result for the FDR test in Fig. 3, because it is not immediately clear how the sliding scale in Eq. (7) might be used in conjunction with a permutation test.
It is a remarkable feature of the Walker and FDR tests that they exhibit very little sensitivity to the assumption of independence among the K local tests. The heavier gray curves in Fig. 3 show results for these two tests operating on correlated data, but using critical values from Eqs. (4b) and (7b), respectively, which have assumed independent local tests. The actual test levels (vertical intercepts in Fig. 3) are only slightly smaller than the nominal αglobal = 0.05. The results are similar overall to the Walker test computed with the permutation procedure, with a slight loss of power for the Walker test relative to its resampling counterpart and, as before, somewhat better power for the FDR test as compared with the Walker test.
Figure 4 compares the actual test levels for the counting test [Eqs. (1) and (2)], Walker’s test [Eq. (4b)], and the FDR test [Eq. (7b)] as functions of the data correlation parameter ρ for K = 100 (Fig. 4a) and K = 1000 (Fig. 4b). That is, these figures show probabilities (estimated using 105 replications each) of rejecting true global null hypotheses as functions of the strength of the data correlation, under the (erroneous, except at ρ = 0) assumption that the underlying K local test results are independent. As before, the counting test is usually conservative for zero or small levels of data correlation but, as expected, becomes very permissive for strongly correlated tests. This latter attribute is the reason why resampling procedures, rather than Eq. (2), are generally used to define critical values for this test, when Eq. (2) indicates the possibility of a significant result.
By contrast, the curves in Fig. 4 for the Walker and FDR tests are nearly flat until large values of ρ, even though Eqs. (4b) and (7b) have assumed independence among the K tests, reflecting their robustness to correlation among the local tests. Katz and Brown (1991) and Ventura et al. (2004) also found the Walker and FDR procedures, respectively, to be robust to correlation among the local tests. Remarkable is that the effect of correlation of the local tests on the Walker and FDR tests is opposite to that for the counting test: these tests become somewhat conservative, and for large values of ρ they typically perform similarly to the counting test operating on independent local tests. This conservative behavior can be understood qualitatively by adopting the view that K correlated local tests behave collectively as if there were some smaller number of “effectively independent” tests. The effect of reducing the number of tests in Eq. (3) is to shift probability mass for p(1) away from the origin: an extremely small value is less likely within a sample of reduced size from the uniform distribution. Therefore, the integral in Eq. (4a) will be smaller than the nominal level αglobal.
In comparing Figs. 4a and 4b, it is evident that the robustness of the Walker and FDR tests increases as the number of local tests increases, so that the performance of these tests even for correlated data is very good for K = 1000 (Fig. 4b), which is representative of the order of magnitude of many atmospheric multiple testing settings. By contrast, Fig. 4b also shows that the performance of the counting procedure under the assumption of independence local tests is extremely poor as the number of local tests increases. For example, for large values of the correlation parameter, counting tests at a nominal level of αglobal = 0.05 actually reject true global null hypotheses more than 25% of the time, and even tests at the nominal αglobal = 0.01 level reject true global null hypotheses approximately 20% of the time.
One practical result of the robustness of the Walker and FDR tests to spatial correlation is that they can be computed under the assumption of independence, with the assurances that the results are approximately correct and that the actual test level is smaller than αglobal. That is, unlike the counting test, observing a significant result from a Walker or FDR global test under the assumption of local test independence implies that an even more significant result would be obtained if local test correlation were to be accounted for. This conservatism of the Walker and FDR tests under the incorrect assumption of local test independence also accounts for their slightly reduced power (heavier gray curves in Fig. 3).
4. Summary and conclusions
The purpose of this paper has been to demonstrate that the false discovery rate approach to multiple hypothesis testing, as described by Benjamini and Hochberg (1995) and Ventura et al. (2004), carries the additional benefit of also providing a powerful field significance test. Furthermore, the control level q on the FDR is numerically equal to the global test level αglobal. This connection has been demonstrated through exposition of FDR-based tests as extensions of Walker’s approach for the joint evaluation of multiple hypothesis tests. Walker’s test, in turn, can be seen as being closely related to the conventional (Livezey and Chen 1983; von Storch 1982) field significance test based on counting significant local results, except that Walker’s test statistic is the smallest of the K local p values, rather than the number of K local tests that are significant at some level αlocal.
Both Walker’s test and the FDR procedure are often more powerful than the traditional counting approach, because they use information on the strengths with which local null hypotheses are rejected. The increase in power is especially marked when only a small fraction of the K local tests correspond to false null hypotheses. The power function results presented here suggest that the traditional counting test would be preferred only when the global alternative hypothesis of interest is that a large fraction of the local null hypotheses are false, but only weakly so. It might be possible to further increase the power of the FDR test if a good estimate for the number of false local null hypotheses could be devised, although the algorithm proposed for this purpose by Ventura et al. (2004) yielded test levels that were very much larger than the nominal αglobal in the experimental setting used here (results not shown).
Another advantage of Walker’s test and of the FDR approach is that both are relatively insensitive to nonindependence of the local test results. The conventional counting test is very sensitive to this common situation and becomes markedly permissive when the local test results are strongly correlated. Therefore, a preliminary global test rejection according to the counting test generally needs to be verified using a resampling procedure (Livezey and Chen 1983). By contrast, the effect of local test correlations on the Walker and FDR tests is to render them modestly conservative, so that a resampling procedure might be necessary only if an initial calculation under the assumption of local test independence were to yield a global p value slightly larger than the nominal αglobal.
The relative insensitivity of the Walker and FDR tests to correlations among the local tests substantially broadens their applicability relative to the counting test. In particular, global tests based on data exhibiting both spatial and temporal correlation can be conducted with relative ease. Conventional resampling tests are seriously permissive when calculated using serially correlated data (e.g., Zwiers 1987), and simultaneously respecting temporal and spatial correlation in such tests requires an elaborate and cumbersome procedure (Wilks 1997). These difficulties are obviated when the global test is insensitive to spatial correlations, so long as local tests that operate correctly for the temporally correlated data are available (e.g., Katz and Brown 1991).
The FDR approach brings the added advantage that only a small fraction of the local tests it identifies as significant represent false rejections of their null hypotheses. In particular, the ceiling on this false discovery rate is numerically equal to the global test level αglobal and so can be controlled in the context of the global test. In contrast, the conventional counting test typically includes many false rejections of local null hypotheses among the nominally significant local tests, and the Walker test tends to identify only the most significant local tests. Because the FDR approach exhibits generally better power, can usually be used with good results even when the local test results are correlated with each other, and allows identification of significant local tests while controlling the proportion of false rejections, it should be used in preference to the counting test in most field significance calculations.
Acknowledgments
I thank Rick Katz for introducing me to the false discovery rate. This work was supported in part by the Northeast Regional Climate Center at Cornell University.
REFERENCES
Benjamini, Y., and Y. Hochberg, 1995: Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Stat. Soc., B57 , 289–300.
Fisher, R. A., 1929: Tests of significance in harmonic analysis. Proc. Roy. Soc. London, A125 , 54–59.
Folland, C., and C. Anderson, 2002: Estimating changing extremes using empirical ranking methods. J. Climate, 15 , 2954–2960.
Gumbel, E. J., 1958: Statistics of Extremes. Columbia University Press, 375 pp.
Johnson, R. A., and D. W. Wichern, 2002: Applied Multivariate Statistical Analysis. 5th ed. Prentice Hall, 767 pp.
Katz, R. W., 2002: Sir Gilbert Walker and a connection between El Niño and statistics. Stat. Sci., 17 , 97–112.
Katz, R. W., and B. G. Brown, 1991: The problem of multiplicity in research on teleconnections. Int. J. Climatol., 11 , 505–513.
Lindgren, B. W., 1976: Statistical Theory. MacMillan, 614 pp.
Livezey, R. E., and W. Y. Chen, 1983: Statistical field significance and its determination by Monte Carlo techniques. Mon. Wea. Rev., 111 , 46–59.
Ventura, V., C. J. Paciorek, and J. S. Risbey, 2004: Controlling the proportion of falsely rejected hypotheses when conducting multiple tests with climatological data. J. Climate, 17 , 4343–4356.
von Storch, H., 1982: A remark on Chervin-Schneider’s algorithm to test significance of climate experiments with GCM’s. J. Atmos. Sci., 39 , 187–189.
Wilks, D. S., 1997: Resampling hypothesis tests for autocorrelated fields. J. Climate, 10 , 65–82.
Wilks, D. S., 2006: Statistical Methods in the Atmospheric Sciences. 2d ed. International Geophysics Series, Vol. 91, Academic Press, 627 pp.
Zwiers, F. W., 1987: Statistical considerations for climate experiments. Part II: Multivariate tests. J. Climate Appl. Meteor., 26 , 477–487.