The conventional approach to evaluating the joint statistical significance of multiple hypothesis tests (i.e., “field,” or “global,” significance) in meteorology and climatology is to count the number of individual (or “local”) tests yielding nominally significant results and then to judge the unusualness of this integer value in the context of the distribution of such counts that would occur if all local null hypotheses were true. The sensitivity (i.e., statistical power) of this approach is potentially compromised both by the discrete nature of the test statistic and by the fact that the approach ignores the confidence with which locally significant tests reject their null hypotheses. An alternative global test statistic that has neither of these problems is the minimum p value among all of the local tests. Evaluation of field significance using the minimum local p value as the global test statistic, which is also known as the Walker test, has strong connections to the joint evaluation of multiple tests in a way that controls the “false discovery rate” (FDR, or the expected fraction of local null hypothesis rejections that are incorrect). In particular, using the minimum local p value to evaluate field significance at a level αglobal is nearly equivalent to the slightly more powerful global test based on the FDR criterion. An additional advantage shared by Walker’s test and the FDR approach is that both are robust to spatial dependence within the field of tests. The FDR method not only provides a more broadly applicable and generally more powerful field significance test than the conventional counting procedure but also allows better identification of locations with significant differences, because fewer than αglobal × 100% (on average) of apparently significant local tests will have resulted from local null hypotheses that are true.
An issue that arises frequently in statistical testing of atmospheric data is the simultaneous evaluation of multiple hypothesis tests. Often these multiple tests pertain to an array of geographical locations, yielding a spatial “field” of tests that is to be evaluated jointly. This is the familiar problem of test multiplicity, or evaluation of “field significance.” The fields of atmospheric data used to compute these tests typically exhibit strong spatial correlations, and this characteristic further complicates the multiple testing problem.
Conventionally each individual, or local, test is conducted at a level αlocal, which is often taken to be 0.05. If an individual local null hypothesis is true, the probability of its being falsely rejected is equal to αlocal. However, joint evaluation of the results of multiple tests is more complicated because, even if each of K local null hypotheses are true, on average Kαlocal of them will be erroneously rejected. Therefore, finding field significance (rejecting a global null hypothesis that all K local null hypotheses are true) according to this method requires that substantially more than a small number of the local null hypotheses are rejected.
The traditional approach to field significance (Livezey and Chen 1983; von Storch 1982) has been to perform a “metatest” using the statistic m = the number of local tests that are nominally significant at the αlocal level. The global null hypothesis that all of the local null hypotheses are true is then rejected if
where αglobal is the level of the global, or field, significance. That is, the global null hypothesis is rejected if the probability of having obtained the observed number m of local test rejections, or any other outcome at least as unfavorable to the global null hypothesis, is no larger than the chosen global test level αglobal.
In the (usually unrealistic) case in which results from the K local tests are mutually independent, the probability (according to the global null hypothesis) on the left-hand side of Eq. (1) can be easily evaluated using the binomial distribution,
If, as is usually the case, the results of the local tests are positively correlated, Eq. (2) will underestimate this probability, with the result that the global test will be too permissive (i.e., will reject true global null hypotheses too frequently). Livezey and Chen (1983) interpret this result in terms of an “equivalent number of independent tests” that is smaller than K, so that a larger fraction of the local tests must yield significant results in order to reject the global null hypothesis. In accord with this, failure to satisfy Eq. (1) using Eq. (2) is sufficient reason not to reject the global null hypothesis (Livezey and Chen 1983), because correctly accounting for correlations among the local test results in evaluating Eq. (1) would yield an even less significant result. However, in cases in which an accurate estimate of that probability is required [i.e., cases for which Eqs. (1) and (2) result in a global test rejection], computationally intensive resampling tests are generally needed (Livezey and Chen 1983; Wilks 2006).
In addition to its sensitivity to correlation of the local test results, this traditional procedure has two other properties that are less than ideal. First, because the number of local test rejections m can take on only integer values, αglobal is only an upper bound on the actual test level that can be achieved when the local tests are independent: such tests will be inaccurate, in general, in the sense that the actual probability of rejecting a true global null hypothesis will be smaller than αglobal. An unfortunate consequence is that the sensitivity of the test to possible violations of the global null hypothesis is reduced. The second shortcoming is that the binary view of the local test results can also reduce the global test sensitivity, because local null hypotheses that are very strongly rejected (local p values that are very much smaller than αlocal) carry no greater weight in the global test than do local tests for which the p values are only slightly smaller than αlocal. That is, no credit is given for rejecting one or more local null hypotheses with near certainty when evaluating the plausibility of the global null hypothesis that all local null hypotheses are true.
These shortcomings of the conventional approach to field significance can in general be improved upon through the use of global test statistics that depend on the magnitudes of the individual p values of the K local tests. Section 2 describes two such global test statistics, assuming initially that the underlying data are spatially uncorrelated. First, Walker’s test is based on the smallest of the K p values, that is, the global test statistic is the p value for the single most significant local result. Section 2 also relates Walker’s test to a relatively recent idea in multiple testing, known as the false discovery rate (FDR, which involves examination of all K of the local p values) and demonstrates that the FDR is directly applicable to the field significance problem. Section 3 compares the properties of these two methods with those of the more conventional counting approach, using both independent and correlated tests, in a synthetic data setting. Section 4 concludes with a summary and recommendation.
2. Walker’s test of minimum p value, and the false discovery rate
a. Walker’s test
If all K of the local null hypotheses are true, then each of the respective test statistics represent random draws from their null distributions, whatever those distributions may be (i.e., regardless of the specific forms of the local tests). If those local null distributions are continuous, and if the results of the local tests are independent of each other, the resulting K p values will be a random sample from the uniform distribution, f (u) = 1, 0 ≤ u ≤ 1 (e.g., Folland and Anderson 2002; Lindgren 1976). If some of the local null hypotheses are false, their p values will tend to be smaller than would be expected from this uniform distribution. How small must the smallest of the K local p values be in order to reject the global null hypothesis that all of the local null hypotheses are true?
Let p(1) be the smallest of the K local p values. It is known (Gumbel 1958) that the smallest of a sample of K uniform variates (e.g., p values from K independent hypothesis tests for which all null hypotheses are true) follows a beta distribution (e.g., Wilks 2006) whose first parameter is 1 and whose second parameter is K. Folland and Anderson (2002) show examples of these beta distributions, using a somewhat different notation than that adopted here. Thus, the sampling distribution for the smallest p value from K independent tests, all of whose null hypotheses are true, has probability density function
Although it is possible for the smallest of K independent p values to be relatively large, Eq. (3) indicates that p(1) will be close to zero with high probability.
To reject the global null hypothesis that all K local null hypotheses are true (i.e., to declare field significance), p(1) must be no larger than some critical value pWalker, corresponding to the global test level αglobal. That is, if the smallest p value is small enough, it can be concluded with high confidence that the collection of K local p values did not result from independent draws from a uniform distribution. The critical value for this global test can be obtained by integrating the left tail of the probability density function in Eq. (3), that is,
Equation (4b) is the basis of what is known as the Walker test (Fisher 1929; Katz 2002; Katz and Brown 1991). It indicates that a global null hypothesis may be rejected at the αglobal level if the smallest of K independent local p values is less than or equal to pWalker.
Figure 1 shows percentage differences between pWalker and pBonf, as a function of K, for two global test levels. The two are identical for K = 1, and even for large K the relative differences are very small. Global tests based on comparison of p(1) with pWalker [Eq. (4b)] or pBonf [Eq. (5)] will perform extremely similarly, because different results will be obtained only when pBonf < p(1) ≤ pWalker. The probability of such a result for independent local tests, when the global null hypothesis is true, can be evaluated using Eq. (4a):
which depends somewhat on the number of local tests K, but is approximately 0.001 for αglobal = 0.05.
b. Global test based on the FDR
A relatively recent development in the simultaneous evaluation of multiple hypothesis tests is based on identifying locally significant tests by controlling the “false discovery rate,” which is the expected proportion of rejected local null hypotheses that are actually true (Benjamini and Hochberg 1995; Ventura et al. 2004). In the present context, global (field) significance would be declared by this method if at least one local null hypothesis is rejected, and so the FDR approach to local testing has the inherent side effect of also testing for field significance.
Extending the notation introduced above for the smallest of K local p values, let p(i) denote the ith smallest of these p values. Assuming that the local tests are independent, the FDR can be controlled at a level q (typically, q = 0.05) by rejecting those local tests for which p(i) is no greater than
That is, the FDR level q is numerically equal to the global test level αglobal. Equation (7) establishes a sliding scale for smallness of the local p values that depends on their placement among the full sorted collection p(j) j = 1, . . . , K. All local tests yielding p values smaller than or equal to the largest p value satisfying the condition on the right-hand side of Eq. (7) are deemed to be significant. The expected fraction of the corresponding local null hypotheses that are actually true but are wrongly rejected is less than or equal to the FDR, q. If none of the local p values satisfy the bracketed conditions in Eq. (7), then none of them are deemed to be significant, and so also the global null hypothesis is not rejected.
If only the smallest p value, p(1), satisfies the criterion in Eq. (7), then this is exactly equivalent to use of the Bonferroni critical value pBonf [Eq. (5)], with q = αglobal, as indicated in Eq. (7b), and is therefore nearly equivalent to Walker’s test also. It can happen that p(1) > αglobal/K but that Eq. (7) is nevertheless satisfied for one or more of the larger p values. In such cases the FDR test will reject the global null hypothesis [including the imputation of a significant result for the local test that produced p(1)], whereas Walker’s test almost always [cf. Eq. (6)] will not. The FDR test accordingly should be somewhat more sensitive to violations of global null hypotheses.
As indicated in the comparison of Eqs. (7a) and (7b), evaluating global significance at the αglobal level using the FDR criterion carries an additional benefit—namely, improved interpretability of the pattern of rejected local null hypotheses because of the tight constraint q = αglobal on the expected fraction of these that may actually be true. This is an important advantage of the FDR approach relative to the conventional counting procedure, which typically misidentifies a large number of local null hypotheses as being false (Ventura et al. 2004). That is, the counting procedure usually yields a large number of “false discoveries.” Walker’s test rejects local null hypotheses only to the extent that their p values are smaller than Eq. (4b), and so it may identify fewer locally significant tests than are warranted by the data.
3. Comparison of the counting test, Walker’s test, and the false discovery rate
a. Synthetic data and test setting
Two key points of comparison among hypothesis tests are accuracy of the achieved test level and test power. The achieved level refers to the proportion of true (global) null hypotheses rejected, when these tests are calculated at the nominal αglobal level. In the ideal case, the achieved level should match the nominal level. When the fraction of rejected tests is larger than αglobal, the test is permissive; when the fraction of rejected tests is smaller than αglobal, the test is conservative.
Test power refers to the sensitivity of a test to violations of the (global) null hypothesis. It is usually evaluated graphically using the “power function,” which plots the probability of rejecting a (global) null hypothesis as a function of the degree of error of that null hypothesis. In the context of this paper, errors in global null hypotheses can have two dimensions: the numbers of local null hypotheses that are false, and the magnitudes of those differences with the corresponding true alternatives. Power functions ideally rise rapidly from αglobal at the origin, where all local null hypotheses are true, to unity as the global null hypothesis becomes farther from the truth.
Global test performance, in terms of both correctness of level (when all local null hypotheses are true) and test power (sensitivity to violations of the global null hypothesis), will be examined here using K simultaneous local two-sample t tests for differences of mean, each with n1 = n2 = 50. The global tests described in section 2 are applicable regardless of the specific forms of the local tests, but this choice for the local tests allows comparison, for relatively small K, with the well-known Hotelling T 2 test (e.g., Johnson and Wichern 2002; Wilks 2006) for differences of vector means. The two-sample t setting also allows straightforward construction of a permutation test to evaluate global null hypotheses when the local tests are correlated.
To allow nonindependence of the local tests, the (K × 1) data vectors x1 and x2 underlying the tests are generated (Wilks 2006) to have zero mean [E(x) = 0] and covariance matrix
Each of the K elements of the underlying data vectors are drawn from populations with unit variance and have correlation ρ with the adjacent vector elements, correlation ρ2 with elements two positions away, and so on. That is, the parameter ρ controls the degree of simulated “spatial” correlation within the data vectors x.
Each local test is computed under the assumption of equal variances for the two samples, leading to the local test statistic
where the two sample variances s2 are computed using the data making up the respective sample means in the numerator. Equation (9) follows the t distribution with n1 + n2 − 2 = 98 degrees of freedom when the respective local null hypothesis is true.
b. Independent local tests
Figure 2 shows power curves for global tests calculated at the nominal αglobal = 0.05 level, based on K = 20 independent [i.e., ρ = 0 in Eq. (8b)] local tests. Each group of curves corresponds to a different number mA of tests for which the local alternative hypotheses are true. The degree to which each of the mA corresponding null hypotheses is in error, in units of standard deviations of the underlying data x, is indicated on the horizontal axis. That is, for each synthetic test realization, the (population) mean of one of the two samples in mA of the K = 20 tests has been offset by Δμ/σ, in units of standard deviations of the underlying data. The result is that, for these mA tests, the corresponding p values tend to be reduced, and they are reduced more as Δμ/σ increases. For each testing procedure, and for each combination of mA and Δμ/σ, the relative frequency of test rejections (i.e., the test power) has been calculated using 105 replications.
Regardless of the value of mA, all local null hypotheses are true when Δμ/σ = 0. Therefore, the vertical intercepts of the power curves indicate the achieved global test levels. With the exception of the traditional counting test, all of these levels are 0.05 (i.e., the test levels are accurate). Accuracy of the FDR test in this sense confirms the equivalence of Eqs. (7a) and (7b). That is, in the field significance context, the FDR q is numerically equal to the global test level αglobal. Figure 2 also shows that the conventional field significance test based on counting local test rejections is conservative. This inaccuracy occurs because of the discreteness of the test statistic m and because of its probability distribution function [Eq. (2)]. In particular, for αglobal = 0.05 and K = 20, it is necessary to reject at least M ≥ 4 local tests to satisfy Eq. (1). Equation (2) yields Pr(M ≥ 4) = 0.016, which is the intercept for the counting tests in Fig. 2, where αlocal = 0.05. Because the power curves for the traditional counting test start at too low of a level, this test is less powerful (i.e., is less sensitive to violations of global null hypotheses) than it otherwise would be.
Not surprising is that global test power increases as the number of false local null hypotheses mA increases, because the number of opportunities for a global test to discern an overall difference is increased. For mA = 10, all of the tests are virtually certain to reject the global null hypothesis when the magnitudes of the mA alternative hypotheses are larger than about Δμ/σ = 0.5, and for mA = 5 this point occurs near Δμ/σ = 0.7. For mA = 10, the traditional counting test is generally most powerful, and for mA = 5 the FDR test is most powerful. Regardless of the value of mA, the FDR and Walker tests perform very similarly, with (as expected) the FDR test being slightly more powerful because the FDR test sometimes rejects a global null hypothesis even though p(1) > 0.05/K. For mA = 2 and mA = 5, both of these tests also behave very similarly to the multivariate Hotelling T 2 test for differences of vector means, which is computationally feasible for this relatively small number of local tests.
The most striking differences in Fig. 2 are for the relatively small number of false local null hypotheses, mA = 2. Here the ability of the Walker and FDR tests to use information regarding the confidence with which local null hypotheses are rejected yields good power, even for “needle in haystack” situations such as this. In contrast, the conventional counting test for field significance performs very badly indeed, because it is blind to the magnitudes of local test rejections. Even for Δμ/σ larger than about 0.8, for which both of the mA = 2 false local null hypotheses are rejected with near certainty, the counting test rejects the global null hypothesis in only 22.6% of cases. The reason is that the global null hypothesis is rejected only for M ≥ 4 local test rejections, and two of those rejections must be from the remaining K − mA = 18 local tests for which the local null hypotheses are actually true. Using Eq. (2), the probability of at least 2 of 18 such spurious local rejections is indeed 0.226.
c. Correlated local tests
Figure 3 shows example power functions for global tests applied to correlated data, with K = 100, and ρ = 0.75 in Eq. (8). Because n1 = n2 = 50 as before, the Hotelling T 2 test is computationally infeasible for K = 100, although if it could be calculated the Hotelling test would have no difficulty operating on the (spatially) correlated data. The results are qualitatively similar to those in Fig. 2, although not surprising is that the power of the tests to discern differences among correlated data is somewhat lower, because there is effectively less independent information. For reference, the power of the Walker tests based on K = 100 and ρ = 0 (i.e., independent data) is shown with the light gray lines.
Because the local tests are mutually correlated, it may be problematic to use Eqs. (2), (4b), or (7b) (which were derived under the assumption of independence among the local tests) to obtain critical values for rejecting global null hypotheses. The usual approach to obtaining the sampling distribution of the global test statistic in this case is to construct a permutation test. In the setting here, the n1 values for x1 and the n2 values for x2 are pooled into a single population, which subsequently is repeatedly resampled (without replacement) to produce realizations of the test statistic consistent with the global null hypothesis (e.g., Livezey and Chen 1983; Wilks 2006).
The black curves in Fig. 3 show power functions for the counting and Walker tests, in which the critical values have been estimated from 103 such resamples in each of 104 replications. That is, these dark curves indicate power for the resampling tests, which have explicitly accounted for the spatial correlations in the data. As before, the counting test is generally more powerful when there are many false null hypotheses (e.g., mA = 50 in Fig. 3c) but exhibits extremely poor power when only a small number of the local null hypotheses are false (e.g., mA = 2 in Fig. 3a). There is no resampling result for the FDR test in Fig. 3, because it is not immediately clear how the sliding scale in Eq. (7) might be used in conjunction with a permutation test.
It is a remarkable feature of the Walker and FDR tests that they exhibit very little sensitivity to the assumption of independence among the K local tests. The heavier gray curves in Fig. 3 show results for these two tests operating on correlated data, but using critical values from Eqs. (4b) and (7b), respectively, which have assumed independent local tests. The actual test levels (vertical intercepts in Fig. 3) are only slightly smaller than the nominal αglobal = 0.05. The results are similar overall to the Walker test computed with the permutation procedure, with a slight loss of power for the Walker test relative to its resampling counterpart and, as before, somewhat better power for the FDR test as compared with the Walker test.
Figure 4 compares the actual test levels for the counting test [Eqs. (1) and (2)], Walker’s test [Eq. (4b)], and the FDR test [Eq. (7b)] as functions of the data correlation parameter ρ for K = 100 (Fig. 4a) and K = 1000 (Fig. 4b). That is, these figures show probabilities (estimated using 105 replications each) of rejecting true global null hypotheses as functions of the strength of the data correlation, under the (erroneous, except at ρ = 0) assumption that the underlying K local test results are independent. As before, the counting test is usually conservative for zero or small levels of data correlation but, as expected, becomes very permissive for strongly correlated tests. This latter attribute is the reason why resampling procedures, rather than Eq. (2), are generally used to define critical values for this test, when Eq. (2) indicates the possibility of a significant result.
By contrast, the curves in Fig. 4 for the Walker and FDR tests are nearly flat until large values of ρ, even though Eqs. (4b) and (7b) have assumed independence among the K tests, reflecting their robustness to correlation among the local tests. Katz and Brown (1991) and Ventura et al. (2004) also found the Walker and FDR procedures, respectively, to be robust to correlation among the local tests. Remarkable is that the effect of correlation of the local tests on the Walker and FDR tests is opposite to that for the counting test: these tests become somewhat conservative, and for large values of ρ they typically perform similarly to the counting test operating on independent local tests. This conservative behavior can be understood qualitatively by adopting the view that K correlated local tests behave collectively as if there were some smaller number of “effectively independent” tests. The effect of reducing the number of tests in Eq. (3) is to shift probability mass for p(1) away from the origin: an extremely small value is less likely within a sample of reduced size from the uniform distribution. Therefore, the integral in Eq. (4a) will be smaller than the nominal level αglobal.
In comparing Figs. 4a and 4b, it is evident that the robustness of the Walker and FDR tests increases as the number of local tests increases, so that the performance of these tests even for correlated data is very good for K = 1000 (Fig. 4b), which is representative of the order of magnitude of many atmospheric multiple testing settings. By contrast, Fig. 4b also shows that the performance of the counting procedure under the assumption of independence local tests is extremely poor as the number of local tests increases. For example, for large values of the correlation parameter, counting tests at a nominal level of αglobal = 0.05 actually reject true global null hypotheses more than 25% of the time, and even tests at the nominal αglobal = 0.01 level reject true global null hypotheses approximately 20% of the time.
One practical result of the robustness of the Walker and FDR tests to spatial correlation is that they can be computed under the assumption of independence, with the assurances that the results are approximately correct and that the actual test level is smaller than αglobal. That is, unlike the counting test, observing a significant result from a Walker or FDR global test under the assumption of local test independence implies that an even more significant result would be obtained if local test correlation were to be accounted for. This conservatism of the Walker and FDR tests under the incorrect assumption of local test independence also accounts for their slightly reduced power (heavier gray curves in Fig. 3).
4. Summary and conclusions
The purpose of this paper has been to demonstrate that the false discovery rate approach to multiple hypothesis testing, as described by Benjamini and Hochberg (1995) and Ventura et al. (2004), carries the additional benefit of also providing a powerful field significance test. Furthermore, the control level q on the FDR is numerically equal to the global test level αglobal. This connection has been demonstrated through exposition of FDR-based tests as extensions of Walker’s approach for the joint evaluation of multiple hypothesis tests. Walker’s test, in turn, can be seen as being closely related to the conventional (Livezey and Chen 1983; von Storch 1982) field significance test based on counting significant local results, except that Walker’s test statistic is the smallest of the K local p values, rather than the number of K local tests that are significant at some level αlocal.
Both Walker’s test and the FDR procedure are often more powerful than the traditional counting approach, because they use information on the strengths with which local null hypotheses are rejected. The increase in power is especially marked when only a small fraction of the K local tests correspond to false null hypotheses. The power function results presented here suggest that the traditional counting test would be preferred only when the global alternative hypothesis of interest is that a large fraction of the local null hypotheses are false, but only weakly so. It might be possible to further increase the power of the FDR test if a good estimate for the number of false local null hypotheses could be devised, although the algorithm proposed for this purpose by Ventura et al. (2004) yielded test levels that were very much larger than the nominal αglobal in the experimental setting used here (results not shown).
Another advantage of Walker’s test and of the FDR approach is that both are relatively insensitive to nonindependence of the local test results. The conventional counting test is very sensitive to this common situation and becomes markedly permissive when the local test results are strongly correlated. Therefore, a preliminary global test rejection according to the counting test generally needs to be verified using a resampling procedure (Livezey and Chen 1983). By contrast, the effect of local test correlations on the Walker and FDR tests is to render them modestly conservative, so that a resampling procedure might be necessary only if an initial calculation under the assumption of local test independence were to yield a global p value slightly larger than the nominal αglobal.
The relative insensitivity of the Walker and FDR tests to correlations among the local tests substantially broadens their applicability relative to the counting test. In particular, global tests based on data exhibiting both spatial and temporal correlation can be conducted with relative ease. Conventional resampling tests are seriously permissive when calculated using serially correlated data (e.g., Zwiers 1987), and simultaneously respecting temporal and spatial correlation in such tests requires an elaborate and cumbersome procedure (Wilks 1997). These difficulties are obviated when the global test is insensitive to spatial correlations, so long as local tests that operate correctly for the temporally correlated data are available (e.g., Katz and Brown 1991).
The FDR approach brings the added advantage that only a small fraction of the local tests it identifies as significant represent false rejections of their null hypotheses. In particular, the ceiling on this false discovery rate is numerically equal to the global test level αglobal and so can be controlled in the context of the global test. In contrast, the conventional counting test typically includes many false rejections of local null hypotheses among the nominally significant local tests, and the Walker test tends to identify only the most significant local tests. Because the FDR approach exhibits generally better power, can usually be used with good results even when the local test results are correlated with each other, and allows identification of significant local tests while controlling the proportion of false rejections, it should be used in preference to the counting test in most field significance calculations.
I thank Rick Katz for introducing me to the false discovery rate. This work was supported in part by the Northeast Regional Climate Center at Cornell University.
Corresponding author address: D. S. Wilks, Department of Earth and Atmospheric Sciences, Cornell University, Ithaca, NY 14853. Email: email@example.com