Alternatives to the Chi-Square Test for Evaluating Rank Histograms from Ensemble Forecasts

Kimberly L. Elmore Cooperative Institute for Mesoscale Meteorological Studies, University of Oklahoma, Norman, Oklahoma

Search for other papers by Kimberly L. Elmore in
Current site
Google Scholar
PubMed
Close
Full access

Abstract

Rank histograms are a commonly used tool for evaluating an ensemble forecasting system’s performance. Because the sample size is finite, the rank histogram is subject to statistical fluctuations, so a goodness-of-fit (GOF) test is employed to determine if the rank histogram is uniform to within some statistical certainty. Most often, the χ2 test is used to test whether the rank histogram is indistinguishable from a discrete uniform distribution. However, the χ2 test is insensitive to order and so suffers from troubling deficiencies that may render it unsuitable for rank histogram evaluation. As shown by examples in this paper, more powerful tests, suitable for small sample sizes, and very sensitive to the particular deficiencies that appear in rank histograms are available from the order-dependent Cramér–von Mises family of statistics, in particular, the Watson and Anderson–Darling statistics.

* Additional affiliation: NOAA/National Severe Storms Laboratory, Norman, Oklahoma

Corresponding author address: Dr. Kimberly L. Elmore, NSSL, 1313 Halley Circle, Norman, OK 73069. Email: kim.elmore@noaa.gov

Abstract

Rank histograms are a commonly used tool for evaluating an ensemble forecasting system’s performance. Because the sample size is finite, the rank histogram is subject to statistical fluctuations, so a goodness-of-fit (GOF) test is employed to determine if the rank histogram is uniform to within some statistical certainty. Most often, the χ2 test is used to test whether the rank histogram is indistinguishable from a discrete uniform distribution. However, the χ2 test is insensitive to order and so suffers from troubling deficiencies that may render it unsuitable for rank histogram evaluation. As shown by examples in this paper, more powerful tests, suitable for small sample sizes, and very sensitive to the particular deficiencies that appear in rank histograms are available from the order-dependent Cramér–von Mises family of statistics, in particular, the Watson and Anderson–Darling statistics.

* Additional affiliation: NOAA/National Severe Storms Laboratory, Norman, Oklahoma

Corresponding author address: Dr. Kimberly L. Elmore, NSSL, 1313 Halley Circle, Norman, OK 73069. Email: kim.elmore@noaa.gov

1. Introduction

Rank histograms are used extensively to evaluate ensemble forecast system performance (e.g., Hamill and Colucci 1997, 1998; Hou et al. 2001; Stensrud and Yussouf 2003). Rank histograms were introduced into the climate field by Anderson (1996), and Anderson and Stern (1996) used the Kolmogorov–Smirnov test along with the Anderson–Darling test for comparing samples in seasonal simulation cases. Hamill (2001) shows how to appropriately use rank histograms for evaluating ensemble forecasts. Once biases in individual members are removed and observational error is accounted for (Hamill 2001), the ideal ensemble produces flat, or uniform, rank histograms. Certain deviations from the uniform distribution are bellwether indicators that the ensemble forecasting system is deficient. Depending on the nature of these deviations, the nature of the deficiency may be better defined. For example, a U-shaped distribution indicates the ensemble is underdispersive, a peaked distribution suggests that the ensemble is overdispersive, and a sloped rank histogram indicates that the ensemble remains biased in some way.

Due to random variations, even ideal ensembles will not produce perfectly uniform rank histograms. Hence, one wishes to test the assumption that, within sampling error, the rank histogram is derived from a discrete uniform distribution. Such tests are derived from the general family of goodness-of-fit (GOF) tests, which test the null hypothesis, H0: the rank histogram is indistinguishable from a discrete uniform distribution. A common test for evaluating whether the resulting rank histograms come from a discrete uniform distribution is the χ2 test. But the χ2 test is far from ideal and lacks power for small sample sizes. More powerful tests come from the Cramér–von Mises (CvM) family of statistics, specifically the Watson test and the Anderson–Darling test, which are described in section 2. Section 3 discusses the results of applying the different tests to both large and small datasets generated by sampling at random from a uniform distribution. Section 4 provides conclusions and recommendations.

2. GOF tests

The most common GOF test is the χ2 test. This is a natural test for rank histograms, which represent data binned by rank. The χ2 test is defined as follows by the test statistic, T:
i1520-0434-20-5-789-e1
where Oi is the observed frequency in bin i, and Fi is the expected frequency in bin i under the null distribution with k cells. The T statistic for the null distribution is approximately distributed as χ2 with k − 1 degrees of freedom.

Sample size is always an issue with GOF tests. In practice, GOF tests have limited value for both very large, and very small, sample sizes, though what constitutes “very large” and “very small” is not usually clear and differs from test to test. If the sample size is large enough, almost any GOF test will reject the null hypothesis because real data are never distributed according to any theoretical distribution (Millard 2002). As the sample size decreases, the power, or ability to detect a difference between the sample distribution and the hypothesized or null distribution (uniform, in this case), of any test suffers, though certain tests are more sensitive than others against particular alternative hypotheses for any given sample size.

For the χ2 test, the conservatively defined required sample size is that which would provide an expected count of at least 5 for each bin. Thus, for a 15-member ensemble, the resulting rank histogram has 16 bins and 90 cases are required. However, the χ2 approximation for the T statistic remains valid for surprisingly small samples. If N is the number of samples, c is the number of bins, and Ei is the expected frequency in bin i under the null hypothesis, the T statistic is still well approximated by the χ2 distribution with c − 1 degrees of freedom if N ≥ 10, c ≥ 3, N 2/c ≥ 10, and Ei ≥ 0.25, which means that for a 15-member ensemble the minimum number of cases must be no less than 13. Conover (1999) gives a good treatment of how to compute the required sample size for a χ2 test.

Other useful GOF tests exist. A notable example is the Cramér–von Mises (CvM) family of test, which has forms for the discrete uniform distribution (Choulakian, et al., 1994). This family consists of the Cramér–von Mises (Cramér 1928; von Mises 1931; Smirnov 1936), the Watson (Watson 1961), and the Anderson–Darling (Anderson and Darling 1952) tests. In general, the CvM family of GOF tests has more power than does the χ2 test for small sample sizes. Unlike the χ2 test, the CvM test statistics are nonparametric. The CvM provides nearly identical results as the Kolmogorov–Smirnov (KS) test, though some find its formulation more appealing because the CvM tests use an integrated departure of the empirical distribution function (EDF) from the null distribution, instead of the largest departure (Conover 1999).

Consider a discrete distribution with k cells with probability pi of an observation landing in any cell. Let oi be the observed number of counts in bin i, and let, Npi = ei be the expected number of counts in bin i under the null distribution. Then, let Sj = Σji=1oi, and Tj = Σji=1ei. Thus, Sj/N and Hj = Tj/N correspond to the EDF FN(x). Finally, let Zj = SjTj, j = 1, 2, . . . , k. Then the discrete form of the CvM statistic is given by
i1520-0434-20-5-789-e2
the discrete form of the Watson statistic is given by
i1520-0434-20-5-789-e3
and the discrete form of the Anderson–Darling statistic is given by
i1520-0434-20-5-789-e4
where Z = Σkj=1Zjpj.

By definition, Zk = 0, so the last term in W2 = 0, and the last term in A2 = 0/0, which is set to 0. An alternative notation is to extend the index over which the sums operate to only k − 1 in (2) and (4) (Choulakian et al. 1994).

Because the CvM family uses an integrated departure from the EDF, it is order dependent, which means the way the bins are indexed affects the value of the computed statistic. For example, the discrete CvM test statistic has the following form: W2 ∼ (O1F1)2 + (O1 + O2F1 + F2)2 + (O1 + O2 + O3F1 + F2 + F3)2 + . . . , and so the order in which the binned values appear affects the statistic’s value. This is also true for the Anderson–Darling statistic. This is only partially true for the Watson statistic, which has a circular dependence. The Watson statistic differs from the other two in that it is invariant with regard to the “starting” cell. Regardless of the start index in (3), as long as all indices are addressed in order thereafter, the Watson statistic is invariant for any given dataset. Hence, the Watson statistic is particularly useful for testing the uniformity of counts around a circle (Choulakian et al. 1994), such as calendar months or wind direction. All three, however, apply to linear data. The distribution theory, and so methods for constructing p values for the CvM, Watson, and Anderson–Darling statistics, are all given in Choulakian et al. (1994) and will not be elaborated upon here.

In contrast, the χ2 test is insensitive to the nature of the departure from the null distribution because it uses only the sum of the individual deviations at each bin over all bins. Hence, the χ2 test cannot distinguish between noisy departures from the null distribution and U-shaped, peaked, or sloped departures, which are ordered departures. The CvM family is relatively insensitive to random departures but more sensitive to ordered departures from the null distribution, and retains more power against these departures than does the χ2 test. This can lead to profound differences between GOF test results based on the χ2 test and results based on the CvM statistics. Table 1 shows critical values for the various significance levels for the three CvM statistics (from Choulakian et al. 1994).

3. Examples

Differences between the χ2 and CvM test behavior are illustrated using a Monte Carlo simulation that draws random samples from a uniform distribution. Assume an ensemble forecasting system with 15 members, which yields a rank histogram containing 16 bins. Define a small-sample rank histogram as consisting of 60 cases. Thus, for the small-sample simulation, each of 1000 Monte Carlo trials draws 60 numbers uniformly distributed between 1 and 16 (Fig. 1). Define a large-sample rank histogram as consisting of 540 cases. So, the large-sample Monte Carlo simulation uses 1000 sets of 540 numbers uniformly distributed between 1 and 16 (Fig. 2). In the small-sample case, the expected number of counts, Ei, in each cell is 3.75, large enough for both the χ2 test and the CvM family of tests to be valid. For the large sample, Ei = 33.75. Each sample is then reordered pathologically to produce a U-shaped, a peaked-shaped, and a sloped bias trend. Hence, each sample results in four possible distributions: random, U shaped, peaked, and sloped. In each case, the χ2 test p value remains invariant, but the CvM tests vary widely depending on the nature of the reordering.

For a test at p = 0.05, the expectation is that close to 5% of the samples will result in a p value less than 0.05 for all of these tests simply by random chance before the samples are reordered. By definition, the χ2 test p value is independent of order. However, the CvM family yields different p values, depending on how the data are ordered. Table 2 shows the proportion of cases that are associated with p values ≤ 0.05 for each ordering for the small sample size, while Table 3 shows the same results for the large sample size (boldface values are associated with the most sensitive tests).

Clearly, for both sample sizes, the expectation for the random rank histograms is met by all tests. However, for the pathologically reordered rank histograms, the statistics show marked differences. While the χ2 statistic is unaffected, the Watson statistic is clearly very sensitive to U-shaped and peaked rank histograms (yielding identical values for both due to circular symmetry), while both the CvM and Anderson–Darling tests are most sensitive to the bias slope rank histograms. These results hold for both the small and large samples.

More insight may be gained by examining how the CvM tests behave relative to the χ2 test (Fig. 3). For the random data, the two GOF tests are uncorrelated for 16-bin rank histograms constructed from both 60 cases (Fig. 3a) and 540 cases (Fig. 3c), which means that the individual samples with p values <0.05 differ from the two tests in nearly random ways. However, the χ2 test is by definition insensitive for the data reordered into a U shape, while the Watson statistic is clearly very sensitive to this reordering for both the 60- (Fig. 3b) and 540-case samples (Fig. 3d). The χ2 test p values are clearly discrete for the small-sample (60 values) cases (Figs. 3a and 3b), which is a result of both the small sample size and the insensitivity to order: with only 60 values, the number of different T-statistic values that can be generated is significantly limited. The Watson test p value rises as the χ2 test p value approaches 1 (Figs. 3b and 3d). This is true for all of the CvM family of tests, and indicates that as the χ2 test p value approaches 1, the rank histogram approaches exact uniformity. Because the sample size used in the simulations is not a multiple of the number of bins, such a “perfect” rank histogram cannot occur within these data. However, were all bins to contain the same number, the rank histogram itself is invariant under bin-order permutations. The sensitivity of the Watson statistic to U-shaped distributions is demonstrated by how high the χ2 test p value may become before the Watson test p value begins to increase (Figs. 3b and 3d). The higher the χ2 test p value at which the Watson p value starts to rise, the more sensitive the Watson test statistic is to the particular deviation from the null distribution. Simulations that use sample sizes as large as 5000 and as small as 30 show similar results.

4. Conclusions

The most common problems associated with ensemble forecasting systems are underdispersion, overdispersion, and bias. Rank histograms depict these problems with either a U shape, a peaked center, or one end of the rank histogram being higher than the other (slope), respectively. The ubiquitous χ2 test possesses certain characteristics that may make it unsuitable for assessing the quality of rank histograms derived from ensemble forecasting systems. Specifically, the χ2 test statistic is order invariant, which means rank histograms that clearly display a problem within the ensemble forecasting system may go undetected using the χ2 test.

Better tests for the specific problems encountered with ensemble forecasting systems are available from the discrete form of the Cramér–von Mises family of GOF tests. These tests are based on the Cramér–von Mises, Watson, and Anderson–Darling statistics. Of these three, the Watson test statistic is considerably more sensitive to either U-shaped or peaked rank histograms than are the other two. The other two tests are almost equally sensitive to a slope/bias within the rank histograms. All of the CvM tests retain considerable power for relatively small samples.

Like any statistical test, the CvM tests are not infallible. Yet, better results will be obtained by using a combination of either the Watson and CvM, or Watson and Anderson–Darling tests for evaluating rank histograms, instead of the χ2 test. If the rank histogram in question passes either combination of CvM tests at an appropriate p value, then that rank histograms may be considered statistically free from either U-shaped/peaked or biased/sloped deficiencies.

Acknowledgments

The author is very grateful to Dr. Richard A. Lockhart, of Simon Fraser University, Burnaby, British Columbia, Canada, without whose generous guidance this work would not have been possible. This work is supported by the National Severe Storms Laboratory.

REFERENCES

  • Anderson, J. S., 1996: A method for producing and evaluating probabilistic forecasts from ensemble model integration. J. Climate, 9 , 15181530.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Anderson, J. S., and Stern W. F. , 1996: Evaluating the potential predictive utility of ensemble forecasts. J. Climate, 9 , 260269.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Anderson, T. W., and Darling D. A. , 1952: Asymptotic theory of certain “goodness of fit” criteria based on stochastic processes. Ann. Math. Stat., 23 , 193212.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Choulakian, V., Lockhart R. A. , and Stephens M. A. , 1994: Cramér–von Mises statistics for discrete distributions. Can. J. Stat., 22 , 125137.

  • Conover, W. J., 1999: Practical Nonparametric Statistics. 3d ed. John Wiley and Sons, 584 pp.

  • Cramér, H., 1928: On the composition of elementary errors. Skand. Aktuarietidskr., 11 , 1374. 141180.

  • Hamill, T. M., 2001: Interpretation of rank histograms for verifying ensemble forecasts. Mon. Wea. Rev., 129 , 550560.

  • Hamill, T. M., and Colucci S. J. , 1997: Verification of Eta–RSM short-range ensemble forecasts. Mon. Wea. Rev., 125 , 13121327.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., and Colucci S. J. , 1998: Evaluation of Eta–RSM ensemble probabilistic precipitation forecasts. Mon. Wea. Rev., 126 , 711724.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hou, D., Kalnay E. , and Droegemeier K. K. , 2001: Objective verification of the SAMEX’98 ensemble forecasts. Mon. Wea. Rev., 129 , 7391.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Millard, S. P., 2002: Environmental Stats for S-Plus. 2d ed. Springer, 264 pp.

  • Smirnov, N. V., 1936: Sui la distribution de w2 (Criterium de M.R.v. Mises). Compt. Rend., 202 , 449452.

  • Stensrud, D. J., and Yussouf N. , 2003: Short-range ensemble predictions of 2-m temperature and dewpoint temperature over New England. Mon. Wea. Rev., 131 , 25102524.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • von Mises, R., 1931: Wahrscheinlichkeitsrechnung und Ihre Anwendung in der Statistik und Theroretishen Physik. Vol. 1. F. Deuticke, 574 pp.

  • Watson, G. S., 1961: Goodness-of-fit tests on a circle. I. Biometrika, 48 , 109114.

Fig. 1.
Fig. 1.

Examples of χ2 and Cramér–von Mises family test results for a single, 60-element dataset with different bin ordering. The y axis yields the number of elements in each bin, and the x axis is the rank, with rank 1 on the left and rank 16 on the right. Tests that are particularly sensitive to certain deviations from the uniform null distribution are in boldface. (a) Rank histogram of uniformly distributed data with noise, (b) same data as in (a) but with the data reordered to generate a U-shaped rank histogram, (c) same data as in (a) but with data reordered to create a peaked rank histogram, and (d) same data as in (a) but reordered to generate a sloping rank histogram.

Citation: Weather and Forecasting 20, 5; 10.1175/WAF884.1

Fig. 2.
Fig. 2.

Same as in Fig. 1 but for a sample size of 540.

Citation: Weather and Forecasting 20, 5; 10.1175/WAF884.1

Fig. 3.
Fig. 3.

Examples of the χ2 test behavior compared to the Watson test behavior: (a) plot showing the association between the χ2 p value and the Watson p value for the random data consisting of 60 values, (b) χ2 p value and the Watson p value for the reordered into a U shape for 60-sample data, (c) same as in (a) but for the 540-sample data, and (d) same as in (b) but for the 540-sample reordered data.

Citation: Weather and Forecasting 20, 5; 10.1175/WAF884.1

Table 1.

Significant values for Cramér–von Mises statistics for tests of the discrete uniform distribution with k cells, α = upper-tail significance test.

Table 1.
Table 2.

Small-sample results. Bold values are associated with the most sensitive tests.

Table 2.
Table 3.

Large-sample results. Bold values are associated with the most sensitive tests.

Table 3.
Save
  • Anderson, J. S., 1996: A method for producing and evaluating probabilistic forecasts from ensemble model integration. J. Climate, 9 , 15181530.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Anderson, J. S., and Stern W. F. , 1996: Evaluating the potential predictive utility of ensemble forecasts. J. Climate, 9 , 260269.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Anderson, T. W., and Darling D. A. , 1952: Asymptotic theory of certain “goodness of fit” criteria based on stochastic processes. Ann. Math. Stat., 23 , 193212.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Choulakian, V., Lockhart R. A. , and Stephens M. A. , 1994: Cramér–von Mises statistics for discrete distributions. Can. J. Stat., 22 , 125137.

  • Conover, W. J., 1999: Practical Nonparametric Statistics. 3d ed. John Wiley and Sons, 584 pp.

  • Cramér, H., 1928: On the composition of elementary errors. Skand. Aktuarietidskr., 11 , 1374. 141180.

  • Hamill, T. M., 2001: Interpretation of rank histograms for verifying ensemble forecasts. Mon. Wea. Rev., 129 , 550560.

  • Hamill, T. M., and Colucci S. J. , 1997: Verification of Eta–RSM short-range ensemble forecasts. Mon. Wea. Rev., 125 , 13121327.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., and Colucci S. J. , 1998: Evaluation of Eta–RSM ensemble probabilistic precipitation forecasts. Mon. Wea. Rev., 126 , 711724.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hou, D., Kalnay E. , and Droegemeier K. K. , 2001: Objective verification of the SAMEX’98 ensemble forecasts. Mon. Wea. Rev., 129 , 7391.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Millard, S. P., 2002: Environmental Stats for S-Plus. 2d ed. Springer, 264 pp.

  • Smirnov, N. V., 1936: Sui la distribution de w2 (Criterium de M.R.v. Mises). Compt. Rend., 202 , 449452.

  • Stensrud, D. J., and Yussouf N. , 2003: Short-range ensemble predictions of 2-m temperature and dewpoint temperature over New England. Mon. Wea. Rev., 131 , 25102524.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • von Mises, R., 1931: Wahrscheinlichkeitsrechnung und Ihre Anwendung in der Statistik und Theroretishen Physik. Vol. 1. F. Deuticke, 574 pp.

  • Watson, G. S., 1961: Goodness-of-fit tests on a circle. I. Biometrika, 48 , 109114.

  • Fig. 1.

    Examples of χ2 and Cramér–von Mises family test results for a single, 60-element dataset with different bin ordering. The y axis yields the number of elements in each bin, and the x axis is the rank, with rank 1 on the left and rank 16 on the right. Tests that are particularly sensitive to certain deviations from the uniform null distribution are in boldface. (a) Rank histogram of uniformly distributed data with noise, (b) same data as in (a) but with the data reordered to generate a U-shaped rank histogram, (c) same data as in (a) but with data reordered to create a peaked rank histogram, and (d) same data as in (a) but reordered to generate a sloping rank histogram.

  • Fig. 2.

    Same as in Fig. 1 but for a sample size of 540.

  • Fig. 3.

    Examples of the χ2 test behavior compared to the Watson test behavior: (a) plot showing the association between the χ2 p value and the Watson p value for the random data consisting of 60 values, (b) χ2 p value and the Watson p value for the reordered into a U shape for 60-sample data, (c) same as in (a) but for the 540-sample data, and (d) same as in (b) but for the 540-sample reordered data.

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 1899 548 17
PDF Downloads 1087 145 11