Indices of Rank Histogram Flatness and Their Sampling Properties

D. S. Wilks Department of Earth and Atmospheric Sciences, Cornell University, Ithaca, New York

Search for other papers by D. S. Wilks in
Current site
Google Scholar
PubMed
Close
Full access

Abstract

Quantitative evaluation of the flatness of the verification rank histogram can be approached through formal hypothesis testing. Traditionally, the familiar χ2 test has been used for this purpose. Recently, two alternatives—the reliability index (RI) and an entropy statistic (Ω)—have been suggested in the literature. This paper presents approximations to the sampling distributions of these latter two rank histogram flatness metrics, and compares the statistical power of tests based on the three statistics, in a controlled setting. The χ2 test is generally most powerful (i.e., most sensitive to violations of the null hypothesis of rank uniformity), although for overdispersed ensembles and small sample sizes, the test based on the entropy statistic Ω is more powerful. The RI-based test is preferred only for unbiased forecasts with small ensembles and very small sample sizes.

© 2019 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: D. S. Wilks, dsw5@cornell.edu

Abstract

Quantitative evaluation of the flatness of the verification rank histogram can be approached through formal hypothesis testing. Traditionally, the familiar χ2 test has been used for this purpose. Recently, two alternatives—the reliability index (RI) and an entropy statistic (Ω)—have been suggested in the literature. This paper presents approximations to the sampling distributions of these latter two rank histogram flatness metrics, and compares the statistical power of tests based on the three statistics, in a controlled setting. The χ2 test is generally most powerful (i.e., most sensitive to violations of the null hypothesis of rank uniformity), although for overdispersed ensembles and small sample sizes, the test based on the entropy statistic Ω is more powerful. The RI-based test is preferred only for unbiased forecasts with small ensembles and very small sample sizes.

© 2019 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: D. S. Wilks, dsw5@cornell.edu

1. Introduction

The probabilistic calibration (or reliability) of a collection of ensemble forecasts is typically examined using the verification rank histogram, often called simply the rank histogram, which is a graphical device developed independently by Anderson (1996), Hamill and Colucci (1997), and Talagrand et al. (1997). The underlying idea is to examine whether the members of a forecast ensemble and the verifying observation that they are predicting can be regarded as independent realizations from the same probability distribution (although this distribution may change from forecast to forecast).

A rank histogram is constructed by considering, for each of n ensemble forecasts consisting of m ensemble members, the collection of m + 1 values composed of the ensemble members and the corresponding verifying observation. These m + 1 values are sorted in ascending order, and the rank of the verifying observation within the group is tabulated. For example, if the observation is the smallest of the m + 1 values, its rank is 1, and if it is the largest, its rank is m + 1. The rank histogram is then constructed from the n ensemble forecasts being evaluated, as a histogram of the resulting n ranks, with m + 1 bars.

If the collection of n ensemble forecasts is probabilistically calibrated, so the verifying observations are statistically indistinguishable from the forecast ensembles to which they belong, each verification is equally likely to take on any of the m + 1 ranks, and the resulting rank histogram will be flat, except for deviations due to sampling variability. Hamill (2001) described characteristic deviations from rank histogram flatness that are diagnostic for different sorts of miscalibration: U-shaped histograms indicate ensemble underdispersion, inverted U shapes indicate ensemble overdispersion, and asymmetric rank histograms are diagnostic for unconditional biases.

Visual inspection of the rank histogram is sufficient to diagnose strong ensemble miscalibration. Weak miscalibration may be difficult to distinguish subjectively from mere sampling variations, and in such cases, computation of a formal statistical hypothesis test is indicated. Both Anderson (1996) and Hamill and Colucci (1997) suggested use of the well-known chi-square test for this purpose. More recently, two alternatives to the chi-square statistic for evaluating rank histogram flatness have appeared in the literature, although the nature of their sampling distributions, necessary for computing hypothesis tests based on them, has not been previously investigated. Section 2 reviews these two alternative statistics in relation to the more conventional chi-square statistic and presents serviceable empirical approximations to their sampling distributions. Section 3 compares the statistical power of (i.e., the sensitivity of the hypothesis tests based on) the three alternatives for both unbiased and biased forecasts exhibiting incorrect dispersion, and section 4 concludes.

2. Flatness metrics and their sampling distributions

a. Chi-square

Define nk, k = 1, …, m + 1, as the number of ensembles for which the corresponding observation attained rank k, with Σknk = n. The familiar chi-squared statistic is computed as the sum over the m + 1 rank histogram bins of (observed − expected)2/expected. When each of the m + 1 outcomes is equally likely, the corresponding expected (in the statistical sense of long-run average) number is n/(m + 1). The result is
e1
A hypothetical perfectly flat rank histogram would exhibit nk = n/(m + 1) for each histogram bin k, yielding χ2 = 0. Actual rank histograms include the effects of sampling variability so that in practice, χ2 > 0. If the null hypothesis of rank uniformity is true, then the effects of sampling variability on the sample statistic in Eq. (1) make it behave like a random realization from the chi-square distribution with ν = m degrees of freedom. Because small values of χ2 support the null hypothesis of rank histogram flatness, that null hypothesis is rejected at the α level if χ2 is as large as or larger than the (1 − α) quantile of the appropriate chi-square distribution , which is the upper limit of integration in the equation
e2
where Γ(⋅) denotes the gamma function. That is, if the test statistic in Eq. (1) is larger than , it is regarded as too unlikely to have occurred as a result of the deviations from rank histogram flatness being due only to statistical sampling variations, leading to rejection of the null hypothesis of rank uniformity. Equation (2) is analytically integrable only for the special case ν = 2, which is not relevant for evaluating rank histogram flatness. However, values of for commonly used test levels α such as 0.05 and 0.01 as functions of ν can be obtained from many printed statistics references and software packages.

b. Reliability index

Delle Monache et al. (2006) proposed an alternative measure of rank histogram flatness that they called the reliability index (RI):
e3
The reliability index is similar in form to the second equality in Eq. (1), except that absolute rather than squared deviations of the relative frequencies in each bin from their statistically expected (if the ensembles are calibrated) values are summed, and the scaling factor n(m + 1) is absent.
Sampling distributions for the square root of RI are generally well approximated by Gaussian distributions, with mean
e4a
and standard deviation
e4b
These are empirically derived relationships that were obtained through analysis of a large number of Monte Carlo simulations consistent with the null hypothesis of rank histogram flatness, as described in the appendix: for m = 4, 8, 16, …, 256, and n = 16, 32, 64, …, 8192, with the restriction that n ≥ 4m. Figure 1 illustrates the procedure for the mean function in Eq. (4a). The plotted points in Fig. 1a show averages of RI over 106 synthetic rank histograms for which the null hypothesis of flatness is true, for illustrative values of sample size n as functions of ensemble size m. The curves in Fig. 1a are plots of (the square of) Eq. (4a), using the indicated sample sizes. The functions in Fig. 1a are part of a family of curves that, when parameterized as a function of m/n, collapse to the single function shown in Fig. 1b. The plotted points in Fig. 1b show the empirically simulated average RI values over 106 realizations each for all combinations of m and n listed above and illustrates that Eq. (4a) represents their behavior almost exactly for the larger sample sizes (small m/n), but is somewhat less accurate for the smaller sample sizes (larger m/n).
Fig. 1.
Fig. 1.

(a) Averages of RI over 106 synthetic rank histograms for which the null hypothesis of flatness is true (points) for illustrative values of sample size n as functions of ensemble size m. The curves in (a) are (the square of) Eq. (4a), using the indicated sample sizes. (b) The plotted points show the empirically simulated average RI for all combinations of m and n considered and illustrate that Eq. (4a) represents their behavior almost exactly for the larger sample sizes (small m/n), but is somewhat less accurate for the smaller sample sizes (larger m/n).

Citation: Monthly Weather Review 147, 2; 10.1175/MWR-D-18-0369.1

Because small values of RI are consistent with rank histogram flatness, the null hypothesis can be rejected at the α level when RI is sufficiently large, specifically when
e5
Here, for example, z0.95 = 1.645 for tests at the α = 0.05 level.

c. Entropy

Taillardat et al. (2016) proposed evaluating rank histogram flatness using the entropy statistic
e6
Perfectly flat rank histograms, which would exhibit nk/n = 1/(m + 1) for all k, yield Ω = 1, whereas a histogram with all counts in a single bin yields Ω = 0 [understanding that 0 ln(0) = 0]. Accordingly, sufficiently small values of Ω will lead to rejection of the null hypothesis of a uniform rank histogram.
Because Eq. (6) is bounded by 0 ≤ Ω ≤ 1, it is natural to consider beta distributions for representing its sampling variations under the null hypothesis of rank histogram flatness. Reasonably good approximations to the sampling properties of Ω can be represented by beta distributions with the parameters
e7a
and
e7b
In common with the results in Eq. (4), these are empirically derived relationships that were obtained through analysis of a large number of Monte Carlo simulations, analogously to the method illustrated in Fig. 1; again, for m = 4, 8, 16, …, 256, and n = 16, 32, 64, …, 8192, with the restriction that n ≥ 4m. Because sufficiently small values of Ω are associated with deviations from rank uniformity, the null hypothesis of rank histogram flatness will be rejected at the α level when Ω ≤ Ωcrit, and this critical value will be the left-tail quantile satisfying
e8
Equation (8) is not analytically integrable, except for a small number of special cases, and tabulated values are not available for the large values of the parameter p in Eq. (7a) that will be relevant for evaluating rank histogram flatness. Accordingly, Eq. (8) must in general be evaluated through numerical integration or use of appropriate software packages.

3. Comparisons of the resulting statistical tests

Hypothesis tests based on the three flatness metrics described in section 2 will be compared with respect to both their “type I” and “type II” error characteristics. A properly constructed hypothesis test will reject valid null hypotheses (type I error) with a small, specified probability that is near or equal to the test level α. When competing test formulations are being evaluated, it is also of interest to compare their statistical power, or sensitivity for detecting null hypothesis violations (failing to reject an invalid null hypothesis is a type II error). Results of such comparisons are generally expressed as power functions, which express the probability of rejecting the null hypothesis as a function of the degree to which it is wrong. Ideally, a power function takes on a minimum value of α where the null hypothesis is true and rises quickly to near 1 as the true condition diverges from that implied by the null hypothesis. In general the most powerful available test, the power function for which rises most quickly from α, will be preferred.

The three rank histogram flatness metrics described in section 2 will be compared here using synthetic rank histograms derived by discretizing random samples from beta distributions. The procedure follows the Monte Carlo algorithm described in the appendix, except that the random numbers generated in step 4 are drawn from beta distributions that are not limited to only uniform distributions. (The uniform distribution is a special case of the beta distribution, with p = q = 1.) U-shaped beta distributions, corresponding to rank histograms for underdispersed ensembles, are produced when p < 1 and q < 1. Hump-shaped beta distributions, corresponding to rank histograms for overdispersed ensembles, result when p > 1 and q > 1.

The mean and standard deviation of a beta distribution in terms of its two distribution parameters are, respectively,
e9a
and
e9b
Rank histograms representing unbiased forecasts will result from discretizing beta distributions with p = q, so that μB = 0.5. Correctly calibrated forecasts (p = q = 1) will exhibit σB = 1/√12 ≈ 0.2887. Artificial rank histograms representing unbiased but underdispersed ensembles are generated by randomly sampling beta distributions with p = q < 1, yielding σB > 1/√12; and unbiased but overdispersed ensembles are generated from beta distributions with p = q > 1, yielding σB < 1/√12.

Figure 2 shows the resulting power functions for tests computed at the α = 0.05 level using the χ2 (solid), RI (dashed), and Ω (dotted) test statistics. The thumbnail insets indicate shapes of the underlying beta distributions for the values of σB at which they are plotted. In most circumstances, the χ2 tests are most powerful, although for the large sample sizes, the tests based on the Ω statistic are nearly equivalent. An exception occurs for overdispersed ensembles with the small sample sizes, where the tests based on Ω are generally most powerful, and the χ2 tests are least powerful. A shortcoming in Eq. (7) is evident from the minima of the power functions for Ω being smaller than 0.05 for the small and medium sample sizes in Fig. 1c, indicating that the tests based on Eqs. (7) and (8) are conservative in these instances, rejecting valid null hypotheses too rarely (for approximately 3% of the tests). (In such cases, the algorithm presented in the appendix can be used to obtain more accurate critical values.) Results for the larger ensemble sizes (m ≥ 64) are qualitatively similar to those shown in Fig. 2c (not shown).

Fig. 2.
Fig. 2.

Comparison of power functions for unbiased forecasts exhibiting dispersion errors for tests computed at the α = 0.05 level using the χ2 (solid), RI (dashed), and Ω (dotted) rank histogram flatness metrics for ensemble sizes (a) m = 4, (b) m = 8, and (c) m = 32. In each panel, the three groups of curves represent small (n = 8m), medium (n = 32m), and large (n = 128m) sample sizes. Thumbnail insets indicate shapes of beta distributions underlying generation of the synthetic rank histograms and are located at corresponding values of σ on the horizontal axes. Note that the three panels have different horizontal scales.

Citation: Monthly Weather Review 147, 2; 10.1175/MWR-D-18-0369.1

Usually, forecast ensembles will exhibit both bias and dispersion errors. Figure 3 shows power functions for rank histograms constructed to have both types of errors. Here, the biases (= 1/2 − μB) are all positive (overforecasting biases), as indicated by both the numerical scales on the horizontal axes and the underlying beta distributions in the thumbnail insets, and increase linearly from zero for rank histograms with zero dispersion errors according to
e10
In this more realistic setting, tests based on the χ2 statistic are most powerful or nearly so in all cases except for small samples of overdispersed ensembles, where tests based on Ω are more sensitive. As was the case for the unbiased forecasts in Fig. 2, the power of tests based on χ2 and Ω is essentially equivalent for large sample sizes, and the entropy-based tests for larger ensembles and small sample sizes are slightly conservative. Results for larger ensemble sizes (not shown) are qualitatively similar to those in Fig. 3c.
Fig. 3.
Fig. 3.

As in Fig. 2, but for rank histograms characterizing forecast ensembles exhibiting bias errors that increase linearly as the dispersion errors increase.

Citation: Monthly Weather Review 147, 2; 10.1175/MWR-D-18-0369.1

Tests based on RI are not most powerful for any of the cases shown in Fig. 2 or 3. However, Fig. 4a shows that for unbiased ensembles with small (m = 4) ensemble size and very small (n = 4m = 16) sample size, the RI-based tests are most powerful. On the other hand, for larger ensemble sizes and for forecasts also exhibiting bias (Fig. 4b), these results for very small sample sizes are consistent with those in Figs. 2 and 3, with the χ2 tests being most powerful for underdispersed ensembles and the Ω-based tests being more powerful for overdispersed ensembles.

Fig. 4.
Fig. 4.

Power functions for tests with very small (n = 4m) sample sizes. Critical values for the tests based on RI and Ω have been computed using the method described in the appendix rather than using Eqs. (4) and (7).

Citation: Monthly Weather Review 147, 2; 10.1175/MWR-D-18-0369.1

4. Conclusions

This study has compared hypothesis tests for rank histogram flatness based on the conventional χ2 statistic [Eq. (1)], the reliability index [Eq. (3)], and an entropy measure [Eq. (6)] in a controlled setting. In addition, empirical approximations to the sampling distributions of the latter two statistics have been presented. Assessing rank histogram flatness is important because it is an indicator of the probabilistic calibration (reliability) of a collection of ensemble forecasts. However, it is important to realize that calibration is only a necessary rather than sufficient condition for forecast skill and value (e.g., Gneiting et al. 2007; Murphy and Winkler 1987), and rank histogram flatness is a necessary but not sufficient condition to conclude that a collection of ensemble forecasts is calibrated (e.g., Hamill 2001). It is unrealistic to expect raw dynamical ensembles to be calibrated because of the unknown and undersampled nature of initial-condition distributions and unavoidable simplifications and errors in the dynamical formulation. However, achieving this condition can reasonably be expected after appropriate postprocessing (Vannitsem et al. 2018), particularly if calibration is enforced in the postprocessing algorithm (Wilks 2018).

In most instances, the traditional χ2 test was found here to be most powerful (i.e., most sensitive for detecting deviations from rank uniformity), particularly for the usual situation of underdispersed ensembles. For overdispersed ensembles and small sample sizes (n ≤ 8m), tests based on the entropy statistic Ω are most powerful. The RI-based tests are preferred only for unbiased forecasts with small ensemble sizes and very small (n ≈ 4m) sample sizes, although in such settings, all three tests exhibit rather weak sensitivity.

Overall, use of the traditional χ2 test is recommended as a consequence of its generally superior power, particularly for the underdispersed ensembles that are most commonly encountered, and the relative ease of obtaining the necessary critical values. Other advantages of using the χ2 test to evaluate rank histogram flatness include the availability of formulations allowing more focused alternative hypotheses (Elmore 2005; Jolliffe and Primo 2008) and adjustments that compensate for the effects of temporal (serial) dependence in the underlying data and resulting verification ranks (Bröcker 2018; Wilks 2004). However, the validity of these adjustments when evaluating calibration of (spatially autocorrelated) gridded ensemble forecasts is unclear, in which case an appropriate approach to working with nearly independent ensembles may be to consider only grid points that are sufficiently well separated.

Although the presentation here has been oriented toward examining probabilistic calibration of ensemble forecasts through the rank histogram, the results are equally applicable to evaluating flatness of the probability integral transform (PIT) histogram (Dawid 1984; Diebold et al. 1998; Gneiting et al. 2005), which can be regarded as the analog of the rank histogram for continuous (effectively, infinite ensemble size) predictive distributions, and for evaluating flatness of the various multivariate extensions of the rank histogram that have been proposed (Thorarinsdottir et al. 2016; Wilks 2017).

Acknowledgments

I thank Tom Hamill and an anonymous reviewer for suggestions that lead to improvements in this paper.

APPENDIX

Algorithm for Computing Empirical Approximations to the Sampling Distributions

  1. Define the statistic of interest S. In the present study, S is either RI [Eq. (3)] or Ω [Eq. (6)].

  2. Define ensemble size m, sample size n, and the number (perhaps 104 or 105) of Monte Carlo replications J.

  3. Initialize bin counts nk, k = 1, …, m + 1, to zero.

  4. Generate a standard uniform random number ui, having probability density f(u) = 1, 0 ≤ u < 1.

  5. Compute k = int[ui(m + 1) + 1], where int[⋅] indicates integer truncation of fractions. Increment the count nk by 1.

  6. Repeat steps 4 and 5 n times, using distinct realizations ui, i = 1, …, n in step 4, and compute Sj from the resulting values of nk.

  7. Repeat steps 3–6 J times. The resulting collection of Sj, j = 1, …, J, is a discrete approximation to the sampling distribution of the statistic S under the null hypothesis of rank uniformity.

REFERENCES

  • Anderson, J. L., 1996: A method for producing and evaluating probabilistic forecasts from ensemble model integrations. J. Climate, 9, 15181530, https://doi.org/10.1175/1520-0442(1996)009<1518:AMFPAE>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Bröcker, J., 2018: Assessing the reliability of ensemble forecasting systems under serial dependence. Quart. J. Roy. Meteor. Soc., 144, 26662675, https://doi.org/10.1002/qj.3379.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Dawid, A. P., 1984: Present position and potential developments: Some personal views: Statistical theory: The prequential approach. J. Roy. Stat. Soc., 147A, 278292, https://doi.org/10.2307/2981683.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Delle Monache, L., J. P. Hacker, Y. Zhou, X. Deng, and R. B. Stull, 2006: Probabilistic aspects of meteorological and ozone regional ensemble forecasts. J. Geophys. Res., 111, D24307, https://doi.org/10.1029/2005JD006917.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Diebold, F. X., T. A. Gunther, and A. S. Tay, 1998: Evaluating density forecasts with applications to financial risk management. Int. Econ. Rev., 39, 863883, https://doi.org/10.2307/2527342.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Elmore, K. L., 2005: Alternatives to the chi-square test for evaluating rank histograms from ensemble forecasts. Wea. Forecasting, 20, 789795, https://doi.org/10.1175/WAF884.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Gneiting, T., A. E. Raftery, A. H. Westveld, and T. Goldman, 2005: Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Mon. Wea. Rev., 133, 10981118, https://doi.org/10.1175/MWR2904.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Gneiting, T., F. Balabdaoui, and A. E. Raftery, 2007: Probabilistic forecasts, calibration and sharpness. J. Roy. Stat. Soc., 69B, 243268, https://doi.org/10.1111/j.1467-9868.2007.00587.x.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., 2001: Interpretation of rank histograms for verifying ensemble forecasts. Mon. Wea. Rev., 129, 550560, https://doi.org/10.1175/1520-0493(2001)129<0550:IORHFV>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., and S. J. Colucci, 1997: Verification of Eta–RSM short-range ensemble forecasts. Mon. Wea. Rev., 125, 13121327, https://doi.org/10.1175/1520-0493(1997)125<1312:VOERSR>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Jolliffe, I. T., and C. Primo, 2008: Evaluating rank histograms using decompositions of the chi-square test statistic. Mon. Wea. Rev., 136, 21332139, https://doi.org/10.1175/2007MWR2219.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., and R. L. Winkler, 1987: A general framework for forecast verification. Mon. Wea. Rev., 115, 13301338, https://doi.org/10.1175/1520-0493(1987)115<1330:AGFFFV>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Taillardat, M., O. Mestre, M. Zamo, and P. Naveau, 2016: Calibrated ensemble forecasts using quantile regression forests and ensemble model output statistics. Mon. Wea. Rev., 144, 23752393, https://doi.org/10.1175/MWR-D-15-0260.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Talagrand, O., R. Vautard, and B. Strauss, 1997: Evaluation of probabilistic prediction systems. Proc. ECMWF Workshop on Predictability, Reading, United Kingdom, ECMWF, 1–25, https://www.ecmwf.int/en/elibrary/12555-evaluation-probabilistic-prediction-systems.

  • Thorarinsdottir, T. L., M. Scheuerer, and C. Heinz, 2016: Assessing the calibration of high-dimensional ensemble forecasts using rank histograms. J. Comput. Graph. Stat., 25, 105122, https://doi.org/10.1080/10618600.2014.977447.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Vannitsem, S., D. S. Wilks, and J. W. Messner, Eds., 2018: Statistical Postprocessing of Ensemble Forecasts. Elsevier, 347 pp.

  • Wilks, D. S., 2004: The minimum spanning tree histogram as a verification tool for multidimensional ensemble forecasts. Mon. Wea. Rev., 132, 13291340, https://doi.org/10.1175/1520-0493(2004)132<1329:TMSTHA>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 2017: On assessing calibration of multivariate ensemble forecasts. Quart. J. Roy. Meteor. Soc., 143, 164172, https://doi.org/10.1002/qj.2906.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 2018: Enforcing calibration in ensemble postprocessing. Quart. J. Roy. Meteor. Soc., 144, 7684, https://doi.org/10.1002/qj.3185.

Save
  • Anderson, J. L., 1996: A method for producing and evaluating probabilistic forecasts from ensemble model integrations. J. Climate, 9, 15181530, https://doi.org/10.1175/1520-0442(1996)009<1518:AMFPAE>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Bröcker, J., 2018: Assessing the reliability of ensemble forecasting systems under serial dependence. Quart. J. Roy. Meteor. Soc., 144, 26662675, https://doi.org/10.1002/qj.3379.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Dawid, A. P., 1984: Present position and potential developments: Some personal views: Statistical theory: The prequential approach. J. Roy. Stat. Soc., 147A, 278292, https://doi.org/10.2307/2981683.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Delle Monache, L., J. P. Hacker, Y. Zhou, X. Deng, and R. B. Stull, 2006: Probabilistic aspects of meteorological and ozone regional ensemble forecasts. J. Geophys. Res., 111, D24307, https://doi.org/10.1029/2005JD006917.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Diebold, F. X., T. A. Gunther, and A. S. Tay, 1998: Evaluating density forecasts with applications to financial risk management. Int. Econ. Rev., 39, 863883, https://doi.org/10.2307/2527342.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Elmore, K. L., 2005: Alternatives to the chi-square test for evaluating rank histograms from ensemble forecasts. Wea. Forecasting, 20, 789795, https://doi.org/10.1175/WAF884.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Gneiting, T., A. E. Raftery, A. H. Westveld, and T. Goldman, 2005: Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Mon. Wea. Rev., 133, 10981118, https://doi.org/10.1175/MWR2904.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Gneiting, T., F. Balabdaoui, and A. E. Raftery, 2007: Probabilistic forecasts, calibration and sharpness. J. Roy. Stat. Soc., 69B, 243268, https://doi.org/10.1111/j.1467-9868.2007.00587.x.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., 2001: Interpretation of rank histograms for verifying ensemble forecasts. Mon. Wea. Rev., 129, 550560, https://doi.org/10.1175/1520-0493(2001)129<0550:IORHFV>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., and S. J. Colucci, 1997: Verification of Eta–RSM short-range ensemble forecasts. Mon. Wea. Rev., 125, 13121327, https://doi.org/10.1175/1520-0493(1997)125<1312:VOERSR>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Jolliffe, I. T., and C. Primo, 2008: Evaluating rank histograms using decompositions of the chi-square test statistic. Mon. Wea. Rev., 136, 21332139, https://doi.org/10.1175/2007MWR2219.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., and R. L. Winkler, 1987: A general framework for forecast verification. Mon. Wea. Rev., 115, 13301338, https://doi.org/10.1175/1520-0493(1987)115<1330:AGFFFV>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Taillardat, M., O. Mestre, M. Zamo, and P. Naveau, 2016: Calibrated ensemble forecasts using quantile regression forests and ensemble model output statistics. Mon. Wea. Rev., 144, 23752393, https://doi.org/10.1175/MWR-D-15-0260.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Talagrand, O., R. Vautard, and B. Strauss, 1997: Evaluation of probabilistic prediction systems. Proc. ECMWF Workshop on Predictability, Reading, United Kingdom, ECMWF, 1–25, https://www.ecmwf.int/en/elibrary/12555-evaluation-probabilistic-prediction-systems.

  • Thorarinsdottir, T. L., M. Scheuerer, and C. Heinz, 2016: Assessing the calibration of high-dimensional ensemble forecasts using rank histograms. J. Comput. Graph. Stat., 25, 105122, https://doi.org/10.1080/10618600.2014.977447.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Vannitsem, S., D. S. Wilks, and J. W. Messner, Eds., 2018: Statistical Postprocessing of Ensemble Forecasts. Elsevier, 347 pp.

  • Wilks, D. S., 2004: The minimum spanning tree histogram as a verification tool for multidimensional ensemble forecasts. Mon. Wea. Rev., 132, 13291340, https://doi.org/10.1175/1520-0493(2004)132<1329:TMSTHA>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 2017: On assessing calibration of multivariate ensemble forecasts. Quart. J. Roy. Meteor. Soc., 143, 164172, https://doi.org/10.1002/qj.2906.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 2018: Enforcing calibration in ensemble postprocessing. Quart. J. Roy. Meteor. Soc., 144, 7684, https://doi.org/10.1002/qj.3185.

  • Fig. 1.

    (a) Averages of RI over 106 synthetic rank histograms for which the null hypothesis of flatness is true (points) for illustrative values of sample size n as functions of ensemble size m. The curves in (a) are (the square of) Eq. (4a), using the indicated sample sizes. (b) The plotted points show the empirically simulated average RI for all combinations of m and n considered and illustrate that Eq. (4a) represents their behavior almost exactly for the larger sample sizes (small m/n), but is somewhat less accurate for the smaller sample sizes (larger m/n).

  • Fig. 2.

    Comparison of power functions for unbiased forecasts exhibiting dispersion errors for tests computed at the α = 0.05 level using the χ2 (solid), RI (dashed), and Ω (dotted) rank histogram flatness metrics for ensemble sizes (a) m = 4, (b) m = 8, and (c) m = 32. In each panel, the three groups of curves represent small (n = 8m), medium (n = 32m), and large (n = 128m) sample sizes. Thumbnail insets indicate shapes of beta distributions underlying generation of the synthetic rank histograms and are located at corresponding values of σ on the horizontal axes. Note that the three panels have different horizontal scales.

  • Fig. 3.

    As in Fig. 2, but for rank histograms characterizing forecast ensembles exhibiting bias errors that increase linearly as the dispersion errors increase.

  • Fig. 4.

    Power functions for tests with very small (n = 4m) sample sizes. Critical values for the tests based on RI and Ω have been computed using the method described in the appendix rather than using Eqs. (4) and (7).

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 2847 2093 692
PDF Downloads 818 171 12