1. Introduction
The probabilistic calibration (or reliability) of a collection of ensemble forecasts is typically examined using the verification rank histogram, often called simply the rank histogram, which is a graphical device developed independently by Anderson (1996), Hamill and Colucci (1997), and Talagrand et al. (1997). The underlying idea is to examine whether the members of a forecast ensemble and the verifying observation that they are predicting can be regarded as independent realizations from the same probability distribution (although this distribution may change from forecast to forecast).
A rank histogram is constructed by considering, for each of n ensemble forecasts consisting of m ensemble members, the collection of m + 1 values composed of the ensemble members and the corresponding verifying observation. These m + 1 values are sorted in ascending order, and the rank of the verifying observation within the group is tabulated. For example, if the observation is the smallest of the m + 1 values, its rank is 1, and if it is the largest, its rank is m + 1. The rank histogram is then constructed from the n ensemble forecasts being evaluated, as a histogram of the resulting n ranks, with m + 1 bars.
If the collection of n ensemble forecasts is probabilistically calibrated, so the verifying observations are statistically indistinguishable from the forecast ensembles to which they belong, each verification is equally likely to take on any of the m + 1 ranks, and the resulting rank histogram will be flat, except for deviations due to sampling variability. Hamill (2001) described characteristic deviations from rank histogram flatness that are diagnostic for different sorts of miscalibration: U-shaped histograms indicate ensemble underdispersion, inverted U shapes indicate ensemble overdispersion, and asymmetric rank histograms are diagnostic for unconditional biases.
Visual inspection of the rank histogram is sufficient to diagnose strong ensemble miscalibration. Weak miscalibration may be difficult to distinguish subjectively from mere sampling variations, and in such cases, computation of a formal statistical hypothesis test is indicated. Both Anderson (1996) and Hamill and Colucci (1997) suggested use of the well-known chi-square test for this purpose. More recently, two alternatives to the chi-square statistic for evaluating rank histogram flatness have appeared in the literature, although the nature of their sampling distributions, necessary for computing hypothesis tests based on them, has not been previously investigated. Section 2 reviews these two alternative statistics in relation to the more conventional chi-square statistic and presents serviceable empirical approximations to their sampling distributions. Section 3 compares the statistical power of (i.e., the sensitivity of the hypothesis tests based on) the three alternatives for both unbiased and biased forecasts exhibiting incorrect dispersion, and section 4 concludes.
2. Flatness metrics and their sampling distributions
a. Chi-square



b. Reliability index
(a) Averages of RI over 106 synthetic rank histograms for which the null hypothesis of flatness is true (points) for illustrative values of sample size n as functions of ensemble size m. The curves in (a) are (the square of) Eq. (4a), using the indicated sample sizes. (b) The plotted points show the empirically simulated average RI for all combinations of m and n considered and illustrate that Eq. (4a) represents their behavior almost exactly for the larger sample sizes (small m/n), but is somewhat less accurate for the smaller sample sizes (larger m/n).
Citation: Monthly Weather Review 147, 2; 10.1175/MWR-D-18-0369.1
c. Entropy
3. Comparisons of the resulting statistical tests
Hypothesis tests based on the three flatness metrics described in section 2 will be compared with respect to both their “type I” and “type II” error characteristics. A properly constructed hypothesis test will reject valid null hypotheses (type I error) with a small, specified probability that is near or equal to the test level α. When competing test formulations are being evaluated, it is also of interest to compare their statistical power, or sensitivity for detecting null hypothesis violations (failing to reject an invalid null hypothesis is a type II error). Results of such comparisons are generally expressed as power functions, which express the probability of rejecting the null hypothesis as a function of the degree to which it is wrong. Ideally, a power function takes on a minimum value of α where the null hypothesis is true and rises quickly to near 1 as the true condition diverges from that implied by the null hypothesis. In general the most powerful available test, the power function for which rises most quickly from α, will be preferred.
The three rank histogram flatness metrics described in section 2 will be compared here using synthetic rank histograms derived by discretizing random samples from beta distributions. The procedure follows the Monte Carlo algorithm described in the appendix, except that the random numbers generated in step 4 are drawn from beta distributions that are not limited to only uniform distributions. (The uniform distribution is a special case of the beta distribution, with p = q = 1.) U-shaped beta distributions, corresponding to rank histograms for underdispersed ensembles, are produced when p < 1 and q < 1. Hump-shaped beta distributions, corresponding to rank histograms for overdispersed ensembles, result when p > 1 and q > 1.
Figure 2 shows the resulting power functions for tests computed at the α = 0.05 level using the χ2 (solid), RI (dashed), and Ω (dotted) test statistics. The thumbnail insets indicate shapes of the underlying beta distributions for the values of σB at which they are plotted. In most circumstances, the χ2 tests are most powerful, although for the large sample sizes, the tests based on the Ω statistic are nearly equivalent. An exception occurs for overdispersed ensembles with the small sample sizes, where the tests based on Ω are generally most powerful, and the χ2 tests are least powerful. A shortcoming in Eq. (7) is evident from the minima of the power functions for Ω being smaller than 0.05 for the small and medium sample sizes in Fig. 1c, indicating that the tests based on Eqs. (7) and (8) are conservative in these instances, rejecting valid null hypotheses too rarely (for approximately 3% of the tests). (In such cases, the algorithm presented in the appendix can be used to obtain more accurate critical values.) Results for the larger ensemble sizes (m ≥ 64) are qualitatively similar to those shown in Fig. 2c (not shown).
Comparison of power functions for unbiased forecasts exhibiting dispersion errors for tests computed at the α = 0.05 level using the χ2 (solid), RI (dashed), and Ω (dotted) rank histogram flatness metrics for ensemble sizes (a) m = 4, (b) m = 8, and (c) m = 32. In each panel, the three groups of curves represent small (n = 8m), medium (n = 32m), and large (n = 128m) sample sizes. Thumbnail insets indicate shapes of beta distributions underlying generation of the synthetic rank histograms and are located at corresponding values of σ on the horizontal axes. Note that the three panels have different horizontal scales.
Citation: Monthly Weather Review 147, 2; 10.1175/MWR-D-18-0369.1
As in Fig. 2, but for rank histograms characterizing forecast ensembles exhibiting bias errors that increase linearly as the dispersion errors increase.
Citation: Monthly Weather Review 147, 2; 10.1175/MWR-D-18-0369.1
Tests based on RI are not most powerful for any of the cases shown in Fig. 2 or 3. However, Fig. 4a shows that for unbiased ensembles with small (m = 4) ensemble size and very small (n = 4m = 16) sample size, the RI-based tests are most powerful. On the other hand, for larger ensemble sizes and for forecasts also exhibiting bias (Fig. 4b), these results for very small sample sizes are consistent with those in Figs. 2 and 3, with the χ2 tests being most powerful for underdispersed ensembles and the Ω-based tests being more powerful for overdispersed ensembles.
Power functions for tests with very small (n = 4m) sample sizes. Critical values for the tests based on RI and Ω have been computed using the method described in the appendix rather than using Eqs. (4) and (7).
Citation: Monthly Weather Review 147, 2; 10.1175/MWR-D-18-0369.1
4. Conclusions
This study has compared hypothesis tests for rank histogram flatness based on the conventional χ2 statistic [Eq. (1)], the reliability index [Eq. (3)], and an entropy measure [Eq. (6)] in a controlled setting. In addition, empirical approximations to the sampling distributions of the latter two statistics have been presented. Assessing rank histogram flatness is important because it is an indicator of the probabilistic calibration (reliability) of a collection of ensemble forecasts. However, it is important to realize that calibration is only a necessary rather than sufficient condition for forecast skill and value (e.g., Gneiting et al. 2007; Murphy and Winkler 1987), and rank histogram flatness is a necessary but not sufficient condition to conclude that a collection of ensemble forecasts is calibrated (e.g., Hamill 2001). It is unrealistic to expect raw dynamical ensembles to be calibrated because of the unknown and undersampled nature of initial-condition distributions and unavoidable simplifications and errors in the dynamical formulation. However, achieving this condition can reasonably be expected after appropriate postprocessing (Vannitsem et al. 2018), particularly if calibration is enforced in the postprocessing algorithm (Wilks 2018).
In most instances, the traditional χ2 test was found here to be most powerful (i.e., most sensitive for detecting deviations from rank uniformity), particularly for the usual situation of underdispersed ensembles. For overdispersed ensembles and small sample sizes (n ≤ 8m), tests based on the entropy statistic Ω are most powerful. The RI-based tests are preferred only for unbiased forecasts with small ensemble sizes and very small (n ≈ 4m) sample sizes, although in such settings, all three tests exhibit rather weak sensitivity.
Overall, use of the traditional χ2 test is recommended as a consequence of its generally superior power, particularly for the underdispersed ensembles that are most commonly encountered, and the relative ease of obtaining the necessary critical values. Other advantages of using the χ2 test to evaluate rank histogram flatness include the availability of formulations allowing more focused alternative hypotheses (Elmore 2005; Jolliffe and Primo 2008) and adjustments that compensate for the effects of temporal (serial) dependence in the underlying data and resulting verification ranks (Bröcker 2018; Wilks 2004). However, the validity of these adjustments when evaluating calibration of (spatially autocorrelated) gridded ensemble forecasts is unclear, in which case an appropriate approach to working with nearly independent ensembles may be to consider only grid points that are sufficiently well separated.
Although the presentation here has been oriented toward examining probabilistic calibration of ensemble forecasts through the rank histogram, the results are equally applicable to evaluating flatness of the probability integral transform (PIT) histogram (Dawid 1984; Diebold et al. 1998; Gneiting et al. 2005), which can be regarded as the analog of the rank histogram for continuous (effectively, infinite ensemble size) predictive distributions, and for evaluating flatness of the various multivariate extensions of the rank histogram that have been proposed (Thorarinsdottir et al. 2016; Wilks 2017).
Acknowledgments
I thank Tom Hamill and an anonymous reviewer for suggestions that lead to improvements in this paper.
APPENDIX
Algorithm for Computing Empirical Approximations to the Sampling Distributions
Define the statistic of interest S. In the present study, S is either RI [Eq. (3)] or Ω [Eq. (6)].
Define ensemble size m, sample size n, and the number (perhaps 104 or 105) of Monte Carlo replications J.
Initialize bin counts nk, k = 1, …, m + 1, to zero.
Generate a standard uniform random number ui, having probability density f(u) = 1, 0 ≤ u < 1.
Compute k = int[ui(m + 1) + 1], where int[⋅] indicates integer truncation of fractions. Increment the count nk by 1.
Repeat steps 4 and 5 n times, using distinct realizations ui, i = 1, …, n in step 4, and compute Sj from the resulting values of nk.
Repeat steps 3–6 J times. The resulting collection of Sj, j = 1, …, J, is a discrete approximation to the sampling distribution of the statistic S under the null hypothesis of rank uniformity.
REFERENCES
Anderson, J. L., 1996: A method for producing and evaluating probabilistic forecasts from ensemble model integrations. J. Climate, 9, 1518–1530, https://doi.org/10.1175/1520-0442(1996)009<1518:AMFPAE>2.0.CO;2.
Bröcker, J., 2018: Assessing the reliability of ensemble forecasting systems under serial dependence. Quart. J. Roy. Meteor. Soc., 144, 2666–2675, https://doi.org/10.1002/qj.3379.
Dawid, A. P., 1984: Present position and potential developments: Some personal views: Statistical theory: The prequential approach. J. Roy. Stat. Soc., 147A, 278–292, https://doi.org/10.2307/2981683.
Delle Monache, L., J. P. Hacker, Y. Zhou, X. Deng, and R. B. Stull, 2006: Probabilistic aspects of meteorological and ozone regional ensemble forecasts. J. Geophys. Res., 111, D24307, https://doi.org/10.1029/2005JD006917.
Diebold, F. X., T. A. Gunther, and A. S. Tay, 1998: Evaluating density forecasts with applications to financial risk management. Int. Econ. Rev., 39, 863–883, https://doi.org/10.2307/2527342.
Elmore, K. L., 2005: Alternatives to the chi-square test for evaluating rank histograms from ensemble forecasts. Wea. Forecasting, 20, 789–795, https://doi.org/10.1175/WAF884.1.
Gneiting, T., A. E. Raftery, A. H. Westveld, and T. Goldman, 2005: Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Mon. Wea. Rev., 133, 1098–1118, https://doi.org/10.1175/MWR2904.1.
Gneiting, T., F. Balabdaoui, and A. E. Raftery, 2007: Probabilistic forecasts, calibration and sharpness. J. Roy. Stat. Soc., 69B, 243–268, https://doi.org/10.1111/j.1467-9868.2007.00587.x.
Hamill, T. M., 2001: Interpretation of rank histograms for verifying ensemble forecasts. Mon. Wea. Rev., 129, 550–560, https://doi.org/10.1175/1520-0493(2001)129<0550:IORHFV>2.0.CO;2.
Hamill, T. M., and S. J. Colucci, 1997: Verification of Eta–RSM short-range ensemble forecasts. Mon. Wea. Rev., 125, 1312–1327, https://doi.org/10.1175/1520-0493(1997)125<1312:VOERSR>2.0.CO;2.
Jolliffe, I. T., and C. Primo, 2008: Evaluating rank histograms using decompositions of the chi-square test statistic. Mon. Wea. Rev., 136, 2133–2139, https://doi.org/10.1175/2007MWR2219.1.
Murphy, A. H., and R. L. Winkler, 1987: A general framework for forecast verification. Mon. Wea. Rev., 115, 1330–1338, https://doi.org/10.1175/1520-0493(1987)115<1330:AGFFFV>2.0.CO;2.
Taillardat, M., O. Mestre, M. Zamo, and P. Naveau, 2016: Calibrated ensemble forecasts using quantile regression forests and ensemble model output statistics. Mon. Wea. Rev., 144, 2375–2393, https://doi.org/10.1175/MWR-D-15-0260.1.
Talagrand, O., R. Vautard, and B. Strauss, 1997: Evaluation of probabilistic prediction systems. Proc. ECMWF Workshop on Predictability, Reading, United Kingdom, ECMWF, 1–25, https://www.ecmwf.int/en/elibrary/12555-evaluation-probabilistic-prediction-systems.
Thorarinsdottir, T. L., M. Scheuerer, and C. Heinz, 2016: Assessing the calibration of high-dimensional ensemble forecasts using rank histograms. J. Comput. Graph. Stat., 25, 105–122, https://doi.org/10.1080/10618600.2014.977447.
Vannitsem, S., D. S. Wilks, and J. W. Messner, Eds., 2018: Statistical Postprocessing of Ensemble Forecasts. Elsevier, 347 pp.
Wilks, D. S., 2004: The minimum spanning tree histogram as a verification tool for multidimensional ensemble forecasts. Mon. Wea. Rev., 132, 1329–1340, https://doi.org/10.1175/1520-0493(2004)132<1329:TMSTHA>2.0.CO;2.
Wilks, D. S., 2017: On assessing calibration of multivariate ensemble forecasts. Quart. J. Roy. Meteor. Soc., 143, 164–172, https://doi.org/10.1002/qj.2906.
Wilks, D. S., 2018: Enforcing calibration in ensemble postprocessing. Quart. J. Roy. Meteor. Soc., 144, 76–84, https://doi.org/10.1002/qj.3185.