*k*th ensemble member is the

*k*/(

*K*+ 1) quantile of the distribution of the verification, conditioned on the ensemble. Here,

*K*is the total number of ensemble members. To be more specific, let

*X*

_{1}…

*X*denote the ensemble members, which are random variables with values in

_{K}*X*

_{1}< ··· <

*X*with probability 1. The verification is a random variable

_{K}*Y*with values in

*quantile*interpretation of ensembles. This states that there is an underlying random distribution function Γ, the

*probability forecast*, so that the

*k*th ensemble member is the

*k*/(

*K*+ 1) quantile of Γ, that is, the unique solution of the equation Γ(

*x*) =

*k*/(

*K*+ 1) for

*x*. It needs to be assumed that Γ is strictly monotonically increasing for the ensemble members to be well defined. The probability forecast Γ is

*reliable*ifTogether with a distribution for Γ (which is unimportant in the present context) these determinations completely specify a probabilistic model for (

*Y*,

*X*

_{1}, … ,

*X*, Γ). The reliability condition (2) together with the quantile interpretation imply the null hypothesis (1), as we now show. Since the ensemble members are functions of Γ, we haveIf reliability holds though, we can writeSubstituting with this in (3) and taking the expectation conditioned on

_{K}*X*, we obtainwhich is the null hypothesis (1).

_{k}The interpretation of ensembles as quantiles is not the only possible interpretation. An alternative to the quantile interpretation is what we will call the Monte Carlo interpretation, which states that the ensemble members are independent draws (i.e., a sample) from some underlying distribution. The quantile and Monte Carlo interpretation are not dissimilar, and for many applications, the difference between the two interpretations is unimportant. (We are considering the case of verifications in one dimension only; in higher dimensions, the quantile interpretation of course ceases to apply.) For a reliability test through CEPs, though, the difference causes problems. Although the Monte Carlo interpretation was mentioned in Mason et al. we feel that the ensuing problems were not sufficiently highlighted.

The problem with this approach results from the fact that the quantile estimates would appear both in the conditioning of the CEPs as well as in the definition of the exceedance event itself. The null hypothesis (1) is equivalent to stating that the exceedance eventsThere are two sources of sampling errors in estimating the regression parameters: sampling errors arising from an insufficient number of forecasts, and inaccuracies in estimating the quantiles of the ensemble distribution. The first source of error is common to all verification methods, but better estimates of the quantiles could be obtained by increasing the ensemble size or by fitting a distribution to the ensemble members and calculating the quantiles of the fitted distribution (provided that a distribution can be found that estimates the quantiles well). The CEPs are therefore likely to be estimated most accurately given large ensemble sizes, and for those ensemble members close to the median.

*E*: = (

_{k}*Y*>

*X*) and the exact quantile

_{k}*X*are independent. [Two random variables

_{k}*Y*

_{1},

*Y*

_{2}are independent if and only if

*Y*

_{2}.] As we will show presently,

*estimates*of the quantiles will in general

*not*be independent of the exceedance events, even if the reliability condition (2) holds. Therefore, variations in the quantile estimates will influence the CEP estimates in a systematic way. Thus, the CEP curves are not expected to be flat even if the forecast is reliable.

*k*= 1 …

*K*denote

*ξ*estimates of the quantiles. If the ensemble is interpreted as Monte Carlo, we can assume the

_{k}*ξ*to be functions of the order statistics, since they are sufficient for any continuous distribution function. Therefore, assuming

_{k}^{1}that the reliability condition (2) holds, we haveTaking expectation conditioned on

*ξ*, this givesThe right-hand side is constant only if the function

_{k}*ξ*is exactly equal to a quantile of Γ, or more precisely, to a quantile of

_{k}We illustrate our arguments by constructing an artificial, perfectly reliable, Monte Carlo ensemble as follows: at each time instance, we draw a number *υ* from a uniform distribution on [0, 1]. We construct the ensemble *ξ*_{1} … *ξ _{K}* and the verification

*Y*at each instance by adding independent and identically distributed (iid) random variables to

*υ*, drawn from a Gaussian distribution with zero mean and unit variance. Our dataset comprised 10 000 instances, and we used ensembles of size

*K*= 24. This ensemble is reliable per construction.

A reasonable estimator for the *k*th quantile is in fact the *k*th order statistic itself. (The quotation above suggests that Mason et al. are primarily thinking of this case.) For more quantitative information about the order statistics as quantile estimators see Theorem 14 of chapter VI in Mood et al. (1974). For each individual ensemble member *ξ _{k}* we obtain exceedance events

*E*= (

_{k}*Y*>

*ξ*). Using the same methodology as Mason et al. we use logistic models with two parameters (intercept and slope) to fit the CEPs using maximum likelihood. Finally, we evaluate the resulting model on all instances in the dataset.

_{k}The results are presented in Fig. 1. Visually, these CEP curves are far from flat. As suggested in Mason et al. the logistic model can be tested for zero slope using a *χ*^{2} test. These tests produced *p* values very close to zero for all curves displayed in Fig. 1. The ensemble would be interpreted as unreliable according to Mason et al.

In Fig. 2 we present the CEP curves for the median of ensembles of different sizes. Again, there is significant dependence on the value of the median for all *K*. We confirm that the CEP depends less and less on the value of the median as the ensemble gets larger. For the ensemble sizes shown in Fig. 2 though, the *p* values (testing for zero slope) are still very close to zero. Beyond ensemble sizes of approximately 200, the *p* value for zero slope becomes greater than 0.1. For comparison, we also analyzed an ensemble comprising exact quantiles. The resulting CEPs (estimated with logistic regression) are shown in Fig. 3. These CEPs are indeed flat, as expected from the analysis. The *p* values for zero slope are essentially uniformly distributed in this situation. This demonstrates that the general conclusion of this paper is not a mere artifact of the numerical example.

The conclusion of this work is that the CEP test for reliability cannot be applied if the quantiles are not known, or more precisely, if the ensembles are not interpreted as quantiles, and the latter have to be estimated. The reason is that errors in the quantile estimates are not independent of the exceedance events, and the CEP curves are therefore not constant, which is a fundamental assumption of the test. In a simple numerical example, the test would reject the null hypothesis even though the ensemble was by construction reliable.

## REFERENCES

Mason, S. J., , J. S. Galpin, , L. Goddard, , N. E. Graham, , and B. Rajartnam, 2007: Conditional exceedance probabilities.

,*Mon. Wea. Rev.***135**, 363–372.Mood, A. M., , F. A. Graybill, , and D. C. Boes, 1974:

*Introduction to the Theory of Statistics*. 3rd ed. McGraw-Hill, 480 pp.

^{1}

A further assumption here is that Γ is sufficient for *ξ _{k}*, that is,

*ξ*are given by Γ and possibly some further randomization independent of

_{k}*Y*.