• Bergsma, W. P., 2004: Testing conditional independence for continuous random variables. EURANDOM Tech. Rep. 2004–048, 19 pp.

  • Bradley, A. A., , Hashino T. , , and Schwartz S. S. , 2003: Distributions-oriented verification of probability forecasts for small data samples. Wea. Forecasting, 18 , 903917.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78 , 13.

  • Briggs, W., , and Ruppert D. , 2005: Assessing the skill of yes/no predictions. Biometrics, 61 , 799807.

  • Buizza, R., , and Palmer T. N. , 1998: Impact of ensemble size on ensemble prediction. Mon. Wea. Rev., 126 , 25032518.

  • Chatfield, C., 2004: The Analysis of Time Series: An Introduction. Chapman and Hall, 333 pp.

  • Davison, A. C., , and Hinkley D. V. , 1997: Bootstrap Methods and Their Application. Cambridge University Press, 592 pp.

  • DiCiccio, T. J., , Monti A. C. , , and Young G. A. , 2006: Variance stabilization for a scalar parameter. J. Roy. Stat. Soc., 68B , 281303.

  • Diebold, F. X., , and Mariano R. S. , 1995: Comparing predictive accuracy. J. Bus. Econ. Stat., 13 , 253263.

  • Hall, P., 1987: On the bootstrap and continuity correction. J. Roy. Stat. Soc., 49B , 8289.

  • Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts. Wea. Forecasting, 14 , 155167.

  • Jolliffe, I. T., 2007: Uncertainty and inference for verification measures. Wea. Forecasting, 22 , 637650.

  • Jolliffe, I. T., , and Stephenson D. B. , 2003: Forecast Verification: A Practitioner’s Guide in Atmospheric Science. John Wiley and Sons, 240 pp.

    • Search Google Scholar
    • Export Citation
  • Kane, T. L., , and Brown B. G. , 2000: Confidence intervals for some verification measures–a survey of several methods. Preprints, 15th Conf. on Probability and Statistics, Asheville, NC, Amer. Meteor. Soc., 46–49.

  • Lahiri, S. N., 1993: Bootstrapping the Studentized sample mean of lattice variables. J. Mult. Anal., 45 , 247256.

  • Müller, W. A., , Appenzeller C. , , Doblas-Reyes F. J. , , and Liniger M. A. , 2005: A debiased ranked probability skill score to evaluate probabilistic ensemble forecasts with small ensemble sizes. J. Climate, 18 , 15131523.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Palmer, T. N., and Coauthors, 2004: Development of a European Multimodel Ensemble System for Seasonal to Interannual Prediction (DEMETER). Bull. Amer. Meteor. Soc., 85 , 853872.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Potts, J. M., 2003: Basic concepts. Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. T. Jolliffe and D. B. Stephenson, Eds., John Wiley and Sons, 13–36.

    • Search Google Scholar
    • Export Citation
  • Richardson, D. S., 2001: Measures of skill and value of ensemble prediction systems, their interrelationship and the effect of ensemble size. Quart. J. Roy. Meteor. Soc., 127 , 24732489.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Romano, J. P., 1988: A bootstrap revival of some nonparametric distance tests. J. Amer. Stat. Assoc., 83 , 698708.

  • Seaman, R., , Mason I. , , and Woodcock F. , 1996: Confidence intervals for some performance measures of yes-no forecasts. Aust. Meteor. Mag., 45 , 4953.

    • Search Google Scholar
    • Export Citation
  • Severini, T. A., 2005: Elements of Distribution Theory. Cambridge University Press, 515 pp.

  • Stephenson, D. B., 2000: Use of the “odds ratio” for diagnosing forecast skill. Wea. Forecasting, 15 , 221232.

  • Thornes, J. E., , and Stephenson D. B. , 2001: How to judge the quality and value of weather forecast products. Meteor. Appl., 8 , 307314.

  • Wilks, D. S., 1997: Resampling hypothesis tests for autocorrelated fields. J. Climate, 10 , 6582.

  • Wilks, D. S., 2006: Statistical Methods in the Atmospheric Sciences. 2d ed. Academic Press, 627 pp.

  • Woodcock, F., 1976: The evaluation of yes/no forecasts for scientific and administrative purposes. Mon. Wea. Rev., 104 , 12091214.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • View in gallery

    Observed (vertical lines) October rainfall (mm) in Jakarta from 1958 to 1995 plotted between both the ECMWF (filled circles) and MF (open circles) nine-member ensemble forecasts.

  • View in gallery

    Estimates m,n (upper thick) and ∞,n (lower thick) for the ECMWF (solid) and MF (dashed) forecasts of October rainfall at Jakarta exceeding different thresholds during the period 1958–95. Thresholds are marked as quantiles (lower axis) and absolute values (mm, upper axis) of the observed rainfall. Standard errors (upper thin) and their conditional versions (lower thin; see appendix A) are shown for the ECMWF (solid) and MF (dashed) forecasts, and are indistinguishable for m,n and ∞,n.

  • View in gallery

    (top) Brier scores m,n (solid) for the ECMWF forecasts, with bootstrapped 90% confidence intervals for Bm (dark gray) and BM,n (light gray; see appendix A) at each threshold. Expected Brier scores are also shown for random forecasts (dotted) and climatology (dashed). (middle) The same as in the top panel but for the MF forecasts. (bottom) The difference (solid) in m,n between the ECMWF and MF forecasts, with bootstrapped 90% confidence intervals for the differences between Bm (dark gray) and BM,n (light gray).

  • View in gallery

    Monte Carlo estimates of normal (solid) and bootstrap (dashed) lower (left) and upper (right) coverage errors plotted against α when m = 8; n = 40; ρ = 0 (thin), 0.4 (medium), and 0.8 (thick); and p = (top) 0.9, (middle) 0.7, and (bottom) 0.5. Solid horizontal lines mark zero error.

  • View in gallery

    Monte Carlo estimates of normal (solid) and bootstrap (dashed) interval lengths plotted against α when m = 8; n = 40; ρ = 0 (thin), 0.4 (medium), and 0.8 (thick); and p = (top) 0.9, (middle) 0.7, and (bottom) 0.5.

  • View in gallery

    (top) Brier scores m,n (solid) for the ECMWF forecasts, with bootstrapped simultaneous 90% confidence intervals for Bm (dark gray) and BM,n (light gray) at each threshold. Expected Brier scores are also shown for random forecasts (dotted) and climatology (dashed). (middle) As in the top panel but for the MF forecasts. (bottom) The difference (solid) in m,n between the ECMWF and MF forecasts, with bootstrapped simultaneous 90% confidence intervals for the differences between Bm (dark gray) and BM,n (light gray).

  • View in gallery

    Fig. B1. Monte Carlo estimates of the powers of the bootstrap (solid), permutation (dashed), and z (dotted) and t tests (dotted–dashed) against correlation ρ2 (see text) for (left) serially independent and (right) dependent observations.

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 288 288 19
PDF Downloads 236 236 17

Comparing Probabilistic Forecasting Systems with the Brier Score

View More View Less
  • 1 School of Engineering, Computing and Mathematics, University of Exeter, Exeter, United Kingdom
© Get Permissions
Full access

Abstract

This article considers the Brier score for verifying ensemble-based probabilistic forecasts of binary events. New estimators for the effect of ensemble size on the expected Brier score, and associated confidence intervals, are proposed. An example with precipitation forecasts illustrates how these estimates support comparisons of the performances of competing forecasting systems with possibly different ensemble sizes.

Corresponding author address: C. Ferro, School of Engineering, Computing and Mathematics, University of Exeter, Harrison Bldg., North Park Rd., Exeter EX4 4QF, United Kingdom. Email: c.a.t.ferro@exeter.ac.uk

Abstract

This article considers the Brier score for verifying ensemble-based probabilistic forecasts of binary events. New estimators for the effect of ensemble size on the expected Brier score, and associated confidence intervals, are proposed. An example with precipitation forecasts illustrates how these estimates support comparisons of the performances of competing forecasting systems with possibly different ensemble sizes.

Corresponding author address: C. Ferro, School of Engineering, Computing and Mathematics, University of Exeter, Harrison Bldg., North Park Rd., Exeter EX4 4QF, United Kingdom. Email: c.a.t.ferro@exeter.ac.uk

1. Introduction

Verification scores are commonly used as numerical summaries for the quality of weather forecasts. General introductions to forecast verification are given by Jolliffe and Stephenson (2003) and Wilks (2006, chapter 7). There are many situations in which we may wish to compare the values of a verification score computed for two sets of forecasts: to compare the quality of forecasts from a single forecasting system at different times or locations, or in different meteorological conditions, or to compare the quality of forecasts from two forecasting systems. Several authors have recommended that measures of uncertainty for the scores, such as standard errors or confidence intervals, should be computed to aid such comparisons. Woodcock (1976), Seaman et al. (1996), Kane and Brown (2000), Stephenson (2000), Thornes and Stephenson (2001), and Wilks (2006, section 7.9) propose confidence intervals for scores of deterministic binary-event forecasts. Bradley et al. (2003) use simulation to compare the biases and standard errors of different estimators for several scores of probabilistic binary-event forecasts, but do not discuss estimators for the standard errors. Hamill (1999) takes a different approach and proposes hypothesis tests for comparing the scores of two sets of deterministic or probabilistic forecasts; see also Briggs and Ruppert (2005). Jolliffe (2007) reviews this work and related ideas, and also presents confidence intervals for the correlation coefficient.

Woodcock (1976) explains the motivation for these attempts to quantify uncertainty. The value of a score depends on the choice of target observations, so the superiority of one forecasting system over another as gauged by their verification scores computed for only finite samples of forecasts and observations cannot be definitive: the superiority may be reduced or even reversed were the systems applied to new target observations. The methods listed previously estimate the variation that would arise in the value of a verification score were forecasts made for different sets of potential target observations, and thereby quantify the uncertainty about some “true” value that would be known were forecasts available for all potential observations. We shall consider uncertainty in expected values of the Brier verification score (Brier 1950) in the case of ensemble-based probabilistic binary-event forecasts. Unbiased estimators for the expected Brier score that would be obtained for any ensemble size are defined in section 2. Standard errors and confidence intervals are developed in section 3, and their performance is assessed with a simulation study in section 4. Methods for comparing the Brier scores of two forecasting systems are presented in section 5. Confidence intervals appropriate for comparing Brier scores of two systems simultaneously for multiple events and sites are described in section 6.

The methods are illustrated throughout the paper with seasonal precipitation forecasts from the Development of a European Multimodel Ensemble System for Seasonal to Interannual Prediction (DEMETER) project (Palmer et al. 2004). In particular, 3-month-ahead, nine-member ensemble forecasts of total October precipitation produced by the European Centre for Medium-Range Weather Forecasts (ECMWF) and Météo-France (MF) global atmosphere–ocean coupled models are compared with observations recorded at Jakarta (6.17°S, 106.82°E) for the years 1958–95. Data are missing for 1983, leaving 37 yr. The ensembles were generated by sampling independent perturbations of the initial ocean state. The Jakarta observations and the forecasts from the corresponding grid box are shown in Fig. 1.

2. Expected Brier scores

a. Definitions

We define the Brier score together with notation that will be used throughout the rest of the paper. Let {Xt : t = 1, . . . , n} be a set of n observations, and let {(Yt,1, . . . , Yt,m) : t = 1, . . . , n} be a corresponding set of m-member ensemble forecasts. For each time t, suppose that we issue a probabilistic forecast, t, for the event that observation Xt exceeds a threshold u. The Brier score for this set of forecasts, equal to one-half of the score originally proposed by Brier (1950), is the mean squared difference between the forecasts and the indicator variables for the event; that is,
i1520-0434-22-5-1076-eq1
where It = I(Xt > u), I(A) = 1 if A is true, and I(A) = 0 if A is false. All summations will be over t = 1, . . . , n unless otherwise specified.
We mentioned in the previous section that the variation in verification scores caused by the choice of target observations leads to uncertainty about the quality of the forecasting system. One quantity of interest that we may be uncertain about is the expected Brier score, denoted
i1520-0434-22-5-1076-e1
and defined as the average Brier score over repeated samples from a population of observations and forecasts. This population can be defined implicitly by assuming that the available sample of observations and forecasts is in some sense representative of the larger population. We assume that the population is a stationary sequence of which our data form a partial realization. This is likely to be a good approximation in a stable climate and could be generalized for a changing climate by assuming, for example, that the data are a partial realization of a nonstationary, multivariate time series model chosen to represent climatic change. We shall concentrate on Bm, but other quantities, such as the conditional expected Brier score discussed in appendix A, may also be of interest.

b. The effects of ensemble size

We investigate how Bm depends on the ensemble size m. By stationarity, the expectation of (tIt)2 is the same for all t, so we can write
i1520-0434-22-5-1076-eq2
where and I are the forecast and the event indicator for an arbitrary time. This expectation is an average over all possible values of the observation variable X and the ensemble Y = (Y1, . . . , Ym). Now let Z denote sufficient information about the forecasting model to determine a probability distribution for Y, given which Y is independent of X. This information might be the model specification plus a probability distribution for its initial conditions, for example. The law of total expectation (e.g., Severini 2005, p. 55) says that we can obtain Bm in two stages: first by taking the expectation with respect to X and Y when Z is held fixed, and then averaging over Z. This is written
i1520-0434-22-5-1076-eq3
The interior, conditional expectation is
i1520-0434-22-5-1076-e2
where P = E(I|Z) = Pr(X > u|Z) is the probability with which the event occurs.
We must specify how is formed from the ensemble members in order to reveal the effects of ensemble size. We choose forecasts equal to the proportion of members that exceed a threshold υ, possibly different from u; that is,
i1520-0434-22-5-1076-e3
where K is the number of members exceeding υ. Alternative forecasts could be considered, for example, (K + a)/(m + b) with ba ≥ 0, although these lead to more complicated formulas later.

For simplicity, we assume that the members within any particular ensemble are exchangeable. Exchangeability means that the members are indistinguishable by their statistical properties: their joint distribution function is invariant to relabeling the members. This admits homogeneous dependence between members and includes the special case of independent and identically distributed members.

Exchangeability implies that all members of an ensemble exceed υ with the same probability,
i1520-0434-22-5-1076-eq4
and all pairs of members jointly exceed υ with the same probability,
i1520-0434-22-5-1076-eq5
Taken together with the forecast definition [(3)], we have
i1520-0434-22-5-1076-e4
i1520-0434-22-5-1076-e5
and the conditional expectation [(2)] equals
i1520-0434-22-5-1076-eq6
Finally, we take the expectation with respect to Z to obtain
i1520-0434-22-5-1076-e6
Because P, Q, and R are independent of m, this expression describes completely the effects of ensemble size. Moreover, the final term is non-negative because RQ. As the ensemble size increases, Bm therefore decreases monotonically to the expected Brier score, B, that would be obtained for an infinite ensemble size, and we can write
i1520-0434-22-5-1076-e7
where BM is the expected Brier score that would be obtained for an ensemble of size M. This generalizes the relationship found by Richardson [2001, Eq. (9)] for independent ensemble members, in which case R = Q2.

c. c. Unbiased estimators

The Brier score m,n is an unbiased estimator for Bm by definition [(1)] but is biased for BM when Mm. Estimating BM from ensembles of size m is useful for comparing forecasting systems with different ensemble sizes or for assessing the potential benefit of larger ensembles (cf. Buizza and Palmer 1998). Equations (4) and (5) can be used to show that an unbiased estimator for BM is
i1520-0434-22-5-1076-e8
and letting M → ∞ yields an unbiased estimator for B.

1) Remark 1

The new estimator [(8)] is undefined if m = 1, in which case an unbiased estimator for BM (M ≠ 1) does not exist because the forecasts contain no information about the effects of ensemble size. Mathematically, for any function h(K, I) independent of R,
i1520-0434-22-5-1076-eq7
cannot contain the required R term. Richardson (2001) does, however, develop a method for estimating a skill score based on BM given an ensemble of any size, even m = 1. He achieves this by assuming independent ensemble members (R = Q2) and perfect reliability (Q = P) in which case the expression [(6)] for Bm becomes
i1520-0434-22-5-1076-eq8
and so
i1520-0434-22-5-1076-e9
In this special case, an unbiased estimator can therefore be obtained for BM even when m = 1 by substituting m,n for Bm in the right-hand side of (9).

2) Remark 2

The adjustment term in the definition [(8)] of M,n depends on only the forecasts and is a measure of sharpness (e.g., Potts 2003). Let
i1520-0434-22-5-1076-eq9
be the sample variance of the forecasts around one-half: as S decreases from its maximum value (1/4) to its minimum value (0), forecasts become more concentrated around one-half and the sharpness decreases. Now,
i1520-0434-22-5-1076-eq10
so M,n reduces the Brier score by amounts that depend on the estimated sharpness and the ensemble size, m. For fixed sharpness, the improvement in forecast quality from increasing the ensemble size decreases as m increases: the law of diminishing returns. For fixed m, the improvement decreases as the sharpness increases, suggesting that the improvement may be attributed to the opportunity to shift forecasts slightly farther away from one-half.

3) Remark 3

The Brier score m,n is proper (e.g., Wilks 2006, p. 298) because, if our belief in the occurrence of the event {It = 1} equals p ∈ [0, 1], then the expected contribution to the Brier score with respect to this belief from issuing forecast t, that is,
i1520-0434-22-5-1076-eq11
is minimized by choosing t = p. Similar calculations show that M,n is improper when Mm because the optimum forecast is then
i1520-0434-22-5-1076-eq12
Therefore, M,n should not be used in situations where it could be hedged.

d. Exchangeability

We assumed that ensemble members were exchangeable and independent of observations given suitable information, Z. The latter assumption is hard to contest because Z can include the full specification of the forecasting model and its inputs. Exchangeability is more restrictive and would be violated were one member biased relative to the other members, for example, or were one pair of members more strongly correlated than other pairs.

Exchangeability might be justified by the process generating the ensemble. For example, exchangeability will hold if the initial conditions for the members are randomly sampled from a probability distribution. Exchangeability is also likely to hold if the forecast lead time is long enough for any initial ordering or dependence between the members to be lost. This latter argument seems appropriate for our 3-month-ahead rainfall forecasts.

Exchangeability might also be justified by empirical assessment. Romano (1988) describes a bootstrap test for exchangeability based on the maximum distance between the empirical distribution functions of the members and permuted members. Applying this test for the ECMWF and MF ensemble forecasts of Jakarta rainfall gave p values of 0.26 and 0.24, which is only weak evidence for rejecting exchangeability.

The effect of ensemble size on the expected Brier score can still be estimated even when exchangeability is unjustifiable. If we wish to estimate BM for a subset of M < m members, then an unbiased estimator is simply the Brier score evaluated for the forecasts constructed from those M members. This approach is straightforward to implement for any verification score, but is inapplicable when M > m.

e. Data example

We estimate the expected Brier scores Bm and B for the ECMWF and MF rainfall forecasts at a range of event thresholds u and υ. These are shown in Fig. 2, where we set υ = u and let u range from the 10% to the 90% quantiles of the observed rainfall. The MF forecasts appear to have significantly lower Brier scores than do those of the ECMWF for thresholds below 90 mm (about the median observed rainfall), and the two systems have similar Brier scores at higher thresholds. The estimated difference between Bm for MF and B for ECMWF is also large below 90 mm, suggesting that increasing the ECMWF ensemble size would not be sufficient to match the MF Brier score.

3. Sampling variation

a. Standard errors

Point estimates of expected Brier scores were presented in the previous section. In this section, we estimate the uncertainty associated with these estimates due to sampling variation. In particular, we shall estimate standard errors and construct confidence intervals for the expected scores. We assume only that the data are stationary; we no longer need to assume exchangeability or a particular form [(3)] for the forecasts. We shall consider only M,n, the estimator [(8)] for BM, because other estimators can be obtained as special cases by changing M.

We can write M,n as the sample mean of the summands
i1520-0434-22-5-1076-eq13
If the interval between successive times t is large, then we may be justified in making the further assumption that these summands are independent, in which case the standard error of M,n is estimated by
i1520-0434-22-5-1076-eq14
If the summands are dependent, then estimates of serial correlation may be incorporated into the standard error (e.g., Wilks 1997).

There is little evidence for serial dependence in the summands of the Brier scores for our ECMWF and MF rainfall forecasts. For example, a two-sided test for the lag-one autocorrelation (e.g., Chatfield 2004, p. 56) was conducted for both the ECMWF and MF data at each of nine thresholds υ = u ranging from the 10% to the 90% quantiles of the observed rainfall, and only one p value was smaller than 0.1. We assume serial independence hereafter. Standard errors for the estimates of Bm and B are shown in Fig. 2 and are large enough to call into question the statistical significance of the differences noted previously in the quality of the ECMWF and MF forecasts. These differences are assessed more formally in section 5.

b. Confidence intervals

More informative descriptions of uncertainty are afforded by confidence intervals, which we now construct. Unless the summands of M,n exhibit long-range dependence, we can expect a central limit theorem to hold and imply that M,n is approximately normally distributed when n is large. An approximate (1 − 2α)confidence interval for BM would then be
i1520-0434-22-5-1076-eq15
where zα is the α quantile of the standard normal distribution.
An alternative approximation to the distribution of M,n is available via the bootstrap method (e.g., Davison and Hinkley 1997). To obtain confidence intervals for BM, the distribution of the studentized statistic
i1520-0434-22-5-1076-eq16
is approximated by a bootstrap sample {Tn*i : i = 1, . . . , r}. If summands are independent, then this sample can be formed by repeating the following steps for each i = 1, . . . , r:
  1. Resample W*s uniformly and with replacement from {Wt : t = 1, . . . , n} for each s = 1, . . . , n.
  2. Set *M,n = ΣW*t/n and
    i1520-0434-22-5-1076-eq17
  3. Set Tn*i = (*M,nM,n)/σ̂*M,n.
Block bootstrapping (e.g., Wilks 1997) can be employed if the summands are serially dependent. Bootstrap (1 − 2α) confidence intervals are then defined by the limits
i1520-0434-22-5-1076-eq18
where k = ⌊αr⌋ and Tn*(1) ≤ . . . ≤ T*(r)n are the order statistics of the bootstrap sample. Neither the normal nor the bootstrap confidence limits are guaranteed to fall in the interval [0, 1], so they will always hereafter be truncated at the end points.

These confidence intervals can be used to test hypotheses of the form BM = b, for some reference value b that represents minimal forecast quality. If a two-sided (1 − α) confidence interval for BM does not contain b, then the hypothesis is rejected in favor of the two-sided alternative hypothesis BMb at the 100α% level. One common reference value for Bm is the Brier score, q2 + (1 − 2qIt/n, obtained if the same probability q is forecast at every time t. Another is the expected Brier score, (2m + 1)/(6m) or 1/3, obtained if the forecast at each time t is selected independently from a uniform distribution on either the set {i/m : i = 0, . . . , m} or the interval [0, 1].

The dark gray bands in the top two panels of Fig. 3 are bootstrapped 90% confidence intervals (using r = 5000) for Bm for the ECMWF and MF rainfall forecasts. The ECMWF forecasts are significantly worse than climatology (the constant forecast q = ΣIt/n) at the 10% level for a few thresholds, but are significantly better than random forecasts except between the 30% and 70% quantiles (50–130 mm). The MF forecasts are not significantly different from climatology at any threshold, but are significantly better than random forecasts except between the 45% and 65% quantiles (70– 110 mm).

4. Simulation study

a. Serial independence

We compare the performances of the proposed normal and bootstrap confidence intervals for Bm with a simulation study. The performance of an equitailed (1 − 2α) confidence interval is commonly assessed by its achieved coverage and average length in repeated simulated datasets for which the true value of Bm is known. Let i be the point estimate and let Li and Ui be the lower and upper confidence limits computed from the ith of N datasets. The achieved lower and upper coverages are the proportions of times that Bm falls above and below the lower and upper limits; that is,
i1520-0434-22-5-1076-eq19
which should both equal 1 − α. The average length is the mean distance between the lower and upper limits; that is,
i1520-0434-22-5-1076-eq20
which should be as small as possible.

The performance of the confidence intervals depends on the ensemble size m, the sample size n, the thresholds u and υ, the target coverage defined by α, and the joint distribution of the observations and forecasts. We examine the effects of all of these factors in this simulation study, although a complete investigation is impossible. Serially independent observations are simulated from a standard normal distribution. Ensemble members are also normally distributed, and each has a correlation ρ with its contemporary observation but is otherwise independent of the other members. Forecasts are simple proportions [(3)] and we use thresholds u = υ equal to p quantiles of the standard normal distribution. We consider the following values for the various factors: m = 2, 4, 8; n = 10, 20, 40; p = 0.5, 0.7, 0.9; ρ = 0, 0.4, 0.8; and α between 0.005 and 0.05. Results for p = 0.1 and 0.3 would be the same as for p = 0.9 and 0.7, respectively, because the former could be obtained by redefining events as deficits below thresholds, which does not alter the Brier score. We use N = 10 000 datasets and r = 1000 bootstrap samples throughout.

We show results for m = 8 and n = 40 only; results are qualitatively similar for different values. Figure 4 shows the lower and upper coverage errors for the normal and bootstrap confidence intervals as α varies and with different values for ρ and p. Figure 5 shows the corresponding lengths. The coverage errors of the lower limits are usually positive (the lower limits are too low and overcover) while the upper errors are often negative (the upper limits are too low and undercover). The errors are always smaller for the bootstrap limits than for the normal limits. The bootstrap achieves this by shrinking the lower limits and extending the upper limits compared to the normal limits (not shown) to capture asymmetry in the sampling distribution of the Brier score, producing wider intervals for the bootstrap, as revealed by Fig. 5. The interval lengths decrease as both α and ρ increase.

b. Serial dependence

To investigate the sensitivity of the results to the presence of serial dependence, the simulations were repeated with observations generated from a first-order moving-average process with correlation 0.5 at lag one. The standard errors were adjusted for the lag-one correlation and the block bootstrap was employed with blocks of size two. Results (not shown) were qualitatively similar to those for serial independence, except that both positive and negative lower coverage errors were found. Errors were larger and intervals wider because of the smaller effective sample sizes.

c. Modified bootstrap intervals

The bootstrap coverage errors in Fig. 4 are typically less than α/2. Errors decrease as n increases, so these intervals may be acceptable for many applications. Improvements are desirable, however, particularly for small sample sizes and rare events. Several modifications have been explored by the author, specifically basic and percentile bootstrap intervals and bootstrap calibration (Davison and Hinkley 1997, chapter 5) and a continuity correction (Hall 1987) to account for the discrete nature of the summands of the Brier score. None of these methods improved significantly on the studentized intervals presented above. A variance-stabilizing transformation proposed by DiCiccio et al. (2006) was also applied and found to reduce the coverage error in the lower limit, especially for rare events for which errors were approximately halved. The effect on the upper limits was small. A large part of the coverage error in small samples arises from the fact that the Brier score can take only a small number of distinct values. One way to reduce these errors is to smooth the Brier score by adding a small amount of random noise (Lahiri 1993). Investigations unreported here show that this can indeed reduce coverage errors significantly at the expense of widening the confidence intervals. However, results depend strongly on the amount of smoothing employed and making general recommendations is difficult. An alternative solution could be to fit joint probability distributions to the observations and forecasts before determining the forecast probabilities (Bradley et al. 2003). This would allow the forecasts, and hence the Brier score, to take any values on the interval [0, 1], and so avoid discretization errors. Another advantage would be the avoidance of intervals with zero length, which occurs for both the normal and bootstrap intervals described above when all summands of the Brier score are equal.

5. Comparing Brier scores

Consider the task of comparing the Brier scores of two forecasting systems, the first with ensemble size m and the second with ensemble size m′, verified against the same set of observations. Quantities pertaining to the second system will be distinguished with primes. We can compare the two systems by constructing a confidence interval for the difference, BMBM, between the Brier scores that would be expected were both ensemble sizes equal to M. Such a comparison may help to identify whether or not a perceived superiority of one system is due only to its larger ensemble size. If the (1 − α) confidence interval contains zero, then the null hypothesis of equal scores is rejected in favor of the two-sided alternative at the 100α% level. We estimate the difference between the Brier scores using unbiased estimators [(8)], though the subsampling approach described at the end of section 2d could also be used.

Normal confidence intervals are defined by
i1520-0434-22-5-1076-eq21
where, if there is no serial dependence,
i1520-0434-22-5-1076-eq22
As in section 3, this can be adjusted to account for serial dependence. Bootstrap intervals approximate the distribution of
i1520-0434-22-5-1076-eq23
with a bootstrap sample {Dn*i : i = 1, . . . , r}. If the summands are serially independent, then this sample can be formed by repeating the following steps for each i = 1, . . . , r.
  1. Resample pairs (W*s, W ′*s) uniformly and with replacement from {(Wt, W ′t) : t = 1, . . . , n} for each s = 1, . . . , n.
  2. Compute *M,n, ′*M,n, and σ*M,n for the resampled data.
  3. Set Dn*i = [(*M,n′*M,n) − (M,nM,n)]/σ*M,n.
The first step preserves dependence between contemporary summands of the two scores. Block bootstrapping may again be employed if the summands are serially dependent, and confidence limits take the form
i1520-0434-22-5-1076-eq24
and
i1520-0434-22-5-1076-eq25

Bootstrapped 90% confidence intervals for the difference between Bm for the ECMWF and MF forecasts are illustrated by the dark gray bands in Fig. 3 (bottom panel). The scores are significantly different at the 10% level between only the 0.3- and 0.4-quantile thresholds (50–70 mm).

The statistical significance of the differences between Brier scores can also be quantified using hypothesis tests. The powers of four such tests are investigated in appendix B, where the permutation test is found to be an attractive alternative to the bootstrap test presented above. The permutation test yields similar results for our data, however, with p values less than 0.1 between only the 0.3- and 0.4-quantile thresholds.

6. Multiple comparisons

We have so far constructed confidence intervals separately for each threshold u. These intervals are designed to contain the quantity of interest, such as an expected score or the difference between two expected scores, with a certain probability at each individual threshold. We may wish, however, to construct confidence intervals such that the quantity of interest is contained simultaneously within the intervals at all thresholds with a certain probability. We describe how to construct such confidence intervals in this section.

Denote by B(u) the quantity of interest at threshold u and suppose that we want to consider a collection S of thresholds. We aim to find confidence limits L(u) and U(u) for each uS such that
i1520-0434-22-5-1076-e10
If we used the (1 − 2α) confidence limits proposed in previous sections for L(u) and U(u), then this probability would be less than 1 − 2α. For example, if scores were independent between thresholds, then the probability would be (1 − 2α)|S|, where |S| is the number of thresholds.
Simultaneous confidence limits can be obtained using a bootstrap method described by Davison and Hinkley (1997, section 4.2.4). Suppose that equitailed confidence intervals at each threshold u are based on a bootstrap sample {T*i(u) : i = 1, . . . , r} and have the form
i1520-0434-22-5-1076-eq26
for some 1 ≤ kr/2. If we use these limits to form simultaneous intervals, then the bootstrap estimate of the coverage probability [(10)] is
i1520-0434-22-5-1076-eq27
It is sufficient, therefore, to choose k such that this estimate is as close as possible to 1 − 2α.

The resampling must preserve dependence between thresholds: the statistics {T*i(u) : uS} should be computed from the same data for each i. So, resampling schemes take the following form.

  1. Resample (X*s, Y*s,1, . . . , Y*s,m) from {(Xt, Yt,1, . . . , Yt,m) : t = 1, . . . , n} for each s = 1, . . . , n.
  2. Compute T*i(u) for all uS.

The size of the resample may also need to be larger for simultaneous intervals. If scores were independent across thresholds, the worst case, then the bootstrap estimate of the coverage would be approximately (1 − 2k/r)|S|. If this is to equal 1 − 2α, then we require r = 2k/[1 − (1 − 2α)1/|S|] ≈ k|S|/α for large |S|. If α = 0.05 and we want k = 5 for example, then r ≈ 100|S|.

The dark gray bands in Fig. 6 are bootstrapped, simultaneous 90% confidence intervals for Bm for the ECMWF and MF rainfall forecasts. Considering all thresholds together, then, we find that, at the 10% level, neither the ECMWF nor MF forecasts differ significantly from climatology. The evidence for a difference between the ECMWF and MF forecasts is marginal at the 10% level.

7. Discussion

This article identified the effect of ensemble size on the expected Brier score [(7)] and, given ensembles of size m, an unbiased estimator [(8)] for the expected Brier score that would be obtained for any other ensemble size. We assumed that ensemble members were exchangeable, an acceptable assumption when the forecast lead time is long enough for systematic differences between members to be lost. We proposed standard errors and confidence intervals for the expected Brier scores and found that bootstrap intervals performed well in a simulation study. When comparing the Brier scores from two forecasting systems, we proposed comparing estimates of expected Brier scores that would be obtained were the ensemble sizes equal, and described confidence intervals for their difference. We showed that if the Brier scores for several event definitions are of interest, then it is possible to construct confidence intervals that simultaneously contain with a specified probability the expected scores for all events.

We applied our methods to two sets of rainfall forecasts. For forecasting low rainfall totals, MF forecasts had lower Brier scores than ECMWF forecasts, even after estimating the effect of increasing the ECMWF ensemble size to infinity. Standard errors and confidence intervals suggested that the scores were only marginally significantly different at the 10% level for a few thresholds, and neither set of forecasts performed better than forecasting climatology.

Müller et al. (2005) have aims similar to ours but for the more general quadratic ranked probability score (RPS). They note that the expected RPS for perfectly calibrated but random ensemble forecasts exceeds the RPS obtained by forecasting climatology, which is equivalent to a perfectly calibrated random ensemble forecast with infinite ensemble size. This is analogous to Bm exceeding B. Instead of using climatology as the reference forecast in RPS skill scores, they therefore propose using a perfectly calibrated random ensemble forecast with an ensemble size equal to that of the forecasts being assessed. This is equivalent to our proposal of comparing m,n with Bm instead of B.

Müller et al. (2005) also produce confidence bands representing the sampling variation in the RPS skill score for random forecasts that arises among different observation–forecast datasets. Comparing a forecast system’s skill score with these bands provides a guide to its statistical significance relative to a random forecast, but does not provide a formal statistical test because the sampling variation in the system’s skill score is ignored. Our confidence intervals differ substantially: they are confidence intervals for the expected score of the forecast system being employed and can, therefore, be used to make comparisons with the expected score of any reference forecast, not only random forecasts, and can also be used to compare the expected scores of two forecasting system.

The methods presented in this article can be extended in several ways. We have defined events as exceedances of thresholds for simplicity, but the same methods could be applied for events defined by membership of general sets. We have also considered scalar observations and forecasts for simplicity, but multivariate data can be handled with the same methods; for example, events could be defined by membership of multidimensional sets. The methods presented here can also be extended to multiple-category Brier scores (Brier 1950) and to the RPS. Computer code for the procedures presented in this article and written in the statistical programming language R is available from the author.

Acknowledgments

Conversations with Professor I. T. Jolliffe and Drs. C. A. S. Coelho, D. B. Stephenson, and G. J. van Oldenborgh (who provided the data), plus comments from the referees, helped to motivate and improve this work.

REFERENCES

  • Bergsma, W. P., 2004: Testing conditional independence for continuous random variables. EURANDOM Tech. Rep. 2004–048, 19 pp.

  • Bradley, A. A., , Hashino T. , , and Schwartz S. S. , 2003: Distributions-oriented verification of probability forecasts for small data samples. Wea. Forecasting, 18 , 903917.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78 , 13.

  • Briggs, W., , and Ruppert D. , 2005: Assessing the skill of yes/no predictions. Biometrics, 61 , 799807.

  • Buizza, R., , and Palmer T. N. , 1998: Impact of ensemble size on ensemble prediction. Mon. Wea. Rev., 126 , 25032518.

  • Chatfield, C., 2004: The Analysis of Time Series: An Introduction. Chapman and Hall, 333 pp.

  • Davison, A. C., , and Hinkley D. V. , 1997: Bootstrap Methods and Their Application. Cambridge University Press, 592 pp.

  • DiCiccio, T. J., , Monti A. C. , , and Young G. A. , 2006: Variance stabilization for a scalar parameter. J. Roy. Stat. Soc., 68B , 281303.

  • Diebold, F. X., , and Mariano R. S. , 1995: Comparing predictive accuracy. J. Bus. Econ. Stat., 13 , 253263.

  • Hall, P., 1987: On the bootstrap and continuity correction. J. Roy. Stat. Soc., 49B , 8289.

  • Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts. Wea. Forecasting, 14 , 155167.

  • Jolliffe, I. T., 2007: Uncertainty and inference for verification measures. Wea. Forecasting, 22 , 637650.

  • Jolliffe, I. T., , and Stephenson D. B. , 2003: Forecast Verification: A Practitioner’s Guide in Atmospheric Science. John Wiley and Sons, 240 pp.

    • Search Google Scholar
    • Export Citation
  • Kane, T. L., , and Brown B. G. , 2000: Confidence intervals for some verification measures–a survey of several methods. Preprints, 15th Conf. on Probability and Statistics, Asheville, NC, Amer. Meteor. Soc., 46–49.

  • Lahiri, S. N., 1993: Bootstrapping the Studentized sample mean of lattice variables. J. Mult. Anal., 45 , 247256.

  • Müller, W. A., , Appenzeller C. , , Doblas-Reyes F. J. , , and Liniger M. A. , 2005: A debiased ranked probability skill score to evaluate probabilistic ensemble forecasts with small ensemble sizes. J. Climate, 18 , 15131523.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Palmer, T. N., and Coauthors, 2004: Development of a European Multimodel Ensemble System for Seasonal to Interannual Prediction (DEMETER). Bull. Amer. Meteor. Soc., 85 , 853872.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Potts, J. M., 2003: Basic concepts. Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. T. Jolliffe and D. B. Stephenson, Eds., John Wiley and Sons, 13–36.

    • Search Google Scholar
    • Export Citation
  • Richardson, D. S., 2001: Measures of skill and value of ensemble prediction systems, their interrelationship and the effect of ensemble size. Quart. J. Roy. Meteor. Soc., 127 , 24732489.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Romano, J. P., 1988: A bootstrap revival of some nonparametric distance tests. J. Amer. Stat. Assoc., 83 , 698708.

  • Seaman, R., , Mason I. , , and Woodcock F. , 1996: Confidence intervals for some performance measures of yes-no forecasts. Aust. Meteor. Mag., 45 , 4953.

    • Search Google Scholar
    • Export Citation
  • Severini, T. A., 2005: Elements of Distribution Theory. Cambridge University Press, 515 pp.

  • Stephenson, D. B., 2000: Use of the “odds ratio” for diagnosing forecast skill. Wea. Forecasting, 15 , 221232.

  • Thornes, J. E., , and Stephenson D. B. , 2001: How to judge the quality and value of weather forecast products. Meteor. Appl., 8 , 307314.

  • Wilks, D. S., 1997: Resampling hypothesis tests for autocorrelated fields. J. Climate, 10 , 6582.

  • Wilks, D. S., 2006: Statistical Methods in the Atmospheric Sciences. 2d ed. Academic Press, 627 pp.

  • Woodcock, F., 1976: The evaluation of yes/no forecasts for scientific and administrative purposes. Mon. Wea. Rev., 104 , 12091214.

    • Crossref
    • Search Google Scholar
    • Export Citation

APPENDIX A

Conditional Brier Scores

Unbiased estimators

We discussed the expected Brier score [(1)] in the main part of the paper, where the expectation was taken over repeated sampling of forecasts and observations. We investigate the conditional expected Brier score in this appendix, where the expectation is taken over repeated sampling of forecasts, but where the observations remain fixed. This quantity would be of interest if we wanted to know how a forecasting system would have performed on a particular set of target observations for different ensemble sizes. As before, we shall find the effects of ensemble size and construct unbiased estimators, standard errors, and confidence intervals.

The only source of variation in the conditional case is the generation of ensemble members: each observation Xt, and the corresponding model details Zt, are fixed. Consequently, we no longer need to assume stationarity, and the conditional expected Brier score is
i1520-0434-22-5-1076-eqa1
since Xt determines It. To see the effects of ensemble size, we assume again that the forecasts t are simple proportions [(3)] and that ensemble members are exchangeable. Then, for each t,
i1520-0434-22-5-1076-eqa2
and
i1520-0434-22-5-1076-eqa3
for some Qt and Rt independent of m, and
i1520-0434-22-5-1076-eqa4
where B∞,n is the conditional counterpart of B. The adjusted Brier score [(8)] is again unbiased for BM,n.

Standard errors

Estimating the uncertainty about BM,n due to sampling variation is harder than for BM because we no longer assume stationarity. The contribution to the sampling variation must therefore be quantifiable for each ensemble separately. This is easier if we strengthen our assumption of exchangeability to one of independent and identically distributed members. This assumption is difficult to test empirically for ensemble forecasts with a distribution that changes through time and requires further investigation (cf. Bergsma 2004) so we appeal to the long lead time of our rainfall forecasts for justification. In this case, the number Kt of members that forecast the event in the ensemble at time t has a binomial distribution with mean mQt and variance mQt(1 − Qt). After some lengthy algebra, the conditional variance, γ2M,n, of M,n can be shown to satisfy
i1520-0434-22-5-1076-eqa5
This variance decreases as m−1 for large m, so M,n is consistent for BM,n as m → ∞. An unbiased estimator for γ2M,n can be constructed if m > 3 by replacing each Qst in the previous equation with
i1520-0434-22-5-1076-eqa6
for positive integers s. The square root, γ̃M,n, of this unbiased estimator is then an estimator for the conditional standard error of M,n. If m ≤ 3, then we can replace Qst with (Kt/m)s instead, but note that if m = 1, then γ̃M,n is always zero.

Estimates of these conditional standard errors are shown in Fig. 2 for the ECMWF and MF rainfall forecasts. As expected, they are smaller than their unconditional counterparts, which reflect the additional variation from sampling observations. In fact, the conditional standard errors are small enough to suggest that the superiority of the MF forecasts at thresholds below 90 mm is statistically significant and would remain for these particular observations even if different ensemble members were sampled.

Confidence intervals

A normal (1 − 2α) confidence interval for BM,n is
i1520-0434-22-5-1076-eqa7
Bootstrap intervals approximate the distribution of
i1520-0434-22-5-1076-eqa8
by a bootstrap sample {T*iM,n : i = 1, . . . , r}. This sample can be formed by repeating the following steps for each i = 1, . . . , r.
  1. Resample Y*t,j from {Yt,i : i = 1, . . . , m} for each j = 1, . . . , m, and repeat for each t = 1, . . . , n.
  2. Form *M,n and γ̃*M,n from these resampled ensembles in the same way that the original ensembles were used to form M,n and γ̃M,n.
  3. Set TM,n*i = (*M,nB*M,n)/γ̃*M,n, where
i1520-0434-22-5-1076-eqa9
Bootstrap (1 − 2α) confidence limits then take the form
i1520-0434-22-5-1076-eqa10

Bootstrapped 90% confidence intervals for BM,n are illustrated for the ECMWF and MF forecasts in Fig. 3. Again, the intervals are narrower than those for Bm. The ECMWF forecasts are now significantly worse than climatology for many thresholds, that is, they are unlikely to do as well as climatology for these observations were new ensemble members to be sampled, but are significantly better than random forecasts except between the 30% and 50% quantiles (50–90 mm). The MF forecasts are not significantly different than climatology at most thresholds, and are significantly better than random forecasts at all thresholds.

Simulation study

The simulation study of section 4 was repeated for BM,n. Results are not shown but were qualitatively similar to those reported in section 4 for Bm except for rare events (p = 0.9). In that case, bootstrap intervals remain preferable to normal intervals except when ρ = 0, for which both intervals have large coverage errors.

Comparing Brier scores

Confidence intervals for the difference, BM,nBM,n, between the conditional expected Brier scores of two systems are easy to construct if the forecasts from the two systems at any time t can be considered independent once the model details Zt and Zt are fixed. This assumption might be violated if the ensemble generation process causes pairing of members between the two systems, though any such dependence is likely to diminish with lead time. The distribution of
i1520-0434-22-5-1076-eqa11
where γ2M,n = γ̃2M,n + γ̃2M,n, can be approximated by a bootstrap sample of the quantity
i1520-0434-22-5-1076-eqa12
to obtain confidence limits
i1520-0434-22-5-1076-eqa13
and
i1520-0434-22-5-1076-eqa14
Resampling follows the scheme described earlier in the section independently for each system.

Figure 3 (bottom panel) shows bootstrapped 90% confidence intervals for the difference between BM,n for the ECMWF and MF forecasts. The MF score is significantly lower than the ECMWF score for most thresholds below 90 mm.

APPENDIX B

Hypothesis Tests

We used confidence intervals in section 5 to test null hypotheses of equal expected Brier scores. Using the normal confidence interval is equivalent to a z test [e.g., the test labeled S1 by Diebold and Mariano (1995)] and using the bootstrap interval is equivalent to a bootstrap test (e.g., Davison and Hinkley 1997, p. 171). Confidence intervals are useful for quantifying uncertainty even when no comparative test is attempted, but if comparison is the goal, then other test procedures might also be employed. Hypothesis tests such as the sign and signed-rank tests (Hamill 1999) test for differences between medians and are inappropriate for testing differences between Brier scores, which are sample means. Instead, we compare the powers of the z and bootstrap tests with those of a t test and a permutation test (Hamill 1999) in a simulation study.

The study design is similar to that in section 4, except that two sets of forecasts are simulated, one uncorrelated with the observations (ρ1 = 0) while the correlation for the other set is varied from ρ2 = 0 to ρ2 = 1. The sets have the same expected Brier score when ρ2 = 0 and the scores diverge as ρ2 increases. For each value of ρ2, 10 000 datasets are generated and subjected to the four tests at the 10% significance level. Figure B1 (left panel) shows Monte Carlo estimates of power for the four tests when m = 8, n = 20, and p = 0.5. All four tests have similar powers, although the z test is slightly oversized and the bootstrap test has slightly lower power far from the null hypothesis.

The z test in Diebold and Mariano (1995) is adapted to handle serial dependence, and block resampling can be used for the permutation and bootstrap tests. The power study is repeated with observations simulated from a first-order moving-average process with correlation 0.5 at lag one. Powers for these three tests are plotted in Fig. B1 (right panel) and show that the z test and, to a lesser extent, the bootstrap tests are oversized, while the permutation test has remained well sized and its power has reduced only slightly from the independent case. From this limited study, the permutation test appears to be an attractive alternative to the bootstrap test for differences between Brier scores.

Fig. 1.
Fig. 1.

Observed (vertical lines) October rainfall (mm) in Jakarta from 1958 to 1995 plotted between both the ECMWF (filled circles) and MF (open circles) nine-member ensemble forecasts.

Citation: Weather and Forecasting 22, 5; 10.1175/WAF1034.1

Fig. 2.
Fig. 2.

Estimates m,n (upper thick) and ∞,n (lower thick) for the ECMWF (solid) and MF (dashed) forecasts of October rainfall at Jakarta exceeding different thresholds during the period 1958–95. Thresholds are marked as quantiles (lower axis) and absolute values (mm, upper axis) of the observed rainfall. Standard errors (upper thin) and their conditional versions (lower thin; see appendix A) are shown for the ECMWF (solid) and MF (dashed) forecasts, and are indistinguishable for m,n and ∞,n.

Citation: Weather and Forecasting 22, 5; 10.1175/WAF1034.1

Fig. 3.
Fig. 3.

(top) Brier scores m,n (solid) for the ECMWF forecasts, with bootstrapped 90% confidence intervals for Bm (dark gray) and BM,n (light gray; see appendix A) at each threshold. Expected Brier scores are also shown for random forecasts (dotted) and climatology (dashed). (middle) The same as in the top panel but for the MF forecasts. (bottom) The difference (solid) in m,n between the ECMWF and MF forecasts, with bootstrapped 90% confidence intervals for the differences between Bm (dark gray) and BM,n (light gray).

Citation: Weather and Forecasting 22, 5; 10.1175/WAF1034.1

Fig. 4.
Fig. 4.

Monte Carlo estimates of normal (solid) and bootstrap (dashed) lower (left) and upper (right) coverage errors plotted against α when m = 8; n = 40; ρ = 0 (thin), 0.4 (medium), and 0.8 (thick); and p = (top) 0.9, (middle) 0.7, and (bottom) 0.5. Solid horizontal lines mark zero error.

Citation: Weather and Forecasting 22, 5; 10.1175/WAF1034.1

Fig. 5.
Fig. 5.

Monte Carlo estimates of normal (solid) and bootstrap (dashed) interval lengths plotted against α when m = 8; n = 40; ρ = 0 (thin), 0.4 (medium), and 0.8 (thick); and p = (top) 0.9, (middle) 0.7, and (bottom) 0.5.

Citation: Weather and Forecasting 22, 5; 10.1175/WAF1034.1

Fig. 6.
Fig. 6.

(top) Brier scores m,n (solid) for the ECMWF forecasts, with bootstrapped simultaneous 90% confidence intervals for Bm (dark gray) and BM,n (light gray) at each threshold. Expected Brier scores are also shown for random forecasts (dotted) and climatology (dashed). (middle) As in the top panel but for the MF forecasts. (bottom) The difference (solid) in m,n between the ECMWF and MF forecasts, with bootstrapped simultaneous 90% confidence intervals for the differences between Bm (dark gray) and BM,n (light gray).

Citation: Weather and Forecasting 22, 5; 10.1175/WAF1034.1

i1520-0434-22-5-1076-fb01

Fig. B1. Monte Carlo estimates of the powers of the bootstrap (solid), permutation (dashed), and z (dotted) and t tests (dotted–dashed) against correlation ρ2 (see text) for (left) serially independent and (right) dependent observations.

Citation: Weather and Forecasting 22, 5; 10.1175/WAF1034.1

Save