Abstract

For probability forecasts, the Brier score and Brier skill score are commonly used verification measures of forecast accuracy and skill. Using sampling theory, analytical expressions are derived to estimate their sampling uncertainties. The Brier score is an unbiased estimator of the accuracy, and an exact expression defines its sampling variance. The Brier skill score (with climatology as a reference forecast) is a biased estimator, and approximations are needed to estimate its bias and sampling variance. The uncertainty estimators depend only on the moments of the forecasts and observations, so it is easy to routinely compute them at the same time as the Brier score and skill score. The resulting uncertainty estimates can be used to construct error bars or confidence intervals for the verification measures, or perform hypothesis testing.

Monte Carlo experiments using synthetic forecasting examples illustrate the performance of the expressions. In general, the estimates provide very reliable information on uncertainty. However, the quality of an estimate depends on both the sample size and the occurrence frequency of the forecast event. The examples also illustrate that with infrequently occurring events, verification sample sizes of a few hundred forecast–observation pairs are needed to establish that a forecast is skillful because of the large uncertainties that exist.

1. Introduction

Forecast verification provides critical information on the quality of forecasts and can help guide forecasters in their efforts to improve forecasting systems (Jolliffe and Stephenson 2003; Welles et al. 2007). Although forecasters routinely report verification results, they often do not consider the inherent uncertainty of the verification measures (Jolliffe 2007). However, computing verification measures requires a sample of forecasts and their corresponding observations. The resulting verification measures are therefore sample estimates. The sampling uncertainty of verification measures depends on both the sample size and the statistical characteristics of the forecasts and observations. In situations where the sample size is relatively small, the sampling uncertainty can be significant (Bradley et al. 2003).

One approach to assessing the sampling uncertainty of verification measures is a resampling method (Efron 1981; Wilks 2006). Forecast–observation pairs are randomly selected from the verification dataset to construct synthetic data samples, and the verification measures are computed; the process is repeated many times to derive an empirical probability distribution for the sample measures. Examples of a resampling approach include Mason and Mimmack (1992), Wilks (1996), Hamill (1999), Zhang and Casey (2000), Doblas-Reyes et al. (2003), Accadia et al. (2003), Ebert et al. (2004), Accadia et al. (2005), Jolliffe (2007), and Ferro (2007), among others. Although the approach is straightforward and robust, it adds significantly to the computational effort associated with verification.

Another approach is to employ sampling theory to derive analytical expressions for sampling uncertainty. Most examples in the literature are for categorical forecast verification measures, such as the probability of detection or hit rate (Kane and Brown 2000; Jolliffe 2007), odds ratio (Stephenson 2000), false alarm rate and skill scores (Thornes and Stephenson 2001), the correlation coefficient (Jolliffe 2007), and relative operating characteristics (Mason and Graham 2002). For probability forecasts, sampling theory has been applied to evaluate the sampling uncertainty of the bias (Schwartz 1992; Carpenter and Georgakakos 2001) and accuracy measures (Ferro 2007). Sampling theory has also been used to develop confidence intervals (Seaman et al. 1996) and hypothesis tests (Hamill 1999; Briggs and Ruppert 2005; Briggs 2005). Although the assumptions associated with deriving analytical expressions can be limiting, the effort required to evaluate the expressions is minimal.

In this paper, we use sampling theory to develop analytical expressions for the sampling uncertainty of the Brier score and Brier skill score. For probability forecasts, these are the most commonly used measures of accuracy and skill. In the following sections, we derive exact or approximate analytical expressions for the sampling uncertainty of these two measures, and then use Monte Carlo simulation for synthetic forecasting examples to evaluate the expressions and estimated confidence intervals.

2. Forecast verification framework

A probability forecast is one that assigns a probability to the occurrence of a discrete (dichotomous) event. In this case, there are only two possible outcomes: either the forecast event occurs or it does not. A familiar example is a probability-of-precipitation forecast. Probability forecasts can also be constructed from ensemble forecasts (Toth et al. 2003; Zhu 2005); an event is defined as the exceedance (or nonexceedance) of a particular threshold value (e.g., precipitation in excess of a given depth), then the probability forecast for an event occurrence can be made using the ensemble members.

Using a sample of probability forecasts and the resulting (binary) event outcomes, the Brier score and Brier skill score (Brier 1950) are often computed to characterize the accuracy and skill of the forecast. To evaluate the uncertainty of these verification measures from sampling theory, assumptions must be made about the statistical properties of forecasts and observations. The diagnostic framework for forecast verification introduced by Murphy and Winkler (1992), known as the distributions-oriented (DO) approach, explicitly defines a stochastic process commonly assumed for verification. Specifically, forecasts and observations are treated as random variables. Each forecast–observation pair is assumed to be independent of all other pairs, and identically distributed. The relationship between forecasts and observations is defined by their joint distribution.

Using the assumptions of the DO approach and the notation described by Murphy (1997), we will evaluate the sampling uncertainty of the Brier score and Brier skill score. In particular, consider a forecasting system that produces a probability forecast for a dichotomous event. Let x be a Bernoulli random variable that takes on a value of 1 if the event occurs and 0 if it does not. Let f be a probability forecast of the occurrence of the event. That is, f is the forecast probability that the event occurs (i.e., that x = 1). The joint distribution p( f, x) describes the relationship between the forecasts and observations.

a. Distribution moments

The moments of the random variables characterize aspects of the joint distribution. The first moment (mean) of the observations is

 
formula

where μx is interpreted as the climatological probability of the event occurrence. Because x is a Bernoulli random variable, xm = x for all positive integers m, so the higher-order noncentral moments are all

 
formula

Using (1) and (2), the variance of the observations σ2x is

 
formula

The forecast f may be either a discrete or continuous random variable. Let μf be the first moment (mean) of the forecasts. The higher-order noncentral moments are denoted as

 
formula

One can also characterize the moments of f conditioned on the observation. For instance, the first conditional moments are denoted as

 
formula
 
formula

Higher-order noncentral conditional moments can also be defined, and are denoted for the mth moment as μ(m) f |x=0 and μ(m) f |x=1.

The moments of the products of forecasts and observations may also be defined. For positive integers m and n,

 
formula

Appendix A summarizes how to compute estimates of these moments using a verification data sample.

b. Accuracy measure

The accuracy is an attribute describing the closeness of the forecasts f to the observations x (Murphy 1997). A commonly used measure of the accuracy of the forecasts is the mean squared error:

 
formula

For probability forecasts of a dichotomous event, MSE can be defined in terms of moments of f and x (Bradley et al. 2003). By expanding the terms in Eq. (8),

 
formula

c. Skill measure

The accuracy of the forecasts relative to a reference forecasting system is known as the skill. Let MSE(r, x) donate the accuracy for the reference forecasts r. The mean squared error skill score SSMSE is then

 
formula

Using a climatology forecast μx as the reference forecast,

 
formula

Therefore, SSMSE using climatology as a reference forecast is

 
formula

3. Sampling uncertainty

Verification measures of accuracy and skill are computed for a forecasting system from a sample of forecasts and observations. Let xi be the observation at time i. Let fi be the probability forecast of the event at time i. Assume that the forecast–observation pairs {fi, xi, i = 1, . . . , N} are a random sample drawn from the joint distribution p( f, x). The verification data sample will be used to estimate forecast accuracy and skill. In the following sections, we define sample estimators for the measures of forecast accuracy and skill, and derive expressions for their sampling uncertainty.

a. Brier score

The sample estimator for MSE is

 
formula

where the “hat” is used to indicate a sample estimate. For probability forecasts, this estimator is referred to as the Brier (or half-Brier) score (Brier 1950).

The expected value of is

 
formula

Because x2i = xi for a Bernoulli random variable, Eq. (14) reduces to

 
formula

Because E[] is equal to MSE (see section 2b), the Brier score is an unbiased estimator of MSE.

b. Sampling variance of the Brier score

The variance of a sample estimator is a measure of its sampling uncertainty. The variance of the Brier score is

 
formula

The variance term on the right-hand side can be expanded as

 
formula

Using the binomial expansion, the first term is

 
formula

Combining all of the terms,

 
formula

The standard error of a sample estimator is also used to describe its sampling uncertainty. The standard error is equivalent to the standard deviation of the estimator. Therefore, the standard error for the Brier score is defined as

 
formula

c. Brier skill score

The sample estimator for SSMSE, where climatology is used as a reference forecast, is

 
formula

For probability forecasts, this estimator is known as the Brier skill score.

Because it involves a ratio of sample estimators, the Brier skill score is a biased estimator of SSMSE (Mason 2004). Using a Taylor series expansion, a second-order approximation of the bias B(MSE) is (Benjamin and Cornell 1970)

 
formula

where

 
formula
 
formula
 
formula

As shown in section 3a, is an unbiased estimator of MSE. However, for a Bernoulli random variable with parameter μx, the sample estimator of σ2x is biased (see appendix B):

 
formula

By substitution and simplification,

 
formula
 
formula

For a Bernoulli random variable, the variance of the sample estimator σ̂2x is (Kenney and Keeping 1951)

 
formula

Using the result shown in appendix B, the covariance term is

 
formula

d. Sampling variance of the Brier skill score

The variance of the Brier skill score MSE is

 
formula

Here, we employ a first-order approximation (Benjamin and Cornell 1970) to estimate the variance of SSMSE:

 
formula

where

 
formula
 
formula

and

 
formula

The standard error for the Brier skill score MSE is approximated by

 
formula

4. Forecasting examples

To evaluate the analytical expressions for the sampling uncertainties, we used Monte Carlo simulations to generate verification datasets for several synthetic forecasting examples. For each example, the joint distribution p( f, x) and the true values of the accuracy and skill measures are known. Ten thousand verification data samples are generated for each example, for sample sizes ranging from 50 to 1000 pairs. For each verification data sample, sample estimates of the accuracy and skill (the Brier score and Brier skill score) are computed. To determine the “true” standard error of the measures for comparison with the analytical expressions, we compute the standard deviation of the measures for the 10 000 verification data samples. Note that with a Monte Carlo sample of this size, the uncertainty of the true value is less than about 1.5% (at a 95% confidence level).

A stochastic model based on the calibration–refinement factorization of the joint distribution p( f, x) (Murphy 1997) was used to construct the synthetic forecasting examples for the Monte Carlo simulations. Forecast–observation pairs were created from the stochastic model as follows. First, the probability forecast fi for the discrete event is randomly generated. The marginal distribution of the forecasts s( f ) is modeled by a beta distribution (Krzysztofowicz and Long 1991):

 
formula

where υ and ω are parameters, and B(υ, ω) is the beta function. Next, the conditional expected value of the observation μx|f , given the forecast f, is evaluated. A linear model (Murphy and Wilks 1998) is used:

 
formula

where a and b are parameters. Finally, a corresponding observation xi is generated for the given forecast fi. This is accomplished by generating a uniform random variable over the range from 0 to 1; if the value is less than μx| fi, then observation xi is set to 1 (event occurrence). Otherwise, the observation is set to 0 (event nonoccurrence).

We use the results for selected examples to illustrate the properties of the uncertainty estimators. In particular, we show results for forecasts of a rarely occurring event (μx = 0.05) and for a more commonly occurring event (μx = 0.25). For both cases, we generated forecasts with low (SSMSE = 0.2), medium (SSMSE = 0.4), and high skill (SSMSE = 0.6). In all the cases, the forecast is assumed to be unconditionally unbiased (μf = μx), which imposes the following constraint on the linear model parameters:

 
formula

Note that if b = 1, the forecast is also conditionally unbiased; if b ≠ 1, the forecast has conditional biases. We examined both of these cases. For the conditionally unbiased forecast case,

 
formula

which implies that the forecast is perfectly reliable. For cases with a conditionally biased forecast, we selected b = 0.8 and used Eq. (39) to find a.

Table 1 shows the model parameters for the cases presented in the following section. Also shown are relative measures of forecast quality (Murphy 1997) for each case. By design, the relative measures are the same for a given skill level within the unbiased or the conditionally biased forecast cases. By definition, the reliability (REL) of the forecast is 0 for a perfectly reliable forecast; the reliability is nonzero for the cases with conditional biases. Also, the resolution (RES), discrimination (DIS), and sharpness (SHP) of the forecast are all higher for a more skillful forecast. Finally, in the case of a conditionally biased forecast with b = 0.8, the highest achievable skill score is 0.6. For this high-skill case, the forecast must be perfectly sharp (the probability forecast f is either 0 or 1). Hence, the relative sharpness (SHP/σx2) for the high-skill case with conditionally biased forecasts achieves its maximum value of 1. Note that for this special case, the forecast f is a discrete random variable and is generated from a Bernoulli distribution with parameter μf .

Table 1.

Relative measures of forecast quality for Monte Carlo simulations of forecast–observation pairs. The parameters of the stochastic model (defined in section 4) are υ, ω, and b. The forecast quality measures are skill (SSMSE), resolution (RES), reliability (REL), discrimination (DIS), type-2 conditional bias (B2), and sharpness (SHP) (Murphy 1997).

Relative measures of forecast quality for Monte Carlo simulations of forecast–observation pairs. The parameters of the stochastic model (defined in section 4) are υ, ω, and b. The forecast quality measures are skill (SSMSE), resolution (RES), reliability (REL), discrimination (DIS), type-2 conditional bias (B2), and sharpness (SHP) (Murphy 1997).
Relative measures of forecast quality for Monte Carlo simulations of forecast–observation pairs. The parameters of the stochastic model (defined in section 4) are υ, ω, and b. The forecast quality measures are skill (SSMSE), resolution (RES), reliability (REL), discrimination (DIS), type-2 conditional bias (B2), and sharpness (SHP) (Murphy 1997).

5. Results

a. Standard errors

First we evaluate the analytical expressions for the standard errors of the Brier score and Brier skill score. True values for the sampling uncertainty are represented by the 10 000-sample estimates from the Monte Carlo simulation. These are compared to the analytical expression estimates made using the known (true) values for the moments.

Figure 1 shows the standard error estimates for the Brier score for medium-skill (SSMSE = 0.4) forecasts. Results are shown for an unbiased and a conditionally biased forecast, for both rare- (μx = 0.05) and common-event (μx = 0.25) occurrences. Note that the analytical expression is exact for the sampling variance; not surprisingly, the standard error estimates are virtually identical to the true values in all cases. For a forecast with conditional biases, the standard errors are slightly higher than those for a forecast with no conditional biases. Furthermore, the standard errors for for forecasts of common-event occurrences are higher than those for rare-event occurrences. This occurs because the magnitude of MSE tends to increase as the inherent uncertainty σ2x increases; for probability forecasts, σ2x is small as μx approaches 0 (or 1), and reaches its maximum for μx = 0.5 [see Eq. (3)]. Therefore, the absolute magnitudes of the standard error for are not directly comparable for rare- and common-event cases, because they depend on the event occurrence frequency.

Fig. 1.

Standard error of the Brier score for a medium-skill forecast (SSMSE = 0.4). The symbols are the true standard errors derived from the Monte Carlo simulations. The curves are the estimates based on the analytical expression [Eq. (20)]. Results are shown for rare-event (μx = 0.05) and common-event (μx = 0.25) occurrences, for both an unbiased and a conditionally biased forecast.

Fig. 1.

Standard error of the Brier score for a medium-skill forecast (SSMSE = 0.4). The symbols are the true standard errors derived from the Monte Carlo simulations. The curves are the estimates based on the analytical expression [Eq. (20)]. Results are shown for rare-event (μx = 0.05) and common-event (μx = 0.25) occurrences, for both an unbiased and a conditionally biased forecast.

For the Brier skill score MSE, the analytical expression for its sampling variance is a first-order approximation. The quality of the approximation for the standard error is illustrated in Fig. 2 for unbiased forecasts with low skill (SSMSE = 0.2) and high skill (SSMSE = 0.6). For large sample sizes, the standard error estimates are very good in all cases. However, significant underestimation of the standard error occurs for forecasts of the rare event for small sample sizes. Unlike MSE, SSMSE is directly comparable for rare- and common-event cases because it is nondimensionalized by the variance of the observations. For the forecasts of the rare event, there are fewer occurrences of the forecast event, resulting in relatively high sampling variance. Still, the deviations of the standard error approximation for these cases are evident only with small sample sizes (200 or less), when there are very few rare-event occurrences in the sample (e.g., 10 or less). Note too that the standard errors for both the rare- and common-event cases increase slightly with increasing skill (see Fig. 2), but the increase is proportionally smaller than the increase in the skill itself. For the high-skill forecast, SSMSE is larger than the standard error for all sample sizes, whereas for the low-skill forecast, the standard error for the rare-event case exceeds SSMSE for very small sample sizes.

Fig. 2.

Standard error of the Brier skill score for (a) a low-skill forecast (SSMSE = 0.2) and (b) a high-skill forecast (SSMSE = 0.6). The symbols are the true standard errors derived from the Monte Carlo simulations. The curves are the estimates based on the analytical expression [Eq. (36)]. Results are shown for rare-event (μx = 0.05) and common-event (μx = 0.25) occurrences, for an unbiased forecast.

Fig. 2.

Standard error of the Brier skill score for (a) a low-skill forecast (SSMSE = 0.2) and (b) a high-skill forecast (SSMSE = 0.6). The symbols are the true standard errors derived from the Monte Carlo simulations. The curves are the estimates based on the analytical expression [Eq. (36)]. Results are shown for rare-event (μx = 0.05) and common-event (μx = 0.25) occurrences, for an unbiased forecast.

The impact of conditional forecast biases is illustrated in Fig. 3 for the case of medium-skill (SSMSE = 0.4) forecasts. As is the case for , the standard errors for MSE are higher when the forecast is conditionally biased. Approximate standard errors are nearly exact for common-event forecasts, even with conditional biases (see Fig. 3b). However, the same deviation in the approximation occurs for the rare-event case when conditional forecast biases are present (see Fig. 3a).

Fig. 3.

Standard error of the Brier skill score for a medium-skill forecast (SSMSE = 0.4) for (a) a rare event (μx = 0.05) and (b) a common event (μx = 0.25). The symbols are the true standard errors derived from the Monte Carlo simulations. The curves are the estimates based on the analytical expression [Eq. (36)]. Results are shown for both an unbiased and a conditionally biased forecast.

Fig. 3.

Standard error of the Brier skill score for a medium-skill forecast (SSMSE = 0.4) for (a) a rare event (μx = 0.05) and (b) a common event (μx = 0.25). The symbols are the true standard errors derived from the Monte Carlo simulations. The curves are the estimates based on the analytical expression [Eq. (36)]. Results are shown for both an unbiased and a conditionally biased forecast.

b. Bias in the Brier skill score

Although the sample estimator of MSE is unbiased, the sample estimator of SSMSE is biased. Figure 4 shows the expected skill score E[MSE] for medium-skill forecasts based on the Monte Carlo simulation, and that evaluated with the second-order approximation of the bias shown in Eq. (22). Clearly, biases are largest for the forecasts of the rare event; the biases are only slightly larger when the forecast is conditionally biased. The magnitude of the bias can be quite large. For rare-event occurrences with a true SSMSE of 0.4, the expected skill score is 0.34 or less for a sample of fewer than a hundred forecast–observation pairs. Although not shown, the magnitude of the bias does not vary greatly with the magnitude of the forecast skill in these forecasting examples.

Fig. 4.

Bias of the Brier skill score for a medium-skill forecast (SSMSE = 0.4) for (a) a rare event (μx = 0.05) and (b) a common event (μx = 0.25). The symbols are the true expected values of the Brier skill score derived from the Monte Carlo simulations. The curves are the estimates made using the analytical expression for bias [Eq. (22)]. The expected values are shown for both an unbiased and a conditionally biased forecast.

Fig. 4.

Bias of the Brier skill score for a medium-skill forecast (SSMSE = 0.4) for (a) a rare event (μx = 0.05) and (b) a common event (μx = 0.25). The symbols are the true expected values of the Brier skill score derived from the Monte Carlo simulations. The curves are the estimates made using the analytical expression for bias [Eq. (22)]. The expected values are shown for both an unbiased and a conditionally biased forecast.

The second-order approximation of the biases produces a reasonable prediction of the bias for rare-event occurrences (see Fig. 4a). The curves follow the overall trend found in the Monte Carlo experiments, but the magnitude of the bias is underestimated. When the actual bias is less, as is the case for common-event occurrences (see Fig. 4b), the approximation is poor. In particular, the magnitudes of the observed biases are much larger for small sample sizes than predicted for the common-event occurrences.

c. Approximate confidence intervals

In practice, the analytical expressions would be used to estimate the uncertainty of the Brier score and Brier skill score computed with a verification dataset. Uncertainty estimates are obtained by replacing the moments in the standard error and bias equations in section 3 with moment estimates from a sample (see appendix A for traditional sample estimators of the moments). Note that for the standard error of the Brier score, this approach is mathematically similar to the sample estimator proposed by Ferro (2007) for ensemble forecasts. Using the resulting sample estimates of uncertainty, one can construct error bars, perform hypothesis testing, or construct confidence intervals for individual sample estimates. To evaluate the performance of the uncertainty expressions when sample moments are used, we examine confidence interval estimates for both MSE and SSMSE for verification data samples.

In principle, confidence interval estimation requires knowledge of the sampling distributions for and MSE. However, for sufficiently large sample sizes, the central limit theorem states that the sampling distribution of a sum of independent and identically distributed random variables will converge to a normal distribution. Therefore, a normal distribution approximation is often used to construct confidence intervals. For , which involves the sum of the forecast errors squared, it is reasonable to assume that its sampling distribution will converge to a normal distribution for large sample sizes. However, the situation is not as straightforward for MSE, which essentially involves the ratio of two sums.

Figure 5 shows the empirical sampling distribution for from the Monte Carlo simulation for medium-skill forecasts with no bias. For rare-event occurrences (see Fig. 5a), the lower bound at 0 causes deviations from a normal distribution for small sample sizes. Otherwise, the normal distribution approximation is reasonable for the sampling distribution. For common-event occurrences (see Fig. 5b), the normal distribution approximation is reasonable, except perhaps in the tails of the distribution.

Fig. 5.

Sampling distribution of the Brier score for an unbiased forecast with medium skill (SSMSE = 0.4) for (a) a rare event (μx = 0.05) and (b) a common event (μx = 0.25). The empirical distribution is based on the 10 000-sample estimates from the Monte Carlo simulation. A normal distribution would plot as a straight line.

Fig. 5.

Sampling distribution of the Brier score for an unbiased forecast with medium skill (SSMSE = 0.4) for (a) a rare event (μx = 0.05) and (b) a common event (μx = 0.25). The empirical distribution is based on the 10 000-sample estimates from the Monte Carlo simulation. A normal distribution would plot as a straight line.

Figure 6 shows the sampling distribution for MSE for the same medium-skill forecasts. For rare-event occurrences (see Fig. 6a), the sampling distributions are negatively skewed due to the upper bound at 1, and the skewness is significant for most sample sizes shown. Therefore, a normal distribution is a poor approximation for the observed sampling distribution for MSE. For common-event occurrences (see Fig. 6b), the skewness of the sampling distribution is not as severe. For sample sizes of 400 or greater, a normal distribution approximation is not unreasonable.

Fig. 6.

Sampling distribution of the Brier skill score for an unbiased forecast with medium skill (SSMSE = 0.4) for (a) a rare event (μx = 0.05) and (b) a common event (μx = 0.25). The empirical distribution is based on the 10 000-sample estimates from the Monte Carlo simulation. A normal distribution would plot as a straight line.

Fig. 6.

Sampling distribution of the Brier skill score for an unbiased forecast with medium skill (SSMSE = 0.4) for (a) a rare event (μx = 0.05) and (b) a common event (μx = 0.25). The empirical distribution is based on the 10 000-sample estimates from the Monte Carlo simulation. A normal distribution would plot as a straight line.

Despite the deviations observed, we examined the characteristics of confidence intervals constructed for individual samples using a normal approximation. For the case where a sample statistic Ŷ is normally distributed with unknown variance, the 100(1 – α)% confidence interval for the true parameter μY is

 
formula

where tα/2,N−1 is the critical value of the t distribution for a two-sided test with a significance level of α and N − 1 degrees of freedom, and σ̂Y is the sample estimate of the standard error of the test statistic. For all forecast cases, we constructed confidence limits for individual verification datasets. In particular, we computed the sample moments (see appendix A) for each dataset, and we used the sample estimates in the standard error equations shown in section 3 to obtain standard error estimates for and MSE and confidence intervals. Then, the fractions of cases where the true values of MSE and SSMSE fell within the confidence limits were computed. The results are shown for 95% confidence intervals in Table 2 for unbiased forecasts.

Table 2.

Percentage of samples where the true MSE or SSMSE falls within the 95% confidence interval constructed for the sample. Results are shown for forecasting examples without conditional forecast biases.

Percentage of samples where the true MSE or SSMSE falls within the 95% confidence interval constructed for the sample. Results are shown for forecasting examples without conditional forecast biases.
Percentage of samples where the true MSE or SSMSE falls within the 95% confidence interval constructed for the sample. Results are shown for forecasting examples without conditional forecast biases.

For the forecasts of the rare event, the 95% confidence intervals constructed using sample estimates contain the true values much less than 95% of the time for small sample sizes. Given the underestimation of the standard error for MSE for the forecasts of the rare event, underestimation of the interval is not surprising. However, the variance estimator is exact, and the underestimation is of a similar magnitude for a given skill level and sample size. Although deviations from the normal distribution approximation may contribute to the underestimation, the dominate cause is the use of sample (rather than the true) moments to estimate the sampling variances of and MSE. Note that the expressions involve higher-order moments (up to fourth order), so sample estimates of these moments have large sampling uncertainty for a small sample size. For the forecasts of the common event, the 95% confidence intervals performed much better, even though slight deviations occur for small sample sizes. For all cases, the confidence interval results are best (closer to 95%) for low skill forecasts. The intervals for are better for the rare-event occurrences, but those for MSE are generally better for the common-event occurrences. So, despite the approximation used for the variance of MSE, estimated confidence intervals for both and MSE perform similarly using the normal distribution approximation.

6. Discussion

As the results of the Monte Carlo experiments show, forecasters can reliably use the standard error and confidence limit estimators presented here to characterize uncertainty in many forecasting situations. Still, significant deviations can occur with forecasts made for rarely occurring events for small sample sizes (roughly less than a few hundred forecast–observation pairs). Unfortunately, the desire to characterize sampling uncertainty for rare (high impact) events is usually the greatest when the forecast–observation data sample is small. In these situations, standard error estimates for the Brier skill score, and confidence limits for both the Brier score and Brier skill score, underestimate the actual uncertainty.

On the other hand, the estimated standard errors and confidence limits, although deficient, do indicate the large uncertainties that exists for such cases. As an example, Fig. 7 shows the 95% confidence limits for SSMSE as a function of the sample size for a low-skill forecast of a rare event (μx = 0.05), computed using the approximate standard error estimators and the true model parameters. Note that for sample sizes up to about 300, one cannot reject the hypothesis that the true skill score is 0 for this case. For a sample size of 50, the estimated 95% confidence interval for SSMSE covers a range of [−0.26, 0.57]. Even though these estimates understate the true uncertainty range, the interval is still so large that one cannot conclude that the forecast of the rare event is skillful from such a small sample. Although it may be possible to devise corrections that improve uncertainty estimates for small sample sizes and rare events, the conclusions drawn in such cases would not be substantially different.

Fig. 7.

The 95% confidence intervals and expected Brier skill score for an unbiased forecast with low skill (SSMSE = 0.2) for a rare event (μx = 0.05). Lines show the results based on analytical expressions evaluated with the true parameters. Symbols show the empirical confidence intervals from the Monte Carlo simulations.

Fig. 7.

The 95% confidence intervals and expected Brier skill score for an unbiased forecast with low skill (SSMSE = 0.2) for a rare event (μx = 0.05). Lines show the results based on analytical expressions evaluated with the true parameters. Symbols show the empirical confidence intervals from the Monte Carlo simulations.

A second-order approximation of the bias for the SSMSE estimator was also derived, which could be applied to correct biases for sample estimates. However, the utility of such a correction is debatable. As suggested by Fig. 7, the magnitude of the bias is quite small compared to the sampling variability, especially for large sample sizes. As shown in Fig. 4, the approximation performed best for rare-event occurrences for small sample sizes, where the sampling uncertainty is too large to make statistical inferences. For common-event occurrences, the bias is much smaller, and the approximation underestimates the bias to such an extent that its use for bias correction would not substantially change the results.

Another important consideration in the use of the uncertainty estimators is their limit of applicability. Recall that the assumptions of the DO approach to forecast verification (Murphy 1997) were used to derive the estimators. One assumption is that forecast–observation pairs are identically distributed; a verification dataset is a random sample drawn from this unchanging (or stationary) joint distribution. Yet in many applications, a verification dataset pools together forecasts issued at different times of the year, and for different locations. In this case, the assumption of identically distributed forecast–observation pairs may be violated, because the joint distribution of forecasts and observations may depend on the season and location.

Hamill and Juras (2006) provide some insightful examples of how nonstationarity in the form of temporal and spatial variations in climatology affects verification measures. In particular, if the joint distribution varies with time or location, verification measures will be unreliable (biased) and misleading. Whenever such variations are a concern, Hamill and Juras (2006) recommend computing verification measures with subsamples where stationarity of the joint distribution can be reasonably assumed. Of course, the drawback to subdividing the verification data sample into smaller subsamples (to avoid such biases) is the inherent increase in sampling variability.

If one adopts this approach when dealing with pooled samples, the uncertainty estimators derived here still have applications. Consider the case where a pooled sample of size N is drawn from M distinct groups, each with its own (stationary) joint distribution. Mathematically, the sample Brier score computed with the pooled sample p is equivalent to

 
formula

where MSEi is the sample Brier score for the ith group and Ni is the sample size for the ith group. Using probability theory, and assuming independence of the subsamples, the variance of the pooled Brier score is

 
formula

where V[i] is defined by Eq. (19) for each subsample.

In cases where a single skill measure is desired for a pooled sample, Hamill and Juras (2006) recommend using a weighted average of the skill score for (stationary) subsamples, rather than the skill score computed directly from the pooled sample. One suggested weighted-average skill measure is

 
formula

Again, assuming independence of the subsamples, the standard error of this weighted-average skill score is simply

 
formula

where V[i] is defined by Eq. (32) for the stationary subsample.

Another assumption of the DO approach is that each forecast–observation pair is independent of all others. The independence assumption is violated whenever forecasts or observations are significantly correlated in time or space. Temporal correlation for probability forecasts may be a concern for frequently updated event forecasts (e.g., daily probability of precipitation), but sample sizes also tend to be relatively large in this case. On the other hand, the spatial correlation for probability forecasts is often a concern for both meteorological and climate forecasts. If the forecasts or observations are stationary, but significantly correlated, the sample size is effectively reduced, and the estimators presented here would underestimate the true sampling uncertainty.

One clear advantage of the analytical estimators of sampling uncertainty is that they are easy to compute at the same time as verification measures. In contrast, with resampling approaches, a few hundred or thousand resampled datasets must be constructed, and verification measures computed for each case. Clearly, the added computational burden is of no consequence for any one verification dataset. However, the computational effort can be a significant obstacle when evaluating the suite of products for an ensemble forecasting system, where the ensemble forecasts are transformed into probability forecasts for multiple event thresholds (e.g., Bradley et al. 2004), and evaluated at many different locations and forecast times (e.g., Kruger et al. 2007). Still, if the underlying stochastic process for forecasts and observations assumed in the derivation of these estimators is not valid, resampling remains an attractive nonparametric alternative.

7. Summary and conclusions

The Brier score and Brier skill score are commonly used verification measures for probability forecasts. However, their sampling characteristics have not been explored in great detail. Using the joint distribution of forecasts and observations as defined in the distributions-oriented approach to forecast verification (Murphy and Winkler 1992), we derive analytical expressions for the sampling uncertainties of the Brier score and Brier skill score. The expressions depend only on the moments of the forecasts and observations, making sampling uncertainties easy to estimate at the same time as the scores themselves. The expressions can be used with verification datasets to estimate standard errors and biases, construct confidence intervals, or perform hypothesis testing (e.g., Hamill 1999) on the Brier score and Brier skill score.

Monte Carlo experiments using synthetic forecasting examples illustrate the performance of the expressions. In general, the estimates provide very reliable information on uncertainty. However, the quality of an estimate depends on both the sample size and the occurrence frequency of the forecast event. For the Brier skill score, the approximation underestimates the standard error at small sample sizes (a few hundred forecast–observation pairs or less) for infrequently occurring events. In contrast, the bias estimator for the Brier skill score only provides reasonable estimates when the bias is large, a situation that exists only for infrequently occurring events. Confidence interval estimates constructed for individual samples using a normal distribution approximation perform well except at small sample sizes, where the sampling distributions are skewed because of large sampling variances and the lower (upper) bound on the Brier score (Brier skill score).

Although this paper focuses on uncertainty estimation for measures of accuracy and skill, there are additional measures that describe other aspects of the forecast quality for probability forecasts (Murphy 1997; Wilks 2006). Bradley et al. (2003) showed that other distributions-oriented measures for probability forecasts can also be expressed as a function of the moments of the joint distribution. We are currently exploring the use of sampling theory to develop approximate uncertainty estimates for these measures as well.

Acknowledgments

This work was supported in part by National Oceanic and Atmospheric Administration (NOAA) Grant NA04NWS4620015, from the National Weather Service (NWS) Office of Hydrologic Development, and Grant NA16GP1569, from the Office of Global Programs as part of the GEWEX Americas Prediction Project (GAPP). We gratefully acknowledge this support. We also thank the two anonymous reviewers for their thoughtful comments and suggestions.

REFERENCES

REFERENCES
Accadia
,
C.
,
S.
Mariani
,
M.
Casaioli
,
A.
Lavagnini
, and
A.
Speranza
,
2003
:
Sensitivity of precipitation forecast skill scores to bilinear interpolation and a simple nearest-neighbor average method on high-resolution verification grids.
Wea. Forecasting
,
18
,
918
932
.
Accadia
,
C.
,
S.
Mariani
,
M.
Casaioli
,
A.
Lavagnini
, and
A.
Speranza
,
2005
:
Verification of precipitation forecasts from two limited-area models over Italy and comparison with ECMWF forecasts using a resampling technique.
Wea. Forecasting
,
20
,
276
300
.
Benjamin
,
J.
, and
C.
Cornell
,
1970
:
Probability, Statistics, and Decision for Civil Engineers.
McGraw-Hill, 684 pp
.
Bradley
,
A. A.
,
T.
Hashino
, and
S. S.
Schwartz
,
2003
:
Distributions-oriented verification of probability forecasts for small data samples.
Wea. Forecasting
,
18
,
903
917
.
Bradley
,
A. A.
,
S. S.
Schwartz
, and
T.
Hashino
,
2004
:
Distributions-oriented verification of ensemble streamflow predictions.
J. Hydrometeor.
,
5
,
532
545
.
Brier
,
G. W.
,
1950
:
Verification of forecasts expressed in terms of probability.
Mon. Wea. Rev.
,
78
,
1
3
.
Briggs
,
W.
,
2005
:
A general method of incorporating forecast cost and loss in value scores.
Mon. Wea. Rev.
,
133
,
3393
3397
.
Briggs
,
W.
, and
D.
Ruppert
,
2005
:
Assessing the skill of yes/no forecasts.
Biometrics
,
61
,
799
807
.
Carpenter
,
T.
, and
K.
Georgakakos
,
2001
:
Assessment of Folsom Lake response to historical and potential future climate scenarios: 1. forecasting.
J. Hydrol.
,
249
,
148
175
.
Doblas-Reyes
,
F.
,
V.
Pavan
, and
D.
Stephenson
,
2003
:
The skill of multi-model seasonal forecasts of the wintertime North Atlantic oscillation.
Climate Dyn.
,
21
,
501
514
.
Ebert
,
E. E.
,
L. J.
Wilson
,
B. G.
Brown
,
P.
Nurmi
,
H. E.
Brooks
,
J.
Bally
, and
M.
Jaeneke
,
2004
:
Verification of nowcasts from the WWRP Sydney 2000 Forecast Demonstration Project.
Wea. Forecasting
,
19
,
73
96
.
Efron
,
B.
,
1981
:
Nonparametric estimates of standard error: The jacknife, the bootstrap and other methods.
Biometrika
,
68
,
589
599
.
Ferro
,
C. A.
,
2007
:
Comparing probabilistic forecasting systems with the Brier score.
Wea. Forecasting
,
22
,
1076
1088
.
Hamill
,
T. M.
,
1999
:
Hypothesis tests for evaluating numerical precipitation forecasts.
Wea. Forecasting
,
14
,
155
167
.
Hamill
,
T. M.
, and
J.
Juras
,
2006
:
Measuring forecast skill: Is it real skill or is it the varying climatology?
Quart. J. Roy. Meteor. Soc.
,
132
,
2905
2923
.
Jolliffe
,
I.
,
2007
:
Uncertainty and inference for verification measures.
Wea. Forecasting
,
22
,
637
650
.
Jolliffe
,
I.
, and
D.
Stephenson
,
2003
:
Introduction.
Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. Jolliffe and D. Stephenson, Eds., John Wiley, 1–12
.
Kane
,
T. L.
, and
B. G.
Brown
,
2000
:
Confidence intervals for some verification measures—A survey of several methods. Preprints, 15th Conf. on Probability and Statistics in the Atmospheric Sciences, Asheville, NC, Amer. Meteor. Soc., 46–49
.
Kenney
,
J. F.
, and
E. S.
Keeping
,
1951
:
Mathematics of Statistics, Part 2.
2nd ed. Van Nostrand, 429 pp
.
Kruger
,
A.
,
S.
Khandelwal
, and
A. A.
Bradley
,
2007
:
Ahpsver: A Web-based system for hydrologic forecast verification.
Comput. Geosci.
,
33
,
739
748
.
Krzysztofowicz
,
R.
, and
D.
Long
,
1991
:
Beta likelihood models of probabilistic forecasts.
Int. J. Forecasting
,
7
,
47
55
.
Mason
,
S. J.
,
2004
:
On using climatology as a reference strategy in the Brier and ranked probability skill scores.
Mon. Wea. Rev.
,
132
,
1891
1895
.
Mason
,
S. J.
, and
G. M.
Mimmack
,
1992
:
The use of bootstrap confidence intervals for the correlation coefficient in climatology.
Theor. Appl. Climatol.
,
45
,
229
233
.
Mason
,
S. J.
, and
N. E.
Graham
,
2002
:
Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: Statistical significance and interpretation.
Quart. J. Roy. Meteor. Soc.
,
128
,
2145
2166
.
Murphy
,
A. H.
,
1997
:
Forecast verification.
Economic Value of Weather and Climate Forecasts, R. Katz and A. H. Murphy, Eds., Cambridge University Press, 19–74
.
Murphy
,
A. H.
, and
R. L.
Winkler
,
1992
:
Diagnostic verification of probability forecasts.
Int. J. Forecasting
,
7
,
435
455
.
Murphy
,
A. H.
, and
D. S.
Wilks
,
1998
:
A case study of the use of statistical models in forecast verification: Precipitation probability forecasts.
Wea. Forecasting
,
13
,
795
810
.
Schwartz
,
S. S.
,
1992
:
Verifying probabilistic water supply outlooks for the Potomac River basin. Preprints, 28th Conf. and Symp. on Managing Water Resources during Global Change, Reno, NV, American Water Resources Association, 153–161
.
Seaman
,
R.
,
I.
Mason
, and
F.
Woodcook
,
1996
:
Confidence intervals for some performance measures of yes–no forecasts.
Aust. Meteor. Mag.
,
45
,
49
53
.
Stephenson
,
D. B.
,
2000
:
Use of the “odds ratio” for diagnosing forecast skill.
Wea. Forecasting
,
15
,
221
232
.
Thornes
,
J. E.
, and
D. B.
Stephenson
,
2001
:
How to judge the quality and value of weather forecast products.
Meteor. Appl.
,
8
,
307
314
.
Toth
,
Z.
,
O.
Talagrand
,
G.
Candille
, and
Y.
Zhu
,
2003
:
Probability and ensemble forecasts.
Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. Jolliffe and D. Stephenson, Eds., John Wiley, 137–163
.
Welles
,
E.
,
S.
Sorooshian
,
G.
Carter
, and
B.
Olsen
,
2007
:
Hydrologic verification: A call for action and collaboration.
Bull. Amer. Meteor. Soc.
,
88
,
503
511
.
Wilks
,
D. S.
,
1996
:
Statistical significance of long-range “optimal climate normal” temperature and precipitation forecasts.
J. Climate
,
9
,
827
839
.
Wilks
,
D. S.
,
2006
:
Statistical Methods in the Atmospheric Sciences.
2nd ed. International Geophysics Series, Vol. 91, Academic Press, 648 pp
.
Zhang
,
H.
, and
T.
Casey
,
2000
:
Verification of categorical probability forecasts.
Wea. Forecasting
,
15
,
80
89
.
Zhu
,
Y. J.
,
2005
:
Ensemble forecast: A new approach to uncertainty and predictability.
Adv. Atmos. Sci.
,
22
,
781
788
.

APPENDIX A

Sample Moment Estimators

Let xi be the observation at time i. Let fi be the probability forecast of the event at time i. The verification data sample is then {fi, xi, i = 1, . . . , N}. The traditional sample moment estimator for the mean of the observations is

 
formula

Because x is a Bernoulli random variable, the sample estimator for the variance of the observations σ2x is simply

 
formula

The sample estimators for the mean and higher-order noncentral moment estimators for the forecasts are

 
formula
 
formula

To estimate the conditional mean and higher-order noncentral moments of the forecasts, the verification data sample is partitioned into two sets. Let {f0j, j = 1, . . . , N0} be the subsample of forecasts for the case where the event does not occur (x = 0). Let {f1k, k = 1, . . . , N1} be the subsample of forecasts for the case where the event occurs (x = 1). The conditional means μ̂f |x=0 and μ̂f |x=1, and higher-order noncentral moments μ̂(m) f |x=0 and μ̂(m) f |x=1 are then estimated using the subsamples in a similar fashion as above.

Note that sample estimates of the accuracy and skill measures shown in sections 2b and 2c, or the bias and standard errors shown in section 3, can be obtained by replacing the moments with the sample moment estimates shown above.

APPENDIX B

Covariance Term

The covariance of the sample estimators of the mean squared error (Brier score) and the variance of the observations σ̂2x is

 
formula

By substituting sample moments into the expressions for the MSE and σ2x in Eqs. (10) and (3), the first term on the right-hand side is

 
formula

Expanding the terms,

 
formula

We will evaluate expectations by conditioning on the number of occurrences of x = 1, denoted N1. For a known sample size N, N1 is a random variable. Because x is a Bernoulli random variable, N1 has a binomial distribution with the parameter μx. Note that the number of occurrences of x = 0, denoted N0, is related to N1 by the relationship N = N0 + N1. Conditioning on N1, the first term is

 
formula

Using the conditional expectations for the noncentral moments of f,

 
formula

The unconditional expectation is

 
formula

Because N1 is a binomial random variable, the expected values for the noncentral moments are

 
formula
 
formula

By substitution,

 
formula

By definition,

 
formula

Therefore, the expectation simplifies to

 
formula

Using the same approach with other terms in Eq. (B3), the second term is

 
formula

the third term is

 
formula

and the fourth term is

 
formula

Because the sample estimator of MSE is unbiased, the expected value of is

 
formula

The expected value of σ̂2x is

 
formula

which implies that the sample estimator is biased.

Combining all of the terms, the covariance term simplifies to

 
formula

Footnotes

Corresponding author address: A. Allen Bradley, IIHR-Hydroscience and Engineering, The University of Iowa, 107 C. Maxwell Stanley Hydraulics Laboratory, Iowa City, IA 52242. Email: allen-bradley@uiowa.edu