## Abstract

For probability forecasts, the Brier score and Brier skill score are commonly used verification measures of forecast accuracy and skill. Using sampling theory, analytical expressions are derived to estimate their sampling uncertainties. The Brier score is an unbiased estimator of the accuracy, and an exact expression defines its sampling variance. The Brier skill score (with climatology as a reference forecast) is a biased estimator, and approximations are needed to estimate its bias and sampling variance. The uncertainty estimators depend only on the moments of the forecasts and observations, so it is easy to routinely compute them at the same time as the Brier score and skill score. The resulting uncertainty estimates can be used to construct error bars or confidence intervals for the verification measures, or perform hypothesis testing.

Monte Carlo experiments using synthetic forecasting examples illustrate the performance of the expressions. In general, the estimates provide very reliable information on uncertainty. However, the quality of an estimate depends on both the sample size and the occurrence frequency of the forecast event. The examples also illustrate that with infrequently occurring events, verification sample sizes of a few hundred forecast–observation pairs are needed to establish that a forecast is skillful because of the large uncertainties that exist.

## 1. Introduction

Forecast verification provides critical information on the quality of forecasts and can help guide forecasters in their efforts to improve forecasting systems (Jolliffe and Stephenson 2003; Welles et al. 2007). Although forecasters routinely report verification results, they often do not consider the inherent uncertainty of the verification measures (Jolliffe 2007). However, computing verification measures requires a *sample* of forecasts and their corresponding observations. The resulting verification measures are therefore *sample estimates*. The sampling uncertainty of verification measures depends on both the sample size and the statistical characteristics of the forecasts and observations. In situations where the sample size is relatively small, the sampling uncertainty can be significant (Bradley et al. 2003).

One approach to assessing the sampling uncertainty of verification measures is a resampling method (Efron 1981; Wilks 2006). Forecast–observation pairs are randomly selected from the verification dataset to construct synthetic data samples, and the verification measures are computed; the process is repeated many times to derive an empirical probability distribution for the sample measures. Examples of a resampling approach include Mason and Mimmack (1992), Wilks (1996), Hamill (1999), Zhang and Casey (2000), Doblas-Reyes et al. (2003), Accadia et al. (2003), Ebert et al. (2004), Accadia et al. (2005), Jolliffe (2007), and Ferro (2007), among others. Although the approach is straightforward and robust, it adds significantly to the computational effort associated with verification.

Another approach is to employ sampling theory to derive analytical expressions for sampling uncertainty. Most examples in the literature are for categorical forecast verification measures, such as the probability of detection or hit rate (Kane and Brown 2000; Jolliffe 2007), odds ratio (Stephenson 2000), false alarm rate and skill scores (Thornes and Stephenson 2001), the correlation coefficient (Jolliffe 2007), and relative operating characteristics (Mason and Graham 2002). For probability forecasts, sampling theory has been applied to evaluate the sampling uncertainty of the bias (Schwartz 1992; Carpenter and Georgakakos 2001) and accuracy measures (Ferro 2007). Sampling theory has also been used to develop confidence intervals (Seaman et al. 1996) and hypothesis tests (Hamill 1999; Briggs and Ruppert 2005; Briggs 2005). Although the assumptions associated with deriving analytical expressions can be limiting, the effort required to evaluate the expressions is minimal.

In this paper, we use sampling theory to develop analytical expressions for the sampling uncertainty of the Brier score and Brier skill score. For probability forecasts, these are the most commonly used measures of accuracy and skill. In the following sections, we derive exact or approximate analytical expressions for the sampling uncertainty of these two measures, and then use Monte Carlo simulation for synthetic forecasting examples to evaluate the expressions and estimated confidence intervals.

## 2. Forecast verification framework

A probability forecast is one that assigns a probability to the occurrence of a discrete (dichotomous) event. In this case, there are only two possible outcomes: either the forecast event occurs or it does not. A familiar example is a probability-of-precipitation forecast. Probability forecasts can also be constructed from ensemble forecasts (Toth et al. 2003; Zhu 2005); an event is defined as the exceedance (or nonexceedance) of a particular threshold value (e.g., precipitation in excess of a given depth), then the probability forecast for an event occurrence can be made using the ensemble members.

Using a sample of probability forecasts and the resulting (binary) event outcomes, the Brier score and Brier skill score (Brier 1950) are often computed to characterize the accuracy and skill of the forecast. To evaluate the uncertainty of these verification measures from sampling theory, assumptions must be made about the statistical properties of forecasts and observations. The diagnostic framework for forecast verification introduced by Murphy and Winkler (1992), known as the distributions-oriented (DO) approach, explicitly defines a stochastic process commonly assumed for verification. Specifically, forecasts and observations are treated as random variables. Each forecast–observation pair is assumed to be independent of all other pairs, and identically distributed. The relationship between forecasts and observations is defined by their joint distribution.

Using the assumptions of the DO approach and the notation described by Murphy (1997), we will evaluate the sampling uncertainty of the Brier score and Brier skill score. In particular, consider a forecasting system that produces a probability forecast for a dichotomous event. Let *x* be a Bernoulli random variable that takes on a value of 1 if the event occurs and 0 if it does not. Let *f* be a probability forecast of the occurrence of the event. That is, *f* is the forecast probability that the event occurs (i.e., that *x* = 1). The joint distribution *p*( *f*, *x*) describes the relationship between the forecasts and observations.

### a. Distribution moments

The moments of the random variables characterize aspects of the joint distribution. The first moment (mean) of the observations is

where *μ _{x}* is interpreted as the climatological probability of the event occurrence. Because

*x*is a Bernoulli random variable,

*x*=

^{m}*x*for all positive integers

*m*, so the higher-order noncentral moments are all

The forecast *f* may be either a discrete or continuous random variable. Let *μ _{f}* be the first moment (mean) of the forecasts. The higher-order noncentral moments are denoted as

One can also characterize the moments of *f* conditioned on the observation. For instance, the first conditional moments are denoted as

Higher-order noncentral conditional moments can also be defined, and are denoted for the *m*th moment as *μ*′_{(m) f |x=0} and *μ*′_{(m) f |x=1}.

The moments of the products of forecasts and observations may also be defined. For positive integers *m* and *n*,

Appendix A summarizes how to compute estimates of these moments using a verification data sample.

### b. Accuracy measure

The *accuracy* is an attribute describing the closeness of the forecasts *f* to the observations *x* (Murphy 1997). A commonly used measure of the accuracy of the forecasts is the mean squared error:

For probability forecasts of a dichotomous event, MSE can be defined in terms of moments of *f* and *x* (Bradley et al. 2003). By expanding the terms in Eq. (8),

### c. Skill measure

The accuracy of the forecasts relative to a reference forecasting system is known as the *skill*. Let MSE(*r*, *x*) donate the accuracy for the reference forecasts *r*. The mean squared error skill score SS_{MSE} is then

Using a climatology forecast *μ _{x}* as the reference forecast,

Therefore, SS_{MSE} using climatology as a reference forecast is

## 3. Sampling uncertainty

Verification measures of accuracy and skill are computed for a forecasting system from a *sample* of forecasts and observations. Let *x _{i}* be the observation at time

*i*. Let

*f*be the probability forecast of the event at time

_{i}*i*. Assume that the forecast–observation pairs {

*f*,

_{i}*x*,

_{i}*i*= 1, . . . ,

*N*} are a random sample drawn from the joint distribution

*p*(

*f*,

*x*). The verification data sample will be used to

*estimate*forecast accuracy and skill. In the following sections, we define sample estimators for the measures of forecast accuracy and skill, and derive expressions for their sampling uncertainty.

### a. Brier score

The sample estimator for MSE is

where the “hat” is used to indicate a sample estimate. For probability forecasts, this estimator is referred to as the Brier (or half-Brier) score (Brier 1950).

The expected value of is

Because *x*^{2}_{i} = *x _{i}* for a Bernoulli random variable, Eq. (14) reduces to

Because *E*[] is equal to MSE (see section 2b), the Brier score is an unbiased estimator of MSE.

### b. Sampling variance of the Brier score

The variance of a sample estimator is a measure of its sampling uncertainty. The variance of the Brier score is

The variance term on the right-hand side can be expanded as

Using the binomial expansion, the first term is

Combining all of the terms,

The standard error of a sample estimator is also used to describe its sampling uncertainty. The standard error is equivalent to the standard deviation of the estimator. Therefore, the standard error for the Brier score is defined as

### c. Brier skill score

The sample estimator for SS_{MSE}, where climatology is used as a reference forecast, is

For probability forecasts, this estimator is known as the Brier skill score.

Because it involves a ratio of sample estimators, the Brier skill score is a biased estimator of SS_{MSE} (Mason 2004). Using a Taylor series expansion, a second-order approximation of the bias *B*(_{MSE}) is (Benjamin and Cornell 1970)

where

As shown in section 3a, is an unbiased estimator of MSE. However, for a Bernoulli random variable with parameter *μ _{x}*, the sample estimator of

*σ*

^{2}

_{x}is biased (see appendix B):

By substitution and simplification,

For a Bernoulli random variable, the variance of the sample estimator *σ̂*^{2}_{x} is (Kenney and Keeping 1951)

Using the result shown in appendix B, the covariance term is

### d. Sampling variance of the Brier skill score

The variance of the Brier skill score _{MSE} is

Here, we employ a first-order approximation (Benjamin and Cornell 1970) to estimate the variance of SS_{MSE}:

where

and

The standard error for the Brier skill score _{MSE} is approximated by

## 4. Forecasting examples

To evaluate the analytical expressions for the sampling uncertainties, we used Monte Carlo simulations to generate verification datasets for several synthetic forecasting examples. For each example, the joint distribution *p*( *f*, *x*) and the true values of the accuracy and skill measures are known. Ten thousand verification data samples are generated for each example, for sample sizes ranging from 50 to 1000 pairs. For each verification data sample, sample estimates of the accuracy and skill (the Brier score and Brier skill score) are computed. To determine the “true” standard error of the measures for comparison with the analytical expressions, we compute the standard deviation of the measures for the 10 000 verification data samples. Note that with a Monte Carlo sample of this size, the uncertainty of the true value is less than about 1.5% (at a 95% confidence level).

A stochastic model based on the calibration–refinement factorization of the joint distribution *p*( *f*, *x*) (Murphy 1997) was used to construct the synthetic forecasting examples for the Monte Carlo simulations. Forecast–observation pairs were created from the stochastic model as follows. First, the probability forecast *f _{i}* for the discrete event is randomly generated. The marginal distribution of the forecasts

*s*(

*f*) is modeled by a beta distribution (Krzysztofowicz and Long 1991):

where *υ* and *ω* are parameters, and *B*(*υ*, *ω*) is the beta function. Next, the conditional expected value of the observation *μ _{x}*

_{|}

*, given the forecast*

_{f}*f*, is evaluated. A linear model (Murphy and Wilks 1998) is used:

where *a* and *b* are parameters. Finally, a corresponding observation *x _{i}* is generated for the given forecast

*f*. This is accomplished by generating a uniform random variable over the range from 0 to 1; if the value is less than

_{i}*μ*

_{x| fi}, then observation

*x*is set to 1 (event occurrence). Otherwise, the observation is set to 0 (event nonoccurrence).

_{i}We use the results for selected examples to illustrate the properties of the uncertainty estimators. In particular, we show results for forecasts of a rarely occurring event (*μ _{x}* = 0.05) and for a more commonly occurring event (

*μ*= 0.25). For both cases, we generated forecasts with low (SS

_{x}_{MSE}= 0.2), medium (SS

_{MSE}= 0.4), and high skill (SS

_{MSE}= 0.6). In all the cases, the forecast is assumed to be unconditionally unbiased (

*μ*=

_{f}*μ*), which imposes the following constraint on the linear model parameters:

_{x}Note that if *b* = 1, the forecast is also conditionally unbiased; if *b* ≠ 1, the forecast has conditional biases. We examined both of these cases. For the conditionally unbiased forecast case,

which implies that the forecast is perfectly reliable. For cases with a conditionally biased forecast, we selected *b* = 0.8 and used Eq. (39) to find *a*.

Table 1 shows the model parameters for the cases presented in the following section. Also shown are relative measures of forecast quality (Murphy 1997) for each case. By design, the relative measures are the same for a given skill level within the unbiased or the conditionally biased forecast cases. By definition, the reliability (REL) of the forecast is 0 for a perfectly reliable forecast; the reliability is nonzero for the cases with conditional biases. Also, the resolution (RES), discrimination (DIS), and sharpness (SHP) of the forecast are all higher for a more skillful forecast. Finally, in the case of a conditionally biased forecast with *b* = 0.8, the highest achievable skill score is 0.6. For this high-skill case, the forecast must be perfectly sharp (the probability forecast *f* is either 0 or 1). Hence, the relative sharpness (SHP/*σ _{x}*

^{2}) for the high-skill case with conditionally biased forecasts achieves its maximum value of 1. Note that for this special case, the forecast

*f*is a discrete random variable and is generated from a Bernoulli distribution with parameter

*μ*.

_{f}## 5. Results

### a. Standard errors

First we evaluate the analytical expressions for the standard errors of the Brier score and Brier skill score. True values for the sampling uncertainty are represented by the 10 000-sample estimates from the Monte Carlo simulation. These are compared to the analytical expression estimates made using the known (true) values for the moments.

Figure 1 shows the standard error estimates for the Brier score for medium-skill (SS_{MSE} = 0.4) forecasts. Results are shown for an unbiased and a conditionally biased forecast, for both rare- (*μ _{x}* = 0.05) and common-event (

*μ*= 0.25) occurrences. Note that the analytical expression is exact for the sampling variance; not surprisingly, the standard error estimates are virtually identical to the true values in all cases. For a forecast with conditional biases, the standard errors are slightly higher than those for a forecast with no conditional biases. Furthermore, the standard errors for for forecasts of common-event occurrences are higher than those for rare-event occurrences. This occurs because the magnitude of

_{x}*MSE*tends to increase as the inherent uncertainty

*σ*

^{2}

_{x}increases; for probability forecasts,

*σ*

^{2}

_{x}is small as

*μ*approaches 0 (or 1), and reaches its maximum for

_{x}*μ*= 0.5 [see Eq. (3)]. Therefore, the absolute magnitudes of the standard error for are not directly comparable for rare- and common-event cases, because they depend on the event occurrence frequency.

_{x}For the Brier skill score _{MSE}, the analytical expression for its sampling variance is a first-order approximation. The quality of the approximation for the standard error is illustrated in Fig. 2 for unbiased forecasts with low skill (SS_{MSE} = 0.2) and high skill (SS_{MSE} = 0.6). For large sample sizes, the standard error estimates are very good in all cases. However, significant underestimation of the standard error occurs for forecasts of the rare event for small sample sizes. Unlike MSE, SS_{MSE} is directly comparable for rare- and common-event cases because it is nondimensionalized by the variance of the observations. For the forecasts of the rare event, there are fewer occurrences of the forecast event, resulting in relatively high sampling variance. Still, the deviations of the standard error approximation for these cases are evident only with small sample sizes (200 or less), when there are very few rare-event occurrences in the sample (e.g., 10 or less). Note too that the standard errors for both the rare- and common-event cases increase slightly with increasing skill (see Fig. 2), but the increase is proportionally smaller than the increase in the skill itself. For the high-skill forecast, SS_{MSE} is larger than the standard error for all sample sizes, whereas for the low-skill forecast, the standard error for the rare-event case exceeds SS_{MSE} for very small sample sizes.

The impact of conditional forecast biases is illustrated in Fig. 3 for the case of medium-skill (SS_{MSE} = 0.4) forecasts. As is the case for , the standard errors for _{MSE} are higher when the forecast is conditionally biased. Approximate standard errors are nearly exact for common-event forecasts, even with conditional biases (see Fig. 3b). However, the same deviation in the approximation occurs for the rare-event case when conditional forecast biases are present (see Fig. 3a).

### b. Bias in the Brier skill score

Although the sample estimator of MSE is unbiased, the sample estimator of SS_{MSE} is biased. Figure 4 shows the expected skill score *E*[_{MSE}] for medium-skill forecasts based on the Monte Carlo simulation, and that evaluated with the second-order approximation of the bias shown in Eq. (22). Clearly, biases are largest for the forecasts of the rare event; the biases are only slightly larger when the forecast is conditionally biased. The magnitude of the bias can be quite large. For rare-event occurrences with a true SS_{MSE} of 0.4, the expected skill score is 0.34 or less for a sample of fewer than a hundred forecast–observation pairs. Although not shown, the magnitude of the bias does not vary greatly with the magnitude of the forecast skill in these forecasting examples.

The second-order approximation of the biases produces a reasonable prediction of the bias for rare-event occurrences (see Fig. 4a). The curves follow the overall trend found in the Monte Carlo experiments, but the magnitude of the bias is underestimated. When the actual bias is less, as is the case for common-event occurrences (see Fig. 4b), the approximation is poor. In particular, the magnitudes of the observed biases are much larger for small sample sizes than predicted for the common-event occurrences.

### c. Approximate confidence intervals

In practice, the analytical expressions would be used to estimate the uncertainty of the Brier score and Brier skill score computed with a verification dataset. Uncertainty estimates are obtained by replacing the moments in the standard error and bias equations in section 3 with moment estimates from a sample (see appendix A for traditional sample estimators of the moments). Note that for the standard error of the Brier score, this approach is mathematically similar to the sample estimator proposed by Ferro (2007) for ensemble forecasts. Using the resulting sample estimates of uncertainty, one can construct error bars, perform hypothesis testing, or construct confidence intervals for individual sample estimates. To evaluate the performance of the uncertainty expressions when sample moments are used, we examine confidence interval estimates for both MSE and SS_{MSE} for verification data samples.

In principle, confidence interval estimation requires knowledge of the sampling distributions for and _{MSE}. However, for sufficiently large sample sizes, the central limit theorem states that the sampling distribution of a sum of independent and identically distributed random variables will converge to a normal distribution. Therefore, a normal distribution approximation is often used to construct confidence intervals. For , which involves the sum of the forecast errors squared, it is reasonable to assume that its sampling distribution will converge to a normal distribution for large sample sizes. However, the situation is not as straightforward for _{MSE}, which essentially involves the ratio of two sums.

Figure 5 shows the empirical sampling distribution for from the Monte Carlo simulation for medium-skill forecasts with no bias. For rare-event occurrences (see Fig. 5a), the lower bound at 0 causes deviations from a normal distribution for small sample sizes. Otherwise, the normal distribution approximation is reasonable for the sampling distribution. For common-event occurrences (see Fig. 5b), the normal distribution approximation is reasonable, except perhaps in the tails of the distribution.

Figure 6 shows the sampling distribution for _{MSE} for the same medium-skill forecasts. For rare-event occurrences (see Fig. 6a), the sampling distributions are negatively skewed due to the upper bound at 1, and the skewness is significant for most sample sizes shown. Therefore, a normal distribution is a poor approximation for the observed sampling distribution for _{MSE}. For common-event occurrences (see Fig. 6b), the skewness of the sampling distribution is not as severe. For sample sizes of 400 or greater, a normal distribution approximation is not unreasonable.

Despite the deviations observed, we examined the characteristics of confidence intervals constructed for individual samples using a normal approximation. For the case where a sample statistic *Ŷ* is normally distributed with unknown variance, the 100(1 – *α*)% confidence interval for the true parameter *μ _{Y}* is

where *t _{α}*

_{/2,}

_{N}_{−1}is the critical value of the

*t*distribution for a two-sided test with a significance level of

*α*and

*N*− 1 degrees of freedom, and

*σ̂*

_{Y}is the sample estimate of the standard error of the test statistic. For all forecast cases, we constructed confidence limits for individual verification datasets. In particular, we computed the sample moments (see appendix A) for each dataset, and we used the sample estimates in the standard error equations shown in section 3 to obtain standard error estimates for and

_{MSE}and confidence intervals. Then, the fractions of cases where the true values of MSE and SS

_{MSE}fell within the confidence limits were computed. The results are shown for 95% confidence intervals in Table 2 for unbiased forecasts.

For the forecasts of the rare event, the 95% confidence intervals constructed using sample estimates contain the true values much less than 95% of the time for small sample sizes. Given the underestimation of the standard error for _{MSE} for the forecasts of the rare event, underestimation of the interval is not surprising. However, the variance estimator is exact, and the underestimation is of a similar magnitude for a given skill level and sample size. Although deviations from the normal distribution approximation may contribute to the underestimation, the dominate cause is the use of sample (rather than the true) moments to estimate the sampling variances of and _{MSE}. Note that the expressions involve higher-order moments (up to fourth order), so sample estimates of these moments have large sampling uncertainty for a small sample size. For the forecasts of the common event, the 95% confidence intervals performed much better, even though slight deviations occur for small sample sizes. For all cases, the confidence interval results are best (closer to 95%) for low skill forecasts. The intervals for are better for the rare-event occurrences, but those for _{MSE} are generally better for the common-event occurrences. So, despite the approximation used for the variance of _{MSE}, estimated confidence intervals for both and _{MSE} perform similarly using the normal distribution approximation.

## 6. Discussion

As the results of the Monte Carlo experiments show, forecasters can reliably use the standard error and confidence limit estimators presented here to characterize uncertainty in many forecasting situations. Still, significant deviations can occur with forecasts made for rarely occurring events for small sample sizes (roughly less than a few hundred forecast–observation pairs). Unfortunately, the desire to characterize sampling uncertainty for rare (high impact) events is usually the greatest when the forecast–observation data sample is small. In these situations, standard error estimates for the Brier skill score, and confidence limits for both the Brier score and Brier skill score, underestimate the actual uncertainty.

On the other hand, the estimated standard errors and confidence limits, although deficient, do indicate the large uncertainties that exists for such cases. As an example, Fig. 7 shows the 95% confidence limits for SS_{MSE} as a function of the sample size for a low-skill forecast of a rare event (*μ _{x}* = 0.05), computed using the approximate standard error estimators and the true model parameters. Note that for sample sizes up to about 300, one cannot reject the hypothesis that the true skill score is 0 for this case. For a sample size of 50, the estimated 95% confidence interval for SS

_{MSE}covers a range of [−0.26, 0.57]. Even though these estimates understate the true uncertainty range, the interval is still so large that one cannot conclude that the forecast of the rare event is skillful from such a small sample. Although it may be possible to devise corrections that improve uncertainty estimates for small sample sizes and rare events, the conclusions drawn in such cases would not be substantially different.

A second-order approximation of the bias for the SS_{MSE} estimator was also derived, which could be applied to correct biases for sample estimates. However, the utility of such a correction is debatable. As suggested by Fig. 7, the magnitude of the bias is quite small compared to the sampling variability, especially for large sample sizes. As shown in Fig. 4, the approximation performed best for rare-event occurrences for small sample sizes, where the sampling uncertainty is too large to make statistical inferences. For common-event occurrences, the bias is much smaller, and the approximation underestimates the bias to such an extent that its use for bias correction would not substantially change the results.

Another important consideration in the use of the uncertainty estimators is their limit of applicability. Recall that the assumptions of the DO approach to forecast verification (Murphy 1997) were used to derive the estimators. One assumption is that forecast–observation pairs are identically distributed; a verification dataset is a random sample drawn from this unchanging (or stationary) joint distribution. Yet in many applications, a verification dataset pools together forecasts issued at different times of the year, and for different locations. In this case, the assumption of identically distributed forecast–observation pairs may be violated, because the joint distribution of forecasts and observations may depend on the season and location.

Hamill and Juras (2006) provide some insightful examples of how nonstationarity in the form of temporal and spatial variations in climatology affects verification measures. In particular, if the joint distribution varies with time or location, verification measures will be unreliable (biased) and misleading. Whenever such variations are a concern, Hamill and Juras (2006) recommend computing verification measures with subsamples where stationarity of the joint distribution can be reasonably assumed. Of course, the drawback to subdividing the verification data sample into smaller subsamples (to avoid such biases) is the inherent increase in sampling variability.

If one adopts this approach when dealing with pooled samples, the uncertainty estimators derived here still have applications. Consider the case where a pooled sample of size *N* is drawn from *M* distinct groups, each with its own (stationary) joint distribution. Mathematically, the sample Brier score computed with the pooled sample _{p} is equivalent to

where MSE* _{i}* is the sample Brier score for the

*i*th group and

*N*is the sample size for the

_{i}*i*th group. Using probability theory, and assuming independence of the subsamples, the variance of the pooled Brier score is

where *V*[_{i}] is defined by Eq. (19) for each subsample.

In cases where a single skill measure is desired for a pooled sample, Hamill and Juras (2006) recommend using a weighted average of the skill score for (stationary) subsamples, rather than the skill score computed directly from the pooled sample. One suggested weighted-average skill measure is

Again, assuming independence of the subsamples, the standard error of this weighted-average skill score is simply

where *V*[_{i}] is defined by Eq. (32) for the stationary subsample.

Another assumption of the DO approach is that each forecast–observation pair is independent of all others. The independence assumption is violated whenever forecasts or observations are significantly correlated in time or space. Temporal correlation for probability forecasts may be a concern for frequently updated event forecasts (e.g., daily probability of precipitation), but sample sizes also tend to be relatively large in this case. On the other hand, the spatial correlation for probability forecasts is often a concern for both meteorological and climate forecasts. If the forecasts or observations are stationary, but significantly correlated, the sample size is effectively reduced, and the estimators presented here would underestimate the true sampling uncertainty.

One clear advantage of the analytical estimators of sampling uncertainty is that they are easy to compute at the same time as verification measures. In contrast, with resampling approaches, a few hundred or thousand resampled datasets must be constructed, and verification measures computed for each case. Clearly, the added computational burden is of no consequence for any one verification dataset. However, the computational effort can be a significant obstacle when evaluating the suite of products for an ensemble forecasting system, where the ensemble forecasts are transformed into probability forecasts for multiple event thresholds (e.g., Bradley et al. 2004), and evaluated at many different locations and forecast times (e.g., Kruger et al. 2007). Still, if the underlying stochastic process for forecasts and observations assumed in the derivation of these estimators is not valid, resampling remains an attractive nonparametric alternative.

## 7. Summary and conclusions

The Brier score and Brier skill score are commonly used verification measures for probability forecasts. However, their sampling characteristics have not been explored in great detail. Using the joint distribution of forecasts and observations as defined in the distributions-oriented approach to forecast verification (Murphy and Winkler 1992), we derive analytical expressions for the sampling uncertainties of the Brier score and Brier skill score. The expressions depend only on the moments of the forecasts and observations, making sampling uncertainties easy to estimate at the same time as the scores themselves. The expressions can be used with verification datasets to estimate standard errors and biases, construct confidence intervals, or perform hypothesis testing (e.g., Hamill 1999) on the Brier score and Brier skill score.

Monte Carlo experiments using synthetic forecasting examples illustrate the performance of the expressions. In general, the estimates provide very reliable information on uncertainty. However, the quality of an estimate depends on both the sample size and the occurrence frequency of the forecast event. For the Brier skill score, the approximation underestimates the standard error at small sample sizes (a few hundred forecast–observation pairs or less) for infrequently occurring events. In contrast, the bias estimator for the Brier skill score only provides reasonable estimates when the bias is large, a situation that exists only for infrequently occurring events. Confidence interval estimates constructed for individual samples using a normal distribution approximation perform well except at small sample sizes, where the sampling distributions are skewed because of large sampling variances and the lower (upper) bound on the Brier score (Brier skill score).

Although this paper focuses on uncertainty estimation for measures of accuracy and skill, there are additional measures that describe other aspects of the forecast quality for probability forecasts (Murphy 1997; Wilks 2006). Bradley et al. (2003) showed that other distributions-oriented measures for probability forecasts can also be expressed as a function of the moments of the joint distribution. We are currently exploring the use of sampling theory to develop approximate uncertainty estimates for these measures as well.

## Acknowledgments

This work was supported in part by National Oceanic and Atmospheric Administration (NOAA) Grant NA04NWS4620015, from the National Weather Service (NWS) Office of Hydrologic Development, and Grant NA16GP1569, from the Office of Global Programs as part of the GEWEX Americas Prediction Project (GAPP). We gratefully acknowledge this support. We also thank the two anonymous reviewers for their thoughtful comments and suggestions.

## REFERENCES

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

### APPENDIX A

#### Sample Moment Estimators

Let *x _{i}* be the observation at time

*i*. Let

*f*be the probability forecast of the event at time

_{i}*i*. The verification data sample is then {

*f*,

_{i}*x*,

_{i}*i*= 1, . . . ,

*N*}. The traditional sample moment estimator for the mean of the observations is

Because *x* is a Bernoulli random variable, the sample estimator for the variance of the observations *σ*^{2}_{x} is simply

The sample estimators for the mean and higher-order noncentral moment estimators for the forecasts are

To estimate the conditional mean and higher-order noncentral moments of the forecasts, the verification data sample is partitioned into two sets. Let {*f*^{0}_{j}, *j* = 1, . . . , *N*_{0}} be the subsample of forecasts for the case where the event does not occur (*x* = 0). Let {*f*^{1}_{k}, *k* = 1, . . . , *N*_{1}} be the subsample of forecasts for the case where the event occurs (*x* = 1). The conditional means *μ̂*_{f |x=0} and *μ̂*_{f |x=1}, and higher-order noncentral moments *μ̂*′_{(m) f |x=0} and *μ̂*′_{(m) f |x=1} are then estimated using the subsamples in a similar fashion as above.

Note that sample estimates of the accuracy and skill measures shown in sections 2b and 2c, or the bias and standard errors shown in section 3, can be obtained by replacing the moments with the sample moment estimates shown above.

### APPENDIX B

#### Covariance Term

The covariance of the sample estimators of the mean squared error (Brier score) and the variance of the observations *σ̂*^{2}_{x} is

By substituting sample moments into the expressions for the *MSE* and *σ*^{2}_{x} in Eqs. (10) and (3), the first term on the right-hand side is

Expanding the terms,

We will evaluate expectations by conditioning on the number of occurrences of *x* = 1, denoted *N*_{1}. For a known sample size *N*, *N*_{1} is a random variable. Because *x* is a Bernoulli random variable, *N*_{1} has a binomial distribution with the parameter *μ _{x}*. Note that the number of occurrences of

*x*= 0, denoted

*N*

_{0}, is related to

*N*

_{1}by the relationship

*N*=

*N*

_{0}+

*N*

_{1}. Conditioning on

*N*

_{1}, the first term is

Using the conditional expectations for the noncentral moments of *f*,

The unconditional expectation is

Because *N*_{1} is a binomial random variable, the expected values for the noncentral moments are

By substitution,

By definition,

Therefore, the expectation simplifies to

Using the same approach with other terms in Eq. (B3), the second term is

the third term is

and the fourth term is

Because the sample estimator of MSE is unbiased, the expected value of is

The expected value of *σ̂*^{2}_{x} is

which implies that the sample estimator is biased.

Combining all of the terms, the covariance term simplifies to

## Footnotes

*Corresponding author address:* A. Allen Bradley, IIHR-Hydroscience and Engineering, The University of Iowa, 107 C. Maxwell Stanley Hydraulics Laboratory, Iowa City, IA 52242. Email: allen-bradley@uiowa.edu