## 1. Introduction

Forecast verification provides critical information on the quality of forecasts and can help guide forecasters in their efforts to improve forecasting systems (Jolliffe and Stephenson 2003; Welles et al. 2007). Although forecasters routinely report verification results, they often do not consider the inherent uncertainty of the verification measures (Jolliffe 2007). However, computing verification measures requires a *sample* of forecasts and their corresponding observations. The resulting verification measures are therefore *sample estimates*. The sampling uncertainty of verification measures depends on both the sample size and the statistical characteristics of the forecasts and observations. In situations where the sample size is relatively small, the sampling uncertainty can be significant (Bradley et al. 2003).

One approach to assessing the sampling uncertainty of verification measures is a resampling method (Efron 1981; Wilks 2006). Forecast–observation pairs are randomly selected from the verification dataset to construct synthetic data samples, and the verification measures are computed; the process is repeated many times to derive an empirical probability distribution for the sample measures. Examples of a resampling approach include Mason and Mimmack (1992), Wilks (1996), Hamill (1999), Zhang and Casey (2000), Doblas-Reyes et al. (2003), Accadia et al. (2003), Ebert et al. (2004), Accadia et al. (2005), Jolliffe (2007), and Ferro (2007), among others. Although the approach is straightforward and robust, it adds significantly to the computational effort associated with verification.

Another approach is to employ sampling theory to derive analytical expressions for sampling uncertainty. Most examples in the literature are for categorical forecast verification measures, such as the probability of detection or hit rate (Kane and Brown 2000; Jolliffe 2007), odds ratio (Stephenson 2000), false alarm rate and skill scores (Thornes and Stephenson 2001), the correlation coefficient (Jolliffe 2007), and relative operating characteristics (Mason and Graham 2002). For probability forecasts, sampling theory has been applied to evaluate the sampling uncertainty of the bias (Schwartz 1992; Carpenter and Georgakakos 2001) and accuracy measures (Ferro 2007). Sampling theory has also been used to develop confidence intervals (Seaman et al. 1996) and hypothesis tests (Hamill 1999; Briggs and Ruppert 2005; Briggs 2005). Although the assumptions associated with deriving analytical expressions can be limiting, the effort required to evaluate the expressions is minimal.

In this paper, we use sampling theory to develop analytical expressions for the sampling uncertainty of the Brier score and Brier skill score. For probability forecasts, these are the most commonly used measures of accuracy and skill. In the following sections, we derive exact or approximate analytical expressions for the sampling uncertainty of these two measures, and then use Monte Carlo simulation for synthetic forecasting examples to evaluate the expressions and estimated confidence intervals.

## 2. Forecast verification framework

A probability forecast is one that assigns a probability to the occurrence of a discrete (dichotomous) event. In this case, there are only two possible outcomes: either the forecast event occurs or it does not. A familiar example is a probability-of-precipitation forecast. Probability forecasts can also be constructed from ensemble forecasts (Toth et al. 2003; Zhu 2005); an event is defined as the exceedance (or nonexceedance) of a particular threshold value (e.g., precipitation in excess of a given depth), then the probability forecast for an event occurrence can be made using the ensemble members.

Using a sample of probability forecasts and the resulting (binary) event outcomes, the Brier score and Brier skill score (Brier 1950) are often computed to characterize the accuracy and skill of the forecast. To evaluate the uncertainty of these verification measures from sampling theory, assumptions must be made about the statistical properties of forecasts and observations. The diagnostic framework for forecast verification introduced by Murphy and Winkler (1992), known as the distributions-oriented (DO) approach, explicitly defines a stochastic process commonly assumed for verification. Specifically, forecasts and observations are treated as random variables. Each forecast–observation pair is assumed to be independent of all other pairs, and identically distributed. The relationship between forecasts and observations is defined by their joint distribution.

Using the assumptions of the DO approach and the notation described by Murphy (1997), we will evaluate the sampling uncertainty of the Brier score and Brier skill score. In particular, consider a forecasting system that produces a probability forecast for a dichotomous event. Let *x* be a Bernoulli random variable that takes on a value of 1 if the event occurs and 0 if it does not. Let *f* be a probability forecast of the occurrence of the event. That is, *f* is the forecast probability that the event occurs (i.e., that *x* = 1). The joint distribution *p*( *f*, *x*) describes the relationship between the forecasts and observations.

### a. Distribution moments

*μ*is interpreted as the climatological probability of the event occurrence. Because

_{x}*x*is a Bernoulli random variable,

*x*=

^{m}*x*for all positive integers

*m*, so the higher-order noncentral moments are all

*σ*

^{2}

_{x}is

*f*may be either a discrete or continuous random variable. Let

*μ*be the first moment (mean) of the forecasts. The higher-order noncentral moments are denoted as

_{f}*f*conditioned on the observation. For instance, the first conditional moments are denoted as

*m*th moment as

*μ*′

_{(m) f |x=0}and

*μ*′

_{(m) f |x=1}.

*m*and

*n*,

### b. Accuracy measure

*accuracy*is an attribute describing the closeness of the forecasts

*f*to the observations

*x*(Murphy 1997). A commonly used measure of the accuracy of the forecasts is the mean squared error:

*f*and

*x*(Bradley et al. 2003). By expanding the terms in Eq. (8),

### c. Skill measure

*skill*. Let MSE(

*r*,

*x*) donate the accuracy for the reference forecasts

*r*. The mean squared error skill score SS

_{MSE}is then

*μ*as the reference forecast,

_{x}_{MSE}using climatology as a reference forecast is

## 3. Sampling uncertainty

Verification measures of accuracy and skill are computed for a forecasting system from a *sample* of forecasts and observations. Let *x _{i}* be the observation at time

*i*. Let

*f*be the probability forecast of the event at time

_{i}*i*. Assume that the forecast–observation pairs {

*f*,

_{i}*x*,

_{i}*i*= 1, . . . ,

*N*} are a random sample drawn from the joint distribution

*p*(

*f*,

*x*). The verification data sample will be used to

*estimate*forecast accuracy and skill. In the following sections, we define sample estimators for the measures of forecast accuracy and skill, and derive expressions for their sampling uncertainty.

### a. Brier score

*x*

^{2}

_{i}=

*x*for a Bernoulli random variable, Eq. (14) reduces to

_{i}*E*[

### b. Sampling variance of the Brier score

### c. Brier skill score

_{MSE}, where climatology is used as a reference forecast, is

_{MSE}(Mason 2004). Using a Taylor series expansion, a second-order approximation of the bias

*B*(

_{MSE}) is (Benjamin and Cornell 1970)

*μ*, the sample estimator of

_{x}*σ*

^{2}

_{x}is biased (see appendix B):

*σ̂*

^{2}

_{x}is (Kenney and Keeping 1951)

### d. Sampling variance of the Brier skill score

_{MSE}is

_{MSE}:

_{MSE}is approximated by

## 4. Forecasting examples

To evaluate the analytical expressions for the sampling uncertainties, we used Monte Carlo simulations to generate verification datasets for several synthetic forecasting examples. For each example, the joint distribution *p*( *f*, *x*) and the true values of the accuracy and skill measures are known. Ten thousand verification data samples are generated for each example, for sample sizes ranging from 50 to 1000 pairs. For each verification data sample, sample estimates of the accuracy and skill (the Brier score and Brier skill score) are computed. To determine the “true” standard error of the measures for comparison with the analytical expressions, we compute the standard deviation of the measures for the 10 000 verification data samples. Note that with a Monte Carlo sample of this size, the uncertainty of the true value is less than about 1.5% (at a 95% confidence level).

*p*(

*f*,

*x*) (Murphy 1997) was used to construct the synthetic forecasting examples for the Monte Carlo simulations. Forecast–observation pairs were created from the stochastic model as follows. First, the probability forecast

*f*for the discrete event is randomly generated. The marginal distribution of the forecasts

_{i}*s*(

*f*) is modeled by a beta distribution (Krzysztofowicz and Long 1991):

*υ*and

*ω*are parameters, and

*B*(

*υ*,

*ω*) is the beta function. Next, the conditional expected value of the observation

*μ*

_{x}_{|}

*, given the forecast*

_{f}*f*, is evaluated. A linear model (Murphy and Wilks 1998) is used:

*a*and

*b*are parameters. Finally, a corresponding observation

*x*is generated for the given forecast

_{i}*f*. This is accomplished by generating a uniform random variable over the range from 0 to 1; if the value is less than

_{i}*μ*

_{x| fi}, then observation

*x*is set to 1 (event occurrence). Otherwise, the observation is set to 0 (event nonoccurrence).

_{i}*μ*= 0.05) and for a more commonly occurring event (

_{x}*μ*= 0.25). For both cases, we generated forecasts with low (SS

_{x}_{MSE}= 0.2), medium (SS

_{MSE}= 0.4), and high skill (SS

_{MSE}= 0.6). In all the cases, the forecast is assumed to be unconditionally unbiased (

*μ*=

_{f}*μ*), which imposes the following constraint on the linear model parameters:

_{x}*b*= 1, the forecast is also conditionally unbiased; if

*b*≠ 1, the forecast has conditional biases. We examined both of these cases. For the conditionally unbiased forecast case,

*b*= 0.8 and used Eq. (39) to find

*a*.

Table 1 shows the model parameters for the cases presented in the following section. Also shown are relative measures of forecast quality (Murphy 1997) for each case. By design, the relative measures are the same for a given skill level within the unbiased or the conditionally biased forecast cases. By definition, the reliability (REL) of the forecast is 0 for a perfectly reliable forecast; the reliability is nonzero for the cases with conditional biases. Also, the resolution (RES), discrimination (DIS), and sharpness (SHP) of the forecast are all higher for a more skillful forecast. Finally, in the case of a conditionally biased forecast with *b* = 0.8, the highest achievable skill score is 0.6. For this high-skill case, the forecast must be perfectly sharp (the probability forecast *f* is either 0 or 1). Hence, the relative sharpness (SHP/*σ _{x}*

^{2}) for the high-skill case with conditionally biased forecasts achieves its maximum value of 1. Note that for this special case, the forecast

*f*is a discrete random variable and is generated from a Bernoulli distribution with parameter

*μ*.

_{f}## 5. Results

### a. Standard errors

First we evaluate the analytical expressions for the standard errors of the Brier score and Brier skill score. True values for the sampling uncertainty are represented by the 10 000-sample estimates from the Monte Carlo simulation. These are compared to the analytical expression estimates made using the known (true) values for the moments.

Figure 1 shows the standard error estimates for the Brier score _{MSE} = 0.4) forecasts. Results are shown for an unbiased and a conditionally biased forecast, for both rare- (*μ _{x}* = 0.05) and common-event (

*μ*= 0.25) occurrences. Note that the analytical expression is exact for the sampling variance; not surprisingly, the standard error estimates are virtually identical to the true values in all cases. For a forecast with conditional biases, the standard errors are slightly higher than those for a forecast with no conditional biases. Furthermore, the standard errors for

_{x}*MSE*tends to increase as the inherent uncertainty

*σ*

^{2}

_{x}increases; for probability forecasts,

*σ*

^{2}

_{x}is small as

*μ*approaches 0 (or 1), and reaches its maximum for

_{x}*μ*= 0.5 [see Eq. (3)]. Therefore, the absolute magnitudes of the standard error for

_{x}For the Brier skill score _{MSE}, the analytical expression for its sampling variance is a first-order approximation. The quality of the approximation for the standard error is illustrated in Fig. 2 for unbiased forecasts with low skill (SS_{MSE} = 0.2) and high skill (SS_{MSE} = 0.6). For large sample sizes, the standard error estimates are very good in all cases. However, significant underestimation of the standard error occurs for forecasts of the rare event for small sample sizes. Unlike MSE, SS_{MSE} is directly comparable for rare- and common-event cases because it is nondimensionalized by the variance of the observations. For the forecasts of the rare event, there are fewer occurrences of the forecast event, resulting in relatively high sampling variance. Still, the deviations of the standard error approximation for these cases are evident only with small sample sizes (200 or less), when there are very few rare-event occurrences in the sample (e.g., 10 or less). Note too that the standard errors for both the rare- and common-event cases increase slightly with increasing skill (see Fig. 2), but the increase is proportionally smaller than the increase in the skill itself. For the high-skill forecast, SS_{MSE} is larger than the standard error for all sample sizes, whereas for the low-skill forecast, the standard error for the rare-event case exceeds SS_{MSE} for very small sample sizes.

The impact of conditional forecast biases is illustrated in Fig. 3 for the case of medium-skill (SS_{MSE} = 0.4) forecasts. As is the case for _{MSE} are higher when the forecast is conditionally biased. Approximate standard errors are nearly exact for common-event forecasts, even with conditional biases (see Fig. 3b). However, the same deviation in the approximation occurs for the rare-event case when conditional forecast biases are present (see Fig. 3a).

### b. Bias in the Brier skill score

Although the sample estimator of MSE is unbiased, the sample estimator of SS_{MSE} is biased. Figure 4 shows the expected skill score *E*[_{MSE}] for medium-skill forecasts based on the Monte Carlo simulation, and that evaluated with the second-order approximation of the bias shown in Eq. (22). Clearly, biases are largest for the forecasts of the rare event; the biases are only slightly larger when the forecast is conditionally biased. The magnitude of the bias can be quite large. For rare-event occurrences with a true SS_{MSE} of 0.4, the expected skill score is 0.34 or less for a sample of fewer than a hundred forecast–observation pairs. Although not shown, the magnitude of the bias does not vary greatly with the magnitude of the forecast skill in these forecasting examples.

The second-order approximation of the biases produces a reasonable prediction of the bias for rare-event occurrences (see Fig. 4a). The curves follow the overall trend found in the Monte Carlo experiments, but the magnitude of the bias is underestimated. When the actual bias is less, as is the case for common-event occurrences (see Fig. 4b), the approximation is poor. In particular, the magnitudes of the observed biases are much larger for small sample sizes than predicted for the common-event occurrences.

### c. Approximate confidence intervals

In practice, the analytical expressions would be used to estimate the uncertainty of the Brier score and Brier skill score computed with a verification dataset. Uncertainty estimates are obtained by replacing the moments in the standard error and bias equations in section 3 with moment estimates from a sample (see appendix A for traditional sample estimators of the moments). Note that for the standard error of the Brier score, this approach is mathematically similar to the sample estimator proposed by Ferro (2007) for ensemble forecasts. Using the resulting sample estimates of uncertainty, one can construct error bars, perform hypothesis testing, or construct confidence intervals for individual sample estimates. To evaluate the performance of the uncertainty expressions when sample moments are used, we examine confidence interval estimates for both MSE and SS_{MSE} for verification data samples.

In principle, confidence interval estimation requires knowledge of the sampling distributions for _{MSE}. However, for sufficiently large sample sizes, the central limit theorem states that the sampling distribution of a sum of independent and identically distributed random variables will converge to a normal distribution. Therefore, a normal distribution approximation is often used to construct confidence intervals. For _{MSE}, which essentially involves the ratio of two sums.

Figure 5 shows the empirical sampling distribution for

Figure 6 shows the sampling distribution for _{MSE} for the same medium-skill forecasts. For rare-event occurrences (see Fig. 6a), the sampling distributions are negatively skewed due to the upper bound at 1, and the skewness is significant for most sample sizes shown. Therefore, a normal distribution is a poor approximation for the observed sampling distribution for _{MSE}. For common-event occurrences (see Fig. 6b), the skewness of the sampling distribution is not as severe. For sample sizes of 400 or greater, a normal distribution approximation is not unreasonable.

*Ŷ*is normally distributed with unknown variance, the 100(1 –

*α*)% confidence interval for the true parameter

*μ*is

_{Y}*t*

_{α}_{/2,}

_{N}_{−1}is the critical value of the

*t*distribution for a two-sided test with a significance level of

*α*and

*N*− 1 degrees of freedom, and

*σ̂*

_{Y}is the sample estimate of the standard error of the test statistic. For all forecast cases, we constructed confidence limits for individual verification datasets. In particular, we computed the sample moments (see appendix A) for each dataset, and we used the sample estimates in the standard error equations shown in section 3 to obtain standard error estimates for

_{MSE}and confidence intervals. Then, the fractions of cases where the true values of MSE and SS

_{MSE}fell within the confidence limits were computed. The results are shown for 95% confidence intervals in Table 2 for unbiased forecasts.

For the forecasts of the rare event, the 95% confidence intervals constructed using sample estimates contain the true values much less than 95% of the time for small sample sizes. Given the underestimation of the standard error for _{MSE} for the forecasts of the rare event, underestimation of the interval is not surprising. However, the variance estimator _{MSE}. Note that the expressions involve higher-order moments (up to fourth order), so sample estimates of these moments have large sampling uncertainty for a small sample size. For the forecasts of the common event, the 95% confidence intervals performed much better, even though slight deviations occur for small sample sizes. For all cases, the confidence interval results are best (closer to 95%) for low skill forecasts. The intervals for _{MSE} are generally better for the common-event occurrences. So, despite the approximation used for the variance of _{MSE}, estimated confidence intervals for both _{MSE} perform similarly using the normal distribution approximation.

## 6. Discussion

As the results of the Monte Carlo experiments show, forecasters can reliably use the standard error and confidence limit estimators presented here to characterize uncertainty in many forecasting situations. Still, significant deviations can occur with forecasts made for rarely occurring events for small sample sizes (roughly less than a few hundred forecast–observation pairs). Unfortunately, the desire to characterize sampling uncertainty for rare (high impact) events is usually the greatest when the forecast–observation data sample is small. In these situations, standard error estimates for the Brier skill score, and confidence limits for both the Brier score and Brier skill score, underestimate the actual uncertainty.

On the other hand, the estimated standard errors and confidence limits, although deficient, do indicate the large uncertainties that exists for such cases. As an example, Fig. 7 shows the 95% confidence limits for SS_{MSE} as a function of the sample size for a low-skill forecast of a rare event (*μ _{x}* = 0.05), computed using the approximate standard error estimators and the true model parameters. Note that for sample sizes up to about 300, one cannot reject the hypothesis that the true skill score is 0 for this case. For a sample size of 50, the estimated 95% confidence interval for SS

_{MSE}covers a range of [−0.26, 0.57]. Even though these estimates understate the true uncertainty range, the interval is still so large that one cannot conclude that the forecast of the rare event is skillful from such a small sample. Although it may be possible to devise corrections that improve uncertainty estimates for small sample sizes and rare events, the conclusions drawn in such cases would not be substantially different.

A second-order approximation of the bias for the SS_{MSE} estimator was also derived, which could be applied to correct biases for sample estimates. However, the utility of such a correction is debatable. As suggested by Fig. 7, the magnitude of the bias is quite small compared to the sampling variability, especially for large sample sizes. As shown in Fig. 4, the approximation performed best for rare-event occurrences for small sample sizes, where the sampling uncertainty is too large to make statistical inferences. For common-event occurrences, the bias is much smaller, and the approximation underestimates the bias to such an extent that its use for bias correction would not substantially change the results.

Another important consideration in the use of the uncertainty estimators is their limit of applicability. Recall that the assumptions of the DO approach to forecast verification (Murphy 1997) were used to derive the estimators. One assumption is that forecast–observation pairs are identically distributed; a verification dataset is a random sample drawn from this unchanging (or stationary) joint distribution. Yet in many applications, a verification dataset pools together forecasts issued at different times of the year, and for different locations. In this case, the assumption of identically distributed forecast–observation pairs may be violated, because the joint distribution of forecasts and observations may depend on the season and location.

Hamill and Juras (2006) provide some insightful examples of how nonstationarity in the form of temporal and spatial variations in climatology affects verification measures. In particular, if the joint distribution varies with time or location, verification measures will be unreliable (biased) and misleading. Whenever such variations are a concern, Hamill and Juras (2006) recommend computing verification measures with subsamples where stationarity of the joint distribution can be reasonably assumed. Of course, the drawback to subdividing the verification data sample into smaller subsamples (to avoid such biases) is the inherent increase in sampling variability.

*N*is drawn from

*M*distinct groups, each with its own (stationary) joint distribution. Mathematically, the sample Brier score computed with the pooled sample

_{p}is equivalent to

*is the sample Brier score for the*

_{i}*i*th group and

*N*is the sample size for the

_{i}*i*th group. Using probability theory, and assuming independence of the subsamples, the variance of the pooled Brier score is

*V*[

_{i}] is defined by Eq. (19) for each subsample.

*V*[

_{i}] is defined by Eq. (32) for the stationary subsample.

Another assumption of the DO approach is that each forecast–observation pair is independent of all others. The independence assumption is violated whenever forecasts or observations are significantly correlated in time or space. Temporal correlation for probability forecasts may be a concern for frequently updated event forecasts (e.g., daily probability of precipitation), but sample sizes also tend to be relatively large in this case. On the other hand, the spatial correlation for probability forecasts is often a concern for both meteorological and climate forecasts. If the forecasts or observations are stationary, but significantly correlated, the sample size is effectively reduced, and the estimators presented here would underestimate the true sampling uncertainty.

One clear advantage of the analytical estimators of sampling uncertainty is that they are easy to compute at the same time as verification measures. In contrast, with resampling approaches, a few hundred or thousand resampled datasets must be constructed, and verification measures computed for each case. Clearly, the added computational burden is of no consequence for any one verification dataset. However, the computational effort can be a significant obstacle when evaluating the suite of products for an ensemble forecasting system, where the ensemble forecasts are transformed into probability forecasts for multiple event thresholds (e.g., Bradley et al. 2004), and evaluated at many different locations and forecast times (e.g., Kruger et al. 2007). Still, if the underlying stochastic process for forecasts and observations assumed in the derivation of these estimators is not valid, resampling remains an attractive nonparametric alternative.

## 7. Summary and conclusions

The Brier score and Brier skill score are commonly used verification measures for probability forecasts. However, their sampling characteristics have not been explored in great detail. Using the joint distribution of forecasts and observations as defined in the distributions-oriented approach to forecast verification (Murphy and Winkler 1992), we derive analytical expressions for the sampling uncertainties of the Brier score and Brier skill score. The expressions depend only on the moments of the forecasts and observations, making sampling uncertainties easy to estimate at the same time as the scores themselves. The expressions can be used with verification datasets to estimate standard errors and biases, construct confidence intervals, or perform hypothesis testing (e.g., Hamill 1999) on the Brier score and Brier skill score.

Monte Carlo experiments using synthetic forecasting examples illustrate the performance of the expressions. In general, the estimates provide very reliable information on uncertainty. However, the quality of an estimate depends on both the sample size and the occurrence frequency of the forecast event. For the Brier skill score, the approximation underestimates the standard error at small sample sizes (a few hundred forecast–observation pairs or less) for infrequently occurring events. In contrast, the bias estimator for the Brier skill score only provides reasonable estimates when the bias is large, a situation that exists only for infrequently occurring events. Confidence interval estimates constructed for individual samples using a normal distribution approximation perform well except at small sample sizes, where the sampling distributions are skewed because of large sampling variances and the lower (upper) bound on the Brier score (Brier skill score).

Although this paper focuses on uncertainty estimation for measures of accuracy and skill, there are additional measures that describe other aspects of the forecast quality for probability forecasts (Murphy 1997; Wilks 2006). Bradley et al. (2003) showed that other distributions-oriented measures for probability forecasts can also be expressed as a function of the moments of the joint distribution. We are currently exploring the use of sampling theory to develop approximate uncertainty estimates for these measures as well.

## Acknowledgments

This work was supported in part by National Oceanic and Atmospheric Administration (NOAA) Grant NA04NWS4620015, from the National Weather Service (NWS) Office of Hydrologic Development, and Grant NA16GP1569, from the Office of Global Programs as part of the GEWEX Americas Prediction Project (GAPP). We gratefully acknowledge this support. We also thank the two anonymous reviewers for their thoughtful comments and suggestions.

## REFERENCES

Accadia, C., Mariani S. , Casaioli M. , Lavagnini A. , and Speranza A. , 2003: Sensitivity of precipitation forecast skill scores to bilinear interpolation and a simple nearest-neighbor average method on high-resolution verification grids.

,*Wea. Forecasting***18****,**918–932.Accadia, C., Mariani S. , Casaioli M. , Lavagnini A. , and Speranza A. , 2005: Verification of precipitation forecasts from two limited-area models over Italy and comparison with ECMWF forecasts using a resampling technique.

,*Wea. Forecasting***20****,**276–300.Benjamin, J., and Cornell C. , 1970:

*Probability, Statistics, and Decision for Civil Engineers*. McGraw-Hill, 684 pp.Bradley, A. A., Hashino T. , and Schwartz S. S. , 2003: Distributions-oriented verification of probability forecasts for small data samples.

,*Wea. Forecasting***18****,**903–917.Bradley, A. A., Schwartz S. S. , and Hashino T. , 2004: Distributions-oriented verification of ensemble streamflow predictions.

,*J. Hydrometeor.***5****,**532–545.Brier, G. W., 1950: Verification of forecasts expressed in terms of probability.

,*Mon. Wea. Rev.***78****,**1–3.Briggs, W., 2005: A general method of incorporating forecast cost and loss in value scores.

,*Mon. Wea. Rev.***133****,**3393–3397.Briggs, W., and Ruppert D. , 2005: Assessing the skill of yes/no forecasts.

,*Biometrics***61****,**799–807.Carpenter, T., and Georgakakos K. , 2001: Assessment of Folsom Lake response to historical and potential future climate scenarios: 1. forecasting.

,*J. Hydrol.***249****,**148–175.Doblas-Reyes, F., Pavan V. , and Stephenson D. , 2003: The skill of multi-model seasonal forecasts of the wintertime North Atlantic oscillation.

,*Climate Dyn.***21****,**501–514.Ebert, E. E., Wilson L. J. , Brown B. G. , Nurmi P. , Brooks H. E. , Bally J. , and Jaeneke M. , 2004: Verification of nowcasts from the WWRP Sydney 2000 Forecast Demonstration Project.

,*Wea. Forecasting***19****,**73–96.Efron, B., 1981: Nonparametric estimates of standard error: The jacknife, the bootstrap and other methods.

,*Biometrika***68****,**589–599.Ferro, C. A., 2007: Comparing probabilistic forecasting systems with the Brier score.

,*Wea. Forecasting***22****,**1076–1088.Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts.

,*Wea. Forecasting***14****,**155–167.Hamill, T. M., and Juras J. , 2006: Measuring forecast skill: Is it real skill or is it the varying climatology?

,*Quart. J. Roy. Meteor. Soc.***132****,**2905–2923.Jolliffe, I., 2007: Uncertainty and inference for verification measures.

,*Wea. Forecasting***22****,**637–650.Jolliffe, I., and Stephenson D. , 2003: Introduction.

*Forecast Verification: A Practitioner’s Guide in Atmospheric Science,*I. Jolliffe and D. Stephenson, Eds., John Wiley, 1–12.Kane, T. L., and Brown B. G. , 2000: Confidence intervals for some verification measures—A survey of several methods. Preprints,

*15th Conf. on Probability and Statistics in the Atmospheric Sciences,*Asheville, NC, Amer. Meteor. Soc., 46–49.Kenney, J. F., and Keeping E. S. , 1951:

*Mathematics of Statistics, Part 2*. 2nd ed. Van Nostrand, 429 pp.Kruger, A., Khandelwal S. , and Bradley A. A. , 2007: Ahpsver: A Web-based system for hydrologic forecast verification.

,*Comput. Geosci.***33****,**739–748.Krzysztofowicz, R., and Long D. , 1991: Beta likelihood models of probabilistic forecasts.

,*Int. J. Forecasting***7****,**47–55.Mason, S. J., 2004: On using climatology as a reference strategy in the Brier and ranked probability skill scores.

,*Mon. Wea. Rev.***132****,**1891–1895.Mason, S. J., and Mimmack G. M. , 1992: The use of bootstrap confidence intervals for the correlation coefficient in climatology.

,*Theor. Appl. Climatol.***45****,**229–233.Mason, S. J., and Graham N. E. , 2002: Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: Statistical significance and interpretation.

,*Quart. J. Roy. Meteor. Soc.***128****,**2145–2166.Murphy, A. H., 1997: Forecast verification.

*Economic Value of Weather and Climate Forecasts,*R. Katz and A. H. Murphy, Eds., Cambridge University Press, 19–74.Murphy, A. H., and Winkler R. L. , 1992: Diagnostic verification of probability forecasts.

,*Int. J. Forecasting***7****,**435–455.Murphy, A. H., and Wilks D. S. , 1998: A case study of the use of statistical models in forecast verification: Precipitation probability forecasts.

,*Wea. Forecasting***13****,**795–810.Schwartz, S. S., 1992: Verifying probabilistic water supply outlooks for the Potomac River basin. Preprints,

*28th Conf. and Symp. on Managing Water Resources during Global Change,*Reno, NV, American Water Resources Association, 153–161.Seaman, R., Mason I. , and Woodcook F. , 1996: Confidence intervals for some performance measures of yes–no forecasts.

,*Aust. Meteor. Mag.***45****,**49–53.Stephenson, D. B., 2000: Use of the “odds ratio” for diagnosing forecast skill.

,*Wea. Forecasting***15****,**221–232.Thornes, J. E., and Stephenson D. B. , 2001: How to judge the quality and value of weather forecast products.

,*Meteor. Appl.***8****,**307–314.Toth, Z., Talagrand O. , Candille G. , and Zhu Y. , 2003: Probability and ensemble forecasts.

*Forecast Verification: A Practitioner’s Guide in Atmospheric Science,*I. Jolliffe and D. Stephenson, Eds., John Wiley, 137–163.Welles, E., Sorooshian S. , Carter G. , and Olsen B. , 2007: Hydrologic verification: A call for action and collaboration.

,*Bull. Amer. Meteor. Soc.***88****,**503–511.Wilks, D. S., 1996: Statistical significance of long-range “optimal climate normal” temperature and precipitation forecasts.

,*J. Climate***9****,**827–839.Wilks, D. S., 2006:

*Statistical Methods in the Atmospheric Sciences*. 2nd ed. International Geophysics Series, Vol. 91, Academic Press, 648 pp.Zhang, H., and Casey T. , 2000: Verification of categorical probability forecasts.

,*Wea. Forecasting***15****,**80–89.Zhu, Y. J., 2005: Ensemble forecast: A new approach to uncertainty and predictability.

,*Adv. Atmos. Sci.***22****,**781–788.

## APPENDIX A

### Sample Moment Estimators

*x*be the observation at time

_{i}*i*. Let

*f*be the probability forecast of the event at time

_{i}*i*. The verification data sample is then {

*f*,

_{i}*x*,

_{i}*i*= 1, . . . ,

*N*}. The traditional sample moment estimator for the mean of the observations is

*x*is a Bernoulli random variable, the sample estimator for the variance of the observations

*σ*

^{2}

_{x}is simply

To estimate the conditional mean and higher-order noncentral moments of the forecasts, the verification data sample is partitioned into two sets. Let {*f*^{0}_{j}, *j* = 1, . . . , *N*_{0}} be the subsample of forecasts for the case where the event does not occur (*x* = 0). Let {*f*^{1}_{k}, *k* = 1, . . . , *N*_{1}} be the subsample of forecasts for the case where the event occurs (*x* = 1). The conditional means *μ̂*_{f |x=0} and *μ̂*_{f |x=1}, and higher-order noncentral moments *μ̂*′_{(m) f |x=0} and *μ̂*′_{(m) f |x=1} are then estimated using the subsamples in a similar fashion as above.

Note that sample estimates of the accuracy and skill measures shown in sections 2b and 2c, or the bias and standard errors shown in section 3, can be obtained by replacing the moments with the sample moment estimates shown above.

## APPENDIX B

### Covariance Term

*σ̂*

^{2}

_{x}is

*MSE*and

*σ*

^{2}

_{x}in Eqs. (10) and (3), the first term on the right-hand side is

*x*= 1, denoted

*N*

_{1}. For a known sample size

*N*,

*N*

_{1}is a random variable. Because

*x*is a Bernoulli random variable,

*N*

_{1}has a binomial distribution with the parameter

*μ*. Note that the number of occurrences of

_{x}*x*= 0, denoted

*N*

_{0}, is related to

*N*

_{1}by the relationship

*N*=

*N*

_{0}+

*N*

_{1}. Conditioning on

*N*

_{1}, the first term is

*f*,

*N*

_{1}is a binomial random variable, the expected values for the noncentral moments are

*σ̂*

^{2}

_{x}is

Standard error of the Brier skill score for (a) a low-skill forecast (SS_{MSE} = 0.2) and (b) a high-skill forecast (SS_{MSE} = 0.6). The symbols are the true standard errors derived from the Monte Carlo simulations. The curves are the estimates based on the analytical expression [Eq. (36)]. Results are shown for rare-event (*μ _{x}* = 0.05) and common-event (

*μ*= 0.25) occurrences, for an unbiased forecast.

_{x}Citation: Weather and Forecasting 23, 5; 10.1175/2007WAF2007049.1

Standard error of the Brier skill score for (a) a low-skill forecast (SS_{MSE} = 0.2) and (b) a high-skill forecast (SS_{MSE} = 0.6). The symbols are the true standard errors derived from the Monte Carlo simulations. The curves are the estimates based on the analytical expression [Eq. (36)]. Results are shown for rare-event (*μ _{x}* = 0.05) and common-event (

*μ*= 0.25) occurrences, for an unbiased forecast.

_{x}Citation: Weather and Forecasting 23, 5; 10.1175/2007WAF2007049.1

Standard error of the Brier skill score for (a) a low-skill forecast (SS_{MSE} = 0.2) and (b) a high-skill forecast (SS_{MSE} = 0.6). The symbols are the true standard errors derived from the Monte Carlo simulations. The curves are the estimates based on the analytical expression [Eq. (36)]. Results are shown for rare-event (*μ _{x}* = 0.05) and common-event (

*μ*= 0.25) occurrences, for an unbiased forecast.

_{x}Citation: Weather and Forecasting 23, 5; 10.1175/2007WAF2007049.1

Standard error of the Brier skill score for a medium-skill forecast (SS_{MSE} = 0.4) for (a) a rare event (*μ _{x}* = 0.05) and (b) a common event (

*μ*= 0.25). The symbols are the true standard errors derived from the Monte Carlo simulations. The curves are the estimates based on the analytical expression [Eq. (36)]. Results are shown for both an unbiased and a conditionally biased forecast.

_{x}Citation: Weather and Forecasting 23, 5; 10.1175/2007WAF2007049.1

Standard error of the Brier skill score for a medium-skill forecast (SS_{MSE} = 0.4) for (a) a rare event (*μ _{x}* = 0.05) and (b) a common event (

*μ*= 0.25). The symbols are the true standard errors derived from the Monte Carlo simulations. The curves are the estimates based on the analytical expression [Eq. (36)]. Results are shown for both an unbiased and a conditionally biased forecast.

_{x}Citation: Weather and Forecasting 23, 5; 10.1175/2007WAF2007049.1

Standard error of the Brier skill score for a medium-skill forecast (SS_{MSE} = 0.4) for (a) a rare event (*μ _{x}* = 0.05) and (b) a common event (

*μ*= 0.25). The symbols are the true standard errors derived from the Monte Carlo simulations. The curves are the estimates based on the analytical expression [Eq. (36)]. Results are shown for both an unbiased and a conditionally biased forecast.

_{x}Citation: Weather and Forecasting 23, 5; 10.1175/2007WAF2007049.1

Bias of the Brier skill score for a medium-skill forecast (SS_{MSE} = 0.4) for (a) a rare event (*μ _{x}* = 0.05) and (b) a common event (

*μ*= 0.25). The symbols are the true expected values of the Brier skill score derived from the Monte Carlo simulations. The curves are the estimates made using the analytical expression for bias [Eq. (22)]. The expected values are shown for both an unbiased and a conditionally biased forecast.

_{x}Citation: Weather and Forecasting 23, 5; 10.1175/2007WAF2007049.1

Bias of the Brier skill score for a medium-skill forecast (SS_{MSE} = 0.4) for (a) a rare event (*μ _{x}* = 0.05) and (b) a common event (

*μ*= 0.25). The symbols are the true expected values of the Brier skill score derived from the Monte Carlo simulations. The curves are the estimates made using the analytical expression for bias [Eq. (22)]. The expected values are shown for both an unbiased and a conditionally biased forecast.

_{x}Citation: Weather and Forecasting 23, 5; 10.1175/2007WAF2007049.1

Bias of the Brier skill score for a medium-skill forecast (SS_{MSE} = 0.4) for (a) a rare event (*μ _{x}* = 0.05) and (b) a common event (

*μ*= 0.25). The symbols are the true expected values of the Brier skill score derived from the Monte Carlo simulations. The curves are the estimates made using the analytical expression for bias [Eq. (22)]. The expected values are shown for both an unbiased and a conditionally biased forecast.

_{x}Citation: Weather and Forecasting 23, 5; 10.1175/2007WAF2007049.1

Sampling distribution of the Brier score for an unbiased forecast with medium skill (SS_{MSE} = 0.4) for (a) a rare event (*μ _{x}* = 0.05) and (b) a common event (

*μ*= 0.25). The empirical distribution is based on the 10 000-sample estimates from the Monte Carlo simulation. A normal distribution would plot as a straight line.

_{x}Citation: Weather and Forecasting 23, 5; 10.1175/2007WAF2007049.1

Sampling distribution of the Brier score for an unbiased forecast with medium skill (SS_{MSE} = 0.4) for (a) a rare event (*μ _{x}* = 0.05) and (b) a common event (

*μ*= 0.25). The empirical distribution is based on the 10 000-sample estimates from the Monte Carlo simulation. A normal distribution would plot as a straight line.

_{x}Citation: Weather and Forecasting 23, 5; 10.1175/2007WAF2007049.1

Sampling distribution of the Brier score for an unbiased forecast with medium skill (SS_{MSE} = 0.4) for (a) a rare event (*μ _{x}* = 0.05) and (b) a common event (

*μ*= 0.25). The empirical distribution is based on the 10 000-sample estimates from the Monte Carlo simulation. A normal distribution would plot as a straight line.

_{x}Citation: Weather and Forecasting 23, 5; 10.1175/2007WAF2007049.1

Sampling distribution of the Brier skill score for an unbiased forecast with medium skill (SS_{MSE} = 0.4) for (a) a rare event (*μ _{x}* = 0.05) and (b) a common event (

*μ*= 0.25). The empirical distribution is based on the 10 000-sample estimates from the Monte Carlo simulation. A normal distribution would plot as a straight line.

_{x}Citation: Weather and Forecasting 23, 5; 10.1175/2007WAF2007049.1

Sampling distribution of the Brier skill score for an unbiased forecast with medium skill (SS_{MSE} = 0.4) for (a) a rare event (*μ _{x}* = 0.05) and (b) a common event (

*μ*= 0.25). The empirical distribution is based on the 10 000-sample estimates from the Monte Carlo simulation. A normal distribution would plot as a straight line.

_{x}Citation: Weather and Forecasting 23, 5; 10.1175/2007WAF2007049.1

Sampling distribution of the Brier skill score for an unbiased forecast with medium skill (SS_{MSE} = 0.4) for (a) a rare event (*μ _{x}* = 0.05) and (b) a common event (

*μ*= 0.25). The empirical distribution is based on the 10 000-sample estimates from the Monte Carlo simulation. A normal distribution would plot as a straight line.

_{x}Citation: Weather and Forecasting 23, 5; 10.1175/2007WAF2007049.1

The 95% confidence intervals and expected Brier skill score for an unbiased forecast with low skill (SS_{MSE} = 0.2) for a rare event (*μ _{x}* = 0.05). Lines show the results based on analytical expressions evaluated with the true parameters. Symbols show the empirical confidence intervals from the Monte Carlo simulations.

Citation: Weather and Forecasting 23, 5; 10.1175/2007WAF2007049.1

The 95% confidence intervals and expected Brier skill score for an unbiased forecast with low skill (SS_{MSE} = 0.2) for a rare event (*μ _{x}* = 0.05). Lines show the results based on analytical expressions evaluated with the true parameters. Symbols show the empirical confidence intervals from the Monte Carlo simulations.

Citation: Weather and Forecasting 23, 5; 10.1175/2007WAF2007049.1

The 95% confidence intervals and expected Brier skill score for an unbiased forecast with low skill (SS_{MSE} = 0.2) for a rare event (*μ _{x}* = 0.05). Lines show the results based on analytical expressions evaluated with the true parameters. Symbols show the empirical confidence intervals from the Monte Carlo simulations.

Citation: Weather and Forecasting 23, 5; 10.1175/2007WAF2007049.1

Relative measures of forecast quality for Monte Carlo simulations of forecast–observation pairs. The parameters of the stochastic model (defined in section 4) are *υ*, *ω*, and *b*. The forecast quality measures are skill (SS_{MSE}), resolution (RES), reliability (REL), discrimination (DIS), type-2 conditional bias (*B*_{2}), and sharpness (SHP) (Murphy 1997).

Percentage of samples where the true MSE or SS_{MSE} falls within the 95% confidence interval constructed for the sample. Results are shown for forecasting examples without conditional forecast biases.