1. Introduction
Forecast verification provides critical information on the quality of forecasts and can help guide forecasters in their efforts to improve forecasting systems (Jolliffe and Stephenson 2003; Welles et al. 2007). Although forecasters routinely report verification results, they often do not consider the inherent uncertainty of the verification measures (Jolliffe 2007). However, computing verification measures requires a sample of forecasts and their corresponding observations. The resulting verification measures are therefore sample estimates. The sampling uncertainty of verification measures depends on both the sample size and the statistical characteristics of the forecasts and observations. In situations where the sample size is relatively small, the sampling uncertainty can be significant (Bradley et al. 2003).
One approach to assessing the sampling uncertainty of verification measures is a resampling method (Efron 1981; Wilks 2006). Forecast–observation pairs are randomly selected from the verification dataset to construct synthetic data samples, and the verification measures are computed; the process is repeated many times to derive an empirical probability distribution for the sample measures. Examples of a resampling approach include Mason and Mimmack (1992), Wilks (1996), Hamill (1999), Zhang and Casey (2000), Doblas-Reyes et al. (2003), Accadia et al. (2003), Ebert et al. (2004), Accadia et al. (2005), Jolliffe (2007), and Ferro (2007), among others. Although the approach is straightforward and robust, it adds significantly to the computational effort associated with verification.
Another approach is to employ sampling theory to derive analytical expressions for sampling uncertainty. Most examples in the literature are for categorical forecast verification measures, such as the probability of detection or hit rate (Kane and Brown 2000; Jolliffe 2007), odds ratio (Stephenson 2000), false alarm rate and skill scores (Thornes and Stephenson 2001), the correlation coefficient (Jolliffe 2007), and relative operating characteristics (Mason and Graham 2002). For probability forecasts, sampling theory has been applied to evaluate the sampling uncertainty of the bias (Schwartz 1992; Carpenter and Georgakakos 2001) and accuracy measures (Ferro 2007). Sampling theory has also been used to develop confidence intervals (Seaman et al. 1996) and hypothesis tests (Hamill 1999; Briggs and Ruppert 2005; Briggs 2005). Although the assumptions associated with deriving analytical expressions can be limiting, the effort required to evaluate the expressions is minimal.
In this paper, we use sampling theory to develop analytical expressions for the sampling uncertainty of the Brier score and Brier skill score. For probability forecasts, these are the most commonly used measures of accuracy and skill. In the following sections, we derive exact or approximate analytical expressions for the sampling uncertainty of these two measures, and then use Monte Carlo simulation for synthetic forecasting examples to evaluate the expressions and estimated confidence intervals.
2. Forecast verification framework
A probability forecast is one that assigns a probability to the occurrence of a discrete (dichotomous) event. In this case, there are only two possible outcomes: either the forecast event occurs or it does not. A familiar example is a probability-of-precipitation forecast. Probability forecasts can also be constructed from ensemble forecasts (Toth et al. 2003; Zhu 2005); an event is defined as the exceedance (or nonexceedance) of a particular threshold value (e.g., precipitation in excess of a given depth), then the probability forecast for an event occurrence can be made using the ensemble members.
Using a sample of probability forecasts and the resulting (binary) event outcomes, the Brier score and Brier skill score (Brier 1950) are often computed to characterize the accuracy and skill of the forecast. To evaluate the uncertainty of these verification measures from sampling theory, assumptions must be made about the statistical properties of forecasts and observations. The diagnostic framework for forecast verification introduced by Murphy and Winkler (1992), known as the distributions-oriented (DO) approach, explicitly defines a stochastic process commonly assumed for verification. Specifically, forecasts and observations are treated as random variables. Each forecast–observation pair is assumed to be independent of all other pairs, and identically distributed. The relationship between forecasts and observations is defined by their joint distribution.
Using the assumptions of the DO approach and the notation described by Murphy (1997), we will evaluate the sampling uncertainty of the Brier score and Brier skill score. In particular, consider a forecasting system that produces a probability forecast for a dichotomous event. Let x be a Bernoulli random variable that takes on a value of 1 if the event occurs and 0 if it does not. Let f be a probability forecast of the occurrence of the event. That is, f is the forecast probability that the event occurs (i.e., that x = 1). The joint distribution p( f, x) describes the relationship between the forecasts and observations.
a. Distribution moments
b. Accuracy measure
c. Skill measure
3. Sampling uncertainty
Verification measures of accuracy and skill are computed for a forecasting system from a sample of forecasts and observations. Let xi be the observation at time i. Let fi be the probability forecast of the event at time i. Assume that the forecast–observation pairs {fi, xi, i = 1, . . . , N} are a random sample drawn from the joint distribution p( f, x). The verification data sample will be used to estimate forecast accuracy and skill. In the following sections, we define sample estimators for the measures of forecast accuracy and skill, and derive expressions for their sampling uncertainty.
a. Brier score



b. Sampling variance of the Brier score


c. Brier skill score


d. Sampling variance of the Brier skill score


4. Forecasting examples
To evaluate the analytical expressions for the sampling uncertainties, we used Monte Carlo simulations to generate verification datasets for several synthetic forecasting examples. For each example, the joint distribution p( f, x) and the true values of the accuracy and skill measures are known. Ten thousand verification data samples are generated for each example, for sample sizes ranging from 50 to 1000 pairs. For each verification data sample, sample estimates of the accuracy and skill (the Brier score and Brier skill score) are computed. To determine the “true” standard error of the measures for comparison with the analytical expressions, we compute the standard deviation of the measures for the 10 000 verification data samples. Note that with a Monte Carlo sample of this size, the uncertainty of the true value is less than about 1.5% (at a 95% confidence level).
Table 1 shows the model parameters for the cases presented in the following section. Also shown are relative measures of forecast quality (Murphy 1997) for each case. By design, the relative measures are the same for a given skill level within the unbiased or the conditionally biased forecast cases. By definition, the reliability (REL) of the forecast is 0 for a perfectly reliable forecast; the reliability is nonzero for the cases with conditional biases. Also, the resolution (RES), discrimination (DIS), and sharpness (SHP) of the forecast are all higher for a more skillful forecast. Finally, in the case of a conditionally biased forecast with b = 0.8, the highest achievable skill score is 0.6. For this high-skill case, the forecast must be perfectly sharp (the probability forecast f is either 0 or 1). Hence, the relative sharpness (SHP/σx2) for the high-skill case with conditionally biased forecasts achieves its maximum value of 1. Note that for this special case, the forecast f is a discrete random variable and is generated from a Bernoulli distribution with parameter μf .
5. Results
a. Standard errors
First we evaluate the analytical expressions for the standard errors of the Brier score and Brier skill score. True values for the sampling uncertainty are represented by the 10 000-sample estimates from the Monte Carlo simulation. These are compared to the analytical expression estimates made using the known (true) values for the moments.
Figure 1 shows the standard error estimates for the Brier score
For the Brier skill score
The impact of conditional forecast biases is illustrated in Fig. 3 for the case of medium-skill (SSMSE = 0.4) forecasts. As is the case for
b. Bias in the Brier skill score
Although the sample estimator of MSE is unbiased, the sample estimator of SSMSE is biased. Figure 4 shows the expected skill score E[
The second-order approximation of the biases produces a reasonable prediction of the bias for rare-event occurrences (see Fig. 4a). The curves follow the overall trend found in the Monte Carlo experiments, but the magnitude of the bias is underestimated. When the actual bias is less, as is the case for common-event occurrences (see Fig. 4b), the approximation is poor. In particular, the magnitudes of the observed biases are much larger for small sample sizes than predicted for the common-event occurrences.
c. Approximate confidence intervals
In practice, the analytical expressions would be used to estimate the uncertainty of the Brier score and Brier skill score computed with a verification dataset. Uncertainty estimates are obtained by replacing the moments in the standard error and bias equations in section 3 with moment estimates from a sample (see appendix A for traditional sample estimators of the moments). Note that for the standard error of the Brier score, this approach is mathematically similar to the sample estimator proposed by Ferro (2007) for ensemble forecasts. Using the resulting sample estimates of uncertainty, one can construct error bars, perform hypothesis testing, or construct confidence intervals for individual sample estimates. To evaluate the performance of the uncertainty expressions when sample moments are used, we examine confidence interval estimates for both MSE and SSMSE for verification data samples.
In principle, confidence interval estimation requires knowledge of the sampling distributions for
Figure 5 shows the empirical sampling distribution for
Figure 6 shows the sampling distribution for


For the forecasts of the rare event, the 95% confidence intervals constructed using sample estimates contain the true values much less than 95% of the time for small sample sizes. Given the underestimation of the standard error for
6. Discussion
As the results of the Monte Carlo experiments show, forecasters can reliably use the standard error and confidence limit estimators presented here to characterize uncertainty in many forecasting situations. Still, significant deviations can occur with forecasts made for rarely occurring events for small sample sizes (roughly less than a few hundred forecast–observation pairs). Unfortunately, the desire to characterize sampling uncertainty for rare (high impact) events is usually the greatest when the forecast–observation data sample is small. In these situations, standard error estimates for the Brier skill score, and confidence limits for both the Brier score and Brier skill score, underestimate the actual uncertainty.
On the other hand, the estimated standard errors and confidence limits, although deficient, do indicate the large uncertainties that exists for such cases. As an example, Fig. 7 shows the 95% confidence limits for SSMSE as a function of the sample size for a low-skill forecast of a rare event (μx = 0.05), computed using the approximate standard error estimators and the true model parameters. Note that for sample sizes up to about 300, one cannot reject the hypothesis that the true skill score is 0 for this case. For a sample size of 50, the estimated 95% confidence interval for SSMSE covers a range of [−0.26, 0.57]. Even though these estimates understate the true uncertainty range, the interval is still so large that one cannot conclude that the forecast of the rare event is skillful from such a small sample. Although it may be possible to devise corrections that improve uncertainty estimates for small sample sizes and rare events, the conclusions drawn in such cases would not be substantially different.
A second-order approximation of the bias for the SSMSE estimator was also derived, which could be applied to correct biases for sample estimates. However, the utility of such a correction is debatable. As suggested by Fig. 7, the magnitude of the bias is quite small compared to the sampling variability, especially for large sample sizes. As shown in Fig. 4, the approximation performed best for rare-event occurrences for small sample sizes, where the sampling uncertainty is too large to make statistical inferences. For common-event occurrences, the bias is much smaller, and the approximation underestimates the bias to such an extent that its use for bias correction would not substantially change the results.
Another important consideration in the use of the uncertainty estimators is their limit of applicability. Recall that the assumptions of the DO approach to forecast verification (Murphy 1997) were used to derive the estimators. One assumption is that forecast–observation pairs are identically distributed; a verification dataset is a random sample drawn from this unchanging (or stationary) joint distribution. Yet in many applications, a verification dataset pools together forecasts issued at different times of the year, and for different locations. In this case, the assumption of identically distributed forecast–observation pairs may be violated, because the joint distribution of forecasts and observations may depend on the season and location.
Hamill and Juras (2006) provide some insightful examples of how nonstationarity in the form of temporal and spatial variations in climatology affects verification measures. In particular, if the joint distribution varies with time or location, verification measures will be unreliable (biased) and misleading. Whenever such variations are a concern, Hamill and Juras (2006) recommend computing verification measures with subsamples where stationarity of the joint distribution can be reasonably assumed. Of course, the drawback to subdividing the verification data sample into smaller subsamples (to avoid such biases) is the inherent increase in sampling variability.



Another assumption of the DO approach is that each forecast–observation pair is independent of all others. The independence assumption is violated whenever forecasts or observations are significantly correlated in time or space. Temporal correlation for probability forecasts may be a concern for frequently updated event forecasts (e.g., daily probability of precipitation), but sample sizes also tend to be relatively large in this case. On the other hand, the spatial correlation for probability forecasts is often a concern for both meteorological and climate forecasts. If the forecasts or observations are stationary, but significantly correlated, the sample size is effectively reduced, and the estimators presented here would underestimate the true sampling uncertainty.
One clear advantage of the analytical estimators of sampling uncertainty is that they are easy to compute at the same time as verification measures. In contrast, with resampling approaches, a few hundred or thousand resampled datasets must be constructed, and verification measures computed for each case. Clearly, the added computational burden is of no consequence for any one verification dataset. However, the computational effort can be a significant obstacle when evaluating the suite of products for an ensemble forecasting system, where the ensemble forecasts are transformed into probability forecasts for multiple event thresholds (e.g., Bradley et al. 2004), and evaluated at many different locations and forecast times (e.g., Kruger et al. 2007). Still, if the underlying stochastic process for forecasts and observations assumed in the derivation of these estimators is not valid, resampling remains an attractive nonparametric alternative.
7. Summary and conclusions
The Brier score and Brier skill score are commonly used verification measures for probability forecasts. However, their sampling characteristics have not been explored in great detail. Using the joint distribution of forecasts and observations as defined in the distributions-oriented approach to forecast verification (Murphy and Winkler 1992), we derive analytical expressions for the sampling uncertainties of the Brier score and Brier skill score. The expressions depend only on the moments of the forecasts and observations, making sampling uncertainties easy to estimate at the same time as the scores themselves. The expressions can be used with verification datasets to estimate standard errors and biases, construct confidence intervals, or perform hypothesis testing (e.g., Hamill 1999) on the Brier score and Brier skill score.
Monte Carlo experiments using synthetic forecasting examples illustrate the performance of the expressions. In general, the estimates provide very reliable information on uncertainty. However, the quality of an estimate depends on both the sample size and the occurrence frequency of the forecast event. For the Brier skill score, the approximation underestimates the standard error at small sample sizes (a few hundred forecast–observation pairs or less) for infrequently occurring events. In contrast, the bias estimator for the Brier skill score only provides reasonable estimates when the bias is large, a situation that exists only for infrequently occurring events. Confidence interval estimates constructed for individual samples using a normal distribution approximation perform well except at small sample sizes, where the sampling distributions are skewed because of large sampling variances and the lower (upper) bound on the Brier score (Brier skill score).
Although this paper focuses on uncertainty estimation for measures of accuracy and skill, there are additional measures that describe other aspects of the forecast quality for probability forecasts (Murphy 1997; Wilks 2006). Bradley et al. (2003) showed that other distributions-oriented measures for probability forecasts can also be expressed as a function of the moments of the joint distribution. We are currently exploring the use of sampling theory to develop approximate uncertainty estimates for these measures as well.
Acknowledgments
This work was supported in part by National Oceanic and Atmospheric Administration (NOAA) Grant NA04NWS4620015, from the National Weather Service (NWS) Office of Hydrologic Development, and Grant NA16GP1569, from the Office of Global Programs as part of the GEWEX Americas Prediction Project (GAPP). We gratefully acknowledge this support. We also thank the two anonymous reviewers for their thoughtful comments and suggestions.
REFERENCES
Accadia, C., Mariani S. , Casaioli M. , Lavagnini A. , and Speranza A. , 2003: Sensitivity of precipitation forecast skill scores to bilinear interpolation and a simple nearest-neighbor average method on high-resolution verification grids. Wea. Forecasting, 18 , 918–932.
Accadia, C., Mariani S. , Casaioli M. , Lavagnini A. , and Speranza A. , 2005: Verification of precipitation forecasts from two limited-area models over Italy and comparison with ECMWF forecasts using a resampling technique. Wea. Forecasting, 20 , 276–300.
Benjamin, J., and Cornell C. , 1970: Probability, Statistics, and Decision for Civil Engineers. McGraw-Hill, 684 pp.
Bradley, A. A., Hashino T. , and Schwartz S. S. , 2003: Distributions-oriented verification of probability forecasts for small data samples. Wea. Forecasting, 18 , 903–917.
Bradley, A. A., Schwartz S. S. , and Hashino T. , 2004: Distributions-oriented verification of ensemble streamflow predictions. J. Hydrometeor., 5 , 532–545.
Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78 , 1–3.
Briggs, W., 2005: A general method of incorporating forecast cost and loss in value scores. Mon. Wea. Rev., 133 , 3393–3397.
Briggs, W., and Ruppert D. , 2005: Assessing the skill of yes/no forecasts. Biometrics, 61 , 799–807.
Carpenter, T., and Georgakakos K. , 2001: Assessment of Folsom Lake response to historical and potential future climate scenarios: 1. forecasting. J. Hydrol., 249 , 148–175.
Doblas-Reyes, F., Pavan V. , and Stephenson D. , 2003: The skill of multi-model seasonal forecasts of the wintertime North Atlantic oscillation. Climate Dyn., 21 , 501–514.
Ebert, E. E., Wilson L. J. , Brown B. G. , Nurmi P. , Brooks H. E. , Bally J. , and Jaeneke M. , 2004: Verification of nowcasts from the WWRP Sydney 2000 Forecast Demonstration Project. Wea. Forecasting, 19 , 73–96.
Efron, B., 1981: Nonparametric estimates of standard error: The jacknife, the bootstrap and other methods. Biometrika, 68 , 589–599.
Ferro, C. A., 2007: Comparing probabilistic forecasting systems with the Brier score. Wea. Forecasting, 22 , 1076–1088.
Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts. Wea. Forecasting, 14 , 155–167.
Hamill, T. M., and Juras J. , 2006: Measuring forecast skill: Is it real skill or is it the varying climatology? Quart. J. Roy. Meteor. Soc., 132 , 2905–2923.
Jolliffe, I., 2007: Uncertainty and inference for verification measures. Wea. Forecasting, 22 , 637–650.
Jolliffe, I., and Stephenson D. , 2003: Introduction. Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. Jolliffe and D. Stephenson, Eds., John Wiley, 1–12.
Kane, T. L., and Brown B. G. , 2000: Confidence intervals for some verification measures—A survey of several methods. Preprints, 15th Conf. on Probability and Statistics in the Atmospheric Sciences, Asheville, NC, Amer. Meteor. Soc., 46–49.
Kenney, J. F., and Keeping E. S. , 1951: Mathematics of Statistics, Part 2. 2nd ed. Van Nostrand, 429 pp.
Kruger, A., Khandelwal S. , and Bradley A. A. , 2007: Ahpsver: A Web-based system for hydrologic forecast verification. Comput. Geosci., 33 , 739–748.
Krzysztofowicz, R., and Long D. , 1991: Beta likelihood models of probabilistic forecasts. Int. J. Forecasting, 7 , 47–55.
Mason, S. J., 2004: On using climatology as a reference strategy in the Brier and ranked probability skill scores. Mon. Wea. Rev., 132 , 1891–1895.
Mason, S. J., and Mimmack G. M. , 1992: The use of bootstrap confidence intervals for the correlation coefficient in climatology. Theor. Appl. Climatol., 45 , 229–233.
Mason, S. J., and Graham N. E. , 2002: Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: Statistical significance and interpretation. Quart. J. Roy. Meteor. Soc., 128 , 2145–2166.
Murphy, A. H., 1997: Forecast verification. Economic Value of Weather and Climate Forecasts, R. Katz and A. H. Murphy, Eds., Cambridge University Press, 19–74.
Murphy, A. H., and Winkler R. L. , 1992: Diagnostic verification of probability forecasts. Int. J. Forecasting, 7 , 435–455.
Murphy, A. H., and Wilks D. S. , 1998: A case study of the use of statistical models in forecast verification: Precipitation probability forecasts. Wea. Forecasting, 13 , 795–810.
Schwartz, S. S., 1992: Verifying probabilistic water supply outlooks for the Potomac River basin. Preprints, 28th Conf. and Symp. on Managing Water Resources during Global Change, Reno, NV, American Water Resources Association, 153–161.
Seaman, R., Mason I. , and Woodcook F. , 1996: Confidence intervals for some performance measures of yes–no forecasts. Aust. Meteor. Mag., 45 , 49–53.
Stephenson, D. B., 2000: Use of the “odds ratio” for diagnosing forecast skill. Wea. Forecasting, 15 , 221–232.
Thornes, J. E., and Stephenson D. B. , 2001: How to judge the quality and value of weather forecast products. Meteor. Appl., 8 , 307–314.
Toth, Z., Talagrand O. , Candille G. , and Zhu Y. , 2003: Probability and ensemble forecasts. Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. Jolliffe and D. Stephenson, Eds., John Wiley, 137–163.
Welles, E., Sorooshian S. , Carter G. , and Olsen B. , 2007: Hydrologic verification: A call for action and collaboration. Bull. Amer. Meteor. Soc., 88 , 503–511.
Wilks, D. S., 1996: Statistical significance of long-range “optimal climate normal” temperature and precipitation forecasts. J. Climate, 9 , 827–839.
Wilks, D. S., 2006: Statistical Methods in the Atmospheric Sciences. 2nd ed. International Geophysics Series, Vol. 91, Academic Press, 648 pp.
Zhang, H., and Casey T. , 2000: Verification of categorical probability forecasts. Wea. Forecasting, 15 , 80–89.
Zhu, Y. J., 2005: Ensemble forecast: A new approach to uncertainty and predictability. Adv. Atmos. Sci., 22 , 781–788.
APPENDIX A
Sample Moment Estimators
To estimate the conditional mean and higher-order noncentral moments of the forecasts, the verification data sample is partitioned into two sets. Let {f0j, j = 1, . . . , N0} be the subsample of forecasts for the case where the event does not occur (x = 0). Let {f1k, k = 1, . . . , N1} be the subsample of forecasts for the case where the event occurs (x = 1). The conditional means μ̂f |x=0 and μ̂f |x=1, and higher-order noncentral moments μ̂′(m) f |x=0 and μ̂′(m) f |x=1 are then estimated using the subsamples in a similar fashion as above.
Note that sample estimates of the accuracy and skill measures shown in sections 2b and 2c, or the bias and standard errors shown in section 3, can be obtained by replacing the moments with the sample moment estimates shown above.
APPENDIX B
Covariance Term


Standard error of the Brier score for a medium-skill forecast (SSMSE = 0.4). The symbols are the true standard errors derived from the Monte Carlo simulations. The curves are the estimates based on the analytical expression [Eq. (20)]. Results are shown for rare-event (μx = 0.05) and common-event (μx = 0.25) occurrences, for both an unbiased and a conditionally biased forecast.
Citation: Weather and Forecasting 23, 5; 10.1175/2007WAF2007049.1
Standard error of the Brier skill score for (a) a low-skill forecast (SSMSE = 0.2) and (b) a high-skill forecast (SSMSE = 0.6). The symbols are the true standard errors derived from the Monte Carlo simulations. The curves are the estimates based on the analytical expression [Eq. (36)]. Results are shown for rare-event (μx = 0.05) and common-event (μx = 0.25) occurrences, for an unbiased forecast.
Citation: Weather and Forecasting 23, 5; 10.1175/2007WAF2007049.1
Standard error of the Brier skill score for a medium-skill forecast (SSMSE = 0.4) for (a) a rare event (μx = 0.05) and (b) a common event (μx = 0.25). The symbols are the true standard errors derived from the Monte Carlo simulations. The curves are the estimates based on the analytical expression [Eq. (36)]. Results are shown for both an unbiased and a conditionally biased forecast.
Citation: Weather and Forecasting 23, 5; 10.1175/2007WAF2007049.1
Bias of the Brier skill score for a medium-skill forecast (SSMSE = 0.4) for (a) a rare event (μx = 0.05) and (b) a common event (μx = 0.25). The symbols are the true expected values of the Brier skill score derived from the Monte Carlo simulations. The curves are the estimates made using the analytical expression for bias [Eq. (22)]. The expected values are shown for both an unbiased and a conditionally biased forecast.
Citation: Weather and Forecasting 23, 5; 10.1175/2007WAF2007049.1
Sampling distribution of the Brier score for an unbiased forecast with medium skill (SSMSE = 0.4) for (a) a rare event (μx = 0.05) and (b) a common event (μx = 0.25). The empirical distribution is based on the 10 000-sample estimates from the Monte Carlo simulation. A normal distribution would plot as a straight line.
Citation: Weather and Forecasting 23, 5; 10.1175/2007WAF2007049.1
Sampling distribution of the Brier skill score for an unbiased forecast with medium skill (SSMSE = 0.4) for (a) a rare event (μx = 0.05) and (b) a common event (μx = 0.25). The empirical distribution is based on the 10 000-sample estimates from the Monte Carlo simulation. A normal distribution would plot as a straight line.
Citation: Weather and Forecasting 23, 5; 10.1175/2007WAF2007049.1
The 95% confidence intervals and expected Brier skill score for an unbiased forecast with low skill (SSMSE = 0.2) for a rare event (μx = 0.05). Lines show the results based on analytical expressions evaluated with the true parameters. Symbols show the empirical confidence intervals from the Monte Carlo simulations.
Citation: Weather and Forecasting 23, 5; 10.1175/2007WAF2007049.1
Relative measures of forecast quality for Monte Carlo simulations of forecast–observation pairs. The parameters of the stochastic model (defined in section 4) are υ, ω, and b. The forecast quality measures are skill (SSMSE), resolution (RES), reliability (REL), discrimination (DIS), type-2 conditional bias (B2), and sharpness (SHP) (Murphy 1997).
Percentage of samples where the true MSE or SSMSE falls within the 95% confidence interval constructed for the sample. Results are shown for forecasting examples without conditional forecast biases.