## 1. Introduction

Ensemble prediction systems are widely used to make weather, climate, and hydrologic forecasts. Often forecasters need to make comparative assessments of ensemble forecast quality. For instance, they may wish to determine whether system enhancements improve the forecasts, or compare the performance of alternative systems. Even with a single forecasting system, forecasters may need to assess how the quality of ensemble forecasts depends on the time of year (or time of day) when the forecasts are issued, or how the quality varies from one location to another. To accomplish these tasks, forecasters need summary verification measures suitable for comparison. Suitable measures are ones where differences can be attributed to differences in the forecast system performance, and not differences in the nature of the forecast variable itself (e.g., differences in its climatology at different times or locations).

Finding suitable summary measures for ensemble prediction systems is challenging, as the nature of the forecast itself is complex. Transforming ensemble forecasts into single value (deterministic) forecasts (e.g., ensemble mean), or into probability forecasts for a discrete event (e.g., outcome above some threshold), are common approaches used in ensemble forecast verification (Hamill and Colucci 1997; Markus et al. 1997; Buizza and Palmer 1998; Hamill and Colucci 1998; Atger 1999; Carpenter and Georgakakos 2001; Ebert 2001; Hou et al. 2001; Kumar et al. 2001; Mullen and Buizza 2001; Grimit and Mass 2002; Cong et al. 2003; Franz et al. 2003; Kirtman 2003; Shamir et al. 2006, among others). To examine the performance of ensemble forecasts as probabilistic forecasts of a continuous variable, measures like the continuous ranked probability score (CRPS) are used (Matheson and Winkler 1976; Unger 1985). Decomposition of scores into measures of reliability, resolution, and discrimination (Murphy and Winkler 1992; Murphy 1997; Wilks 2000; Hersbach 2000; Candille and Talagrand 2005, among others) is helpful for diagnosing how these aspects affect forecast accuracy.

In this paper, we examine summary verification measures for assessing the overall performance of ensemble forecasts as probabilistic forecasts. We begin by introducing a theoretical framework that generalizes common verification approaches, describing forecast quality as a continuous function of the forecast variable (or its climatological probability). This description naturally leads to a set of measures—some based on traditional measures and others introduced here—which summarizes aspects of ensemble forecast quality, and can be interpreted as measures of the “geometric shape” of forecast quality functions.

## 2. Forecast verification framework

Forecast verification is often viewed in terms of data samples and the sample statistics (verification measures) computed from them. But the verification process can also be described using basic probability theory. As an example, probability theory is central to the distribution-oriented (DO) approach to verification (Murphy and Winkler 1987; Murphy 1997). Forecasts and observations are treated as random variables. Verification measures are defined based on the joint distribution of the forecasts and observations. We will employ a similar approach throughout this paper to define aspects of forecast quality.

*Y*. The forecast issued represents the probability distribution of

*Y*, conditioned on the state of the system

*ξ*(e.g., initial conditions) at the time of the forecast. This

*probability distribution forecast*can be defined by a conditional cumulative distribution function

*F*(

*y*|

*ξ*) aswhere

*P*{

*Y*≤

*y*|

*ξ*} is the probability that the forecast variable

*Y*is less than or equal to some value

*y*. Figure 1 illustrates such a probability distribution forecast from an ensemble prediction system.

Because of the complex nature of a probability distribution forecast, it is commonly transformed into a simpler *probability forecast* of a discrete event for verification. Consider the discrete event {*Y* ≤ *y _{p}*}, defined by the threshold value

*y*. Then

_{p}*F*(

*y*|

_{p}*ξ*) is a random variable that represents the forecast probability that the event occurs. Figure 1 illustrates how

*F*(

*y*|

_{p}*ξ*) is defined by the ensemble forecast for the specific threshold value

*y*.

_{p}*Y*. Let

*X*(

*y*) be a random variable that denotes the observation of the discrete event, defined asIn other words,

_{p}*X*(

*y*) is 1 if the event {

_{p}*Y*≤

*y*} occurs and 0 if it does not.

_{p}*y*for the threshold because later it will become convenient to replace the threshold with the climatological probability of the event it defines. The climatological probability

_{p}*p*is defined by the

*unconditional probability*of the occurrence of the event {

*Y*≤

*y*}:By definition, the climatological probability

_{p}*p*is equivalent to the expected value of the binary observation

*X*(

*y*):Since

_{p}*X*(

*y*) is a Bernoulli random variable, its variance is simply

_{p}Although the probability forecasts and observations are defined above for a nonexceedance event, *F*(*y _{p}*|

*ξ*) and

*X*(

*y*) could have been defined instead for an exceedance event {

_{p}*Y*>

*y*}. The climatological probability

_{p}*p*would then represent the unconditional probability of the exceedance event.

### a. Forecast quality measures

*y*asA common measure of skill, or accuracy relative to a reference forecast, is the MSE (or Brier) skill score (SS). Using climatology as a reference forecast, SS for the threshold

_{p}*y*is defined asIn practice, these and other verification measures are

_{p}*estimated*based on a random sample of probability forecasts and observations. The appendix illustrates how a verification dataset of ensemble forecasts and continuous observations may be used to create a sample of probability forecasts and observations for a discrete event.

### b. Generalization for probability distribution forecasts

From the definition of *F*(*y*|*ξ*) in Eq. (1), it follows that a probability distribution forecast contains a probability forecast for the event {*Y* ≤ *y*}, for any value of the threshold *y*. A generalization of the above verification approach is to define and evaluate the forecast quality of probability distribution forecasts as a *continuous function* of threshold value. Let *Q*(*y _{p}*) denote a forecast quality measure (e.g., skill) for a specific threshold

*y*. Then the function

_{p}*Q*(

*y*) characterizes the forecast quality measure as a continuous function of

*y*. If instead we index the threshold by the climatological probability

*p*of the event defined by

*y*, then the function

_{p}*Q*(

*p*) characterizes the forecast quality measure as a function of the event occurrence frequency

*p*. Such an approach has been used to describe ensemble forecast quality by Bradley et al. (2004) and Gneiting et al. (2007).

Figure 2 illustrates the concept of a *forecast quality function* for a set of ensemble drought forecasts for the Des Moines River. The figure shows the skill functions SS(*y*) and SS(*p*) for the ensemble forecasts. This generalization provides an important insight on the nature of ensemble forecasts. A single value of skill cannot completely characterize the relative accuracy of ensemble forecasts; forecast skill varies depending on the threshold value. In this example, probability forecasts for low flow volume thresholds are quite skillful; however, for the same set of ensemble forecasts, the probability forecasts for high flow volume thresholds have virtually no skill.

Even though Fig. 2 shows the forecast skill for nonexceedance events, the forecast skill for exceedance events is readily deduced. For a given threshold *y _{p}*, MSE and SS are the same for exceedance and nonexceedance events. Therefore, a plot of SS(

*y*) for exceedance events is the same as shown in Fig. 2a. However, a plot of SS(

*p*) for exceedance events is the mirror image of that shown in Fig. 2b, since the corresponding probability index for an exceedance event is 1 −

*p*.

*ρ*

_{fx}is the correlation of the forecasts and observations;

*σ*and

_{f}*σ*are the standard deviation of the forecasts and the observations, respectively; and

_{x}*μ*and

_{f}*μ*are the mean of the forecasts and the observations, respectively. The first term of the decomposition (right-hand side) is known as the potential skill, and is a measure of the resolution of the forecasts (i.e., the skill if there were no biases). The second term is known as the slope reliability, and is a measure of the conditional bias. The third term is the standardized mean error, a measure of the unconditional bias.

_{x}*p*), we can also create functions that describe each component of the decomposition:where PS(

*p*) represents the potential skill function [first term in Eq. (8)], CB(

*p*) represents the conditional bias function (second term), and UB(

*p*) represents the unconditional bias function (third term). Likewise, the decomposition of the skill function can also be carried out for SS(

*y*). Although we will use this particular skill score decomposition as an example in the remainder of the paper, the same approach can be used with other decompositions (see Murphy 1997).

In principle, any probability forecast verification measure could be represented as a continuous function of the threshold *y* or its climatological probability *p*. However, some measures are not as well suited for graphical comparison (as in Fig. 2). For instance, the magnitude of MSE depends on the event occurrence frequency *p*. For the most part, a plot of the function MSE(*y*) or MSE(*p*) illustrates this dependency (see Bradley et al. 2004), and not the differences in forecast quality at different thresholds. A better way to assess the forecast quality as a continuous function is with a *relative measure*. Skill measures, and their decompositions, make better candidates for graphical comparison. We will focus exclusively on these relative measures for the remainder of this paper.

## 3. Summary measures of forecast quality functions

Forecast quality functions provide a very detailed description of the quality of probability distribution forecasts for a single forecast variable. But there is still a need for simpler measures that can summarize information contained in these functions, to allow forecasters to compare system performance for different locations or lead times, or even for different forecasting systems. In this section, we look at some common measures used in ensemble forecast verification to see what information they summarize, and derive additional measures that can summarize other properties of forecast quality functions.

### a. Average forecast quality

*y*is a threshold, and

_{i}*k*is the number of thresholds defining the

*k*+ 1 categories. Note that

*y*), we solve for the MSE in Eq. (7), and by substitution into the expression above:Gathering the terms involving

*σ*

_{x}^{2}(

*y*), which do not depend on forecasts, we define a function

_{i}*w*asThe RPSS then simplifies toThe interpretation of the RPSS is very intuitive when viewed in this form; it is a measure of the

*weighted-average skill score*for probability forecasts defined by the discrete thresholds

*y*. The weight applied to each threshold is related to the variance

_{i}*σ*

_{x}^{2}(

*y*), which depends only on the climatological probability of the event defined by

_{i}*y*[see Eq. (5)].

_{i}*y*), we follow a derivation similar to that shown above for the RPSS. The result is that the CRPSS is mathematically equivalent towhere the weight function isTherefore, CRPSS is a summary measure representing the

*weighted-average skill score*over the continuous range of outcomes

*y*.

Figure 2a illustrates this interpretation of summary measures of ensemble forecast skill for drought forecasts for the Des Moines River. The dashed horizontal lines indicate the computed RPSS and CRPSS for the set of ensemble forecasts. In this example, the RPSS is based on four categories defined by the flow quartiles, which are shown as dashed vertical lines. Hence, the RPSS is simply the weighted-average skill function values at these three thresholds. In contrast, the CRPSS is the weighted-average skill over the entire skill function.

*Q*(

*y*) denotes the MSE skill function or a component of a skill score decomposition, then

### b. Summary measures using probability thresholds

As seen in section 3a, the CRPSS can be interpreted as a summary measure of a skill function indexed using the threshold value *y*. Here we derive an analogous measure for a skill function indexed using probability *p*.

*p*), an integrated measure analogous to

*Y*).

*weighted-average skill score*over the continuous range of probability thresholds

*p*. Figure 2b illustrates this interpretation for the drought forecasts for the Des Moines River. Note that

*Q*(

*p*) denotes the MSE skill function or a component from a skill score decomposition, then

*p*as an index, the denominator of

*w*(

*p*) [in Eq. (23)] is a constant ⅙, so

*w*(

*p*) simplifies for this case toFigure 3 shows this weight function. Probability thresholds near 0.5 have the largest weight. The weight approaches 0 as the probability threshold approaches 0 or 1. The weight functions

*w*(

*y*) in Eqs. (13) and (18) have similar properties; their numerator is equal to

*p*(1 −

*p*) for the threshold

*y*[see Eq. (5)].

_{p}Clearly, this weighting of forecast quality functions is not arbitrary; it arises from the mathematical definitions of RPSS, CRPSS, and *p*(1 − *p*), so just like the weight function in Fig. 3, rare events (where *p* approaches 0 or 1) naturally contribute less to the sum or integral than commonly occurring events (where *p* is near to 0.5). Our formulation replaces MSE with SS—a relative measure that eliminates this scale dependency—so this weighting now appears explicitly in the form of the weight function.

### c. Higher-order summary measures

The weighted-average skill summarizes an extremely important property of the skill functions shown in Fig. 2. But other properties of the functions are also of interest. For instance, the skill function SS(*p*) is not constant; it exhibits systematic departures from the weighted-average skill. Probability forecasts for low flow volume thresholds have high skill, whereas those for high flow volume thresholds have very little skill. This information is significant in this drought forecasting example, as a forecaster would prefer high skill probability forecasts of unusually low flow conditions (thresholds corresponding to low flow volumes). Although

*Q*(

*p*) is to use a geometric analogy. If one considers the weighted forecast quality

*w*(

*p*)

*Q*(

*p*) to be like a “mass distribution,” denoted here as

*M*(

*p*), then the weighted-average forecast quality [Eq. (24)] is equivalent toIn other words, a

*geometric interpretation*of

_{0}(

*p*) asIn essence, SS

_{0}(

*p*) shows the quality of the ensemble forecasts only at thresholds where the probability forecasts are skillful (more accurate than climatology forecasts). One can then use the transformed skill function SS

_{0}(

*p*) as the function

*Q*(

*p*) to find the average skill

_{0}(

*p*) is that the weighted average of MSE decomposition terms [like those shown in Eq. (19)] sum to

## 4. Examples

The three derived measures—the weighted average

### a. Hypothetical forecasts with the same average skill

The first example compares the skill function for three hypothetical probabilistic forecasting systems. Figure 5 shows the skill functions for the three cases. By design, all three have the same average skill

Using the three geometric summary measures, we can produce graphical depictions of key features of three skill functions. The centroids (symbols) and shape (bars) show the differences in where skill is concentrated. For the constant skill function, the center of mass is at 0.5 without a bar

The conclusion is that even when alternative forecast systems are equivalent in terms of their average skill (e.g., CRPSS), the concentration of skill as shown by their skill functions can be very different. The additional summary measures provide a simple way of quantifying these differences in the ensemble forecasts. In this hypothetical example the measures are readily visualized as differences in a function’s shape, since the shapes chosen are very simple. As will be seen in the next example, where function shapes are much more complex, the geometric meaning of the measures can still allow one to visualize (in an approximate way) the overall magnitude of a function, and the ways it differs from the benchmark case (a constant function).

### b. Forecast system verification problem

The second example illustrates how a forecast system manager might utilize the summary measures for diagnostic verification of an ensemble prediction system. The example utilizes retrospective ensemble streamflow forecasts generated from the operational system at the National Weather Service (NWS) North Central River Forecast Center (NCRFC) for three locations along the Des Moines River: in the headwaters at Jackson (JCKM5), upstream of a flood control reservoir at Stratford (STRI4), and downstream of the reservoir at Des Moines (DESI4). The forecast variable is the minimum 7-day flow volume over a 90-day time horizon, used to anticipate the probability of drought (low flow volumes) during the upcoming season. Verification data samples are constructed using the 50 ensemble forecasts issued on the same calendar day (i.e., 1 forecast per year) over a 50-yr historical period (Kruger et al. 2007).

We begin by evaluating forecast quality functions for the ensemble forecasts at the upstream location (Jackson) issued at a single time of the year (April; see Fig. 6). The drought forecasts have skill [SS(*p*) > 0], except for extreme low and high threshold probabilities. However, the potential skill PS(*p*) of the ensemble forecasts is much higher, indicating that forecast biases are reducing forecast accuracy. Still, the conditional bias CB(*p*) is near zero, except at the extremes. Instead, the unconditional bias UB(*p*) is the primary source of bias (especially at intermediate threshold levels), and the main reason why the potential skill is not realized.

Figure 6 also plots the geometric summary measures—the centroid (symbols) and shape (bars) for the forecast quality functions. Note that we summarize the function for SS_{0}(*p*), as the skill function SS(*p*) contains some negative values. Despite the complex shape of the functions, collectively the three summary measures indicate how the functions differ from the benchmark (constant function) case. For both SS_{0} and PS, the center of mass is less than 0.5 since SS_{0}(*p*) and PS(*p*) values are generally larger at lower probability thresholds; the vertical bars indicate that the mass is more concentrated near the centroid than a constant function. Considering the two bias terms, the weighted average of CB(*p*) is near zero, but the long horizontal bars indicate that the largest CB(*p*) values occur at the extremes. In contrast, UB(*p*) is much larger, with its mass more concentrated near the centroid than the other functions (the longest vertical bar), and a center of mass nearer to 0.5.

To quickly determine whether the forecasts at downstream locations are better or worse, and how the quality of forecasts varies through the spring and early summer (April–July), we compare the summary measures of the forecast quality functions for all three forecast locations. All three sites have skillful ensemble forecasts in spring (see Fig. 7a), but their characteristics consistently change from April to May; skill decreases and the mass of the skill function shifts to higher flow thresholds (an unwelcome property for drought forecasting). Overall, the skill is higher downstream and lower upstream for the same forecast issuance date. By June and July, all three sites have little or no forecast skill.

The trends in the skill are explained by the summary measures for the skill decomposition. The potential skill PS is consistently high at all three sites for April and May forecasts, and a mass shift toward higher flow thresholds in May is apparent (see Fig. 7b). Potential skill drops precipitously in summer months. The conditional biases (CBs) are generally small in April, but by summer they are significantly larger and concentrated at higher flow thresholds (see Fig. 7c). It is the unconditional biases (UBs) that have the most significant affect on skill (see Fig. 7d). At all sites, UB increases from April through July.

Another way that summary measures can be compared is illustrated by the time series plots for the unconditional bias in Fig. 8. The bias *γ*_{UB} < 0), except in May at Stratford and Des Moines, where the biases are relatively small (see Fig. 8c).

Overall, the conclusion is that even though the ensemble forecasts have lower potential skill in summer (June and July), large unconditional biases (or poor reliability) degrade whatever skill they might have. Implementing bias correction (Smith et al. 1992; Leung et al. 1999; Wood et al. 2002; Seo et al. 2006; Hashino et al. 2007) or forecast calibration (Eckel and Walters 1998; Atger 2003; Gneiting et al. 2005; Hamill and Whitaker 2007; Wilks and Hamill 2007; Hagedorn et al. 2008; Primo et al. 2009, among others) could eliminate unconditional biases and help the system realize the potential skill that exists.

Clearly, the utility of the summary measures for forecast system verification is that they permit a meaningful approximate comparison of the properties of forecast quality functions. Although the climatological distribution of the forecast variable is different at each site (and in each month), the nondimensional nature of the probability threshold summary measures facilitates a side-by-side comparison, and provides a consistent diagnostic framework for forecast system assessment.

## 5. Discussion

Although the previous example compares only three forecast points, it illustrates the utility of the summary measures for forecast system verification. In practice, verification is often done with a single metric like the RPSS or CRPSS—a measure of the weighted-average skill of the forecast skill function. Using the two additional summary measures can provide additional information on the shape of the skill function (and its decompositions) to characterize the performance of the forecast system. Still, it is important to recognize that the three summary measures are insufficient to completely define the shape of a forecast quality function. For example, it is possible to construct shapes that would have summary measures that are identical to those for a constant skill function. The constructed shapes must be perfectly symmetrical about a probability threshold of 0.5 (to produce a center of mass of 0.5), and any deviations from the constant function must be perfectly offset by deviations elsewhere of opposite sign (so that the radius of gyration is the same as for a constant function). Such shapes are unlikely in verification applications, but illustrate that there are certain limits in distinguishing between forecast quality function shapes with just three summary measures. In this way they are analogous to the moments of a probability distribution: the mean, variance, and skewness summarize information about the distribution, but cannot completely define its shape (unless it has a simple parametric form that is known a priori). But just as three moments say more about a probability distribution than the mean alone, so, too, do the three summary measures say more about a forecast quality function than a weighted average (RPSS or CRPSS) alone.

Using either the forecast quality function indexed by the forecast variable *Q*(*y*), or by its climatological nonexceedance probability *Q*(*p*), the geometric interpretation of summary measures is valid. Although higher-order summary measures are derived only for *Q*(*p*) in section 3, similar measures can be obtained for *Q*(*y*). Still, there are compelling reasons for preferring summary measures based *Q*(*p*) for ensemble forecast system comparison. First and foremost, the higher-order measures are essentially nondimensional (units of probability), allowing a meaningful comparison between forecast locations or forecasting systems. Also, their measures are readily interpretable. For example, a shift of *Q*(*y*) have dimensions of the forecast variable. For the benchmark case of a constant forecast quality function, the center of mass and radius of gyration depend on the climatological distribution of the forecast variable; they are not constant (as are *Q*(*y*) is that the weighted-average skill is equal to the CRPSS, which has been used for verification (Candille et al. 2007; Hamill and Whitaker 2007; Ferro et al. 2008; Hagedorn et al. 2008; McCollor and Stull 2009; Candille 2009). Still,

Since the new summary measures we derived in this paper are based on the RPSS and CRPSS, the forecast quality functions are all weighted by *w*(*p*) or *w*(*y*). As seen in section 3b, the weights are smaller for rare events, and larger for events near the median. Of course, it is possible to define a new class of summary measures, unrelated to RPSS or CRPSS, where equal weighting at all thresholds is used. One advantage of such an approach would be that the summary measures would describe the shape of the forecast quality functions *Q*(*p*) or *Q*(*y*), rather than their weighted functions *w*(*p*)*Q*(*p*) or *w*(*y*)*Q*(*y*). But in the case of forecast quality functions indexed by the threshold value *y*, which can be unbounded at the upper and/or lower end, finite values might not exist for equal-weighted measures (i.e., their integral values may be infinite). For functions indexed by *p*, which is bounded at 0 and 1, such a problem does not occur. Therefore, it would be worthwhile to explore such summary measures for *Q*(*p*).

An important insight of the verification framework outlined in section 2 is the interpretation it offers on summary measures of ensemble forecasts. An ensemble probability distribution forecast of a continuous variable (a function) is more complex than a traditional probability forecast of a discrete event (a single number). Visualizing ensemble forecast quality as a continuous function of the forecast variable is a natural extension of the discrete case. This also leads to the realization that scalar measures like the RPSS or CRPSS summarize features of the forecast quality function. For instance, when viewing a skill score function, the sensitivity of the RPSS to the selection of multicategory events is plain to see. But beyond the numerical result of a given summary measure, viewing forecast quality as varying by outcome also illustrates the fundamental nature of such forecasts. Whereas a single-value probability forecast is *either* skillful or not, an ensemble forecast can be *both* skillful and not, depending on the outcome considered. Using the proposed verification framework to evaluate forecast quality over the continuous range of outcomes can help in assessing the strengths, weaknesses, and potential applications of an ensemble prediction system.

## 6. Conclusions

Ensemble prediction systems produce forecasts that can represent the probability distribution of a continuous forecast variable. As such, ensemble forecasts are more complex than traditional probability forecasts of a discrete event. In essence, the ensemble forecast contains a probability forecast for any outcome. As a result, one can define and evaluate the forecast quality of a set of ensemble forecasts as a continuous function of the forecast variable (or its climatological nonexceedance probability). Rather than using a single value to describe some aspect of forecast quality, *a forecast quality function* is needed to completely describe that aspect over the range of continuous outcomes for ensemble forecasts. Indeed, the examples presented demonstrate that ensemble forecasts may contain probability forecast statements that are of high quality (skillful) for predicting some outcomes, but be of low quality (no skill) for predicting others.

Still, forecasters need summary measures of forecast quality to enable comparisons between ensemble forecasts made by different forecasting systems, or between forecasts made at multiple locations, issuance times, and lead times. We explore summary measures that describe properties of a forecast quality function. In particular, we show how traditional summary measures of ensemble forecast skill (the ranked probability skill score and the continuous ranked probability skill score) mathematically represent the weighted-average skill of a skill function. This concept can be extended to skill score decompositions to derive weighted-average measures of other aspects of forecast quality, like resolution and reliability (conditional and unconditional biases). Using a geometric analogy, we derive other summary measures that describe the center of mass of a forecast quality function, and the distribution of mass (related to the moment of inertia). Together, the three summary measures can be interpreted as descriptions of the geometric shape of the forecast quality function.

Several examples illustrate the advantages and applications of the summary measures. In particular, even when ensemble forecasts have the same average skill based on traditional summary measures, the differences in their distribution of skill over the range of outcomes has significant implications (to forecasters and forecast users). The additional summary measures showing where the skill is concentrated further differentiates the quality of the ensemble forecasts. Extending the geometric summary measures to skill score decompositions allows one to efficiently compare multiple sets of forecasts, characterize their differences, and diagnose attributes that contribute (or detract) from their skill. Although the geometric summary measures cannot completely replace the information contained in the forecast quality functions, for a forecast system manager evaluating the quality of hundreds of forecast elements, the geometric summary measures provide a way to concisely summarize important verification attributes for ensemble forecast system assessment.

## Acknowledgments

This work was supported in part by National Oceanic and Atmospheric Administration (NOAA) Grant NA04NWS4620015, from the National Weather Service (NWS) Office of Hydrologic Development, and Grant NA09OAR4310196, from the Climate Program Office as part of the Climate Prediction Program for the Americas (CPPA). We gratefully acknowledge this support. We would also like to thank the two anonymous reviewers for their thoughtful comments and suggestions.

## APPENDIX

### Discrete Approximation of Functions and Summary Measures

This section describes how a verification dataset containing ensemble forecasts and outcomes (observations) is used to evaluate forecast quality functions and the geometric summary measures. Let *z _{t}*(

*j*),

*j*= 1, … ,

*M*denote the ensemble forecast at time

_{t}*t*, where

*z*(

_{t}*j*) is the

*j*th ensemble member, and

*M*is the number of ensemble members in the forecast. Corresponding to each ensemble forecast, let

_{t}*y*be the outcome (the observation of the forecast variable).

_{t}*k*discrete thresholds, selected at constant intervals in probability space. Let {

*p*,

_{i}*i*= 1, … ,

*k*} be the probability thresholds, defined asIn some applications, the climatology of the forecast variable is known (e.g., from a longer historical record). The threshold values corresponding to the probability thresholds, denoted

*y*

_{(i)},

*i*= 1, … ,

*N*} be the ranked observations for the

*N*ensemble forecasts in the verification dataset. One could then use the midpoints between successive pairs

*y*

_{(i)}and

*y*

_{(i+1)}to define

*k*discrete thresholds (

*k*=

*N*− 1):

#### a. Forecast quality functions

*I*() is the indicator function defined asThe observation is thenNote that other approaches may be used to estimate probability forecasts. For example, as illustrated in Fig. 1, one could assign a nonexceedance probability to each ensemble member

*z*(

_{t}*j*) using a plotting position formula, then interpolate to find the probability forecast

The set of forecasts and observations for the threshold *k* thresholds values. The discrete approximation of the forecast quality function *Q*(*p*) is {*Q*(*p _{i}*),

*i*= 1, … ,

*k*}.

#### b. Discrete approximation of summary measures

The summary measures *p*. These integrals are approximated by numerical integration. Since the probability interval *p _{i}*

_{+1}−

*p*is equal to 1/(

_{i}*k*+ 1) for each pair of thresholds [see Eq. (A1)], numerical integration using the trapezoidal rule reduces to the simple expressions shown below.

*w*(

*p*) defined asthe approximation of the weighted-average forecast quality [Eq. (24)] isThe approximation of the center of mass [Eq. (27)] isThe approximation moment of inertia

_{i}Note that in the case of the skill function SS(*p*), the approximation of *k* + 1 categories defined by the thresholds {*i* = 1, … , *k*}.

## REFERENCES

Atger, F., 1999: The skill of ensemble prediction systems.

,*Mon. Wea. Rev.***127**, 1941–1953.Atger, F., 2003: Spatial and interannual variability of the reliability of ensemble-based probabilistic forecasts: Consequences for calibration.

,*Mon. Wea. Rev.***131**, 1509–1523.Bradley, A. A., , T. Hashino, , and S. S. Schwartz, 2003: Distributions-oriented verification of probability forecasts for small data samples.

,*Wea. Forecasting***18**, 903–917.Bradley, A. A., , S. S. Schwartz, , and T. Hashino, 2004: Distributions-oriented verification of ensemble streamflow predictions.

,*J. Hydrometeor.***5**, 532–545.Brier, G. W., 1950: Verification of forecasts expressed in terms of probability.

,*Mon. Wea. Rev.***78**, 1–3.Buizza, R., , and T. N. Palmer, 1998: Impact of ensemble size on ensemble prediction.

,*Mon. Wea. Rev.***126**, 2503–2518.Candille, G., 2009: The multiensemble approach: The NAEFS example.

,*Mon. Wea. Rev.***137**, 1655–1665.Candille, G., , and O. Talagrand, 2005: Evaluation of probabilistic prediction systems for a scalar variable.

,*Quart. J. Roy. Meteor. Soc.***131**, 2131–2150.Candille, G., , C. Cote, , P. L. Houtekamer, , and G. Pellerin, 2007: Verification of an ensemble prediction system against observations.

,*Mon. Wea. Rev.***135**, 2688–2699.Carpenter, T. M., , and K. P. Georgakakos, 2001: Assessment of Folsom Lake response to historical and potential future climate scenarios: 1. Forecasting.

,*J. Hydrol.***249**(1–4), 148–175.Cong, S., , J. C. Schaake, , and E. Welles, 2003: Retrospective verification of ensemble stream predictions (ESP): A case study. Preprints,

*17th Conf. on Hydrology,*Long Beach, CA, Amer. Meteor. Soc., JP3.8.Ebert, E. E., 2001: Ability of a poor man’s ensemble to predict the probability and distribution of precipitation.

,*Mon. Wea. Rev.***129**, 2461–2480.Eckel, F. A., , and M. K. Walters, 1998: Calibrated probabilistic quantitative precipitation forecasts based on the MRF ensemble.

,*Wea. Forecasting***13**, 1132–1147.Epstein, E. S., 1969: A scoring system for probability forecasts of ranked categories.

,*J. Appl. Meteor.***8**, 985–987.Ferro, C. A. T., , D. S. Richardson, , and A. P. Weigel, 2008: On the effect of ensemble size on the discrete and continuous ranked probability scores.

,*Meteor. Appl.***15**, 19–24.Franz, K. J., , H. C. Hartmann, , S. Sorooshian, , and R. Bales, 2003: Verification of National Weather Service ensemble streamflow predictions for water supply forecasting in the Colorado River basin.

,*J. Hydrometeor.***4**, 1105–1118.Gneiting, T., , A. E. Raftery, , A. H. Westveld, , and T. Goldman, 2005: Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation.

,*Mon. Wea. Rev.***133**, 1098–1118.Gneiting, T., , F. Balabdaoui, , and A. E. Raftery, 2007: Probabilistic forecasts, calibration and sharpness.

,*J. Roy. Stat. Soc. Series B-Stat. Method.***69**, 243–268.Grimit, E. P., , and C. F. Mass, 2002: Initial results of a mesoscale short-range ensemble forecasting system over the Pacific Northwest.

,*Wea. Forecasting***17**, 192–205.Hagedorn, R., , T. M. Hamill, , and J. S. Whitaker, 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part I: Two-meter temperatures.

,*Mon. Wea. Rev.***136**, 2608–2619.Hamill, T. M., , and S. J. Colucci, 1997: Verification of ETA-RSM short-range ensemble forecasts.

,*Mon. Wea. Rev.***125**, 1312–1327.Hamill, T. M., , and S. J. Colucci, 1998: Evaluation of ETA-RSM ensemble probabilistic precipitation forecasts.

,*Mon. Wea. Rev.***126**, 711–724.Hamill, T. M., , and J. S. Whitaker, 2007: Ensemble calibration of 500-hPa geopotential height and 850-hPa and 2-m temperatures using reforecasts.

,*Mon. Wea. Rev.***135**, 3273–3280.Hashino, T., , A. A. Bradley, , and S. S. Schwartz, 2007: Evaluation of bias-correction methods for ensemble streamflow volume forecasts.

,*Hydrol. Earth Syst. Sci.***11**, 939–950.Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems.

,*Wea. Forecasting***15**, 559–570.Hou, D. C., , E. Kalnay, , and K. K. Droegemeier, 2001: Objective verification of the SAMEX ’98 ensemble forecasts.

,*Mon. Wea. Rev.***129**, 73–91.Kirtman, B. P., 2003: The COLA anomaly coupled model: Ensemble ENSO prediction.

,*Mon. Wea. Rev.***131**, 2324–2341.Kruger, A., , S. Khandelwal, , and A. A. Bradley, 2007: AHPSVER: A web-based system for hydrologic forecast verification.

,*Comput. Geosci.***33**, 739–748.Kumar, A., , A. G. Barnston, , and M. P. Hoerling, 2001: Seasonal predictions, probabilistic verifications, and ensemble size.

,*J. Climate***14**, 1671–1676.Leung, L. R., , A. F. Hamlet, , D. P. Lettenmaier, , and A. Kumar, 1999: Simulations of the ENSO hydroclimate signals in the Pacific Northwest Columbia River basin.

,*Bull. Amer. Meteor. Soc.***80**, 2313–2328.Markus, M., , E. Welles, , and G. N. Day, 1997: A new method for ensemble hydrograph forecast verification. Preprints,

*13th Int. Conf. on Interactive Information and Processing Systems for Meteorology, Oceanography, and Hydrology,*Long Beach, CA, Amer. Meteor. Soc., J106–J108.Matheson, J. E., , and R. L. Winkler, 1976: Scoring rules for continuous probability distributions.

,*Manage. Sci.***22**(10), 1087–1096.McCollor, D., , and R. Stull, 2009: Evaluation of probabilistic medium-range temperature forecasts from the North American Ensemble Forecast System.

,*Wea. Forecasting***24**, 3–17.Mullen, S. L., , and R. Buizza, 2001: Quantitative precipitation forecasts over the United States by the ECMWF ensemble prediction system.

,*Mon. Wea. Rev.***129**, 638–663.Murphy, A. H., 1971: A note on the ranked probability score.

,*J. Appl. Meteor.***10**, 155–156.Murphy, A. H., 1997: Forecast verification.

*Economic Value of Weather and Climate Forecasts,*R. Katz and A. H. Murphy, Eds., Cambridge University Press, 19–74.Murphy, A. H., , and R. L. Winkler, 1987: A general framework for forecast verification.

,*Mon. Wea. Rev.***115**, 1330–1338.Murphy, A. H., , and R. L. Winkler, 1992: Diagnostic verification of probability forecasts.

,*Int. J. Forecasting***7**(4), 435–455.Primo, C., , C. A. T. Ferro, , I. T. Jolliffe, , and D. B. Stephenson, 2009: Calibration of probabilistic forecasts of binary events.

,*Mon. Wea. Rev.***137**, 1142–1149.Seo, D.-J., , H. D. Herr, , and J. C. Schaake, 2006: A statistical post-processor for accounting of hydrologic uncertainty in short-range ensemble streamflow prediction.

,*Hydrol. Earth Syst. Sci. Discuss.***3**, 1987–2035.Shamir, E., , T. M. Carpenter, , P. Fickenscher, , and K. P. Georgakakos, 2006: Evaluation of the National Weather Service operational hydrologic model and forecasts for the American River basin.

,*J. Hydrol. Eng.***11**(5), 392–407.Smith, J. A., , G. N. Day, , and M. D. Kane, 1992: Nonparametric framework for long-range streamflow forecasting.

,*J. Water Resour. Plan. Manage.***118**, 82–91.Toth, Z., , O. Talagrand, , G. Candille, , and Y. Zhu, 2003: Probability and ensemble forecasts.

*Forecast Verification: A Practioner’s Guide in Atmospheric Science,*I. Jolliffe and D. Stephenson, Eds., John Wiley & Sons, 137–163.Unger, D. A., 1985: A method to estimate the continous ranked probability score. Preprints,

*Ninth Conf. on Probability and Statistics in Atmospheric Sciences,*Virginia Beach, VA, Amer. Meteor. Soc., 206–213.Werner, K., , D. Brandon, , M. Clark, , and S. Gangopadhyay, 2004: Climate index weighting schemes for NWS ESP-based seasonal volume forecasts.

,*J. Hydrometeor.***5**, 1076–1090.Wilks, D. S., 2000: Diagnostic verification of the Climate Prediction Center long-lead outlooks, 1995–98.

,*J. Climate***13**, 2389–2403.Wilks, D. S., 2006:

*Statistical Methods in the Atmospheric Sciences*. 2nd ed. International Geophysics Series, Vol. 59, Academic Press, 629 pp.Wilks, D. S., , and T. M. Hamill, 2007: Comparison of ensemble-MOS methods using GFS reforecasts.

,*Mon. Wea. Rev.***135**, 2379–2390.Wood, A. W., , E. P. Maurer, , A. Kumar, , and D. P. Lettenmaier, 2002: Long-range experimental hydrologic forecasting for the eastern United States.

,*J. Geophys. Res.***107**, 4429, doi:10.1029/2001JD000659.