## Abstract

Ensemble streamflow prediction systems produce forecasts in the form of a conditional probability distribution for a continuous forecast variable. A distributions-oriented approach is presented for verification of these probability distribution forecasts. First, a flow threshold is used to transform the ensemble forecast into a probability forecast for a dichotomous event. The event is said to occur if the observed flow is less than or equal to the threshold; the probability forecast is the probability that the event occurs. The distributions-oriented approach, which has been developed for meteorological forecast verification, is then applied to estimate forecast quality measures for a verification dataset. The results are summarized for thresholds chosen to cover the range of possible flow outcomes. To aid in the comparison for different thresholds, relative measures are used to assess forecast quality. An application with experimental forecasts for the Des Moines River basin illustrates the approach. The application demonstrates the added insights on forecast quality gained through this approach, as compared to more traditional ensemble verification approaches. By examining aspects of forecast quality over the range of possible flow outcomes, the distributions-oriented approach facilitates a diagnostic evaluation of ensemble forecasting systems.

## 1. Introduction

Ensemble streamflow prediction systems make forecasts by using a hydrologic model to simulate streamflow time series (traces) for alternate input weather sequences. The weather inputs are often from historical records, although inputs from ensemble weather and climate forecasts might be used as well. Each hydrologic simulation is initialized to reflect the current moisture state of the watershed. The result is an ensemble of conditional streamflow traces. Increasingly, ensemble prediction techniques are being used for streamflow forecasting and water resources decision making (Day 1985; Smith et al. 1992; Croley 1993; Georgakakos et al. 1995; Georgakakos et al. 1998; Hamlet and Lettenmaier 1999; Carpenter and Georgakakos 2001; Duce 2001; Faber and Stedinger 2001; Krzysztofowicz 2001; Krzysztofowicz and Herr 2001; Yao and Georgakakos 2001; Hamlet et al. 2002; Hay et al. 2002; Wood et al. 2002). At the National Weather Service (NWS), ensemble prediction techniques are being implemented nationwide for long-range streamflow forecasting (out to 90 days), as a part of Advanced Hydrologic Prediction Services (AHPS) (Braatz et al. 1997; Connelly et al. 1999).

The nature of a forecast from an ensemble prediction system is very different from traditional hydrologic and meteorologic forecasts. For example, a short-range flood forecast is an example of a *deterministic* forecast (Jolliffe and Stephenson 2003). That is, the forecast is a single outcome, and no statement about its uncertainty is provided. In contrast, a probability of precipitation (PoP) forecast is an example of a *probabilistic* forecast for a discrete event (Jolliffe and Stephenson 2003). That is, the forecast is the probability that the event (i.e., rain) will occur. Ensemble streamflow predictions are usually used to make a probabilistic forecast. Rather than a forecast probability for a discrete event, however, the ensemble defines an empirical probability distribution for an outcome (e.g., streamflow volume) that is a continuous variable; for a discussion on interpreting ensembles as probabilistic forecasts see Krzysztofowicz (2001).

Verification of forecasts is an essential step for their operational use in decision making (Livezey 1990; Mjelde et al. 1993; Murphy 1993; Roebber and Bosart 1996; Doswell and Brooks 1998; Stern 2001; Wilks 2001; Hartmann et al. 2002; Palmer 2002; Pagano et al. 2002; among others). Forecast verification is carried out using a verification dataset, which contains a record of forecasts and subsequent observations. A comparison of the forecasts with the observations is then made to assess forecast quality. To generate a verification dataset, it may be necessary to reconstruct forecasts for the past, where a record of streamflow observations is available (a technique also known as hindcasting).

Numerous techniques have been used for verification of ensemble predictions for meteorologic (Anderson 1996; Hamill and Colucci 1997; Wilson et al. 1999; Hersbach 2000; Hou et al. 2001; among others) and hydrologic forecasts (Day et al. 1992; Schwartz 1992; Markus et al. 1997; Hamlet and Lettenmaier 1999; Carpenter and Georgakakos 2001; Duce 2001; Cong et al. 2003; among others). In this paper, we present a *distributions-oriented* approach for ensemble forecast verification. The distributions-oriented approach has been developed and used extensively for meteorologic forecast verification of discrete forecasts and observations (both probabilistic and deterministic) (Murphy and Winkler 1992; Brooks and Doswell 1996; Brooks et al. 1997; Murphy and Wilks 1998; Wilks 2000; among others). Here, we describe a framework for applying distributions-oriented concepts to probability distribution forecasts of continuous outcomes. The approach is presented through an application to ensemble streamflow predictions for the Des Moines River basin. The application illustrates how the distributions-oriented approach adds insight on the quality of forecasts over the information available with more traditional approaches.

## 2. Problem statement

Let *Y*_{i} be a continuous random variable representing a forecast variable at time *i.* Examples might include the discharge on a specified date, the cumulative discharge volume for some period, or the minimum discharge during the forecast interval. Assume that the forecasting system produces forecasts of the probability distribution of *Y*_{i} conditioned on the current state of the system. The *probability distribution forecast **G*_{i}(*y*) is defined as

where *P*{*Y*_{i} ≤ *y*|*ξ*_{i}} is the probability that the forecast variable *Y*_{i} is less than or equal to *y,* conditioned on the state of the hydroclimatic system *ξ*_{i}. Our goal for verification is to characterize the quality of probability distribution forecasts *G*_{i}(*y*) for a specific forecasting system.

### a. Verification methodology

Forecast verification requires comparison of the forecasts with observations. However, the form of the forecast *G*_{i}(*y*) as a probability distribution function, and the outcome *Y*_{i} as a continuous variable, makes a comparison difficult for ensemble forecasts. In practice, probability distribution forecasts are usually reformatted (or transformed), and verification is carried out with a simpler form of the forecast.

One common approach is to compare the ensemble mean with the observations (Day et al. 1992; Hou et al. 2001; Duce 2001; Cong et al. 2003; among others). This approach essentially transforms *G*_{i}(*y*) into a deterministic forecast for verification. A second approach is to define discrete outcomes, and transform *G*_{i}(*y*) into a probabilistic forecast for each category (e.g., Carpenter and Georgakakos 2001; Cong et al. 2003). An example would be the use of flow terciles to define above-, below-, and near-normal categories. Although both approaches provide valuable information on aspects of forecast quality, neither completely evaluates the quality of forecasts over the continuous range of outcomes. Here we address this problem through a generalization of the second approach.

The meteorological literature defines a *probability forecast* as one that assigns a probability to the occurrence of a discrete event. From the definition of *G*_{i}(*y*) in Eq. (1), it follows that the probability distribution forecast is a probability forecast for the event {*Y*_{i} ≤ *y*}, for any and all values of *y.* For a specific value *y**, let *f*_{i}(*y**) be the probability forecast that the nonexceedance event {*Y*_{i} ≤ *y**} occurs. By definition,

Let *x*_{i}(*y**) be a discrete random variable defined as

In other words, *x*_{i}(*y**) is 1 when the event {*Y*_{i} ≤ *y**} occurs and 0 when it does not. Using a set of probability forecasts *f*_{i}(*y**) and discrete observations *x*_{i}(*y**), forecast verification methods can be applied to evaluate forecast quality for the event {*Y*_{i} ≤ *y**}.

Let *Q*(*y**) denote a specific forecast quality measure (e.g., skill) for a specific threshold *y**. Then the function *Q*(*y*) characterizes the forecast quality measure for the probability distribution forecast *G*_{i}(*y*). In principle, *Q*(*y*) must be determined for the continuous range of *y.* In practice, however, we will evaluate *Q*(*y*) at a set of threshold values spanning the range of possible outcomes.

### b. Distributions-oriented approach

Verification methods for probability forecasts of discrete variables are well established in the meteorological literature (Brier 1950; Wilks 1995; Murphy 1997; Zhang and Casey 2000). However, recent applications have demonstrated the appeal of the distributions-oriented (DO) approach to forecast verification (Murphy and Winkler 1992; Brooks and Doswell 1996; Brooks et al. 1997; Murphy and Wilks 1998; Wilks 2000; among others). Forecast verification using a DO approach centers on the joint distribution of forecasts and observations *h*(*f, **x*) (Murphy and Winkler 1987; Murphy 1997). In the case of probability forecasts of a dichotomous event, *f* denotes the forecast probability, and *x* is the binary outcome (0 or 1). If forecast-observation pairs (*f, **x*) at different times are independent and identically distributed, the relationship between *f* and *x* is completely described by *h*(*f, **x*). Hence, all aspects of the forecast quality can be determined directly from the joint distribution (Murphy 1997).

Murphy (1997) provides a comprehensive presentation of forecast verification based on the DO approach, and describes summary measures for aspects of forecast quality. In the approach proposed here for probability distribution forecasts, forecast quality measures derived from the joint distribution *h*(*f, **x*) are used for verification. These measures will be estimated for forecasts *f*_{i}(*y*) and observations *x*_{i}(*y*) for the nonexceedance event threshold *y,* and the results will be summarized over the entire range of possible flow thresholds.

## 3. Study area

An application of the DO approach for verification of probability distribution forecasts is presented for an experimental streamflow forecasting system for the Des Moines River basin. This forecast system makes ensemble streamflow predictions using a hydrological model driven with historical weather observations for the period from 1948 to 1997. The hydrological model used is the Hydrological Simulation Program-FORTRAN (HSPF), a lumped-parameter conceptual model developed for the U.S. Environmental Protection Agency (Donigian et al. 1995).

Verification is carried out for monthly flow volume forecasts for September to illustrate the proposed approach. Forecasts are made on 1 September of each year for the cumulative streamflow volume through the end of the month. The location selected for forecast verification is the Des Moines River at Stratford, Iowa (U.S. Geological Survey stream gauge 5481300). This location drains a 5452 mi^{2} area, and is directly upstream of Saylorville Reservoir. Discharge volume at this location is of interest because it represents the majority of the inflow to the reservoir. Saylorville Reservoir is operated for flood control, drought management, and other uses by the U.S. Army Corps of Engineers.

A verification dataset, made up of forecasts and observations for many events, is needed to assess the quality of the forecasting system. In this application, September flow volume forecasts for 1949–96 (*N* = 48) were reconstructed for verification. Streamflow traces were generated using the initial conditions valid on the forecast date and the weather data for each year in the historical record (see Fig. 1). Therefore, for each forecast date, there are 48 traces available. To correct for biases, the simulated volumes were adjusted based on simulation results and observations from the historical record, using a technique similar to that used by Wood et al. (2002) for weather inputs. Probability distribution forecasts of monthly flow volume were then determined using the extended (or ensemble) streamflow prediction (ESP) approach (Day 1985; Smith et al. 1992).

Two alternative ESP forecasts were generated from the traces for comparison. Scenario 1 assumes that no skillful climate forecast is available at the time the ESP forecast is made. Thus, all 48 traces are considered to be “=ually likely” outcomes, and all are used to define the probability distribution forecast. Scenario 2 assumes that a skillful climate forecast is available at the time of the ESP forecast. Specifically, we assume the climate forecast predicts whether the September precipitation is above, below, or near normal, as defined by the terciles of its unconditional (climatological) distribution. Furthermore, we assume that this hypothetical climate forecast is “perfect” in that it always predicts the correct precipitation category for the month. The probability distribution forecasts for streamflow volume are then derived using only those traces generated with precipitation in the forecast precipitation category. Figure 2 shows an example ESP probability distribution forecast for 1 September 1965 for the two scenarios. For comparison, the empirical distribution of the observed September flow volume for the 48-yr historical record is also shown. Figure 3 shows a time series of ESP forecasts and observed volumes for the period from 1960 to 1970.

Clearly, scenario 2 is not selected for its operational realism. Indeed, it is unrealistic to expect that perfect forecasting of future precipitation categories is achievable. Note too that the approach for making a probability distribution forecast based on the hypothetical climate forecasts is but one way (possibly not the best) to use this information for streamflow forecasting (e.g., see Croley 2000 and Perica 1998). Instead, this scenario is selected because one might reasonably expect that perfect insight into next month's precipitation category will result in higher-quality streamflow forecasts. The verification in the following sections will show whether this is so.

## 4. Application

The first step in the application is to select flow thresholds where we will apply the DO verification approach. Let {*y*_{(i)}, *i* = 1, … , *N*} be the ranked observed flow volumes for the *N* years of the verification period. We will use the midpoints between successive pairs *y*_{(i)} and *y*_{(i+1)} as thresholds. The thresholds will be indexed by their nonexceedance probability *p.* For the midpoint between *y*_{(i)} and *y*_{(i+1)}, *p* is estimated by *i*/*N.* For this application (*N* = 48), this approach results in 47 separate thresholds, with nonexceedance probabilities *p* ranging from 1/48 to 47/48.

The next step is to construct the set of forecast-observation pairs for each flow threshold. Figure 2 illustrates the process for the September 1965 forecast. The horizontal line shows the *y*_{0.25} quantile threshold in relation to the probability distribution forecasts for the two scenarios and the empirical distribution of observed flows. Note that the empirical distribution can be thought of as the climatological (or unconditional) forecast of September streamflow volume. By definition then, the climatological probability forecast *f* associated with the *y*_{0.25} threshold is 0.25. The conditional forecasts for the two scenarios predict a much higher probability of flow below the threshold. The forecast probability is 0.737 for scenario 1, and 0.432 for scenario 2.

For each year in the verification dataset, the forecast probability *f*_{i} is determined for each scenario. Table 1 shows the results for the *y*_{0.25} threshold. Also shown are the observed flow volumes *y*_{i} and the observations transformed into discrete outcomes *x*_{i} using Eq. (3). For the 1965 forecasts shown in Fig. 2, Table 1 shows that observed September flow was greater than *y*_{0.25}, and hence, *x*_{i} is 0. Therefore, at least for this threshold, the 1965 forecasts for the two scenarios were worse than the climatology forecast; that is, the forecasts indicated a greater chance of observing flows below *y*_{0.25} than climatology, but the actual outcome was above the threshold. Viewed in isolation, this result might tempt some to conclude that the quality of the ensemble forecasts is poor. While the result is clearly undesirable, a larger verification dataset is needed to make a meaningful assessment of forecast quality. As will be seen, the quality of the forecasts for both scenarios is better than climatology forecasts when viewed in this proper context.

With the forecast-observation pairs {(*f*_{i}, *x*_{i}), *i* = 1, … , *N*} for each threshold, the last step is to estimate the DO measures of forecast quality. As the DO approach was originally presented (Murphy and Winkler 1987; Murphy 1997), the forecasts *f* must take on a set of discrete forecast values. However, forecast probabilities produced by the ensemble prediction system are essentially continuous variables. Recently, Bradley et al. (2003) developed techniques for estimating DO summary measures when probability forecasts *f* are continuous variables. The appendix summarizes the sample estimators of the forecast quality measures based on these techniques. In the following sections, we describe the forecast quality measures, and present sample results for the Des Moines example.

### a. Bias

The first moments of the joint distribution *h*(*f, **x*) are used to characterize the *bias.* The first moments are the expected value of the forecasts *μ*_{f} and the expected value of the observations *μ*_{x}. Note that for the case where the forecast-observations pairs are defined by the quantile *y*_{p}, the expected value of the forecasts is

and the expected value of the observations is

Hence, the forecasts are unbiased if the mean forecast probability *μ*_{f} equals the unconditional nonexceedance probability *p. *Figure 4 compares sample estimates of *μ*_{f} and *p* for the range of quantile thresholds *y*_{p} for the September volume forecasts. A 1:1 line is shown to help assess whether the sample average is close to *p.* Both forecast scenarios produce probability distribution forecasts that closely follow the 1:1 line; the results for scenario 2 are only slightly closer to the 1:1 line. Hence, the probability forecasts have very low bias for both forecast scenarios.

### b. Accuracy

A measure of the *accuracy* of the forecasts is the mean-square error (MSE), which is defined using the joint distribution *h*(*f, **x*) as

For probability forecasts, the MSE is also known as the Brier score. Figure 5 shows the MSE for the two forecasting scenarios plotted over the range of thresholds *y*_{p}. As was the case for the bias, we use the nonexceedance probability *p* as an index to the results for the *y*_{p} threshold. The shape of the plot is concave, with low MSE for probability forecasts for extreme thresholds (low and high). Also, the MSE is consistently lower for scenario 2, except perhaps at extreme thresholds. Hence, the skillful climate forecasts improve the accuracy of the probability forecasts for nonextreme flow thresholds, but do not improve the forecast accuracy for extreme events.

The MSE results shown in Fig. 5 can be somewhat misleading in that they seem to suggest that the forecasts for extreme thresholds are more accurate (because the MSE is lower). However, the magnitude of the MSE depends on the variability of the forecasts *f* and the observations *x.* For binary observations, the variance is simply

Note the variance *σ*^{2}_{x} is a measure of the inherent *uncertainty* of the observations, and is independent of the forecasts that are made (Murphy 1997). Figure 5 also shows this uncertainty measure, which has a symmetrical concave shape with a maximum at *p* = 0.5. Hence, even though the MSE is much higher for thresholds near the median, it is still much lower than its inherent uncertainty *σ*^{2}_{x}.

Because the magnitude of the MSE depends on *p,* a better way to assess forecast quality over the range of flow thresholds is with a relative measure. A nondimensional measure of the accuracy is the forecast *skill.* The skill of the forecast is the accuracy relative to a reference forecast methodology. The MSE skill score using climatology as a reference (i.e., the forecast probability *f* is always the climatological mean *μ*_{x}, or in this case, *p*) is

For probability forecasts, the SS_{MSE} is also known as the Brier skill score. Figure 6a shows the estimated MSE skill scores.

For both scenarios, the forecasts have skill (i.e., are better than climatology forecasts) for all but the most extreme flow thresholds. In particular, the 0.25 quantile threshold forecasts have considerable skill, despite the bad forecasts for September 1965. Note also that the use of a skillful climate forecast in scenario 2 produces only modest improvements in streamflow forecasting skill for the intermediate flow thresholds. Forecasts made without a climate forecast (scenario 1) still have considerable skill, indicating that knowledge of the initial moisture state contributes more to skill than knowledge of the future precipitation category. The relative measure of accuracy also reveals a slight asymmetry in the forecast accuracy; forecasts for lower-flow thresholds tend to have higher (relative) accuracy than for higher-flow thresholds.

The obvious exception is the spike in the skill score at the 0.938 quantile. The rapid change in skill is a result of random sampling variability. At the spike, there are only 3 instances (out of 48) where the observed flow exceeds the threshold, and the probability forecasts for these 3 cases are remarkably good. Still, by examining the probability forecast skill over the range of thresholds, one can deduce that this result is an artifact of the particular verification data sample. This result emphasizes the fact that forecast quality measures are sample estimates based on a relatively small sample. The implications of sampling variability will be discussed in detail in a later section.

It is important to recognize that even for thresholds where the estimated skill for the two scenarios is similar, the probability forecasts and their attributes are still quite different. The remaining panels in Fig. 6 show attributes contributing to the forecast skill. In the following sections, we describe these attributes with an example for a single threshold, then return to see how these combine to produce forecast skill.

### c. Calibration-refinement measures

Using the calibration-refinement (CR) factorization of the joint distribution *h*(*f, **x*), one can investigate the statistical properties of event occurrences (or nonoccurrences) for specific forecast probabilities. The CR factorization is

where *q*(*x*|*f*) is the conditional distribution of the observations given the forecast, and *s*(*f*) is the marginal distribution of the forecasts.

Figure 7 shows a reliability diagram, which graphically presents components of the CR decomposition (Wilks 1995). The figure shows the results for the 0.25 quantile threshold for the two forecasting scenarios. The top panel shows the marginal cumulative distribution of the forecasts *S*(*f*). In most reliability diagrams, the marginal density *s*(*f*) is usually plotted as a histogram. However, because *f* has a mixed distribution, with mass points at 0 and 1, the cumulative distribution provides a clearer graphical display.

The *sharpness* is defined as the degree to which probability forecasts approach 0 and 1. Ideally, forecasts will be perfectly sharp, consisting only of 0 and 1. The marginal distribution *S*(*f*) in Fig. 7 shows that the forecasts are sharp for both scenarios; over half of the forecasts are 0. However, the forecasts for scenario 2 are sharper, as a forecast of 1 is also issued about 15% of the time. This result is not unexpected. Note that the hypothetical precipitation-category forecasts used in scenario 2 are perfectly sharp; utilizing only those traces corresponding to the forecast category results in slightly sharper streamflow forecasts.

The bottom panel of Figure 7 indicates the relative frequency of event occurrence *μ*_{x|f} for a given forecast probability *f.* The *reliability* describes the conditional bias of the forecasts. For conditionally unbiased forecasts (i.e., perfect reliability), the relative frequency *μ*_{x|f} equals *f.* For both scenarios, events (i.e., volumes less than the 0.25 quantile) occur less frequently than their forecast probability of occurrence. In other words, the scenarios tend to overforecast event occurrences, resulting in a small conditional bias. An integrated measure of the reliability is

For perfectly reliability forecasts, REL is zero. For a climatology forecast, REL is equal to *σ*^{2}_{x}.

The *resolution* describes the degree to which the observations for a specific forecast *f* differ from the unconditional mean (or climatological probability). If the relative frequency of event occurrences *μ*_{x|f} are the same, regardless of the forecast *f,* the forecasts have no resolution (see line in Fig. 7). For both scenarios, the forecasts have good resolution. When probability forecasts *f* of the event {*Y*_{i} ≤ *y*_{0.25}} are near 0, the event rarely occurs, and when forecasts are near 1, the event usually occurs. An integrated measure of the resolution is

For forecasts with no resolution, RES is zero. The maximum resolution is obtained for perfectly reliability forecasts (*μ*_{x|f} = *f*), where RES is *σ*^{2}_{x}.

The connection between the reliability and resolution of the forecasts, and the MSE, can be seen through a decomposition of the MSE. Conditioning on the forecast leads to the so-called CR decomposition (Murphy 1996):

where the first term on the right-hand side is a relative measure of the resolution, and the second is a relative measure of the reliability. In other words, through the CR decomposition, skill can be thought of as the difference between the resolution of the forecasts and their conditional bias (the reliability).

The relative resolution and reliability of the September volumes forecasts are summarized over the entire range of flow quantile thresholds in Figs. 6b,c. Clearly, the spike in skill is related to the spike in the resolution of the forecasts. With the exception of the spike, the resolution of the forecasts tends to be asymmetric; forecasts for lower-flow thresholds have good resolution, but those for higher thresholds have poorer resolution. The asymmetry of the resolution (and skill) is not too surprising for streamflow forecasts. The persistence of dry conditions produces low-flow volumes, thus the probable occurrence of low flow is highly predictable based on the moisture state at the time the forecast is made. The occurrence of high-flow volumes depends on the precipitation over the month. Hence, knowledge of the September precipitation category in scenario 2 leads to better resolution (at least for quantile thresholds between about 0.2 and 0.9).

Note, however, that the high resolution of the probability forecasts does not translate into significantly better skill for very low flows (see Figs. 6a–c for *p* of less than 0.1). Conditional biases (poor reliability) offset the high resolution. The reason for conditional biases is the poor simulation of low flows by the forecast model (e.g., see low flows in Fig. 2). Hence, this analysis reveals that there is room for improvement for low flows (and to a lesser extent, very high flows). For example, conditional biases could be reduced or eliminated by adjustment (calibration) of the probability forecasts (e.g., Stewart and Reagan-Cirincione 1991; Krzysztofowicz and Sigrest 1999). In contrast, the forecasts for nearly all other thresholds are already reliable, so improvements through calibration are not possible.

### d. Likelihood-base-rate measures

Using the likelihood-base-rate (LBR) factorization of the joint distribution *h*(*f, **x*), one can evaluate the statistical properties of the forecasts that are made when events occur (or do not occur). The LBR factorization is

where *r*(*f*|*x*) is the conditional distribution of the forecasts given the observation, and *t*(*x*) is the marginal distribution of the observations.

Figure 8 shows a discrimination diagram, which graphically presents components of the LBR decomposition (Wilks 1995). The figure shows the results for the 0.25 quantile for the two forecasting scenarios. For each scenario, the conditional cumulative distributions *R*(*f*|*x* = 0) and *R*(*f*|*x* = 1) are shown. Although discrimination diagrams are usually plotted with the conditional densities *r*(*f*|*x*), we again use the cumulative distributions to more clearly portray the mixed distribution for *f.*

The *discrimination* describes the degree to which the forecasts differ for a specific observation *x.* Note that the cumulative distributions for *x* = 0 and *x* = 1 are very different, suggesting the forecasts have good discrimination. When the event does not occur (*x* = 0), a forecast of 0 was issued over 70% of the time. When the event occurs (*x* = 1), the forecasts are much closer to 1. Scenario 2 forecasts have better discrimination. Note that for *x* = 1, a forecast of 1 was issued about 55% of the time for scenario 2, whereas forecast probabilities as high as 1 were not issued for scenario 1. An integrated measure of the discrimination is

where *μ*_{f|x} is the expected value of the forecasts conditioned on the observation. If forecasts have no discrimination, then *μ*_{f|x} is the same (and equal to *μ*_{f}) regardless of whether the event occurs or not, and DIS is zero.

Similar to the reliability for the CR factorization, the *type-2 conditional bias* describes the bias conditioned on the observation. An integrated measure of the type-2 conditional bias is

Note that *B*_{2} is zero (no conditional bias) only in the case of perfect forecasts.

The connection between the discrimination and type-2 conditional bias of the forecasts, and the MSE, can be seen through another decomposition of the MSE. Conditioning on the observations leads to the LBR decomposition (Murphy 1996):

where the first term in the decomposition is a measure of the sharpness of the forecasts. Substituting (17) into Eq. (8) for the skill score shows that

where the second term on the right-hand side is the relative sharpness, the third is the relative discrimination, and the fourth is the relative type-2 conditional bias.

There is a complex interrelation between sharpness, discrimination, and type-2 condition bias. As an example, for the case of a climatology forecast (i.e., *f* = *μ*_{x} in all cases), the sharpness of the forecasts is zero. However, the discrimination is also zero, and the type-2 conditional bias is maximized at *σ*^{2}_{x} (resulting in a skill of zero). Hence, for forecasts to possess skill, they must be sharp and have discrimination.

The relative sharpness, discrimination, and type-2 conditional bias of the September volumes forecasts are summarized over the entire range of flow quantile thresholds in Figs. 6d–f. The forecasts for scenario 2 are generally sharper, have higher discrimination, and have lower type-2 conditional biases. This combination results in the higher skill for the intermediate-flow thresholds. Still, the results for nonexceedance probabilities between 0.1 and 0.2 are noteworthy. The forecasts for scenario 2 are much sharper than those for scenario 1. That is, the forecasts made based on a skillful climate forecast (scenario 2) go “out on a limb” more, with forecast probabilities tending to be closer to 0 or 1 for the event occurrence. However, these sharper forecasts do not have significantly better discrimination. Hence, for the given increase in sharpness, the improvement in discrimination is insufficient to improve the overall accuracy (skill) of the forecasts. This helps explain why the skill for scenario 1 is better in this range.

The results also highlight that skill is an incomplete measure of the quality of the forecasting system. Note that near *p* = 0.22, the skill for the two scenarios is similar (Fig. 6a). However, the nature of the forecasts are clearly very different; the forecasts made using a skillful climate forecast (scenario 2) are sharper with slightly more discrimination than those made without a climate forecast (scenario 1). Such differences can affect the value of the forecasts in decision making, even though the overall accuracy (skill) of the forecasts is similar.

## 5. Discussion

In this section, we discuss several issues related to the proposed methodology for ensemble forecast verification. First, we compare the DO approach results with those from traditional measures used in ensemble verification. We then discuss how the use of forecasts affects the choice of verification methodology. And finally, we examine the effect of sampling uncertainty in hydrologic forecast verification.

### a. Traditional measures

Traditional approaches to forecast verification often involve calculation of various summary measures meant to assess specific attributes of the forecasts (Wilks 1995). To assess whether ensemble forecasts have accuracy (or skill), a very common approach is to compare the ensemble mean with the observation using the root-mean-square error (rmse), or its related MSE skill score. (Day et al. 1992; Hou et al. 2001; Duce 2001; among others). Table 2 summarizes these measures for the two scenarios. Although both measures are better for ensemble forecasts made using a skillful climate forecast (scenario 2), the MSE skill scores are much less than zero. This would seem to suggest that the ensemble forecasts are quite poor. However, a significant problem with these measures is that they heavily penalize large errors for individual forecasts. In this case, the errors for just two forecasts (1962 and 1993) account for almost 90% of the MSE. These errors are caused by a similar situation. Both years experienced very wet summers, followed by a drier than average September; with high moisture conditions at the start of September, most ensemble traces were significantly higher than the observed flow (see Fig. 3 for 1962). Table 2 shows that excluding these two years results in a much lower rmse; the MSE skill score is still below zero for scenario 1, but the results for scenario 2 now suggest that the ensemble mean has significant skill.

To assess probabilistic forecasts from ensemble predictions for multiple-ordered categories, a common measure of skill is the ranked probability skill score (RPSS; Wilks 1995). One advantage of this measure is that forecast probabilities in categories farther away from the observed category are penalized greater. Note that the RPSS is mathematically related to the MSE in Eq. (6). If {*y*_{j}, *j* = 1, … , *M*} are thresholds defining the boundary between the (*M* + 1) flow categories, then

where MSE_{j} is the MSE for the *j*th threshold, and *p*_{j} is the nonexceedance probability for the *j*th threshold. Therefore, the RPSS can be computed from the results shown in Fig. 5 for any set of categories. Table 2 summarizes the RPSS of probability forecast for three flow categories, defined by the terciles of the distribution of observed September flow volumes. The RPSS shows that forecasts for tercile categories are skillful for both scenarios. Still, the forecasts made with a skillful climate forecast (scenario 2) are clearly much better.

Although the conclusion by these traditional measures is that streamflow forecasts made using a skillful climate forecast (scenario 2) are better, it is important to contrast this result with the insights gained by using the DO approach. Specifically, the traditional measures do not reveal that forecast probabilities for extremes have almost no skill, or the forecast for lower-flow events are generally better than those for higher-flow events. They also do not reveal the reason why the forecasts for scenario 2 are more skillful. Perhaps more importantly, the traditional measures do not indicate that the improvement for scenario 2 only occurs for intermediate-flow thresholds; probability forecast for flow extremes are not improved by the skillful climate forecasts. Because extreme events are usually of great concern in water management, insights on the quality of forecasts for extreme events can be very valuable to water managers.

### b. Verification methodologies

The verification methodology presented for ensemble streamflow predictions involves transforming the probability distribution forecasts into probability forecasts for discrete events, then evaluating the forecast quality. The approach implicitly assumes that one is interested in assessing the quality of the probability forecasts over the entire range of possible events. Another objective is to diagnose the attributes contributing to the forecast skill. Therefore, the verification results might be very useful to a forecaster, who wishes to evaluate the overall quality of the forecast system, and find ways to improve it.

However, from a forecast user's perspective, the verification approach should be tailored to the decision problem where the forecast will be used. Clearly, certain users will find different verification approaches more helpful. For example, for a water manager who considers only a single drought or flood threshold for decision making, the DO verification approach could be applied to the probability forecasts for this one event (threshold). Still, other verification measures, such as those based on the relative operating characteristics (ROC), can also provide valuable information for decision making in this situation (Mason 1982; Centor 1991; Harvey et al. 1992; Mason and Graham 1999, 2002). For some users, assessing the quality of probability forecasts for specific flow ranges is important (e.g., terciles, or critical flow ranges defined by system constraints or regulations). Here, the ensemble predictions would be used to assign forecast probabilities to each discrete category. A DO verification approach is still applicable in this situation, but other verification measures may also be useful.

### c. Sampling uncertainty

The single most significant obstacle to verification of ensemble streamflow predictions is their inherently small sample sizes. In the example application for September volumes, there is only one forecast made each year (for a given lead time). Most flow records have about 50 or fewer years of data, so the size of the verification dataset is severely limited. In contrast, sample sizes for most meteorological forecasts are an order of magnitude or more larger. Because measures computed using the verification dataset are sample estimates of the forecast quality attributes, small sample sizes lead to high sampling variability (Schwartz 1992). Furthermore, the sampling variability strongly depends on the threshold. The uncertainty for intermediate thresholds, where exceedances (or nonexceedances) occur relatively frequently, is much less than that for extreme thresholds, where exceedances (or nonexceedances) rarely occur.

Using Monte Carlo simulation, Bradley et al. (2003) estimated standard errors of DO measures for the Des Moines example presented here, using a stochastic model developed to represent the September volume forecasts (Hashino et al. 2002). For a threshold with a nonexceedance probability of 0.25, the standard errors in estimates of MSE, REL, and RES, for a sample size of 50, ranged from ±5% to 20%. However, for a threshold with a nonexceedance probability of 0.05, the standard errors ranged from ±60% to 110%. Clearly, for the extreme thresholds shown in Fig. 6, the uncertainty of the forecast quality measures is quite large. One benefit of evaluating the forecast quality measures at so many thresholds is that sampling variability and artifacts can been seen; Fig. 6 shows higher variations at the extremes (e.g., the spike), and smoother variations at intermediate thresholds. However, due to sampling uncertainty, quantitative inferences on forecast quality at the extremes should not be made. Instead, the results for the extremes should be interpreted only in a comparative sense; the results document differences in how well (or poorly) the two forecast scenarios performed for this particular verification sample.

In some cases, it may be possible to increase the sample size, and reduce sampling uncertainties, by pooling forecasts made at different times into a single verification dataset. However, this approach assumes that the joint distributions of the forecasts and observations are the same for each time period, and the forecasts issued at different times are (statistically) independent. If these assumptions do not hold, the reduced sampling variability from pooling the forecasts may be offset by biases in the forecast quality estimates.

Regardless, additional work is needed to assess the uncertainty of forecast quality estimates. Ideally, estimates of the uncertainty of each measure could be made, and the standard errors or confidence limits plotted along with the sample values. Exact or approximate analytical expressions for the standard errors of the forecast quality measures based on sampling theory (e.g., Schwartz 1992; Carpenter and Georgakakos 2001) are being explored for this purpose. Due to the small sample sizes, assessment of uncertainty in the forecast quality measures needs to be an integral part of ensemble forecast verification.

## 6. Summary and conclusions

Forecasts from ensemble prediction systems can be issued in several ways. For ensemble streamflow predictions, they are usually issued as probability distribution forecasts for a continuous flow variable. A distributions-oriented (DO) approach is presented to verify ensemble forecasts in this form. Flow thresholds are used to transform the ensemble forecasts into a probability forecast for a dichotomous event. The event is defined as the occurrence of flow below the threshold; the probability forecast is the probability of the event occurrence. The DO approach is then applied to a verification dataset constructed for the threshold, which is based on an archive of ensemble forecasts and observations. The process is repeated for thresholds covering the range of possible outcomes; relative measures are then used to examine aspects of forecast quality over this range.

The application of the DO approach illustrates the insights gained from this verification approach. For September volume forecasts for the Des Moines experimental system, the skill of ensemble forecasts was seen to vary over the range of flow outcomes. The skill was highest for intermediate flows; the forecasts had very little skill for low- and high-flow extremes. Forecasts were compared for two scenarios. One uses all the ensemble traces to make the forecast; the other uses a hypothetical skillful climate forecast to select traces for making the forecast. The verification results showed that a skillful climate forecast improved the skill of the streamflow forecasts for intermediate-flow thresholds. Significantly though, the forecasts for low- and high-flow extremes were not improved. This kind of information can be extremely important for water managers, because extreme events are often the most significant for decision making.

The DO approach also provides information to help diagnose the reason for the skill in probability distribution forecasts. For the Des Moines forecasts, the reason that ensemble forecasts made using the hypothetical climate forecast had more skill for intermediate flows is that they had better resolution. The forecasts were also sharper, even for low flows, and had better discrimination (but still not enough to improve the accuracy for low flows). More traditional approaches to verification also suggested that ensemble forecasts made using the hypothetical climate forecast had higher skill. However, such measures would not inform the user that the improvement is only for intermediate flows, and not flow extremes, or the reason for the improvement in skill.

The DO approach to ensemble streamflow forecast verification is designed to provide a diagnostic evaluation of the quality of probability forecasts over the range of possible flow outcomes. Although the methodology is presented for streamflow forecasts, the approach would also be useful for verification of meteorological ensemble forecasts. However, there are unique challenges for hydrologic applications. Verification datasets for ensemble streamflow predictions typically have small sample sizes. Because forecast quality measures are sample estimates, small sample sizes result in large sampling uncertainty. Additional work is needed to assess uncertainty of forecast quality measures as part of the verification process.

## Acknowledgments

This work was supported in part by the National Oceanic and Atmospheric Administration (NOAA) Office of Global Programs under Grants NA86GP0365 and NA16GP1569, as part of the GEWEX Continental-Scale International Project (GCIP) and the GEWEX Americas Prediction Project (GAPP). We gratefully acknowledge this support. We would also like to thank the three anonymous reviewers for their thoughtful comments and suggestions.

## REFERENCES

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

### APPENDIX

#### Estimating Forecast Quality Measures

For probability forecasts of a dichotomous event, Bradley et al. (2003) derived simplified expressions for the distributions-oriented forecast quality measures. Most of the forecast quality measures can be estimated analytically using sample moments of forecasts and observations from the verification data sample. Other measures require a statistical model for estimation. The sample estimators for these measures are briefly outlined below.

Let *x*_{i} be the observation at time *i.* Let *f*_{i} be the probability forecast of the event at time *i.* The verification data sample is then {(*f*_{i}, *x*_{i}), *i* = 1, … , *N*}. The sample estimators used to evaluate bias are the traditional sample means *μ̂*_{f} and *μ̂*_{x}. A sample estimator for the MSE is

The likelihood-base-rate measures depend on just the conditional and unconditional means. The sample estimator of the discrimination (DIS) is

and the sample estimator for the type-2 conditional bias (*B*_{2}) is

where *μ̂*_{f|x=0} and *μ̂*_{f|x=1} are the sample conditional means. To estimate the conditional means, the verification data sample is partitioned into two sets, one for the case where the event does not occur (*x* = 0), and the other for the case where the event occurs (*x* = 1). The conditional means are then estimated by the subsample means.

The calibration-refinement measures depend on higher-order moments. The sample estimator for the reliability (REL) is

and the sample estimator for the resolution (RES) is

where *σ̂*^{2}_{f} is the sample variance of the forecasts, and is estimated by

The term *Ê*_{f}(*μ̂*^{2}_{x|f}) cannot be simplified if the form of the marginal distribution for *f* is unknown. Bradley et al. (2003) show two approaches for estimating the term. One approach is to use logistic regression to estimate *μ*_{f|x} from the verification data sample {(*f*_{i}, *x*_{i}), *i* = 1, … , *N*}. With the logistic regression, the sample estimate for the term is

where *μ̂*_{x|fi} is the logistic regression evaluated at forecast *f*_{i}.

Finally, the forecast quality measures are nondimensionalized using the uncertainty measure *σ*^{2}_{x}. Because *x* is a Bernoulli random variable, the sample estimator for the variance is simply

## Footnotes

*Corresponding author address:* Dr. A. Allen Bradley, IIHR-Hydroscience and Engineering, 523A C. Maxwell Stanley Hydraulics Laboratory, University of Iowa, Iowa City, IA 52242-1585. Email: allen-bradley@uiowa.edu