## 1. Introduction

The ability to forecast large river floods globally could have substantial economic and humanitarian benefits. Sophisticated flood forecast systems in the developed world have been credited with reducing the loss of life over recent decades [notwithstanding increases in flood losses, which are largely attributable to increased development in flood plains; Pielke et al. (2002)]. However, the benefits of these systems do not extend to the developing world, where in situ hydrometric networks are often sparse (Hossain and Katiyar 2006).

Improvements in global weather prediction, and in precipitation observations and nowcasts, both from satellite and numerical weather prediction systems, are beginning to be implemented in undergauged areas. Asante et al. (2007) describe a flood monitoring system that uses satellite precipitation data and a semidistributed hydrologic model that they apply over the poorly gauged Limpopo River basin in Africa. Hopson and Webster (2010) describe a method of producing flood forecasts at two locations in the Ganges and Brahmaputra basins in Bangladesh that the Flood Forecasting and Warning System of Bangladesh integrates in their automated system. The streamflow forecasts are derived from European Centre for Medium-Range Weather Forecasts (ECMWF) Ensemble Prediction System (EPS) forecasts, Tropical Rainfall Monitoring Mission (TRMM) 3B42 simulations, the Climate Prediction Center (CPC) morphing technique (CMORPH; Joyce et al. 2004) precipitation nowcasts, and National Oceanic and Atmospheric Administration (NOAA) Global Telecommunication System (GTS) precipitation gauge data. An alternative global approach using TRMM Multisatellite Precipitation Analysis (TMPA; Huffman et al. 2007) data in conjunction with a rainfall–runoff model has been developed for near-real-time global flood monitoring by Hong et al. (2007, 2009) and evaluated by Yilmaz et al. (2010).

In this paper, we analyze aspects of the downscaling of ensemble precipitation forecasts, including precipitation forecast skill and reliability, at the scale of large river basins, typically 10^{5}–10^{6} km^{2}. The goal of a global approach for downscaling precipitation forecasts is to provide better information than presently exists for flood forecasting in underinstrumented areas and over large domains where no other flood forecast system presently exists, recognizing that such a system could possibly not perform as well as regional systems where higher quality data are available. Use of the probabilistic quantitative precipitation forecasts to produce ensemble flood forecasts is the topic of a future research project (N. Voisin et al. 2010, unpublished manuscript). Arguably, there is a potential benefit for calibrating the weather forecasts in addition to calibrating the resulting ensemble streamflow forecasts, particularly in ungauged basins where streamflow observations are either missing or irregular. We investigate here methods that utilize global forecast models and remotely sensed precipitation. Our objective is to produce precipitation ensemble forecasts, downscaled to the spatial resolution of a grid-based hydrology model (in our case, at 0.25° grid resolution in latitude and longitude) in locations where in situ observations are lacking.

Among the downscaling techniques that have been used (or are intended) for precipitation forecasts are model output statistics (MOS; Glahn and Lowry 1972), precipitation field estimation from probabilistic quantitative precipitation forecasts (Seo et al. 2000), and probabilistic quantitative precipitation forecasting (Sloughter et al. 2007; Berrocal et al. 2008)—using methods such as the Bayesian Average Model (BMA; Raftery et al. 2005); the National Weather Service (NWS) Ensemble Precipitation Processor (EPP), which uses the Schaake shuffle (Clark et al. 2004); the analog approach (e.g., Hamill and Whitaker 2006, hereafter HW2006); and bias correction with the statistical disaggregation approach of Wood et al. (2002, hereafter W2002).

All of these methods use high spatial resolution local observations that are not generally available globally. While some methods may be adaptable to data-sparse conditions, all suffer from some limitations. For instance, in MOS, a regression is performed between (retrospective) observations and forecasts, and that regression is applied to produce a correction to forecasts in near–real time. This requires a relatively lengthy period of contemporaneous forecasts and observations (changes in forecast models and methods present a practical challenge). In our case, a relatively short period of overlap between the observations and model output (5 yr) is an issue in this respect. Record length issues with MOS can be especially severe in arid regions; Clark and Hay (2004) relate potential difficulties in developing MOS equations in dry regions due to the small number of days with precipitation. They also find no improvement in general in either the accuracy or the correlation with the observed precipitation of the downscaled precipitation forecasts in comparison with the raw forecasts over the United States.

Sloughter et al. (2007) adapted the BMA for probabilistic quantitative precipitation forecasts. It is a combination of a logistic regression model and a fitted gamma distribution that outputs the “probability of precipitation and full probability density function for the precipitation accumulation given that it is greater than zero.” As for the MOS method, it is applied independently to observation stations. Berrocal et al. (2008) included a representation of the spatial correlation between stations, but for a single ensemble member only. Application of a full BMA to a multiensemble forecast including representation of correlations between all downscaled forecast fields (precipitation, temperatures, wind) is beyond the capability of current methods.

Seo et al. (2000) produced daily ensembles of future basin mean areal precipitation forecasts. They first downscale in space the probability of precipitation, the conditional mean, and the variance (similar to kriging); then, they perform a conditional simulation of precipitation based on the probabilistic quantitative precipitation forecast (PQPF; Krzysztofowicz 1998) and the double optimal estimation (Seo 1998a,b). The computational requirements of the procedure are a challenge (Schaake et al. 2007).

The EPP (Schaake et al. 2007) is intended for use with probabilistic quantitative forecasts. It addresses some of the issues noted above via a two-step method. First a single value forecast (deterministic or ensemble mean) is used to derive parameters for conditional distributions of the forecast and observation pairs at each subbasin location (or finer-resolution grid cell, called the observed mean areal precipitation). Marginal distributions are derived for both daily events and n-day aggregates. These are then used to transform both the forecast and the observation to new variables using the normal quantile transform (they are assumed to follow a bivariate normal distribution). The second step of the method uses this joint probability distribution and the Schaake shuffle (Clark et al. 2004) to create ensemble forecasts (mean areal precipitation) for each subbasin. The use of the Schaake shuffle and the possibility of aggregation facilitate the conservation of observed spatial and temporal structures within the subbasin and between downscaled forecast fields. The method is intended to provide forcings for a hydrologic model that can forecast the flow at a specific location (outlet of the subbasin). It is not clear how small the subbasins should be, and the application of the method to the larger domains we consider here (e.g., river basins with drainage areas hundreds of thousands of km^{2}) would require additional development.

The bias correction with statistical disaggregation (BCSD) approach of W2002 is a two-step method for seasonal forecasting that produces ensembles of spatially and temporally distributed quantitative precipitation forecasts over relatively large domains with a direct application of forcing a distributed hydrologic model. The disaggregation aspects of this approach were applied one basin at a time in order to conserve an observed spatiotemporal structure in the forecasts; we pursue here adaptations that will allow it to handle large domains in a coherent manner, and to forecasts that are much shorter than the minimum of 1 month to which it has been previously applied.

Analog methods such as HW2006 were originally intended for spatially distributed probabilistic forecasts. They require an observation dataset at the desired fine spatial resolution and a reasonable period of retrospective forecasts. In brief, a daily forecast for a defined spatial window at day of year *n* and lead time *d* is compared to a set of retrospective daily forecasts over the same spatial window and for the same lead time as the valid time of the given daily forecast. The closest retrospective forecast becomes the analog and the finer spatial resolution observations corresponding to that retrospective forecast become the downscaled forecast. In addition, HW2006 introduced a spatial moving window for finding the analog that makes the methods applicable for large river basins.

In this paper, we focus on applying the BCSD method of W2002 and the analog approach of HW2006. The adaptations required to use these two methods for our application (forecast lead times of days to a week or more, large ungauged spatial domains, and quantitative precipitation forecasts) seemed minimal. The methods are computationally realistic; that is, the system should run in much less than 1 day with modest computer resources in order to be feasible for real-time implementation. We compare the two methods for application to medium-range (maximum 10 days) ensemble precipitation forecasts over a large river basin (∼500 000 km^{2}), where we are able to evaluate the forecast skill and accuracy for a range of spatial and temporal scales.

The remainder of the paper is organized as follows. Section 2 describes the forecast and observation datasets used throughout the paper. Section 3 describes adaptations of the BCSD and the analog methods. Section 4 evaluates the downscaling approaches over the Ohio River basin using observed precipitation.

## 2. Forecast and observation datasets

ECMWF EPS 6-h precipitation forecasts (up to 10-day lead times) were aggregated to a daily (0000–2400 UTC) format. Their spatial resolution is 1° latitude × 1° longitude. They span the period 2002–06. There are 51 ensemble members for each forecast. To reduce the computational burden, we formed reduced ensemble forecasts of 15 members, which included the control run (forecast without initial condition perturbation), and 14 members randomly selected from the 50 perturbed forecast members.

Both the BCSD and analog methods require in situ observations for calibration and downscaling. We used the daily TRMM 3B42 V6 Research Product (TMPA; Huffman et al. 2007), which is a near-global dataset at 0.25° latitude–longitude spatial resolution. During the period 2002–06, both the EPS forecasts and the TMPA observations are available. We performed forecast evaluations using both the 0.25° native resolution of the TMPA data, and an aggregation to 1° (Fig. 1).

Although the methods we develop are intended for application to data-sparse areas, our test site is the Ohio River basin. This allows evaluation of the methods with a relatively dense gridded station precipitation dataset, in addition to the TMPA data. We used an updated version of the Maurer et al. (2002) dataset, aggregated from its native ⅛° latitude–longitude spatial resolution to ¼°. Although the TMPA (Research Product) is adjusted on a monthly basis to match the gridded observations, the TMPA daily precipitation statistics are known to differ considerably from the gridded daily in situ observations (Su et al. 2008). Thus, we are interested in knowing how the methods perform for the generally higher quality station data as compared with the TMPA remote sensing product.

## 3. Downscaling methods

### a. BCSD

The BCSD approach of W2002 was originally intended for ensemble seasonal precipitation (and temperature) forecast applications, and for applications to relatively large areas or river basins. The original BCSD method bias-corrects monthly precipitation forecasts with respect to monthly observed climatologies at the coarse spatial resolution of the precipitation forecast (in the case of W2002, the forecast spatial resolution was 2.5° latitude × longitude). The daily sequencing of precipitation within the month is determined by resampling from the historical record for the entire basin at once, and rescaling to the bias-corrected forecast values. The resampling is performed randomly, with the historical months within the sampling pool(s) partitioned into wet and dry months (months above and below the median). No further adjustment is made to preserve the climatological relationship between the number of wet days and the monthly precipitation amount.

In the W2002 BCSD method, spatial anomalies from the domain-average precipitation are used to disaggregate the forecast spatially and temporally. This is feasible because the same month from the historic record is resampled for the entire domain. This assures plausible spatial and temporal levels of coherence within the domain. For larger (e.g., continental) domains however, this last step of selecting the same month from the historic record for the entire domain is problematic, as the daily spatial variability in the disaggregated forecast could be much different than that in the raw forecast.

To be useful for our application, the BCSD method needs to handle shorter-time-scale forecasts (daily) and larger domains. Thus, the following changes were made. First, the probability-mapping bias correction was applied to daily EPS forecast members on a daily time scale (which eliminates the need for the temporal disaggregation step in W2002). Then, we used three variations of spatial disaggregation. The first variation, termed hereafter BC_INT, is an inverse squared distance interpolation (Shepard 1984) of the daily 1° bias-corrected ensemble forecasts. This variation smoothes the bias-corrected forecasts without recognizing features of the spatial precipitation distribution associated with, for instance, orography in one part of the 1° cell, or weather patterns like fronts. It is however a way to isolate the improvement when using a simple bias correction. In a second variation, termed INT_BC, the daily 1° ensemble forecasts were first interpolated to 0.25° and then bias corrected at the finer spatial scale. This variation recognizes finer spatial-scale physical features, but as for BC_INT, the ensemble forecast spatiotemporal rank structure is entirely controlled by the raw ensemble weather forecasts because both the interpolation and the bias correction conserve the ranks of each member within the ensemble. In the third variation, termed BC_SD, the bias correction is applied to the daily 1° ensemble forecasts and then the spatial disaggregation, a scaling approach, is performed in two steps. First it is applied on a grid-cell by grid-cell basis, rather than for the entire domain at once as in BCSD. This allows the method to be applied to much larger domains. Then, a Schaake shuffle (Clark et al. 2004) is applied to the entire domain at once in order to reconstruct a spatiotemporal rank structure for the ensemble forecasts.

These adaptations (BC and SD) recognize fundamental differences between the seasonal time scales for which the method was originally intended, and the shorter forecast lead times of interest here. In particular, the timing and localization of storms become much more important at shorter lead times. Furthermore, there is an inevitable trade-off as to the size of the spatial domain in application of BCSD, given that hydrological data needed for model calibration and verification often are not available for smaller tributaries in areas that are data deficient, whereas increasing the domain can lead to problems with spatial consistency, especially at short lead times where storm characteristics are important. For this reason, we decided to use a two-step approach to implement the revised BCSD method. The BC step of the three variations and the SD step of the BC_SD variation are described below.

#### 1) Bias correction

The 15 members of the daily 10-day 1° (0.25° for INT_BC) ensemble forecasts from ECMWF EPS are bias corrected with respect to the gridded precipitation observations aggregated to the 1° (0.25°) spatial resolution of the forecasts. Systematic bias is inevitably inherent in the precipitation forecasts, and in general is a function of spatial location and forecast lead. We apply the quantile-based mapping method of Panofsky and Brier (1968) with the implementation of W2002 with the following adaptations (further detailed in appendix A):

- Cumulative distribution functions are derived from daily (not monthly) values of the ensemble precipitation forecast members.
- A correction for daily precipitation intermittency is added and is performed concurrently with the quantile-based mapping technique.

The bias correction was applied to each forecast member for each 1° grid cell (BC_INT and BC_SD) or 0.25° interpolated member (INT_BC) and to each forecast lead time (1–10 days in our application) independently. The advantage of bias correcting each day independently rather than bias correcting the 10-day accumulated values is that corrections can be made for daily precipitation intermittency, and different biases at different lead times can be corrected. For example, if the forecasts tend to be wetter for long lead times and drier for earlier ones, the bias correction will resolve both issues.

#### 2) Spatial disaggregation

The EPS forecast native spatial resolution for the 2002–06 period is 1° latitude–longitude, which is coarser than the resolution of the hydrological model (0.25° in our example). An inverse squared distance interpolation (Shepard 1984) was used after (before) bias correction in the BC_INT (INT_BC) variation. In the BC_SD variation, an alternative technique for spatial disaggregation of precipitation was developed based on the procedure applied in Wood and Lettenmaier (2006, hereafter WL2006). We rely on the 0.25° gridded observations only to specify the spatial patterns (but not amounts) of the precipitation within a 1° cell because the 1° cell values have already been bias corrected. Essentially, our strategy is to impose the high spatial resolution variability of the observations onto the coarse-resolution bias-corrected forecast values as follows.

For each 1° grid cell and each lead time, 15 days were selected at random, with the historical observed days within the sampling pool partitioned into wet and dry days. Each observed 1° grid cell is made of up to sixteen 0.25° observations that define the observed subgrid precipitation pattern at the 0.25° spatial resolution. Each of the fifteen 0.25° precipitation patterns was transposed onto the fifteen bias-corrected forecast values using the ratios of the 0.25°–1° values as described in section b of appendix A. The patterns corresponding to the largest observed 1° aggregate precipitation amounts were used to disaggregate the largest forecast amounts to 0.25°.

The resulting 0.25° spatial variability within the 1° cell is consistent with a pattern that has occurred in the observations while the 1° bias-corrected forecast value is conserved. A problem with this method (downscaling performed independently for each day of the forecast period, and for each 1° cell rather than over a larger area such as a river basin) is the loss of spatial and temporal coherence between adjacent 1° cells and lead times. This issue is addressed by adding a Schaake shuffle (Clark et al. 2004) that reconstructs a spatiotemporal rank structure over the 10-day forecast period and over the entire (Ohio River basin in this case) domain; This Schaake shuffle is also necessary to create the correlation between the different downscaled forecast fields (precipitation, temperature, and wind) that are required to force a hydrology model (N. Voisin et al. 2010, unpublished manuscript). The Schaake shuffle and its modifications for our application are described in section c of appendix A. The spatial coherence of the bias corrected and downscaled forecasts over the domain is evaluated in section 4.

### b. Analog approach

The analog approach of HW2006 was developed to give occurrence probabilities for certain specific events (e.g., probabilities of exceeding a specific threshold). The idea behind the analog approach is to compare precipitation forecasts for a particular day over a given domain at the forecast native spatial resolution with a set of archived forecasts for the same domain and previous years. Archived forecasts that are similar to the current forecasts are called analogs. The observed precipitation patterns corresponding to the analog forecasts can be used to make probabilistic statements about the precipitation that may occur, given the current forecast. In the HW2006 implementation, the archived forecasts are for a ±45-day window centered on the particular day (say, day *n*). The closest analog is denoted day m. The *observation* for day m, which in general has a finer spatial resolution than the raw forecast, becomes the downscaled forecast. In our case, the observations come from the 0.25° gridded observations (TMPA or gridded station data).

In our adaptation of HW2006, there are three ways to choose the analog for ensemble forecasts. These are described in more detail in appendix B. The first is to match each of the 15 members individually with a retrospective forecast member, with the closest analog defined as the one with the lowest root-mean-square difference (RMSD). The RMSD is defined as the sum of the root-mean-square differences between the forecast and the potential analog values at each grid point in a spatial window (25 points in a 5° × 5° window). We term this approach analog RMSD. A second way is to match the ensemble mean forecast with retrospective ensemble mean forecasts, and choose the 15 closest analogs based on the lowest RMSD (as defined above). We term this analog RMSDmean. For the analog RMSD and RMSDmean methods, when the sum of the RMSDs over the forecast points is larger than the sum of the third of the ensemble (mean) forecasts or 10 mm, whichever is the greatest, of each grid point in the spatial window, then interpolation is substituted for the analog method. This implies that for extreme forecast events where no “close enough” analog could be found, interpolation is used. Table 1 presents the frequencies of the substitutions for both methods (RMSD and RMSDmean), for different lead times, and for different forecast categories for a 5° × 5° spatial window. The third way is to match each of the 15 ensemble members individually based on ranks of forecast values in the forecast climatology rather than the raw forecast values. We term this approach analog rank.

The domain over which the closest analog is identified would ideally be the entire forecast domain so as to ensure spatial coherence in the rank structure. However, it is unlikely that any dataset can reproduce the spatial variability of a particular forecast over a domain as large as, say, the Mississippi River basin. As a compromise, HW2006 suggested using a moving spatial window. We follow their approach, and use a 5° × 5° spatial window, which gives twenty-five 1° grid points to which the RMSDs or ranks can be compared, as described further in appendix B. As in HW2006, we initially saved the analog dates for the 25 or so spatial moving windows in which each forecast point was included and then made a weighted averaging of those analogs at this forecast point based on how far it was from the center of the moving spatial window. We found, however, that this approach leads to too much smoothing of the forecast values. (This issue did not arise in HW2006 because they smoothed the probabilities that a certain threshold would be reached, not the actual forecast values.) As a consequence, unlike HW2006, we assign the analog that matches the spatial window to the center grid cell only. Our spatial window is large enough (5° × 5°), and there is sufficient spatial overlap with adjacent spatial windows (16 of 25 grid cells are in common), so as to ensure that analogs chosen at two adjacent grid cells are not much different. This in turn ensures some spatial consistency across the entire domain, although as for BCSD there do remain some practical issues in terms of temporal and spatial coherence among the ensemble members. Therefore, a similar Schaake shuffle as applied in the BCSD is performed, as described in section c of appendix A. As in its incorporation in BCSD, this step does not change the calibrated and downscaled values but only changes the ordering of the members within each grid cell. This implies that the spatially distributed results in this paper would be identical without the Schaake shuffle, but the spatial coherence would not. The spatial coherence is evaluated further in section 4.

## 4. Evaluation of downscaling approaches

In this section, we evaluate the biases and skill levels of the ensemble precipitation forecasts calibrated and downscaled using the three variations of the BCSD approach described in section 3a, and the three variations of the analog approach described in section 3b. The benchmark for our evaluation is a simple inverse squared distance interpolation of raw forecasts [the Synergraphic Mapping System (SYMAP); Shepard (1984)] that produces downscaled forecasts for each ensemble member at 0.25° spatial resolution. Our goal is to evaluate the developed calibration and downscaling methods with respect to downscaled raw forecasts. The benchmark does not include a bias-correction process (like INT_BC or BC_INT) because the evaluations of the methods would then depend on the appropriateness of the bias correction applied to the benchmark.

Our evaluations were performed over the Ohio basin during the period 2002–06 at both 0.25° (for purposes of evaluating the effects of spatial disaggregation and calibration) and 1° spatial resolutions (to isolate the skill of the forecast calibration only). The evaluations were for daily forecasts at lead times of 1–10 days, and for 5-day accumulation forecasts at 1-day lead time (accumulated precipitation for days 1–5 and 6–10 made on day 1, the accumulated precipitation forecasts being an important indicator for basins with concentration times longer than 1 day; see Fig. 1). In our first evaluation, the TMPA data were used as the observation dataset. For this analysis, none of the analog techniques used a resampling of the same day as the forecast; that is, that day was excluded from the dataset used for resampling.

It is common practice to evaluate different categories of forecasts. For instance, Schaake et al. (2007) evaluated precipitation terciles, HW2006 used specific precipitation thresholds, and Franz et al. (2003) evaluated streamflow forecasts for low and high flows. In our case, we defined a forecast as wet when the ensemble mean forecast was larger than 0.254 mm for daily forecasts, and 2.54 mm for 5-day accumulation forecasts. We then evaluated forecast performance for all events (dry and wet forecasts), wet events, and for wet events when the ensemble mean forecast was in the upper tercile of wet forecasts. Section a of appendix C includes details of the forecast categories.

It is important to note that the categories used in the evaluations are conditioned on the benchmark (interpolated) forecasts and not the observations. Conditioning on observations is common when comparing, for instance, forecasts from different sources (Demargne et al. 2009). In this case, both the skill levels of the calibration and of the original forecasts are evaluated. Here, however, we compare different approaches to calibrating and downscaling a particular set of forecasts; therefore, conditioning on the benchmark forecasts rather than observations allows isolation of the attributes of the calibration and downscaling approaches. Commonly used skill measures like biases (mean errors), RMSEs (accuracy), and Spearman rank correlations (predictability) were used to verify the forecasts. Spearman rank correlations arguably are more appropriate than Pearson correlations when normal distributions cannot be assumed (as is the case for daily precipitation). Other skill measures specific to probabilistic forecasts were also verified: reliabilities in the sense that the forecast ensemble spread was representative of the observed climatology, and predictability in the sense of reproduction of the observed cumulative distribution function of the ensemble. The ensemble spread is defined here as the range of the ensemble member values (i.e., the difference between the maximum and minimum ensemble member values). A small or narrow (large) ensemble spread indicates low(large) uncertainty in the forecasts as all member values are close to (far from) each other.

Forecast discrimination was evaluated in the sense that the forecasts were different in terms of their probability density functions for different categories of observed events, which were also evaluated for the equivalent observed event categories (i.e., conditioned on observations). This is an important point as the forecast performance metrics noted above were evaluated with conditioning on interpolated forecasted precipitation amounts. Figure 2 shows statistics as a function of lead time for the downscaled daily forecasts at 0.25° spatial resolution for all calibration methods. Figure 3 shows the same statistics but for the daily downscaled forecasts at 0.25° resolution aggregated to 1°. Figure 4 shows equivalent statistics for the 5-day accumulation of daily 0.25° downscaled forecasts (1–5- and 6–10-day accumulations made on day 1). The 5-day accumulation precipitation forecast is important for large basins that have a hydrological response time larger than a few days. For Fig. 3 and subsequent figures, the analog RMSD and the BC_SD methods are not shown for clarity because they did not perform as well as the others and were, therefore, excluded from further analysis.

### a. Biases and RMSEs

The first row in Figs. 2 –4 shows the mean of the forecasts conditioned on benchmark forecast categories (all forecasts, wet forecasts, and upper tercile of wet forecasts) for each downscaling method. On average over all days and all grid cells (all forecasts category), the raw (interpolated) forecasts tend to be too wet for all categories. All six calibration methods tend to have drier forecasts than the interpolation technique for all categories and all lead times, but less so for the analog RMSD method. As shown in Table 1, the analog RMSD method has a higher rate of default to the interpolation method (which occurs when no satisfactory analog is found) than the RMSDmean method, with a rate that increases with lead time. When needed, the interpolation default is used mostly (>95% for all lead times) for wet forecasts. It artificially increases the analog RMSD means, which become closer to those for the interpolation method. For this reason and the implied remaining bias, the analog RMSD method was not further analyzed.

Row 2 in Figs. 2 –4 shows the biases for the different lead times and for the different methods. See section b of appendix C for their definitions. All six calibration methods reduce the biases of the interpolated forecasts for all forecast categories and spatial scales, with the exception of the INT_BC method at 1° for the upper wet tercile forecast category. By design, INT_BC is expected to be less competitive at the 1° scale. Biases are very similar for different spatial scales but are slightly lower for the upper wet tercile category for BC_SD when aggregated to 1°. The bias correction in general did not perform as well as the analog method for the upper wet tercile forecast category. The analog rank method performed the poorest of the analog methods. Based on the mean errors, the most competitive calibration methods (Fig. 3, 1° scale) were the analog RMSDmean, and the BC_INT methods. The best-performing calibration and downscaling methods for different time resolutions (Figs. 2 and 4) were the BC_INT and the analog RMSDmean methods.

Row 3 in Figs. 2 –4 shows the RMSEs (section c in appendix C). For wet forecast categories, and in particular the upper wet tercile forecast category, the RMSEs decrease with increasing lead time because the means (first row) decrease with lead time; that is, relative RMSEs increase with lead time. The BC_SD method has higher RMSEs than the interpolation technique for daily and 5-day accumulated forecasts at 0.25° spatial resolution (but not at 1°). Our interpretation is that the spatial disaggregation (resampling and scaling) processing at a daily time scale can lead to very disparate precipitation amounts among the 0.25° cells within each 1° cell (i.e., very large 0.25° amounts adjacent to near-zero 0.25° grid cells). This also leads to a drop in correlation (row 5, Fig. 2). For these reasons, we did not pursue the analysis of the BC_SD method. In terms of calibration (1° scale; see Fig. 3), no method clearly reduced the interpolated forecast RMSE. For wet forecast categories, the BC_INT, analog rank, and RMSDmean methods tend to maintain or reduce the interpolated forecast RMSEs. Note that analog RMSDmean results for shorter lead times are close to the interpolation method for the upper tercile wet forecast category because the interpolation was sometimes substituted when no satisfactory analog was found (Table 1). When combining calibration and downscaling, both the BC_INT and analog RMSDmean methods reduced the RMSEs for different temporal resolutions.

### b. Reliability and predictability

Row 6 in Figs. 2 –4 shows Talagrand diagrams, or rank histograms (Hamill and Colucci 1997; Talagrand and Vautard 1997; see also section c of appendix C). They show the rank of the observation relative to each forecast ensemble member. The rank here is divided by the number of ensemble members plus one to present it as a percentile. A high frequency of observation at rank 0% (100%) means that the ensemble forecast in this category systematically overpredicts (underpredicts) the observation. Assuming that each ensemble member should have an equal chance to represent well the observation, perfect reliability is reached when the observation has an equivalent frequency for each rank; that is, the ensemble spread is uniform. All of the calibration methods improve on the interpolation in this respect but to the same extent. Both BC_INT and INT_BC barely improve the reliability and still have U-shaped Talagrand diagrams, implying that the ensemble spread of the forecasts is too small [Wood and Schaake (2008) also illustrate this problem in the BC approach]. All analog techniques improve on the interpolation and are close to perfect reliability for all lead times and for all spatial and temporal resolutions. As shown below, this is achieved in part by increasing the ensemble spread, which in turn decreases the resolution and the continuous rank probability skill score (CRPSS), described below.

The continuous rank probability score (CRPS; Hersbach 2000; Wilks 2006; see also section f in appendix C) measures “the difference between the predicted and the occurred cumulative distribution function” (Hersbach 2000). CRPS is related to the ranked histogram (reliability), but unlike the histogram, it is sensitive to the width of the ensemble spread and the number of outliers (when the ensemble spread does not include the observation) (Hersbach 2000). In this sense, it is also a measure of reliability, resolution, and predictability because the CRPS is penalized (increases) when the ensemble spread is consistently too large and the forecasts are not much different when trying to forecast different sets of observed events, and when the correlation decreases. The CRPSS is equal to one minus the ratio of the forecast CRPS to the climatology CRPS [section f in appendix C, Eq. (C11)]. A CRPSS near 1 indicates very large improvement relative to climatology, a score of 0 means no improvement, and a negative score means worse than climatology. Row 4 in Figs. 2 –4 shows the CRPSS values for the various forecasts. The CRPSS results are linked to the bias and reliability results. Only the bias correction step (i.e., BC_INT at 1° and INT_BC at 0.25°) improves the raw CRPSS because it reduces the bias without increasing the reliability (narrow ensemble spread). The CRPSS of the analog methods is lower than that of the interpolated forecasts, but for 5-day accumulated 0.25° forecasts the improvement in bias tends to balance the increased ensemble spread size. Note that CRPSS values are high for all methods. The high scores are attributable to the fact that the ensemble spread for the reference is quite large. Figure 5 shows the dependence of CRPS on the ensemble spread (range) for the interpolation, the analog RMSDmean and INT_BC methods, and the climatology. Lower CRPS values indicate better skill. Figure 5 shows that the ensemble spread increases and the number of outliers is reduced by both the analog and INT_BC techniques and that there is a direct relationship between the CRPS and the ensemble spread: the larger the ensemble spread, the larger the CRPS, although the slope tends to vary with the calibration and downscaling method. Even though there are fewer outliers (a desired characteristic), the lower CRPSS values of the analog methods with respect to the interpolation method might indicate a loss of resolution, but as seen in the next paragraph, this is due to a loss of correlation.

Predictability can also be evaluated using the Spearman rank correlation between predictions and observations (section d in appendix C). Row 5 in Figs. 2 –4 shows the correlations. Overall, the correlation for all methods is highest for the all-forecasts category, and then decreases with lead time and for the wet and upper tercile categories. As expected, correlations are slightly larger at 1° resolution than at 0.25°. By design, the bias correction does not affect the Spearman (rank) correlation. The minimal differences between the interpolated forecasts and the BC_INT and INT_BC techniques are due to the reduction in the number of ensemble members, but the correlation is maintained. For daily forecasts at 0.25° and 1° resolutions, the analog techniques have lower correlations with observations than do the interpolation technique. For 5-day accumulation forecasts, the analog RMSDmean correlation is very close to that of the interpolation method. The analog RMSDmean method has in general the highest correlations of the two analog methods.

### c. Discrimination

The relative operating characteristic (ROC) diagram is a measure of discrimination. Discrimination is the ability of a forecast (e.g., conditional average forecast) to discriminate among different observed events (e.g., observation categories). It is therefore conditioned on the observations (as opposed to conditioning on the forecasts for the other skill measures). The ROC diagram plots the hit rate (or the detection rate, which is the number of times the forecast expects an event and it happens divided by the number of times the event occurs) as a function of the false alarm rate (number of times when the forecast expects an event and it does not occur divided by the number of time the event does not occur). ROC plots are not shown. There was very little difference between all of the methods, and in particular, none of the methods was much better than the interpolated forecasts. As such, the discrimination skill of the forecasts was not improved by any downscaling method but, on the other hand, was not degraded relative to forecast interpolation.

### d. Spatial coherence

Figure 6 shows snapshots of forecasts for 17 September 2004, the date with the largest observed mean areal precipitation during the 2002–06 period, and 5 January 2004, which corresponds to the 90th percentile of the wet 2002–06 daily Ohio River basin precipitation events (at least 0.254 mm over the basin). The second snapshot is intended to show the reconstructed precipitation patterns on a day with significant (but not extreme) precipitation (upper tercile of the wet forecast) so as to evaluate the analog methods without the effects of interpolation (section 3b). For both dates, the first row shows the observed TMPA precipitation. The second row shows the daily ensemble mean forecast for 1-day lead time downscaled by using different methods. The third row shows one daily ensemble forecast member. The figures show that the precipitation patterns for the interpolation and BC_INT methods are unrealistically smooth, although there is spatial coherence (all cells are spatially correlated and the precipitation pattern appears to be realistic). INT_BC precipitation patterns are more realistic than the interpolation method, that is, smoothed but with more variability in the magnitudes. The analog techniques generally produce more realistic precipitation patterns than the benchmark for both ensemble mean and individual ensemble forecast members. For the most extreme events, the analog RMSDmean precipitation patterns look very much like the interpolation method for reasons discussed above.

### e. Sensitivity to the size of the spatial window for the analog techniques

A sensitivity analysis was performed for the analog methods, in which the change in biases and forecast accuracy skill was evaluated for a different spatial window. In particular, 3° × 3° and 7° × 7° windows (9 and 49 forecast points, respectively) were tested rather than a 5° × 5° window (25 points) (see section g in appendix C). A similar sensitivity analysis was performed for the interpolation method where a different radius search (equivalent to the spatial moving windows used for the analog methods) was used to select the 1° cells from which to perform the interpolation to 0.25°. For the interpolation method, means in wet forecast categories decrease with increasing search radius (number of neighbors). Most of the skill improvements due to calibration and downscaling remained relatively unchanged, as did the overall conclusions based on the 5° × 5° window results.

To this point, our results suggest that, based on the biases and RMSE results, the analog RMSDmean and the BC_INT methods improved the skill of the interpolated forecasts the most across temporal and spatial scales. The BC_INT method maintained predictability (correlation) whereas the analog RMSDmean technique maintained it only for a 5-day accumulation forecast and decreased it slightly at a daily time scale. The BC_INT approach did not improve on the ensemble reliability whereas the analog methods improved it to be nearly ideal (uniform rank histogram). Finally, the analog RMSDmean precipitation patterns are more realistic than the smoothed patterns generated by the BC_INT method.

### f. Seasonality

The equivalent forecast verification for daily 0.25° forecasts with respect to TMPA was evaluated for the December–February (DJF) and June–August (JJA) periods. As expected, correlations for all methods were higher in winter (dominantly stratiform precipitation systems) than in summer (convective storms). CRPSSs were also higher for all methods. However, comparison results among the methods were consistent with previous findings.

### g. Comparison using gridded station data

All approaches implicitly assume that the forecasts are in some sense related to (e.g., correlated with) the observations. Clearly, this is determined in part by forecast skill; however, observation error can also potentially affect the performance of the methods. In this section, we evaluate the performance of the INT_BC and RMSDmean methods using the extended Maurer et al. (2002) gridded station dataset rather than the TMPA data. Figure 7 shows the equivalent analysis for the daily forecasts for different lead times at the 0.25° spatial resolution as in Fig. 2 when using either TMPA or the Maurer et al. (2002) data for both resampling and evaluation purposes. The INT_BC method was favored (rather than BC_INT) because the performance of the INT_BC technique was expected to surpass BC_INT at the finer spatial scale, which it did when using the Maurer et al. dataset in place of TMPA. Our interpretation is that the BC at 0.25° performs better due to the higher predicted–observed correlation at 0.25° with the Maurer et al. observations than with TMPA (Fig. 7, last row). Overall, when using the gridded station dataset, calibrated biases and RMSEs were smaller and correlations were higher than in the previous analysis using TMPA; that is, the levels of accuracy and predictability skill were higher when using the gridded station dataset. The Talagrand diagrams (not shown) showed improved reliability for the interpolated forecast (benchmark). As with TMPA, the INT_BC method did not improve the interpolated forecast reliability but the analog RMSDmean approach did. The benchmark CRPSS values were then lower (better reliability). The analog RMSDmean improved the benchmark CRPSS but did not show improvement with TMPA. The INT_BC technique better improved the benchmark CRPSS. When using the gridded station dataset, the overall skill of the INT_BC method was higher but it was for the analog RMSDmean approach as well. When using the gridded station data, the analog RMSDmean method was able to further reduce biases and RMSEs, and even improve the predictability of the interpolated forecast in the upper wet tercile forecast category. The analog techniques still had better skill than did the BCSD approaches.

## 5. Discussion

The performance of the adaptation of BC_SD described in section 3a and implemented independently for each 1° cell was somewhat disappointing. The BC_SD scheme showed no consistent improvement in RMSEs or predictability relative to simple spatial interpolation for all forecasts, and this result was consistent over all lead times. The reliability was slightly improved relative to a simple interpolation but remained unreliable as shown by the U-shaped rank histograms. The unsatisfactory RMSE results are mostly due to the spatial disaggregation step, not the bias correction, as shown by the BC_INT and INT_BC results. Furthermore, on a visual basis, it was clear that the adaptation of the spatial disaggregation resulted in spotty and unrealistic precipitation patterns for the most extreme events and tended to amplify RMSEs (not shown). This spatial disaggregation was an experiment to create more complex precipitation patterns than those obtained with a spatial interpolation as a complement to the BC step. This showed how difficult it can be to calibrate different skills (spatial coherence and reliability versus accuracy) independently in subsequent steps.

The issue of intermittency was handled within the bias correction step for simplicity (global approach), whereas other calibration and downscaling methods (e.g., Sloughter et al. 2007; Hamill et al. 2004) deal with intermittency in a first step (wet forecasts are calibrated in an independent step). As a result, the levels of forecast skill and biases for BC_INT and INT_BC for the all forecast category were competitive with the analog methods, if not better. Their level of performance for wet categories was competitive but their performance was worse for the upper wet tercile forecast categories. The INT_BC performance was worse for wet forecasts because the derived interpolated forecast cumulative distribution function (CDF) has a nonzero precipitation value for higher probabilities of exceedance than the observed one. The values of the calibrated and downscaled forecasts, as evaluated in the next paragraph, were also not as competitive for wet forecasts in general because of this one-step BC. Furthermore, the BC component was not able to improve the probabilistic forecasts’ reliability.

Results for the three variations of the analog technique generally performed similarly. Biases and RMSEs were improved for all methods relative to spatially interpolated forecasts. The largest improvement for the analog methods relative to the interpolated forecasts was for forecast ensemble reliability. However, this comes at the expense of predictability. Correlation and CRPSSs were generally not improved relative to interpolated forecasts; the analog RMSDmean method showed an improved CRPSS and maintained its correlation for the 5-day accumulation forecast for the upper wet tercile category. Downscaled precipitation patterns were generally more realistic than interpolated forecasts when interpolation was not substituted because no satisfactory analog could be found (extreme events); otherwise, the patterns for the RMSDmean method were equivalent to the interpolation method.

Simultaneously improving the ensemble spread reliability and the CRPS when calibrating forecasts seems difficult; there is a relationship between the CRPS (predictability–resolution–reliability) and the ensemble spread (linked to reliability), and these relationships were different for different methods (section 4b).

An experiment using a gridded station dataset showed better skill in general (accuracy, reliability, and predictability), in particular, the INT_BC approach, probably because the observation errors were smaller than for the satellite (TMPA) data. However, most differences among methods were consistent with the analysis based on the TMPA satellite data.

The downscaling approaches presented here rely on a good correlation between the raw forecasts and the observations. It is not likely that they can improve the (downscaled) predicted–observed correlations; however, the downscaling errors are smaller when the observation–raw forecast correlation is higher as confirmed by using gridded station data in place of the satellite data over the Ohio River basin, which has a relatively dense gauge network.

We note that in the methods we tested, the raw ensemble forecast values are used to downscale and calibrate individual forecast members (an exception is the analog RMSDmean method). This implies that the raw forecast controls the ensemble uncertainties. Other methods such as the EPP or MOS used in Clark and Hay (2004) downscale and calibrate a single-value forecast and reconstruct the ensemble in a subsequent step. Those methods allow independent control of the ensemble spread, and as such, they may have better CRPS values. On the other hand, we expressed earlier some concerns on how those methods may be adapted for global application (see section 1).

Our analysis evaluated the skill of the calibration and downscaling methods by comparing the levels of performance of the calibrated forecasts for forecast categories based on the interpolated forecasts. Figure 8 shows the same forecast verification for daily forecasts at 0.25°, as in Fig. 2, but for forecast categories based on each individual set of calibrated forecasts. This implies that different events are potentially compared. This perspective allows us to evaluate the calibrated forecasts (not only the calibration) from an end-user point of view. Our conclusions are consistent with the previous analysis in the sense that the analog RMSDmean method is preferred. The issue with the precipitation intermittency within the bias correction step is more pronounced here for the INT_BC and BC_INT techniques. HW2006 favored the analog rank method in terms of reliability. Accordingly, we found smaller biases and RMSEs, and slightly better reliability for wet forecast categories at the finest spatial scale (especially when using a smaller spatial window; not shown) but for the largest loss in predictability. However, the analog RMSDmean relative biases and RMSEs were better than those for the analog rank method for the wet forecast categories (not shown), and in particular the predictability (correlation) was better.

## 6. Conclusions

The objective of this paper was to adapt existing calibration and downscaling methods for medium-range (up to 10 days) probabilistic quantitative precipitation forecasts, and for application to large river basins in parts of the world where in situ observations are sparse (hence, the use of satellite precipitation datasets is attractive). Two approaches were tested: three variations of the bias correction with statistical downscaling method (BCSD) of Wood et al. (2002) and three variations of the analog method of Hamill and Whitaker (2006). The BCSD approach was adapted for shorter temporal scale and application to larger domains than those to which it had been applied previously. Three variations of the analog technique (analog RMSD, RMSDmean, and rank) were adapted for quantitative probabilistic forecasts. Based on the results reported in this analysis, the analog RMSDmean method is preferred for applications to flood forecasting. In particular, (a) downscaled precipitation patterns for the analog methods were more realistic in their spatial characteristics than interpolated forecasts (which had an overly smooth spatial character); (b) the analog RMSDmean method more significantly reduced the RMSEs and biases than the other calibration methods (only equaled by BC_INT) in all forecast categories, although the accuracy skill of the calibrated forecasts using the RMSDmean method was better than the forecast calibrated with the BC_INT method; c) the reliability of the analog methods was considerably higher than that for the interpolated forecasts and the BCSD approaches; and d) the analog methods decrease the predictability while the BC_INT and INT_BC techniques maintain it. The analog RMSDmean had the best predictability of the analog methods when using the TMPA data, and could even improve the predictability of the forecasts when a gridded station dataset was used.

## Acknowledgments

The research reported upon herein was supported in part by Grants NA070AR4310210 and NA08OAR4320899 for NOAA’s Climate Prediction Program for the Americas. The authors wish to thank the ECMWF, for providing access to their precipitation analyses and EPS forecasts datasets, and Philippe Bougeault, Roberto Buizza, and Florian Pappenberger of ECMWF, for their assistance. Thanks are also due to Qiuhong Tang of the University of Washington for extending the Maurer et al. (2002) dataset. We thank an anonymous reviewer, whose comments led to additional analyses that we believe strengthened the paper.

## REFERENCES

Asante, K. O., , Dezanove R. M. , , Artan G. , , Lietzow R. , , and Verdin J. , 2007: Developing a flood monitoring system from remotely sensed data for the Limpopo Basin.

,*IEEE Trans. Geosci. Remote Sens.***45****,**1709–1714.Berrocal, V. J., , Raftery A. E. , , and Gneiting T. , 2008: Probabilistic quantitative precipitation field forecasting using a two-stage spatial model.

,*Ann. Appl. Stat.***2****,**1170–1193.Clark, M. P., , and Hay L. E. , 2004: Use of medium-range numerical weather prediction model output to produce forecasts of streamflow.

,*J. Hydrometeor.***5****,**15–32.Clark, M. P., , Gangopadhyay S. , , Hay L. , , Rajagopalan B. , , and Wilby R. , 2004: The Schaake shuffle: A method for reconstructing space–time variability in forecasted precipitation and temperature fields.

,*J. Hydrometeor.***5****,**243–262.Demargne, J., and Coauthors, 2009: Towards standard verification strategies for operational hydrologic forecasts. National Weather Service Hydrologic Forecast Verification Team Final Rep., 65 pp. [Available online at http://www.nws.noaa.gov/oh/rfcdev/docs/NWS-Hydrologic-Forecast-Verification-Team_Final-report_Sep09.pdf].

Franz, K. J., , Hartmann H. C. , , Sorooshian S. , , and Bales R. , 2003: Verification of National Weather Service ensemble streamflow predictions for water supply forecasting in the Colorado River basin.

,*J. Hydrometeor.***4****,**1105–1118.Glahn, H. R., , and Lowry D. A. , 1972: The use of model output statistics (MOS) in objective weather forecasting.

,*J. Appl. Meteor.***11****,**1203–1211.Hamill, T. M., , and Colucci S. J. , 1997: Verification of Eta–RSM short-range ensemble forecasts.

,*Mon. Wea. Rev.***125****,**1312–1327.Hamill, T. M., , and Whitaker J. S. , 2006: Probabilistic quantitative precipitation forecasts based on reforecast analogs: Theory and application.

,*Mon. Wea. Rev.***134****,**3209–3229.Hamill, T. M., , Whitaker J. S. , , and Wei X. , 2004: Ensemble re-forecasting: Improving medium-range forecast skill using retrospective forecasts.

,*Mon. Wea. Rev.***132****,**1434–1447.Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems.

,*Wea. Forecasting***15****,**559–570.Hong, Y., , Adler R. F. , , Hossain F. , , Curtis S. , , and Huffman G. J. , 2007: A first approach to global runoff simulation using satellite rainfall estimation.

,*Water Resour. Res.***43****,**W08502. doi:10.1029/2006WR005739.Hong, Y., , Adler R. , , and Huffman G. J. , 2009: Applications of TRMM-based multi-satellite precipitation estimation for global runoff simulation: Prototyping a global flood monitoring system.

*Satellite Rainfall Applications to Surface Hydrology,*M. Gebremichael and F. Hossain Eds., 245–265.Hopson, T. M., , and Webster P. J. , 2010: A 1–10-day ensemble forecasting scheme for the major river basins of Bangladesh: Forecasting severe floods of 2003–07.

,*J. Hydrometeor.***11****,**618–641.Hossain, F., , and Katiyar N. , 2006: Improving flood forecasting in international river basins.

,*Eos, Trans. Amer. Geophys. Union***87****,**49.Huffman, G. J., and Coauthors, 2007: The TRMM Multisatellite Precipitation Analysis (TMPA): Quasi-global, multiyear, combined-sensor precipitation estimates at fine scales.

,*J. Hydrometeor.***8****,**38–55.Joyce, R. J., , Janowiak J. E. , , Arkin P. A. , , and Xie P. , 2004: CMORPH: A method that produces global precipitation estimates from passive microwave and infrared data at high spatial and temporal resolution.

,*J. Hydrometeor.***5****,**487–503.Krzysztofowicz, R., 1998: Probabilistic hydrometeorological forecasts: Toward a new era in operational forecasting.

,*Bull. Amer. Meteor. Soc.***79****,**243–251.Maurer, E. P., , Wood A. W. , , Adam J. C. , , and Lettenmaier D. P. , 2002: A long-term hydrologically based dataset of land surface fluxes and states for the conterminous United States.

,*J. Climate***15****,**3237–3251.Panofsky, H. A., , and Brier G. W. , 1968:

*Some Applications of Statistics to Meteorology*. The Pennsylvania State University, 224 pp.Pielke R. A. Jr., , , Downton M. W. , , and Barnard Miller J. Z. , 2002: Flood damage in the United States, 1926–2000: A reanalysis of National Weather Service estimates. UCAR, 86 pp. [Available online at www.flooddamagedata.org].

Raftery, A. E., , Gneiting T. , , Balabdaoui F. , , and Polakowski M. , 2005: Using Bayesian model averaging to calibrate forecast ensembles.

,*Mon. Wea. Rev.***133****,**1155–1174.Schaake, J., and Coauthors, 2007: Precipitation and temperature ensemble forecasts from single-value forecasts.

,*Hydrol. Earth Syst. Sci. Discuss.***4****,**655–717.Seo, D-J., 1998a: Real-time estimation of rainfall fields using rain gage data under fractional coverage conditions.

,*J. Hydrol.***208****,**25–36.Seo, D-J., 1998b: Real-time estimation of rainfall fields using radar rainfall and rain gage data.

,*J. Hydrol.***208****,**37–52.Seo, D-J., , Perica S. , , Welles E. , , and Schaake J. C. , 2000: Simulation of precipitation fields from probabilistic quantitative forecast.

,*J. Hydrol.***239****,**203–229.Shepard, D. S., 1984: Computer mapping: The SYMAP interpolation algorithm.

*Spatial Statistics and Models,*G. L. Gaile and C. J. Willmott, Eds., D. Reidel, 133–145.Sloughter, J. M., , Raftery A. E. , , Gneiting T. , , and Fraley C. , 2007: Probabilistic quantitative precipitation forecasting using Bayesian model averaging.

,*Mon. Wea. Rev.***135****,**3209–3220.Su, F., , Hong Y. , , and Lettenmaier D. P. , 2008: Evaluation of TRMM Multisatellite Precipitation Analysis (TMPA) and its utility in hydrologic prediction in La Plata Basin.

,*J. Hydrometeor.***9****,**622–640.Talagrand, O., , and Vautard R. , 1997: Evaluation of probabilistic prediction systems.

*Proc. ECMWF Workshop on Predictability,*Reading, United Kingdom, ECMWF, 1–25.Wilks, D. S., 2006:

*Statistical Methods in the Atmospheric Sciences*. 2nd ed. Academic Press, 627 pp.Wood, A. W., , and Lettenmaier D. P. , 2006: A testbed for new seasonal hydrologic forecasting approaches in the western United States.

,*Bull. Amer. Meteor. Soc.***87****,**1699–1712.Wood, A. W., , and Schaake J. C. , 2008: Correcting errors in streamflow forecast ensemble mean and spread.

,*J. Hydrometeor.***9****,**132–148.Wood, A. W., , Maurer E. P. , , Kumar A. , , and Lettenmaier D. P. , 2002: Long-range experimental hydrologic forecasting for the eastern United States.

,*J. Geophys. Res.***107****,**4429. doi:10.1029/2001JD000659.Yilmaz, K., , Adler R. F. , , Tian Y. , , Hong Y. , , and Pierce H. F. , 2010: Evaluation of a satellite-based global flood monitoring system.

,*Int. J. Remote Sens.***31****,**3763–3782.

## APPENDIX A

### Bias Correction and Statistical Downscaling (BCSD) Approach

#### a. Bias correction (quantile-based mapping technique)

Bias correction is achieved by replacing each ensemble member value of daily precipitation with observed values having the same percentiles (nonexceedance probabilities) with respect to the observed climatology (TMPA), for a given lead time. Bias correction is applied either at the ECMWF EPS forecast spatial scale (1°, BC_INT and BCSD) or at the scale of the hydrology model (0.25°, INT_BC). Each grid cell (fifty-one 1°, or eight hundred forty-eight 0.25° within the Ohio River basin) is treated individually.

For each grid cell and each lead time in the set of 10-day forecasts (1826 in total) in the 2002–06 period, the procedure is as follows.

##### 1) Derive the forecast daily cumulative distribution function (CDF, nonexceedance probability)

To save computational time, the CDF was derived for monthly windows, resulting in 12 rather than 366 CDFs per grid cell per lead time. For each monthly window, the CDF was derived using all 51 ensemble members of the ECMWF EPP forecasts over the 2002–06 period, that is, (51 members × ∼30 days × 5 yr) values to be ranked.

##### 2) Derive the TMPA daily CDF

The 0.25° 2002–06 daily TMPA precipitation datasets were aggregated to a 1° spatial resolution (BC_INT and BCSD). The 0.25° and 1° TMPA daily CDFs were derived for each day (366 CDFs per grid cell) using a centered 61-day moving window, that is, (61 days × 5 yr) of values in each CDFs. A 61-day moving window allows us to balance the higher number of values in the forecast CDFs.

##### 3) Quantile mapping

*Q*) of the daily precipitation forecast member in the corresponding forecast CDF (appropriate monthly window and lead time for that grid cell) and substituted this value for the observed TMPA value with the same quantile in the corresponding TMPA daily CDF (CDF for that day, lead time, and grid cell): where BCFcst is the bias-corrected forecast value, Fcst is the forecast value, CDFobs is the TMPA CDF, CDFfcst is the forecast CDF, and

*Q*is the quantile of the forecast value in the forecast CDF.

##### 4) Distribution fitting for extreme daily precipitation forecasts

When the forecast quantiles were smaller than the minimum or larger than the maximum TMPA empirical Weibull quantiles (equal to 1/(*n* + 1) and *n*/(*n* + 1), where *n* is the number of events from which the probability distribution is estimated, i.e., 61 days × 5 yr), the values were obtained by fitting the TMPA daily climatology with an extreme value type I (Gumbel) distribution for high precipitation values, and a Weibull distribution with a zero lower boundary for small nonzero precipitation values, following Wood et al. (2002).

##### 5) Bias correction: Daily precipitation intermittency correction

#### b. Spatial disaggregation in BCSD

The spatial disaggregation aims to give a realistic 0.25° spatial pattern while preserving the 1° bias-corrected forecast value in the BCSD variation. The spatial distribution is taken from the 0.25° TMPA climatology.

For each 1° grid cell and each lead time of the 1826 daily forecasts during 2002–06 the following steps are taken:

- (i) A total of 15 independent days were selected from the daily 1° aggregations of the 2002–06 TMPA precipitation dataset, with the following criteria: the selected dates are in the same month as the forecasts and there must be as many selected days with positive observed precipitation (at least 1 mm over the 1° cell) as there are bias-corrected forecast members with positive precipitation (zero threshold). Next, we let Pobs1° be the 1° TMPA precipitation values on those selected dates. Those values are ranked by magnitude from 1 to 15.
- (ii) The 15 bias-corrected ensemble forecast members BCFcst
_{1°}are similarly ranked from 1 to 15. - (iii) Each ranked BCFcst
_{1°}is associated with each Pobs_{1°}of the same rank so that the ensemble member with the highest bias-corrected precipitation forecast value has a disaggregated precipitation pattern corresponding to the observation day with the highest precipitation [Schaake shuffle; Clark et al. (2004)]. - (iv) Finally, the precipitation pattern at 0.25° resolution was transposed onto the bias-corrected forecast value using the ratios of 0.25° to 1° values, for each 16 0.25° cells in the 1° cell. Ensemble members with zero precipitation forecasts at 1° spatial resolution will have 0 precipitation for all constituent 0.25° grid cells:

To avoid the possibility of unrealistic values resulting from this rescaling, we constrained the results to be no larger than a threshold value (in this work, 350 mm day^{−1}). This case happens very rarely (less than 1% of the time).

#### c. Schaake shuffle

This Schaake shuffle reorders the 0.25° calibrated ensemble members to give the ensemble forecasts a spatiotemporal rank structure similar to the observed climatology. It does not change the BCSD values; it only reorders the ensembles in space and time.

For the entire basin and the entire 10-day period of each of the 10-day forecasts in the dataset (1826 in total) during the 2002–06 period, the following steps are taken:

- (i) A total of 15 dates (and the subsequent 9 days) are selected in the ±45-day window around the day of the forecast in the 0.25° daily 2002–06 TMPA dataset.
- (ii) For each day in the 10-day period and each 0.25° grid cell, the 15 TMPA values (15 dates) are ranked by magnitude. Let rank_TMPA(
*i*,*t*,*d*) be the rank of the TMPA value for date t (ranges from 1 to 15) for the 0.25° grid cell*i*(1–848 in the Ohio River basin), for lead time*d*(1–10 days, i.e., it corresponds to the date + d day in the TMPA climatology). - (iii) For each lead time in the 10-day forecast and each 0.25° grid cell, each (calibrated and downscaled) ensemble member is ranked in magnitude. Let Fcst_member(
*i*, rank,*d*) (1–15) be the forecast member for grid cell*i*and lead time*d*with a specific rank (1–15). - (iv) For each lead time in the 10-day forecast and each 0.25° grid cell, the 15 ensemble members are ordered (shuffled) in space and time so that the ranking in space and time of the ensemble members is similar to the 15 resampled observed 10-day TMPA field:
The new ensemble member
*t*= 1 has the value of the ensemble forecast member that has the same rank as the observed value for date 1 (*t*= 1).

Note that the Schaake shuffle cannot be applied at the 1° spatial resolution and, therefore, simplifying the BCSD method’s spatial disaggregation by keeping the same selected observed dates (section b in appendix A). Using a shuffling of members derived from 1° ensemble forecasts would not result in an observed 0.25° spatial structure. The spatial distribution of ranks is not equivalent with a change in spatial scale.

## APPENDIX B

### Analog Technique

*i*,

*j*) for which the analogs are searched (where

*i*and

*j*would be the longitude and latitude of one of the fifty-one 1° grid cells over the Ohio River basin). The RMSD over the moving window centered around (

*i*,

*j*) is the sum of the root-mean-square differences over the 25 forecasts points: where Fcst(

*i*,

*j*) is the forecast at the centered forecast point and RetroFcst(

*i*,

*j*) is a forecast for the same forecast point (

*i*,

*j*) but ±45 days around the day of the forecast Fcst(

*i*,

*j*) in the 2002–06 forecast climatology (archived forecast when in near-real-time forecast mode) and for the same lead time as Fcst(

*i*,

*j*). Moving 3° × 3° and 9° × 9° windows are made of up to 9 (

*i*± 1;

*j*± 1) and 49 (

*i*± 3;

*j*± 3) forecast points, respectively.

For the RMSD analog method, one RMSD is computed for each ensemble member (15) and the date with the smallest RMSD for that ensemble member analog becomes the analog for the centered point (*i*, *j*).

For the RMSDmean analog method, the ensemble mean forecast is first computed [Fcst(*x*, *y*) in Eq. (B1) is the ensemble mean forecast for that forecast point]. The 15 dates with the smallest RMSDs over the spatial window become the 15 analogs for the 15-member ensemble forecast at the forecast point (*i*, *j*).

*i*,

*j*) is associated with a rank in this CDF. This rank in Eq. (B1) becomes

## APPENDIX C

### Forecast Verification Measures

The forecast verification measures are first computed independently for each 0.25° grid cell (848 in the Ohio River basin) and are then averaged over the basin.

#### a. Definition of the forecast categories

In the results section, the forecast categories (defined below) are conditioned on the interpolated forecasts. The skills are measured with respect to the observations, but for each forecast category, the improvement of the calibration and downscaling methods with respect to interpolated forecasts is evaluated based on the same set of events (i.e., when the interpolated forecast falls in a certain category).

In section 5, Fig. 8 shows an ensemble forecast verification in which the forecast categories are based on each individual set of downscaled forecasts (i.e., not exactly the same set of events). From this perspective, the end-user value of the downscaled forecasts is evaluated for different downscaling methods. This approach complements the evaluation conditioned on the interpolated forecast categories and is necessary in order to evaluate the value of downscaled precipitation forecasts. It can be used in subsequent work for estimating the value of ensemble hydrological forecasts for potential real-time decisions.

Forecast categories (all forecasts, wet forecasts, and the upper tercile of the wet forecasts) are defined individually for each 0.25° grid cell and lead time. Let *N*(*d*, *i*) be the number of ensemble forecasts in each forecast category for lead time d and 0.25° grid cell *i* (*i* varies from 1 to 848). The forecast categories are defined as follow.

##### 1) All forecast

##### 2) Wet forecasts

*H*(

*x*) is the Heaviside function and m is 1 of the 15 ensemble members.

##### 3) Upper tercile of the wet forecasts

*d*,

*i*) by magnitude (CDF of the wet ensemble mean forecasts). Only the dates in the tercile with the largest magnitude are considered:

#### b. Bias

*d*is the lead time (day 1–10),

*i*is the 0.25° grid cell, m is the ensemble member (1–15),

*t*is the forecast day when the forecast ensemble mean value is in a certain forecast category [1 to

*N*(

*d*,

*i*)], and (

*t*+

*d*) is the corresponding observation date for the forecast on day

*t*for lead time

*d*. There are 848 grid cells (0.25° resolution) in our representation of the Ohio River basin.

#### c. RMSE

#### d. Spearman rank correlation

_{mean}(

*d*,

*i*,

*t*) and TMPA(

*d*,

*i*,

*t*) are nonzero in order to avoid high correlation due to matching zero-precipitation days. The

*N*(

*d*,

*i*) are therefore updated [

*N*

^{+}(

*d*,

*i*)] but must include at least 50 occurrences; otherwise, the value is not taken into account for the basin-averaging process. For a particular 0.25° grid cell

*i*, lead time

*d*, and forecast category, all

*N*

^{+}(

*d*,

*i*) ensemble mean forecasts Fcst

_{mean}(

*d*,

*i*,

*t*) are ranked in magnitude. Their corresponding observations TMPA(

*i*,

*d*+

*t*) are also ranked from 1 to

*N*

^{+}(

*d*,

*i*). Identical values are assigned the same rank (i.e., the average of their positions in the ascending order of the values). Let Fcst

_{mean}_rank(

*d*,

*i*,

*t*) be the rank of Fcst

_{mean}(

*d*,

*i*,

*t*) and TMPA_rank(

*i*,

*d*+

*t*) be the rank of TMPA(

*i*,

*d*+

*t*) in their respective climatologies:

#### e. Talagrand histogram

*d*,

*i*,

*t*) be the normalized rank [divided by the number of ensemble members Nmember (15) plus one] of the observation TMPA(

*i*,

*d*+

*t*) with respect to the Nmember (15) ranked members of the ensemble spread for Fcst(

*d*,

*m*,

*i*,

*t*). Here, Rank_TMPA(

*d*,

*i*,

*t*) is 0 (1) when the TMPA values is smaller (larger) than all of the member values in the ensemble: where

*δ*(

*i*,

*j*) is the Kronecker delta function. The Talagrand histogram plots the frequency that TMPA has a certain rank (position in the ensemble spread) as a function of the rank (

*x*label).

#### f. CRPS

*P*(

*x*) and

*P*

_{obs}(

*x*) are the predicted and observed cumulative probabilities; that is,

*P*(

*x*) is the probability that

*x*will be below the observed value obs and

*ρ*(

*m*) is the predicted probability distribution function made of the 15 ensemble member values.

*d*) 15-member daily ensemble forecast

*t*with TMPA(

*i*,

*t*+

*d*) as the observed value, the CRPS is computed as follows for each 0.25° grid cell i and lead time

*d*: where Fcst(

*d*,

*m*,

*i*,

*t*) is the ranked forecast with rank m in the 15-member ensemble forecast; that is, Fcst(

*d*, 0,

*i*,

*t*) is the smallest forecast value in the ensemble, and Fcst(

*d*, 15,

*i*,

*t*) is the largest,

The climatology CRPS is computed using the climatology forecast; it is a forecast with (1826-d) members where each member represents a day in the 2002–06 daily observations [TMPA(1:1826 − *d*, *i*)]. All climatology forecasts are similar for each (1826 − *d*) forecasted day and span the entire 2002–06 period.

#### g. Forecast verification for different spatial and temporal scales or seasons

For the forecast verification at the 1° spatial resolution, the downscaled forecasts were aggregated to the 1° spatial resolution and the same equations were used, with i varying from 1 to 51 and the forecast categories computed individually for each 1° cell.

For the forecast verification at the 0.25° spatial resolution and 5-day accumulation, each forecast was aggregated in time for days 1–5 and 6–10. Forecast categories were updated and the same equations were used with d varying from 1 (days 1–5) to 2 (days 6–10).

For seasonal analysis, forecast categories were adjusted [*N*(*d*, *i*)] and the climatology CRPS was updated so as to include only the days in the analyzed season.

Statistics when interpolation of the raw forecast is used in lieu of the analog technique where no satisfactory analog is found These statistics are the basin averages of the individual grid cell statistics during the 2002–06 period.