## 1. Introduction

Multimodel methods have been explored in an attempt to reduce errors in hydrologic simulations and forecasts by a number of authors (e.g., Wang et al. 2009; Viney et al. 2009; Vrugt and Robinson 2007; Duan et al. 2007; Guo et al. 2007; Gao and Dirmeyer 2006; Ajami et al. 2006; Georgakakos et al. 2004). Furthermore, the spread among model results can provide an estimate of prediction uncertainty, while the mean of the model results is often regarded as a more skillful prediction than the results from individual models. Notwithstanding the usefulness of information about prediction uncertainty, here we focus on the multimodel average as an indicator of forecast skill. The general multimodel ensemble approach relies on the combination of results from multiple models in such a way that simulation or forecast errors from the individual models cancel each other to some extent. The approach has been applied in many fields (e.g., Hoeting et al. 1999). Its application in meteorological and climate modeling dates at least to Krishnamurti et al. (1999), who found that the simple average of weather forecasts produced by several meteorological models resulted in greater forecast skill than any individual model. A variety of techniques have been developed for combining models, which amounts to selecting weights. Among the methods that have been explored are equal-weight average, multiple linear regression (Krishnamurti et al. 1999; 2000), maximum likelihood, and Bayesian techniques. In the latter case, each model’s weight represents the probability that the model is the most skillful (e.g., Gneiting and Raftery 2005).

Multimodel averaging has been studied in the context of probabilistic hydrologic forecasts, but not for large basins or long lead times. Vrugt and Robinson (2007) compared Bayesian multimodel averaging and ensemble Kalman filtering to daily streamflow forecasts from 8 watershed models over a small watershed (1950 km^{2}), with a lead time of 1 day. In terms of the RMSE, correlation, and bias of the ensemble mean, multimodel averaging did not in general outperform the best individual model. Duan et al. (2007) evaluated Bayesian multimodel averages with flow-dependent weights in ensemble simulations (not true forecasts, since meteorological uncertainty was not taken into account) of daily flows in small watersheds (700–2500 km^{2}), and found that Bayesian multimodel averaging yielded modest reductions of daily RMSE of the ensemble mean and improved forecast reliability.

Multimodel averaging can yield improvements in various metrics of forecast skill including the RMSE of the ensemble mean, which in turn depends on forecast bias and error variance (Murphy 1988; Gupta et al. 2009). However, in many applications, including large-scale hydrologic forecasting, various forms of bias-correction techniques are or can be employed to address these same errors. Bias correction has the potential to reduce forecast RMSE by adjusting model statistics to match those of observations. The motivation for so doing is that bias can comprise a substantial part of overall model prediction or forecast RMSE, but can be removed where lengthy observed and model simulation records exist. Various schemes have been developed to do so, ranging from linear transformations to mapping between quantiles of the distributions of simulated and observed results (e.g., Snover et al. 2003; Wood et al. 2002). Bias corrections may be constant in time or vary according to season or month. Bias correction can be an effective means of improving model skill; Shi et al. (2008) found that a monthly bias correction could compensate for poor model calibration when assessed in terms of seasonal forecast error. Removal of model bias can, however, also reduce the model skill, depending on the bias-removal method (e.g., quantile mapping, which not only reduces bias but also changes variability) and skill measure.

Multimodel averaging has been applied in conjunction with bias correction in some hydrologic studies. Ajami et al. (2006) applied several multimodel averaging schemes, with and without bias correction, to simulate hourly discharge in relatively small watersheds (800–2500 km^{2}). They found that multimodel averaging generally yielded improvements in RMSE but little or no improvement in correlation of simulations with observations, implying that these improvements in RMSE arose from a reduction in bias. Applying a constant bias correction to the models before forming the multimodel average led to a further improvement in results. However, they did not compare the performance of individual bias-corrected models to the multimodel average. Gao and Dirmeyer (2006) and Guo et al. (2007) applied both a monthly varying bias correction and monthly varying multimodel average to global soil moisture products, which were compared with observations. They found that a multimodel average of bias-corrected models did not yield substantial improvements over individual bias-corrected models.

It is clear that both bias correction and multimodel averaging can yield improvements in forecast skill. However, in large-scale seasonal hydrologic forecasting, aggregation of model results (e.g., from daily to monthly time scales) may affect (usually increase) model cross correlations and therefore reduce the performance of multimodel ensembles. Under these conditions, the question arises: what are the relative effects of multimodel averaging and bias correction on forecast skill? In this study, we address this question in the context of large-scale seasonal streamflow forecasting. We compare the performance of three macroscale hydrologic models [Variable Infiltration Capacity (VIC), Sacramento Moisture Accounting Model/Snow-17 (SAC), and the Noah land surface model] and two multimodel averaging methods (simple model average and multiple linear regression with monthly varying model weights), with and without monthly bias correction, in simulating monthly discharge in three snowmelt-dominated river basins in the western United States ranging in size from 9000 to 35 000 km^{2}. We examine performance not only in retrospective simulations, but also in probabilistic forecasts produced using the Ensemble Streamflow Prediction (ESP) method (Day 1985), as a function of forecast month and lead time, for lead times of 1–12 months. While our multimodel suite is relatively small, the diversity of model bias and skill among the three models provides information that is applicable to larger ensembles, as we discuss in more detail in section 4.

## 2. Methods

### a. Models

We applied the following three large-scale hydrologic models: VIC (version 4.0.6), SAC, and Noah (version 2.7.1). VIC (Liang et al. 1994; Cherkauer and Lettenmaier 1999) is a physically based, semidistributed model with a three-layer soil column overlain by multiple vegetation tiles and a two-layer snowpack. It solves both water and energy balances at the soil and snow surfaces on daily and subdaily time steps. Subgrid heterogeneity of soil moisture infiltration is represented using a statistical parameterization. Noah (Chen et al. 1996; Koren et al. 1999) is also a physically based model, with a four-layer soil column, a single vegetation layer, and a one-layer snowpack. Noah solves the water balance and a linearized energy balance at the soil and snow surfaces at a subdaily time step. SAC uses the conceptually based Sacramento Moisture Accounting Model (Burnash 1995; Burnash et al. 1973), originally designed as a lumped model but adapted to gridded parameters for the North American Land Data Assimilation System (NLDAS) project (Mitchell et al. 2004). It does not perform an energy balance, and does not explicitly represent vegetation, but rather bases its evapotranspiration calculations on a prescribed potential evapotranspiration (PET) modulated by modeled soil moisture relative to the contents of multiple soil zones. Following Mitchell et al. (2004) we used the PET computed by Noah as an input for SAC. SAC’s snow model, Snow-17 (Anderson 1973), uses a degree-day formulation to simulate snow accumulation and melt. The surface and subsurface runoff simulated by each model was routed by a common external routing program, described by Lohmann et al. (1996, 1998).

### b. Study basins and observations

We applied the individual models and multimodel methods to three river basins shown in Fig. 1: the Feather River at the Oroville Dam, California; the Salmon River at White Bird, Idaho; and the Colorado River above Grand Junction, Colorado. The Feather River basin has a drainage area of 9390 km^{2}. Naturalized monthly streamflow (upstream storage and diversion effects removed) at Oroville Dam (OROVI) was obtained for the period 1951–2005 from the California Data Exchange Center (CDEC) of the California Department of Water Resources. The Salmon River basin at White Bird has a drainage area of 35 100 km^{2}. Streamflow observations were taken from U.S. Geological Survey records for station 13317000 for the period 1951–2005. The Colorado River basin at Cameo (CAMEO) has a drainage area of 20 800 km^{2}. Discharge for the period 1951–2003, adjusted for the effects of upstream storage and diversion, was obtained from the U.S. Bureau of Reclamation (USBR).

Monthly average water budget terms are plotted in Fig. 2 for all basins. Row 1 shows basin-average observed precipitation; rows 2–4 show basin-average snow water equivalent (SWE), snowmelt, and evaporation, respectively, as simulated by each of the three models; and row 5 shows observed and simulated discharge at the gauge. The three models exhibit substantial variation in their simulated SWE, snowmelt, and evaporation (Fig. 2, panels 2a–4c) but because larger snowpacks tend to be accompanied by larger evaporative fluxes, the resulting calibrated flows (Fig. 2, panels 5a–c) are much more similar among the models, with the notable exception of Noah in the Feather basin (Fig. 2, panel 5b). For all the basins, the snowmelt peak occurs in April–May and plays a large role in the seasonal cycle of discharge (Fig. 2, panels 3a–c). However, the Colorado and Salmon River basins are more snowmelt-dominated than the Feather basin due primarily to their higher average elevations, combined with the fact that precipitation in the Feather basin has a much more distinct seasonal cycle than in the other two basins, with wet winters and dry summers (Fig. 2, panels 1a–c). Thus, in the Colorado and Salmon basins, the months of highest flow are May–July, corresponding to the snowmelt peak (Fig. 2, panels 5a and 5c); while for the Feather basin high flows are distributed over December – April and consist of a mix of rainfall and snowmelt (Fig. 2, panel 5b).

### c. Meteorological inputs and model parameters

All models were applied at ⅛° latitude–longitude spatial resolution. The models were driven by a common set of meteorological forcings consisting of 56 yr (1949–2005) of daily maximum temperature, minimum temperature, precipitation, and wind speed for the Feather and Salmon basins and 54 yr (1949–2003) of the same inputs for the Colorado basin. The daily meteorological inputs were taken from Maurer et al. (2002) for the period 1949–2000 and extended to 2003 for the Colorado and to 2005 for the Salmon and Feather basins. To account for topographic effects, each grid cell was subdivided into up to five separate elevation bands, depending on the elevation range within the grid cell. Within each band, the air temperature was lapsed to the band’s average elevation using a lapse rate of 6.5°C km^{−1}. The same elevation bands were used in all models.

Initial model parameters were taken from the NLDAS experiment (Mitchell et al. 2004). All models were calibrated using the multiobjective complex evolution (MOCOM) algorithm (Yapo et al. 1998) to maximize simultaneously the Nash–Sutcliffe efficiency (NSE) and the correlation between model simulations and observations. While these calibrations yielded nontrivial reductions in RMSE relative to the NLDAS parameters, the objective function arguably is not ideally formulated (NSE contains a component that is sensitive to correlation, thereby rendering the dual objective functions somewhat redundant; see e.g., Murphy 1988; Gupta et al. 2009). For this reason, we performed an exploratory analysis, in which the multiple criteria in MOCOM were the three components of MSE (see section 4a and Gupta et al. 2009), over one of our test basins (the Feather River basin). We found that the results were qualitatively similar to those obtained using NSE and the model-observed correlations. We address some implications of calibration methods for our results in section 4c.

Calibrations were performed for the years 1976–90. This period was chosen to be representative of the range of temperature and precipitation found over the last half-century in the three basins. The VIC parameters calibrated were the baseflow parameters Ds, Ds_{max}, and Ws; the infiltration parameter *b*_{inf}; and the bottom-layer soil depth D3. For SAC, LZTWM, UZTWM, LZFPM, PFREE, LZSK, and ADIMP were calibrated based on the parameter sensitivity study by Tang et al. (2007). Additionally, the snow parameters MFMIN, MFMAX, and SCF were calibrated. For Noah, calibration was performed on the four soil layer depths and maximum snow albedo.

### d. Bias correction

*X*′

*is the*

_{k}*k*th model flow after bias correction,

*X*is the

_{k}*k*th model flow,

*Y*

*s*is the standard deviation of flows of the

_{k}*k*th model,

*s*is the standard deviation of observed flows, and

_{Y}*a*,

_{k}*b*are equivalent linear regression parameters. We applied the bias corrections on a monthly basis.

_{k}### e. Multimodel ensemble techniques

*E*can be defined as where

*E*is the ensemble average flow,

*X*is the

_{k}*k*th model flow,

*w*is the weight of the

_{k}*k*th model, and

*M*is the number of models in the ensemble. Several strategies can be used for computing the model weights

*w*. We examined the performance of two strategies in particular:

_{k}- Simple model average (SMA): all
*w*= 1/_{k}*M*. - Unconstrained multiple linear regression (MLR):
*w*determined by multiple linear regression of observations against all models over a historical “training” period;_{k}*w*can be negative or >1._{k}

### f. Training and validation periods

We evaluated the performance of the bias correction and multimodel ensemble-based estimates during a validation period separate from the training period used to estimate bias correction parameters (*a _{k}*,

*b*) and ensemble model weights

_{k}*w*. To minimize the differences in flow statistics between training and validation periods, we chose a training period consisting of the even years 1952–2004 and a validation period consisting of the odd years 1951–2005 (or 1952–2002 and 1951–2003, respectively, in the case of the Colorado River).

_{k}### g. Probabilistic forecasts: ESP technique

To investigate relative performance of the methods for seasonal streamflow forecasting, we used the ESP method (Day 1985). This method generates an ensemble of hydrological simulations by resampling from sequences of historical meteorological inputs to the hydrological model, under the assumption that these meteorological inputs represent the distribution of possible conditions during the forecast period. In this study, forecasts were made for each month in the period 1951–2004 (1951–2002 in the Colorado basin), with lead times of 1–12 months. For each forecast/lead time combination, simulations started from the model state on the first day of the forecast start month, as estimated by the retrospective run using observed meteorology. An ensemble of simulations spanning from the start month to the forecast month was then generated from an ensemble of historical meteorological inputs composed of the same set of years, 1951–2004 (1951–2002 in the Colorado basin), excluding the year being forecast.

## 3. Results

### a. Retrospective simulations

We first examine the relative performance of the two methods in retrospective simulations. Monthly fractional RMSE (RMSE divided by observed mean flow) is plotted in Fig. 3 for individual models and multimodel averages, for each of the three study basins. The fractional RMSE values of “raw” results are shown in the first column, the fractional RMSE values of bias-corrected results are shown in the second column, and the differences between fractional RMSE values of bias-corrected and raw results are shown in the third column. The fractional RMSE values of raw results tend to be smaller in high-flow months (April–September in the Colorado and Salmon River basins; January–June in the Feather River basin).

Considering first the RMSE of the raw results (column a), we see considerable spread in the individual model results, with no single model consistently outperforming the others (although VIC and SAC tend to perform better than Noah). It is clear that in general, the best-case multimodel average (MLR) performs at least as well as, and sometimes substantially better than (particularly in the Salmon basin), the best raw model output, with the exception of month 11 in the Feather basin. The SMA multimodel average, on the other hand, never substantially outperforms the best raw model, and often performs much worse. In contrast, comparison of the fractional RMSE values after applying the monthly bias correction (columns b and c) shows that in general, the bias correction reduces RMSE values of not only individual models but also multimodel averages, with some exceptions in the Feather basin in months 1, 2, 4, and 11. These reductions in RMSE are mostly at least as large, if not larger, than those yielded by the MLR multimodel average of raw model results (column a). In the Colorado and Salmon basins, these improvements range from 0% to approximately 50% of the original RMSE in high-flow months (April–September), and approximately 50%–100% in low-flow months (October–March). In addition, after bias correction, the spread of RMSE values (column b) is much narrower than for raw values (with the exception of month 11 in the Feather basin, which may be the result of nonstationarity, discussed further in section 4b). Not only is the bias-corrected MLR multimodel average rarely better than the best bias-corrected model (and only marginally better in those cases), but all bias-corrected models and multimodel averages (with the exception of Noah in the Feather River basin), become very similar after bias correction.

### b. ESP forecasts

We also evaluated the performance of bias correction and multimodel averaging with respect to the behavior of the ESP ensemble mean [the ensemble spread is also a useful metric worth investigating (e.g., Weigel et al. 2008), but we thought it important first to understand the behavior of the ensemble mean, which influences the bias of the forecasts]. We evaluated forecast RMSEs for each start month/forecast month pair over all 27 (Salmon and Feather basins) or 26 (Colorado basin) validation year forecasts. The parameters used in the bias correction and multimodel averages were the same as those used in section 3a for the simulation comparisons. Figure 4 shows the fractional RMSE of ESP ensemble means as a function of simulation start month and forecast month, for the three study basins, for VIC (chosen to represent the behavior of a typical individual model), with and without bias correction and the two monthly multimodel averages without bias correction. For each basin, the fractional RMSE of each model or multimodel average is displayed in the first row and differences between these fractional RMSEs and those of the raw VIC results are displayed in the second row. In each plot, the *y* axis corresponds to forecast month and the *x* axis corresponds to start month. For reference, the 1:1 line, along which the start month is equal to the forecast month and the lead time is equal to 0, is drawn in each plot. Forecasts along this diagonal incorporate the smallest uncertainty in model forcings, and therefore forecast errors along this diagonal approach those of the retrospective simulations (section 3a). Just to the right of this diagonal, the lead time is 12 months, for which model forcing uncertainty is greatest.

Two main features are evident in the raw model results, represented by “VIC_{RAW}” in the first column of Fig. 4 (all three raw models exhibit the same general patterns). For forecast months that are not dominated by snowmelt (October–March in all basins), fractional RMSE values are relatively high and vary little with lead time, except for small variations (decreasing or increasing) at lead times of 1–2 months. For snowmelt-dominated forecast months (April–September in all basins), fractional RMSE is relatively high and constant with lead time before the start month of December or January (when snowfall begins to accumulate), at which point it begins a steady decline until the start month equals the forecast month (lead time of 0). This is an inherent quality of ESP forecasts in snowmelt-dominated basins, as noted by Pagano et al. (2004) and others. At long lead times, the influence of the forecast’s initial conditions is small, and climatology dominates the forecast. In nonsnowmelt-dominated forecast months, forecast skill changes appreciably only at short (1–2 months) lead times (this can be an increase or decrease, depending on whether the short-lead-time forecast is better than climatology). In contrast, in those forecast months dominated by snowmelt, forecast skill begins to increase for those start months when the snowpack has begun to form, and increases steadily with decreasing lead time as the snowpack matures. This can lead to appreciable forecast skill with lead times up to 6 months in some cases.

Columns 2–4 of Fig. 4 illustrate the differences in performance between bias correction of the VIC model (denoted by “VIC_{BC}”) and multimodel averaging, in terms of fractional RMSEs of ESP means. In the Colorado and Salmon basins, both bias-corrected VIC and raw MLR tend to reduce RMSE substantially (50%–75%) in nonsnowmelt-dominated forecast months at all lead times. In snowmelt-dominated forecast months (which correspond to the highest-flow months in these basins), the reduction in RMSE is smaller and depends on lead time, with short-lead-time reductions of 0%–25% generally decreasing to near zero for start months before December. In the Feather basin, a similar small lead-time-dependent reduction is evident in snowmelt-dominated forecast months (which form a subset of the highest-flow months in this basin), but in nonsnowmelt-dominated forecast months, fractional RMSE is relatively high to begin with and bias-corrected VIC and raw MLR have little effect on RMSE (this may also be a result of nonstationarity in the Feather basin, discussed further in section 4b). The behaviors displayed by bias correction and multimodel averaging in snowmelt-dominated forecast months come about because errors in the raw model results are the combination of two factors: lead-time-independent climatological bias and lead-time-dependent errors from snowpack state. While bias correction and MLR show similar dependence on forecast month and lead time, bias-corrected VIC tends to perform as well as or better than MLR in most forecast month–start month combinations; for snowmelt-dominated forecast months, the difference between bias-corrected VIC and MLR tends to increase as lead time decreases. In all basins, raw SMA rarely outperforms raw VIC, and sometimes performs worse. As with the retrospective case, after bias correction, both MLR and SMA, as well as the other two individual models, all look very similar to bias-corrected VIC (and therefore have been omitted from the plot). Thus, applying bias correction to a single model tends to yield benefits that are similar to or greater than those produced by multimodel averaging.

## 4. Discussion

The most evident pattern in our results is that applying a monthly bias correction to a single model yields similar or better performance improvements than those resulting from the best monthly multimodel ensemble (MLR) of raw model results, and is also competitive with multimodel averages of bias-corrected models. This is true both for the retrospective simulations and for the ensemble means of ESP forecasts. Given the benefits of multimodel averaging observed by others (e.g., Krishnamurti et al. 1999, 2000; Ajami et al. 2006), understanding the reasons for these benefits is of great interest.

### a. Components of MSE

*is the mean-square error of model*

_{k}*k*with respect to observations

*Y*,

*B*is the bias of model

_{kY}*k*with respect to observations (=

*X*

_{k}*Y*

*r*is the correlation of model

_{kY}*k*and observations

*Y*,

*s*is the standard deviation of observations

_{Y}*Y*, and

*s*is the standard deviation of model

_{k}*k.*In other words, a model’s normalized MSE consists of a bias term,

*s*/

_{k}*s*−

_{Y}*r*)

_{kY}^{2}. All of these terms have minimum values of 0; the correlation term has a maximum of 1 while the other terms have no upper limit. Thus, a model’s MSE can be minimized by 1) reducing its bias, 2) increasing its correlation with observations, and/or 3) adjusting its amplitude

*s*to match

_{k}*s*as closely as possible. As noted by Gupta et al. (2009), minimizing a model’s MSE is therefore equivalent to minimizing the sum of these three terms, two of which (the correlation and amplitude terms) are linked. The relative impacts of bias correction and multimodel averaging will therefore depend on how each method handles the different components of MSE.

_{Y}r_{kY}Figure 5 displays the fractions of total monthly normalized MSE contributed by the bias, correlation, and amplitude components, for the retrospective simulations in the three basins used in this study, for VIC (a “representative” individual model; row 1) and the two multimodel averages (rows 2 and 3), with and without monthly bias correction. To facilitate comparisons among the different processing techniques, fractions are relative to the total monthly normalized MSE of raw VIC. For ease of comparison, a horizontal dashed line is drawn at 1.0 across all panels to indicate this value. Here again the Feather basin exhibits different behavior from the other two basins, particularly in months 9 and 11, possibly due to nonstationarity (discussed further in section 4b).

For the individual model (VIC), before bias correction (row 1, columns a, c, and e), the bias and amplitude terms often form the majority of normalized MSE. In the Colorado and Salmon basins (panels 1a and 1e), bias and amplitude components dominate in the majority of months, although in the Feather they dominate in only three of the months (panel 1c). The partitioning of normalized MSE into its components exhibits some dependence on flow volume; as a result of inverse dependence on observed variance [Eq. (3)], which can be quite small in lower-flow months, the bias and amplitude terms are more likely to dominate total normalized MSE in those months. After bias correction (row 1, columns b, d, and f), the bias term is essentially eliminated and the amplitude term is substantially reduced (although the bias correction actually increased the amplitude term in some months in the Feather basin), with no change in the correlation component. One notable exception is month 11 in the Feather River basin, in which the bias correction worsened performance in all three components of normalized MSE (possible nonstationarity in the Feather basin is discussed in section 4b). Other individual models exhibit the same general behavior as VIC and are omitted here for brevity.

In contrast, the multimodel averages of raw model results (rows 2–3, columns a, c, and e) generally do not reduce the bias term as much as the bias correction does, and give mixed results for the amplitude term. Indeed, the simple model average (SMA) often increases the bias term and both SMA and MLR often increase the amplitude term. Unlike bias correction, multimodel averages can effect some change in the correlation term, but the change is usually small and not always a reduction. Applying multimodel averages to bias-corrected models (rows 2–3, columns b, d, and f) yields results similar to applying bias correction to the best individual model (row 1, columns b, d, and f), that is, little or no bias and a greatly reduced amplitude term, but with a slightly changed correlation term.

Thus, it appears that bias correction owes its success, relative to multimodel averaging, to its superior performance in reducing the bias and amplitude components of error. This should not come as a surprise; bias correction methods are specifically designed to remove these types of error from an individual model, independently of any other models. In each month, our bias correction adjusts the mean and rescales the standard deviation [Eq. (1)] and therefore directly adjusts both the bias and amplitude terms of normalized MSE [Eq. (3)], but has no effect on the correlation term [as an aside, we also tested a quantile-mapping bias correction as used by Snover et al. (2003) rather than the simple scaling approach we used, and found that it too had minimal impact on model–observation correlations]. It should be noted that our bias correction, in adjusting the model standard deviation *s _{i}* to match the observed standard deviation

*s*, will not in general minimize the amplitude term of Eq. (3); ideally the bias correction should adjust

_{Y}*s*to match

_{i}*s*. Still, the bias correction we have employed tends to reduce the amplitude component of error substantially.

_{Y}r_{iY}We see the same behavior in the means of ESP forecasts. At long lead times, the model results are dominated by model climatology. Because model climatology is the same for all forecasts of a given month, long-lead-time forecasts have small variances and essentially no correlation with observations. Because the amplitude and correlation components depend on these two quantities, they are relatively insensitive to postprocessing when dominated by climatology. Thus, for long-lead-time forecasts, the only error component that can be appreciably reduced by postprocessing techniques is the bias arising from the model climatology. The method that reduces this bias most effectively will yield the most accurate forecasts at long lead times. Our results indicate that bias correction is more effective than multimodel averaging in this respect. Similarly, as lead time diminishes, the forecast errors begin to resemble those of the retrospective simulations, which, as we have already seen, are dominated by the bias and amplitude terms, which are reduced most effectively by bias correction.

The benefits of multimodel averaging shown in the retrospective simulations are similar to those suggested by other hydrologic studies. For example, Ajami et al. (2006) found that multimodel averages of uncalibrated models, based on multiple linear regression, exhibited substantial improvements in RMSE (8%–16%), but with little or no improvements in correlation relative to the best individual model. Incorporating some form of bias correction into the multimodel averages further improved RMSE to some degree (0%–10%), but again had little effect on correlation. These results imply that the types of errors being addressed by the multimodel averages and bias corrections were bias or amplitude components, rather than the correlation component. While Ajami et al. (2006) did not examine the performance of individual bias-corrected models, it seems plausible that applying bias corrections to the individual models would have produced results that were competitive with the bias-corrected multimodel averages, given that the multimodel averages did not have much impact on correlations.

### b. Monthly data and model cross correlations

It should not be a surprise that bias correction outperforms multimodel averaging in the reduction of bias. But why did the multimodel averages perform so poorly in reducing the correlation and amplitude components [i.e., the second and third terms on the right-hand side of Eq. (4)]? While the number and choice of models in the ensemble likely played a role, a primary reason was our analysis of monthly data.

The correlation and amplitude terms depend on the models’ correlations with each other and with the observations. Winter and Nychka (2009) provide a useful geometric analysis of the dependence of multimodel performance on the relationships among the member models. If we represent model errors as vectors in T dimensions (where T is the number of points in the time series), the multimodel average will outperform the best individual model to the degree that the set of member models samples the error space evenly (i.e., the models’ error magnitudes must be somewhat similar, but their directions in error space should be as diverse as possible, so that their average is as close to the origin as possible). The larger the region of error space these error vectors span (i.e., the less collinear the models are), the greater the possible gains. This is independent of the number of models in the ensemble, although sampling of a larger span of error space is more likely with a larger number of models.

Techniques commonly applied in seasonal hydrologic forecasting, and which we have applied here, can increase the collinearity of the models and reduce the effectiveness of a multimodel average, relative to other hydrologic applications. Most prominent among these techniques is our aggregation of the model results to a monthly time scale. Other important factors are our use of methods with monthly varying parameters and our consideration of relatively large snowmelt-dominated river basins. For example, Ajami et al. (2006) used hourly discharge from river basins that were somewhat smaller than those that we studied (all <10 000 km^{2}); the multimodel averages and bias corrections they applied involved time-invariant parameters. Under such conditions, intermodel differences can be expected to include significant shifts in the timing of runoff peaks and the overall shapes of the hydrographs on a daily and subdaily scale; time-invariant bias correction would preserve these differences, and time-invariant multimodel averaging would presumably have the benefit of relatively small correlations among the model errors. In contrast, we considered monthly discharge from larger (mostly >10 000 km^{2}) basins, and used time-varying (monthly) bias corrections. Aggregating observed and simulated discharge to a monthly interval reduces intermodel differences mostly to differences in the monthly bias and the amplitude and correlation of interannual variability. To the extent that the interannual variability of simulated streamflow is sensitive to interannual variability in the meteorological forcings common to all models (this sensitivity should be strong in snowmelt-dominated basins, where the previous few months’ accumulation of snow dominates the peak seasonal flows), any technique that emphasizes the models’ interannual variability will increase model collinearity. Additionally, because monthly bias is the most important component of simulation and forecast errors, any technique (e.g., bias correction with monthly varying parameters) that reduces the bias of each month separately should yield substantial improvements, which appears to be a major factor in our results.

To illustrate the impact of aggregation from daily to monthly time scales, we plot in Fig. 6 monthly ‖𝗿‖ computed from both the original daily discharge and the monthly aggregate discharge, for the three study basins. It is readily apparent that aggregating discharge from daily to monthly time scales almost always increased model collinearity, with especially dramatic results in the Feather basin. The only cases in which the aggregation decreased model collinearity were month 4 in the Colorado basin, months 8 and 9 in the Feather basin, and months 3, 4, and 12 in the Salmon basin. All of these months are low-flow months, in which the bias and amplitude components composed the dominant sources of error anyway (and thus the bias correction could still remove the bulk of the errors).

The use of monthly varying parameters had an impact on performance as well. This is explored further in Fig. 7, which plots grand (as opposed to monthly) fractional RMSE of individual models and multimodel averages, with and without bias correction. Both the multimodel averages and the bias correction were applied in two ways: constant parameters (fitted to minimize the MSE of the whole time series) and monthly varying parameters (fitted to minimize the MSE of each month). Figure 7 shows that for the Colorado and Salmon River basins, both the MLR multimodel average with monthly varying weights (column c) and the monthly varying bias correction (column d) yielded substantially larger reductions in errors than their time-invariant counterparts (columns a and b). Surprisingly, in the Feather basin, monthly varying parameters performed worse than constant parameters; in fact, the constant MLR average yielded the best results of any method. This was not the case in the training period. One explanation for this may be a lower degree of stationarity in the Feather basin, resulting from either nonstationary climate or an artifact of naturalizing flows to account for the large reservoir at the basin outlet. Under such conditions, one must be careful to avoid overfitting the parameters to the training data. Certainly, the monthly varying bias correction and multimodel averages have 12 times the number of parameters of their time-invariant counterparts. In this respect, the constant-parameter bias correction and multimodel averages may be more robust, statistically, than monthly varying methods. This would be consistent with the findings of Ajami et al. (2006), Gao and Dirmeyer (2006), and Duan et al. (2007). If nonstationarity is a significant concern and we restrict ourselves to time-invariant parameters, a multimodel average may be superior to a bias correction alone because it still can address the relative values of adjacent monthly data via the month-to-month differences between individual models.

### c. Model calibration

Calibration can also impact model collinearity. Ajami et al. (2006) used uncalibrated models to form their multimodel averages. Presumably uncalibrated models would tend to be less similar than calibrated models (which we have used in this study) and therefore multimodel averages formed from them would yield more benefit than multimodel averages of calibrated models. Indeed, had we specifically designed our calibration scheme to maximize the diversity of monthly model errors, or to minimize monthly bias (note that this is not the same as the “grand” bias of the entire time series), the relative performances of the bias correction and multimodel averaging might have been affected.

Nevertheless, regardless of the calibration scheme, our use of monthly data and monthly varying parameters would have had the same relative effect that we report above: the models become more collinear and the effectiveness of the multimodel average, relative to a bias correction, is diminished by temporal aggregation. Indeed, when we used the uncalibrated NLDAS parameters for the models in this study (not shown), the relative performance of bias correction and multimodel averaging was the same (indicating that the main component of error in the uncalibrated models was also monthly bias). Similarly, Shi et al. (2008) examined monthly streamflow forecasts based on the VIC model in several western U.S. river basins and found that bias correction was nearly as effective as model calibration in improving seasonal forecast skill. Furthermore, the multimodel techniques employed by Ajami et al. (2006) on hourly data did not have a large effect on correlations, even for uncalibrated models. In practice it may be difficult to find a sufficiently diverse set of models for which the multimodel average substantially outperforms a bias correction, at least in the context of seasonal streamflow forecasts.

### d. Differences between hydrologic and atmospheric multimodel ensembles

There are important differences between hydrologic multimodel ensembles as evaluated in our study and atmospheric multimodel ensembles such as those studied by Krishnamurti et al. (1999, 2000) that we think explain the failure of the multimodel ensemble approach to yield simulation and forecast accuracy improvements in the hydrologic case. The most important of these is that, while atmospheric models simulate chaotic systems, and therefore are susceptible to large differences in model state even when initial states are similar, hydrologic models simulate damped systems, in which large perturbations in initial state exert diminishing influence over time. In particular, the interannual variability of simulated streamflows in snowmelt-dominated months should have a high correlation with that of observed flows because the snowpack state is essentially reset every summer. Each model’s snowpack therefore has little memory of the previous year’s state, but rather depends strongly on the current winter’s precipitation, which is a common input to all models. Thus, simulated discharge from multiple hydrologic models tends to be more collinear, at least in snowmelt-dominated months (as seen in Fig. 6), than for atmospheric simulations, reducing potential forecast skill gains of multimodel averages of hydrologic models.

## 5. Conclusions

Multimodel methods have proven useful in many contexts for assessing model uncertainty (model spread) and reducing forecast error (multimodel average). Our results show, however, that in the context of seasonal hydrologic forecasting, multimodel averaging may be no more effective in reducing forecast errors than applying a monthly bias correction to a single model. For the basins we studied, the best individual bias-corrected model tended to outperform multimodel averages of raw models, in both retrospective simulations and ESP forecasts. While the Multiple Linear Regression (MLR) multimodel average of bias-corrected models performed slightly better than the best individual bias-corrected model, the differences in performance among all bias-corrected models and multimodel averages were relatively small. The Simple Model Average (SMA) yielded much smaller error reductions than MLR, and often performed worse than the best individual model. In the context of probabilistic ESP forecasts, these patterns showed some dependence on lead time in snowmelt-dominated months. Both bias correction and multimodel averaging reduced the RMSE of the ESP ensemble means at lead times of up to 6 months in snowmelt-dominated forecast months, with the reduction increasing as lead time decreased. Bias correction yielded greater improvements than multimodel averaging, but this difference diminished to zero as lead time increased from 1 to 6 months. In the nonsnowmelt-dominated forecast months, both methods reduced RMSE for all start months, with little dependence on lead time, with the exception of the Feather basin, where nonstationarity may have played a role in the poor performance of both methods.

The main reason for the success of bias correction appears to be that aggregating the data to monthly time scales (from daily) increases model collinearity and inhibits the effectiveness of multimodel averaging techniques to reduce those components of model error that bias correction cannot address. This effect may be stronger in snowmelt-dominated basins than elsewhere, because the interannual variability of winter precipitation is a common input to all models. We also found that both bias corrections and multimodel averages using monthly varying parameters yielded much greater error reductions than methods using time-invariant parameters. The poorer performance of the monthly bias correction in the Feather basin, compared with a time-invariant multimodel average, indicates that monthly varying bias correction may be less robust in the face of nonstationarity than a constant-parameter multimodel average.

This research was supported by the National Oceanic and Atmospheric Administration (NOAA) under Grant NA08OAR4320899 to the University of Washington. The authors also gratefully acknowledge insightful comments by Tara Troy of Princeton University, H. Gupta of University of Arizona, and two anonymous reviewers.

## REFERENCES

Ajami, N. K., , Duan Q. , , Gao X. , , and Sorooshian S. , 2006: Multimodel combination techniques for analysis of hydrological simulations: Application to Distributed Model Intercomparison Project results.

,*J. Hydrometeor.***7****,**755–768.Anderson, E. A., 1973: National Weather Service River Forecast System—Snow accumulation and ablation model. NOAA Tech. Memo. NWS HYDRO-17, 217 pp.

Burnash, R. J. C., 1995: The NWS River Forecast System—Catchment modeling.

*Computer Models of Watershed Hydrology,*V. P. Singh, Ed., Water Resources Publications, 311–366.Burnash, R. J. C., , Ferral R. L. , , and McGuire R. A. , 1973: A generalized streamflow simulation system—Conceptual modeling for digital computers. Joint Federal and State River Forecast Center Tech. Rep., U.S. Department of Commerce, National Weather Service and State of California, Department of Water Resources, 204 pp.

Chen, F., and Coauthors, 1996: Modeling of land-surface evaporation by four schemes and comparison with FIFE observations.

,*J. Geophys. Res.***101****,**(D3). 7251–7268.Cherkauer, K. A., , and Lettenmaier D. P. , 1999: Hydrologic effects of frozen soils in the upper Mississippi River basin.

,*J. Geophys. Res.***104****,**(D16). 19599–19610.Day, G. N., 1985: Extended streamflow forecasting using NWSRFS.

,*J. Water Resour. Plan. Manage.***3**(2) 157–170.Duan, Q., , Ajami N. K. , , Gao X. , , and Sorooshian S. , 2007: Multi-model ensemble hydrologic prediction using Bayesian model averaging.

,*Adv. Water Resour.***30****,**1371–1386. doi:10.1016/j.advwatres.2006.11.014.Gao, X., , and Dirmeyer P. A. , 2006: A multimodel analysis, validation, and transferability study of global soil wetness products.

,*J. Hydrometeor.***7****,**1218–1236.Georgakakos, K. P., , Seo D-J. , , Gupta H. , , Schaake J. , , and Butts M. B. , 2004: Towards the characterization of streamflow simulation uncertainty through multimodel ensembles.

,*J. Hydrol.***298****,**222–241. doi:10.1016/j.jhydrol.2004.03.037.Gneiting, T., , and Raftery A. E. , 2005: Weather forecasting with ensemble methods.

,*Science***310****,**248–249. doi:10.1126/science.1115255.Guo, Z., , Dirmeyer P. A. , , Gao X. , , and Zhao M. , 2007: Improving the quality of simulated soil moisture with a multi-model ensemble approach.

,*Quart. J. Roy. Meteor. Soc.***133****,**731–747.Gupta, H. V., , Kling H. , , Yilmaz K. K. , , and Martinez G. F. , 2009: Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modeling.

,*J. Hydrol.***377****,**80–91. doi:10.1016/j.jhydrol.2009.08.003.Hoeting, J. A., , Madigan D. , , Raftery A. E. , , and Volinsky C. T. , 1999: Bayesian model averaging: A tutorial.

,*Stat. Sci.***14**(4) 382–417.Koren, V., , Schaake J. , , Mitchell K. , , Duan Q. Y. , , Chen F. , , and Baker J. M. , 1999: A parameterization of snowpack and frozen ground intended for NCEP weather and climate models.

,*J. Geophys. Res.***104****,**19569–19585.Krishnamurti, T. N., , Kishtawal C. M. , , LaRow T. E. , , Bachiochi D. R. , , Zhang Z. , , Williford C. E. , , Gadgil S. , , and Surendran S. , 1999: Improved weather and seasonal climate forecasts from multimodel superensembles.

,*Science***285****,**1548–1550.Krishnamurti, T. N., , Kishtawal C. M. , , Zhang Z. , , LaRow T. E. , , Bachiochi D. R. , , and Williford C. E. , 2000: Multimodel ensemble forecasts for weather and seasonal climate.

,*J. Climate***13****,**4196–4216.Liang, X., , Lettenmaier D. P. , , Wood E. F. , , and Burges S. J. , 1994: A simple hydrologically based model of land surface water and energy fluxes for GSMs.

,*J. Geophys. Res.***99****,**(D7). 14415–14428.Lohmann, D., , Nolte-Holube R. , , and Raschke E. , 1996: A large scale horizontal routing model to be coupled to land surface parameterization schemes.

,*Tellus***48A****,**708–721.Lohmann, D., , Raschke E. , , Nijssen B. , , and Lettenmaier D. P. , 1998: Regional scale hydrology: 1. Formulation of the VIC-2L model coupled to a routing model.

,*Hydrol. Sci. J.***43****,**131–141.Maurer, E. P., , Wood A. W. , , Adam J. C. , , Lettenmaier D. P. , , and Nijssen B. , 2002: A long-term hydrologically based dataset of land surface fluxes and states for the conterminous United States.

,*J. Climate***15****,**3237–3251.Mitchell, K. E., and Coauthors, 2004: The multi-institution North American Land Data Assimilation System (NLDAS): Utilizing multiple GCIP products and partners in a continental distributed hydrological modeling system.

,*J. Geophys. Res.***109****,**D07S90. doi:10.1029/2003JD003823.Murphy, A. H., 1988: Skill scores based on the mean square error and their relationships to the correlation coefficient.

,*Mon. Wea. Rev.***116****,**2417–2424.Pagano, T., , Garen D. , , and Sorooshian S. , 2004: Evaluation of official western U.S. seasonal water supply outlooks, 1922-2002.

,*J. Hydrometeor.***5****,**896–909.Shi, X., , Wood A. W. , , and Lettenmaier D. P. , 2008: How essential is hydrologic model calibration to seasonal streamflow forecasting?

,*J. Hydrometeor.***9****,**1350–1363.Snover, A. K., , Hamlet A. F. , , and Lettenmaier D. P. , 2003: Climate-change scenarios for water planning studies.

,*Bull. Amer. Meteor. Soc.***84****,**1513–1518.Tang, Y., , Reed P. , , Wagner T. , , and van Werkhoven K. , 2007: Comparing sensitivity of analysis methods to advance lumped watershed model identification and evaluation.

,*Hydrol. Earth Syst. Sci.***11****,**793–817.Viney, N. R., and Coauthors, 2009: Assessing the impact of Land Use Change on Hydrology by Ensemble Modeling (LUCHEM) II: Ensemble combinations and predictions.

,*Adv. Water Resour.***32****,**147–158. doi:10.1016/j.advwatres.2008.05.006.Vrugt, J. A., , and Robinson B. A. , 2007: Treatment of uncertainty using ensemble methods: Comparison of sequential data assimilation and Bayesian model averaging.

,*Water Resour. Res.***43****,**W01411. doi:10.1029/2005WR004838.Wang, A., , Bohn T. J. , , Mahanama S. P. , , Koster R. D. , , and Lettenmaier D. P. , 2009: Multimodel ensemble reconstruction of drought over the continental United States.

,*J. Climate***22****,**2694–2712.Weigel, A. P., , Liniger M. A. , , and Appenzeller C. , 2008: Can multi-model combination really enhance the prediction skill of probabilistic ensemble forecasts?

,*Quart. J. Roy. Meteor. Soc.***134****,**241–260.Winter, C. L., , and Nychka D. , 2009: Forecasting skill of model averages.

,*Stochastic Environ. Res. Risk Assess.***24****,**633–638. doi:10.1007/s00477-009-0350-y.Wood, A. W., , Maurer E. P. , , Kumar A. , , and Lettenmaier D. P. , 2002: Long-range experimental hydrologic forecasting for the eastern United States.

,*J. Geophys. Res.***107****,**4429. doi:10.1029/2001JD000659.Yapo, P. O., , Gupta H. V. , , and Sorooshian S. , 1998: Multi-objective global optimization for hydrologic models.

,*J. Hydrol.***204****,**83–97.