Assessing uncertainties in hydrologic models can improve accuracy in predicting future streamflow. Here, simulated streamflows using the Variable Infiltration Capacity (VIC) model at coarse (°) and fine (°) spatial resolutions were evaluated against observed streamflows from 217 watersheds. In particular, the adequacy of VIC simulations in groundwater- versus runoff-dominated watersheds using a range of flow metrics relevant for water supply and aquatic habitat was examined. These flow metrics were 1) total annual streamflow; 2) total fall, winter, spring, and summer season streamflows; and 3) 5th, 25th, 50th, 75th, and 95th flow percentiles. The effect of climate on model performance was also evaluated by comparing the observed and simulated streamflow sensitivities to temperature and precipitation. Model performance was evaluated using four quantitative statistics: nonparametric rank correlation ρ, normalized Nash–Sutcliffe efficiency NNSE, root-mean-square error RMSE, and percent bias PBIAS. The VIC model captured the sensitivity of streamflow for temperature better than for precipitation and was in poor agreement with the corresponding temperature and precipitation sensitivities derived from observed streamflow. The model was able to capture the hydrologic behavior of the study watersheds with reasonable accuracy. Both total streamflow and flow percentiles, however, are subject to strong systematic model bias. For example, summer streamflows were underpredicted (PBIAS = −13%) in groundwater-dominated watersheds and overpredicted (PBIAS = 48%) in runoff-dominated watersheds. Similarly, the 5th flow percentile was underpredicted (PBIAS = −51%) in groundwater-dominated watersheds and overpredicted (PBIAS = 19%) in runoff-dominated watersheds. These results provide a foundation for improving model parameterization and calibration in ungauged basins.
Climate changes anticipated over the next few decades pose challenges to resource managers seeking the most effective strategies to adapt, maintain, and restore rivers, watersheds, and aquatic ecosystems. Because water resources are particularly sensitive to changes in climate, managers benefit from accurate analyses of historical streamflows and predictions of future hydrologic behavior. Accurate estimation of runoff, especially during dry seasons, is extremely critical to plan for hydroelectric power generation (Hamlet et al. 2010), agriculture and municipal water supply (Roy et al. 2012), aquatic habitat (Battin et al. 2007), and water-based recreation (Farley et al. 2011). Both empirical and numerical models have been routinely used for predicting future streamflows and improving understanding of hydrological functioning at varying spatial and temporal scales. In large watershed– and regional-scale studies, land surface models (LSMs) such as the catchment model (Koster et al. 2000), Community Land Model (Oleson et al. 2010), Noah model (Ek et al. 2003), Sacramento Soil Moisture Accounting Model (Burnash et al. 1973), Unified Land Model (Livneh et al. 2011), and Variable Infiltration Capacity (VIC) model (Liang et al. 1994) are commonly used (Koster et al. 2010; Nijssen et al. 2014; Vano et al. 2012; Wang et al. 2009; Xia et al. 2012, 2014). In the U.S. Pacific Northwest (PNW), the large-scale VIC model has been widely employed to study regional-scale changes in snowpack (Hamlet et al. 2005), water resources (Hamlet et al. 2007; Liu et al. 2013), droughts (Shukla and Wood 2008), and energy (Hamlet et al. 2010).
The VIC model, typically implemented at a spatial resolution of ° and °, is calibrated and validated using observed or naturalized streamflow from large rivers (Hamlet et al. 2013; Matheussen et al. 2000). The number of watersheds used for calibration and validation has been quite variable but is generally limited to available gauged watersheds. One of the assumptions in calibrating the model against observed streamflows in large watersheds is that calibrated parameters are applicable to subwatersheds within the larger basins. This assumption may not be valid, however, in regions and basins with strong hydrogeological differences, thereby introducing some degree of uncertainty into model predictions at finer spatial scales. Conversely, improving the topographic representation by increasing the model spatial resolution or model calibration over small watersheds may be adversely affected by errors in the meteorological driving data, resulting in a calibrated model with compensating errors, that is, getting the right answer for the wrong reasons. Examining the sources and magnitudes of uncertainty at the small watershed scale can help interpret and constrain predictions of direction, magnitude, and timing of future streamflow changes and thereby improve decision making.
Sources of hydrologic modeling uncertainty can be classified as parametric (Beven and Binley 1992; Duan et al. 2006) or structural (Butts et al. 2004; Refsgaard and Knudsen 1996). Parametric uncertainties are associated with the model input data and parameter values, whereas structural uncertainties are associated with the model formulation. Both parametric and structural uncertainties can be minimized, but this may require different strategies for each type in terms of model selection, forcing, calibration, and parameterization. Identifying the major sources of uncertainties and distinguishing which of these are due to model forcing, parameter estimation, and/or model structure is fundamental to minimizing uncertainties (Beven and Freer 2001; McMichael et al. 2006). Various techniques have been developed [e.g., Generalized Likelihood Uncertainty Estimation (GLUE), bootstrapping, Monte Carlo based, Bayesian method, and machine learning] and utilized for model uncertainty analysis (Beven 2011; Shrestha 2010). These techniques can be implemented within many different parameter spaces and model structures (Butts et al. 2004; Clark and Vrugt 2006; Gupta et al. 1998; Jin et al. 2010; Shen et al. 2012). Despite the scientific merits of exploring parameter space and tradeoffs between various model structures, such an approach will be computationally intensive at a regional scale such as the PNW. In fact, for an LSM such as VIC, full hydrologic calibration at a small scale is extremely resource intensive (Oubeidillah et al. 2014), and model calibration is often restricted to a subset of large basins (Hamlet et al. 2013) or grid cells (Troy et al. 2008). Even when high-performance supercomputing is available, exhaustive calibrations and validations of LSMs have to rely on assimilated (Oubeidillah et al. 2014) or naturalized (Hamlet et al. 2013; Vano et al. 2012) streamflow time series because of the lack of unregulated stream gauges. Given all of these limitations, it becomes important to evaluate and assess whether any model inherits systematic biases, whether these are more prevalent in some landscapes than others, and whether these biases can be reduced to improve model performance. Any evaluation of bias should also address how the choice of model (Vano et al. 2012), meteorological data (Elsner et al. 2014), or even parameterization scheme (Tague et al. 2013) affects model behavior.
Here, we examine the source of a range of parametric and structural uncertainties associated with the VIC model. We focus on parametric uncertainties associated with the scale of model resolution, potential biases in meteorological forcing variables, and structural uncertainties introduced by how the model handles watersheds that are dominated by either runoff or groundwater flow paths. We emphasize the latter because runoff- and groundwater-dominated watersheds have been shown to respond quite differently to climate change, and ensuring adequate representation of watersheds with different runoff dynamics is vital for accurate streamflow forecasting (Safeeq et al. 2013; Tague and Grant 2009; Tague et al. 2008, 2013; Waibel et al. 2013). Thus, we have two overarching questions: 1) Can we improve model accuracy by increasing topographic representation and hence theoretically better capturing hillslope-scale processes? and 2) Are model uncertainties consistent across watersheds in a geologically heterogeneous landscape such as the PNW? In this study, we focus on the VIC model because of its increasing use in water resource assessment and planning in the Pacific Northwest. However, the issue of deep-groundwater representation is not limited to VIC alone. Explicit representation of deep groundwater is not a part of any LSM and is approximated instead by extended soil profiles (Vano et al. 2012).
The VIC model conceptualizes infiltration, surface, and subsurface flow processes as occurring within a soil layer that can be made up of two or more (typically three) sublayers. The top soil layer relates to soil infiltration and surface runoff via the variable infiltration curve whereas base flow processes are controlled primarily by the lowest soil layer. The VIC model does not explicitly mimic the movement of water into and out of deep groundwater. Rather, the formulation of base flow in VIC follows the conceptual ARNO rainfall–runoff model (Todini 1996) that relates base flow as a function of soil moisture in the lowest soil layer. The base flow curve based on the ARNO model is linear under low soil moisture and becomes nonlinear toward saturation. This results in a rapid base flow response in wet conditions and a relatively slower response under dry conditions. Movement of water into and out of shallow groundwater can be fairly represented by increasing the bottom layer soil depth, hence increasing the residence time, or alternatively coupling with a groundwater flow model. The first approach can reproduce groundwater dynamics but requires site-specific calibration of bottom soil layer depth and related drainage parameters—a difficult task at the regional scale due to data limitations. On the other hand, the use of coupled surface water groundwater models is computationally inefficient and has shown limited success (Jin and Sridhar 2010; Rosenberg et al. 2013). As a result, capturing groundwater dynamics in areas with deep-groundwater-fed streams, such as the Oregon High Cascades, remains a challenge. Indeed, simulating climate change scenarios without accounting for deep-groundwater influences may lead to predictions of greater relative decline in summer streamflow (Tague et al. 2008). Such biased predictions of streamflow can potentially affect decisions and adaptation plans for future water scenarios.
The role of streamflow contributions from deep groundwater and the sensitivity of streamflow to precipitation and temperature under different geological regimes have not yet been tested for the VIC model. Some of the limitations of a regional-scale application of the VIC model surfaced in an earlier study using 55 streamflow gauges across the PNW (Wenger et al. 2010). However, the focus of this previous study was to evaluate model performance in terms of ecologically relevant flow metrics. In the present study, the objective is to quantify modeling uncertainties in hydrologic predictions due to both geological and climatic factors, with the goal of improving predictions of streamflow in the PNW and elsewhere. Specifically, we examined hydrological predictions using the VIC model in 217 watersheds located across Oregon (OR) and Washington (WA) in the PNW region of the United States. We explored uncertainties in 1) predicted total streamflow at annual and seasonal time scales, 2) five percentiles calculated based on predicted daily streamflows, and 3) predicted annual and seasonal streamflow sensitivities to a change in temperature and precipitation. Model performance evaluations at annual and seasonal time scales are useful for water resource assessment under climate change, whereas model evaluations using daily flow metrics (Olden and Poff 2003; Wenger et al. 2010) are useful for characterizing the entire hydrograph and assessing uncertainties in future ecological and in-stream flow requirements. Additionally, temperature- and precipitation-based hydrologic sensitivity metrics are useful for forecasting water resource vulnerability under climate change (Vano and Lettenmaier 2014; Vano et al. 2012). Our findings over a range of temporal scales help demonstrate under which circumstances the VIC model can be applied with confidence and point to future improvements for model predictions at the local/regional scale.
We obtained daily streamflow time series from 217 unregulated watersheds from the U.S. Geologic Survey (UGSG; U.S. Geological Survey 2013) and the Oregon Water Resources Department (Water Resources Department 2013; Fig. 1). These watersheds are part of the USGS Hydro-Climatic Data Network (HCDN; Slack et al. 1993) and recently updated Geospatial Attributes of Gages for Evaluating Streamflow (GAGES) network (Falcone et al. 2010). Mean watershed elevation ranged from 106 to 2273 m MSL. Drainage areas for most (~75%) of the 217 watersheds were less than 500 km2 (Fig. 2a). All selected watersheds had a minimum record length of 20 years of complete daily streamflow during the span of water years (wy) from 1950 to 2006. Among the 217 watersheds used to evaluate the model performance, 21% (n = 45) of the stream gauges have daily streamflow that spanned the entire 57-yr period (1950–2006) and 68% (n = 148) of the stream gauges have more than 30 years of streamflow record (Fig. 2b). The total number of stream gauges during any given year varied between 121 and 189; their spatial distribution was defined by data availability, with most of them located on the western side of the Cascade Mountains. A majority (~70%) of the watersheds are located between 500 and 1500 m mean elevation (Fig. 2c). The average precipitation ranges from <1 to as much as 4.5 m (Fig. 2d). The number of watersheds classified as an early (flow timing <150 or before 17 February), intermediate (150 ≤ flow timing ≤ 200), and late (flow timing >200 or after 18 April) hydrologic regime were 92, 100, and 25, respectively (Fig. 2e). Watersheds classified as an early, intermediate, and late hydrologic regime represent rain, mixture of rain and snow, and snow-dominated streams, respectively (Wenger et al. 2010).
2) VIC modeling
We used simulated streamflow from the VIC model applied at ° (~6 km × 6 km) spatial resolution over the period 1950–2006 from Hamlet et al. (2013). Daily minimum and maximum temperatures, precipitation data from the National Climatic Data Center and Environment Canada, and wind speed data from reanalysis products (Hamlet and Lettenmaier 2005; Kalnay et al. 1996) were gridded at ° spatial resolution from Elsner et al. (2010), based on the techniques developed by Maurer et al. (2002) and Hamlet and Lettenmaier (2005). The model was calibrated and validated on a monthly time step utilizing streamflow data from 11 major watersheds within the Columbia River basin, following the approach of Yapo et al. (1998). The Nash–Sutcliffe efficiency NSE for 11 major watersheds ranged between 0.74 and 0.89 during calibration periods and 0.68 and 0.93 during validation periods. The calibrated model parameters were further validated at 80 streamflow gauging stations in the Columbia River basin where NSE ranged from <0 to over 0.9. Further description of the model calibration and validation procedure can be found in Hamlet et al. (2013). In addition to VIC modeling at °, we also utilized simulated streamflow from the most recent VIC implementation in the Columbia River basin at a fine spatial resolution (°, or about 800 m × 800 m) over the period 1950–2006. This °-resolution model was built primarily to better capture finescale snow dynamics by providing more realistic radiative forcing at local scales. However, climate, soil, and vegetation forcing variables for ° model implementation were resampled from those developed at °.
The VIC model simulates infiltration, runoff, and base flow processes based on empirically derived relationships that characterize the average gridcell condition (Liang et al. 1994). To contrast simulated streamflow with observed values, we estimated simulated watershed streamflow by adding the daily runoff and base flow values from the entire VIC grid cells, both whole and partial cells based on area weighting, within each watershed boundary. No channel routing algorithm was employed in this analysis, and we assumed that all the runoff exits the watershed on the same day. This assumption is not an issue for comparisons of annual and seasonal streamflow but could be problematic for flow percentiles based on daily flows. However, since the majority (n = 158) of the selected watersheds are small (<500 km2; Fig. 2a), the influence of channel routing on model performance is likely to be small compared with other modeling uncertainties.
b. Model evaluation metrics
To explore seasonal and annual biases in model performance, observed and simulated daily streamflow data were converted into time series of seasonal and annual time scales on a water-year (October–September) basis. Seasons were defined as fall [October–December (OND)], winter [January–March (JFM)], spring [April–June (AMJ)], and summer [July–September (JAS)]. Hereafter, the total streamflows are referred to as Qwy for water year and QOND, QJFM, QAMJ, and QJAS for the fall, winter, spring, and summer seasons, respectively. In addition, five streamflow percentiles were used to characterize the modeling uncertainty in matching the overall hydrologic regime of the watersheds. Both low and moderately low streamflows were characterized by the annual 5th (Q5) and 25th (Q25) percentiles, respectively. Similarly, the high and moderately high streamflows were characterized by the 75th (Q75) and 95th (Q95) percentiles, respectively. Annual 50th percentile values were used to characterize mean streamflow.
Uncertainties associated with the VIC-simulated streamflow were assessed by comparing the concurrent observed and simulated streamflows using four quantitative statistics for model performance: the nonparametric rank correlation coefficient ρ, the NSE, the root-mean-square error RMSE, and the percent bias PBIAS. The rank correlation is a nonparametric measure that shows the model’s ability to reproduce the observed temporal patterns of interannual variability in streamflow. We used rank correlation instead of the Pearson product-moment correlation to specifically focus on evaluating the model performance in capturing interannual variability rather than the strength of the linear relationship between observed and simulated streamflow. Typically, values of Pearson correlation greater than 0.7 or a coefficient of determination greater than 0.5 are considered acceptable (Moriasi et al. 2007). Following this, we used a threshold of ρ greater than 0.7 as an acceptable model performance. The NSE is a measure of overall goodness of fit between observed and simulated data, with NSE = 1 being the optimal value (1:1 relationship). Since the value of NSE ranges between −∞ and 1.0, we rescaled it between 0 and 1 and refer to it hereafter as the normalized NSE or NNSE (Nossent and Bauwens 2012). While the optimal value of NNSE remains 1, a value of 0.5 corresponds with a 0 value for the NSE. Model performance is considered satisfactory when NSE (NNSE) is greater than 0.5 (0.67) (Moriasi et al. 2007). We also used RMSE and PBIAS to quantify the magnitude of model error. RMSE provides the overall error, and PBIAS measures the average tendency of the simulated data to be larger (positive PBIAS) or smaller (negative PBIAS) than their observed counterparts (Gupta et al. 1999). The RMSE can be decomposed into its systematic component RMSEs and unsystematic component RMSEu using a linear regression (Willmott et al. 1985). Also known as the linear bias, RMSEs is a measure of discrepancy between simulated and observed data caused by poor calibration, forcing errors, and/or unaccounted for processes in the model. The discrepancy between simulated and observed data caused by random processes is measured by RMSEu. When the ratio of RMSEs to RMSEu is greater than one, the RMSE is largely composed of systematic bias, which can potentially be removed through calibration. However, a ratio of RMSEs to RMSEu less than one indicates that the RMSE is largely composed of unsystematic or random bias, and further improvement in model performance will require model and forcing refinement. Optimal values of RMSE and PBIAS are zero, indicating accurate model prediction. Model performance is considered satisfactory when PBIAS is within ±25% (Moriasi et al. 2007).
c. Impact of climate variability on model performance
We assess the impact of climate variability on model performance by comparing the simulated and observed streamflow sensitivities to precipitation and temperature. Following Sankarasubramanian et al. (2001), we defined the precipitation sensitivity of streamflow SP as the percent change in total streamflow Qt over annual and seasonal time t divided by the percentage change in annual precipitation P:
where is the long-term sample mean of streamflow total over a period t and is the long-term sample mean of average annual precipitation. Similarly, temperature sensitivity of streamflow ST (% °C−1) was defined as the percent change in total Qt over a period t per unit change in average mean daily temperature Tavg between October and June (°C). We omit the summer months from the temperature sensitivity analysis to specifically focus on evaluating the model performance during the snow accumulation and melt period, a particularly critical set of processes influencing model performance in this region. Much of the snowfall (>90%) in this region occurs between October and May (Knowles et al. 2006). Although over 50% of the snowpack melts by the end of April, the total snowmelt period can extend until late spring at some locations (results not shown). The temperature sensitivity of streamflow ST is given by
where is the long-term sample mean of Tavg. Both SP and ST were calculated using observed and simulated Qwy, QOND, QJFM, QAMJ, and QJAS. In the past, this sensitivity approach has been used at the annual time scale (Patil and Stieglitz 2012; Safeeq and Fares 2012; Vano et al. 2012). Comparing the additional seasonal-scale sensitivities with respect to change in annual precipitation and October–June temperature provides insight into how accurately the model represents the seasonal carryover of above (snow) and below (groundwater) land surface storage. Additionally, the metrics SP and ST can be used as a measure of model performance, irrespective of structural and parametric uncertainties, in characterizing streamflow under climate change.
d. Impact of deep groundwater on model performance
We used the hydrograph recession constant K as a metric for evaluating the relative contribution of deep groundwater to streamflow. This metric effectively distinguishes the relative contribution of shallow versus deep groundwater (Safeeq et al. 2013). Following Vogel and Kroll (1992), an automated recession algorithm was employed to search all 10-days-or-longer recession segments from the historical record of daily streamflow. The peak and end of each recession segment was defined as the point when the 3-day moving average of streamflow began to recede and rise, respectively. The beginning of recession (inflection point) was identified following the method of Arnold et al. (1995). To minimize the effects of snowmelt, recession segments were excluded between the onset of the snowmelt-derived streamflow pulse and 15 August. Days of snowmelt pulse onset were determined following the method of Cayan et al. (2001), with mean flow calculated for calendar days 9–248. Similar to Vogel and Kroll (1992), spurious observations were avoided by only accepting the pairs of receding streamflow (Qt, Qt−1) when Qt > 0.7Qt−1. The recession constant K can be given by
where m is the total number of pairs of consecutive daily streamflow Qt−1 and Qt at each site. The K parameter ranges between 0 and 1, representing the lowest and highest possible groundwater contribution, respectively. Watersheds were divided into four groups based on quartiles: 0.810 ≤ < 0.941, 0.941 ≤ < 0.950, 0.950 ≤ < 0.963, and 0.963 ≤ < 0.989. Quartile technique was used for grouping over other commonly used techniques (i.e., k-means clustering) for simplicity and in an effort to keep the sample size consistent between K groups. The distribution of watersheds by K between the 217 watersheds is shown in Fig. 2f.
A summary of watershed characteristics and hydrologic conditions across all study watersheds and under different K regimes is presented in Table 1. The watersheds have higher drainage areas, base flow indices, and mean watershed elevations; lower annual precipitation; and colder temperatures (Tavg) as compared to , , and . As a result of slower hydrograph recession (i.e., higher groundwater contribution), the low flow (Q5 and Q25) increases and high flow (Q75 and Q95) diminishes between watershed groups and . The centroid of timing is nearly one month earlier in and watersheds as compared to . The model evaluation metrics under different K regimes were compared using Kruskal–Wallis one-way analysis of variance (ANOVA) on ranks. If the ANOVA revealed statistically significant differences (p < 0.05), a post hoc Dunn’s multiple comparison test was used to determine which K regimes were different at a significance level of 0.05.
e. Impact of meteorological forcing
Since meteorological data for the VIC model are generated through spatial interpolation of irregularly spaced point measurements, this interpolation adds a potential source of uncertainty, especially at higher elevations where point measurements are scarce. To attempt to quantify this uncertainty, we compared the VIC meteorological forcing data (i.e., precipitation and temperature) and simulated snow water equivalent (SWE) against independent observations from the Natural Resources Conservation Service (NRCS) Snow Telemetry (SNOTEL) sites. Although comparing gridded data with point measurements can be somewhat misleading, it nonetheless provides useful information about the potential errors in meteorological driving data (and particularly precipitation data) at small scales. Daily precipitation, maximum and minimum temperatures, and SWE data were downloaded from 148 sites (70 in OR and 78 in WA). We also added data from three [Climatic Station at Watershed 2 (CS2MET), Primary Meteorological Station (PRIMET), and Hi-15 Meteorological Station (H15MET)] additional meteorological sites at the H.J. Andrews Experimental Forest. Many of these sites only extend from 1978 to present and do not have concurrent meteorological records. Hence, we have only included the sites with at least 10 years of data. This criterion resulted in 109 stations with daily precipitation, 34 stations with maximum temperature, 31 stations with minimum temperature, and 106 stations with daily SWE data.
a. Total streamflow and flow percentiles
The VIC model performed well in capturing the interannual variability in observed annual and seasonal total streamflow, as measured by rank correlation coefficients (i.e., ρ; Fig. 3a). The median value of ρ from 217 individual watersheds was the highest for the annual time scale and diminished as seasons progressed from fall to summer (Table 2). Additionally, the interquartile range of ρ from individual watersheds was smallest for Qwy and relatively large for QJFM and QJAS, showing greater variability in model performance between watersheds in the latter two cases. The percentage of watersheds with ρ > 0.7 decreased from 96% to 73% for Qwy and QJAS, respectively. This indicates that in 27% of the watersheds, interannual variability in QJAS was not satisfactorily captured by the model. The model performed better for high-flow percentiles as compared to low flows (Fig. 3b): the median value of ρ ranged from 0.92 for Q50 to 0.67 for Q5 (Table 2). The interquartile range was small for Q50 and increased toward the more extreme flows, increasing more for low- than high-flow percentiles (Fig. 3b). For example, the model performed poorly (ρ ≤ 0.7) in 29% of watersheds for Q25 and in 54% of watersheds for Q5, whereas only 4% of watersheds had similarly low correlations for Q50, 7% for Q75, and 14% for Q95. This indicates that the model consistently performed better in predicting the interannual variability in mean and high flows across all selected watersheds but underperformed in predicting low flows.
Rank correlations are useful for evaluating model sensitivities in terms of interannual variability but do not provide information on absolute model biases. In contrast, NNSE along with RMSE and PBIAS provide overall goodness of fit between observed and simulated hydrographs. The percentage of total watersheds with NNSE below 0.67 ranged from 50% for Qwy to 90% for QJAS, indicating strong disagreement between modeled and observed flows in a large number of watersheds. Similarly, strong disagreements between modeled and observed low-flow percentiles were also found (Table 2). The NNSE was below the 0.67 threshold in 93% of watersheds for Q25 and in 95% of watersheds for Q5, whereas only 51% of watersheds had similarly low NNSE for Q50, 55% for Q75, and 65% for Q95. As compared to ρ, the NNSE values were lower for both total flow and flow percentiles, indicating a systematic absolute model bias. This was confirmed from the proportionally higher RMSEs values as compared to RMSEu (Table 2). Although, the median RMSE was large for Qwy (206.7 mm) and Q95 (2.5 mm), the corresponding PBIAS was small. The median absolute PBIAS was very good (<6%) for all total streamflows and flow percentiles except for Q5 (PBIAS = −26%). Although median PBIAS was satisfactory in the majority of cases, the range was quite variable and, in some cases, absolute PBIAS was larger than 25% (Fig. 3b). The percentage of watersheds with absolute PBIAS > 25% in predicting total flow increased from 23% for Qwy to 66% for QJAS. Similarly, the percentage of watersheds with absolute PBIAS > 25% in predicting flow percentiles increased from 23% for Q50 to 82% for Q5.
Increasing model spatial resolution from ° to ° alone resulted in no improvement in model performance (Fig. S1 in the supplemental material). This can be primarily attributed to the fact that both simulations were driven by the same vegetation and soil parameterization. Additionally, effects of improved small-scale snow dynamic and evaporation as a result of more realistic radiative forcing at ° scale as compared to ° may not be apparent at the watershed scale. Based on the Wilcoxon rank-sum test, which is a nonparametric method to test if the two population distributions are the same, the effect of model resolution was only significant (p < 0.05) for PBIAS in Qwy, QAMJ, Q5, Q50, and Q75. As compared to the ° representation, the °-resolution interquartile range in flow percentiles shifted toward more positive PBIAS. In the case of Q5, there was a 50% reduction in median PBIAS under ° simulations (median PBIAS = −13%) as compared to ° (median PBIAS = −25%). A similar change was also observed for Qwy and QAMJ. For example, the median PBIAS in QAMJ increased from an underestimation (−3%) to an overestimation (11%) under ° and ° spatial resolution, respectively (Table 2). Although some differences in the model performance were statistically significant, there was no clear or substantial improvement from the finer-resolution simulation. Therefore, we only present the ° VIC modeling compared to the observed values unless otherwise noted.
b. Precipitation and temperature sensitivities
The comparisons of average observed and simulated precipitation sensitivity (i.e., SP) across the 217 watersheds showed consistently low correspondence, with only 18% of the variance in SP explained by the model (Fig. 4a). The median sensitivity of observed streamflow to precipitation (i.e., SP) across all 217 watersheds ranged from 0.87 in spring to 1.42 in fall (Table 1). In other words, an increase of annual precipitation by 1% resulted in a 1.42% increase in QOND. In comparison to observed SP, the median SP derived from simulated streamflow ranged from 0.90 during spring to 1.44 in fall. On a regional scale, the simulated median SP across the 217 watersheds were comparable to the observed SP in all seasons and water years (Table 1). However, at the individual watershed level, the model largely underpredicts SP, particularly in watersheds with observed SP > 2.0 (Fig. 4a). This disagreement between observed and simulated SP was within ±25% in the majority (50%–76%) of watersheds for annual, fall, and winter streamflows. In the spring and summer, only 43% and 26% of the watersheds had results within a ±25% error between simulated and observed SP. These results indicate that model performance in capturing SP was even lower for summer as compared to other seasons. Although the model performed less well for precipitation (Fig. 4), uncertainties about the magnitude and direction of future precipitation changes (Mote and Salathé 2010) make this aspect less critical in the PNW, but it may be a factor when modeling other regions.
Future changes in temperature are more certain in this region, and model performance in capturing ST was improved as compared to SP across all time scales as inferred by the higher coefficient of regression between observed and simulated ST (Fig. 4b). However, overall model performance in simulating ST was unsatisfactory, with only 33%–45% of the variance in observed ST explained by the VIC model. The model largely underpredicted ST in 50%–60% watersheds and overpredicted in 25%–28% watersheds by at least 25% across all seasons. The median sensitivity of observed streamflow to temperature ST across all watersheds ranged from −8.19 (% °C−1) during winter to −0.34 (% °C−1) in spring (Table 1). The negative ST value indicates a decline in streamflow with increasing Tavg. An increase in Tavg by 1°C will result in as much as nearly an 8% decline in QJFM and 0.34% decline in QAMJ. On the annual time scale, an increase in Tavg by 1°C will result in a 2.6% decline in observed Qwy and a 2.2% decline in simulated Qwy. Although there was no major difference in model performance across the seasons and water year in simulating ST based on NNSE and RMSE values (Fig. 4b), the median simulated ST for QOND was significantly higher (Table 1).
c. Impact of deep groundwater on model performance
1) Total streamflow and flow percentiles
Performance metrics among , , , and watersheds revealed differences in model performance in different geological terrains (Fig. 5). In 80% of possible cases, the Kruskal–Wallis and post hoc Dunn’s multiple comparison test showed statistically significant differences in model performance metrics for total streamflow and flow percentiles between two or more watershed groups (Table S1 in the supplemental material). The model performance in capturing interannual variability in QOND and QJFM based on ρ was significantly lower (p < 0.05) in as compared to watersheds. However, the opposite was true for QAMJ, where the model performed significantly (p < 0.05) better in as compared to watersheds. In terms of flow percentiles, the difference in ρ was only significant between and watersheds for Q25, Q50, and Q75, where the model performed better in the latter case. For Q5 the model performed significantly better in as compared to , , and . The NNSE in was significantly lower (p < 0.05) for QOND and Qwy as compared to and , respectively. Similarly, the NNSE in was significantly lower for QOND and QJFM as compared to and , respectively. The model performed poorly in as compared watersheds for Q25, Q50, and Q95. The RMSE was significantly lower for as compared to or during Qwy, QOND, and QJFM, which is not surprising given the overall lower flow during these seasons in watersheds (Table 1). Similarly, the RMSE values for Q50, Q75, and Q95 in were significantly higher as compared to watersheds. However, despite overall higher QAMJ and QJAS in watersheds (Table 1), there was no statistical difference in RMSE values during these seasons among the different groups of watersheds. This was unexpected given that deep groundwater, which was not explicitly modeled by VIC, exerts a greater influence on QAMJ and QJAS as compared to streamflow in other seasons. Considering the model limitation, the RMSE values in QAMJ and QJAS for were expected to be higher than the watersheds. However, although not statistically significant, the corresponding RMSE values in and are slightly higher as compared to and watersheds. This pattern seems to be consistent in terms of ρ and NNSE as well with the model performing better overall in and as compared to and watersheds.
The effect of deep groundwater in simulating QJAS and extreme flow percentiles (i.e., Q5, Q25, and Q95) was most evident in terms of PBIAS. The model significantly (p < 0.05) overpredicted (median PBIAS = 48%) QJAS in and underpredicted (median PBIAS = −13%) in watersheds. Similarly, Q5 was significantly overpredicted (median PBIAS = 19%) in and underpredicted (median PBIAS = −51%) in watersheds (Fig. 5). In contrast, Q95 was significantly underpredicted (median PBIAS = −17%) in and overpredicted (median PBIAS 11%) in watersheds. These results indicate that the model performance was significantly influenced by the absence/presence of groundwater and that base flow recedes quickly in groundwater-dominated watersheds and slowly in runoff-dominated watersheds. Although the magnitude of error (both RMSE and PBIAS) was comparable among the different groups of watersheds, a systematic shift in the direction of error (over- or underestimation) based on groundwater influence could be problematic.
The spatial pattern among the watersheds with PBIAS < −25% in QJAS and Q5 was consistent with geological terrains in the PNW (Fig. 6). Most of the watersheds with PBIAS < −25% had high K values and were located along the Cascades. This was not surprising given the fact that most of these watersheds are sourced from the High Cascades. However, there were watersheds on the Olympic Peninsula in WA and around the Wallowa Mountains in OR with PBIAS < −25%. These places did not have deep-groundwater systems as in the Cascades, but they sustained relatively higher summer base flows, presumably because of late-melting snowpacks.
2) Precipitation and temperature sensitivities
Comparisons of observed and simulated SP showed sharp differences in model performance between different groups of watersheds based on K (Fig. 4). The error in SP, calculated as the difference between simulated and observed SP, was significantly different (p < 0.05) for QOND and QJAS in (groundwater dominated) as compared to (runoff dominated) watersheds. The model significantly overpredicted SP for QOND and QJAS in as compared to . We did not see any significant influence of deep groundwater on model performance in terms of ST (Fig. 4b). This indicates that model performance was not influenced by the presence/absence of groundwater in terms of temperature sensitivity of streamflow. However, SP during fall and summer season streamflow was strongly influence by the presence/absence of groundwater.
d. Uncertainty in model meteorological forcing
Comparing gridded (° resolution) meteorological forcing and simulated SWE to point measurements at SNOTEL sites revealed that gridded precipitation compared reasonably well with measurements at seasonal and annual time scales, with average ρ and NNSE values larger than 0.67, except during spring (Table 3). Although average RMSE values ranged between 39 mm during summer and 209 mm on the annual time scale, average PBIAS remained <2%. The gridded precipitation values were slightly higher than measured values during fall, spring, and summer and lower during winter. The average RMSE for annual and seasonal precipitation (Table 3) was comparable to those values for total streamflow (Table 2). However, the PBIAS in precipitation and total streamflow at the seasonal time scale did not agree. For example, the model overpredicted winter flows despite negative PBIAS in winter precipitation, while the opposite (underprediction) was true for the spring season. This was not surprising given the seasonal carryover of precipitation in the form of groundwater and SWE. As opposed to RMSE in streamflow (Table 2), the RMSE in precipitation was equally composed of both systematic and unsystematic components (Table 3). The comparisons of gridded and measured maximum and minimum temperatures showed strong bias with NNSE values less than 0.67 and RMSE ranging between 1.7° and 2.7°C (Table 3). As opposed to precipitation, RMSE in temperatures were mostly systematic.
The average ρ between observed and simulated SWE was higher than the acceptable threshold of 0.7 for good model performance in terms of ρ (Table 3). However, average NNSE values were all below the 0.6 threshold for good model performance in terms of NNSE at all four time scales, indicating large absolute bias. The model underpredicted the average SWE during fall and winter and overpredicted average SWE during spring. Since most of the SNOTEL sites used in this study are located at elevations above the average elevation of (90%) and (66%), it was not possible to disentangle the role of bias in SWE to streamflow between runoff- and groundwater-dominated watersheds. However, large underpredictions of QJAS in groundwater-dominated watersheds (Fig. 5), despite an overprediction of spring SWE and higher gridded spring and summer precipitation as compared to those measured at SNOTEL sites (Table 3), provide confirmation of the importance of considering groundwater contributions when simulating streamflow.
Our analysis of VIC model performance, comparing time series of total streamflow and flow percentiles of observed and simulated streamflow, revealed both strengths and weaknesses in the model that are important to understand for successful model applications. The model performed reasonably well in capturing interannual variability in observed streamflow (as reflected by relatively high ρ), which gives confidence in using the model to estimate and simulate hydrologic trends for climate change assessments (Hamlet et al. 2007; Hidalgo et al. 2009). However, performance was poorer in predicting the magnitude and interannual variability in observed low flows (i.e., Q5 and Q25) and total summer streamflow (i.e., QJAS). This poorer performance could be problematic, given the importance of summer flows for aquatic organisms (Arismendi et al. 2013; Beer and Anderson 2013) and municipal water supply (Barnett et al. 2005). Although routing was not explicitly included in streamflow simulation, we explored the possible effect of routing on model performance (Fig. 3). We found that in large watersheds (drainage area >500 km2), where lack of routing was expected to affect model performance the most, both low flow (Q5) and peak flow (Q95) were better predicted than in small watersheds (drainage area <500 km2). The effect of routing was evident on Q25, where the model performance (median PBIAS = 12%) deteriorated in large watersheds as compared to small watersheds (median PBIAS = −7%). However, since the majority of our studied watersheds were <500 km2 (Fig. 2a), the effect of routing on model performance for our study sites is expected to be minimal. We also found no improvement in model performance for low-flow metrics (Q5 and Q25) as a result of site-specific calibration and validation. However, the peak flows (Q95) in calibrated watersheds were better predicted. This indicates that reducing the model uncertainty from low flows will require a different strategy for model calibration. Not only was VIC calibrated at the monthly time scale, the NSE, which was used to calibrate the VIC model, is biased toward the peak-flow portion of the hydrograph rather than the low-flow portion.
The strong disagreement between observed and simulated sensitivities to changes in both precipitation and temperature does not necessarily imply poor model performance. As shown by Elsner et al. (2014), the choice of gridded meteorological dataset can influence the streamflow sensitivity to changes in climate. In this study, observed sensitivities were calculated using gridded precipitation and temperature data. However, small-scale variability in precipitation and temperature may not be accurately captured in these datasets, leading to biased estimates of observed sensitivities (SP and ST). This alone could cause substantial disagreement between observed and simulated sensitivities, more so for precipitation than temperature. Thus, higher uncertainties in precipitation as compared to temperature sensitivities are not surprising (Fig. 4). Although quantifying the error associated with observed sensitivities is challenging in the absence of a landscape-level-independent record of precipitation and temperature variability, it would be interesting to look at how the choice of gridded meteorological data affects observed SP and ST.
The model showed strong systematic bias in both runoff- and groundwater-dominated watersheds, especially total summer streamflow and low-flow percentile. Although the magnitudes of RMSE in runoff- and groundwater-dominated watersheds are comparable for both Q5 and QJAS, the difference in PBIAS was statistically significant. Most importantly, the contrasting (positive and negative) PBIAS error in the prediction of Q5 and QJAS highlights the varying level of uncertainty in model predictions between runoff- and groundwater-dominated watersheds. In groundwater-dominated watersheds, PBIAS in Q5 and QJAS shows an overall underestimation as opposed to an overestimation in runoff-dominated watersheds. In general, the model tends to perform better in intermediate (i.e., and ) across all model performance metrics. This is somewhat contrary to the findings of Wenger et al. (2010), who showed diminishing model performance with increasing groundwater contribution. However, considering the fact that the model neither accurately captures the groundwater dynamics nor is capable of producing zero (or near zero) flows, corresponding under- and overestimation of low-flow percentiles and total summer streamflow are not surprising. In either extreme case where streams either gain water from inflow of groundwater (e.g., watersheds) or lose water by outflow to groundwater (e.g., watersheds), base flow recession was not accurately captured by the model. Under these circumstances, increasing the model resolution alone may theoretically provide better topographic representation of the landscape but not necessarily improve model performance. On the other hand, coupling a groundwater model to VIC is not a feasible option because of both huge computational resources and time requirements and because of the extensive amount of data needed for groundwater parameterization. Most of these data are typically not available at a regional scale. Also, as pointed out by Wenger et al. (2010), these finescale biases become less important at the larger scale when extreme hydrogeological systems (i.e., and ) mix with intermediate systems (i.e., and ).
Although comparisons of meteorological data forcing and simulated SWE at SNOTEL sites apparently show large discrepancies, these results should be interpreted with caution. This comparison relies on point measurements of precipitation, temperature, and SWE from SNOTEL sites against corresponding mean values generated by the model over a ~6-km grid. Point measurements are typically made under open tree canopies, whereas average SWE from a ~6-km grid not only ignores local topographic effects but also includes the effect of existing vegetation over the entire grid cell. SWE values derived from SNOTEL sites have been shown to be as much as 200% higher when compared with mean SWE over a 1-, 4-, and 16-km2 grid (Molotch and Bales 2005). Hence, a large underestimation of SWE by the model was not surprising and could be entirely due to the sampling nature of the snow datasets. Comparing VIC-simulated SWE with those derived from remote sensing [e.g., spatially distributed SWE derived from Moderate Resolution Imaging Spectroradiometer (MODIS) imagery] could provide a better measure of model performance in capturing snow dynamics.
These findings on model performance have broad implications for using large-scale LSMs in landscape-level planning as well as future model improvement. As mentioned earlier, geological differences among the watershed as inferred by K along with errors in meteorological forcing can significantly affect model performance. However, the relative contribution of structural (lack of groundwater) and parametric (meteorological forcing) error is still unclear. Because of the nature of the landscape where geology, snow, and elevation all are geographically correlated, it was difficult to disentangle their individual effects on model performance. For example, summer-flow and low-flow percentiles were underpredicted in groundwater-dominated watersheds and overpredicted in runoff-dominated watersheds. But in this region, groundwater-dominated watersheds are typically located at high elevations, where meteorological forcings are also uncertain and presumably underreported. The meteorological forcing data for VIC were interpolated based on Cooperative Observer Program (COOP) stations that are predominantly located at lower elevations. Under these circumstances, it would be unfair to call the error in groundwater-dominated watersheds entirely structural. Most likely it was a result of both structural and parametric, and more research is needed to quantify the individual impact.
Finally, since most of the model error is systematic (RMSEs/RMSEu > 1), a more rigorous site-specific calibration may help improve model performance. At a regional scale where climate and geology varies significantly, a geologically (Tague et al. 2013) or landscape-based (Patil et al. 2013) model parameterization, as opposed to transferring the calibrated model parameters based on spatial proximity and physical similarity (Oudin et al. 2008), could help reduce the uncertainty due to presence/absence of groundwater. Additionally, relying on only streamflow for model calibration in regions such as PNW, where shapes of the hydrographs largely depend on the water stored in the form of SWE and groundwater, may be problematic. Availability of remotely sensed datasets such as MODIS-based evapotranspiration and snow cover and Gravity Recovery and Climate Experiment (GRACE)-based terrestrial water storage provide opportunities for multicriteria parameter estimation (Livneh and Lettenmaier 2012). Site-specific calibration can also compensate for the uncertainty in meteorological forcing (Elsner et al. 2014). Although, as pointed out by Elsner et al. (2014), uncertainties in meteorological forcing could still be challenging in forecasting the effects of climate change. Uncertainties in gridded meteorological data (Elsner et al. 2014) can be minimized by utilizing the measurements from SNOTEL sites. The SNOTEL sites are typically located at higher elevations than the currently used COOP stations for gridded meteorological VIC forcing data (Hamlet and Lettenmaier 2005).
5. Summary and conclusions
This study provides an assessment of the large-scale Variable Infiltration Capacity (VIC) model for predicting hydrologic regimes of small watersheds in the Pacific Northwest. Since large-scale hydrologic models of this type are typically not calibrated for small watersheds, knowing the uncertainties and their relationship to topographic, geologic, and climatic controls is quite valuable. Model performance and associated uncertainties were assessed by comparing VIC-simulated and observed streamflows from 217 watersheds in terms of total flow at annual and season time scales and flow percentiles. In addition to streamflow, we also compared the model meteorological forcing with independent observations from 109 stations with daily and monthly precipitation, maximum and minimum temperatures, and SWE data. The effect of deep groundwater on model performance was assessed by grouping watersheds based on the streamflow recession constant K, following Safeeq et al. (2013).
Overall, the model was able to capture the hydrologic behavior of these watersheds with reasonable accuracy as measured by the Spearman rank correlation. Both total streamflow and flow percentiles, however, are subject to strong systematic model error. Although the magnitude of relative bias (i.e., PBIAS) between groundwater- and runoff-dominated watersheds are comparable, summer streamflow and lower-percentile flows in runoff-dominated watersheds are predominantly overestimated and consistently underestimated in groundwater-dominated watersheds. The model performed poorly in capturing the sensitivity of streamflow to changes in both temperature and precipitation across all seasons. Our findings also suggest strong disagreements between gridded and observed meteorological forcing and simulated and observed SWE, which could be contributing to model bias. Since groundwater- and snow-dominated watersheds overlap geographically, disentangling the individual impact on model bias was challenging. However, since most of the model bias was systematic, a careful site-specific or geologically driven model calibration using not only streamflow but also SWE would be expected to improve model performance.
Predicting changes in future streamflow under climate change at the regional scale is essential for planning and developing mitigation strategies. The VIC and other LSMs help scientists and resource managers answer “what if” questions in a quantitative manner based on future climate and land use changes as projected by global climate models. This study highlights some of the uncertainties in model-simulated streamflow and how it may vary under different hydrogeological terrains and time scales. Our results also provide a basis for developing model calibration and parameterization strategies for future modeling work in this region that might better account for landscape differences in terms of groundwater contribution.
The authors gratefully acknowledge funding support from the Oregon Watershed Enhancement Board, Bureau of Land Management (Oregon), and the U.S. Forest Service Region 6 and Pacific Northwest Research Station. The manuscript benefitted from the thoughtful comments of Sarah Lewis and three anonymous reviewers.