Streamflow Q estimation in ungauged catchments is one of the greatest challenges facing hydrologists. Observed Q from 3000 to 4000 small-to-medium-sized catchments (10–10 000 km2) around the globe were used to train neural network ensembles to estimate Q characteristics based on climate and physiographic characteristics of the catchments. In total, 17 Q characteristics were selected, including mean annual Q, baseflow index, and a number of flow percentiles. Testing coefficients of determination for the estimation of the Q characteristics ranged from 0.55 for the baseflow recession constant to 0.93 for the Q timing. Overall, climate indices dominated among the predictors. Predictors related to soils and geology were relatively unimportant, perhaps because of their data quality. The trained neural network ensembles were subsequently applied spatially over the entire ice-free land surface, resulting in global maps of the Q characteristics (at 0.125° resolution). These maps possess several unique features: they represent observation-driven estimates, they are based on an unprecedentedly large set of catchments, and they have associated uncertainty estimates. The maps can be used for various hydrological applications, including the diagnosis of macroscale hydrological models. To demonstrate this, the produced maps were compared to equivalent maps derived from the simulated daily Q of four macroscale hydrological models, highlighting various opportunities for improvement in model Q behavior. The produced dataset is available online (http://water.jrc.ec.europa.eu/GSCD).
Quantitative knowledge of streamflow Q and its spatial and temporal distribution is important for hydropower production, drinking water supply, industrial uses, irrigation systems, aquatic habitats, recreation, and sediment and contaminant transport (Poff et al. 1997; Nuttle 2002; Brauman et al. 2007). Perhaps the greatest obstacle to advancing current understanding is that the majority of Earth’s land surface is ungauged or poorly gauged (Fekete and Vörösmarty 2007). The estimation of Q in ungauged and poorly gauged catchments has therefore been highlighted as one of the major challenges facing the hydrologic sciences today (Sivapalan 2003).
One way to estimate Q characteristics in ungauged catchments is to use physically based continuous rainfall–runoff models. These models are expected to provide accurate Q estimates for ungauged catchments owing to their physically based representation of the chief processes governing the water cycle, provided good quality data on precipitation and other governing factors are available. However, model parameters are typically difficult or impossible to measure at the model application scale (Beven 1989; Duan et al. 2001), and hence models still require calibration (Duan et al. 2006; Nasonova et al. 2009; Rosero et al. 2011). Approaches exist that transfer model parameters from gauged to ungauged catchments based on regression, physical similarity, or spatial proximity (Kim and Kaluarachchi 2008; He et al. 2011), but so far these approaches have met with limited success (Wagener and Montanari 2011).
An alternative approach to gain knowledge about Q behavior in ungauged catchments is to use empirical models linking catchment attributes (related to climate, topography, land cover, soils, and/or geology) to indices that quantify various characteristics of the Q regime [e.g., Mazvimavi et al. 2005; Brandes et al. 2005; Detenbeck et al. 2005; Longobardi and Villani 2008; Van Dijk 2010; Peña-Arancibia et al. 2010; Krakauer and Temimi 2011; see Blöschl et al. (2013), Salinas et al. (2013), Hrachowitz et al. (2013), and Beck et al. (2013b) for reviews]. Based on observed Q, these models capture the Q behavior in a lumped way by integrating the effects of spatial heterogeneity in climate and physiography. However, such models have typically been based on a relatively small number of catchments (22–183 for the cited studies) and regional datasets to characterize climate and physiography and hence may have limited applicability outside the study region. Moreover, most regression models have not been applied to an independent set of catchments to test their generalization capability. In an effort to overcome these limitations, Beck et al. (2013b) produced global maps for two baseflow-related Q characteristics based on data from 3394 catchments around the globe.
The present study aims to produce global maps for a more comprehensive selection of Q characteristics. The subsidiary aims are threefold: 1) to investigate the relationships between Q characteristics and climate and physiographic characteristics at the global scale, 2) to produce global maps of Q characteristics, and 3) to compare the produced maps to equivalent maps derived from macroscale hydrological models to test the usefulness of observation-based maps in model evaluation and improvement.
2. Streamflow characteristics
Indices quantifying various characteristics of the Q regime have been widely used in ecological studies (Clausen and Biggs 2000; Olden and Poff 2003). Table 1 lists the 17 Q characteristics selected for the present analysis, including their computation from Q time series. Some of the selected Q characteristics are related to base flow (BFI1–4, Q95, and Q99), and others are related to peak flow (Q1), the shape of the flow duration curve (Q1–Q99), the mean water balance (RC and QMEAN), or flow timing (T50). Figure 1 gives an example of base flow computed using the four techniques considered here (BFI1–4). Note that base flow is defined in this study as the slowly varying portion of Q and thus includes all slow runoff components, including snow and ice melt and the outflow from surface-water bodies (cf. Hall 1968; Smakhtin 2001).
a. Climate and physiographic characteristics
Table 2 lists the climate and physiographic characteristics used as predictors of the Q characteristics. Among the 20 selected predictors, eight were related to climate, three to topography, five to land cover, one to geology, and three to soils. A similar selection of predictors was used by Beck et al. (2013b). However, we used regional data on precipitation P and air temperature TA for the United States, regional P data for New Zealand, and improved datasets for surface elevation ELEV, surface slope SLO, fraction covered by lakes and reservoirs fW, fraction of forest cover fTC, normalized difference vegetation index NDVI, and the soil-related predictors (content of sand SAND, content of silt SILT, and content of clay CLAY). Additionally, we added topographic wetness index TWI, permafrost abundance PF, and fraction covered by glaciers GLAC as predictors. On the other hand, we removed gravel soil content GRAV, as it is not included in the SoilGrids1km dataset. Furthermore, to give drier catchments more weight in training datasets, we replaced mean annual precipitation by square-root-transformed mean annual precipitation Ptrans and used the reciprocal of the humidity index, the aridity index AI. Last, snow water equivalent depth SNOW was replaced with fraction of snow cover fS, which can be mapped more accurately and at higher spatial resolution (Gao et al. 2010). Most data used for the predictors have a resolution ≤1 km. For the catchment-scale estimation of the Q characteristics the full-resolution data were used, while for the computation of global maps the data were resampled to 0.125° using simple averaging with gaps filled by nearest-neighbor interpolation.
b. Observed streamflow
The observed Q data in this study were compiled from three sources. First, daily Q and catchment boundary data for 9169 U.S. stations that were part of the Geospatial Attributes of Gages for Evaluating Streamflow, version II (GAUGES II; Falcone et al. 2010), were downloaded from the U.S. Geological Survey (USGS) website (http://water.usgs.gov). Daily and monthly Q data for 1961 and 3357 stations, respectively, around the globe and associated catchment boundaries (Lehner 2012) were provided by the Global Runoff Data Centre (GRDC; http://grdc.bafg.de). Daily Q and associated catchment boundary data for 321 Australian stations compiled by Peel et al. (2000) were used to complete the dataset. This resulted in an initial dataset consisting of 14 808 catchments, of which the 3357 GRDC catchments with only monthly Q data were only considered for the Q characteristics insensitive to the shape of individual flow events (T50, RC, and QMEAN). Prior to the computation of T50, the monthly Q data were linearly interpolated to daily values. The analysis of T50 was restricted to northern catchments and regions with some snowfall (defined by fS > 0.1), which tend to have distinct seasonal flow patterns.
The following criteria were used to exclude unsuitable catchments from our analysis. First, in order to reduce anthropogenic influences, catchments were required to have <2% classified as urban (using the “artificial areas” class of the map from GlobCover, version 2.3; Bontemps et al. 2011) or subject to irrigation (using the Global Irrigated Area Map; http://www.iwmigiam.org). Second, a minimum Q record length of 10 years (but not necessarily consecutive) was chosen to provide a reasonable compromise between data availability and robustness of the computed Q characteristics (cf. Kauffeldt et al. 2013). Third, to minimize the effects of channel routing (evaporation, leakage losses, and travel time delays), the catchment area was limited to <10 000 km2 (cf. Peña-Arancibia et al. 2010; Van Dijk et al. 2013). Fourth, to reduce sampling errors in the catchment-mean predictor values, the catchment area had to be >10 km2. Finally, to reduce the number of disinformative catchments, all Q records were screened for artifacts and anthropogenic influences (diversions and/or impoundments), and U.S. catchments flagged as “non-reference” in the GAUGES-II database were discarded. Table 3 lists the number of catchments that fulfilled the selection criteria for each Q characteristic. Figure 2 shows the locations of the catchments. All Q data were converted to millimeters per day using the provided catchment areas.
c. Modeled streamflow
We used modeled Q from four macroscale hydrological models for two purposes: first, to compare the performance in estimating the Q characteristics obtained by the neural networks against the performance obtained by hydrological models, and second, to test whether the produced global maps of Q characteristics can be used for model evaluation and improvement.
The first model, HTESSEL–ERAIR, uses the Hydrology-Tiled ECMWF Scheme for Surface Exchanges over Land (HTESSEL) land surface model (LSM; Balsamo et al. 2009) driven by the ERA-Interim forcing (Dee et al. 2011) corrected using Global Precipitation Climatology Project (GPCP), version 2.1 (Adler et al. 2003; Balsamo et al. 2015). The data have a daily temporal and ~0.7° spatial resolution and were available for the period 1979–2010.
The second model, Noah–GLDAS, is based on the Noah LSM, version 2.7.1 (Schaake et al. 1996; Ek et al. 2003), driven by the Global Land Data Assimilation System (GLDAS; Rodell et al. 2004). The data were obtained from the NASA Goddard Earth Sciences (GES) Data and Information Services Center (DISC; http://disc.sci.gsfc.nasa.gov) and cover the period from 2000 to present. The data have a 3-hourly temporal and 0.25° spatial resolution and were aggregated to daily means for this study.
The third model, PCR-GLOBWB–ERAI, uses the PCRaster Global Water Balance (PCR-GLOBWB) global hydrological model (GHM; Van Beek and Bierkens 2009) driven by ERA-Interim (Dee et al. 2011). The data have a daily temporal and 0.5° spatial resolution and were available for the period 2003–10.
The fourth and final model, W3RA–Princeton, is based on the World-Wide Water Resources Assessment (W3RA) GHM, version 1 (Van Dijk et al. 2013), driven by the Sheffield et al. (2006) meteorological dataset developed at Princeton University. The data have a daily temporal and 1° spatial resolution and were available for the period 1980–2010.
a. Climate and physiographic controls of streamflow characteristics
Regression analysis was used to evaluate the strength and shape of the relationships between catchment values of the climate and physiographic characteristics and the transformed Q characteristics, providing clues to the controls on Q characteristics. Linear, exponential, and power functions were fitted by least squares and the function with the highest coefficient of determination R2 was reported. Significance levels (or probability values) were not calculated as they were strongly correlated to the R2 values, making it redundant to list both. In addition, most relationships were found to be highly significant, despite visual interpretation indicating no apparent relationship. Indeed, several studies have found that significance levels may be misleading when working with such large sample sizes (Royall 1986; Johnson 1999; Nicholls 2001).
b. Catchment- and global-scale estimation of streamflow characteristics
The procedure described in Beck et al. (2013b) with one modification was used to estimate the Q characteristics at the catchment and global scale. Briefly summarized, Beck et al. (2013b) trained neural network ensembles to estimate the Q characteristics BFI1 and k based on climate and physiographic characteristics of the catchments. Each ensemble member consisted of a multilayer perceptron feed-forward neural network with one hidden layer comprising 30 neurons. Although Beck et al. (2013b) used ensembles comprising 10 members obtained using 10-fold cross validation, here we used ensembles comprising 20 members obtained using 20-fold cross validation in order to achieve more accurate estimation results. For each cross-validation iteration, the catchment set was partitioned into subsets of 80% for training, 15% for validation, and 5% for testing. The subdivision was random, although each catchment was used only once for testing. The validation subset of catchments was used to prevent overfitting, while the testing subset was used to independently evaluate the performance of the trained neural network. For training, the Levenberg–Marquardt algorithm was used in combination with the mean-square error performance function. Prior to the estimation, the values of the Q characteristics were transformed to make the catchment distribution of values resemble a normal distribution and to stretch important parts of the scale of measurement. After the estimation, the estimates were back-transformed. Table 3 lists the transformation equations used for the 17 Q characteristics considered here.
Global maps of the Q characteristics (0.125°) were obtained by spatially applying the trained neural network ensemble over the entire ice-free land surface, computing the median of the 20 estimates for each grid cell and back-transforming the result. Associated uncertainty maps were computed from the standard deviation of the 20 transformed estimates for each grid cell [see Beck et al. (2013b) for further details]. The ice-free land surface was defined using the following datasets. Ocean boundaries were taken from the Global Self-Consistent, Hierarchical, High-Resolution Shoreline (GSHHS) dataset (http://www.ngdc.noaa.gov/mgg/shorelines/gshhs.html). The World Wildlife Fund (WWF) Global Lakes and Wetlands Database (GLWD) Level 1 (Lehner and Döll 2004) was used to exclude large inland water bodies (>20 000 km2). The ice-covered portion of the land surface was defined by climate type EF (polar, frost) of the Köppen–Geiger climate type map (Peel et al. 2007; see Fig. S2.1 in the supplemental material).
To evaluate the value of the neural network ensembles, the obtained performance in estimating the Q characteristics was compared against the performance obtained by two LSMs (HTESSEL–ERAIR and Noah–GLDAS) and two GHMs (PCR-GLOBWB–ERAI and W3RA–Princeton). For each model, total Q was computed by summing the surface and subsurface runoff contributions of the model. Thus, BFI values derived from the models do not necessarily reflect the fraction of total Q originating from the subsurface compartments of the models. For each Q characteristic, only the catchments that had nonmissing values of the Q characteristics for all four models were considered. For k, catchments were rejected if one of the models produced flow events without sufficiently long recession periods, as this prevented the computation of k. For BFI1–4, k, Q1–Q99, and T50, catchments for which one of the models produced QMEAN < 5 mm yr−1 were rejected. The analysis of T50 was restricted to catchments with some snowfall (defined by fS > 0.1). Values for RC were not computed for the models since the P forcing data were not available for all models.
c. Example application
As an example application of the new global maps of Q characteristics, a comparison was made to equivalent maps derived from the daily Q estimates of four macroscale hydrological models (HTESSEL–ERAIR, Noah–GLDAS, PCR-GLOBWB–ERAI, and W3RA–Princeton) in order to reveal deficiencies in the model or the forcing. Total Q was obtained by summing the surface and subsurface runoff contributions of the models, after which the Q characteristics were computed per grid cell. The model-based maps of the Q characteristics were resampled, together with the observation-based maps produced in this study, to a common 1° resolution using simple averaging. For each model and Q characteristic, spatial correlation coefficients R (indicative of the agreement in spatial variability), median differences D (indicative of the overall bias), and median absolute differences (indicative of the overall difference) were computed between the simulation- and observation-based maps. Significance levels were not reported as nearly all combinations yielded extremely significant probability values.
a. Climate and physiographic controls of the streamflow characteristics
Table 4 lists R2 values computed between catchment values of the Q characteristics and the catchment attributes (for associated scatterplots, see Figs. S1.1–S1.17 in the supplemental material). In terms of mean R2, the five most important predictors (ordered from highest to lowest mean R2) were AI, Ptrans, SLO, fS, and TWI, while the five least important predictors (again ordered from highest to lowest mean R2) were GLAC, fW, Psi, SAND, and PERM. Overall, the predictors related to climate proved most informative, followed by the ones related to topography and land cover, while the predictors related to geology and soils were least informative. The topography-related predictors TWI and SLO achieved almost identical R2 values because of the strong correlation between the two predictors. Among the soil-related predictors, CLAY was the most informative. The most important predictors for BFI4 were PET, TA, ELEV, fS, and CLAY; and for k were AI, PET, and fS. Predictors identified as important for Q1–Q99 were AI, Ptrans, and SLO. Important predictors for T50 were Ptrans, TA, NDVI, and fS; for RC were AI, PET, and fS; and for QMEAN were AI, Ptrans, SLO, and fS. Variables BFI1–4 achieved comparable R2 results (see Figs. S1.1–S1.4 in the supplemental material) because of strong correlations between the BFI estimates.
b. Catchment-scale estimation of streamflow characteristics
Table 5 shows mean R2 and RMSE values obtained by the 20 neural networks (one for each cross-validation iteration) for the training and testing subsets of catchments. The mean training R2 values were excellent (≥0.90) for Q5, Q10, Q20, Q50, T50, and QMEAN; good (0.80–0.89) for Q1, Q80, Q90, Q95, and RC; acceptable (0.70–0.79) for BFI1–4 and Q99; and modest (<0.70) for k (Table 5; for associated scatterplots, see Fig. S2.2 in the supplemental material). Variables BFI1–4 achieved very similar training and testing R2 values because of high correlations among the BFI estimates. The testing R2 values were 0.02–0.14 lower than the training R2 values (Table 5), suggesting a good generalization capability of the trained models. Table 6 lists the performance of the four macroscale hydrological models in estimating the Q characteristics. The testing R2 values obtained by the neural networks (Table 5) substantially exceed the R2 values obtained by the models (Table 6), confirming the usefulness of the neural networks.
c. Global-scale estimation of streamflow characteristics
Figures 3–5 show global maps of estimated and observed BFI4, T50, and QMEAN, respectively (for global maps of the other Q characteristics, see Figs. S2.3–S2.19 in the supplemental material). Visual comparison of the estimates and the observations revealed good agreement between estimated and observed BFI4, despite a slight tendency for the neural networks to underestimate high BFI4 and overestimate low BFI4 (Fig. 3). Particularly high BFI4 values were found for central Africa, the Canadian Shield in North America, the mountain ranges of the western United States, the Baltic Shield in northern Europe, and the eastern part of the Amazon basin (Fig. 3a). The T50 estimates and observations exhibited excellent agreement overall (Fig. 4). A pronounced north–south gradient from late to early Q timing can be observed (Fig. 4a). Similarly, the QMEAN estimates did not show any clear regional biases (Fig. 5). Particularly high QMEAN was obtained for the Amazon, maritime Southeast Asia, the western flank of the Coast Mountains in North America, and the southern section of the Andes in Patagonia (Fig. 5a).
Table 7 lists mean values of the Q characteristics for the entire ice-free land surface and the major Köppen–Geiger climate types (for a map of the major Köppen–Geiger climate types, see Fig. S2.1 in the supplemental material). The lowest BFI values were found in arid regions (broad climate type B) and the highest were found in tropical and cold regions (broad climate types A and D, respectively; Table 7, Fig. 3a). The values of the different BFI estimates varied consistently among climate types—BFI1 always produced the highest values and BFI2 or BFI4 always produced the lowest values, while BFI3 produced values slightly lower than BFI1 (Table 7). The values of k showed little variability among climate types—only for arid regions (broad climate type B) was the baseflow recession markedly faster (Table 7). Values of Q1–Q99 and QMEAN also varied rather consistently among climate types, with higher values found for tropical climate types Af and Am (rain forest and monsoon, respectively) and polar climate type EF (frost), while lower values were found for arid regions (broad climate type B; Table 7). The means for Q99, Q95, and Q90 were (near) zero for many climate types (Table 7), which is indicative of intermittent and ephemeral streams. The earliest T50 values were found for climate type BS (arid, steppe) and the latest for climate type ET (polar, tundra; Table 7). The lowest RC values were found in arid regions (broad climate type B) and the highest for polar climate type ET (Table 7). The mean RC value obtained for climate type ET is in fact higher than one (Table 7), probably because of P underestimation.
d. Example application
As an example application of the new global maps of Q characteristics, a comparison was made to equivalent maps derived from the daily Q estimates of four macroscale hydrological models. Figures 6–8 show global maps of the difference for transformed values of the Q characteristics BFI4, T50, and QMEAN, respectively (for the other Q characteristics, see Figs. S3.1–S3.13 in the supplemental material). Table 8 lists correlation coefficients, median differences, and median absolute differences for BFI4, k, Q1–Q99, T50, and QMEAN. HTESSEL–ERAIR and W3RA–Princeton yielded the highest R values for BFI4 (Table 8), meaning that these models most accurately simulate the spatial variability in the ratio of base flow to total Q. However, HTESSEL–ERAIR produces excess base flow over taiga and temperate regions (Fig. 6a), while W3RA–Princeton does so over tropical regions (Fig. 6d). Noah–GLDAS yielded the lowest R value (Table 8), producing insufficient base flow over tundra and taiga regions (Fig. 6b). PCR-GLOBWB–ERAI performed moderately well in terms of R but appears to overestimate BFI4 over most of the land surface (Table 8, Fig. 6c). All models produce insufficient base flow over the Rocky Mountains in North America (Fig. 6).
In terms of k, W3RA–Princeton achieved the highest R value (Table 8), meaning it most accurately simulates the spatial variability in baseflow recession rate. Nevertheless, it exhibits slightly too slow baseflow recessions overall, as indicated by the negative D value (Table 8). In contrast, Noah–GLDAS performed poorly, as indicated by the negative R value (Table 8). W3RA–Princeton performed best for the median and low flow percentiles (Q50–Q99), while Noah–GLDAS performed best for the high flow percentiles (Q1–Q10; Table 8). In terms of T50, all models exhibit a too early production of Q over most of the land surface considered here (Table 8, Fig. 7). However, PCR-GLOBWB–ERAI performed markedly better than the other models, obtaining a D value of −3 days whereas the other models yielded D values from −11 to −14 days (Table 8).
PCR-GLOBWB–ERAI demonstrated the best R value for QMEAN and thus best captures the spatial variability in mean annual Q (Table 8, Fig. 8c). Noah–GLDAS produces insufficient Q over most of the land surface (Table 8, Fig. 8b), while W3RA–Princeton does so over most of the Northern Hemisphere with the exception of India (Table 8, Fig. 8d). HTESSEL–ERAIR performed moderately well in terms of QMEAN (Table 8, Fig. 8a). All models produce insufficient Q over the Rocky Mountains, the Andes, and the Himalayas (Fig. 8). The two GHMs (PCR-GLOBWB–ERAI and W3RA–Princeton), which have Q simulation as one of their primary aims, did not outperform the two LSMs (HTESSEL–ERAIR and Noah–GLDAS; Table 8, Fig. 8).
a. Climate and physiographic controls of the streamflow characteristics
The high R2 values found for PET versus BFI4, TA versus BFI4, fS versus BFI4, AI versus k, PET versus k, and fS versus k (Table 4) reflect the dominant control of climate conditions on baseflow volume and recession rate (Price 2011). The positive relationship obtained between fS and BFI4 (Table 4) is due to the gradual release of snowmelt to Q, which results in slowly varying Q exhibiting high BFI4. The relationships obtained between SAND or CLAY and BFI4 (Table 4) correspond with the influence of soil composition on the infiltration-excess mechanism of runoff generation (Horton 1933; Wolock and McCabe 1999). Values of BFI1–4 showed very similar relationships with the predictors (Figs. S1.1–S1.4 in the supplemental material) because of high correlations among the BFI estimates. Using a different but partly overlapping set of 3394 catchments around the globe, Beck et al. (2013b) obtained results similar to those presented here (Figs. S1.1, S1.5 in the supplemental material). We refer to Beck et al. (2013b) for a review of multicatchment observational studies investigating the links between catchment attributes and BFI and k.
The Q characteristics Q1–Q99, RC, and QMEAN were found to be strongly related to AI (Table 4), emphasizing the dominant influence of climate conditions on the overall water balance (cf. Schreiber 1904; Ol’dekop 1911; Pike 1964; Budyko 1974; Zhang et al. 2004). The RC exhibited a particularly strong (positive) relationship with fS (Table 4), presumably because of reductions in evaporation when the surface is snow covered (Berghuijs et al. 2014) and/or because of common P underestimations in mountainous environments caused by wind-induced undercatch and the elevation bias in gauge placement (Groisman and Legates 1994; Hijmans et al. 2005; Sevruk et al. 2009). The weakly negative relationships obtained here between QMEAN and the phase difference between precipitation and potential evaporation seasonalities CORR (Table 4) is in accord with several studies investigating the influence of CORR on the overall water balance (Milly 1994; Wolock and McCabe 1999; Potter et al. 2005). Values of fTC did not exhibit clear relationships with the Q characteristics (Table 4), which agrees with some multicatchment studies examining the spatial relationship between fTC and Q (e.g., Oudin et al. 2008; Peel et al. 2010; Van Dijk et al. 2012) as well as some meso- and large-scale catchment (>1 km2) studies examining the impact of temporal changes in fTC on Q [see review in Beck et al. (2013a)]. The lack of a relationship between NDVI and Q5–Q99 or QMEAN (Table 4) agrees with Hope and Bart (2012), who used observed Q from 30 South African catchments (1–254 km2) to examine whether QMEAN and 11 flow percentiles between Q5 and Q95 are linked to NDVI derived from the Moderate Resolution Imaging Spectroradiometer (MODIS).
Values of TA were found to be strongly negatively related to T50 (Table 4) because temperature represents the dominant control on snow accumulation and melt (Aguado et al. 1992; Stewart et al. 2004). The almost equally strong negative relationship obtained between NDVI and T50 (Table 4) is probably not due to the effect of vegetation on Q, but rather because of the correlation between TA and NDVI owing to the strong control exerted by temperature on vegetation productivity in northern regions (e.g., Jia et al. 2003). The negative relationship found between Ptrans and T50 (Table 4) is also unlikely to be causal but reflects the strong positive correlation between Ptrans and TA in northern regions. Similarly, the positive relationship obtained between ELEV and T50 (Table 4) is presumably not causal but reflects the strong correlations between ELEV, air temperature, snowfall, and snow accumulation and melt (Daly et al. 2002; Leibowitz et al. 2012). A noncausal relationship does not necessarily mean that the predictor should be excluded from the estimation exercise, as it is possible that noncontrolling predictors compensate for the deficiencies of controlling predictors.
Overall, the climate indices showed the strongest relationships with the Q characteristics (Table 4). Among the predictors related to land cover, only fS was clearly related to multiple Q characteristics (Table 4). Among the topography-related predictors, SLO and TWI were related to several Q characteristics, while ELEV was found to be relatively unimportant (Table 4). Our results suggest that topography is not the dominant control on Q behavior at the global scale (cf. Devito et al. 2005). In general, low R2 values were found for single predictors (Table 4), suggesting that multiple predictors have to be used to satisfactorily estimate the Q characteristics. It bears mentioning that a low R2 obtained in Table 4 could mean that the predictor under consideration is unimportant but also that the quality of the dataset is poor. For example, it is generally recognized that geology and soils exert strong controls on Q characteristics (e.g., Farvolden 1963; Boorman et al. 1995; Bruijnzeel 2004), but weak or nonexistent relationships were found here (Table 4), probably because of the limited quality and consistency of the global datasets (cf. Table 2). Despite the transformations applied to some of the Q characteristics, the relationships were generally highly nonlinear (see Figs. S1.1–S1.17 in the supplemental material), confirming the need for neural networks instead of linear models for the estimation of the Q characteristics.
b. Catchment-scale estimation of streamflow characteristics
The obtained R2 values were from good to excellent for most Q characteristics and exceeded the R2 values obtained by the macroscale hydrological models by a wide margin for all Q characteristics (Tables 5, 6). We compare our results to four recent reviews of studies estimating Q characteristics in ungauged catchments (Blöschl et al. 2013; Salinas et al. 2013; Hrachowitz et al. 2013; Beck et al. 2013b). The mean training R2 values obtained for BFI1–4 (0.76–0.78; Table 5) are in the range of those obtained in other BFI estimation studies [e.g., Lacey and Grayson 1998; Neff et al. 2005; Van Dijk 2010; see review by Beck et al. (2013b)]. Although the mean training R2 of 0.66 obtained for k was the lowest among the Q characteristics examined (Table 5), it is still in the upper range of training R2 values obtained in other k estimation studies [e.g., Hughes 1997; Peña-Arancibia et al. 2010; Krakauer and Temimi 2011; see review by Beck et al. (2013b)].
The best way to quantitatively summarize the shape of the flow duration curve is not commonly agreed upon and depends on the application requirements. Some studies used normalized flow percentiles (e.g., Smakhtin et al. 1997), while others used parameters of fitted probability distributions (e.g., Fennessey and Vogel 1990) or the slope of the flow duration curve (e.g., Yadav et al. 2007); we used a number of flow percentiles (Q1–Q99; Table 1). The mean training and testing R2 values obtained here for Q95 (0.80 and 0.70, respectively; Table 5) were higher than those obtained in two regional studies (Gustard and Irving 1994; Laaha and Blöschl 2006). Using observed Q from 1530 European catchments, Gustard and Irving (1994) constructed a multivariate linear model incorporating the catchment proportions of nine soil classes to estimate Q95, yielding a training R2 of 0.29. Using observed Q from 325 small Austrian catchments (7–963 km2), Laaha and Blöschl (2006) estimated transformed Q95 based on seven catchment attributes, achieving training and testing R2 values of 0.62 and 0.57, respectively. In contrast to the previous two studies, Mazvimavi et al. (2005) and Hope and Bart (2012) examined multiple flow percentiles. Using observed Q from 52 Zimbabwean catchments (4–2630 km2), Mazvimavi et al. (2005) trained neural networks to estimate nine flow percentiles between Q10 and Q90 from two slope-related indices, drainage density, PET, and P. They obtained a training R2 range of 0.64–0.92, which is lower than the range of training R2 values obtained here (Table 5). Using observed Q from 30 South African catchments (1–254 km2), Hope and Bart (2012) estimated 11 flow percentiles between Q5 and Q95 from P, SLO, CLAY, SILT, and drainage density, yielding a training R2 range of 0.45–0.82, which is lower than the range of training R2 values found here (Table 5).
Mean training and testing R2 values of 0.91 and 0.88, respectively, were obtained for QMEAN (Table 5). These numbers are considerably higher than the testing R2 range of 0.55–0.70 obtained in the validation of QMEAN from 14 LSMs and six Budyko-type models against observed QMEAN from 150 large catchments (>10 000 km2) around the globe (Zhou et al. 2012). They also exceed the training R2 range of 0.29–0.58 obtained by Arnell (1995), who used observed Q from ~3500 European catchments (29 000–160 000 km2) to train a multivariate model incorporating P and PET; and the training R2 of 0.85 obtained by Hope and Bart (2012), who used observed Q from 30 South African catchments (1–254 km2) to estimate QMEAN from P, satellite-derived enhanced vegetation index, and SILT. However, the training R2 values obtained here are similar to those obtained in three regional studies (Mazvimavi et al. 2005; Viglione et al. 2007; Duan et al. 2010). Based on observed Q from 52 Zimbabwean catchments (4–2630 km2), Mazvimavi et al. (2005) yielded a training R2 of 0.89 in the estimation of QMEAN based on P, two slope-related indices, and a geologic index. Viglione et al. (2007) used observed Q from 47 catchments in northwestern Italy to estimate natural log–transformed QMEAN based on catchment orientation, ELEV, and a radiational aridity index, yielding a training R2 of 0.90. Duan et al. (2010) used observed Q from 11 Chinese catchments (3322–53 829 km2) to estimate QMEAN from P, PET, latitude and longitude, ELEV, wetland area, catchment shape, and catchment area, yielding a training R2 of 0.87. However, most such studies did not report testing R2 values, making it impossible to verify the generalization capability of the derived models. Studies estimating mean annual volumetric Q instead of QMEAN (e.g., Smakhtin et al. 1997; Vogel et al. 1999; Sanborn and Bledsoe 2006; Viglione et al. 2007) are not compared against here as the strong relationship with catchment area results in inflated R2 values.
c. Global-scale estimation of streamflow characteristics
The global maps of Q characteristics developed here (Figs. 3–5) have some unique features. First, they were derived using a data-driven (top down) approach based on Q observations, rather than using a physically based (bottom up) process model (cf. Sivapalan et al. 2003; Babovic 2005). Being based on Q observations, the maps capture the spatially aggregated behavior without being affected by the limitations and assumptions of physically based models (Beven 1989; Duan et al. 2006). Second, an unprecedentedly large global catchment set spanning a broad range of physiographic and climate conditions was used to produce the maps, leading to more generally applicable results (cf. Andréassian et al. 2007; Gupta et al. 2014). Third, uncertainty estimates have been computed (Figs. S2.20–S2.36 in the supplemental material), which can be used to evaluate the reliability of the new maps.
The global QMEAN and BFI maps developed in this study can be compared to QMEAN and BFI maps for the conterminous United States produced by Gebert et al. (1987) and Wolock (2003), respectively. The QMEAN map for the conterminous United States was based on inverse distance weighting (IDW) interpolation of observed QMEAN from 5951 catchments (Gebert et al. 1987). The map was available as a vector file and was rasterized to the 0.125° geographical grid using IDW interpolation to allow spatial comparison to the newly derived QMEAN map (Fig. 5a). Figure 9 shows the QMEAN maps from Gebert et al. (1987) and this study side by side. By and large, there is good large-scale agreement between the maps, although the map of Gebert et al. (1987) has a smoother appearance because of the use of IDW interpolation (Fig. 9). The spatial comparison yielded an R2 of 0.89 based on transformed values. Considering only the regions where both maps have data, very similar mean QMEAN values of 243 mm yr−1 for the map developed here and 238 mm yr−1 for the map of Gebert et al. (1987) were obtained. The good correspondence between the maps probably reflects the relative ease with which QMEAN can be estimated based on climate and physiographic data.
The BFI map for the conterminous United States (1-km resolution) was produced by IDW interpolation of BFI values from 8249 catchments (Wolock 2003). The BFI was computed using the so-called BFI program (Wahl and Wahl 1995), which implements the baseflow separation procedure described by Gustard et al. (1992) used here for BFI4 (Table 1). After reprojecting the map of Wolock (2003) to the 0.125° geographical grid of the newly derived BFI4 map (Fig. 3a), the maps could be spatially compared. This comparison yielded an R2 of 0.29, indicating somewhat poor agreement in terms of spatial patterns. There was, however, moderate agreement in terms of mean BFI4—considering only the regions where both maps have data, mean BFI4 values of 0.38 for the newly derived map and 0.43 for the map from Wolock (2003) were obtained. The low agreement between the maps emphasizes the difficulty in estimating BFI (cf. Tables 5 and 6).
Differences among the BFI1–4 values (Table 7) are caused by differences in the baseflow separation techniques used (Table 1). Figure 1 shows that the base flow derived using BFI1 rises the day after the Q starts rising, resulting in relatively high BFI values. BFI3 also produces relatively high BFI values as a short (7 days) moving minimum is used to compute base flow. On the other hand, BFI2 produces relatively low BFI values because of the use of a long (11 days) moving minimum. Relatively low BFI values are also produced by BFI4 as it computes minima at 5-day nonoverlapping intervals and subsequently connects the valleys in this series of minima. Base flow is defined in this study as the slowly varying portion of Q (Hall 1968; Smakhtin 2001), and in this sense the technique of choice is somewhat subjective and should be guided by the application requirements.
Haddeland et al. (2011) provide estimates of mean QMEAN and RC for the land surface derived from six LSMs and five GHMs. The mean QMEAN based on the map produced here (Fig. 5a) was 365 mm yr−1 (Table 7), which is within the range of 290–457 mm yr−1 obtained by Haddeland et al. (2011). The mean RC based on the map produced here (see Fig. S2.18 in the supplemental material) was 0.34 (Table 7), which is in the low end of the range of 0.33–0.52 obtained by Haddeland et al. (2011). The relatively low mean RC found here may be due to the tendency of LSMs and GHMs to overestimate QMEAN in arid and semiarid regions (Haddeland et al. 2011; Trambauer et al. 2013).
The global maps of Q characteristics produced here should be useful for a wide range of large-scale hydrological applications, such as water resource assessments, catchment classification, groundwater recharge estimation, stream ecosystem assessments, and diagnosis and parameterization of hydrological models. Additionally, the produced uncertainty maps may be useful in the design of Q monitoring networks by guiding the placement of gauging stations. However, there are some caveats worth mentioning. First, the maps reflect unregulated flow conditions unaffected by urbanization and irrigation. Second, the maps are representative of flows in small-to-medium-sized catchments (10–10 000 km2) and should not be used for larger catchments because of the increasing influence of channel routing effects at larger scales (i.e., evaporation and leakage losses and travel time delays). Third, the baseflow index estimates are not necessarily indicative of the portion of total Q originating from the groundwater—other potential sources include snow and ice melt and the outflow from surface-water bodies. Fourth, prior to computing the Q characteristics from hydrological model output, the surface and subsurface contributions of Q should be summed. Fifth, for many grid cells the runoff coefficient exceeds unity and/or the mean annual Q per unit area exceeds the WorldClim P. Such grid cells are especially common in mountainous and northern regions and may be indicative of P underestimation. Sixth, for regions characterized by multimodal seasonal flow distributions, such as the Pacific coast of northern North America, the timing of the center of mass of Q estimates may be less reliable and require more careful interpretation. Finally, the maps are most accurate in regions with dense Q monitoring networks, such as Europe and the United States, and less accurate in ungauged regions (Fig. 2). If a particular analysis requires a higher level of accuracy, the provided uncertainty maps may be used to mask out uncertain results, or one may restrict the analysis to gauged regions only using the gridded observational maps.
d. Example application
Daily Q estimates from four macroscale hydrological models were diagnosed as an example application of the global maps of Q characteristics produced here. By calculating the difference between the simulation- and observation-based maps of the (transformed) Q characteristics, a number of deficiencies in the Q behavior of the models have been brought to light (Figs. 6–8, Table 8, and Figs. S3.1–S3.13 in the supplemental material). Although there are a host of studies using observed Q to diagnose models in a regional context (e.g., Lohmann et al. 1998, 2004; Robock et al. 2003; Slater et al. 2007; Gusev et al. 2008; Feng and Houser 2008; Choi and Liang 2010; Rodell et al. 2011; Vano et al. 2012; Xia et al. 2012) and at a global scale (e.g., Oki et al. 1999; Döll et al. 2003; Decharme and Douville 2007; Decharme 2007; Zaitchik et al. 2010; Materia et al. 2010; Gosling and Arnell 2011; Haddeland et al. 2011; Zhou et al. 2012), these studies typically used observed Q from a small number of large catchments. The use of a small number of basins limits confidence in the results, while the large size of the basins prevents a spatially detailed analysis of the results and makes it more difficult to distinguish between deficiencies in the forcing, the land surface component, or the routing component. The global maps of Q characteristics produced here enable a spatially explicit diagnosis of models, and since the maps from this study were based on small-to-medium-sized catchments, the confounding influence of the routing component is minimized.
The results for the baseflow-related Q characteristics (BFI1–4, k, and Q99) depended mostly on the runoff schemes of the models and less on the meteorological forcing used. Our findings are in agreement with Slater et al. (2007), who used observed Q from four large (sub)Arctic catchments (1 680 000–2 440 000 km2) and found that Noah–GLDAS produces too much quickflow under (sub)Arctic conditions, likely because of excessive reduction of the soil infiltration capacity when the surface is frozen (Pan et al. 2003; Niu et al. 2011). The k performance was rather poor for all models (Table 8) because of the baseflow-related parameterizations of the models. The k values of HTESSEL and Noah depend on soil texture [Balsamo et al. (2009) and Schaake et al. (1996), respectively]. On the other hand, PCR-GLOBWB parameterizes k based on geology (Van Beek and Bierkens 2009), while W3RA parameterizes k using an AI-based empirical relationship taken from Peña-Arancibia et al. (2010). However, this study found weak or nonexistent relationships between k and soil texture (SAND, SILT, and CLAY), geology (PERM), or AI (Table 8) using the most comprehensive global datasets available (Table 2). The poor k performance obtained by the models confirms the difficulty of estimating k (cf. Table 5; Van Dijk 2010; Peña-Arancibia et al. 2010; Beck et al. 2013b). It should be noted that erroneous BFI and/or k estimates by the models may become less noticeable downstream because of the effects of routing.
Our analysis of T50 was restricted to areas with at least some snowfall. The T50 performance depended on the snow and runoff schemes of the models but also on the meteorological forcing used. Our results for T50 (Table 8, Fig. 7) were confirmed in four previous studies. Balsamo et al. (2009) found an early bias in Q timing for HTESSEL using observed Q from the (sub)Arctic Yenisei catchment (2 580 000 km2). Lohmann et al. (2004) found a similar difference pattern in Q timing for Noah using observed Q from 1145 small-to-medium-sized U.S. catchments (23–10 000 km2). Slater et al. (2007) found a too early Q timing for Noah–GLDAS using observed Q from four large (sub)Arctic catchments (1 680 000–2 440 000 km2). Zaitchik et al. (2010) also found a too early timing of Q for Noah–GLDAS but for the globe based on observed Q from 66 catchments (19 000–4 600 000 km2) around the world. The early biases in Q timing suggest that the models poorly represent the processes affecting meltwater refreezing and/or that there are biases in the meteorological forcing.
The QMEAN performance of the models depended on both the model and the meteorological forcing used. Our results for PCR-GLOBWB are in line with Van Beek and Bierkens (2009), who found that the model accurately simulates the spatial variability in Q based on observed Q from 19 large catchments (66 000–6 915 000 km2) around the world. The pronounced QMEAN underestimation by W3RA–Princeton at northern latitudes suggests that improvements in the forcing or model may be required. All models with the exception of PCR-GLOBWB–ERAI appeared to underestimate QMEAN at the global scale (Table 8, Fig. 8), while all models underestimated QMEAN over mountainous regions (Fig. 8). These findings point to P underestimation despite the gauge undercatch and orographic corrections applied to the forcings of HTESSEL–ERAIR and W3RA–Princeton (Sheffield et al. 2006; Adler et al. 2003; Balsamo et al. 2015). Such P underestimation appears to be common in the high-resolution P data used here (cf. Table 2) as well, as the RC exceeds unity over many northern and mountainous regions for both the observations and the estimates (Table 7 and Fig. S2.18 in the supplemental material).
Overall, the results suggest that the global maps of Q characteristics derived in this study can be employed to refine the parameterization or structure of the examined models in order to obtain better Q behavior (cf. Yadav et al. 2007; Zhang et al. 2008; Castiglioni et al. 2010; Wagener and Montanari 2011; Lombardi et al. 2012; Pinheiro and Naghettini 2013).
In this study, observed Q from 3000 to 4000 small-to-medium-sized catchments (10–10 000 km2) around the globe were used to examine the relationships between catchment attributes (related to climate, topography, land cover, soils, and geology) and Q characteristics. We examined 17 Q characteristics, each quantifying a different aspect of the hydrograph. Additionally, neural network ensembles were used to produce global observation-based maps of Q characteristics, which may be useful for numerous large-scale hydrological applications. The main findings are as follows.
Climate indices exhibited the strongest relationships with the Q characteristics. In contrast, predictors related to soils and geology showed weak or nonexistent relationships with the Q characteristics, possibly because of data quality. The individual relationships between catchment attributes and Q characteristics were generally weak, suggesting the need for models incorporating multiple predictors to estimate Q characteristics. The four base flow indices (BFI1–4) were highly intercorrelated but showed systematic differences.
Neural networks were trained to estimate the Q characteristics from catchment attributes. The estimation performance was good to excellent for most Q characteristics and generally exceeded the performance achieved in previous (regional) studies. The trained neural network ensembles were subsequently applied spatially over the entire ice-free land surface to produce global maps of the Q characteristics, providing a unique observation-based global picture of the Q behavior.
The produced global maps of Q characteristics should be of particular value for the diagnosis of macroscale hydrological models. To test this, the newly produced maps were compared to equivalent maps derived from four macroscale hydrological models. This comparison revealed that, among other things, the models fail to simulate the baseflow recession rate, exhibit early bias in Q timing, and underestimate Q over mountain ranges.
The maps, including uncertainty estimates and observations, have been released as the Global Streamflow Characteristics Dataset (GSCD; freely available via http://water.jrc.ec.europa.eu).
We thank the Global Runoff Data Centre (GRDC) and the U.S. Geological Survey (USGS) for providing most of the observed Q data. We gratefully acknowledge the WorldClim developers for making the global climate maps available. We would also like to thank Tom Gleeson (McGill University, Montreal, Canada) for providing the global permeability map; Toby Marthews (University of Oxford, Oxford, United Kingdom) for providing the global map of topographic wetness index; Frederiek Sperna Weiland (Deltares, Delft, the Netherlands) for providing the PCR-GLOBWB–ERAI modeled Q data; and finally, the European Centre for Medium-Range Weather Forecasts (ECMWF; Reading, United Kingdom) for providing the HTESSEL–ERAIR modeled Q data. The views expressed herein are those of the authors and do not necessarily reflect those of the European Commission.
Supplemental information related to this paper is available at the Journals Online website: http://dx.doi.org/10.1175/JHM-D-14-0155.s1.