Accurately predicting regional-scale water fluxes and states remains a challenging task in contemporary hydrology. Coping with this grand challenge requires, among other things, a model that makes reliable predictions across scales, locations, and variables other than those used for parameter estimation. In this study, the mesoscale hydrologic model (mHM) parameterized with the multiscale regionalization technique is comprehensively tested across 400 European river basins. The model fluxes and states, constrained using the observed streamflow, are evaluated against gridded evapotranspiration, soil moisture, and total water storage anomalies, as well as local-scale eddy covariance observations. This multiscale verification is carried out in a seamless manner at the native resolutions of available datasets, varying from 0.5 to 100 km. Results of cross-validation tests show that mHM is able to capture the streamflow dynamics adequately well across a wide range of climate and physiographical characteristics. The model yields generally better results (with lower spread of model statistics) in basins with higher rain gauge density. Model performance for other fluxes and states is strongly driven by the degree of seasonality that each variable exhibits, with the best match being observed for evapotranspiration, followed by total water storage anomaly, and the least for soil moisture. Results show that constraining the model against streamflow only may be necessary but not sufficient to warrant the model fidelity for other complementary variables. The study emphasizes the need to account for other complementary datasets besides streamflow during parameter estimation to improve model skill with respect to “hidden” variables.
Since the pioneering work of Crawford and Linsley (1966), the efficiency of computational hydrologic models has been evaluated against streamflow observations that are available at determined locations within a river basin (Dawdy and Lichty 1968; Sorooshian and Dracup 1980; Duan et al. 1992; Bergström 1995; Seibert 2000; Hundecha and Bárdossy 2004; Troy et al. 2008; Yilmaz et al. 2008; Samaniego et al. 2010; Kumar et al. 2013a). This kind of continuous in situ measurement is essential for understanding the governing relationships between rainfall and runoff in a particular drainage basin. The information content of this time series fundamentally differs from other point measurements such as soil moisture and latent heat in the sense that it represents the integral basin response to a sequence of hydrometeorologic events under particular physiographic and climatic conditions that uniquely characterizes a river basin. Because of this fundamental characteristic, streamflow gauging has been and will be part of the core of national hydrometeorologic monitoring programs and the basis for sound water resources management. It is therefore not surprising that streamflow time series has been the focus for seminal hydrologic work in the past (Kuichling 1889; Sherman 1932; Horton 1935; Hurst 1951; Nash 1958; SCS 1973; Rodriguez-Iturbe and Valdes 1979).
In recent years, however, a tendency toward a more comprehensive assessment of model structural adequacy has taken shape, with an overall aim to improve the representation of different hydrological processes incorporated within a model (Clark et al. 2011; Gupta et al. 2012; Shuttleworth 2012). The rationale behind this assessment is the need to get the right answers for the right reasons (Blöschl 2001; Kirchner 2006), which goes beyond just assessing the model performance against observed streamflow or associated signature measures (Samaniego and Bárdossy 2007; Yilmaz et al. 2008; Kumar et al. 2010; Pokhrel et al. 2012; Euser et al. 2013). Additional motivation for such assessment is driven by the growing need to simulate spatially distributed land surface fluxes controlled by local soil moisture availability and land surface hydrology. Consequently, complementary datasets representing internal hydrologic states and fluxes, such as soil moisture and evapotranspiration, are required to achieve this goal. New kinds of observations and/or proxy data obtained from remote sensing and/or in situ measurement are becoming increasingly available, although at different spatial and temporal resolutions, for example, monthly total water storage (TWS) anomaly from the Gravity Recovery and Climate Experiment (GRACE) at 1° × 1°, near-surface soil moisture from the European Space Agency (ESA) Climate Change Initiative (CCI) at 0.25° × 0.25°, and 30-min eddy flux measurements of latent heat with a footprint of hundreds of hectares.
Several recent studies have evaluated the capability of hydrologic and/or land surface models to represent internal model fluxes and/or states (Li et al. 2012; Livneh and Lettenmaier 2012; Sutanudjaja et al. 2014; Cai et al. 2014; Xia et al. 2014, 2015). A common shortcoming in these studies has been the incompatibility of the scales at which simulated state variables and fluxes are compared with the observations (i.e., data are measured at different spatial scales from those at which models usually operate). The scaling issue poses a major obstacle in performing a comprehensive model evaluation (Blöschl 2001; Samaniego et al. 2010; Tetzlaff et al. 2010; Gentine et al. 2012). Often, the in situ measurements of soil moisture or the evapotranspiration inferred at eddy covariance sites are compared with much coarser gridded model outputs (Xia et al. 2014, 2015). In situ measurements often exhibit much finer support than the smallest representative elemental volume of hydrologic models (Blöschl et al. 1995; Wood 1995; Blöschl 1999).
This scale discrepancy problem is exaggerated when a model is evaluated simultaneously against multiple datasets available at different spatial resolutions. In such a case, different up- and downscaling rules have to be employed to enable comparison between simulations and observations. Alternatively, a quasi-scale-independent model parameterization scheme that allows us to reliably represent processes at different spatial resolutions is required to tackle this scaling problem. The latter has the advantage that a process-based modeling approach can be used to estimate hydrologic fluxes and states across multiple scales (Samaniego et al. 2010; Gentine et al. 2012; Kumar et al. 2013b). Most of the existing modeling approaches, however, exhibit scale-dependent performance, which means that the model parameterization obtained at a given spatial resolution induces large bias in hydrologic fluxes and states when applied to other resolutions (Haddeland et al. 2002; Boone et al. 2004; Stöckli et al. 2007; Troy et al. 2008; Samaniego et al. 2010; Kumar et al. 2013b).
Recently, Samaniego et al. (2010) proposed a multiscale parameter regionalization (MPR) method that allows one to make hydrologic predictions at different scales using a same set of model (transfer) parameters but without losing much of the model performance. The method explicitly accounts for the subgrid variability of the essential aspects of the physical processes that are embedded within model parameters (e.g., soil porosity) and ensures that water fluxes simulated at different scales are comparable. The MPR method incorporated within the mesoscale hydrologic model (mHM; Samaniego et al. 2010) has been tested across a variety of climate and land surface conditions at different spatial resolutions ranging from 4 to 100 km (Kumar et al. 2010, 2013a,b; Samaniego et al. 2013). To date, these scaling studies have mainly focused on evaluating model performance against streamflow and conducting flux-matching experiments using modeled variables at multiple scales and locations.
In this study, we specifically evaluate the ability of the MPR method to reproduce the spatiotemporal dynamics of various water fluxes and states observed at multiple resolutions. The model parameterization constrained using streamflow observations across 400 European river basins is evaluated against complementary datasets that include gridded upscaled in situ evapotranspiration (ET) data, satellite-based soil moisture (SM), and TWS anomalies, as well as local-scale eddy covariance data and their native resolutions. Alternative data fusion possibilities (such as data assimilation) to mitigate the limitations of models are beyond the scope of the present study and require future investigation. The multiscale evaluation approach followed here differs from previous hydrological model assessment studies that have covered the European domain using, for example, the LISFLOOD model (Wanders et al. 2014), the PCRaster Global Water Balance (PCR-GLOBWB) model (Wada et al. 2010; Sutanudjaja et al. 2014), or the Water–Global Assessment and Prognosis (WaterGAP) model (Werth and Güntner 2010). Although these studies focused on evaluating model skill on multiple variables, they have been operated on a limited number of basins and/or with little consideration of the scaling discrepancy problem while verifying model outputs against observations.
We hypothesize that parameter estimation based only on streamflow-related metrics is a necessary but not sufficient condition to warrant the proper partitioning of incoming precipitation P into various spatially distributed water storage components (e.g., SM) and fluxes (e.g., ET). In the presented study, the multiscale and multivariate verification of water fluxes and states is carried out by executing mHM in a “seamless manner” (i.e., multiscale model simulation, in which each scale realization can be run simultaneously using a same set of model transfer parameters) at the native resolutions of available datasets varying from 0.5 to 100 km.
2. Data and methods
a. Study area and datasets
The study is carried out in 400 European river basins (Fig. 1a) with drainage areas varying from 102 to 106 km2. These basins span across distinct climate conditions ranging from the dry summer subtropical (in the Mediterranean, southern Europe) to maritime temperate (in western Europe) and warm summer continental (in eastern Europe) climate types, according to the Köppen–Geiger classification (Rubel and Kottek 2010). Figure 1a shows the span of runoff ratio that represents the long-term average partitioning of the precipitation into runoff and actual ET . The runoff ratio is a comprehensive measure of physiographic basin and regional climate descriptors (Berger and Entekhabi 2001; Sankarasubramanian and Vogel 2002) that ranges between 0 and 1. Basins with smaller values represent relatively drier conditions with higher evaporative rates (e.g., southern Spain), while larger represents humid or mountainous basins with lower evaporation rates (e.g., alpine regions).
The physiographical datasets used to set up the model mainly include digital elevation models, soil textural properties, and land-cover states. An overview of these datasets is provided in Table 1. Since these datasets are available at different spatial resolutions, they are mapped on a common spatial resolution of 500 m × 500 m. These fine-resolution datasets then allow one to account for subgrid variability of basin physical characteristics in parameter regionalization, as described further in section 2b.
The meteorological forcings for the mHM consist of the daily gridded fields of precipitation and average, maximum, and minimum air temperatures at 0.25° × 0.25° resolution for the period 1950–2010. These datasets are acquired from the European daily high-resolution gridded dataset (E-OBS, version 8.0; Haylock et al. 2008). These fields were created using the external drift kriging interpolation technique from ground-based observation networks. The potential evapotranspiration is derived using the temperature-based method of Hargreaves and Samani (1982) at the same spatial resolution (0.25° × 0.25°).
Streamflow records are commonly used to constrain the model parameterization and to evaluate its performance. Daily streamflow data between 1950 and 2010 were obtained from the Global Runoff Data Centre (GRDC) for this purpose. The data availability varies from station to station, with the median record length of 43 years. All basins used in this study have undertaken a first-order data quality check so that they do not violate the physical constraints imposed by the Budyko relationship (Budyko 1974) and do not exhibit any obvious unnatural behavior in the discharge time series. More detailed analysis, particularly on the degree of regulation of European river basins, is deemed beyond the scope of the study because of the lack of support information. Besides streamflow, model performance is evaluated against complementary datasets, namely, the TWS anomaly, actual evapotranspiration, and soil moisture. A brief overview of these datasets is given below.
1) TWS anomaly
The TWS anomaly represents an important measure of seasonal and interannual variability of the terrestrial water storage and is of critical interest for water resource management. The state of TWS affects infiltration rates, subsurface flows, groundwater recharge, and runoff generation (Li et al. 2012). The remotely sensed anomalies of Earth’s gravity field retrieved by GRACE (release 05; Landerer and Swenson 2012) are used in this study to evaluate the simulated TWS of mHM. The global GRACE gridded dataset has 1° × 1° spatial and monthly temporal resolution. Although the GRACE product is available at coarse spatial and temporal resolutions, its application in hydrologic studies is increasing (Andersen et al. 2005; Zaitchik et al. 2008; Su et al. 2010; Li et al. 2012; Forman et al. 2012; Livneh and Lettenmaier 2012; Cai et al. 2014; Orth and Seneviratne 2015). The TWS anomaly is analyzed using a combined product composed of different solutions obtained from three processing centers: the GeoForschungsZentrum (GFZ; Potsdam, Germany), the Center for Space Research at the University of Texas at Austin (United States), and the Jet Propulsion Laboratory (United States). The TWS anomaly is calculated via removing their corresponding long-term mean estimates, which cover the baseline period from January 2004 to December 2009 (NASA 2015). The arithmetic mean of these three products used here is the most effective way to reduce noise in the gravity field within the available scatter of the three solutions (Sakumura et al. 2014). The evaluation period for the TWS anomaly ranges between 2004 and 2012.
2) Actual ET
Actual ET (latent heat flux) includes evaporation of water from soil, surface water bodies, canopy interception, and transpiration from plants leaves. It represents the second-largest flux of the hydrologic cycle; on average, 60% of terrestrial precipitation is returned back to the atmosphere via ET (Oki and Kanae 2006). In this study, the modeled ET is evaluated against data at two distinct resolutions from 1) finescale eddy covariance observations at 27 CarboEurope sites (Göckede et al. 2008; Mauder et al. 2008) and 2) the 0.5° gridded ET dataset derived from the Flux Network (FLUXNET) observations (Jung et al. 2011).
Basic information for the eddy covariance stations is provided in Table A1 (in the appendix). The footprint of the observations covers approximately 700 m. Only stations with an almost complete record for the years 2004–07 are chosen from the CarboEurope database. Data are processed after Papale et al. (2006), and unit imputation (gap filling) is done by marginal distribution sampling (Reichstein et al. 2005). Observed latent heat fluxes were corrected by authors for missing energy balance closures with a Bowen ratio approach similar to Kessomkiat et al. (2013).
The gridded FLUXNET ET product is acquired from the Department of Biogeochemical Integration at the Max Planck Institute for Biogeochemistry, Jena, Germany. The FLUXNET ET product is obtained by upscaling observations of biosphere–atmosphere fluxes of carbon and energy from eddy covariance flux tower sites using model tree ensembles (MTE; Jung et al. 2011). The global monthly ET product is available at 0.5° × 0.5° for the period 1982–2011. We refer to Jung et al. (2011) for a detailed description of the processing algorithm used to generate this dataset.
3) ESA CCI surface SM
SM acts as a switch and integrator of various energy and water fluxes between the land surface and the atmosphere and is the life-giving substance for vegetation. Correctly estimating the degree of soil saturation is the key point in hydrological modeling because it influences the partitioning of precipitation into ET and runoff. It also has a direct effect on society in terms of agriculture management as well as flood and drought predictions. Moreover, it integrates precipitation and evaporation over periods of days to weeks, thus introducing memory in the hydrological cycle.
The ESA CCI provides a global SM product based on the retrievals from four passive (SMMR, SSM/I, TMI, and AMSR-E) and two active [the ERS Active Microwave Instrument (AMI) and Advanced Scatterometer (ASCAT)] coarse-resolution microwave sensors. The interested reader may refer to Liu et al. (2011) and Dorigo et al. (2014) for detailed descriptions of this dataset. The ESA CCI dataset represents near-surface SM (0.5–2 cm) at 0.25° × 0.25° spatial resolution for the period 1978–2013. The recent study by Dorigo et al. (2014) shows that the skill of the merged product compared to the skill of the individual input products of the passive–active sensors with respect to in situ observation has a comparable and/or better performance than the individual input products in terms of the Spearman rank correlations. We emphasize that the ESA CCI SM product is rescaled to the dynamic range of the GLDAS Noah surface soil moisture fields, and therefore it could not be considered as an independent dataset representing absolute true soil moisture (ESA 2015).
b. The mHM and the MPR
The mHM used in this study is a grid-based distributed model that is grounded on numerical approximations of dominant hydrologic processes applied in known HMs such as the Hydrologiska Byråns Vattenbalansavdelning (HBV; Bergström 1995) model and the Variable Infiltration Capacity model (VIC; Liang et al. 1994). Specifically, the model accounts for the following processes: canopy interception, snow accumulation and melting, SM dynamics, infiltration and surface runoff, ET, subsurface storage and discharge generation, deep percolation and base flow, and flood routing. The snow accumulation and melting processes are modeled using a modified degree-day method that accounts for the enhanced snowmelt during the intense precipitation events (Hundecha and Bárdossy 2004). The incoming precipitation and snowmelt is partitioned into root-zone soil moisture and runoff components, depending on the degree of soil saturation, using a power function similar to the HBV model. The model uses three soil layers to describe the root-zone soil moisture dynamics: the depth of the first soil layer is 5 cm, the second one is 25 cm, and the third layer is up to 100 cm. The soil moisture processes in the first two soil layers account for the variation in soil organic matter over time with changes according to land-cover type. Actual ET from soil layers is estimated as a fraction of potential evapotranspiration depending on the soil moisture stress and fraction of roots in each soil layer. The runoff generation process in mHM accounts for surface, fast- and slow-interflow, and baseflow components. The interflow component represents the fast reaction to weather signals while the base flow represents the slow and permanent groundwater flow. Finally, the total runoff produced at each grid cell is routed to the neighboring downstream grid cell via the Muskingum–Cunge flow routing algorithm. For a complete model description, interested readers may refer to Samaniego et al. (2010; the model code can be downloaded from www.ufz.de/mhm). To date, mHM has been applied over a large number of river basins across Germany, Europe, and the United States (Samaniego et al. 2010, 2013; Kumar et al. 2010, 2013a,b; Livneh et al. 2015; Thober et al. 2015).
The model uses three distinct levels of information to better account for spatial heterogeneities of input data, hydrological processes, and meteorological forcings. The lowest level describes information on input data related to physiographical and morphological characteristics of a basin. The intermediate level is used to model the governing hydrological processes, while the highest level contains information on meteorological datasets. Typically, the spatial resolution of and is the same, on the order of kilometers depending on the availability of the forcing dataset (24 km ≈ 0.25° in this study). The resolution of input data is much finer than the other two, on the order of hectometers (500 m ≈ 0.004° in this study).
The MPR technique (Samaniego et al. 2010; Kumar et al. 2013b) is used to efficiently incorporate the subgrid-level information within the modeling level using a two-step parameterization technique (see Fig. 2). In the first step, model parameters (e.g., porosity) are linked to available basin physical characteristics (e.g., terrain slope or sand and clay contents) using a set of pedotransfer functions f and global parameters . This linkage is established at the spatial resolution to account for the subgrid variability of input data and . In the subsequent step, the fields of model parameters are aggregated to generate the effective regional parameter fields at the modeling level . The aggregation is performed using upscaling operators such as the harmonic, geometric, or arithmetic means, which satisfy flux-matching conditions, that is, minimal discrepancy between aggregated water fluxes simulated across multiple resolutions. A set of global parameters is usually inferred via a suitable parameter estimation technique. This two-step parameter regionalization technique allows the model to run efficiently in a seamless manner at multiple resolutions using the same set of global parameters (see Fig. 2).
According to Gupta et al. (2014, p. 471), the benefit of a regionalization method such as MPR stems from the fact that it “regularize[s] the optimization problem, limiting the degrees of freedom to a small number of regional transfer function coefficients.” The MPR technique, in addition to regularizing the optimization problem, takes into account the subgrid variability of the essential aspects of the physical process that represents a given model parameter (e.g., soil porosity, wilting point, or hydraulic conductivity). Previous studies have demonstrated the effectiveness of the MPR approach over other existing parameterization techniques based on hydrological response units, lumped parameterizations, and standard regionalizations that do not account for the subgrid variability of model parameters (Kumar et al. 2010, 2013b; Samaniego et al. 2010, 2011).
c. Experimental design and model setup
The goal here is to comprehensively evaluate the skill of mHM to represent the spatiotemporal variability of modeled fluxes and states at multiple scales. The experimental design of model parameter estimation and seamless verification is schematically shown in Fig. 2. In the initial phase, the model parameters are constrained against observed streamflow to obtain an ensemble of plausible model parameters with the following procedure.
For each basin, the following steps must be taken:
Prior to the parameter estimation of the model parameters, a subset of “informative” parameters is identified using the sequential screening method developed by Cuntz et al. (2015). This screening method is an adaption of the Morris method (Morris 1991). In the first iteration, the model is evaluated at several points along trajectories of the parameter space and elementary effects are determined. Parameters with an elementary effect above a certain threshold are considered to be informative. The next iteration of the method only takes into account “noninformative” parameters to test whether they are sensitive at other regions of the parameter space. This iterative procedure is repeated until no additional parameters are marked to be informative.
- Find an optimal set of identified in the first step by maximizing the Kling–Gupta efficiency (KGE; Gupta et al. 2009). The shuffled complex evolution (SCE) algorithm (Duan et al. 1993) is used to maximize and simulated streamflow; α denotes the measure of relative variability in the simulated and observed values (ratio of the standard deviations); and β is the ratio between the mean simulations and mean observations, that is, bias. The parameter estimation period varies from basin to basin and ranges between 4 and 16 years of data, depending on the availability of the observed streamflow. Prior to the model parameter estimation, a default run in the period from 1951 to 2010 is conducted to ensure appropriate initializations of internal model states and fluxes. Additionally, 5 years of data prior to the parameter estimation period are used to spinup the model. The model is executed at a daily time step and spatial resolution of 0.25° × 0.25°. The optimal set of inferred by this procedure corresponds to the complete basin and not to individual grid cells. This step is repeated 10 times with random initialization of the parameter space to partially account for the uncertainty of the SCE algorithm, and the best-performing parameter set is selected.
Transfer to all remaining basins for streamflow cross validation. Estimate as a measure of transfer efficiency.
Select the best parameter sets from the pool of based on a cross-validation median value larger than a threshold κ.
Following this procedure, an ensemble of the best 36 parameter sets satisfying the threshold κ of 0.55 is selected to represent the cross-validation uncertainty of model output. The κ criterion of 0.55 is not directly related to performance in an individual basin, but rather, it represents the median KGE value in a cross validation over 400 basins. The location of the respective 36 basins spans over the entire study domain (Fig. 1b), which indicates the representativeness of the donor sample.
In the second phase (Fig. 2, right), the ensemble of 36 parameter sets is used to conduct model simulations at the native scale of the complementary datasets using the following procedure.
For each parameter set , take the following steps:
Estimate effective model parameters at using transfer functions (i.e., f) applied to the basin’s physical characteristics (Fig. 2, left).
Aggregate to at multiple modeling levels using upscaling operators (indicated by angle . The spatial resolution of varies from 0.004° to 1° depending on the variable of interest (Fig. 2, right).
Run mHM using the fields at multiple scales and evaluate its performance for selected water fluxes and states.
Following the aforementioned algorithm, the model is executed 36 times at the spatial resolutions of 1° × 1°, 0.5° × 0.5°, and 0.25° × 0.25° over the entire domain and at 0.004° × 0.004° across 27 eddy grid points (see Fig. 2).
The streamflow evaluation is conducted on the entire record of available streamflow observations within the simulation period 1951–2010. The model skill for complementary gridded datasets is evaluated during the period of 2004–10 for the ESA CCI SM product, 1982–2011 for the gridded LandFlux ET data, and 2004–12 for the GRACE TWS anomaly. The ET evaluation at eddy flux stations is limited to the period 2004–07 because of data availability issues. Three Coordination of Information on the Environment (CORINE) land-cover scenes corresponding to the years 1990, 2000, and 2006 are taken into consideration. Model simulations prior to 1990 use the CORINE 1990 land-cover map.
The model performance is evaluated using multiple statistical criteria that includes r, α, and β [Eq. (2)], which quantifies a mismatch between model and observations with respect to the temporal dynamics, variability, and biases, respectively. Additionally, the model evaluation results are presented using a Taylor diagram (Taylor 2001). This two-dimensional diagram quantifies concisely how well model simulations match observations in terms of r, α, and the root-mean-square difference.
Finally, a two-dimensional histogram of the marginal distributions of the observed and simulated values (also called empirical copula) is used to describe the statistical dependency between the marginal distributions of two random variables (Nelsen 2006). The copula can be generally written as
where x is the observed quantity with distribution function F, y is the simulated value with distribution function G, the left-hand side of the equation is the joint probability P of x and y, and C is the copula between F and G. The copula-based model evaluation allows us to quantify the stochastic dependence of simulated variables with respect to observations along the entire range of the variable.
3. Results and discussion
a. Model evaluation using observed streamflow
The mHM performance for the median KGE values between observed and simulated Q is depicted in Fig. 3a. The results indicate that mHM is capable of simulating daily discharge well over the pan-European domain considering that around 70% of the total area exhibits a median KGE value exceeding 0.5. It should be noted that the KGE values at a given basin are not obtained by means of onsite parameter estimation, but rather by transferring global parameters from other basins. This guarantees consistent representation of hydrological processes at different locations, because the goal here is to obtain parameter sets that represent hydrological fluxes and states across the entire domain applicable for making predictions in ungauged basins. Onsite parameter estimations yield even better performances but increase the dimensionality of the parameter space for the entire domain since each basin in this case requires a set of global parameters. Additionally, Fig. 3b shows the ensemble of the cumulative frequency distributions of model performance based on 36 sets of global parameters . This figure illustrates a considerably narrow variation of model performance due to equally well-performing parameter sets with KGE values higher than 0.55 for 50% of the basins.
The model performance in terms of KGE tends to be homogeneously distributed over space (Fig. 3a). Hot spots of poorer model performance occur in notably human-influenced river basins such as those in southern Spain, where the pressure on water resources is high and observed Q is far from natural conditions because of irrigation diversions, hydroelectric power generation, and flood control (Batalla et al. 2004; Lorenzo-Lacruz et al. 2012). The mHM does not include human-influenced processes, as the majority of other rainfall–runoff models; thus, below-normal performance is expected in those areas. We note that the basins that violate the physical constraints of the Budyko curve are removed prior to the analysis, as discussed earlier; however, this first-order quality check may not be sufficient to filter basins with significant anthropogenic activities. The mHM performance to simulate naturalized streamflow dynamics in other heavily human-influenced U.S. basins is adequate and comparable to other existing models (Kumar et al. 2013a; Livneh et al. 2015). However, the lack of naturalized streamflow dataset in the present study domain limits such types of model evaluation. Another hot spot can be found in eastern Europe (Romanian Carpathian Mountains) where the model systematically underestimates snowmelt-driven floods (in spring). The same behavior is observed by onsite parameter estimation (not shown). In addition to the model conceptual error, the poor performance in these areas can be related to observation errors, such as the precipitation undercatch discussed in the following section.
b. Factors influencing Q predictability
The model performance of KGE and its three components [see r, α, and β in Eq. (2)] is further evaluated in Fig. 4 for basic basin characteristics such as area, rain gauge density, and runoff ratio . In general, the spread in uncertainty decreases with increasing basin area and the model performance also tends to be improved: KGE and r increase, while α and β converge toward their ideal value of one. This type of model performance dependency indicates that smaller basins are more susceptible to errors in model inputs than the larger ones, which also stems from averaging and the central limit theorem. The closer to the representative elementary area (REA), the more difficult it is to model because of the increased effect of small processes considered neither in the model nor in the data. Such kinds of model dependency are also reported in previous studies (Reed et al. 2004; Merz et al. 2009; Kumar et al. 2013b).
Additionally, Fig. 4 depicts the relation between model performance and rain gauge density (number of rain gauges per 1000 km2) to investigate the effect of forcing uncertainty. The model exhibits systematically better performance in regions with relatively higher rain gauge density, particularly in terms of variability. This promotes the importance of having a dense observation network for meaningful hydrological simulations, which is in particular important for capturing small-scale features such as convective cells (Alfieri et al. 2014). The median rain gauge density is 0.4 gauges per 1000 km2 and does not meet the standard of the World Meteorological Organization (WMO) in which the tolerable rain gauge density in flat regions is around 1–2 gauges per 1000 km2, while it increases to 4–10 gauges per 1000 km2 for mountainous regions (Dingman 2004). Therefore, a precipitation product based on sparsely distributed rain gauge data can lead to higher modeling errors arising from imperfect precipitation estimates. However, we would like to emphasize that E-OBS used here is the best possible freely available dataset that exists at the moment with a relatively long temporal coverage, large spatial extent, and fine spatial resolution (Hofstra et al. 2009).
Moreover, Fig. 4 illustrates that the model performance with respect to is usually superior in intermediate physiographic and climatic regimes (0.3 < < 0.7), whereas the performance deteriorates toward both extremes. Generally, the model tends to overestimate the observed mean and variability in relatively moisture-limited (dry) basins. Note that these basins contain areas with human-influenced activities where the model performs poorly (as discussed before). On the other hand, the extreme energy-limited basins exhibit a large bias and a systematically underestimated variance. These shortcomings can be related to several factors: precipitation underestimation due to lower rain gauge density, insufficient evaporation rates, and/or model deficiency in capturing subgrid snow processes. In general, the spread of model statistics (r, α, and β) is considerably lower and the model yields better results in basins with higher rain gauge density, which is observed regardless of the selected basin characteristics (cf. gray filled circles with black crosses in Fig. 4 and their median values).
c. Spatial evaluation using complementary data
The model is further evaluated against the following complementary data (not being used to constrain the model) at monthly temporal resolution and native spatial resolution, namely, fields of the TWS anomaly from GRACE (1° × 1°), ET from FLUXNET (0.5° × 0.5°), and SM from ESA CCI (0.25° × 0.25°). Figure 5 shows the model performance in terms of median r of the original data (Fig. 5, top), median r of standardized anomalies (Fig. 5, middle), and corresponding copulas of the latter one (Fig. 5, bottom).
Overall, the model represents the TWS anomaly and the actual ET adequately well, while the performance for SM is not satisfactory in terms of correlation for the original time series (Figs. 5a–c). Presented hydrological variables exhibit strong seasonality and the performance criteria based on, for example, correlation coefficient for such variables is not adequate to show the actual model skill. Therefore, standardized anomalies of both observations and simulations are estimated by removing their respective monthly means and standard deviations. The correlation for the standardized anomalies shows, in general, deterioration for TWS and ET, but slight improvement in SM when compared to the data with retained seasonality (Figs. 5d–f). The mHM results are consistent in findings of the recent study by Orth and Seneviratne (2015).
A reasonably good agreement is achieved for TWS anomaly in a large part of the study domain. A relatively larger error can be observed in the Alps and coastal areas. This can be attributed to the fact that mHM lacks the capability to explicitly represent the glacial and tidal processes in these areas. Also, the GRACE data are not suitable to accurately quantify ice mass changes in glaciers (Jacob et al. 2012) and there are relatively higher measurement and leakage errors provided in the GRACE dataset along the coastal line. The leakage error of GRACE stands for the “residual errors after filtering and rescaling” from the raw original product to estimate the TWS anomaly (NASA 2015).
The temporal dynamics of modeled ET resembles quite well the FLUXNET-derived ET product with the majority (75%) of cells exceeding correlation coefficients (i.e., r) larger than 0.9 for the original time series, and larger than 0.59 for the standardized anomalies. Poor model performance is noticed on the Iberian Peninsula. Notably, in these areas the model performance for observed Q is also poor (Fig. 3).
In comparison with the two aforementioned variables, r for soil moisture exhibits the poorest performance; however, overall correlation is quite comparable to those obtained in other recent studies (Dorigo et al. 2014; Lievens et al. 2015). Stripes in the model performance may correspond to an artifact of the retrieval algorithm, which is a typical characteristic in the observation through satellite microwave instruments (ESA 2015). Notably, the comparison between satellite-derived soil moisture and distributed models is challenging because of data and modeling reasons. First, there is limited information on the exact depth of the soil layer that is used for the ESA CCI satellite product besides other potential retrieval problems such as vegetation coverage, snow, and ice content. Second, the biases across the statistical moments are very typical for surface soil moisture data derived from satellite retrievals, ground measurements, and models so that they need to be quantified and corrected (Reichle and Koster 2004). Third, the top thin first soil layer of most distributed models is not a good representation of actual soil moisture. Water evaporates as vapor from the soil surface. Soil water models that calculate only liquid water flow have to compensate for the missing process of vapor transport. Most of the models therefore include a very thin upper layer that counteracts the liquid water flow. Soil moisture products such as from cosmic-ray sensors (Zreda et al. 2012) represent larger soil volumes (Köhli et al. 2015) and could become more feasible for model–data comparison in future studies.
The overall model evaluation using empirical copula densities of the standardized anomalies shows a strong statistical dependency between observations and simulations in particular for high and low quantiles (Figs. 5g–i). This has strong implications for drought and flood monitoring using satellite products alone. The statistical dependency for values in between the extremes is close to the diagonal but with considerable spread for the three variables analyzed. This is related to grid cells exhibiting low correlation coefficients between their simulated and observed anomalies (Figs. 5d–f). Among the three variables, the relationship between the observed and simulated TWS anomaly exhibits larger scatter because of the reduced sample size due to a coarser spatial resolution and limited temporal availability of datasets (Fig. 5g). Overall, the copula densities indicate that the matching between observations and simulations needs to be improved for normal conditions as compared to the extremes. The Spearman rank correlations estimated based on these copulas vary from 0.49 to 0.61.
d. Basin-scale evaluation of modeled fluxes and states
The quantitative evaluation of model performance at basin scale is presented with Taylor diagrams (Fig. 6) for monthly estimates of Q, TWS anomaly, ET, and SM. The basins are further classified into three categories based on runoff coefficients representing wet and/or mountainous, intermediate, and dry climatic regimes. In general, the model is able to represent the temporal dynamics of observed Q adequately well with correlations varying between 0.75 and 0.95 in the majority of the analyzed basins. The observed variability is also well captured by the model regardless of the characteristics with a median α value of around 1. On the other hand, the variability in ET is systematically underestimated ( < 1), while temporal dynamics are well represented with r exceeding 0.8 in the majority of basins. Furthermore, the performance of the model is relatively low for the total water storage anomaly and SM. This is observed independently of the runoff ratio. The correlation for the TWS anomaly ranges mostly between 0.6 and 0.9 with higher values being noticed for basins lying in the water-limited regime, which is also seen in Fig. 5. The poorest performance among all analyzed variables is observed for simulating the soil moisture dynamics with r less than 0.6 in the majority of basins.
The model performance in terms of correlation for the original time series (shown in Fig. 6) is contrasted against their corresponding standardized values for different variables in Fig. 7a. All variables exhibit lower performance for their standardized estimates with the exception of soil moisture. The largest deterioration is noticed for the ET followed by the TWS anomaly and the least for Q. The sequence of this deterioration corresponds to the degree of seasonality among the analyzed variables. The best model performance for Q can be partly attributed to the fact that the model is constrained against this variable. Despite the lower model performance for standardized variables, the majority of basins have r values above 0.4, which is well beyond the threshold limit of 0.2 to be statistically significant at the 95% confidence interval.
Based on the results shown in Fig. 7a, the test of differences in mean skill scores between the standardized distribution of Q and other modeled variables have p values lower than 10−5. This indicates that the null hypothesis that Q alone can sufficiently constrain model components responsible for internal fluxes and states in a cross-validation mode can be safely rejected. To further support the aforementioned hypothesis, Fig. 7b differentiates the results of the standardized anomalies into two equally sized groups based on the median KGE cutoff value of 0.55 (as discussed in section 3a). On average, a significant deterioration in model skill score (p value <0.01) is observed for other complementary variables in comparison to discharge, with the exception of TWS anomaly, for which no conclusive deterioration in model skill can be noticed for the worse-performing basins. The deterioration is more pronounced for the group of basins yielding on average better model performance in terms of discharge, which to some degree reflects the overfitting of model parameters during the parameter estimation against the observed discharge. This also indicates that other complementary data are required to appropriately constrain the model in well-performing basins. Such datasets are also of great usage in data-scarce regions, where streamflow observations are not available to constrain the model.
Finally, Fig. 8 shows the monthly dynamics of observed and simulated fluxes and states for three randomly selected basins in dry, intermediate, and wet climatic conditions based on the runoff coefficients. The magnitude and timing of Q are well matched and observations are mostly covered within the uncertainty bounds. Contrary to Q, ET and TWS anomaly exhibit very regular interannual variability. The model is able to follow this behavior quite well, although it tends to underestimate the gridded FLUXNET ET in the intermediate and wet basins. The largest discrepancy between the model and observation occurs for SM consistent with our previously discussed results.
e. Model errors in relation to water balance closure
The aforementioned results of basin-scale model performance for different variables illustrate the existence of potential errors between observations and simulations (Figs. 5–8). They can be attributed to a number of factors, which can be mainly related to model and input data errors. The former constitutes errors due to the improper model structure and/or parameterizations, whereas the latter represents errors due to imperfect forcings and/or response variables. An analysis is carried out to understand the relationship between the errors in input data and errors in individual variables. The residuals in water balance closure based on 20 water years (1989–2008) of observed P, Q, and the gridded FLUXNET ET, are taken as a proxy for the input data error. This error is analyzed across basins with different climatic characteristics and contrasted against errors in simulated Q and ET.
Overall, the errors in water balance closure are rather independent from the physiographical characteristics with a majority of the basins having average annual values between −200 and 100 mm yr−1 (Fig. 9a). The negative water balance errors are caused by either an underestimated source term (P) or an overestimated sink term (ET + Q). Note that ET is not directly measured but is rather estimated by upscaling observations of biosphere–atmosphere fluxes of carbon and energy from eddy covariance flux tower sites with its own uncertainties (Jung et al. 2011). As noted by Velpuri et al. (2013), the errors in ET can yield up to 50% of the mean annual ET values in certain regions. The underestimation of is likely due to inadequate representation of rain gauge coverage failing to capture the small-scale convective events.
The model overestimates the observed discharge in basins where the positive water balance closure errors occur ( > 1.05; Fig. 9a), while underestimations are observed in basins with negative closure errors ( < 0.95; Fig. 9a). This is further supported in Fig. 9b, which indicates that the water balance closure error follows a close relation with the errors between observed and simulated discharge, with a correlation coefficient of around 0.96. The ET errors do not exhibit any dependency to observed water balance errors, since the correlation coefficient is 0.02. The simulated ET estimates averaged across the investigated basins are consistently underestimated by approximately 70 mm yr−1 with respect to observations. Furthermore, the slope of the best-fitted line between water balance closure error and Q error is nearly one, meaning that, on average, a 100-mm water balance closure error would translate to around 100 mm of simulated streamflow error. The slope can also be interpreted as the elasticity of the fitted line here illustrating the sensitivity of proportional changes in modeling error (in case Q) to the changes in water balance closure error. Results of this analysis indicate that a substantial part of the error in modeled variables can be safely attributed to the erroneous observational datasets. These results highlight the need for better quantification of model errors together with erroneous observational data.
f. Local-scale evaluation of ET
The multiscale evaluation of mHM is further carried out against daily ET estimated at eddy covariance stations with distinct vegetation cover. Results of this analysis, summarized in Fig. 10a, indicate that the model is able to capture the temporal dynamics of ET with correlations ranging between 0.6 and 0.9 across 27 eddy covariance stations. In analogy to the gridded-scale ET simulation results (see Fig. 6), the model systematically underestimates the observed variability, indicating the lack of the model to represent the observed range of ET dynamics (i.e., from dry to wet phases). Figures 10b–10d show the time series of observed and modeled ET at three distinct locations. The temporal dynamics of observations is well represented by the model at the forest and grassland sites with correlations of more than 0.88. Relatively poor performance is observed at the semiarid savanna site (r = 0.74). However, the model is able to capture the observed variability, including the sudden jumps in ET values observed at this savanna site. Furthermore, the observed magnitude and variability of ET at the grassland site is strongly underestimated by the model, particularly during summer. This is because the potential ET that is used to force the model at a local scale is generally lower than the observed actual ET.
Results of this analysis provide a first-order confidence that the parameter estimates obtained at much coarser scales can be transferred to finer ones. There is, however, a number of factors that influence the modeling results at the local scale, mainly related to the representation of hydrological processes as well as input data. For example, constraining the model against the local forcings instead of large-scale E-OBS meteorological forcings may improve its performance. Another limiting factor could be due to the estimation of temperature-based potential ET estimates (Hargreaves and Samani 1982), which do not account for other environmental factors, such as wind speed or humidity (Cristea et al. 2012). Finally, the current model version does not account for lateral flows, particularly relevant at the small scales.
The performance of the mesoscale hydrologic model (mHM) parameterized with the multiscale parameter regionalization (MPR) technique is comprehensively evaluated against various in situ and satellite-based observations over 400 European river basins. The multiscale evaluation of internal model fluxes and states is carried out at the native resolution of available data varying from 0.5 to 100 km using an ensemble of cross-validated model parameters constrained only against observed streamflow. Results show that the model is able to perform well for simulating daily discharge over a wide range of climatic and physiographic conditions with KGE greater than 0.55 in more than 50% of the basins. The streamflow predictability deteriorates in basins with a poor rainfall gauge network and in heavily regulated river basins (e.g., southern Spain). Besides the improvement needed in observational networks, further efforts are needed to incorporate large-scale reservoir operations, irrigation, and other human-induced water withdrawal and storage activities.
The multiscale evaluation for the complementary datasets generally shows reasonable but lower performance in comparison to streamflow, which is used to constrain the model parameters. The model shows the best agreement with the gridded FLUXNET evapotranspiration (r > 0.8), followed by the GRACE total water storage anomaly (0.6 < r < 0.9) and the least for the ESA CCI merged soil moisture (r < 0.6). This performance is strongly related to the degree of seasonality that the selected variable exhibits. The skill of the model deteriorates when the annual cycle is removed from each variable except for the soil moisture, with a majority of the basins exhibiting r > 0.4 for the deseasonalized complementary datasets. The analysis of water balance closure errors indicates that a part of the error in modeled variables is due to erroneous observational datasets. While the error between the observed and simulated discharge is closely related to the errors in the water balance closure estimates, modeled ET is consistently underestimated with respect to observations on average by a constant error of 70 mm yr−1.
The local-scale evaluation of evapotranspiration at several eddy covariance sites further supports the functionality of multiscale parameterization of mHM. While the model is able to capture the temporal dynamics of observed evapotranspiration at most of the sites, it consistently underestimates the observed variability regardless of the locations. Besides improvement in the model parameterization to account for the local-scale processes in detail (e.g., subgrid variability of snow and runoff generation processes), future studies may focus on further enhancement in model performance by constraining the model with site-specific information.
This study provides first-order confidence on the ability of the mHM to simulate fluxes and states across a range of spatial scales and varying climatic and physiographic conditions. Because of the implemented MPR technique, it has been possible to run the model at disparate scales native to the observational data, without recalibrating the model. Although the model yields good performance while conditioned on observed discharge, further improvements are expected by optimally exploiting other reliable complementary datasets together with the streamflow. Results of this study indicate that the null hypothesis, that streamflow alone can sufficiently constrain model components responsible for internal fluxes and states in a cross-validation mode, can be safely rejected. Therefore, further research should focus on multivariate parameter estimation or assimilation schemes for improving the ability to predict the regional water fluxes and states over large domains.
This work is partially funded by the Helmholtz Alliance EDA–Remote Sensing and Earth System Dynamics, through the Initiative and Networking Fund of the Helmholtz Association, Germany. It is also part of the Helmholtz Alliance Climate Initiative REKLIM (www.reklim.de). We acknowledge E-OBS from the EU-FP6 project ENSEMBLES (http://ensembles-eu.metoffice.com) and the data providers in the ECA&D project (http://www.ecad.eu); the U.S. Geological Survey’s Earth Resources Observation and Science Center and NASA’s Land Processes Distributed Active Archive Center; the European Environment Agency; the Harmonized World Soil Database; the Global Runoff Data Centre; the Gravity Recovery and Climate Experiment (available at http://grace.jpl.nasa.gov with processing algorithms provided by Sean Swenson and supported by the NASA MEaSUREs Program); the Department Biogeochemical Integration at the Max Planck Institute for Biogeochemistry, Jena, Germany; and the European Space Agency Climate Change Initiative. The French river runoff data were kindly provided by Jean-Philippe Vidal. Additionally, we acknowledge the PIs at the CarboEurope eddy covariance sites and the corresponding projects: CarboEuropeIP (EU-FP6), CarboItaly (IT-FISR), CarboMont (EU-FP5), NitroEuropeIP (EU-FP7), GHG-Europe (EU-FP7), and GreenGrass (EU-FP5). These data were downloaded from the European Fluxes Database Cluster at http://gaia.agraria.unitus.it. We thank the three anonymous reviewers and the Editor for their comments, which improved the quality of the manuscript.
Details on the Eddy Covariance Stations
Table A1 provides detailed information on the eddy covariance stations.