• Abramowitz, G., 2005: Towards a benchmark for land surface models. Geophys. Res. Lett., 32, L22702, doi:10.1029/2005GL024419.

  • Abramowitz, G., 2012: Towards a public, standardized, diagnostic benchmarking system for land surface models. Geosci. Model Dev., 5, 819827, doi:10.5194/gmd-5-819-2012.

    • Search Google Scholar
    • Export Citation
  • Abramowitz, G., , Leuning R. , , Clark M. , , and Pitman A. , 2008: Evaluating the performance of land surface models. J. Climate, 21, 54685481, doi:10.1175/2008JCLI2378.1.

    • Search Google Scholar
    • Export Citation
  • Baldocchi, D., and Coauthors, 2001: FLUXNET: A new tool to study the temporal and spatial variability of ecosystem-scale carbon dioxide, water vapor, and energy flux densities. Bull. Amer. Meteor. Soc., 82, 24152434, doi:10.1175/1520-0477(2001)082<2415:FANTTS>2.3.CO;2.

    • Search Google Scholar
    • Export Citation
  • Beck, M. B., and Coauthors, 2009: Grand challenges of the future for environmental modeling. White Paper, National Science Foundation, Arlington, VA, 135 pp. [Available online at http://www.ewp.rpi.edu/hartford/~ernesto/S2013/MMEES/Papers/ENVIRONMENT/1EnvironmentalSystemsModeling/Beck2009-nsfwhitepaper.pdf.]

  • Best, M. J., and Coauthors, 2011: The Joint UK Land Environment Simulator (JULES), model description—Part 1: Energy and water fluxes. Geosci. Model Dev., 4, 677699, doi:10.5194/gmd-4-677-2011.

    • Search Google Scholar
    • Export Citation
  • Best, M. J., and Coauthors, 2015: The plumbing of land surface models: Benchmarking model performance. J. Hydrometeor., 16, 14251442, doi:10.1175/JHM-D-14-0158.1.

    • Search Google Scholar
    • Export Citation
  • Beven, K. J., , and Young P. , 2013: A guide to good practice in modeling semantics for authors and referees. Water Resour. Res., 49, 5092–5098, doi:10.1002/wrcr.20393.

    • Search Google Scholar
    • Export Citation
  • Blöschl, G., , and Sivapalan M. , 1995: Scale issues in hydrological modelling: A review. Hydrol. Processes, 9, 251290, doi:10.1002/hyp.3360090305.

    • Search Google Scholar
    • Export Citation
  • Clark, M. P., , Kavetski D. , , and Fenicia F. , 2011: Pursuing the method of multiple working hypotheses for hydrological modeling. Water Resour. Res., 47, W09301, doi:10.1029/2010WR009827.

    • Search Google Scholar
    • Export Citation
  • Clark, M. P., and Coauthors, 2015: A unified approach for process-based hydrologic modeling: 1. Modeling concept. Water Resour. Res., 51, 24982514, doi:10.1002/2015WR017198.

    • Search Google Scholar
    • Export Citation
  • Cover, T. M., , and Thomas J. A. , 1991: Elements of Information Theory. Wiley-Interscience, 726 pp.

  • Cybenko, G., 1989: Approximation by superpositions of a sigmoidal function. Math. Control Signal, 2, 303314, doi:10.1007/BF02551274.

  • Draper, D., 1995: Assessment and propagation of model uncertainty. J. Roy. Stat. Soc., 57B, 4597.

  • Edwards, A. F. W., 1984: Likelihood. Cambridge University Press, 243 pp.

  • Gong, W., , Gupta H. V. , , Yang D. , , Sricharan K. , , and Hero A. O. , 2013: Estimating epistemic and aleatory uncertainties during hydrologic modeling: An information theoretic approach. Water Resour. Res., 49, 22532273, doi:10.1002/wrcr.20161.

    • Search Google Scholar
    • Export Citation
  • Gupta, H. V., , and Nearing G. S. , 2014: Using models and data to learn: A systems theoretic perspective on the future of hydrological science. Water Resour. Res., 50, 53515359, doi:10.1002/2013WR015096.

    • Search Google Scholar
    • Export Citation
  • Gupta, H. V., , Clark M. P. , , Vrugt J. A. , , Abramowitz G. , , and Ye M. , 2012: Towards a comprehensive assessment of model structural adequacy. Water Resour. Res., 48, W08301, doi:10.1029/2011WR011044.

    • Search Google Scholar
    • Export Citation
  • Gupta, H. V., , Perrin C. , , Kumar R. , , Blöschl G. , , Clark M. , , Montanari A. , , and Andréassian V. , 2014: Large-sample hydrology: A need to balance depth with breadth. Hydrol. Earth Syst. Sci., 18, 463477, doi:10.5194/hess-18-463-2014.

    • Search Google Scholar
    • Export Citation
  • Hansen, M. C., , DeFries R. S. , , Townshend J. R. G. , , and Sohlberg R. , 2000: Global land cover classification at 1 km spatial resolution using a classification tree approach. Int. J. Remote Sens., 21, 13311364, doi:10.1080/014311600210209.

    • Search Google Scholar
    • Export Citation
  • Jaynes, E. T., 2003: Probability Theory: The Logic of Science. Cambridge University Press, 727 pp.

  • Jung, M., , Reichstein M. , , and Bondeau A. , 2009: Towards global empirical upscaling of FLUXNET eddy covariance observations: Validation of a model tree ensemble approach using a biosphere model. Biogeosciences, 6, 20012013, doi:10.5194/bg-6-2001-2009.

    • Search Google Scholar
    • Export Citation
  • Kavetski, D., , Kuczera G. , , and Franks S. W. , 2006: Bayesian analysis of input uncertainty in hydrological modeling: 2. Application. Water Resour. Res., 42, W03408, doi:10.1029/2005WR004376.

    • Search Google Scholar
    • Export Citation
  • Keenan, T. F., , Davidson E. , , Moffat A. M. , , Munger W. , , and Richardson A. D. , 2012: Using model–data fusion to interpret past trends, and quantify uncertainties in future projections, of terrestrial ecosystem carbon cycling. Global Change Biol., 18, 25552569, doi:10.1111/j.1365-2486.2012.02684.x.

    • Search Google Scholar
    • Export Citation
  • Kumar, S. V., and Coauthors, 2014: Assimilation of remotely sensed soil moisture and snow depth retrievals for drought estimation. J. Hydrometeor., 15, 24462469, doi:10.1175/JHM-D-13-0132.1.

    • Search Google Scholar
    • Export Citation
  • Liu, Y. Q., , and Gupta H. V. , 2007: Uncertainty in hydrologic modeling: Toward an integrated data assimilation framework. Water Resour. Res., 43, W07401, doi:10.1029/2006WR005756.

    • Search Google Scholar
    • Export Citation
  • Liu, Y. Q., and Coauthors, 2011: The contributions of precipitation and soil moisture observations to the skill of soil moisture estimates in a land data assimilation system. J. Hydrometeor., 12, 750765, doi:10.1175/JHM-D-10-05000.1.

    • Search Google Scholar
    • Export Citation
  • Luo, Y. Q., and Coauthors, 2012: A framework for benchmarking land models. Biogeosciences, 9, 38573874, doi:10.5194/bg-9-3857-2012.

  • Mo, K. C., , Long L. N. , , Xia Y. , , Yang S. K. , , Schemm J. E. , , and Ek M. , 2011: Drought indices based on the climate forecast system reanalysis and ensemble NLDAS. J. Hydrometeor., 12, 181205, doi:10.1175/2010JHM1310.1.

    • Search Google Scholar
    • Export Citation
  • Montanari, A., , and Koutsoyiannis D. , 2012: A blueprint for process-based modeling of uncertain hydrological systems. Water Resour. Res., 48, W09555, doi:10.1029/2011WR011412.

    • Search Google Scholar
    • Export Citation
  • Neal, R. M., 1993: Probabilistic inference using Markov chain Monte Carlo methods. Tech. Rep. CRG-TR-93-1, Dept. of Computer Science, University of Toronto, 144 pp. [Available online at http://www.cs.toronto.edu/~radford/ftp/review.pdf.]

  • Nearing, G. S., , and Gupta H. V. , 2015: The quantity and quality of information in hydrologic models. Water Resour. Res., 51, 524538, doi:10.1002/2014WR015895.

    • Search Google Scholar
    • Export Citation
  • Nearing, G. S., , Gupta H. V. , , and Crow W. T. , 2013: Information loss in approximately Bayesian estimation techniques: A comparison of generative and discriminative approaches to estimating agricultural productivity. J. Hydrol., 507, 163173, doi:10.1016/j.jhydrol.2013.10.029.

    • Search Google Scholar
    • Export Citation
  • Oberkampf, W. L., , DeLand S. M. , , Rutherford B. M. , , Diegert K. V. , , and Alvin K. F. , 2002: Error and uncertainty in modeling and simulation. Reliab. Eng. Syst. Saf., 75, 333357, doi:10.1016/S0951-8320(01)00120-X.

    • Search Google Scholar
    • Export Citation
  • Paninski, L., 2003: Estimation of entropy and mutual information. Neural Comput., 15, 11911253, doi:10.1162/089976603321780272.

  • Peters-Lidard, C. D., , Kumar S. V. , , Mocko D. M. , , and Tian Y. , 2011: Estimating evapotranspiration with land data assimilation systems. Hydrol. Processes, 25, 39793992, doi:10.1002/hyp.8387.

    • Search Google Scholar
    • Export Citation
  • Poulin, A., , Brissette F. , , Leconte R. , , Arsenault R. , , and Malo J.-S. , 2011: Uncertainty of hydrological modelling in climate change impact studies in a Canadian, snow-dominated river basin. J. Hydrol., 409, 626636, doi:10.1016/j.jhydrol.2011.08.057.

    • Search Google Scholar
    • Export Citation
  • Rasmussen, C., , and Williams C. , 2006: Gaussian Processes for Machine Learning. MIT Press, 248 pp.

  • Schöniger, A., , Wöhling T. , , and Nowak W. , 2015: A statistical concept to assess the uncertainty in Bayesian model weights and its impact on model ranking. Water Resour. Res., 51, 75247546, doi:10.1002/2015WR016918.

    • Search Google Scholar
    • Export Citation
  • Shannon, C. E., 1948: A mathematical theory of communication. Bell Syst. Tech. J., 27, 379423, doi:10.1002/j.1538-7305.1948.tb01338.x.

    • Search Google Scholar
    • Export Citation
  • Snelson, E., , and Ghahramani Z. , 2006: Sparse Gaussian processes using pseudo-inputs. Advances in Neural Information Processing Systems 18, Y. Weiss, B. Schölkopf, and J. C. Platt, Eds., Neural Information Processing Systems, 1257–1264. [Available online at http://papers.nips.cc/paper/2857-sparse-gaussian-processes-using-pseudo-inputs.]

  • van den Hurk, B., , Best M. , , Dirmeyer P. , , Pitman A. , , Polcher J. , , and Santanello J. , 2011: Acceleration of land surface model development over a decade of GLASS. Bull. Amer. Meteor. Soc., 92, 15931600, doi:10.1175/BAMS-D-11-00007.1.

    • Search Google Scholar
    • Export Citation
  • Wand, M. P., , and Jones M. C. , 1994: Kernel Smoothing. CRC Press, 212 pp.

  • Weijs, S. V., , Schoups G. , , and Giesen N. , 2010: Why hydrological predictions should be evaluated using information theory. Hydrol. Earth Syst. Sci., 14, 25452558, doi:10.5194/hess-14-2545-2010.

    • Search Google Scholar
    • Export Citation
  • Wilby, R. L., , and Harris I. , 2006: A framework for assessing uncertainties in climate change impacts: Low-flow scenarios for the River Thames, UK. Water Resour. Res., 42, W02419, doi:10.1029/2005WR004065.

    • Search Google Scholar
    • Export Citation
  • Xia, Y., and Coauthors, 2012a: Continental-scale water and energy flux analysis and validation for the North American Land Data Assimilation System project phase 2 (NLDAS-2): 1. Intercomparison and application of model products. J. Geophys. Res., 117, D03109, doi:10.1029/2011JD016051.

    • Search Google Scholar
    • Export Citation
  • Xia, Y., and Coauthors, 2012b: Continental-scale water and energy flux analysis and validation for North American Land Data Assimilation System project phase 2 (NLDAS-2): 2. Validation of model-simulated streamflow. J. Geophys. Res., 117, D03110, doi:10.1029/2011JD016051.

    • Search Google Scholar
    • Export Citation
  • Xia, Y., , Sheffield J. , , Ek M. B. , , Dong J. , , Chaney N. , , Wei H. , , Meng J. , , and Wood E. F. , 2014: Evaluation of multi-model simulated soil moisture in NLDAS-2. J. Hydrol., 512, 107125, doi:10.1016/j.jhydrol.2014.02.027.

    • Search Google Scholar
    • Export Citation
  • Xia, Y., , Hobbins M. T. , , Mu Q. , , and Ek M. B. , 2015: Evaluation of NLDAS-2 evapotranspiration against tower flux site observations. Hydrol. Processes, 29, 17571771, doi:10.1002/hyp.10299.

    • Search Google Scholar
    • Export Citation
  • Ziv, J., , and Zakai M. , 1973: On functionals satisfying a data-processing theorem. IEEE Trans. Inf. Theory, 19, 275283, doi:10.1109/TIT.1973.1055015.

    • Search Google Scholar
    • Export Citation
  • View in gallery

    Location of the SCAN and AmeriFlux stations used in this study. Each SCAN station contributed 2 years of hourly measurements (N = 17 520) and each AmeriFlux station contributed 4000 hourly measurements to the training of the model regressions.

  • View in gallery

    A conceptual diagram of uncertainty decomposition using Shannon information. The term represents the total uncertainty (entropy) in the benchmark observations, and represents the amount of information about the benchmark observations that is available from the forcing data. Uncertainty due to forcing data is the difference between the total entropy and the information available in the forcing data. The information in the parameters plus forcing data is , and because of errors in the parameters. The term is the total information available from the model, and because of model structural error. This figure is adapted from Gong et al. (2013).

  • View in gallery

    Median ARD inverse correlation lengths from soil moisture SPGPs trained at each site using only lagged precipitation data. Inverse correlation lengths indicate a posteriori sensitivity to each dimension of the input data. The hourly inputs approach a minimum value around 15 lag periods at the 100-cm depth and the daily inputs approach a minimum at around 25 lag periods at the 10-cm depth. This indicates that these lag periods are generally sufficient to capture the information from forcing data that is available to the SPGPs. All benchmark SPGPs were trained with these lag periods.

  • View in gallery

    Scatterplots of soil moisture observations and estimates made by the NLDAS-2 models (black) and by the benchmarks (gray) in both soil layers [(a),(b) for surface soil moisture; (c),(d) for top 100-cm soil moisture]. The regressions [(a),(c)] act on the forcing data only and the regressions [(b),(d)] act on forcing data plus parameters. The mean anomaly correlations over all sites are listed.

  • View in gallery

    The fraction of total uncertainty in soil moisture estimates contributed by each model component. These plots are conceptually identical to Fig. 2, except that these use real data.

  • View in gallery

    Scatterplots of ET observations and estimates made by the NLDAS-2 models (black) and by the benchmarks (gray). (top) The regressions act on the forcing data only and (bottom) the regressions act on forcing data plus parameters. The mean anomaly correlations over all sites are listed.

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 146 146 22
PDF Downloads 104 104 17

Benchmarking NLDAS-2 Soil Moisture and Evapotranspiration to Separate Uncertainty Contributions

View More View Less
  • 1 Hydrological Sciences Laboratory, NASA GSFC, Greenbelt, Maryland, and Science Applications International Corporation, McLean, Virginia
  • 2 Hydrological Sciences Laboratory, NASA GSFC, Greenbelt, Maryland
  • 3 Hydrological Sciences Laboratory, NASA GSFC, Greenbelt, Maryland, and Science Applications International Corporation, McLean, Virginia
  • 4 NOAA/NCEP/Environmental Modeling Center, College Park, and I. M. Systems Group, Rockville, Maryland
© Get Permissions
Full access

Abstract

Model benchmarking allows us to separate uncertainty in model predictions caused by model inputs from uncertainty due to model structural error. This method is extended with a “large sample” approach (using data from multiple field sites) to measure prediction uncertainty caused by errors in 1) forcing data, 2) model parameters, and 3) model structure, and use it to compare the efficiency of soil moisture state and evapotranspiration flux predictions made by the four land surface models in phase 2 of the North American Land Data Assimilation System (NLDAS-2). Parameters dominated uncertainty in soil moisture estimates and forcing data dominated uncertainty in evapotranspiration estimates; however, the models themselves used only a fraction of the information available to them. This means that there is significant potential to improve all three components of NLDAS-2. In particular, continued work toward refining the parameter maps and lookup tables, the forcing data measurement and processing, and also the land surface models themselves, has potential to result in improved estimates of surface mass and energy balances.

Corresponding author address: Grey S. Nearing, Hydrological Sciences Laboratory, NASA GSFC, 8800 Greenbelt Rd., Code 617, Bldg. 33, Rm. G205, Greenbelt, MD 20771. E-mail: grey.s.nearing@nasa.gov

Abstract

Model benchmarking allows us to separate uncertainty in model predictions caused by model inputs from uncertainty due to model structural error. This method is extended with a “large sample” approach (using data from multiple field sites) to measure prediction uncertainty caused by errors in 1) forcing data, 2) model parameters, and 3) model structure, and use it to compare the efficiency of soil moisture state and evapotranspiration flux predictions made by the four land surface models in phase 2 of the North American Land Data Assimilation System (NLDAS-2). Parameters dominated uncertainty in soil moisture estimates and forcing data dominated uncertainty in evapotranspiration estimates; however, the models themselves used only a fraction of the information available to them. This means that there is significant potential to improve all three components of NLDAS-2. In particular, continued work toward refining the parameter maps and lookup tables, the forcing data measurement and processing, and also the land surface models themselves, has potential to result in improved estimates of surface mass and energy balances.

Corresponding author address: Grey S. Nearing, Hydrological Sciences Laboratory, NASA GSFC, 8800 Greenbelt Rd., Code 617, Bldg. 33, Rm. G205, Greenbelt, MD 20771. E-mail: grey.s.nearing@nasa.gov

1. Introduction

Abramowitz et al. (2008) found that statistical models outperform physics-based models at estimating land surface states and fluxes and concluded that land surface models are not able to fully utilize information in forcing data. Gong et al. (2013) provided a theoretical explanation for this result and also showed how to measure both the underutilization of available information by a particular model as well as the extent to which the information available from forcing data was unable to resolve the total uncertainty about the predicted phenomena. That is, they separated uncertainty due to forcing data from uncertainty due to imperfect models.

Dynamical systems models, however, are composed of three primary components (Gupta and Nearing 2014): model structures are descriptions of and solvers for hypotheses about the governing behavior of a certain class of dynamical systems, model parameters describe details of individual members of that class of systems, and forcing data are measurements of the time-dependent boundary conditions of each prediction scenario. The analysis by Gong et al. (2013) did not distinguish between uncertainties that are due to a misparameterized model from those due to a misspecified model structure, and we propose that this distinction is important for directing model development and efforts to both quantify and reduce uncertainty.

The problem of segregating these three sources of uncertainty has been studied extensively (e.g., Keenan et al. 2012; Montanari and Koutsoyiannis 2012; Schöniger et al. 2015; Liu and Gupta 2007; Kavetski et al. 2006; Draper 1995; Oberkampf et al. 2002; Wilby and Harris 2006; Poulin et al. 2011; Clark et al. 2011). Almost ubiquitously, the methods that have been applied to this problem are based on the chain rule of probability theory (Liu and Gupta 2007). These methods ignore model structural error completely (e.g., Keenan et al. 2012), require sampling a priori distributions over model structures (e.g., Clark et al. 2011), or rely on distributions derived from model residuals (e.g., Montanari and Koutsoyiannis 2012). In all cases, results are conditional on the proposed model structure(s). Multimodel ensembles allow us to assess the sensitivity of predictions to a choice between different model structures, but they do not facilitate true uncertainty attribution or partitioning. Specifically, any distribution (prior or posterior) over potential model parameters and/or structures is necessarily degenerate (Nearing et al. 2015, manuscript submitted to Hydrol. Sci. J.), and sampling from or integrating over such distributions does not facilitate uncertainty estimates that approach any true value.

The theoretical development by Gong et al. (2013) fundamentally solved this problem. They first measured the amount of information contained in the forcing data—that is, the total amount of information available for the model to translate into predictions1—and then showed that this represents an upper bound on the performance of any model (not just the model being evaluated). Deviation between a given model’s actual performance and this upper bound represents uncertainty due to errors in that model. The upper bound can, in theory, be estimated using an asymptotically accurate empirical regression (e.g., Cybenko 1989; Wand and Jones 1994). That is, estimates and attributions of uncertainty produced by this method approach correct values as the amount of evaluation data increases—something that is not true for any method that relies on sampling from degenerate distributions over models.

In this paper, we extend the analysis of information use efficiency by Gong et al. (2013) to consider model parameters. We do this by using a “large sample” approach (Gupta et al. 2014) that requires field data from a number of sites. Formally, this is an example of model benchmarking (Abramowitz 2005). A benchmark consists of 1) a specific reference value for 2) a particular performance metric that is computed against 3) a specific dataset. Benchmarks have been used extensively to test land surface models (e.g., van den Hurk et al. 2011; Best et al. 2011; Abramowitz 2012; Best et al. 2015). They allow for direct and consistent comparisons between different models, and although it has been argued that they can be developed to highlight potential model deficiencies (Luo et al. 2012), there is no systematic method for doing so [see discussion by Beck et al. (2009)]. What we propose is a systematic benchmarking strategy that at least lets us evaluate whether the problems with land surface model predictions are due primarily to forcings, parameters, or structures.

We applied the proposed strategy to benchmark the four land surface models that constitute phase 2 of the North American Land Data Assimilation System (NLDAS-2; Xia et al. 2012a,b), which is a continental-scale ensemble land modeling and data assimilation system. The structure of the paper is as follows. The main text describes the application of this theory to NLDAS-2. Methods are given in section 2 and results in section 3. Section 4 offers a discussion both about the strengths and limitations of information-theoretic benchmarking in general, and also about how the results can be interpreted in context of our application to NLDAS-2. A brief and general theory of model performance metrics is given in the appendix, along with an explanation of the basic concept of information-theoretic benchmarking. The strategy is general enough to be applicable to any dynamical systems model.

2. Methods

a. NLDAS-2

The NLDAS-2 produces distributed hydrometeorological products over the contiguous United States used primarily for drought assessment and NWP initialization. NLDAS-2 is the second generation of the NLDAS, which became operational at the National Centers for Environmental Protection in 2014. Xia et al. (2012a) provided extensive details about the NLDAS-2 models, forcing data, and parameters, and so we will present only a brief summary here.

NLDAS-2 runs four land surface models over a North American domain (25°–53°N, 125°–67°W) at ⅛° resolution: 1) Noah, 2) Mosaic, 3) the Sacramento Soil Moisture Accounting (SAC-SMA) model, and 4) the Variable Infiltration Capacity model (VIC). Noah and Mosaic run at a 15-min time step whereas SAC-SMA and VIC run at an hourly time step; however, all produce hourly time-averaged output of soil moisture in various soil layers and evapotranspiration at the surface. Mosaic has three soil layers with depths of 10, 30, and 160 cm. Noah uses four soil layers with depths of 10, 30, 60, and 100 cm. SAC-SMA uses conceptual water storage zones that are postprocessed to produce soil moisture values at the depths of the Noah soil layers. VIC uses a 10-cm surface soil layer and two deeper layers with variable soil depths. Here we are concerned with estimating surface and root-zone (top 100 cm) soil moistures. The former is taken to be the moisture content of the top 10 cm (top layer of each model), and the latter as the depth-weighted average over the top 100 cm of the soil column.

Atmospheric data from the North American Regional Reanalysis (NARR), which is natively at 32-km spatial resolution and 3-h temporal resolution, is interpolated to the 15 min and ⅛° resolution required by NLDAS-2. NLDAS-2 forcing also includes several observational datasets, including a daily gauge-based precipitation, which is temporally disaggregated to hourly using a number of different data sources, as well as satellite-derived shortwave radiation used for bias correction. A lapse-rate correction between the NARR grid elevation and the NLDAS grid elevation was also applied to several NLDAS-2 surface meteorological forcing variables. NLDAS forcings consist of eight variables: 2-m air temperature (K), 2-m specific humidity (kg kg−1), 10-m zonal and meridional wind speed (m s−1), surface pressure (kPa), hourly integrated precipitation (kg m−2), and incoming longwave and shortwave radiation (W m−2). All models act only on the total wind speed, and in this study we also used only the net radiation (sum of shortwave and longwave) so that a total of six forcing variables were considered at each time step.

Parameters used by each model are listed in Table 1. The vegetation and soil classes are categorical variables and are therefore unsuitable for using as regressors in our benchmarks. The vegetation classification indices were therefore mapped onto a five-dimensional real-valued parameter set using the University of Maryland (UMD) classification system (Hansen et al. 2000). These real-valued vegetation parameters included optimum transpiration air temperature (called topt in the Noah model and literature), a radiation stress parameter (rgl), maximum and minimum stomatal resistances (rsmax and rsmin), and a parameter used in the calculation of vapor pressure deficit (hs). Similarly, the soil classification indices were mapped, for use in NLDAS-2 models, to soil hydraulic parameters: porosity, field capacity, wilting point, a Clapp–Hornberger-type exponent, saturated matric potential, and saturated conductivity. These mappings from class indices to real-valued parameters ensured that similar parameter values generally indicated similar phenomenological behavior. In addition, certain models use one or two time-dependent parameters: monthly climatology of greenness fraction, quarterly albedo climatology, and monthly leaf area index (LAI). These were each interpolated to the model time step and so had different values at each time step.

Table 1.

Parameters used by the NLDAS-2 models.

Table 1.

b. Benchmarks

As mentioned in the introduction, a model benchmark consists of three components: a particular dataset, a particular performance metric, and a particular reference value for that metric. The following subsections describe these three components of our benchmark analysis of NLDAS-2.

1) Benchmark dataset

As was done by Kumar et al. (2014) and Xia et al. (2014), we evaluated the NLDAS-2 models against quality-controlled hourly soil moisture observations from the Soil Climate Analysis Network (SCAN). Although there are over 100 operational SCAN sites, we used only those 49 sites with at least 2 years of complete hourly data during the period of 2001–11. These sites are distributed throughout the NLDAS-2 domain (Fig. 1). The SCAN data have measurement depths of 5, 10, 20.3, 51, and 101.6 cm (2, 4, 8, 20, and 40 in.) and were quality controlled (Liu et al. 2011) and depth averaged to 10 and 100 cm to match the surface and root-zone depth-weighted model estimates.

Fig. 1.
Fig. 1.

Location of the SCAN and AmeriFlux stations used in this study. Each SCAN station contributed 2 years of hourly measurements (N = 17 520) and each AmeriFlux station contributed 4000 hourly measurements to the training of the model regressions.

Citation: Journal of Hydrometeorology 17, 3; 10.1175/JHM-D-15-0063.1

For evapotranspiration (ET), we used level 3 station data from the AmeriFlux network (Baldocchi et al. 2001). We used only those 50 sites that had at least 4000 time steps of hourly data during the period 2001–11. The AmeriFlux network was also used by Mo et al. (2011) and by Xia et al. (2015) for evaluation of the NLDAS-2 models, and a gridded flux dataset from Jung et al. (2009), based on the same station data, was used by Peters-Lidard et al. (2011) to assess the impact on ET estimates of soil moisture data assimilation in the NLDAS framework.

2) Benchmark metrics and reference values

Nearing and Gupta (2015) provide a brief overview of the theory of model performance metrics, and the general formula for a performance metric is given in the appendix. All performance metrics measure some aspect (either quantity or quality) of the information content of model predictions, and the metric that we propose here uses this fact explicitly.

The basic strategy for measuring uncertainty due to model errors is to first measure the amount of information available in model inputs (forcing data and parameters) and then to subtract the information that is contained in model predictions. The latter is always less than the former since the model is never perfect, and this difference measures uncertainty (i.e., lack of complete information) that is due to model error (Nearing and Gupta 2015). This requires that we measure information (and uncertainty) using a metric that behaves so that the total quantity of information available from two independent sources is the sum of the information available from either source. The only type of metric that meets this requirement are those based on Shannon-type entropy (Shannon 1948), so we used this standard definition of information and accordingly measure uncertainty as (conditional) entropy (the appendix contains further explanation).

To segregate the three sources of uncertainty (forcings, parameters, and structures), we require three reference values. The first is the total entropy of the benchmark observations, which is notated as , where represents observations. Strictly speaking, is the amount of uncertainty that one has when drawing randomly from the available historical record, and this is equivalent, at least in the context of the benchmark dataset, to the amount of information necessary to make accurate and precise predictions of the benchmark observations. Note that is calculated using all benchmark observations at all sites simultaneously, since the total uncertainty prior to adding any information from forcing data, parameters, or models includes no distinction between sites.

The second reference value measures information about the benchmark observations contained in model forcing data. This is notated as , where is the mutual information function (Cover and Thomas 1991, chapter 2) and represents the forcing data. Mutual information is the amount of entropy of either variable that is resolvable given knowledge of the other variable. For example, is the entropy (uncertainty) in the benchmark observations conditional on the forcing data and is equal to the difference between total prior uncertainty less the information content of the forcing data: . This difference, , measures uncertainty that is due to errors or incompleteness in the forcing data.

Our third reference value is the total amount of information about the benchmark observations that is contained in the forcing data plus model parameters. This is notated as , where represents model parameters. As discussed in the introduction, is what differentiates between applications of a particular model to different dynamical systems (in this case, as applied at different SCAN or AmeriFlux sites), and it is important to understand that describes the relationship between forcing data and observations at a particular site, whereas considers how the relationship between model forcings and benchmark observations varies between sites, and how much the model parameters can tell us about this intersite variation. Section 2b(3) describes how to deal with this subtlety when calculating these reference values; however, for now the somewhat counterintuitive result is that it is always the case that is always greater than since no set of model parameters can ever be expected to fully and accurately describe differences between field sites.

Finally, the actual benchmark performance metric is the total information available in model predictions and is notated . Because of the data processing inequality [see appendix, as well as Gong et al. (2013)], these four quantities will always obey the following hierarchy:
e1
Furthermore, since Shannon information is additive, the differences between each of these ordered quantities represent the contribution to total uncertainty due to each model component. This is illustrated in Fig. 2, which is adapted from Gong et al. (2013) to include parameters. The total uncertainty in the model predictions is , and the portions of this total uncertainty that are due to forcing data, parameters, and model structure are , , and , respectively.
Fig. 2.
Fig. 2.

A conceptual diagram of uncertainty decomposition using Shannon information. The term represents the total uncertainty (entropy) in the benchmark observations, and represents the amount of information about the benchmark observations that is available from the forcing data. Uncertainty due to forcing data is the difference between the total entropy and the information available in the forcing data. The information in the parameters plus forcing data is , and because of errors in the parameters. The term is the total information available from the model, and because of model structural error. This figure is adapted from Gong et al. (2013).

Citation: Journal of Hydrometeorology 17, 3; 10.1175/JHM-D-15-0063.1

The above differences that measure uncertainty contributions can be reformulated as efficiency metrics. The efficiency of the forcing data is simply the fraction of resolvable entropy:
e2.1
The efficiency of the model parameters to interpret information in forcing data independent of any particular model structure is
e2.2
and the efficiency of any particular model structure at interpreting all of the available information (in forcing data and parameters) is
e2.3
In summary, the benchmark performance metric that we use is Shannon’s mutual information function, , which measures the decrease in entropy (uncertainty) due to running the model. To decompose prediction uncertainty into its constituent components due to forcing data, parameters, and the model structure, we require three benchmark reference values: , , and . These reference values represent a series of decreasing upper bounds on model performance, and appropriate differences between the performance metric and these reference values partition uncertainties. Similarly, appropriate ratios, given in Eqs. (2.1)(2.3), measure the efficiency of each model component at utilizing available information.

3) Calculating information metrics

Calculating the first reference value, , is relatively straightforward. There are many ways to numerically estimate entropy and mutual information (Paninski 2003), and here we used maximum likelihood estimators. A histogram was constructed using all observations of a particular quantity (10-cm soil moisture, 100-cm soil moisture, or ET from all sites), and the first reference value was
e3.1
where is the histogram count for the ith of bins. The histogram bin width determines the effective precision of the benchmark measurements, and we used a bin width of 0.01 m3 m−3 (1% volumetric water content) for soil moisture and 5 W m−2 for ET.
Similarly, the benchmark performance metric is also straightforward to calculate. In this case, a joint histogram was estimated using all observations and model predictions at all sites, and the joint entropy was calculated as
e3.2
We used square histogram bins so that the effective precision of the benchmark measurements and model predictions was the same, and for convenience we notate the same number of bins in both dimensions. The entropy of the model predictions was calculated in a way identical to Eq. (3.1), and mutual information was
e3.3
The other two intermediate reference values, and , are more complicated. The forcing data was very high dimensional because the system effectively acts on all past forcing data; therefore, it is impossible to estimate mutual information using a histogram as above. To reduce the dimensionality of the problem we trained a separate regression of the form (where the curly brackets indicate set notation) for each individual site where the site is indexed by . That is, we used the benchmark observations from a particular site to train an empirical regression that mapped a (necessarily truncated) time history of forcing data onto predictions . The reference value was then estimated as , where was calculated according to Eqs. (3.1)(3.3) using all data from all sites simultaneously. Even though a separate regression was trained at each site, we did not calculate site-specific reference values.

As described in the appendix, the regressions are actually kernel density estimators of the conditional probability density , and to the extent that these estimators are asymptotically complete (i.e., they approach the true functional relationships between and at individual sites in the limit of infinite training data), approaches the true benchmark reference value.

The value was estimated in a similar way; however, to account for the role of parameters in representing differences between sites, a single regression (where the curly brackets indicate set notation) was trained using data from all sites simultaneously. This regression was used to produce estimates at all sites, and these data were then used to estimate according to Eqs. (3.1)(3.3).

It is important to point out that we did not use a split-record training/prediction for either the regressions at each site or for the regressions trained with data from all sites simultaneously. This is because our goal was to measure the amount of information in the regressors (forcing data and parameters), rather than to develop a model that could be used to make future predictions. The amount of information in each set of regressors is determined completely by the injectivity of the regression mapping. That is, if the functional mapping from a particular set of regressors onto benchmark observations preserves distinctness, then those regressors provide complete information about the diagnostics—they are able to completely resolve . If there is error or incompleteness in the forcing data or parameter data, or if these data are otherwise insufficient to distinguish between distinct system behavior (i.e., the system is truly stochastic or it is random up to the limit of the information in regressors), then the regressors lack complete information and therefore contribute to prediction uncertainty. For this method to work, we must have sufficient data to identify this type of redundancy, and like all model evaluation exercises, the results are only as representative as the evaluation data.

4) Training the regressions

A separate regression was trained at each site, so that in the soil moisture case there were 98 (49 × 2) separate regressions, and in the ET case there were 50 separate regressions. In contrast, a single regression was trained separately for each observation type and for each LSM (because the LSMs used different parameter sets) on data from all sites so that there were a total of 12 separate regressions (10-cm soil moisture, 100-cm soil moisture, and ET for each of Noah, Mosaic, SAC-SMA, and VIC).

We used sparse pseudo-input Gaussian processes (SPGPs; Snelson and Ghahramani 2006), which are kernel density emulators of differentiable functions. SPGPs are computationally efficient and very general in the class of functions that they can emulate. SPGPs use a stationary anisotropic squared exponential kernel (see Rasmussen and Williams 2006, chapter 4) that we call an automatic relevance determination (ARD) kernel for reasons that are described presently. Because the land surface responds differently during rain events than it does during dry-down, we trained two separate SPGPs for each observation variable to act on time steps 1) during and 2) between rain events. Thus, each and regression consisted of two separate SPGPs.

Because the NLDAS-2 models effectively act on all past forcing data, it was necessary for the regressions to act on lagged forcings. We used hourly lagged forcings from the 15 h previous to time plus daily averaged (or aggregated in the case of precipitation) forcings for the 25 days prior to that. These lag periods were chosen based on an analysis of the sensitivity of the SPGPs. The anisotropic ARD kernel assigns a separate correlation length to each input dimension in the set of regressors (Neal 1993), and the correlation lengths of the ARD kernel were chosen as the maximum likelihood estimates conditional on the training data. Higher a posteriori correlation lengths (lower inverse correlation lengths) correspond to input dimensions to which the SPGP is less sensitive, which is why this type of kernel is sometimes called an ARD kernel—because it provides native estimates of the relative (nonlinear and nonparameteric) sensitivity to each regressor. We chose lag periods for the forcing data that reflect the memory of the soil moisture at these sites. To do this, we trained rainy and dry SPGPs at all sites using only precipitation data over a lag period of 24 h plus 120 days. We then truncated the lag hourly and daily lag periods where the mean a posteriori correlation lengths stabilized at a constant value: 15 hourly lags and 25 daily lags. This is illustrated in Fig. 3. Since soil moisture is the unique long-term control on ET, we used the same lag period for ET as for soil moisture.

Fig. 3.
Fig. 3.

Median ARD inverse correlation lengths from soil moisture SPGPs trained at each site using only lagged precipitation data. Inverse correlation lengths indicate a posteriori sensitivity to each dimension of the input data. The hourly inputs approach a minimum value around 15 lag periods at the 100-cm depth and the daily inputs approach a minimum at around 25 lag periods at the 10-cm depth. This indicates that these lag periods are generally sufficient to capture the information from forcing data that is available to the SPGPs. All benchmark SPGPs were trained with these lag periods.

Citation: Journal of Hydrometeorology 17, 3; 10.1175/JHM-D-15-0063.1

Because of the time-lagged regressors, each SPGP for rainy time steps in the regressions acted on 240 forcing inputs, and each SPGP for dry time steps acted on 239 forcing data inputs (the latter did not consider the zero rain condition at the current time ). Similarly, the wet and dry SPGPs that constituted the regressions acted on the same forcing data, plus the number parameter inputs necessary for each model (a separate regression was trained for each of the four NLDAS-2 land surface models). Each regression for SCAN soil moisture was trained using 2 years of data (17 520 data points), and each SCAN regression was trained on 100 000 data points selected randomly from the 49 × 17 520 = 858 480 available. The ET regressions were trained on 4000 data points, and the ET regressions were trained on 100 000 of the 50 × 4000 = 200 000 available. All SPGPs used 1000 pseudoinputs [see Snelson and Ghahramani (2006) for an explanation of pseudoinputs], and all SPGPs used 2000 pseudoinputs.

3. Results

a. Soil moisture

Figure 4 compares the model and benchmark estimates of soil moisture with SCAN observations and also provides anomaly correlations for the model estimates, which for Noah were very similar to those presented by Kumar et al. (2014). The spread of the benchmark estimates around the 1:1 line represents uncertainty that was unresolvable given the input data—this occurred when we were unable to construct an injective mapping from inputs to observations. This happened, for example, near the high range of the soil moisture observations, which indicates that the forcing data were not representative of the largest rainfall events at these measurements sites. This might be due to localized precipitation events that are not always captured by the ⅛° forcing data and is an example of the type of lack of representativeness that is captured by this information analysis—the forcing data simply lack this type of information.

Fig. 4.
Fig. 4.

Scatterplots of soil moisture observations and estimates made by the NLDAS-2 models (black) and by the benchmarks (gray) in both soil layers [(a),(b) for surface soil moisture; (c),(d) for top 100-cm soil moisture]. The regressions [(a),(c)] act on the forcing data only and the regressions [(b),(d)] act on forcing data plus parameters. The mean anomaly correlations over all sites are listed.

Citation: Journal of Hydrometeorology 17, 3; 10.1175/JHM-D-15-0063.1

It is clear from these scatterplots that the models did not use all available information in the forcing data. In concordance with the empirical results of Abramowitz et al. (2008) and the theory of Gong et al. (2013), the statistical models here outperformed the physics-based models. This is not at all surprising considering that the regressions were trained on the benchmark dataset, which—to reemphasize—is necessary for this particular type of analysis. Figure 5 reproduces the conceptual diagram from Fig. 2 using the data from this study and directly compares the three benchmark reference values with the values of benchmark performance metric. Table 2 lists the fractions of total uncertainty, that is, , that were due to each model component, and Table 3 lists the efficiency metrics calculated according to Eqs. (2.1)(2.3).

Fig. 5.
Fig. 5.

The fraction of total uncertainty in soil moisture estimates contributed by each model component. These plots are conceptually identical to Fig. 2, except that these use real data.

Citation: Journal of Hydrometeorology 17, 3; 10.1175/JHM-D-15-0063.1

Table 2.

Fractions of total uncertainty due to forcings, parameters, and structures.

Table 2.
Table 3.

Efficiency of forcings, parameters, and structures according to Eqs. (2.1)(2.3).

Table 3.

The total uncertainty in each set of model predictions was generally about 90% of the total entropy of the benchmark observations (this was similar for all four land surface models and can be inferred from Fig. 5). Forcing data accounted for about a quarter of this total uncertainty related to soil moisture near the surface (10 cm), and about one-sixth of total uncertainty in the 100-cm observations (Table 2). The difference is expected since the surface soil moisture responds more dynamically to the system boundary conditions, and so errors in measurements of those boundary conditions will have a larger effect in predicting the near-surface response.

In all cases except SAC-SMA, parameters accounted for about half of total uncertainty in both soil layers, but for SAC-SMA this percentage was higher, at 60% and 70% for the two soil depths, respectively (Table 2). Similarly, the efficiencies of the different parameter sets were relatively low—below 45% in all cases and below 30% for SAC-SMA (Table 3). SAC-SMA parameters are a strict subset of the others, so it is not surprising that this set contained less information. In general, these results indicate that the greatest potential for improvement to NLDAS-2 simulations of soil moisture would come from improving the parameter sets.

Although the total uncertainty in all model predictions was similar, the model structures themselves performed very differently. Overall, VIC performed the worst and was able to use less than a quarter of the information available to it, while SAC-SMA was able to use almost half (Table 3). SAC-SMA had less information to work with (from parameters; Fig. 5), but it was better at using what it had. The obvious extension of this analysis would measure which of the parameters that were not used by SAC-SMA are the most important, and then determine how SAC-SMA might consider the processes represented by these missing parameters. It is interesting to notice that the model structure that performed the best, SAC-SMA, was an uncalibrated conceptual model, whereas Noah, Mosaic, and VIC are ostensibly physics based (and VIC parameters were calibrated).

The primary takeaway from these results is that there is significant room to improve both the NLDAS-2 models and parameter sets, but that the highest return on investment, in terms of predicting soil moisture, will likely come from looking at the parameters. This type of information-based analysis could easily be extended to look at the relative value of individual parameters.

b. Evapotranspiration

Figure 6 compares the model and benchmark estimates of ET with AmeriFlux observations. Again, the spread in the benchmark estimates is indicative of substantial unresolvable uncertainty given the various input data. Figure 5 again plots the ET reference values and values of the ET performance metrics. Related to ET, forcing data accounted for about two-thirds of total uncertainty in the predictions from all four models (Table 2). Parameters accounted for about one-fifth of total uncertainty, and model structures only accounted for about 10%. In all three cases, the fractions of ET uncertainty due to different components were essentially the same between the four models. Related to efficiency, the forcing data were able to resolve less than half of total uncertainty in the benchmark observations, and the parameters and structures generally had efficiencies between 50% and 60%, with the efficiencies of the models being slightly higher (Table 3). Again, the ET efficiencies were similar among all four models and their respective parameter sets.

Fig. 6.
Fig. 6.

Scatterplots of ET observations and estimates made by the NLDAS-2 models (black) and by the benchmarks (gray). (top) The regressions act on the forcing data only and (bottom) the regressions act on forcing data plus parameters. The mean anomaly correlations over all sites are listed.

Citation: Journal of Hydrometeorology 17, 3; 10.1175/JHM-D-15-0063.1

4. Discussion

The purpose of this paper is twofold. First, we want to demonstrate (and expand) information-theoretic benchmarking as a way to quantify contributions to uncertainty in dynamical model predictions without relying on degenerate priors or on specific model structures. Second, we used this strategy to measure the potential for improving various aspects of the continental-scale hydrologic modeling system, NLDAS-2.

Related to NLDAS-2 specifically, we found significant potential to improve all parts of the modeling system. Parameters contributed the most uncertainty to soil moisture estimates, and forcing data contributed the majority of uncertainty to evapotranspiration estimates; however, the models themselves used only a fraction of the information that was available to them. Differences between the soil moisture and ET results and those from the soil moisture experiments highlight that model adequacy (Gupta et al. 2012) depends very much on the specific purpose of the model (in this case, the “purpose” indicates what variable we are particularly interested in predicting with the model). As mentioned above, an information use efficiency analysis like this one could easily be extended not only to look at the information content of individual parameters, but also of individual process components of a model by using a modular modeling system (e.g., Clark et al. 2011). We therefore expect that this study will serve as a foundation for a diagnostic approach to both assessing and improving model performance—again, in a way that does not rely on simply comparing a priori models. The ideas presented here also will guide the development and evaluation of the next phase of NLDAS, which will be at a finer spatial scale, and include updated physics in the land surface models, data assimilation of remotely sensed water states, improved model parameters, and higher-quality forcings through improved model forcings.

Related to benchmarking theory in general, there have recently been a number of large-scale initiatives to compare, benchmark, and evaluate the land surface models used for hydrological, ecological, and weather and climate prediction (e.g., van den Hurk et al. 2011; Best et al. 2015); however, we argue that those efforts have not exploited the full power of model benchmarking. The most exciting aspect of the benchmarking concept seems to be its ability to help us understand and measure factors that limit model performance—specifically, benchmarking’s ability to assign (approximating) upper bounds on the potential to improve various components of the modeling system. As we mentioned earlier, essentially all existing methods for quantifying uncertainty rely on a priori distributions over model structures, and because such distributions are necessarily incomplete, there is no way for such analyses to give approximating estimates of uncertainty. What we outline here can provide such estimates. It is often at least theoretically possible to use regressions that asymptotically approximate the true relationship between model inputs and outputs (Cybenko 1989).

The caveat here is that although this type of benchmarking-based uncertainty analysis solves the problem of degenerate priors, the problem of finite evaluation data remains. We can argue that information-theoretic benchmarking allows us to produce asymptotic estimates of uncertainty, but since we will only ever have access to a finite number of benchmark observations, the best we can ever hope to do in terms of uncertainty partitioning (using any available method) is to estimate uncertainty in the context of whatever data we have available. We can certainly extrapolate any uncertainty estimates into the future (e.g., Montanari and Koutsoyiannis 2012), but there is no guarantee that such extrapolations will be correct. Information-theoretic benchmarking does not solve this problem. All model evaluation exercises necessarily ask the question “What information does the model provide about the available observations?” Such is the nature of inductive reasoning.

Similarly, although it is possible to explicitly consider error in the benchmark observations during uncertainty partitioning (Nearing and Gupta 2015), any estimate of this observation error ultimately and necessarily constitutes part of the model that we are evaluating (Nearing et al. 2015, manuscript submitted to Hydrol. Sci. J.). The only thing that we can ever assess during any type of model evaluation (in fact, during any application of the scientific method) is whether a given model (including all probabilistic components) is able to reproduce various instrument readings with certain accuracy and precision. Like any other type of uncertainty analysis, benchmarking is fully capable of testing models that do include models of instrument error and representativeness.

The obvious open question is about how to use this to fix our models. It seems that the method proposed here might, at least theoretically, help to address the question in certain respects. To better understand the relationship between individual model parameters and model structures, we could use an -type regression that acts only on a single model parameter to measure the amount of information contained in that parameter, and then measure the ability of a given model structure to extract information from that parameter by running the model many times at all sites using random samples of the other parameters and calculating something like . This would tell us whether a model is making efficient use of a single parameter, but not whether that parameter itself is a good representation of differences between any real dynamical systems. It would also be interesting to know whether the model is most sensitive (in a traditional sense) to the same parameters that contain the most information. Additionally, if we had sufficient and appropriate evaluation data, we could use a deconstructed model or set of models, like what was proposed by Clark et al. (2015), to measure the ability of any individual model process representation to use the information made available to it via other model processes, parameter, and boundary conditions.

To summarize, Earth scientists are collecting ever-increasing amounts of data from a growing number of field sites and remote sensing platforms. These data are typically not cheap, and we expect that it will be valuable to understand the extent to which we are able to fully utilize this investment—that is, by using it to characterize and model biogeophysical relationships. Hydrologic prediction in particular seems to be a data-limited endeavor. Our ability to apply our knowledge of watershed physics is limited by unresolved heterogeneity in the systems at different scales (Blöschl and Sivapalan 1995), and we see here that this difficulty manifests in our data and parameters. Our ability to resolve prediction problems will, to a large extent, be dependent on our ability to collect and make use of observational data, and one part of this puzzle involves understanding the extents to which 1) our current data are insufficient and 2) our current data are underutilized. Model benchmarking has the potential to help distinguish these two issues.

Acknowledgments

Thank you to Martyn Clark (NCAR) for his help with organizing the presentation. The NLDAS-2 data used in this study were acquired as part of NASA’s Earth–Sun System Division and archived and distributed by the Goddard Earth Sciences (GES) Data and Information Services Center (DISC) Distributed Active Archive Center (DAAC). Funding for AmeriFlux data resources was provided by the U.S. Department of Energy’s Office of Science.

APPENDIX

A General Description of Model Performance Metrics

We begin with five things: 1) a (probabilistic) model with 2) parameter values acts on 3) measurements of time-dependent boundary conditions to produce 4) time-dependent estimates or predictions of phenomena that are observed by 5) . A deterministic model is simply a delta distribution; however, even when we use a deterministic model, we always treat the answer as a statistic of some distribution that is typically implied by some performance metric (Weijs et al. 2010). Invariably, during model evaluation, the model implies a distribution over the observation that we notate .

Further, we use the word “information” to refer to the change in a probability distribution due to conditioning on a model or data [see discussion by Jaynes (2003), and also, but somewhat less importantly, by Edwards (1984)]. Since probabilities are multiplicative, the effect that new information has on our current state of knowledge about what we expect to observe is given by the ratio
ea1
where is our prior knowledge about the observations before running the model. In most cases, will be an empirical distribution derived from past observations of the same phenomenon [see Nearing and Gupta (2015) for a discussion].
Information is defined by Eq. (A1), and measuring this information (i.e., collapsing the ratio to a scalar) requires integrating. The information contributed by a model to any set of predictions is measured by integrating this ratio, so that the most general expression for any measure of the information contained in model predictions about observations is
ea2
The integration in the expected value operator is over the range of possibilities for the value of the observation. Most standard performance metrics (e.g., bias, mean-squared error, and correlation coefficient) take this form [see appendix A of Nearing and Gupta (2015)]. The function is essentially a utility function and can be thought of, in a very informal way, as defining the question that we want to answer about the observations.
Since is a transformation of and (via model ), any information measure where is monotone and convex is bounded by (Ziv and Zakai 1973):
ea3
Equation (A3) is called the data processing inequality, and it represents the reference value for our benchmark.
Shannon (1948) showed that the only function that results in an additive measure of information that takes the form of Eq. (A2) is , where is any base. As described presently, we require an additive measure, so the performance metric for our benchmark takes the form of Eq. (A2) and uses the natural log as the integrating function. We therefore measure entropy and mutual information in units of nats in the usual way, as
ea4
and
ea5
respectively, where is a placeholder for any variable that informs us about the observations (e.g., , , and ).

Because it is necessary to have a model to translate the information contained in and into information about the observations , the challenge in applying this benchmark is to estimate . This conditional probability distribution can be estimated using some form of kernel density function (Cybenko 1989; Rasmussen and Williams 2006; Wand and Jones 1994), which creates a mapping function , where the stands for regression to indicate that this is fundamentally a generative approach to estimating probability distributions [see Nearing et al. (2013) for a discussion] and where the curly brackets indicate set notation. The regression estimates are . To the extent that this regression is asymptotically complete (i.e., it approaches the true functional relationship between the set and ), an approximation of the right-hand side of Eq. (A3) approaches the benchmark reference value.

REFERENCES

  • Abramowitz, G., 2005: Towards a benchmark for land surface models. Geophys. Res. Lett., 32, L22702, doi:10.1029/2005GL024419.

  • Abramowitz, G., 2012: Towards a public, standardized, diagnostic benchmarking system for land surface models. Geosci. Model Dev., 5, 819827, doi:10.5194/gmd-5-819-2012.

    • Search Google Scholar
    • Export Citation
  • Abramowitz, G., , Leuning R. , , Clark M. , , and Pitman A. , 2008: Evaluating the performance of land surface models. J. Climate, 21, 54685481, doi:10.1175/2008JCLI2378.1.

    • Search Google Scholar
    • Export Citation
  • Baldocchi, D., and Coauthors, 2001: FLUXNET: A new tool to study the temporal and spatial variability of ecosystem-scale carbon dioxide, water vapor, and energy flux densities. Bull. Amer. Meteor. Soc., 82, 24152434, doi:10.1175/1520-0477(2001)082<2415:FANTTS>2.3.CO;2.

    • Search Google Scholar
    • Export Citation
  • Beck, M. B., and Coauthors, 2009: Grand challenges of the future for environmental modeling. White Paper, National Science Foundation, Arlington, VA, 135 pp. [Available online at http://www.ewp.rpi.edu/hartford/~ernesto/S2013/MMEES/Papers/ENVIRONMENT/1EnvironmentalSystemsModeling/Beck2009-nsfwhitepaper.pdf.]

  • Best, M. J., and Coauthors, 2011: The Joint UK Land Environment Simulator (JULES), model description—Part 1: Energy and water fluxes. Geosci. Model Dev., 4, 677699, doi:10.5194/gmd-4-677-2011.

    • Search Google Scholar
    • Export Citation
  • Best, M. J., and Coauthors, 2015: The plumbing of land surface models: Benchmarking model performance. J. Hydrometeor., 16, 14251442, doi:10.1175/JHM-D-14-0158.1.

    • Search Google Scholar
    • Export Citation
  • Beven, K. J., , and Young P. , 2013: A guide to good practice in modeling semantics for authors and referees. Water Resour. Res., 49, 5092–5098, doi:10.1002/wrcr.20393.

    • Search Google Scholar
    • Export Citation
  • Blöschl, G., , and Sivapalan M. , 1995: Scale issues in hydrological modelling: A review. Hydrol. Processes, 9, 251290, doi:10.1002/hyp.3360090305.

    • Search Google Scholar
    • Export Citation
  • Clark, M. P., , Kavetski D. , , and Fenicia F. , 2011: Pursuing the method of multiple working hypotheses for hydrological modeling. Water Resour. Res., 47, W09301, doi:10.1029/2010WR009827.

    • Search Google Scholar
    • Export Citation
  • Clark, M. P., and Coauthors, 2015: A unified approach for process-based hydrologic modeling: 1. Modeling concept. Water Resour. Res., 51, 24982514, doi:10.1002/2015WR017198.

    • Search Google Scholar
    • Export Citation
  • Cover, T. M., , and Thomas J. A. , 1991: Elements of Information Theory. Wiley-Interscience, 726 pp.

  • Cybenko, G., 1989: Approximation by superpositions of a sigmoidal function. Math. Control Signal, 2, 303314, doi:10.1007/BF02551274.

  • Draper, D., 1995: Assessment and propagation of model uncertainty. J. Roy. Stat. Soc., 57B, 4597.

  • Edwards, A. F. W., 1984: Likelihood. Cambridge University Press, 243 pp.

  • Gong, W., , Gupta H. V. , , Yang D. , , Sricharan K. , , and Hero A. O. , 2013: Estimating epistemic and aleatory uncertainties during hydrologic modeling: An information theoretic approach. Water Resour. Res., 49, 22532273, doi:10.1002/wrcr.20161.

    • Search Google Scholar
    • Export Citation
  • Gupta, H. V., , and Nearing G. S. , 2014: Using models and data to learn: A systems theoretic perspective on the future of hydrological science. Water Resour. Res., 50, 53515359, doi:10.1002/2013WR015096.

    • Search Google Scholar
    • Export Citation
  • Gupta, H. V., , Clark M. P. , , Vrugt J. A. , , Abramowitz G. , , and Ye M. , 2012: Towards a comprehensive assessment of model structural adequacy. Water Resour. Res., 48, W08301, doi:10.1029/2011WR011044.

    • Search Google Scholar
    • Export Citation
  • Gupta, H. V., , Perrin C. , , Kumar R. , , Blöschl G. , , Clark M. , , Montanari A. , , and Andréassian V. , 2014: Large-sample hydrology: A need to balance depth with breadth. Hydrol. Earth Syst. Sci., 18, 463477, doi:10.5194/hess-18-463-2014.

    • Search Google Scholar
    • Export Citation
  • Hansen, M. C., , DeFries R. S. , , Townshend J. R. G. , , and Sohlberg R. , 2000: Global land cover classification at 1 km spatial resolution using a classification tree approach. Int. J. Remote Sens., 21, 13311364, doi:10.1080/014311600210209.

    • Search Google Scholar
    • Export Citation
  • Jaynes, E. T., 2003: Probability Theory: The Logic of Science. Cambridge University Press, 727 pp.

  • Jung, M., , Reichstein M. , , and Bondeau A. , 2009: Towards global empirical upscaling of FLUXNET eddy covariance observations: Validation of a model tree ensemble approach using a biosphere model. Biogeosciences, 6, 20012013, doi:10.5194/bg-6-2001-2009.

    • Search Google Scholar
    • Export Citation
  • Kavetski, D., , Kuczera G. , , and Franks S. W. , 2006: Bayesian analysis of input uncertainty in hydrological modeling: 2. Application. Water Resour. Res., 42, W03408, doi:10.1029/2005WR004376.

    • Search Google Scholar
    • Export Citation
  • Keenan, T. F., , Davidson E. , , Moffat A. M. , , Munger W. , , and Richardson A. D. , 2012: Using model–data fusion to interpret past trends, and quantify uncertainties in future projections, of terrestrial ecosystem carbon cycling. Global Change Biol., 18, 25552569, doi:10.1111/j.1365-2486.2012.02684.x.

    • Search Google Scholar
    • Export Citation
  • Kumar, S. V., and Coauthors, 2014: Assimilation of remotely sensed soil moisture and snow depth retrievals for drought estimation. J. Hydrometeor., 15, 24462469, doi:10.1175/JHM-D-13-0132.1.

    • Search Google Scholar
    • Export Citation
  • Liu, Y. Q., , and Gupta H. V. , 2007: Uncertainty in hydrologic modeling: Toward an integrated data assimilation framework. Water Resour. Res., 43, W07401, doi:10.1029/2006WR005756.

    • Search Google Scholar
    • Export Citation
  • Liu, Y. Q., and Coauthors, 2011: The contributions of precipitation and soil moisture observations to the skill of soil moisture estimates in a land data assimilation system. J. Hydrometeor., 12, 750765, doi:10.1175/JHM-D-10-05000.1.

    • Search Google Scholar
    • Export Citation
  • Luo, Y. Q., and Coauthors, 2012: A framework for benchmarking land models. Biogeosciences, 9, 38573874, doi:10.5194/bg-9-3857-2012.

  • Mo, K. C., , Long L. N. , , Xia Y. , , Yang S. K. , , Schemm J. E. , , and Ek M. , 2011: Drought indices based on the climate forecast system reanalysis and ensemble NLDAS. J. Hydrometeor., 12, 181205, doi:10.1175/2010JHM1310.1.

    • Search Google Scholar
    • Export Citation
  • Montanari, A., , and Koutsoyiannis D. , 2012: A blueprint for process-based modeling of uncertain hydrological systems. Water Resour. Res., 48, W09555, doi:10.1029/2011WR011412.

    • Search Google Scholar
    • Export Citation
  • Neal, R. M., 1993: Probabilistic inference using Markov chain Monte Carlo methods. Tech. Rep. CRG-TR-93-1, Dept. of Computer Science, University of Toronto, 144 pp. [Available online at http://www.cs.toronto.edu/~radford/ftp/review.pdf.]

  • Nearing, G. S., , and Gupta H. V. , 2015: The quantity and quality of information in hydrologic models. Water Resour. Res., 51, 524538, doi:10.1002/2014WR015895.

    • Search Google Scholar
    • Export Citation
  • Nearing, G. S., , Gupta H. V. , , and Crow W. T. , 2013: Information loss in approximately Bayesian estimation techniques: A comparison of generative and discriminative approaches to estimating agricultural productivity. J. Hydrol., 507, 163173, doi:10.1016/j.jhydrol.2013.10.029.

    • Search Google Scholar
    • Export Citation
  • Oberkampf, W. L., , DeLand S. M. , , Rutherford B. M. , , Diegert K. V. , , and Alvin K. F. , 2002: Error and uncertainty in modeling and simulation. Reliab. Eng. Syst. Saf., 75, 333357, doi:10.1016/S0951-8320(01)00120-X.

    • Search Google Scholar
    • Export Citation
  • Paninski, L., 2003: Estimation of entropy and mutual information. Neural Comput., 15, 11911253, doi:10.1162/089976603321780272.

  • Peters-Lidard, C. D., , Kumar S. V. , , Mocko D. M. , , and Tian Y. , 2011: Estimating evapotranspiration with land data assimilation systems. Hydrol. Processes, 25, 39793992, doi:10.1002/hyp.8387.

    • Search Google Scholar
    • Export Citation
  • Poulin, A., , Brissette F. , , Leconte R. , , Arsenault R. , , and Malo J.-S. , 2011: Uncertainty of hydrological modelling in climate change impact studies in a Canadian, snow-dominated river basin. J. Hydrol., 409, 626636, doi:10.1016/j.jhydrol.2011.08.057.

    • Search Google Scholar
    • Export Citation
  • Rasmussen, C., , and Williams C. , 2006: Gaussian Processes for Machine Learning. MIT Press, 248 pp.

  • Schöniger, A., , Wöhling T. , , and Nowak W. , 2015: A statistical concept to assess the uncertainty in Bayesian model weights and its impact on model ranking. Water Resour. Res., 51, 75247546, doi:10.1002/2015WR016918.

    • Search Google Scholar
    • Export Citation
  • Shannon, C. E., 1948: A mathematical theory of communication. Bell Syst. Tech. J., 27, 379423, doi:10.1002/j.1538-7305.1948.tb01338.x.

    • Search Google Scholar
    • Export Citation
  • Snelson, E., , and Ghahramani Z. , 2006: Sparse Gaussian processes using pseudo-inputs. Advances in Neural Information Processing Systems 18, Y. Weiss, B. Schölkopf, and J. C. Platt, Eds., Neural Information Processing Systems, 1257–1264. [Available online at http://papers.nips.cc/paper/2857-sparse-gaussian-processes-using-pseudo-inputs.]

  • van den Hurk, B., , Best M. , , Dirmeyer P. , , Pitman A. , , Polcher J. , , and Santanello J. , 2011: Acceleration of land surface model development over a decade of GLASS. Bull. Amer. Meteor. Soc., 92, 15931600, doi:10.1175/BAMS-D-11-00007.1.

    • Search Google Scholar
    • Export Citation
  • Wand, M. P., , and Jones M. C. , 1994: Kernel Smoothing. CRC Press, 212 pp.

  • Weijs, S. V., , Schoups G. , , and Giesen N. , 2010: Why hydrological predictions should be evaluated using information theory. Hydrol. Earth Syst. Sci., 14, 25452558, doi:10.5194/hess-14-2545-2010.

    • Search Google Scholar
    • Export Citation
  • Wilby, R. L., , and Harris I. , 2006: A framework for assessing uncertainties in climate change impacts: Low-flow scenarios for the River Thames, UK. Water Resour. Res., 42, W02419, doi:10.1029/2005WR004065.

    • Search Google Scholar
    • Export Citation
  • Xia, Y., and Coauthors, 2012a: Continental-scale water and energy flux analysis and validation for the North American Land Data Assimilation System project phase 2 (NLDAS-2): 1. Intercomparison and application of model products. J. Geophys. Res., 117, D03109, doi:10.1029/2011JD016051.

    • Search Google Scholar
    • Export Citation
  • Xia, Y., and Coauthors, 2012b: Continental-scale water and energy flux analysis and validation for North American Land Data Assimilation System project phase 2 (NLDAS-2): 2. Validation of model-simulated streamflow. J. Geophys. Res., 117, D03110, doi:10.1029/2011JD016051.

    • Search Google Scholar
    • Export Citation
  • Xia, Y., , Sheffield J. , , Ek M. B. , , Dong J. , , Chaney N. , , Wei H. , , Meng J. , , and Wood E. F. , 2014: Evaluation of multi-model simulated soil moisture in NLDAS-2. J. Hydrol., 512, 107125, doi:10.1016/j.jhydrol.2014.02.027.

    • Search Google Scholar
    • Export Citation
  • Xia, Y., , Hobbins M. T. , , Mu Q. , , and Ek M. B. , 2015: Evaluation of NLDAS-2 evapotranspiration against tower flux site observations. Hydrol. Processes, 29, 17571771, doi:10.1002/hyp.10299.

    • Search Google Scholar
    • Export Citation
  • Ziv, J., , and Zakai M. , 1973: On functionals satisfying a data-processing theorem. IEEE Trans. Inf. Theory, 19, 275283, doi:10.1109/TIT.1973.1055015.

    • Search Google Scholar
    • Export Citation
1

Contrary to the suggestion by Beven and Young (2013), we use the term “prediction” to mean a model estimate before it is compared with observation data for some form of hypothesis testing or model evaluation. This definition is consistent with the etymology of the word and is meaningful in the context of the scientific method.

Save