1. Introduction
The Protocol for the Analysis of Land Surface Models (PALS) Land Surface Model Benchmarking Evaluation Project (PLUMBER) benchmarking experiments by Best et al. (2015) showed that some of the world’s most sophisticated operational land models (CABLE, CH-TESSEL, COLA-SSiB, ISBA-SURFEX, JULES, Mosaic, Noah, ORCHIDEE) were outperformed in their ability to simulate short-term surface energy fluxes by simple regressions. Specifically, the PLUMBER experiments used piecewise-linear regressions trained on data from 20 globally distributed FLUXNET meteorological and eddy covariance towers to predict half-hourly latent and sensible heat fluxes given inputs of half-hourly near-surface air temperature, relative humidity, and shortwave radiation. The regressions were trained and evaluated using a leave-one-out approach, so that training data came from different FLUXNET sites than where the regression was evaluated. These out-of-sample, instantaneous regressions, extrapolated over globally distributed FLUXNET sites, generally yielded improved performance compared to predictions made by the land surface models according to several different metrics (mean bias error, normalized mean error, standard deviation, correlation coefficient), but the land models generally yielded better performance according to metrics that considered full probability distributions (5th and 95th percentiles, third and fourth statistical moments).
This type of result is not unique. Abramowitz et al. (2008) reached similar conclusions after benchmarking CLM and ORCHIDEE against linear and nonlinear regressions (again with no state memory), and Nearing et al. (2016) used benchmarking to show that Noah, SAC-SMA, Mosaic, and VIC each use generally less than half of the information available to them from parameters and boundary conditions. Results like this indicate that there is substantial potential to improve even our most sophisticated land models.
One result lacking in these benchmarking studies was any indication about where exactly information is lost in the different hydrometeorological processes simulated by the land models. There certainly exists some intuition that model benchmarking should help us understand how to improve models, however, to our knowledge, there has not been any methodology proposed that allows us to do this in a systematic way. For example, Luo et al. (2012) outlined a set of desiderata to bridge the gap between benchmarking and diagnostics, but did not offer any methods to implement those desiderata.
The primary purpose of this paper is to suggest how model intercomparison studies might be designed to facilitate more diagnostic results. We do this by first formalizing a quantitative theory of model benchmarking based in information theory and then extending the conceptual underpinnings of that theory to the problem of model diagnostics. We demonstrate the proposed theory and methodology using data from the aforementioned PLUMBER study, and use this demonstration to suggest some specific improvements that might be adopted in the experimental design of future benchmarking and intercomparison studies. In particular, the PLUMBER study did not distinguish between parameter error versus model structural error. This could be remedied by simply reporting the parameter values used for each model and then applying an extension of the information-theoretic method proposed here to disaggregate these two types of error (e.g., Nearing et al. 2016).
We want to be clear that the purpose of this paper is not to argue for using statistical, data-driven, or regression models in place of physically based or process-based models for operational forecasting of terrestrial hydrological systems. We do not want to do this because of the potential for nonstationarity—some type of mechanistic understanding of the system is necessary to predict under changing conditions (Milly et al. 2008). That being said, we cannot ignore the fact that regression models regularly outperform our best biogeophysical models (e.g., the benchmarking references above). The pertinent question we address is how benchmarking can guide the improvement of complex dynamical systems models that contain explicit representations of interacting biogeophysical processes.
We also want to be clear that we are not advocating rejecting land models if they do not pass a particular benchmark(s). Just like the PLUMBER land models should not be falsified or rejected because they were outperformed by any of the linear benchmarks, those same land models should not be rejected because they do not outperform a benchmark here. Simplistic model rejection or falsification by either traditional statistical hypothesis testing or other methods (e.g., Beven 2016; Liu et al. 2009) does not seem particularly helpful when dealing with large and complicated models. Instead, we view benchmarking as a tool to help guide a model’s conceptual development and performance improvement, and our goal in this paper is to advance that objective.
The paper proceeds as follows. Section 2 briefly revisits the PLUMBER experiments, which will serve as an example for the rest of this paper. Section 3 proposes a more formal framework for model benchmarking using empirical models such that the benchmarking strategy used by Best et al. (2015) is a special case. This section also outlines a theory for process-level diagnostics that extrapolates that formalization. Section 4 applies several approximations of both the benchmarking and diagnostics theories to the same models and field data used by Best et al. (2015), first to clarify and extend those previous results, and then to look at specific process-level deficiencies in the models. The concluding discussion in section 5 speaks to how we see model benchmarking practice developing in the future, and lists opportunities to improve both models and model intercomparison projects.
2. The PLUMBER experiments
The PLUMBER observational dataset consists of half-hourly measurements of surface boundary conditions and eddy covariance surface energy fluxes from 20 globally distributed FLUXNET towers. In particular, model inputs consisted of half-hourly longwave
The PLUMBER performance metrics were (i) bias, (ii) standard deviation ratio (model over observations), (iii) linear correlation coefficient, and (iv) the normalized mean squared error. These were all calculated on time-step (half-hourly) sensible and latent heat fluxes (
The PLUMBER reference values for these metrics were determined by two simple conceptual models and three statistical models. The conceptual benchmarks were (i) Penman–Monteith (i.e., that evapotranspiration
Best et al. (2015) applied the above benchmarks to test the land models that contribute to most of the major weather and climate forecasting centers in the United States, Europe, and Australia. For a complete list and description of these models, the reader is referred to Table 2 in Best et al. (2015). They found that, on average across the 20 sites, the half-hourly
3. Theory
a. Benchmarking
1) What is benchmarking?
Best et al. (2015) recognized three approaches that the land modeling community has generally used to understand model consistency, accuracy, and precision. They defined model evaluation as a process in which model outputs are compared against observations, model intercomparison as a process in which multiple models are applied to specific test cases, and model benchmarking as a process in which model outputs are compared against a priori expectations of model performance. Our first purpose here is to formalize the concept of model benchmarking.
A benchmark consists of three distinct elements: (i) a specific observational dataset, (ii) a particular performance metric defined over that dataset, and (iii) a specific criterion or reference value for that performance metric. The point of benchmarking is to define a priori objectives and/or adequacy criteria for model performance (Gupta et al. 2012).
The main challenge when evaluating a complex systems model is to understand (and quantify) how well a model can be expected to perform given that all experiments contain some amount of randomness and uncertainty. That is, even a hypothetically perfect model will not exactly simulate any given set of observation data, simply because of uncertainty in both the measured input data used to drive the model and in the measured response data (Beven 2016). Our goal is therefore to develop a strategy for testing models that leads reliably toward minimizing the component of predictive error associated with the model and to do this in the presence of predictive error due to uncertainty in experimental input and response data. Ideally, the benchmark criteria must be somehow related to data uncertainty, so that we can separate data uncertainty from model uncertainty. Our approach is to have the benchmark reference value estimate the information content of the (imperfect) model input data relative to the (imperfect) measured response data. Gong et al. (2013) laid out a general theory for dealing with this problem that we will adapt here.
2) Benchmark criteria
We must choose all three components of a benchmark (observation data, performance metric, and reference value), and again, the major challenge is to quantify model error in the presence of data uncertainty. The idea is that if we knew how accurate a “perfect” model would be over a given set of experimental data, then we could measure how far away any given hypothesis-driven model is from this ideal. We want to quantify information loss due to inevitable model error.
We will rephrase the question slightly: instead of asking how far away we are from a perfect model, we will ask whether there is any measurable information in the experimental input/response data about the relationship between inputs and outputs that is not captured by the model. This is the question that the PLUMBER experiments addressed.
As Nearing and Gupta (2015) discussed, this approach requires that our benchmark criteria come from a purely data-driven model, simply because our goal is to isolate data uncertainty from model uncertainty. We always have many options for data-driven models, and this choice will affect benchmarking results. For example, the standard Student’s t test can be viewed as benchmarking against a parametric (Gaussian) distribution with a sample variance parameter. Along with the chosen
One general strategy for developing data-driven models is to rely on superposition theorems. For example, Hornik (1991) showed that an appropriate superposition of sigmoidal bases covers any smooth function. This means that, in principle, a typical single-layer, feed-forward neural network will converge to extracting all information from experimental observations that are about smooth functional relationships between model inputs and outputs. Convergence is difficult to demonstrate in practice, and while this is a very large class of potential relationships, it does not include all possible (e.g., discontinuous or nondifferentiable) relationships. However, it is a very good start to approaching our nonparametric objective, and more sophisticated versions of, perhaps layered, superpositions (e.g., deep learning) can be expected to improve on the first-order approach outlined here.
Importantly the question we want to ask is how much information about this particular type of systematic relationship is contained in our experimental data? Once we know the answer to this question, we can ask what portion of that information a particular model is able to reproduce. The basic strategy is to ensure that any process model we might build contains at least as much information about the systematic (nonrandom) relationships between controls and responses as we can extract directly from our experimental data.
3) Benchmark metrics
So far, we’ve outlined a general strategy for setting benchmark reference values when the goal is to separate data uncertainty from model uncertainty, which we claim is at the core of the challenge of testing models. Benchmarking against nonparametric regressions allows us to test a hypothesis-driven model in absence of any a priori assumptions about the functional dynamics or process behavior of the system, and to do this in the presence of (arbitrary) experimental randomness.















Nearing and Gupta (2015) showed that certain standard evaluation measures (additive bias, linear correlation, and mean squared error) are special cases of metric
There is one class of divergence metrics that is especially relevant to benchmarking. This is the class of divergences used by Shannon (1948). In particular, Shannon outlines a set of very basic requirements for a theory of information, and then demonstrates that






















































To reiterate, the predicted variable
The above discussion assumes that our objectives for evaluating model
4) Benchmark data
It is, of course, necessary to choose benchmark data that are as representative as possible of the various situations where we expect our model to be applicable. Again, with reference to the nonstationarity argument above, the purpose of a model is to produce emergent behavior from process descriptions that are themselves stationary, and there may be both differences and similarities in the functional dynamics of a particular set of biogeophysical processes at different locations (different climates and biomes). This means that we may ask two distinct benchmarking questions. The first is about the extent to which a particular hypothesis-driven model is able to represent the behavior of any particular dynamical system, and the second is about the extent to which a particular model is able to represent those aspects of behavior that are consistent across many different dynamical systems. Thus, we advocate a data collection strategy that emphasizes both breadth and depth (Gupta et al. 2014).
In the above discussion, we said that we ultimately want models that are able to reproduce all of the information contained in the experimental data about patterns between observed control (input) and response (output) variables. The way we advocated approaching this is to create
The PLUMBER benchmarking experiments used a split record approach. They took data from 20 FLUXNET sites and used a leave-one-out approach on the site locations to train their regressions. By comparing their models against benchmarks trained on data from different sites, they effectively asked whether their models could reproduce those portions of the patterns in the FLUXNET input/output data (input data were meteorological forcings, and output data were surface energy fluxes) that were consistent between sites. This means that, effectively, they assessed the ability of their models to reproduce the globally consistent components of surface energy balance, and they found that land models often could not reproduce the linearized component of this globally stationary portion of the signal. This is only one of the two benchmarking questions mentioned in the first paragraph of this subsection. We might also ask about the site-specific aspects of the surface energy balance (as measured by FLUXNET). To do the latter we would train separate data-driven models at each site. We will apply and compare both of these benchmarking approaches in section 4a.
b. Process diagnostics
Our next objective is to obtain some type of insight about any deficiencies in our model(s). To state this another way, we not only want to understand predictive accuracy (or information content of model predictions), but we also want to make sure that our models “get the right answers for the right reasons” (Kirchner 2006). Here we suggest doing this by looking at the dynamics of information transfer within the model, in addition to looking at predictive skill. We outline a general conceptual framework for quantifying interactions and feedbacks in complex systems.
To do this, we will treat dynamical systems and their models as information flow dynamical process networks (DPNs or iDPNs). DPNs have been applied to infer process structure, feedback, and nonstationarity in ecohydrological (Kumar and Ruddell 2010) and ecoclimatological (Ruddell et al. 2016) systems based on field observations. An example illustration of a DPN (for a FLUXNET tower) is given in Fig. 1. Each node in a DPN represents a particular modeled or observed variable, and each edge represents directed influence between variables. Even if all equations in the model are deterministic, each variable is still probabilistic conditional on only a subset of the other variables that determine its value at any point during the simulation.
A dynamical process network representing variables measured at FLUXNET sites. Information is transferred from meteorological boundary conditions to modeled variables (Qe, Qh, and NEE) at some time scale
Citation: Journal of Hydrometeorology 19, 11; 10.1175/JHM-D-17-0209.1
Our diagnostic approach will be to quantify the influence that each variable has on all others in the observed system and then to see whether our hypothesis-driven model reliably simulates these partially informative relationships. That is, we will measure information transfers within the model and within the ecohydrological system to identify particular relationships in the LSM that are either overconstrained or underconstrained. If a model simulates a smaller-than-observed transfer of information from one variable to another, then that modeled pathway or modeled relationship is underconstrained, and vice versa.









One major challenge with any type of process-level diagnostics is that there is typically insufficient observation data about each individual simulated variable. It is possible that this could be dealt with by using data assimilation (Bulygina and Gupta 2011), but this is out of the scope of the current paper. Our PLUMBER reanalysis in section 4 follows Ruddell and Kumar (2009) in assuming the system is identified by readily available FLUXNET data variables rather than trying to identify the correct system boundary and resolution in this paper.
To calculate information flow diagnostic benchmarks, a custom computer code was written in MATLAB. This code is available at https://github.com/greyNearing/plumber_diagnostics.git. The code was validated by comparison with a community code, ProcessNetwork version 1.5 (Ruddell 2016), and found to be in good agreement.
4. Reanalysis of the PLUMBER experiments
We will now return to the PLUMBER experiments and apply the theory outlined in section 3. Before presenting the details of these experiments in sections 4a–c, we will briefly outline the primary results.
Figure 2 reports out-of-sample benchmarks that measure the overall ability of the PLUMBER land models to utilize information in meteorological inputs for estimating half-hourly surface energy balance components. Figure 3 shows results from a highly simplified version of the theory outlined in section 3 to benchmark long-term water balances, rather than the short-term energy balances shown in Fig. 2. The Budyko-type analysis in Fig. 3 is from a simplified version of benchmarking that is useful when insufficient data are available to implement the full information-theoretic procedure. Figure 4 summarizes the primary process-diagnostics results, and compares these against the benchmarking results. Figure 5 looks specifically at process-level differences between different land models for partitioning net radiation in particular. Figure 6 shows all of the process-level diagnostic results for all of the land models.
Comparison of half-hourly surface flux predictions made by the PLUMBER models (colored lines) against out-of-sample benchmarks derived from theoretically convergent (sigmoidal) regressions with time-lagged input at the 20 sites (bars). The two bars represent local and global normalized mutual information benchmark metrics according to Eq. (3d). Both benchmarks are calculated using a histogram bin resolution of 1% of the range of observed data across all FLUXNET sites.
Citation: Journal of Hydrometeorology 19, 11; 10.1175/JHM-D-17-0209.1
MAEs of models and benchmarks for simulating multiyear evaporative fractions.
Citation: Journal of Hydrometeorology 19, 11; 10.1175/JHM-D-17-0209.1
Relationships between information missing from land models, as measured by information-theoretic benchmarking (section 3a), and differences in directed information transfers between pairs of observed variables vs between pairs of modeled variables [Eq. (6)]. The directed relationship over which differences in transfer entropy are calculated (x axis) are indicated in the titles of subplots, and the missing information (y axis) is about the conditioned variable (i.e.,
Citation: Journal of Hydrometeorology 19, 11; 10.1175/JHM-D-17-0209.1
This figure shows—for each transfer entropy pathway illustrated in Fig. 4—the mean distance to center of mass due to clustering by model vs by site. The models show clear site-by-site clustering but do not show clear model-by-model clustering, meaning that the models all exhibit generally similar error structures at each individual site. All variables are defined in the legend in Fig. 1.
Citation: Journal of Hydrometeorology 19, 11; 10.1175/JHM-D-17-0209.1
Model-specific differences between modeled and observed transfer entropies along the
Citation: Journal of Hydrometeorology 19, 11; 10.1175/JHM-D-17-0209.1
a. Local versus global PLUMBER benchmarks
The goal of benchmarking, as outlined in section 3a(1), is to set a priori expectations on model performance, given error in experimental input and response data, that are free of any conceptual understanding about the modeled system. So we first trained a set of nonparametric regressions that estimate the information content of the FLUXNET meteorological input data about the FLUXNET response data, which here are half-hourly surface latent and sensible heat fluxes. Then we asked whether the hypothesis-driven land models (CABLE, CH-TESSEL, COLA-SSiB, ISBA-SURFEX, JULES, Mosaic, Noah, ORCHIDEE) each provide more or less information about this relationship than we were able to extract directly from observation data [Eq. (5)].
The exact method that we used for constructing our benchmarks is given in appendix A. It is, of course, an approximation of the philosophy outlined in section 3, but it is an approach that is coherent in the presence of arbitrary uncertainties, and which relies on explicit and well-understood approximation. Each sigmoidal regression used lagged meteorological data to account for the conservation-related memory of the land surface, and we used an empirical analysis to ensure that our benchmark regressions were not overfitted.
The first set of benchmarks that we considered were analogous to the PLUMBER benchmarks in that they were trained at out-of-sample FLUXNET sites. We trained single-layer, feed-forward artificial neural networks on each permutation of 19 out of 20 FLUXNET sites using time-lagged input data, and then used each of these 20 out-of-sample regressions to make surface energy flux predictions at the single excluded site. These benchmarks tell us something about the ability of the land models to represent only the globally stationary components of the forcing-response relationships at the land surface.
We also want process-based land models to represent site-specific components of the forcing-response relationships. In practice, spatial heterogeneity is represented in these models by parameter maps that reflect characteristics of soils, vegetation, topography, etc. PLUMBER did not report the parameter values used by each model, so we cannot directly assess information loss due specifically to model parameters, as was done by Nearing et al. (2016). However, we can assess the ability of the parameterized model (Gupta and Nearing 2014) to capture spatial heterogeneity in system processes by performing site-specific benchmarking. In this case, benchmark regressions are trained at each site individually using a split-record approach. Here we used a k-fold with
We will refer to the first set of benchmarks (trained using a leave-one-out procedure over the 20 FLUXNET sites) as global benchmarks, and the second set (trained using k-fold validation at each site) as local benchmarks. In the hypothetical situation where our regression models were perfect at extracting information from the observation data, the global benchmarks could be beaten by even an imperfect parameterized hypothesis-driven model, whereas the hypothetically perfect local benchmarks could not be beaten by even a hypothetically perfect hypothesis-driven model due to the data processing inequality. Appendix B gives the technical details of how the normalized information metrics [Eqs. (3c) and (3d)] were calculated, and Fig. 2 compares the information content of the PLUMBER models against these benchmarks.
To understand the primary message of the results in Fig. 2, it is important to understand two things. First, the metrics we report are the ratio of discrete information in the model or benchmark to the total discrete entropy of the FLUXNET surface flux data (either
Figure 2 shows that the land models evaluated by the PLUMBER experiments provide less than about half of the total information in FLUXNET observation data about the relationship between near-surface atmospheric forcings and surface energy fluxes. This benchmark fully accounts for all observation error, unless there is systematic measurement error that is stationary across all sites. These results also show that the land models generally produce about as much information as the global benchmarks. This means that the models are not capturing spatial process heterogeneity that exists between different ecoclimates. Again, it is impossible to tell whether this is because of parameter error or model structural error because the PLUMBER parameter values were not reported, but these two effects could be segregated using information-theoretic benchmarking by training a third set of regressions that act directly on model parameters (Nearing et al. 2016).





















b. Benchmarking the PLUMBER water balance
In addition to simulating the surface energy balance, we are interested in understanding the ability of land models to simulate long-term water balances. We will use this benchmarking objective to motivate an example that demonstrates how to approximate the theory outlined in section 3a in cases where we may only have limited data.








Given that hydrological theory can generally explain at least certain departures from the Turc–Pike curve by the dominant controls on the water balance, including intraseasonal-to-interannual variability in precipitation, and the controls of vegetation, topography, and soils on runoff generation and transpiration (Eagleson 1978; Milly 1994; Zhang et al. 1999), it is reasonable to expect that sophisticated simulation models should also be able to predict departures from the Turc–Pike curve.
To assess this, we trained a regression model to act on model inputs, which here were taken to be long-term averages of net radiation (sum of short- and longwave radiation), air temperature, and cumulative precipitation, to predict departures from climatological evaporative fraction calculated according to Eq. (9). In this case we only have observation data from 20 FLUXNET sites—each of which yields a single value for the dryness index and a single value for the evaporative fraction. Similarly, calculating the evaporative fraction from each set of land model output represents a transform of the simulated data into a new random variable
Figure 3 illustrates the mean absolute error (MAE) of evaporative fraction across FLUXNET sites for the various models and benchmarks. The spreads in this figure were calculated by calculating error statistics over 1000 separate random samples drawn with replacement from the 20 FLUXNET sites. Both the Turc–Pike curve itself and the regression on departures from this curve outperform all of the physically based land models.
There are not enough data to develop reliable empirical probability distributions for Eqs. (3), and so we cannot estimate the fraction of information missing from the models relative to this benchmark criterion. Instead, we calculated the MAE over all models and benchmarks relative to the observed evaporative fractions (Fig. 3) and used a nonparametric bootstrap hypothesis test on the MAEs across the 20 sites. The out-of-sample regression benchmark here had lower MAE on the evaporative fraction (
Again, the point of this example is to illustrate that it is feasible to approximate the theory outlined above, even when data are limited. Interpretation of results depends, of course, on the exact method of approximation, but at least we have laid out an explicit theory to approximate (section 3a). What we would not want to do here, for example, is to use the regression models that were used to set the energy balance benchmark criteria in section 4a to assess questions about the ability of models to simulate long-term water balance. This is because those regressions do not measure the amount of information in model inputs related to the water balance.
c. Process diagnostics of PLUMBER models
Finally, we applied the process diagnostics approach outlined in section 3b to the FLUXNET data and PLUMBER models. In particular, we used Eq. (6) to calculate the transfer entropy from all air temperature, shortwave radiation, net ecosystem exchange, and surface-layer soil moisture to latent and sensible heat fluxes and net ecosystem exchange. Transfer entropy was calculated on the time-step (half-hourly) data with a lag time of
Figure 4 plots the differences between transfer entropies calculated over FLUXNET data and PLUMBER model data at each FLUXNET site that reported soil moisture values (13 of the 20 sites) and for each land model. These differences are plotted against missing information about the target variable (
To formalize this, the strength of site-related groupings versus model-related groupings was quantified by calculating the fractional reduction in mean distance to center of cluster due to clustering the results in each subplot in Fig. 4 by model versus clustering by site. Cluster centers for each of the model-specific and site-specific clusters illustrated in Fig. 4 were calculated using a Euclidean distance in the x–y space defined by (i) the difference between modeled and measured transfer entropies, and (ii) the missing information of each model for each prognostic variable (
The
Understanding the causes of this model-specific bias will require more data than was collected as part of the PLUMBER study. In particular, it will be necessary in the future for each of the modeling groups that contribute to the model intercomparison project to report the parameter values. It is impossible from the data provided to the PLUMBER study to know whether the energy partitioning biases in Noah and JULES is due to poor parameter values or to specific problems with model physics. All of these models are based on Penman-style evapotranspiration equations; both Noah and Mosaic used climatological vegetation. Stomatal conductance parameterizations in the Noah, Mosaic, and CH-TESSEL models use a Jarvis-type stomatal resistance, while most other models including JULES, CABLE, and ISBA-SURFEX use a Ball–Berry–Leuning model for stomatal conductance. Given the similarity in parameterization schemes between process-biased versus process-unbiased models, and the dissimilarity between Noah and JULES, it is at least possible that the biases in the Noah and JULES models are due to specific parameter values, rather than to model physics.
d. Summary of results
The results outlined above related to PLUMBER models and data can be summarized as follows:
Figure 2 shows that the PLUMBER land models use less than half of the information about half-hourly surface energy fluxes that is available to them from the meteorological forcing data as measured by FLUXNET. This figure also shows that about one-half of the information in FLUXNET forcing-response data is due to locally specific patterns of behavior that are not shared across sites.
Figure 3 shows that the PLUMBER models underutilize information in meteorological forcing data about long-term (multiyear) water balances. However, due to a lack of data at this time scale, we can only estimate missing information using a parametric (RMSE) statistic.
Figures 4 and 5 show that all of the PLUMBER models have quantitatively similar process-level information transfer biases, but that these biases differ between different ecoclimates.
Figure 6 shows that some of the PLUMBER models (especially the Noah models) have consistent bias across all sites in the shortwave energy partitioning relationships, while the other models do not.
5. Conclusions and outlook
Benchmarking is often used to assess model performance against baseline criteria—for example, Best et al. (2015) used linearized benchmark criteria under the perspective that complex land simulation models should capture at least some of the nonlinearity of hydrometeorological systems. Similarly, benchmarks related to persistence and climatology are often used to assess numerical weather prediction models and forecasts, because these are simple criteria that a competent model should be able to beat.
We take a different perspective here, one that is developed around a more general philosophy. Our proposal is that model evaluation—and hypothesis testing in general—should generally consider model performance for a given set of experimental data in the context of the inherent information content of that data. Even a perfect model cannot predict more accurately than is allowed using noisy input and response data. We are also not interested in using benchmarking to reject complex systems models, but rather to guide their improvement.
Our benchmarking approach is therefore designed to do two things: 1) help assess model performance in the context of experimental observations with unknown and arbitrary data uncertainty and 2) connect model evaluation and benchmarking with process-level model diagnostics. Under the proposed theory, benchmarking allows us to partition uncertainty between data error versus model error, and separating these two things allows us to judge model performance independent of any error in the observation data. This is probably the intuition behind the benchmarking approach used by Best et al. (2015) and Abramowitz (2005), and our contribution is to take steps toward formalizing this intuition. We further propose that an information-theoretic perspective on model benchmarking suggests at least an informal link between benchmarking and process diagnostics, and we suspect that it will likely be possible to formalize this relationship by deriving an aggregating relationship to relate process-specific transfer entropy metrics within the model directly to the holistic mutual information metrics used in section 3.
Empirical results from the application of our benchmarking theory to the PLUMBER models and data support the primary conclusions by Best et al. (2015) that modern land models do not take full advantage of the information content of input data. Our specific experimental conclusions about the PLUMBER models are summarized in section 4d. We agree with the Best et al. (2015) conclusions and expand on their findings by showing how results like theirs can be exploited to guide model development.
Our methods and results also provide some insight into how we might improve the design of future model intercomparison experiments. The Best et al. (2015) PLUMBER experiments did not collect sufficient data about the various models and simulations to enable highly detailed process-level model diagnostics. The first suggestion from our analysis would be that modelers should report their parameter values, so that benchmarking can be used to segregate the effects of model parameters from model structure (e.g., Nearing et al. 2016). However, if the objective is to use model benchmarking or intercomparison projects to inform continued model development, then we would ideally want full posterior parameter distributions over model parameters, so that parameter effects could be marginalized out in the integrations used to calculate information metrics. This would allow us to make quantitative statements directly about model processes in the presence of parameter error, and would require that future benchmarking studies adopt (for example) a Markov chain Monte Carlo approach to dealing with model parameters.
A second suggestion for future model intercomparison studies is that it would be very helpful if there were a succinct description of the process-level differences between different land surface models. This would be an invaluable resource for qualitatively tracing diagnostic signatures to differences between common process parameterizations used in modern land models. One way to approach this might be to host a collaborative workshop where developers of the various land models worked together to produce a systematized report of the process-level differences between land models. Tools like the Structure for Unifying Multiple Modeling Alternatives (SUMMA; Clark et al. 2015) are process-flexible in the sense that they have the ability to implement several of the more common process-specific parameterizations used commonly in land models, and it might be worthwhile to focus a concerted effort on collecting functional equivalents of the major process descriptions used in the various PLUMBER models in SUMMA, so that individual processes can be tested in a common computational framework. Both of these efforts would require significant buy-in from the developers of at least several of the modern land models. To derive significant scientific value from model intercomparison experiments, it will be necessary to start by outlining a formal theory and methodology of model benchmarking and diagnostics and then to design an intercomparison protocol around that theory.
Acknowledgments
This work was partially supported by the National Science Foundation under Grant EF-1241960 through Northern Arizona University, and by the NASA Earth Science Technology Office Advanced Information Systems Technology program through NCAR, NASA, and the University of Washington. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding agencies. All code and data used for this project are publically available at https://github.com/greyNearing/plumber_diagnostics.git.
APPENDIX A
Constructing PLUMBER Benchmark Criteria
Our PLUMBER benchmark criteria were set by single-layer, feed-forward neural networks that projected time-lagged meteorological inputs onto FLUXNET eddy covariance estimates of half-hourly surface latent and sensible heat fluxes. The regression inputs were hour of day, wind speed
To ensure that the regressions were not overfitted, we pulled 600 000 data points—30 000 from each of the 20 FLUXNET sites—and then randomly sampled a fraction of those data points to train various neural networks. Performance statistics were calculated over the training data and also over the remaining portion of the 600 000 samples that were not used for training. We repeated this procedure 10 times for a number of different training sample sizes, and Fig. A1 reports the mean and two standard deviations of the normalized mutual information [Eq. (3d)] over these resampling tests as a function of the number of training data. The important takeaway from this figure is that the performance of these regressions over training and test data converge to within a few percent of the same information content at around 11 000 samples. This does not guarantee that such regressions account for all of the information about functional relationships contained in these data, but it does mean that our models are not overestimating information in data due to overfitting.
Convergence of neural network training and test performance statistics [normalized mutual information from Eq. (3d)] as a function of the number of training samples. Training sets were chosen randomly from a total of 600 000 data points sampled randomly from 20 FLUXNET sites, and test sample statistics were calculated on the remainder that were not used for training. Error bars show two standard deviations calculated from 10 random samples of training data at each sample size. Both the mutual information ratio and the more traditional correlation coefficient are stable at around 10 000–15 000 training samples, indicating that this number of training data is likely sufficient to avoid overfitting the benchmark regressions.
Citation: Journal of Hydrometeorology 19, 11; 10.1175/JHM-D-17-0209.1
APPENDIX B
Calculating Information Metrics
The information ratios that we report are the ratio of maximum likelihood estimators (Paninski 2003) of discrete mutual information between the model predictions and the evaluation data over the discrete entropy of the evaluation data. We use discrete (i.e., histogram) estimators to ensure that these ratios are bounded both below (by zero) and above (by one). This means that we have to discretize the model outputs and the evaluation data, and the precision of this discretization will affect the mutual information and entropy statistics, as well as their ratios. For the remainder of this essay we will report information ratios calculated using discretizations with precision equal to 1% of the total observed range of each type of observation data (Qe, Qh, and NEE) over all 20 FLUXNET sites. We use the same discretization for model outputs as for observation data.








REFERENCES
Abramowitz, G., 2005: Towards a benchmark for land surface models. Geophys. Res. Lett., 32, L22702, https://doi.org/10.1029/2005GL024419.
Abramowitz, G., 2012: Towards a public, standardized, diagnostic benchmarking system for land surface models. Geosci. Model Dev., 5, 819–827, https://doi.org/10.5194/gmd-5-819-2012.
Abramowitz, G., A. Pitman, H. Gupta, E. Kowalczyk, and Y. Wang, 2007: Systematic bias in land surface models. J. Hydrometeor., 8, 989–1001, https://doi.org/10.1175/JHM628.1.
Abramowitz, G., R. Leuning, M. Clark, and A. Pitman, 2008: Evaluating the performance of land surface models. J. Climate, 21, 5468–5481, https://doi.org/10.1175/2008JCLI2378.1.
Best, M. J., and Coauthors, 2015: The plumbing of land surface models: benchmarking model performance. J. Hydrometeor., 16, 1425–1442, https://doi.org/10.1175/JHM-D-14-0158.1.
Beven, K. J., 2016: Facets of uncertainty: Epistemic error, non-stationarity, likelihood, hypothesis testing, and communication. Hydrol. Sci. J., 61, 1652–1665, https://doi.org/10.1080/02626667.2015.1031761.
Bulygina, N., and H. Gupta, 2011: Correcting the mathematical structure of a hydrological model via Bayesian data assimilation. Water Resour. Res., 47, W05514, https://doi.org/10.1029/2010WR009614.
Clark, M. P., and Coauthors, 2015: The structure for unifying multiple modeling alternatives (SUMMA), version 1.0: Technical description. NCAR Tech. Note NCAR/TN-514+STR, 50 pp., https://doi.org/10.5065/D6WQ01TD.
Eagleson, P. S., 1978: Climate, soil, and vegetation. 1. Introduction to water-balance dynamics. Water Resour. Res., 14, 705–712, https://doi.org/10.1029/WR014i005p00705.
Gerrits, A. M. J., H. H. G. Savenije, E. J. M. Veling, and L. Pfister, 2009: Analytical derivation of the Budyko curve based on rainfall characteristics and a simple evaporation model. Water Resour. Res., 45, W04403, https://doi.org/10.1029/2008WR007308.
Gong, W., H. V. Gupta, D. Yang, K. Sricharan, and A. O. Hero, 2013: Estimating epistemic and aleatory uncertainties during hydrologic modeling: An information theoretic approach. Water Resour. Res., 49, 2253–2273, https://doi.org/10.1002/wrcr.20161.
Gupta, H. V., and G. S. Nearing, 2014: Using models and data to learn: A systems theoretic perspective on the future of hydrological science. Water Resour. Res., 50, 5351–5359, https://doi.org/10.1002/2013WR015096.
Gupta, H. V., M. P. Clark, J. A. Vrugt, G. Abramowitz, and M. Ye, 2012: Towards a comprehensive assessment of model structural adequacy. Water Resour. Res., 48, W08301, https://doi.org/10.1029/2011WR011044.
Gupta, H. V., C. Perrin, G. Blöschl, A. Montanari, R. Kumar, M. Clark, and V. Andréassian, 2014: Large-sample hydrology: A need to balance depth with breadth. Hydrol. Earth Syst. Sci., 18, 463–477, https://doi.org/10.5194/hess-18-463-2014.
Hornik, K., 1991: Approximation capabilities of multilayer feedforward networks. Neural Networks, 4, 251–257, https://doi.org/10.1016/0893-6080(91)90009-T.
Kinney, J. B., and G. S. Atwal, 2014: Equitability, mutual information, and the maximal information coefficient. Proc. Natl. Acad. Sci. USA, 111, 3354–3359, https://doi.org/10.1073/pnas.1309933111.
Kirchner, J. W., 2006: Getting the right answers for the right reasons: Linking measurements, analyses, and models to advance the science of hydrology. Water Resour. Res., 42, W03S04, https://doi.org/10.1029/2005WR004362.
Knuth, K. H., 2005: Lattice duality: The origin of probability and entropy. Neurocomputing, 67, 245–274, https://doi.org/10.1016/j.neucom.2004.11.039.
Kumar, P., and B. L. Ruddell, 2010: Information driven ecohydrologic self-organization. Entropy, 12, 2085–2096, https://doi.org/10.3390/e12102085.
Liu, Y., J. Freer, K. Beven, and P. Matgen, 2009: Towards a limits of acceptability approach to the calibration of hydrological models: Extending observation error. J. Hydrol., 367, 93–103, https://doi.org/10.1016/j.jhydrol.2009.01.016.
Luo, Y. Q., and Coauthors, 2012: A framework for benchmarking land models. Biogeosciences, 9, 3857–3874, https://doi.org/10.5194/bg-9-3857-2012.
Manabe, S., 1969: Climate and ocean circulation: I. The atmospheric circulation and hydrology of the Earth’s surface. Mon. Wea. Rev., 97, 739–774, https://doi.org/10.1175/1520-0493(1969)097<0739:CATOC>2.3.CO;2.
Milly, P. C. D., 1994: Climate, soil-water storage, and the average annual water-balance. Water Resour. Res., 30, 2143–2156, https://doi.org/10.1029/94WR00586.
Milly, P. C. D., J. Betancourt, M. Falkenmark, R. M. Hirsch, Z. W. Kundzewicz, D. P. Lettenmaier, and R. J. Stouffer, 2008: Stationarity is dead: Whither water management? Science, 319, 573–574, https://doi.org/0.1126/science.1151915.
Nearing, G. S., and H. V. Gupta, 2015: The quantity and quality of information in hydrologic models. Water Resour. Res., 51, 524–538, https://doi.org/10.1002/2014WR015895.
Nearing, G. S., and H. V. Gupta, 2018: Ensembles vs. information theory: Supporting science under uncertainty. Front. Earth Sci., https://doi.org/10.1007/s11707-018-0709-9, in press.
Nearing, G. S., D. M. Mocko, C. D. Peters-Lidard, S. V. Kumar, and Y. Xia, 2016: Benchmarking NLDAS-2 soil moisture and evapotranspiration to separate uncertainty contributions. J. Hydrometeor., 17, 745–759, https://doi.org/10.1175/JHM-D-15-0063.1.
Paluš, M., 2014: Cross-scale interactions and information transfer. Entropy, 16, 5263–5289, https://doi.org/10.3390/e16105263.
Paninski, L., 2003: Estimation of entropy and mutual information. Neural Comput., 15, 1191–1253, https://doi.org/10.1162/089976603321780272.
Ruddell, B., 2016: ProcessNetwork, version 1.5. GitHub, https://github.com/ProcessNetwork/ProcessNetwork_Software.
Ruddell, B., and P. Kumar, 2009: Ecohydrologic process networks: 1. Identification. Water Resour. Res., 45, W03419, https://doi.org/10.1029/2008WR007279.
Ruddell, B., R. Yu, M. Kang, and D. L. Childers, 2016: Seasonally varied controls of climate and phenophase on terrestrial carbon dynamics: Modeling eco-climate system state using dynamical process networks. Landscape Ecol., 31, 165–180, https://doi.org/10.1007/s10980-015-0253-x.
Schreiber, T., 2000: Measuring information transfer. Phys. Rev. Lett., 85, 461, https://doi.org/10.1103/PhysRevLett.85.461.
Shannon, C. E., 1948: A mathematical theory of communication. Bell Syst. Tech. J., 27, 379–423, https://doi.org/10.1002/j.1538-7305.1948.tb01338.x.
Tian, Y., G. S. Nearing, C. D. Peters-Lidard, K. W. Harrison, and L. Tang, 2016: Performance metrics, error modeling, and uncertainty quantification. Mon. Wea. Rev., 144, 607–613, https://doi.org/10.1175/MWR-D-15-0087.1.
Weijs, S. V., G. Schoups, and N. Giesen, 2010: Why hydrological predictions should be evaluated using information theory. Hydrol. Earth Syst. Sci., 14, 2545–2558, https://doi.org/10.5194/hess-14-2545-2010.
Zhang, L., W. R. Dawes, and G. R. Walker, 1999: Predicting the effect of vegetation changes on catchment average water balance. Cooperative Research Center for Catchment Hydrology Tech. Rep. 99/12, 35 pp., https://ewater.org.au/archive/crcch/archive/pubs/pdfs/technical199912.pdf.