1. Introduction
Mathematical models of natural systems, primarily built to make predictions of systems' behavior, are usually tested using measurements of the variables predicted by the model. Any systematic procedure that further uses physical measurements to actually improve model simulation may be termed “data assimilation.” Classically in the field of land surface modeling, data assimilation has meant model state estimation (e.g., soil moisture or soil temperature). Measurements of state variables are used, for example, to update a model's predicted values for a period leading up to the present before running the model forward in time in order to make a weather prediction.
Two vital assumptions are made in this type of configuration. The first is that the model's parameters, the time-independent variables that describe the conditions under which the model is operating, are correctly chosen. The second is that the model itself, the chosen representation and coupling of physical processes, is actually capable of making a prediction with the accuracy and precision that is required. It is now well recognized that when equations representing physical processes within a model are developed at different spatial and temporal scales to those at which the model is applied, many model parameters are not directly measurable. This leaves the modeler with little choice other than to choose “behavioral” parameter values: those whose resulting model output matches observed data well. This procedure, commonly called parameter calibration, is also in some sense a data assimilation technique. It too, however, makes the assumption that the model in question is capable of reproducing the natural system's behavior; all that is required are the “correct” model parameters.
In this paper we use observed data to critique the model itself. We will demonstrate, using a single land surface model, that systematic problems in model simulation resulting from model limitations (rather than parameter mis-prescription) are of far greater significance than the limitations in the accuracy and precision of the observational data used to validate the model. Indeed in the cases presented here, systematic errors resulting from model parameterization problems play a greater role in the model's inability to match observational data than the choice of parameter values. We make it clear that by “model parameterization” we refer concurrently to what others may refer to as “model structure“ and “model physics.”
We use the Commonwealth Scientific and Industrial Research Organisation (CSIRO) Biosphere Model (CBM) (Wang and Leuning 1998; Leuning et al. 1998), a land surface model developed at CSIRO Atmospheric Research, and examine the existence of systematic trends in the output error. If we can isolate, quantify, and predict such trends, then not only should we be able to correct them, but additionally gain insight into which parts of the model parameterization are ripe for improvement.
We do this by using an artificial neural network (ANN) to simulate model output error as a function of the model's inputs (meteorological forcing) and some outputs on a per-time-step basis (Fig. 1). As an example, the ANN may learn to simulate latent heat flux error (the ANN output) as a function of observed downward shortwave radiation, observed humidity, and modeled soil moisture (the ANN inputs). This involves a training phase (Fig. 1a) and a testing/simulation phase (Fig. lb). The training phase involves providing the ANN with a time series of input–output pairs from which it establishes the functional dependence (hence both ANN inputs and ANN output are directed toward the ANN in Fig. 1a). The end result is a set of ANN parameters or weights. During the testing or simulation phase (Fig. lb), these weights are used by the ANN to make a prediction of the model's error (based on the errors made by the model under similar conditions during the training phase). This prediction is used to make a correction to the model's output.
This approach differs from state-constraint techniques such as Kalman filtering in a number of ways. First, corrections to model states are correcting only for those parameterizations within the model that affect the specific state in question. The process described above corrects for all model parameterizations affecting model output. Second, while implementations of the Kalman filter usually assume zero model bias, this technique specifically attempts to capture the bias. Third, the use of a neural network means the bias relationships learned by the ANN from the testing set can be used to make prognostic corrections. That is, the technique has predictive capability. While the use of ANNs in the natural sciences is not new (see Maier and Dandy 2000) applications to model bias have been very limited (e.g., Martínez and Velázquez 2001; Tetko 2002).
To capture the systematic component of model output error, we need to make a careful choice of ANN. Here we use the regression-based Self-Organizing Linear Output (SOLO) ANN (Hsu et al. 2002), precisely because it simulates only the systematic part of the training data with which it is provided. If it is simply trained with noise, it will make a zero-value simulation.
In this paper, we show the prevalence of systematic error in CBM's output as well as the ability of the regression-based SOLO ANN to capture this error by making a (statistically based) correction to the model at every time step. We use an ensemble of model runs, derived from multiple-criteria parameter estimation, to show that this systematic error is not a result of poor parameter choices. The combination of these two processes defines the Neural Error-modeling Regression-based Diagnosis (NERD) tool.
To begin, we discuss the attribution of error in model output and how we minimize contributions to this error from all sources other than the model itself. We will then detail the datasets, land surface model, and neural network used for the experiment before outlining how they are used together.
2. Defining error




We now outline how we attempt to ensure that the systematic component of model output error, E, is due only to M, and not the other four sources. We additionally try to characterize systematic error in a way that is relatively insensitive to our choice of parameter set. We deal with each of the five possible error sources in Eq. (2) in turn.
a. Choosing model parameter values
We set about identifying parameter sets that are as close as possible to being “correct.” While ideally this means we want the values that are those of the natural system, not all parameters are physically observable. This is often because the parameterization of physical processes included in the model has been developed at spatial and temporal scales different to those at which the model is applied. This leaves us with little choice when choosing parameter values other than to use intuition and physical reasoning, and/or to choose the values that make the model perform best. We will refer to parameter sets that make the model best match observations as behavioral. By definition the process of choosing values based on a model's posterior adherence to observations (commonly known as calibration) decreases the error in simulations. It does not, however, guarantee that parameter values so obtained are physically meaningful, nor that they would be successful in any other model (Franks et al. 1997). However, since models may include unmeasurable parameters, it seems we can do nothing better than to estimate them in this way. This is essentially what we do here with CBM.








In this paper, the multiple-criteria technique we employ to obtain a pareto set for CBM is the Multi-Objective Shuffled Complex Evolution Metropolis algorithm (MOSCEM-UA) (Vrugt et al. 2003a), which essentially combines the Multi-Objective Complex Evolution (MOCOM-UA) (Yapo et al. 1998) and Shuffled Complex Evolution Metropolis (SCEM-UA) (Vrugt et al. 2003b) methods. Details of the nature of the specific search algorithms employed can be found in Vrugt et al. (2003a) and Vrugt et al. (2003b), and a more general discussion of the benefits of the multiple-criteria approach can be found in Gupta et al. (1999).
For a given model, the calibration process provides us with a collection of parameter sets, the pareto set, each member of which contains parameter values that are both realistic and behavioral with respect to at least one of the model outputs for which we have measurements. Our stated goal was to ensure that errors in output from a given model could not be attributed to parameter misprescription. Which point should we then choose from the pareto set to run the model with? Ideally the answer is all of them, since we have no grounds to declare any one point universally “better” than another. That is, if we wish to characterize the nature of the systematic component of model output error at a particular site in a way that is independent of a given parameter set, we must include analysis of model runs using all realistic and behavioral parameter values. In practice we are limited by finite computing power, so that “all” needs to become a reasonably small, manageable number while still adequately representing the range of parameter sets within the pareto set. We will discuss how we have selected such subsets in our experimental setup in section 6.
b. Error in state initialization and observations
We now look at the other two sources of error described in Eq. (2), error in observed data and model initialization, which may cause systematic error in model output not originating from model parameterization weaknesses.
State initialization issues are commonly dealt with by what is referred to as model spinup. This involves running the model on the simulation dataset repeatedly until the model states reach equilibrium, at which point we begin recording model output. Using a spinup period usually ensures that model performance is insensitive to initial state values and this was indeed the case for the experiments conducted here (see section 6). There is, however, another way that we might interpret “initial state error.”
Equations (1) and (2) represent model behavior for a particular time step during a simulation. If for a moment we ignore error arising from observational (It and Ot) and parameter (ϕ) uncertainty, then model output error not arising from model inability comes from the states of the previous time step, ζt−1. Even though we have employed a model spinup period, and hence the value of ζt−1 is insensitive to the first time step's state values, ζ1, there is no reason to believe that ζt−1 will be as measured on site. It is commonly accepted that model states, such as soil moisture and temperature, may “drift” from observed values. The passing of state values from time step to time step, therefore, represents an internal feedback mechanism, since ζt−1 is a function not only of initial state values, but the model inability, parameter value, and input data errors from every time step since the first. This may influence the nature of any systematic error in model output.
This problem will be dealt with in part by ensuring that model states are used as ANN inputs. That is, state values will partially form the set of conditions from which the ANN will be trained to recognize model error. We will discuss this in more detail, together with other possible approaches to dealing with the problem in section 8.
Issues of accuracy in model input (meteorological) and validation (flux) data are not dealt with explicitly in this paper. As we will see after discussing the structure of the SOLO ANN in section 5, this is unlikely to influence the results presented here unless they are of a systematic nature. Systematic problems in observational data, where known, need to be dealt with individually and are outside the scope of this paper.
In the following sections we outline the datasets, land surface model, calibration algorithm, and type of neural network used in this paper. Quite some time will be spent discussing the workings of the neural network, as its structure is vital to the success of the NERD process. We then detail how these elements are combined during the training and testing phases of the neural network.
3. Datasets
To illustrate the technique we use two datasets. The first was collected at Cabauw in the Netherlands (51°58′N, 4°56′E) and is described in detail by Beljaars and Bosveld (1997). The site consists mainly of short grass divided by narrow ditches, with no obstacle or perturbation of any importance within a distance of about 200 m from the measurement site. Climate in the area is characterized as moderate maritime with prevailing westerly winds. Variables available, at 20-m height in 30-min intervals for the year 1987, include downward shortwave radiation, downward longwave radiation, air temperature, wind, specific humidity, sensible heat flux, latent heat flux, ground temperature, net radiation, and ground heat flux. These data were used by the Project for the Intercomparison of Land surface Parameterization Schemes (PILPS) (Henderson-Sellers et al. 1995) as both atmospheric forcing and observed flux data, in an evaluation of the performance of a suite of land surface schemes (Chen et al. 1997). As part of this experiment a default parameter set was provided, which we will use here to help quantify the gains made by parameter estimation.
The second was collected at the Harvard Forest site in Massachusetts (42°32′N, 72°10′W). This cool moist temperate deciduous forest site consists of a mixture of hardwoods and conifers, with vegetation height around 25 m near the 30-m measurement tower. The measurement site has an elevation of around 300 m, with mainly sandy loam soils. Hourly averages for the years 1992–99 of the following variables were used: air temperature, downward shortwave radiation, wind speed, relative humidity, rainfall, surface soil temperature, CO2 flux, latent heat flux, and sensible heat flux. Downward longwave radiation was synthesized using the Swinbank approximation (Swinbank 1963). Unlike Cabauw, for simulations at Harvard Forest, we used time-dependent leaf area index, derived from on-site measurements. The carbon flux measurements used here are discussed in Barford et al. (2001). (For a list of publications and details of site instrumentation see http://www-as.harvard.edu/chemistry/hf/.)
4. The CSIRO Biosphere Model
The CBM was developed by CSIRO (Australia). It uses a single-layer, two-leaf canopy model that consists of two parts: 1) a radiation submodel that calculates the photosynthetically active radiation, near-infrared radiation, and thermal radiation absorbed by sunlit and shaded leaves and 2) a coupled model of stomatal conductance, photosynthesis, and partitioning of absorbed net radiation into sensible and latent heat (Leuning 1995; Leuning et al. 1995, 1998; Wang and Leuning 1998). The soil component uses a six-layer structure to compute heat conduction and Richards' equation to calculate moisture transport, and includes soil freeze and thaw cycles. The snow model computes the temperature, snow density and thickness of three snowpack layers.
CBM took part in the PILPS C1 experiment (www.pilpsc1.cnrs-gif.fr), which compared land surface model performance using data collected at the Loobos, Netherlands, pine forest site. CBM's demonstrated competence in this experiment (results at www.pilpsc1.cnrsgif.fr) suggests that the results presented here should be applicable to other models. When we speak of “the model” in the experiments considered here, we mean CBM.
5. The SOLO neural network
An ANN may be thought of as a mathematical function that, through an iterative process, adjusts its own constants or parameters to fit a given set of data. Most commonly, ANN operation is split into two phases, one to train the ANN and one to test or use it for prediction. This process, providing the ANN with a fixed set of input/output pairs from which it establishes the desired functional relationships, is known as “supervised training.” Figure 1a represents the supervised training phase, and Fig. 1b represents the testing phase.
For our purpose of modeling systematic trends in model output error we chose the SOLO neural network (Hsu et al. 2002). Our primary reason for doing so is that the structure of the SOLO map ensures that only systematic trends in training data are captured; there is little risk of modeling noise in data, often an issue with overtraining in feed-forward ANNs. We will discuss this and other reasons for our choice in more detail after we have outlined the SOLO map's structure and operation. The description below follows from Hsu et al. (2002).
The SOLO map consists of three layers, shown in Fig. 3: an input layer, an input classification layer, and a regression or output layer. The input layer, given n0 input variables (such as air temperature or wind speed), consists of n0 + 1 nodes. The unit input that forms the extra node is used only in the regression stage of operation. Both the classification layer and output layer are square matrices of n1 × n1 nodes. Joining the ith nonunit input node to the jth classification layer node is the weight wji, for all i = 1, . . . , n0 and j = 1, . . . , n1 × n1. These weights, wji, together with the input and classification layers form a Self Organizing Feature Map (SOFM) (Kohonen 1989), which operates in the following way.




At this point, with SOFM training complete and all input vectors associated with nodes of the classification layer, a direct link is made between the jth node of the classification layer and jth node of the output/regression layer (see Fig. 3). By this, we mean each node in the regression layer is associated with the subset of the input data belonging to the jth classification layer node. A linear regression is then performed between this subset of the input data and its associated output data (remembering for this training period we provided the SOLO map with a set of input–output pairs). The weights between each input layer node (including the unit input node) and the jth regression layer node, {υji|i = 0, . . . n0}, are the parameters of this regression (see Fig. 3).










The regression structure of the SOLO map ensures no correction will be made if there is no systematic trend in the model's output error. In this case, the gradient and intercept regression parameters for each node will be zero. This makes it ideal for use as a bias correction model. For the same reason, noise in observational data should not affect the input–output relationships established, provided we have enough data for training. Additionally, the regression structure eliminates the many potential problems encountered with other ANNs that use error space search algorithms to find optimal network parameters (e.g., problems with local minima and overtraining). It is also computationally more efficient than either multilayer feed-forward or recurrent neural networks (Hsu et al. 2002). In the sections to follow, when we speak of “the ANN” we mean the SOLO map, and by SOFM “resolution” we mean the number of nodes, n21, in the SOFM.
6. Experimental setup
To demonstrate the NERD process, we make a correction to CBM's simulation output at the two observational sites described in section 3. In both cases we correct only a single model output flux, although it should be clear that extending the SOLO map architecture to deal with several outputs is relatively simple. The processes of training and testing the ANN, described below, are shown schematically in Figs. 1a and 1b, respectively.
a. Case 1: Latent heat correction at Cabauw


In addition to these five parameter sets, we use two default parameter sets for reference purposes. One of these was provided by Beljaars and Bosveld (1997) for the PILPS phase 2a experiment (Chen et al. 1997); the other was our choice of default parameters for Cabauw, using generic vegetation and soil type.
To demonstrate the processes involved in using these seven parameter sets we first consider the simplest configuration. CBM is run with a single default parameter set for the entire year of Cabauw forcing data. We then have 17 520 time steps of model output, observations of latent heat flux (provided with the Cabauw meteorological forcing), and meteorological forcing. Alternate time steps are allocated to training and testing sets, giving two 8760 time step sets.


This configuration is used for CBM runs with each of the five parameter sets chosen from the pareto set and the two default parameter sets, leaving us with seven trained ANNs and seven respective testing sets.
The second configuration examines the systematic trends in CBM's error in a parameter independent way by using the same ANN inputs and output as mentioned above but utilizing all five pareto point runs. That is, 5 × 17 520/2 = 43 800 input–output pairs are provided for ANN training, and 43 800 provided for testing. In terms of the matrix in Eq. (13) above, this simply involves increasing the length of each column by a factor of 5.
For each of the seven model runs used in these eight experiments, a 5-yr spinup period was used to remove sensitivity to initial state values. To be certain of its success, five initial state value sets were used for all model runs with each of the seven parameter sets. These state sets had soil moisture values ranging from wilting point to above soil saturation as well as soil temperature values ranging from 0° to 20°C. In every case, after the 5-yr spinup period, variation in rmse in latent heat amongst runs with a fixed parameter set but different initial state values was three orders of magnitude less than the variation between runs with different parameter sets.
b. Case 2: Carbon correction at Harvard Forest in dynamic conditions
We demonstrate the broad applicability of the NERD process by making a correction to carbon fluxes at a different site. The experimental setup is similar to the first case, but is additionally designed to explore the validity, in a dynamic environment, of the statistical correction. We stress that we are primarily using the NERD methodology as a tool to identify systematic model weakness, but the question of how fundamental this weakness is, relative to changes in climate system behavior, remains open. To investigate this question, the 8 yr of Harvard Forest data described in section 3 are used to make a correction to net ecosystem exchange (NEE) predictions. The first 4 yr of data are used both to select parameter values and train the ANN. The second 4 yr are used to test the relationships so established.


Inputs to the ANN in this case are (observed) downward shortwave radiation, surface air temperature, and leaf area index together with (modeled) latent heat flux, top layer soil temperature, and net ecosystem exchange. Output is error in NEE. We again make a correction on a per-time-step basis.
7. Results
We first address the existence of systematic model error. (The dotted lines in Figs. 7 and 8 show the average daily and monthly values of the two fluxes predicted by CBM at the two sites; the solid lines represent observations.) March, July, and November were chosen as evenly separated months that include the middle of the Northern Hemisphere summer. Results in both figures are an average of the ensemble of all pareto point runs (and testing years in the Harvard Forest case). We can see that CBM consistently underrepresents latent heat flux at Cabauw. Harvard Forest NEE was also underpredicted during the winter months, with CBM predicting virtually no net emission of CO2. Had we any doubt, it should now be clear that the model has a detectable systematic bias.
Each of the eight experiments outlined in case 1 of section 6 using Cabauw data are represented by a line in Fig. 4. This plot shows the per-time-step rmse of model simulations corrected by the ANN for a range of self-organizing feature map (classification layer) resolutions. The x axis represents the SOFM edge length, n1, as shown in Fig. 3, and the rmse value at zero resolution is simply the rmse of the uncorrected model run. No attempt has been made here to distinguish between runs generated by the two default parameter sets (dashed) or runs generated by the five individual pareto set runs (dotted), rather we consider them as two behavioral groups. The solid line represents the performance of the ANN trained on the ensemble pareto set runs.
From this figure we see that a correction made by the ANN using a SOFM with edge length 32 (implying a 32 × 32 = 1024 node SOFM) that has been trained and tested on all five pareto set runs can decrease the simulation rmse for latent heat from 27.5 to 18.6 W m−2 (32%). If we look at the y axis or zero-resolution line (enlargement in Fig. 4) we see how effective parameter calibration is in this case. The difference between the worst-performing default parameter set and the best-performing noninferior parameter set is around 1.74 W m−2, or about 6%. From the best-performing default to the worst-performing noninferior point is 0.44 W m−2, or around 1.5%. Even a single unit SOFM (“1” on the x axis in Fig. 4) gives a 7% improvement in the rmse. That is, making a linear correction to the latent heat flux of CBM at Cabauw based on the six ANN inputs gives a correction of a similar size to parameter calibration.
The analogous plot for CO2 flux correction at Harvard Forest shows a similar trend (Fig. 5). It represents the per-time-step rmse performance of NEE prediction by CBM for the 4-yr testing period (1996–99) for a range of ANN complexity. The best-performing correction came from the ANN trained and tested on the default parameter set, a 122 node SOFM reducing the per-time-step rmse from 3.71 to 2.70 μmol m−2 s−1 (27%). The all-pareto point experiment resulted in a drop from 3.55 to 2.88 μmol m−2 s−1 (18%).
Parameter calibration in this case reduced NEE per-time-step rmse from 3.71 to 3.36 μmol m−2 s−1 (9%) for the NEE minimum in the pareto set. The sensible heat minimum in the pareto set resulted in a marginally higher per-time-step rmse than that resulting from the default parameter set.
It appears that the nature of model systematic error in latent heat flux at Cabauw is not wholly parameter dependent. In Fig. 4, the all-pareto point experiment (which used five model runs) performed better than any of the single pareto point experiments, suggesting there is information about model weakness using one parameter set when running the model with another. The nature of the model's systematic error was therefore best generalized by the ANN trained using multiple parameter sets. This was not the case for the carbon correction at Harvard Forest, however. The reason for this is most likely that calibration process that generated the Harvard Forest pareto set used three criteria, instead of the two used at Cabauw. The result was a larger pareto set and consequently a larger range of model behavior for the ANN to capture.
For further analysis, unless otherwise stated, we use results from all-pareto trained ANNs. In the Cabauw case, we use a 322 node SOFM, trained and tested on the five pareto point model runs and for Harvard Forest, a 122 node SOFM trained and tested on the four pareto point model runs.
We now consider rmse on longer time scales. Figures 6a and 6b show rmse for a range of averaging window sizes. Half-day averages to 20-day averages are plotted for Cabauw and up to 40-day averages for Harvard Forest. Results are shown for all-pareto-point model (solid) and corrected model (dashed) runs as well as default model (dash–dot) and corrected model (dotted) runs. This gives us an indication of the relative effectiveness of parameter estimation and the NERD correction. If we dispense with parameter estimation altogether and simply implement NERD using default parameter sets only, results in the Cabauw case are only marginally worse and the Harvard Forest case, significantly better. A summary of the improvements is shown in Table 1, which suggests reductions in rmse are achieved both by increasing averaging time and applying the NERD correction. Note that the relative size of the NERD correction increases with increasing averaging time. While daily carbon flux rmse is reduced by 53% in the default simulations, dropping from 2.66 to 1.25 μmol m−2 s−1, the annual reduction is 95%, a drop from 2.24 to 0.11 μmol m−2 s−1.
The remainder of Fig. 6 represents the results of scatterplots of modeled versus observed values for both fluxes. Ideally, we want a unit gradient and zero offset for the least squares linear regression lines for such plots, regardless of whether we consider a scatter based on per-time-step, daily, weekly, or monthly averages. The gradient of such regression lines (Figs. 6c,d) as well as the square of the correlation coefficient, r 2 (Figs. 6e,f), are shown for a range of averaging window sizes at both sites. The solid line represents the gradient of model versus observed, and the dashed line represents corrected model versus observed. The shaded gray regions surrounding each line represent the 95% confidence intervals on the gradient estimates, which naturally broaden as we consider longer-term averages and the sample size shrinks.
The most striking result here is the correction of simulated CO2 at Harvard Forest. While the gradient of model simulation versus observation converges to a value around 0.7 (with increasing size of averaging window), the corrected simulation is unbiased at 10-day or greater averages (where the 95% confidence interval includes the unit gradient). Correlation between observed and modeled CO2 at Harvard Forest was also significantly improved by the correction. The Cabauw case was not so dramatic. While observed–modeled correlation was clearly bettered at all time scales by the correction, the corrected model versus observed regression gradient was only better for 8-day averages or less.
We now briefly look at the impact of the corrections on the diurnal and annual cycles of the two fluxes. Figure 7 shows the ensemble average day for three separate months during the simulation at both sites. Figure 8 shows average monthly flux values at both sites. The model underestimated the latent heat flux at Cabauw during each month and significantly so in March, while the application of the NERD process removed almost all of this bias (Fig. 7). A similar result was obtained for NEE at Harvard Forest, with NERD able to remove both positive and negative biases in model predictions. Systematic errors in modeled monthly mean latent heat fluxes were largely eliminated by NERD at Cabauw (Fig. 8), but the correction led to a systematic positive bias in NEE at Harvard Forest, in contrast to the negative biases in winter and autumn from the model alone.
8. Discussion
The results in section 7 demonstrated that the NERD process led to significant improvements in model performance at all time scales for most of the measures we considered. The ANN successfully identified and corrected systematic bias in model output for calibrated and default parameter sets.
Reasons for choosing one parameter set over another when making a NERD correction are not yet clear. In the Cabauw case, gains made by parameter calibration were preserved by the NERD correction; the separation of model performance using default parameter values versus pareto parameter values remained intact regardless of SOFM resolution in the correcting ANN (Fig. 4). At Harvard Forest, however, the default parameter set, which had considerably higher rmse for uncorrected model runs, consistently outperformed any of the pareto point model runs once the ANN correction was applied (Fig. 5). The use of multiple pareto parameter sets effectively gave several times as much training data with which to generalize the model's systematic error, which at Cabauw resulted in the superior performance of the all-pareto correction. This again was not true at Harvard Forest. A possible resolution of this issue could be the use of an all-pareto ANN that additionally includes selected model parameters as inputs.
We now consider possible improvements to the technique. Table 2 shows the Pearson correlation coefficient (r) between ANN inputs and model error before and after ANN correction. It also shows the “P value,” a measure designed to gauge the significance of the correlation. It represents the probability of getting the given correlation by random chance, with the hypothesis of no correlation. Traditionally 5% (0.05) or less is deemed significant, implying the hypothesis is false. A zero P value here implies a value less than 10−100. Table 2 suggests that although the ANN has significantly reduced systematic error, it has by no means done a comprehensive job. At Cabauw, the ANN largely removed the significant correlations between the pairwise error in latent heat fluxes and air temperature, humidity, and modeled latent heat flux, but the relatively minor decrease in the pairwise correlation of the other three inputs and latent heat error show that the ANN did not adequately capture this dependence. This problem is even clearer at Harvard Forest, where even after correction all ANN input–output P values were less than 10−10. This suggests that the improvements made by NERD correction could be better.


One issue mentioned in section 2b that could complicate results was model state drift. Ideally we would like the ANN to learn only first-order model error, without the complication of internal feedback mechanisms. That is, the state values of the previous time step in Eq. (1) would ideally be observed states so that (for the moment ignoring observational errors) the error term defined in Eq. (2) would have no dependence on the model's behavior in previous time steps. If the ANN were to be trained this way, however, during the testing period it would have to make a correction to the model states (which had been replaced by observations during training). If it did not, we would expect model states to again drift to equilibrium values, potentially a very different environment from the one in which the ANN was trained. The main limiting factor for such an approach is the relatively limited amount of observed state data. The issue was mitigated to some extent by including model states as ANN inputs. Also, in both the Cabauw and Harvard Forest cases described above, we have some evidence to indicate that model states were reasonably realistic. Top-layer soil temperature, the only state variable available for both of the datasets, was easily within 1 K of the observed value after spinup in both cases.
Perhaps the most serious criticism of the NERD process is that it is a statistical correction. One might well ask, if we believe that an ANN is capable of appropriately correcting the model, why not just use an ANN to model the land surface and dispense with the physically based model altogether? The answer is that we are modeling a dynamic climate system (assuming that our interest is long-term prognostic simulation). It is the case, whether we make a correction or not, that the model must incorporate enough of the natural system's physical processes that the mechanisms of climate change are captured. Additionally, there is not yet enough data to support a global statistical land surface model. Deciding whether a statistical correction to a physically based model is appropriate under dynamic conditions is very difficult, since we do not know exactly how dynamic the natural system actually is. We must decide whether the anticipated modes of climate behavior are significantly different from those that were used to develop the statistical technique. That is, whether we have data today that includes the physical processes of climate in the future. Resolving these issues will take time and a great deal of high quality observational data, and the answer will probably be temporally and spatially dependent. They apply both to the NERD process and to parameter estimation, which has been performed using a short period of single-site observations to choose parameters for entire regions for long-term simulations (Sen et al. 2001). In the Harvard Forest case presented here however, since both parameter estimation and ANN training were performed using 1992–95 data and the results used 1996–99 data, both processes seem appropriate.
This paper is intended as a simple demonstration of the ability of the NERD technique to capture (but not yet reveal) the nature of model error emanating from parameterization problems in the model. Future work will use NERD to identify weaker areas of model parameterization. Additionally, the statistical correction presented here will be extended to regional or global scales by including model parameters as inputs to the ANN as a mechanism for distinguishing between sites.
9. Conclusions
In this paper we have demonstrated the ability of the NERD process to remove a significant proportion of model error. That is, we have shown that an appropriately chosen artificial neural network can successfully identify and correct systematic trends in model output at different sites, for different variables, across a broad range of time scales. The magnitude of the correction in all cases presented here was considerably larger than that afforded by parameter calibration. For latent heat flux at the Cabauw site, the NERD process reduced per-time-step rmse from 27.5 to 18.6 W m−2 (32%) and monthly rmse from 9.91 to 3.08 W m−2 (68%). Net ecosystem carbon exchange (NEE) rmse at the Harvard Forest site was reduced from 3.71 to 2.70 μmol m−2 s−1 (27%) on a per-time-step basis and 2.24 to 0.11 μmol m−2 s−1 (95%) on annual time scales. This clearly shows that systematic error in model output does indeed exist.
We have also ensured that the gains made by the NERD correction compensate for inadequacies in model parameterization rather than problems resulting from inappropriate parameter values. The NERD tool was applied using model parameter sets that minimized error in latent heat, sensible heat, and net ecosystem carbon exchange both independently and simultaneously, as well as with default parameter sets.
This suggests that data quality is not a major limitation on the validation and development of land surface models. Indeed the use of observational data purely for parameter estimation at least in this case appears to be an underutilization of important information on model misbehavior, which the observational data contain. The NERD technique also dramatically enhances the breadth of data available for testing and improving land surface models since it does not require continuous observational data. Even single measurements of appropriate variables can contribute to neural network training or testing. It should be noted, however, that the work presented here represents a small sample size. It only made use of a single model and two observational sites.
Acknowledgments
The authors thank Yuqiong for providing the MOSCEM code and Steve Wofsy for helpful comments on the manuscript, together with the U.S. Dept. of Energy, Office of Science, and the U.S. National Science Foundation for Harvard Forest data and the Royal Netherlands Meteorological Institute for Cabauw data. Work here was supported by a CSIRO Postgraduate Scholarship and an Australian Postgraduate Award.
REFERENCES
Barford, C. C., and Coauthors, 2001: Factors controlling long- and short-term sequestration of atmospheric CO2 in a mid-latitude forest. Science, 294 , 1688–1691.
Beljaars, A. C. M., and Bosveld F. C. , 1997: Cabauw data for the validation of land surface parameterization schemes. J. Climate, 10 , 1172–1193.
Chen, T. H., and Coauthors, 1997: Cabauw experimental results from the Project for Intercomparison of Land-Surface Parameterization Schemes. J. Climate, 10 , 1194–1215.
Franks, S. W., Beven K. J. , Quinn P. F. , and Wright I. R. , 1997: On the sensitivity of soil–vegetation–atmosphere transfer (SVAT) schemes: Equifinality and the problem of robust calibration. Agric. For. Meteor, 86 , 63–75.
Gan, T. Y., and Biftu G. F. , 1996: Automatic calibration of conceptual rainfall–runoff models: Optimization algorithms, catchment conditions, and model structure. Water Resour. Res, 32 , 3513–3524.
Gupta, H. V., Bastidas L. A. , Sorooshian S. , Shuttleworth W. J. , and Yang Z. L. , 1999: Parameter estimation of a land surface scheme using multicriteria methods. J. Geophys. Res, 104 , 19491–19503.
Gupta, H. V., Sorooshian S. , Hogue T. , and Boyle D. , 2002: Advances in automatic calibration of watershed models. Calibration of Watershed Models, Q. Duan et al., Eds., Water Science and Application Series, Vol. 6, Amer. Geophys. Union, 9–28.
Henderson-Sellers, A., Pitman A. J. , Love P. K. , Irannejad P. , and Chen T. H. , 1995: The Project for Intercomparison of Land Surface Parameterization Schemes (PILPS): Phases 2 and 3. Bull. Amer. Meteor. Soc, 76 , 489–503.
Hsu, K-L., Gupta H. V. , Gao X. , Sorooshian S. , and Imam B. , 2002: Self-Organizing Linear Output map (SOLO): An artificial neural network suitable for hydrologic modeling and analysis. Water Resour. Res, 38 .1302, doi:10.1029/2001WR000795.
Kohonen, T., 1989: Self-Organization and Associative Memory. Springer-Verlag, 312 pp.
Leuning, R., 1995: A critical appraisal of a combined stomatal-photosynthesis model for c3 plants. Plant Cell Environ, 18 , 339–355.
Leuning, R., Kelliher F. M. , de Pury D. G. G. , and Schulze E-D. , 1995: Leaf nitrogen, photosynthesis, conductance and transpiration: Scaling from leaves to canopies. Plant Cell Environ, 18 , 1183–1200.
Leuning, R., Dunin F. X. , and Wang Y. P. , 1998: A two-leaf model for canopy conductance, photosynthesis and partitioning of available energy. ii. Comparison with measurements. Agric. For. Meteor, 91 , 113–125.
Maier, H. R., and Dandy G. C. , 2000: Neural networks for the prediction and forecasting of water resources variables: A review of modelling issues and applications. Environ. Modell. Software, 15 , 101–124.
Martínez, M. A., and Velázquez M. , 2001: A new method to correct dependence of MSG IR radiances on satellite zenith angle, using a neural network. Proc. 2001 EUMETSAT Meteorological Satellite Data Users' Conf., Antalya, Turkey.
Sen, O. L., Bastidas L. , Shuttleworth W. , Yang Z. , and Sorooshian S. , 2001: Impact of field calibrated vegetation parameters on GCM climate simulations. Quart. J. Roy. Meteor. Soc, 127 , 1199–1224.
Swinbank, W. C., 1963: Longwave radiation from clear skies. Quart. J. Roy. Meteor. Soc, 89 , 339–348.
Tetko, I. V., 2002: Associative neural network. Neural Process. Lett, 16 , 187–199.
Vrugt, J., Gupta H. V. , Bastidas L. A. , Bouten W. , and Sorooshian S. , 2003a: Effective and efficient algorithm for multi-objective optimization of hydrologic models. Water Resour. Res, 39 .1214, doi:10.1029/2002WR001746.
Vrugt, J., Gupta H. V. , Bouten W. , and Sorooshian S. , 2003b: A shuffled complex evolution metropolis algorithm for optimization and uncertainty assessment of hydrologic model parameters. Water Resour. Res, 39 .1201, doi:10.1029/2002WR001642.
Wang, Y. P., and Leuning R. , 1998: A two-leaf model for canopy conductance, photosynthesis and partitioning of available energy. i. Model description. Agric. For. Meteor, 91 , 89–111.
Yapo, P. O., Gupta H. V. , Bouten W. , and Sorooshian S. , 1998: Multi-objective global optimization for hydrologic models. J. Hydrol, 204 , 83–97.
APPENDIX A
Finding the Regression Parameters










APPENDIX B
Principal Components
We wish to find a solution to















Configuration for (a) training the ANN and (b) testing the ANN. Shaded boxes represent the goal of each phase. During the training period, the ANN is provided with a set of input–output pairs from which it will establish a functional relationship, recorded in the ANN weights. This relationship is tested by using the ANN to make a correction to model output using “unseen” data.
Citation: Journal of Hydrometeorology 7, 1; 10.1175/JHM479.1

Configuration for (a) training the ANN and (b) testing the ANN. Shaded boxes represent the goal of each phase. During the training period, the ANN is provided with a set of input–output pairs from which it will establish a functional relationship, recorded in the ANN weights. This relationship is tested by using the ANN to make a correction to model output using “unseen” data.
Citation: Journal of Hydrometeorology 7, 1; 10.1175/JHM479.1
Configuration for (a) training the ANN and (b) testing the ANN. Shaded boxes represent the goal of each phase. During the training period, the ANN is provided with a set of input–output pairs from which it will establish a functional relationship, recorded in the ANN weights. This relationship is tested by using the ANN to make a correction to model output using “unseen” data.
Citation: Journal of Hydrometeorology 7, 1; 10.1175/JHM479.1

The parameter and criteria space in a two-criteria, two-dimensional parameter calibration setup. The dark line between the two criteria's minima, α and β, represents the noninferior or pareto set, while the concentric circles in the parameter space represent level curves for the two criteria objective functions. The shaded region represents the projection of the parameter space into the criterion space, and γ is the “compromise point” defined in Eq. (12) (modified after Gupta et al. 2002).
Citation: Journal of Hydrometeorology 7, 1; 10.1175/JHM479.1

The parameter and criteria space in a two-criteria, two-dimensional parameter calibration setup. The dark line between the two criteria's minima, α and β, represents the noninferior or pareto set, while the concentric circles in the parameter space represent level curves for the two criteria objective functions. The shaded region represents the projection of the parameter space into the criterion space, and γ is the “compromise point” defined in Eq. (12) (modified after Gupta et al. 2002).
Citation: Journal of Hydrometeorology 7, 1; 10.1175/JHM479.1
The parameter and criteria space in a two-criteria, two-dimensional parameter calibration setup. The dark line between the two criteria's minima, α and β, represents the noninferior or pareto set, while the concentric circles in the parameter space represent level curves for the two criteria objective functions. The shaded region represents the projection of the parameter space into the criterion space, and γ is the “compromise point” defined in Eq. (12) (modified after Gupta et al. 2002).
Citation: Journal of Hydrometeorology 7, 1; 10.1175/JHM479.1

The three-layer structure of the SOLO neural network. The input layer, classification layer, and weights wji together form a SOFM, while the output layer performs a node-by-node multiple linear regression with parameters υji. The input variables {x1, . . . , xn0} (e.g., air temperature and wind speed) are normalized, in this case using their maximum possible ranges. After Hsu et al. (2002).
Citation: Journal of Hydrometeorology 7, 1; 10.1175/JHM479.1

The three-layer structure of the SOLO neural network. The input layer, classification layer, and weights wji together form a SOFM, while the output layer performs a node-by-node multiple linear regression with parameters υji. The input variables {x1, . . . , xn0} (e.g., air temperature and wind speed) are normalized, in this case using their maximum possible ranges. After Hsu et al. (2002).
Citation: Journal of Hydrometeorology 7, 1; 10.1175/JHM479.1
The three-layer structure of the SOLO neural network. The input layer, classification layer, and weights wji together form a SOFM, while the output layer performs a node-by-node multiple linear regression with parameters υji. The input variables {x1, . . . , xn0} (e.g., air temperature and wind speed) are normalized, in this case using their maximum possible ranges. After Hsu et al. (2002).
Citation: Journal of Hydrometeorology 7, 1; 10.1175/JHM479.1

Rmse of corrected model simulation vs the size of the self-organizing feature map used to make the correction for latent heat correction at Cabauw. Dotted lines represent ANNs trained and tested on model output generated using a single pareto point, dashed lines use a single default parameter set, and the solid line uses an ensemble of model runs using all five pareto points. Zero edge length (the y axis) is the uncorrected model simulation.
Citation: Journal of Hydrometeorology 7, 1; 10.1175/JHM479.1

Rmse of corrected model simulation vs the size of the self-organizing feature map used to make the correction for latent heat correction at Cabauw. Dotted lines represent ANNs trained and tested on model output generated using a single pareto point, dashed lines use a single default parameter set, and the solid line uses an ensemble of model runs using all five pareto points. Zero edge length (the y axis) is the uncorrected model simulation.
Citation: Journal of Hydrometeorology 7, 1; 10.1175/JHM479.1
Rmse of corrected model simulation vs the size of the self-organizing feature map used to make the correction for latent heat correction at Cabauw. Dotted lines represent ANNs trained and tested on model output generated using a single pareto point, dashed lines use a single default parameter set, and the solid line uses an ensemble of model runs using all five pareto points. Zero edge length (the y axis) is the uncorrected model simulation.
Citation: Journal of Hydrometeorology 7, 1; 10.1175/JHM479.1

Rmse of corrected model simulation vs the size of the self-organizing feature map used to make the correction for NEE correction at Harvard Forest. Dotted lines represent ANNs trained and tested on model output generated using a single pareto point, dashed lines use a single default parameter set, and the solid line uses an ensemble of model runs using all five pareto points. Zero edge length (the y axis) is the uncorrected model simulation.
Citation: Journal of Hydrometeorology 7, 1; 10.1175/JHM479.1

Rmse of corrected model simulation vs the size of the self-organizing feature map used to make the correction for NEE correction at Harvard Forest. Dotted lines represent ANNs trained and tested on model output generated using a single pareto point, dashed lines use a single default parameter set, and the solid line uses an ensemble of model runs using all five pareto points. Zero edge length (the y axis) is the uncorrected model simulation.
Citation: Journal of Hydrometeorology 7, 1; 10.1175/JHM479.1
Rmse of corrected model simulation vs the size of the self-organizing feature map used to make the correction for NEE correction at Harvard Forest. Dotted lines represent ANNs trained and tested on model output generated using a single pareto point, dashed lines use a single default parameter set, and the solid line uses an ensemble of model runs using all five pareto points. Zero edge length (the y axis) is the uncorrected model simulation.
Citation: Journal of Hydrometeorology 7, 1; 10.1175/JHM479.1

(a), (b) Root-mean-square error; (c), (d) gradient of least squares regression of model and corrected model vs observed; and (e), (f) Pearson correlation coefficient, r 2, for latent heat at Cabauw and net ecosystem carbon exchange at Harvard Forest. The x axis represents the window averaging size; the shaded area around the lines represents the 95% confidence interval on the gradient estimates. Results are shown for ensemble model runs using pareto parameter sets; (a) and (b) additionally include default parameter set results.
Citation: Journal of Hydrometeorology 7, 1; 10.1175/JHM479.1

(a), (b) Root-mean-square error; (c), (d) gradient of least squares regression of model and corrected model vs observed; and (e), (f) Pearson correlation coefficient, r 2, for latent heat at Cabauw and net ecosystem carbon exchange at Harvard Forest. The x axis represents the window averaging size; the shaded area around the lines represents the 95% confidence interval on the gradient estimates. Results are shown for ensemble model runs using pareto parameter sets; (a) and (b) additionally include default parameter set results.
Citation: Journal of Hydrometeorology 7, 1; 10.1175/JHM479.1
(a), (b) Root-mean-square error; (c), (d) gradient of least squares regression of model and corrected model vs observed; and (e), (f) Pearson correlation coefficient, r 2, for latent heat at Cabauw and net ecosystem carbon exchange at Harvard Forest. The x axis represents the window averaging size; the shaded area around the lines represents the 95% confidence interval on the gradient estimates. Results are shown for ensemble model runs using pareto parameter sets; (a) and (b) additionally include default parameter set results.
Citation: Journal of Hydrometeorology 7, 1; 10.1175/JHM479.1

Average day fluxes for March, July, and November for (top) Cabauw latent heat flux and (bottom) Harvard Forest net ecosystem carbon flux. Corrections were made using a 322 node SOFM for Cabauw and a 122 node SOFM for Harvard Forest. Results are an average of an ensemble of pareto set model runs.
Citation: Journal of Hydrometeorology 7, 1; 10.1175/JHM479.1

Average day fluxes for March, July, and November for (top) Cabauw latent heat flux and (bottom) Harvard Forest net ecosystem carbon flux. Corrections were made using a 322 node SOFM for Cabauw and a 122 node SOFM for Harvard Forest. Results are an average of an ensemble of pareto set model runs.
Citation: Journal of Hydrometeorology 7, 1; 10.1175/JHM479.1
Average day fluxes for March, July, and November for (top) Cabauw latent heat flux and (bottom) Harvard Forest net ecosystem carbon flux. Corrections were made using a 322 node SOFM for Cabauw and a 122 node SOFM for Harvard Forest. Results are an average of an ensemble of pareto set model runs.
Citation: Journal of Hydrometeorology 7, 1; 10.1175/JHM479.1

Average monthly values for Harvard Forest net ecosystem carbon flux and Cabauw latent heat flux. Corrections were made using a 322 node SOFM for Cabauw and a 122 node SOFM for Harvard Forest. Results are an average of an ensemble of pareto set model runs.
Citation: Journal of Hydrometeorology 7, 1; 10.1175/JHM479.1

Average monthly values for Harvard Forest net ecosystem carbon flux and Cabauw latent heat flux. Corrections were made using a 322 node SOFM for Cabauw and a 122 node SOFM for Harvard Forest. Results are an average of an ensemble of pareto set model runs.
Citation: Journal of Hydrometeorology 7, 1; 10.1175/JHM479.1
Average monthly values for Harvard Forest net ecosystem carbon flux and Cabauw latent heat flux. Corrections were made using a 322 node SOFM for Cabauw and a 122 node SOFM for Harvard Forest. Results are an average of an ensemble of pareto set model runs.
Citation: Journal of Hydrometeorology 7, 1; 10.1175/JHM479.1

Fig. A1. The decomposition
Citation: Journal of Hydrometeorology 7, 1; 10.1175/JHM479.1

Fig. A1. The decomposition
Citation: Journal of Hydrometeorology 7, 1; 10.1175/JHM479.1
Fig. A1. The decomposition
Citation: Journal of Hydrometeorology 7, 1; 10.1175/JHM479.1
Daily, weekly, monthly, and annual decrease in rmse for latent heat flux at Cabauw and NEE carbon flux at Harvard Forest due to the NERD correction. Results are shown for the correction utilizing all pareto-point runs (all) and default model parameter set runs (def). Annual values for Cabauw are omitted due the brevity of the dataset.


Correlation between model residual and each of the variables selected as ANN inputs. Values of the square root of the Pearson correlation coefficient, r, as well P values (significance of the correlation) are shown for model runs before (“model”) and after (“corr”) NERD correction. Here P values less than 0.05 suggest significant correlation. Zero implies a value less than 10−100. All pareto-point trained and tested ANNs were used with 322 and 122 node SOFMs for Cabauw latent heat flux and Harvard Forest CO2 flux, respectively.

