How Well Can Land-Surface Models Represent the Diurnal Cycle of Turbulent Heat Fluxes?

: The diurnal cycleof solarradiation represents the strongestenergeticforcingand dominatestheexchangeof heat and mass of the land surface with the atmosphere. This diurnal heat redistribution represents a core of land– atmosphere coupling that should be accurately represented in land surface models (LSMs), which are critical parts of weather and climate models. We employ a diagnostic model evaluation approach using a signature-based metric that describes the diurnal variation of heat ﬂuxes. The metric is obtained by decomposing the diurnal variation of surface heat ﬂuxes into their direct response and the phase lag to incoming solar radiation. We employ the output of 13 different LSMs driven with meteorological forcing of 20 FLUXNET sites (PLUMBER dataset). All LSMs show a poor representation of the evaporative fraction and thus the diurnal magnitude of the sensible and latent heat ﬂux under cloud-free conditions. In addition, we ﬁnd that the diurnal phase of both heat ﬂuxes is poorly represented. The best performing model only reproduces 33% of the evaluated evaporative conditions across the sites. The poor performance of the diurnal cycle of turbulent heat exchange appears to be linked to how models solve for the surface energy balance and redistribute heat into the subsurface. We conclude that a systematic evaluation of diurnal signatures is likely to help to improve the simulated diurnal cycle, better represent land–atmosphere interactions, and therefore improve simulations of the near-surface climate.


Introduction a. Background and motivation
Land surface models simulate distinct diurnal cycles of turbulent heat fluxes, but they also show systematic deviations from observations, which were reported in early (Henderson-Sellers et al. 1995;Chen et al. 1997) and more recent model intercomparison studies (Holtslag et al. 2013;Best et al. 2015).Best et al. (2015) used observational meteorological forcing to drive and evaluate state-of-the-art models at 20 different flux towers.A striking finding of Best et al. (2015) was that simple linear regression models with solar radiation as a predictor variable outcompeted all land surface models when evaluated with standard statistical metrics.The simple linear response seen in observations can be regarded as a signature of complex land-atmosphere interactions, which are known to simplify the response of turbulent fluxes at certain scales (Jarvis and McNaughton 1986;De Bruin and Holtslag 1982).Hence, these land surface models, which include many different process parameterizations, may not represent diurnal land-atmosphere interaction well.This is important because land-atmosphere feedbacks propagate to larger scales and may ultimately affect model sensitivity to global change (Miralles et al. 2019).Here, we employ a diagnostic model evaluation based on diurnal signatures of surface heat fluxes to quantify the performance of LSMs specifically for diurnal heat redistribution processes and point toward parameterizations that need to be improved.

b. Signatures of diurnal land-atmosphere exchange
The response of turbulent heat fluxes to solar radiation at the subdaily time scale represents a signature of heat redistribution Denotes content that is immediately available upon publication as open access.
processes and is at the core of local land-atmosphere interactions.The diurnal imbalance of solar heating of the surface is redistributed into the lower atmosphere and into the subsurface, which is well constrained by the energy balance of the whole system (Kleidon and Renner 2018).Heat redistribution into the lower atmosphere is achieved by turbulent heat transfer driving the atmospheric boundary layer development (Oke 1987), while subsurface heat redistribution includes the conduction and storage of heat in the soil and canopy.These processes are linked through the surface energy balance [in simplified form following Ohmura (2014)] where R sd is the incoming and R su the reflected solar radiation at the surface, R ld is downwelling longwave radiation, and H and lE are the turbulent heat fluxes of sensible and latent heat, respectively.The term dU s /dt summarizes heat storage fluxes at and into the subsurface such as the soil heat flux G or the change in biomass heat storage.The emission of longwave radiation from the surface s sT 4 on the right-hand side of Eq. ( 1) responds passively to the fluxes on the left-hand side.
Here, T refers to the surface temperature, s is the Stefan-Boltzmann constant, and s is the emissivity of the surface.While downward solar radiation can be regarded as an independent driver (at subdaily time scale), all other terms are influenced by surface properties and the exchange with the atmosphere.Hence, the resulting diurnal course of turbulent heat fluxes is an outcome of the response to the distinct diurnal course of solar radiation, influenced by landatmosphere interaction and biophysical properties of the surface.Although many studies focus on turbulent heat fluxes as key heat fluxes in terms of local land-atmosphere interaction (Best et al. 2015;Haughton et al. 2016;Zhou and Wang 2016;Dirmeyer et al. 2018), there has been relatively little focus on the diurnal signal itself.Exceptions are Wilson et al. (2003) and Nelson et al. (2018), who used the diurnal centroid as a measure of timing of turbulent heat fluxes, and Renner et al. (2019b), who analyzed the diurnal magnitude and phase lag of turbulent fluxes to solar radiation in observations and evapotranspiration schemes.
The diurnal magnitude of turbulent heat fluxes is related to the partitioning into sensible and latent fluxes, which is controlled by many processes, especially the redistribution of water (Dirmeyer et al. 2018).The phase lag of the turbulent heat fluxes to solar radiation is influenced by diurnal heat storage changes in the land-atmosphere system (Renner et al. 2019b).Although the absorption of solar radiation at the surface comprises a dominant diurnal forcing, the landatmosphere system appears to be well buffered (Taylor 2012) through the exchange of turbulent heat and radiation within the lower atmosphere and heat conduction with the subsurface (Kleidon and Renner 2018).Heat storage changes at some depth in the soil, within vegetation and the lower atmosphere are relatively slow and typically show large lags to solar radiation (Renner et al. 2019b).The heat fluxes that arise from the instantaneous surface heating by absorption of solar radiation redistribute heat into and interact with the larger heat reservoirs of the soil and the atmosphere.Land surface model parameterizations must resolve these interactions through the (i) surface energy balance solution (Viterbo and Beljaars 1995), and the subgrid-scale parameterizations of the (ii) surface or canopy conductance (Monteith 1965;Medlyn et al. 2011), (iii) the turbulent exchange in the surface layer (Monin and Obukhov 1954;Businger et al. 1971;Maronga and Reuder 2017), and (iv) the planetary boundary layer (Baklanov et al. 2011).These subgrid-scale parameterizations are critical for accurate simulation of the surface climate (Baklanov et al. 2011;Heidkamp et al. 2018), including heat and mass exchange with atmosphere.Therefore, an evaluation of the diurnal phase lag of simulated turbulent heat fluxes can identify important model deficiencies and point toward improvements.

c. Research questions and design
Here we aim to assess the performance of LSMs at the subdaily time scale when driven with meteorological observations.In particular, we address the open question of why the complex land surface models can be outcompeted by simple The variables plotted as a function of R sd .The variable Y 1 shows a linear relationship, but the lagged variable reveals a counterclockwise hysteresis loop.This variation can be decomposed into a direct response dY/dR sd and a second-order response of dY/(›R sd /›t) that can be expressed as a phase lag in time units.linear regression models and point toward key parameterizations of heat redistribution processes that can be improved given the information on observed diurnal signatures.
To assess the performance of current state of the art LSMs, we use the signature based metric of the phase lag of heat fluxes to solar radiation following Renner et al. (2019b).This method has important advantages to standard model observation evaluations.First, as a signature approach it does not evaluate the variability over time, but the response to a common reference variable such as solar radiation.Second, the approach effectively decomposes diurnal variability into two metrics whose statistical significance can be tested and which can be directly interpreted.We will employ the PLUMBER dataset, which was established by the Land Surface Model Benchmarking Evaluation Project published by Best et al. (2015).The PLUMBER dataset includes 13 different land surface models at 20 different FLUXNET sites.This dataset thus represents a wide range of observations and includes key land surface models used in climate and weather models.
In the following we will describe the metric and the sampling design.The results will first focus on the observed variations and controls of the phase lags and then report on the LSM performance.In the discussion we focus on the suitability of the phase lags as a signature and then diagnose which of the LSM components may cause the observed discrepancies.

a. Diurnal signatures
The diurnal course of solar radiation is a dominant forcing of the land-atmosphere exchange fluxes and has already been identified as the dominant forcing of turbulent heat fluxes over land (Kleidon and Renner 2013;Renner et al. 2016;Kleidon and Renner 2018).This motivated the development of two related metrics to evaluate the diurnal variation of surface energy balance components and near-surface atmospheric variables in reference to the variation in solar radiation (Renner et al. 2019b).This method is briefly explained next.
Timing differences between variables that exhibit large diurnal magnitudes can be difficult to discern in time series plots.However, when plotted against each other systematic deviations can be seen directly.This is illustrated for solar radiation R sd and two variables Y 1 and Y 2 in Fig. 1 as a time series plot in Fig. 1a and hysteresis loop plot in Fig. 1b.The example shows a variable Y 1 that has a different magnitude than R sd but is perfectly in phase with R sd .This results in a linear relationship as illustrated in Fig. 1b.In contrast, variable Y 2 has a time lag with respect to R sd .This time lag results in a counterclockwise hysteresis loop when plotted against R sd , see Fig. 1b.Such a time lag is difficult to identify from a time series plot.
The diurnal variation of variable Y(t) can be decomposed into a direct response to solar radiation and a response to its time derivative ›R sd (t)/›t: This form of regression is known as the Bernardi-Camuffo formula (Camuffo and Bernardi 1982).The regression coefficient b 1 corresponds to the direct response of variable Y to R sd .
The second coefficient b 2 reflects the width of the hysteresis loop, b 0 is the intercept, and (t) represents the residuals.The regression has the benefit that the significance of the coefficients can be statistically determined and tested.Assuming that both variables, Y and R sd , can be represented by harmonic functions, one can link the two coefficients b 1 and b 2 to obtain a measure of the phase lag in time units (Renner et al. 2019b): with the angular frequency v 5 (2p)/n d , where n d is the number of time steps per day.Equation (3) eliminates the units of the two variables Y and R sd and allows easy transformation into time units.This way one can compare the phase lag of different variables such as temperatures or heat fluxes (Renner et al. 2019b).

b. Data stratification
Conditional sampling of the data was applied to isolate diurnal heat exchange processes and to meaningfully compare the phase lag of models and observations.First, data were filtered for cloudfree days to avoid the influence of clouds on the diurnal cycle of turbulent fluxes.Second, the daily evaporative fraction was used to select days of similar evaporative conditions to avoid influences of biases from modeling of water redistribution processes.
Cloud-free days were determined by days that have more than 80% of the daily sum of surface clear-sky solar radiation R sd,cs .Half-hourly values of R sd,cs were determined using the method by Renner et al. (2019a).This method only requires measurements of observed total incoming shortwave radiation R sd and shortwave incoming shortwave radiation at the top of atmosphere R s,in as input.The latter can be calculated with information of location, date and time.The method estimates the fractional transmission of solar radiation under cloud-free conditions from monthly subsets of half-hourly data by quantile regression of observed R sd against the theoretical maximum that is set by R s,in .Using a quantile of 80% for the regression was found to yield the best agreement to the standard method of Long and Ackerman (2000), which requires additional data of diffuse radiation at a temporal resolution of minutes.Since such data are not commonly available, even at FLUXNET sites, we used the quantile regression approach of Renner et al. (2019a).Renner et al. (2019b) found that evaporative conditions can alter the phase lag of turbulent heat fluxes at a grassland site.Therefore, we avoid a comparison of the phase lag on a day-byday basis.Instead we compare the distribution of daily phase lag estimates for a given site and a given range of evaporative fraction (EF).Daily EF was estimated by linear regression of half-hourly values of lE(t) and H(t) for each day: The slope of this regression represents the daily evaporative fraction.This method is more robust than a simple ratio of daily average values of turbulent heat fluxes [EF 5 lE/(H 1 lE)] because it can handle a few outliers and its significance can be tested.
We then use the daily EF estimates and stratify the data into 10% bins of EF for each site.Since models and observations differ in the estimated EF (see section 4) there are different sample sizes between models and observations.We choose a minimum of five samples for further statistical analysis of the phase lag per EF bin and site.The agreement of the models was assessed with a test for the center of the distributions.Since the models should be able to reproduce the phase lag of both the sensible and the latent heat fluxes, we use the two-sample Hotelling T 2 test (Hotelling 1931).This is a two-dimensional Student's t test, which evaluates differences in the center of two linked distributions.We use the implementation ''rrcov::T2.test'' in the R programming language (R Core Team 2015) within the package ''rrcov'' (Todorov and Filzmoser 2009).We choose a minimum requirement of five samples, and we then count the frequency of nonrejections at a conservative significance level a 5 0.01 per site and EF bin to assess model agreement.

Data
The analysis is based on the PLUMBER dataset published by Best et al. (2015).It contains site observations from the FLUXNET La Thuile dataset (https://fluxnet.fluxdata.org/data/la-thuile-dataset/) and LSM simulations using the meteorological forcing of these sites.A brief overview is presented below, but the reader is referred to the PLUMBER publications for further detail (Best et al. 2015;Haughton et al. 2016).

a. Observations
PLUMBER contains 20 sites distributed across continents with different climate and land cover conditions, see the map in Fig. 2 and details in Table 1.For the analysis of the observations we removed data that did not pass quality control provided by the PLUMBER dataset (see Best et al. 2015).Further data screening for the present analysis revealed unrealistic values for R sd in some years and at some sites.In particular, daily maximum R sd showed drifts at sites Amplero, UniMich, Mopane, and Kruger, and these years where excluded from further analysis.Single outliers at some sites were set to missing.
The data have been filtered for a set of criteria.First, we only sampled cloud-free days that range between 93 days per year in Loobos and 246 days per year in Blodgett.Since the number of cloud-free days varies across sites this leads to a slightly imbalanced dataset.Second, we restricted the analysis to days with high potential solar radiation of more than 300 W m 22 .Further days were excluded that have fewer than 30 observations per day, show nonsignificant slopes in determining EF [Eq.( 4)] or nonsignificant slopes of sensible and latent heat fluxes to R sd [Eq.( 2)].We also exclude days where the Bernardi-Camuffo regression explains less than 50% of the diurnal variation in H and lE.These filter criteria help to exclude atypical conditions.After applying the filter criteria, the dataset contained between 83 days at Amplero (with one year of data) and 973 days at Blodgett (spanning 7 years of data), see Table 3.
The energy balance requires that the observed heat fluxes should be balanced by the net radiation budget.However, most eddy TABLE 2. Overview of the land surface models and reference models of the PLUMBER dataset, taken from Best et al. (2015) and Haughton et al. (2016).The column ''surface layer'' indicates if a skin layer approach or the top soil layer is used to solve for the surface energy balance.covariance measurement sites show a systematic underestimation of the turbulent heat fluxes (Foken 2008;Stoy et al. 2013) and an energy balance correction is often applied to the observed turbulent heat fluxes.The energy balance correction is strongly debated since (i) the magnitude of heat storage changes in the canopy and soil is difficult to assess (Moderow et al. 2009;Eshonkulov et al. 2019) and (ii) it is unclear how the gap should be partitioned between the sensible and the latent heat flux (e.g., Roo et al. 2018).For a broad assessment of the closure gap we used the Bowen ratio correction (Twine et al. 2000), where the energy balance closure gap Q gap 5 R n 2 (H 1 lE 1 dU s /dt) is partitioned to sensible and latent heat fluxes by the Bowen ratio B r .We used the daily EF estimates [Eq.( 4)] to correct the latent and sensible heat fluxes.However, note that the energy balance closure gap may itself show a distinct phase shift to solar radiation.This could easily be caused by errors in estimated heat storage.In this case the Bowen ratio correction will add this phase shift to the corrected turbulent heat fluxes.Therefore, the main part of the analysis of this paper focuses on the uncorrected values, which is also true for the previous publications using the PLUMBER dataset (Best et al. 2015;Haughton et al. 2016;Nearing et al. 2018).

b. Land surface models
The PLUMBER dataset contains output from eight different core models with multiple model versions for some such that 13 different LSMs are finally compared (see Table 2).All models were run with meteorological forcing data from each site including R sd , incoming longwave radiation R ld , near surface air temperature T a , wind speed u, relative humidity r H , and precipitation P, all at half-hourly resolution (Best et al. 2015;Nearing et al. 2018).The vegetation type, vegetation height and reference height of each site were used.Otherwise, default parameter values were used with soil parameters from internal routines.The models differ substantially in their structure and have thus different simulated output variables.Sensible H and latent heat lE fluxes where reported for all models, which is why we focus on these surface heat fluxes.For further interpretation in terms of surface energy balance, we also included net radiation balance R n and the soil heat flux of these models.In TABLE 3. Statistics of observed data for each site, with the number of years, the number of cloud-free days per year, and the number of days after filtering was applied.Also reported are the 10th, 50th, and 90th percentiles of the daily EF estimates.The median of the daily energy balance closure from a linear regression of the total turbulent heat fluxes against available energy is reported in column EB closure.The phase lags f are reported in minutes relative to incoming solar radiation.We report the median for the phase lag of net radiation and the energy balance closure gap Q Gap and the 10th, 50th, and 90th percentiles for the latent and the sensible heat fluxes.addition to the LSMs, Best et al. (2015) included benchmark approaches consisting of the physical benchmarks, the Manabe bucket model and the Penman-Monteith potential evapotranspiration (no water limitation), as well as three empirical benchmarks that include a simple linear regression to R sd , a multiple-linear regression with R sd and T a (2lin), and a cluster-based regression using relative humidity as a third variable (3km27).These regressions were established out-of-sample and not for each site specifically.

a. Evaporative conditions
The PLUMBER dataset provides a wide range of evaporative conditions, expressed by the daily EF [Eq.( 4)].The dataset contains rather dry sites like Mopane and Kruger with a median EF # 0.23, and rather wet sites like Palang and Elsaler with median EF 5 0.7.The wetter sites show less variation in EF on the order of 20%, while drier sites show variations on the order of 50%, see Table 3.
The fractional energy balance closure of the sites was obtained by linear regression of the sum of the sensible and latent heat fluxes against the available energy for each day.Available energy was derived from the difference of net radiation and soil heat flux data, when available.The median energy balance closure is on average 80%.However, it ranges between 55% at the site Elsaler2 (irrigated cropland) and 93% at the neighboring site Elsaler, which is classified as a wetland, see Table 3.
Next, we evaluate the LSM ability to predict the evaporative fraction for a given day, cut into 10% bins.We find very low agreement of about 20% for the LSMs, see Fig. 3.While the LSMs show seasonal variation in EF, the empirical benchmarks show rather constant EF across season and sites, since these have been obtained from an out-of-sample fit using the remaining 19 sites.The Penman-Monteith physical benchmark, which predicts potential evapotranspiration, clearly overestimates EF agreeing in only about 5% of all days in the filtered dataset.The poor agreement of simulated EF is directly related to the response of the sensible and latent heat fluxes to solar radiation, since the sum of the turbulent heat fluxes is strongly and almost linearly related to solar radiation (Kleidon and Renner 2018;Conte et al. 2019).Here, daily EF explains 79% of the slope of the latent heat flux to solar radiation in the observations.The response is strongly linked to the diurnal magnitude of the turbulent heat fluxes.The model error in EF and the error in the diurnal magnitudes of the latent heat flux (derived from the diurnal range) is strongly linearly related with an explained variance of 69% across all models.Therefore, the poor performance in EF is in line with the poor simulation of the diurnal magnitude of the sensible as well as the latent heat fluxes by the LSMs and the benchmarks.

b. Phase lag of observations
Phase lags of the sensible and latent heat fluxes (f H and f lE , respectively) were computed for each day and are evaluated for the filtered days, see Table 3.The Bernardi-Camuffo regression is able to explain the dominant variation of the diurnal cycle of the latent heat flux (average across all sites R 2 5 0.91, lowest median R 2 5 0.75 at Kruger) and the sensible heat flux (average across all sites R 2 5 0.95, lowest median R 2 5 0.91 at Howard and Hyytiala).This emphasizes the strong control of incoming solar radiation on turbulent heat fluxes and the statistical robustness of our signature approach.On average, the phase lag of H is about 15 min, while lE has a lag of 111 min relative to R sd .There is considerable variation of site median values and within each site.Generally, f H and f lE show different patterns whereby f lE is larger than f H for 14 out of 20 sites.Systematically larger phase lags of the sensible heat flux are found at Blodgett, Espirra, El-saler, Kruger, Tumba, and Mopane, all being rather dry ecosystems.
To investigate the effects of water availability on the diurnal turbulent heat exchange we separated the days by bins of 0.1 of the daily evaporative fraction.We also grouped the sites by the land-cover characteristics following the IGBP standards, see Fig. 4 for f lE and Fig. 5 for f H .We find that f lE shows a significant effect of wetness at open vegetation sites (grasslands, cropland, wetlands, and savannas; Fig. 4).Under dry conditions a smaller phase lag (around 0) exists, which increases with EF to about 20 min.For the site Espirra, which is classified as woody savanna, this effect of water availability is further enhanced.Interestingly, forest ecosystems do not show such a reduction of the phase lag f lE when dry (see Fig. 4) and remain rather stable with positive phase lags.The pattern for the phase lag of the sensible heat flux appears to be rather different, see Fig. 5.Under dry conditions (EF , 0.5) we find phase lags around 0 for most sites, except at Espirra and Blodgett, which show rather large positive phase lags at around 30 min.Under wetter conditions with EF .0.6, f H becomes negative at the grassland, cropland, wetland and broadleaf forest sites.At needleleaf forest sites f H remains relatively stable across different EF.
As a reference we also computed the phase lag of net radiation f Rn (Table 3).Net radiation comprises incoming and outgoing longwave as well as shortwave radiation and balances the turbulent heat fluxes and heat storage fluxes such as the ground heat flux.Usually, incoming shortwave radiation dominates the diurnal variation of R n , since longwave radiation has much smaller diurnal amplitude but typically with positive phase lags with respect to shortwave radiation (Renner et al. 2019b).Therefore, we expect f Rn ' 0. It appears that this is violated at sites Palang and Tumba, which show a lag of 30 min, most likely a timing offset in the radiation data.The timing of the turbulent heat fluxes, however, appears to be correct.

c. model performance with phase lags
The Bernardi-Camuffo regression also explains a large part of the diurnal cycle of heat fluxes simulated by the LSMs.To quantify the performance of the models to reproduce the observed phase lags of the turbulent fluxes, we evaluated the distribution of phase lags of both the sensible and the latent heat flux for a given site and bin of EF.Model agreement was assessed with the two-sample Hotelling's T 2 test at a significance level a 5 0.01. Figure 6 shows the frequency of agreement for each model aggregated across all sites and bins of EF.The LSMs are ordered by agreement, with the highest performance of the Mosaic model with 33% agreement.Models with more than 15% agreement are the versions of the JULES models, CABLE and CHTESSEL.The empirical benchmark with three input variables (3km27), including humidity, shows relatively good performance, while the other two empirical benchmarks show no agreement since these have no lag by definition (1lin) or very small lags and hence a very small variation.A range of LSMs show rather poor agreement and fall between that of the physical model benchmarks, namely, the Penman-Monteith and Manabe bucket models.This includes all the Noah3.2LSM, ORCHIDEE, and COLA-SSiB models as well as the two versions of the ISBA LSM.
The evaluation of the models was done with the observed turbulent heat fluxes without the energy balance correction.Applying a correction relies on an accurate estimation of the available energy and heat storage fluxes.As a preliminary assessment of whether energy balance played a role in our results, we reevaluated the models with the Bowen ratio correction of the turbulent heat fluxes, using site net radiation and soil heat flux estimates.Other heat storage terms could not be included, since the observational dataset does not contain sufficient data.It appears that the closure gap itself showed large phase lags at many sites (see Table 3).Such a phase lag in the closure gap changes the diurnal course of the corrected fluxes.In some cases this resulted in significantly different phase lags of both fluxes.However, we find that for most models the agreement with corrected observations is much lower (see Fig. 6).Hence we will focus on the uncorrected fluxes in the following.
The ability to reproduce phase lags at the site level is rather mixed.We ordered the performance per model and site in a model performance matrix, shown in Fig. 7. Some sites are easier to predict like Amplero and Kruger, while only the Mosaic model achieves some agreement at the sites Tumba and Blodgett.
Next, we evaluated whether the models tend to over or underestimate the observed phase lags.Therefore, we assessed the mean difference in phase lags for each site and evaporative conditions.The boxplot in Fig. 8 shows that all LSMs tend to overestimate f lE .The median bias ranges between 110 and 130 min for the poorer models.The three empirical benchmarks underestimate both f lE and f H , because solar radiation is used as the key predictor variable and there is no energy balance constraint.The best models (Mosaic, CABLE, and the two JULES versions), show a large variation, but no overall bias in f H .The models with a medium rank in the T 2 statistics (CHTESSEL, CABLE-SLI, the Noah LSMs) show systematic underestimation of f H . COLA-SSiB overestimates both f lE and f H .The LSMs with poorest ranks show systematically large f lE on the order of FIG. 6. Ranking of model performance for phase lags of sensible and latent heat by frequency of agreement across all sites and evaporative conditions.The agreement is tested by a Hotelling's T 2 test based on the distribution of phase lags of both f H and f lE for a given site and 10% bin of EF.Dark blue bars show the statistics without correction for energy balance in observations and light blue shows with Bowen ratio correction.30 min, whereby the ISBA models also show a systematic overestimation of the sensible heat flux.
We also evaluated the phase lag of the soil heat flux, which shows large discrepancies with the observed values (see Fig. 8).All models show a negative phase lag, i.e., higher values during the morning than in the afternoon.The differences are on the order of 90 min and can be as large as 3 h.A bias in the phase lag of the modeled soil heat flux will also influence the turbulent heat fluxes in a LSM because of the surface energy balance constraint.The effect on the turbulent fluxes, however, is only large when the magnitude of the soil heat flux is also large.Therefore, we assessed the diurnal magnitude of the soil heat fluxes, through the daily range between the minimum and maximum soil heat flux.The results are shown in aggregated form for short savanna, and forest in Fig. 9.This shows that the observed soil heat fluxes are much smaller than the modeled fluxes at forest and short vegetation sites.For savannas there is no clear systematic bias across all the models.Hence, the modeled soil heat fluxes have rather large magnitudes and a different sign for the phase lags, which reveals a potential misrepresentation of heat storage changes.

d. Diurnal composites
To illustrate the performance of models at the diurnal cycle we classified the sites into short vegetation (three sites) and forest vegetation (four sites).Then hourly composites of the data were computed using the median for a given bin of EF (bin width 0.1).In Fig. 10 we show examples of these composites for two poor performing models and two relatively well performing models in short vegetation and forests.The figure shows the latent, sensible, and soil heat fluxes against the observed solar radiation.Under dry conditions at sites with short vegetation, the observations show a rather linear response of the turbulent heat fluxes without a distinct hysteresis (Figs.10a,c).Only the soil heat flux shows an anticlockwise hysteresis that is smaller in the morning than in the afternoon, which corresponds to a positive phase lag.
Figure 10a shows the corresponding results for CABLE, which reveals a much too large anticlockwise hysteresis of the sensible heat flux while the almost linear response of the latent heat flux corresponds well to the observations.The large hysteresis of the sensible heat flux is linked to clockwise hysteresis of the soil heat flux, which is higher in the morning than in the afternoon.The discrepancies of the phase lags of H are linked to the high diurnal magnitude of the soil heat flux of CABLE (Fig. 9) and its negative phase lag (Fig. 8).Results for the JULES model are shown in Fig. 10c for short vegetation under dry conditions.The JULES model exhibits a very small hysteresis of the heat fluxes and corresponds much better to observations.Only the phase lag of the latent heat flux is slightly overestimated.
As a further example, we chose forests sites under ample moisture conditions (Figs. 10b,d).Observations show rather large hysteresis loops for the turbulent heat fluxes, where the latent heat flux shows a pronounced anticlockwise hysteresis, which is compensated by a smaller clockwise hysteresis of the sensible heat flux.Observations of the soil heat flux in forests show very small values both in magnitude and in terms of phase lag.However, there is no assessment of the heat storage terms FIG. 7. Model performance matrix of phase lags of sensible and latent heat per model and site.Models are ordered by their overall performance across sites.Sites are ordered by their relative agreement from high at the left to low at the right side.The color represents the relative agreement between nonrejected Hotelling's T 2 tests and the number of tests applied per site.The total number of tests differs between models and sites.This is because the EF of a model may have no agreement with the EF in observations, e.g., Penman-Monteith at the dry site Kruger.These missing conditions are shown with white color.
in the biomass, which can be quite large (Meier et al. 2019).Figure 10b shows the ISBA-dif model that exhibits a large clockwise hysteresis and magnitude of the soil heat flux, a common issue of this model (Figs. 8 and 9).For these reasons the ISBA-dif model produces very large hysteresis loops for the turbulent heat fluxes.Their direction, however, corresponds well with the observations.Figure 10d shows the ORCHIDEE LSM, which better reproduces the diurnal patterns.This is related to a smaller hysteresis loop of the soil heat flux, see also Fig. 8. ORCHIDEE, however, consistently overestimates the hysteresis of the latent heat flux, which results in poor agreement when evaluated at the site level, see Fig. 7.

Discussion
We employed the approach of diagnostic signatures (Gupta et al. 2008) to evaluate complex models by a typical behavior that can be quantified both from observations and models.As diagnostic signature we employed the metric of a phase lag of heat fluxes to solar radiation at the subdaily time scale.In the following we discuss the typical signatures obtained from observations, and discuss the LSM performance to diagnose which process representations need to improved.

a. Phase lag in observations
We focused on the diurnal variation of the turbulent heat fluxes that were obtained from eddy covariance observations from the FLUXNET La-Thuile synthesis dataset.The diurnal variation of turbulent heat fluxes can be rather well captured by its direct response to solar radiation and its time derivative, explaining more than 90% of subdaily variation under the cloud-free, summertime conditions evaluated here.This emphasizes the strong control of solar radiation for landatmosphere exchange but also highlights the presence of heat storage effects that are captured by the phase lags.
Both can be well identified on a daily basis.Our results further show significant temporal variation in the phase lags as well as differences between sites.Still we find convergence of phase lag behavior within broad vegetation categories, consisting of forests, savannas and short vegetation.Short vegetated sites show a significant influence of water limitation on the diurnal course of the turbulent fluxes.Under ample moisture supply and high EF there is a positive phase lag in the latent heat flux and a tendency for negative phase lags of the sensible heat flux.Under drier conditions the positive phase lag of the latent heat flux is reduced toward zero.This behavior is consistent with a recent analysis of turbulent heat fluxes at a grassland site by Renner et al. (2019b), who showed a reduction of phase lags during a dry-down.
The savanna type ecosystems show an enhanced response of f lE to EF with negative phase lags under very dry conditions (EF , 0.2).This behavior is in agreement with Nelson et al. (2018), who used the method of diurnal centroids (Wilson et al. 2003) on the FLUXNET2015 dataset.The forest ecosystems evaluated here, however, show no response of f lE to EF. Needleleaf forests also show no response of f H to EF, while broadleaf forests show a decline toward negative f H under high EF.
The positive f lE can be regarded as a response to the diurnal cycle of the vapor pressure deficit (VPD) of the air as a driver of evapotranspiration.The diurnal cycle of VPD is strongly related to air temperature and both variables show large positive phase lags with respect to solar radiation.This is due to the heat storage changes in the lower, well mixed atmosphere.Hence we argue that VPD, which is highest during early afternoon, enhances lE leading to a positive f lE (Renner et al. 2019b).This explanation is in line with the phase lag observations in forests and under wet conditions in short vegetation and savannas.The significant reduction of f lE in short vegetation and savannas toward water limited conditions can be related to the enhanced surface resistance to evapotranspiration leading to a decoupling of the surface from the atmosphere (McNaughton and Jarvis 1983).As a result, surface skin temperature increases more strongly than air temperature (Mildrexler et al. 2011;Panwar et al. 2019), which also enhances subsurface heat storage changes.This decoupling is typical for sites with short vegetation or an open canopy where the ground is not sufficiently shaded from solar radiation.Negative phase lags of lE observed under very dry conditions could also be related to water saving strategies of plants that control gas exchange by stomata and close under very high VPD conditions leading to a morning shift of lE (Nelson et al. 2018).
The pattern of the phase lag of the sensible heat flux is strongly coupled to the phase lag of the latent heat flux.For most sites there is only a small phase lag of the sensible heat flux and this explains the good performance of the simple linear regression benchmark (1lin) used by Best et al. (2015).The shift to negative phase lags under wet conditions can be explained by a compensation due to the positive phase lags of the latent heat flux to keep the surface energy balance.A deviation of this explanation is found for Blodgett and Espirra that show relatively large positive phase lags of the sensible heat flux throughout.Both sites have a relatively open canopy (Blodgett was reforested at that time, Misson et al. 2005) and we speculate that these patterns are induced by relatively large diurnal soil heat fluxes.

b. Energy balance closure
One of the key assumptions of this work is that the turbulence measurements correctly capture the diurnal cycle of the sensible and latent heat fluxes.The sites studied in this analysis achieve an energy balance closure between 55% and 93% of available energy (R n 2 dU s /dt), assuming that reported soil heat fluxes capture heat storage components.The estimated energy balance closure gap revealed significant phase shifts for many sites (Table 3), with the implication that the phase lag of the corrected fluxes also changed.Comparison with the modeled phase lags revealed an even poorer comparison than without correction (Fig. 6).We believe that the unaccounted heat storage in the subsurface and canopy are a crucial limitation of the energy balance correction.These heat storage components, such as in the upper soil layer, can have significant diurnal magnitudes and phase shifts.Therefore, the diurnal cycle and especially the phase lag of the corrected heat fluxes may not be reliable.On the other hand, we have no reason to believe that there are systematic measurement errors with a time lag in eddy covariance measurements, since these are based on the instantaneous turbulent fluctuations of wind speed, temperature and moisture.
Given the lack of further data to calculate these heat storage components, we cannot further assess this issue within the scope of this paper.However, we strongly recommend a detailed measurement of subsurface and canopy heat storage components, which are important for the diurnal heat redistribution (Moderow et al. 2009;Gao et al. 2017;Heidkamp et al. 2018;Meier et al. 2019;Eshonkulov et al. 2019).Furthermore, alternative measurement technologies should be used in parallel to assess the validity of the diurnal cycle of eddy covariance turbulence measurements.

c. Diurnal cycle of simulated heat fluxes
All LSMs and benchmarks have been driven with meteorological forcing data specific to each site.Prescribing near surface meteorological states such as air temperature and humidity strongly constrains the model response and thus focusses on the partitioning of the surface energy balance terms.All LSMs and benchmarks show large discrepancies in the simulated magnitude of the diurnal cycle of the sensible and latent heat fluxes.This is apparent from the poor simulation of evaporative fraction (Fig. 3), which is strongly influenced by water redistribution processes and the large memory of soil moisture.The poor performance of the PLUMBER models in terms of EF was noted by Ukkola et al. (2016) and is also a key issue in coupled model simulations and reanalysis (Dirmeyer et al. 2018).Therefore, we focused on the phase lag of turbulent heat fluxes, which is influenced by the LSM parameterizations that control heat redistribution.
Common to all models is a tendency to overestimate the phase lag of f lE , while there is no common bias for the sensible heat flux, see Fig. 8.When we evaluated both phase lags, the model performance was rather low (Figs.6 and 7).Hence, although the models evaluated in PLUMBER show distinct diurnal cycles, these are shifted in time and thus do not capture heat redistribution processes adequately.
Although there is incomplete observational data on the soil heat storage for the sites in the PLUMBER dataset, we find very large discrepancies for diurnal magnitude and the phase lag of the soil heat In particular, the models show a negative phase lag of the soil heat flux, with higher values during the morning than in the afternoon.This does not correspond with the observations, which show positive phase lags.Consequently, the phase lag in the soil heat flux is compensated by the turbulent heat fluxes since net radiation does not show large phase lags, see Table 3.In addition, the models overestimate the diurnal magnitude of the soil heat flux compared to the observations.A higher magnitude of the soil heat flux then enhances this effect on the turbulent fluxes.This is readily seen, for example, for the CABLE model in Fig. 10a.

d. Model diagnostics with phase lags
In general the diurnal phase lag signature will be sensitive to the following parameterizations of LSMs: (i) the method to solve the surface energy balance, (ii) the dynamic parameterization of surface/canopy conductance, and (iii) the parameterization of surface layer exchange of heat, moisture and momentum.Since the soil heat flux seems to be the main cause for the observed biases we believe that the way in which the surface energy balance is solved is of greatest importance to improve the modeled diurnal cycles.

1) SURFACE ENERGY BALANCE SOLUTION
The surface energy balance is the central component of any LSM.It links all different parameterizations particularly the transfer of heat into the subsurface and into the atmosphere.This partitioning is critical for the phase lags as it relates the strong diurnal variation of solar radiation to storage changes below and above the surface.
The key state variable is surface (skin) temperature, which is often part of the other parameterizations such as for soil heat transfer but also for canopy and surface layer exchange.There are two different approaches to solve this nonlinear problem.One way is to use the temperature of the upper soil layer.This layer has a large heat capacity and dampens the temporal variation of the surface temperature depending on the depth of the layer.The second approach is to use a skin layer that has no heat capacity but requires iteration to get a stable solution (Viterbo and Beljaars 1995;Best et al. 2005;Heidkamp et al. 2018).

2) TOP SOIL VERSUS SKIN LAYER APPROACH
Next we illustrate the impact of surface energy balance schemes on the phase lags of the heat fluxes using output of two different versions of the JSBACH model, which is the LSM scheme of the ECHAM climate model.Heidkamp et al. (2018) used the JSBACH model in an offline mode and simulated the diurnal cycle with the reference version of the JSBACH model (i.e., with upper soil layer to solve the surface energy balance) and the new version with a skin layer (JSBACH-Skin).For illustration Heidkamp et al. (2018) used observational data from the Diurnal Land/Atmosphere Coupling Experiment (DICE, http://appconv.metoffice.com/dice/dice.html)project.This is based on the well-known CASES99 campaign at the Southern Great Plains sites in the United States over crop land.These simulations are ideal to illustrate the impact of these schemes on the phase lags of the heat fluxes.We obtained the simulation output data from Heidkamp et al. (2018) and compared the hysteresis patterns.Figure 11a shows the hysteresis on a single day for the observations and the reference model.The observations reveal the linear response of the turbulent heat fluxes under dry conditions (EF 5 0.17) as seen in the PLUMBER dataset (Fig. 10a).The JSBACH-Reference model, however, shows a large phase lag in the simulated sensible heat flux as well as in the soil heat flux.In contrast, the new model version with a skin layer does not show such large hysteresis of any fluxes and corresponds much better to observations, see Fig. 11b.This highlights the impact of how the surface energy balance is solved on the phase lag of heat fluxes.
Although Heidkamp et al. (2018) show that the skin layer approach is superior for the JSBACH model, we do not find that PLUMBER models that use a skin layer approach (CHTESSEL, COLA, Noah LSMs) are generally superior to models that use the top soil layer (see Table 2).Figure 10 highlights that the JULES model performs well on average even though it employs a top soil layer approach.However, JULES uses a soil heat transfer scheme in which the vertical layers are optimized to account for the diurnal to annual frequencies of the atmospheric forcings (Best et al. 2005(Best et al. , 2011)).The optimized vertical discretization scheme includes four layers with thinner layers on top and is based on solutions of the soil heat diffusion equation (Best et al. 2005).

DYNAMICS
To illustrate how biases in the diurnal cycle of turbulent heat fluxes propagate when coupled to the atmosphere we used the two cases of Heidkamp et al. (2018) and looked at what these imply for the diurnal development of the convective planetary boundary layer.Using a simple bulk mixed layer scheme we simulated the growth of the boundary layer given the sensible heat flux as an input.Example results are shown for the two days of the DICE dataset, for which the height of the boundary layer was estimated by radio soundings (Steeneveld et al. 2008).Figure 12 shows the sensible heat fluxes whereby the JSBACH-Reference model has a damped diurnal cycle and lags the simulated growth of the PBL compared to JSBACH-Skin.In contrast the simulated PBL growth for the observed and the simulated sensible heat flux from JSBACH-Skin agree well with the observed height.
This case study highlights the importance of how and where the diurnal forcing of solar radiation is buffered in the landatmosphere system.Misrepresentation of heat storage changes, such as in the subsurface, directly affect the exchange of heat and water with the atmosphere.Systematic offsets cannot only influence the diurnal cycle of heat fluxes, but also affect temperature and will propagate into the lower atmosphere in coupled models (Heidkamp et al. 2018).Offsets in boundary layer height affect the mixing of heat, momentum and moisture and potentially the timing of the development of boundary layer clouds (de Arellano et al. 2014).While modeling these interactive processes is a challenge, the strong local coupling of the land-atmosphere system also leads to simple emergent relationships such as the almost linear response of the sensible heat flux to solar radiation, which can serve as a valuable signature to evaluate landatmosphere feedbacks represented in complex models.

Conclusions
We performed a diagnostic signature based analysis of stateof-the-art land surface models.We focused on the simulated diurnal cycle of the turbulent heat fluxes and used the phase lag to solar radiation as a metric to quantify the model's ability to resolve a signature that is characteristic of the diurnal cycle.The evaluation is based on a set of 20 FLUXNET sites with diverse climatic and land surface conditions.As known from previous assessments the models show a poor ability to resolve the partitioning of sensible and latent heat fluxes, which directly affects the diurnal magnitude of these fluxes.This issue is related to difficulties in simulating water redistribution processes.Hence we focus on the phase lag of the heat fluxes, which is influenced by diurnal heat redistribution processes and found to be influenced by vegetation type and water limitation.We find that all models tend to overestimate the phase lag of the latent heat flux, combined with varying biases in the sensible heat flux.These biases appear to be systematic and thus provide a further reason for the poor performance of the PLUMBER LSMs found in earlier studies (Best et al. 2015;Haughton et al. 2016;Nearing et al. 2018).
Modeled soil heat fluxes have much larger diurnal magnitudes than the available observations and their phase lags to solar radiation are strongly biased.These larger diurnal fluxes and large phase lags propagate to the phase lag of turbulent heat fluxes.We argue that this is related to issues in jointly solving the surface energy balance and soil heat transfer.
Our findings have important implications: 1) the models are not able to accurately capture diurnal heat exchange processes that are at the core of land-atmosphere interactions.The biases will propagate into the lower atmosphere when coupled and to longer time scales when aggregated. 2) The phase lags of the turbulent heat fluxes appear to be quite sensitive to biases FIG.12. Sensible heat fluxes (continuous lines) and derived planetary boundary layer height (dashed) for two days of the DICE experiment.Sensible heat flux data are obtained from Heidkamp et al. (2018), (http://appconv.metoffice.com/dice/dice.html),and radio-sounding data of PBL height is taken from Steeneveld et al. (2008).
induced by the surface energy balance scheme including subsurface heat transfer.3) After resolving these issues, we expect that the diurnal signature of the phase lag can help to identify and calibrate important subgrid-scale parameterizations of the canopy and surface layer, enabling the use of the large set of available data at the diurnal time scale to better constrain complex model parameterizations.

FIG. 1 .
FIG. 1. Illustration of the signature-based metric of the phase lag.(a) A diurnal cycle of solar radiation R sd and two variables; Y 1 is in phase with R sd while Y 2 is lagging R sd .(b) The variables plotted as a function of R sd .The variable Y 1 shows a linear relationship, but the lagged variable reveals a counterclockwise hysteresis loop.This variation can be decomposed into a direct response dY/dR sd and a second-order response of dY/(›R sd /›t) that can be expressed as a phase lag in time units.

FIG. 4 .
FIG. 4. Boxplot of observed daily phase lags of the latent heat flux lE for different EF bins and aggregated by land-cover classes.Small numbers at the bottom of each panel show the number of samples per bin.

FIG. 5 .
FIG. 5. Boxplot of daily phase lags of sensible heat fluxes H for different EF and aggregated by IGBP land-use classes.

FIG. 8 .
FIG. 8. Average difference of the simulated minus observed phase lags for the latent, sensible, and soil heat fluxes.Data are based on the average per EF bin and site.Note that not all sites have soil heat flux observations.

FIG. 9 .
FIG. 9. Distribution of observed and modeled diurnal range of the soil heat fluxes of the filtered data, aggregated by sites belonging to broad land-cover groups.

FIG. 10 .
FIG. 10.Model to observation comparison based on diurnal composites of median diurnal cycles.Composites are derived for a specific bin of EF aggregated over a set of sites (a),(c) with short vegetation under dry conditions and (b),(d) for forested sites under moist conditions.Panels (a) and (b) show typical LSMs, which strongly overestimate the phase lag of turbulent heat fluxes.Panels (c) and (d) show better performing models.Arrows show the rising flux during morning hours.

FIG. 11 .
FIG. 11.Simulations and observations of turbulent heat fluxes during DICE.(a) Simulations with the reference JSBACH land surface model using the upper soil layer to solve for the surface energy balance.(b) The new version of the JSBACH model with a Skin layer.Data are obtained from Heidkamp et al. (2018) and appconv.metoffice.com/dice/dice.html).

TABLE 1 .
FLUXNET sites used in this study.Column names are the site codes used throughout the manuscript, the respective FLUXNET ID, duration of data used in this study, longitude and latitude, elevation above sea level, the Köppen climate classification, and vegetation cover following IGBP.Annual average air temperature and precipitation are also reported.
Site code n yr n yr,cf n filter EF q10 EF q50 EF q90 EB closure f Rn f lE,q10 f lE,q50 f lE,q90 f H,q10 f H,q50 f H,q90 f Qgap,q50 FIG.3.Bar plot of the number of days matching the observed evaporative fraction within the filtered dataset.Green bars show the number of days with correct evaporative fraction, light blue refers to too dry (low EF), and dark blue too wet model simulation.A bin width of 0.1 was used to evaluate if a model is correct, too low, or too high.