Constraining a System of Interacting Parameterizations through Multiple-Parameter Evaluation: Tracing a Compensating Error between Cloud Vertical Structure and Cloud Overlap

This study explores the opportunities created by subjecting a system of interacting fast-acting parameterizations to long-term single-column model evaluation against multiple independent measurements at a permanent meteorological site. It is argued that constraining the system at multiple key points facilitates the tracing and identification of compensating errors between individual parametric components. The extended time range of the evaluation helps to enhance the statistical significance and representativeness of the single-column model result, which facilitates the attribution of model behavior as diagnosed in a general circulation model to its subgrid parameterizations. At the same time, the high model transparency and computational efficiency typical of single-column modeling is preserved. The method is illustrated by investigating the impact of a model change in the Regional Atmospheric Climate Model (RACMO) on the representation of the coupled boundary layer–soil system at the Cabauw meteorological site in the Netherlands. A set of 12 relevant variables is defined that covers all involved processes, including cloud structure and amplitude, radiative transfer, the surface energy budget, and the thermodynamic state of the soil and various heights of the lower atmosphere. These variables are either routinely measured at the Cabauw site or are obtained from continuous large-eddy simulation at that site. This 12-point check proves effective in revealing the existence of a compensating error between cloud structure and radiative transfer, residing in the cloud overlap assumption. In this exercise, the application of conditional sampling proves a valuable tool in establishing which cloud regime exhibits the biggest impact.


Introduction
Clouds significantly affect the earth's climate, for a large part due to their impact on the transfer of solar and thermal radiation (Ramanathan 1987).Clouds are often generated by processes that act on spatial and temporal scales that are much smaller than the scales of discretization in general circulation models (GCMs), and as a consequence their impact has to be represented through parametrization.Great variety exists among the suites of sub-grid parameterizations in the various present-day operational GCMs.On the one hand this reflects the long history of the scientific research behind their formulation, going back decades.On the other hand this variety reflects the significant complexity of such parameterization schemes, which typically consist of many individual parametric functions, each representing an observed statistical relation between one quantity and another.This complexity brings some considerable risks.The first is intransparency, i.e. the interaction between the many parametric components is often not fully understood, which might result in unexpected behavior or instability in the model.Another risk is that of introducing so-called compensating errors between parametric components.These are situations in which a structural error by one component is erroneously compensated by another.In a shifting future climate, when each process might act differently, it is not guaranteed that such an artificial correction will still hold.Another potential side-effect of compensating errors is that the improvement of one parametric component does not guarantee an improvement in the overall GCM performance.
The complexity of parametrization schemes has hampered progress in recent decades (e.g.Randall and Co-Authors 2003b).To address this issue the evaluation of parametrization schemes against relevant measurements has been an active field of research (Browning 1993;Randall and Co-Authors 2003a), as testified by the numerous model inter-comparison studies at process-level (e.g.Stevens and Co-authors 2001;Brown and Co-authors 2002;Siebesma and Co-authors 2003;Zanten and Co-Authors 2011).These initiatives have been successful in providing the modeling community with benchmark results for certain controlled situations or regimes that have to be reproduced by any model, thus acting as a "testing ground" for new or existing parameterizations.However, some problems remain.It can be argued that the available idealized cases are too few to guarantee representativeness.Also, to properly constrain a complex interacting system with measurements (and thus identify any existing compensating errors), the representation of each process should be individually confronted with relevant, independent measurements.In reality this has proven hard to realize; either the required multitude of independent measurements was not available, or the measurements did not cover sufficiently long timeperiods to ensure statistical significance in the evaluation result.
How to move forward?In the evaluation of GCMs at process-level there has been a recent drive towards a more comprehensive approach, attempting to involve many more observational datasets and to ensure better guidance by the GCM (e.g.Jakob 2010).In a companion paper to this study (Neggers et al. (2012), hereafter referred to as NSH12) a new strategy is proposed that consists of the continuous, long-term simulation and evaluation of Single-Column Models (SCM) against a multitude of independently measured parameters at meteorological super-sites.The purpose of this strategy is twofold: • To constrain the system of interacting parametrization at multiple key points with independent measurements; • To make a statistically significant assessment of model performance.
By adopting this strategy we hope to improve the detection of compensating errors, and to make the SCM result more representative of its native GCM while at the same time maintaining the proven benefits of single-case studies (such as model transparency).This study is dedicated to illustrating the approach adopted by NSH12.As an example we use the implementation of a new, integrated scheme for boundary-layer transport and clouds into the Regional Atmospheric Climate Model (RACMO).Both the new and control version will be evaluated at process-level against multiple measurements at the Cabauw meteorological site in The Netherlands, to investigate if the new scheme improves the representation of the local cloud-radiative model climate.Results of continuous Large-Eddy Simulation (LES) at Cabauw will be used to supplement the observational data-sets with information on boundary-layer cloud structure for which measurements are unavailable.We will demonstrate how multiple parameter evaluation at twelve key points allows tracing the impact of a change in cloud representation throughout the coupled boundary-layer/soil system, and how this technique in the end proves effective in revealing the existence of a compensating error in the interaction between clouds and radiation.
Section 2 contains a brief introduction to the Cabauw site and its measurements, as well as detailed descriptions of the model setup and the applied research method.The results of the multi-year evaluation are presented in Section 3, including among others a summary of model performance using simple statistical metrics, a conditional sampling exercise to determine the cloud regime of interest, an evaluation against LES results, and the impact of a model improvement that was inspired by these results.Finally in Section 4 the main conclusions will be summarized and their implications will be shortly discussed.

Method a. The site
The Cabauw Experimental Site for Atmospheric Research (CESAR) is situated in a flat grass-land area in the vicinity of the small village of Cabauw in the Netherlands.The site is operated by the Royal Netherlands Meteorological Institute (KNMI) since 1973.Its main asset is the 213m tower, equipped at regular intervals with sensors to the purpose of atmospheric boundary-layer research, air pollution studies and climate monitoring (e.g.Driedonks et al. 1978;Ulden and Wieringa 1996).In addition, an array of continuously operational instruments is installed at the site that include both in-situ and remotesensing equipment (described in detail by Russchenberg and Co-authors 2005).The Cabauw site participates in the CloudNet project (Illingworth and Co-authors 2007).All Cabauw data used in this study is publicly accessible online at http://www.cesar-database.nl/.

b. Model setup
Two versions of the RACMO SCM are evaluated in this study, each representing a different boundary-layer scheme.The first is the scheme as implemented in the operational RACMO, and is identical to that of the model cycle 31R1 of the Integrated Forecasting System (IFS) of the European Centre for Medium-Range Weather Forecasts (ECMWF).This scheme will be referred to as the "control" version.The second version is the Eddy-Diffusivity Mass-Flux (EDMF) scheme including the Dual Mass Flux framework (DualM, Neggers et al. 2009a;Neggers 2009b).
The two model versions will be simulated and evaluated at Cabauw for the period covering 2007-2010.
The setup of the RACMO SCM is already described in detail by NSH12, but will shortly be summarized here.On each day within the selected time-period a simulation is performed of 72 hours duration, initialized at the ECMWF analysis at 12 UTC at Cabauw.The simulations are performed at the L91 grid, using an integration time-step of 900 s.As usual in SCM experiments the tendencies due to advection by the larger-scale circulation are prescribed.These are obtained from short-term forecasts with the RACMO 3D model that occupy the same timewindow.In addition, continuous relaxation is applied towards the state variables {U, V, T, q} as sampled from the same RACMO 3D forecasts, using a synoptic time-scale of 6 hours.This time-scale is chosen such that the relaxation prevents excessive model drift in time, but still allows the faster acting parametrized physics to create their own unique state or "fingerprint".The soil scheme is initialized at the RACMO 3D state but is free during the simulation, and interacts with the atmosphere through flexible surface fluxes (including radiation, sensible and latent heat, and precipitation).The evaluation is limited to the first complete day in the staggered simulations, which corresponds to hours 12-36 since initialization.This time-window gives the models a 12-hour spin-up time, but is still sufficiently close to the initialization time (ECMWF analysis) to minimize uncertainties in the applied forcing and relaxation.
The LES results used in this study are generated using the code of the Dutch Atmospheric Large-Eddy Simulation model (DALES, Heus and Co-authors 2010).The LES is initialized and forced in exactly the same way as the SCMs, so that their inter-comparison remains meaningful.The spatial resolution of the LES is 100 m in the horizontal and 40 m in the vertical, in a domain of 6.4x6.4 km wide and 6 km high, at a time-step of 2 s.In this domain and at these resolutions the turbulence, convection and associated clouds in the atmospheric boundary-layer can be expected to be reasonably well resolved.As part of the KNMI Parametrization Testbed (KPT, NSH12) the DALES model is run on a daily basis at the Cabauw site, enabled by the significant enhancement in simulation speed brought by making use of a Graphical Processing Unit (the GPU-based Atmospheric LES, or GALES, see Schalkwijk et al. 2011).As a result, the LES-derived datasets available at Cabauw by now cover multiple years.

c. A 12-point check
The set of parameters used in this study for model evaluation is chosen to reflect the cloud, radiative and thermodynamic state of the atmospheric boundary-layer, as well as the surface heat budget and the soil temperature.The 12 data-streams as listed in Table 1 consist of two types, namely i) measurements by instrumentation at Cabauw, and ii) LES results at Cabauw.The observational parameters are routinely measured at Cabauw and are therefore available for long and continuous periods of time.The LES data supplement this set with information on key aspects of cloud structure for which no measurements are available.
The chosen set of 12 parameters is designed to constrain the following impact mechanism in the coupled boundarylayer/soil system, which mainly involves fast physics and thus acts on very short time-scales.This mechanism is schematically illustrated in Fig. 1.We suspect from preliminary tests for various idealized cases that the representation of boundary-layer clouds by the new scheme will differ considerably from the control model, both in amount and in vertical structure.Suppose such a difference will also materialize in multi-year simulations with the SCM at Cabauw.This difference in clouds will affect the radiative transfer through the atmosphere, which should affect the surface downward radiative fluxes.These are part of the surface energy budget, which will affect both the surface temperature and the surface sensible heat flux.This will finally impact the low-level temperature in the atmospheric boundary layer.All main processes in this chain of fast interactions thus react to a change in cloud representation; at the same time, the state of these processes is also routinely measured at Cabauw or can be estimated from LES simulations.This allows constraining the representation of this interactive chain at multiple points (as indicated in Fig. 1), and to thus identify possible compensating errors between them.
One could argue that many more parameterizations are involved in this chain of interactions than there are measurements.Accordingly, some compensating errors might still remain undetected.However, using 12 independent measurements is already an improvement from evaluating against a single measurement, such as, say, total cloud cover -many examples of the latter type of model evaluation already exist in the literature.As will be demonstrated, evaluations for a limited number of parameters can already be successful at revealing compensating errors at the first-order level, i.e. between the main components in the interacting system.More generally speaking, this study should be interpreted as a first attempt at constraining a system of interacting parameterizations more comprehensively -somewhat limited (but not complicated) by the measurements that are currently available.
The model performance for each parameter will be assessed through simple statistical metrics such as the bias and the root-mean-square error.The basic time-unit of these statistical analyses is the monthly mean, which in our experience results in clear signals.It should be noted that some errors will be shared by both model versions, for which various reasons can exist.For example, there is the question how representative a point measurement is of an area-mean as represented by a model grid-box.A second cause can be errors in the prescribed large-scale forcings, for example due to small-scale effects that are not resolved by the 3D model.However, the added value of a multipleparameter approach as opposed to a single-parameter approach is that it makes the evaluation less fragile and more robust, by reducing the sensitivity to the possible peculiarities of a single measurement.Accordingly, model differences will only be taken seriously when they are significant and when they materialize for multiple parameters.

a. Quantifying performance
We commence by evaluating the cloud-radiative climate over multiple years (2007)(2008)(2009)(2010) as generated by SCM simulations with the two versions of the CY31R1 physics package as described in the previous section.Figure 2 shows the monthly means of the total cloud cover (T CC) and the surface downward shortwave radiation (SW d ) at 12 UTC, plotting the observed (abscissa) against the modeled (ordinate) values.While the control SCM reproduces to a degree the cloud-radiative climate of its host-model, the new SCM differs considerably.This shows that the boundarylayer (the only physics component that was changed) significantly affects the cloud-radiative climate at Cabauw.In general, the new scheme underestimates the T CC, while it overestimates the SW d .
Scatter-plots of model results against measurements like Fig. 2 can be interpreted in terms of the bias and the centered root-mean-square (CRMS) error, defined as and respectively.Here φ is the monthly-mean of a time-dependent variable, N the number of samples in the time-series, the superscript m and o respectively indicate the modelled and observed data, and the overbar indicates the time-mean over the whole period of evaluation.The bias and the CRMS error are metrics often used to measure model performance; the latter here reflects how well a model reproduces the observed variability among the monthly means.Using these metrics, we now expand the model evaluation to multiple parameters by including six more independently measured variables, as listed in Table 1.A convenient way of visualizing model performance for multiple variables is the so-called Taylor (2001) diagram, as shown in Fig. 3.These two-dimensional diagrams plot the normalized standard deviation against the correlation coefficient.In this diagram, the distance from the point labeled "REF" represents the CRMS difference, and is a measure for how well a model reproduces the observed pattern.The results show that the old model best reproduces the observed pattern in monthly-mean variability for all selected variables, thus confirming the results obtained for only two variables.

b. Tracing impacts
Differences in biases between models for multiple variables can reflect how a change in the representation of one parameter (say cloud cover) impacts another, and so work its way through a chain of interacting fast processes (such as the boundary layer system).For SW d and T CC, Fig. 2 already shows that the new model has a larger bias in both variables.The biases have opposite sign, reflecting the known impact of the one parameter on the other (i.e. more cloud cover enhances the reflection of solar radiation).The biggest biases in shortwave radiation occur at large values, reflecting that the error is largest during the summer period.For the total cloud cover the largest biases occurs at medium values, which are typical for summer-time climate at Cabauw.
The evaluation of biases is expanded to multiple parameters in Fig. 4, which summarizes and quantifies the impact of the model change on all components in the feedback mechanism as illustrated in Fig. 1.A 12 % reduction in total cloud cover (reflecting a multi-year average at 12 UTC) enhances the downward shortwave radiative flux by about 36 W m −2 , but reduces the downward long-wave radiative flux by 4.5 W m −2 .The change in the shortwave dominates, so that the net downward radiative flux at the surface is also larger.This leads to a warmer soil, and through the surface energy budget this also boosts the sensible and latent heat fluxes into the atmosphere.In turn, this heats the atmosphere near the surface by about 0.2 K.
The nature of this impact mechanism is investigated in more detail in Fig. 5, by adopting a special plotting method that differs from Fig. 2 in two ways.First, on the vertical axis the modeled minus observed value is now shown.This highlights the difference between models, but simultaneously maintains information about the measurement.Second, on the horizontal axis the monthly means are now sorted on the associated difference in SW d between the two models.As a result, each data-point (representing the mean over a specific month) has the same position on the horizontal axis in all panels.What this reveals is that the difference in bias between the two models increases with rank for all parameters in this set -with the largest differences on the right-hand-side.This suggests that the impacts on all parameters are related to the change in SW d , and that clear correlations should exist between model differences in various variables.
Figure 6 lists the correlation coefficients between the model difference in SW d and all other variables.Comparing the degree of correlation among variables provides further insight into where this impact mechanism starts, and how deeply it works its way into the coupled boundarylayer/soil system.For example, the model difference in the surface downward shortwave radiation is highly correlated to the model differences in cloud cover, as can be expected.
The correlations between SW d and the surface heat fluxes are similarly high, reflecting a substantial impact on the surface energy budget.However, the correlation between SW d and the soil temperature, as well as the air temperature at 2 m, is already somewhat weaker; and then further weakens with height above the surface.At 1 km height the correlation coefficient has reduced to only 0.13.In general, the correlation weakens when a process is further down the chain of interacting processes, reflecting that other processes also start to play a role; in case of air-temperature this probably reflects a difference in the vertical structure of the thermodynamic state, directly resulting from the use of a different model for boundary-layer transport and cloud.

c. Regime sampling
The results presented so far suggest that the change in cloud cover drives the changes in other parameters; we now adopt this as our working hypothesis.The next step is then to gain more insight into the nature of this change in cloud representation, by studying model behavior at process-level.The aim of this exercise is to establish if the change in cloud cover can be attributed at all to fast-acting parametrized processes.Such parameterizations can act on very small time-scales, sometimes even close to the integration time-step.This should be reflected in the evaluation strategy.
The method adopted here follows the strategy proposed by NSH12.First the day or days are identified that contribute most to the long-term mean model difference; this choice ensures that the most relevant cases are selected for further study.These cases are then investigated in great detail; for example, we ask if those selected days have anything in common, or define a specific regime.If so, a criterion is defined that reflects this regime.This criterion is subsequently used to calculate conditionally sampled longterm means.If the signal of interest (i.e. the model difference) amplifies for these conditional means, then the corresponding regime is responsible for most of the bias.The benefit of such conditional sampling over long time periods is that it can improve the statistical significance and representativeness of the choice of the 'most relevant case' or 'golden day'.
First a single month is selected for which the difference in TCC and SWd between the models is large.In Fig. 7 the composite-mean time development of these parameters is shown for June 2008.A well-defined diurnal cycle exists in the model difference for both parameters, with the maximum difference occurring at about 12 UTC.The breakdown of this monthly-mean signal into individual days is shown in Fig. 8.The days in this month that contribute most to the monthly mean model difference are highlighted.To give the reader an idea of actual cloud field in nature on these days, snapshots by the Cabauw web-camera at 12 UTC on each of these days are shown in Fig. 9.All days featured fair-weather cumulus.This would suggest that this boundary-layer regime is responsible for most of the bias.
To improve our confidence in this conclusion, we now enhance its statistical significance by assessing the model difference for the subset of days within the period 2007-2010 on which fair-weather cumulus occurred.A day is labeled as "fair-weather cumulus" in case all of the following three criteria apply at 12 UTC; i.A positive surface buoyancy flux; ii.A lifting condensation level below boundary layer top; and iii.A total cloud cover smaller than 50%.
Figure 10a shows the percentage of days per month that were labeled as fair-weather cumulus, using the SCM output of the new scheme.The percentage peaks at about 50 % in Northern-hemisphere summer.Panels b) and c) show the corresponding model differences in T CC and SW d , respectively, for both the full monthly-mean and the conditionallysampled monthly-mean.For both variables the model difference is clearly largest in the conditional mean.This confirms that the fair-weather cumulus topped boundary layer is the regime in which the impact of the model change is strongest.

d. Process-level evaluation
We next investigate in detail the representation of boundarylayer clouds on the shallow cumulus days selected in the previous section.Figure 11 shows the corresponding timeheight cross-sections of the cloud fraction in the boundarylayer.For reference the LES results are also shown.One difference in cloud structure that immediately stands out is the tendency by the control scheme to create a singlemodel-level "anvil" at the top of the cloud layer.In contrast, with the new scheme the cloud fraction is maximum at the cloud base, and decreases monotonically with height above.In addition, the maximum cloud fraction is larger in the original scheme.In these two aspects (i.e. the vertical structure and amplitude) the new scheme better resembles the LES results.
To judge which vertical structure is most realistic the model results are now confronted with two relevant observational data-products available at the Cabauw site, see Fig. 12.The CLOUDNET dataset (Illingworth and Coauthors 2007) includes profiles of cloud fraction based on independent measurements by multiple instruments (cloud radar, micro-wave radiometer and ceilometer).A complicating factor is that at 300 m its vertical discretization is too coarse to properly resolve the vertical structure of most cases of boundary-layer clouds at Cabauw, which are typically less then 1km deep.The few remaining cases with a deep enough shallow cumulus cloud layer therefore only allow anecdotical (or case-by-case) evaluation.For example, Fig. 12a shows that on 16 June 2008 at 12 UTC the shallow cumulus cloud layer was about 2 km deep.In this case the "bottom-heavy" profile as created by the LES as well as the new scheme is supported by the CLOUDNET observation.
The location of the cloud layer boundaries are a first order aspect of the vertical structure of the cloudy boundary layer.To assess representativeness the LES results are now confronted with high-frequency measurements of the lowest cloud base height by the LD40 ceilometer at Cabauw.To this purpose the lowest cloud base height as observed by the LD40 within a 1-hour time-bin is used, in order to minimize the chance that samples are included that actually reflect the sides of cumulus clouds.The dataset thus obtained should reflect the lifting condensation level (LCL), which is generally used as a definition of the base of the cumulus cloud layer.The evaluation covers the whole month of June 2008, including a daily time-window of 4 hours around 12 UTC in order to cover some of the diurnal variation in cloud base height as can be observed in Fig. 11.This period also roughly corresponds to the period of largest inter-model difference as visible in Fig. 8.The results are plotted as a probability density function in Fig. 12b, suggesting that the LES satisfactorily reproduces the observed LCL with a relatively small bias of 105.6 m, a CRMS of 311.7 m, and a correlation coefficient of 0.65.This correlation is reasonable, given that many factors exist that still prevent a perfect correlation between LES and the measurements, such as i) uncertainties in the prescribed forcing, ii) low-level fog due to local terrain features that are not captured by the model setup, and iii) the sampling method still failing to exclude all samples above the LCL.
Having obtained some confidence in the LES results, the final step is to apply the same method of multiple parameter analysis as previously used in Section 3a in the evaluation of the vertical structure of the boundary-layer cloud field in the SCMs against the LES.To this purpose an additional set of four relevant variables is defined that reflects the key aspects of the cloud vertical structure in which the two SCM versions differ, as already established in the discussion of Fig. 11; • The cloud base height, • The maximum cloud fraction in the boundary-layer, • The height of max cloud fraction in the boundarylayer, and • The boundary-layer cloud-overlap ratio.
The latter is defined as the maximum cloud fraction over total cloud cover, both diagnosed over the lowest 4 km.The overlap ratio is included in this set because of its potentially important impact on radiative transfer.The set of four parameters is evaluated using the same setup, timecoverage and visualization method as was applied in Fig. 12b.The results are shown in Fig. 13 and are statistically summarized in Fig. 14.The new model performs significantly better for the cloud base height, the maximum cloud fraction and the height of the maximum cloud fraction.The maximum cloud fraction in particular seems to be over-predicted by the control model; it is unable to reproduce the small amplitudes typical of fair-weather cumulus as diagnosed in the LES.Interestingly, both models perform poorly for the cloud overlap ratio, as expressed by the shared large bias for this parameter.On average, the SCMs give r overlap = 1 while the LES gives r overlap = 0.5; in other words, the SCMs in effect apply the maximum overlap limit (i.e. total cover equals maximum fraction), while in the LES the overlap is much less efficient.
The better performance by the new model on boundarylayer cloud structure, in combination with the worse performance for all other variables, might seem paradoxical at first.However, this apparent contradiction is explained by the shared error on cloud overlap as found in Fig. 13d.This is schematically illustrated in Fig. 15.Similar to most operational GCMs the maximum-random overlap function (e.g.Geleyn and Hollingsworth 1979;Räisänen 1998) is applied in the RACMO, which resides in the radiation scheme.This overlap function was not affected by the implementation of the new boundary-layer scheme, so that both model versions use the same overlap function.In case of the control scheme, the overestimation of the maximum cloud fraction in the boundary-layer is compensated by the assumption of too efficient vertical overlap, resulting in a still reasonable estimate of the projected cloud cover (but for the wrong reason).In contrast, the new scheme better reproduces the smaller cloud fractions as seen in the LES, but fails in this setting to combine that with the correct inefficient overlap.This results in a structural underestimation of the projected cloud cover, which then leads to an overestimated downwards net radiative flux at the surface, with all its further impacts on the state of the coupled boundary-layer/soil system as already documented in the previous sections.

e. Model improvement
The insight that errors in the representation of the vertical overlap in cumuliform boundary-layer cloud fields are responsible for the degraded overall performance by the new scheme has motivated the authors to investigate this phenomenon more closely.The results of that study were published by Neggers et al. (2011), reporting that much of the inefficiency in the vertical overlap already takes place at depth-scales that are sub-grid scale (SGS) at the vertical discretizations typical of most present-day GCMs.As a consequence, most GCMs do not account for this important phenomenon.They also proposed an inverse linear function to describe the observed behavior of overlap in cumuliform cloud fields as a function of layer-depth.
The new scheme was rerun for the whole period 2007-2010, this time accounting for the overlap on sub-grid scales.To this purpose the overlap function as proposed by Neggers et al. (2011) was implemented, which effectively results in an increase of cloud fraction per level within the boundary-layer (as illustrated in Fig. 16a).Figure 17 gives a statistical summary of its performance, in terms of the main characteristics of Fig. 4 (the multi-year bias) and Fig. 3 (the centered RMS difference).Compared to the simulation without the SGS overlap function the mean bias has reduced significantly for all variables, almost to the level of the control model.The same is true for the centered RMS difference.Note that both errors still remain somewhat larger than those of the control model; this is probably due to the fact that only the overlap on sub-grid scales was altered.A further reduction in the biases can be expected when also the super-grid scale overlap (i.e. the overlap between model levels, see Fig. 16b) is improved.

Summary and conclusions
The results obtained in this study illustrate that the multiple parameter evaluation of continuous, long-term SCM simulations can be an efficient method to i) constrain a system of interacting parameterizations, ii) trace impacts of model changes throughout this system, and iii) reveal the existence of compensating errors between parametric components.In the example documented in this study, we applied this method to evaluate the impact of the implementation of a new boundary-layer scheme in the RACMO on the cloud-radiative model climate at Cabauw.The "12point check" revealed the existence of a compensating error in the interaction of boundary-layer clouds and the radiative transfer, residing in the cloud overlap function.Equipping the new scheme with an improved cloud overlap function resulted in much improved model performance for all parameters.
The results obtained in this study suggest and emphasize that the important phenomenon of vertical overlap in boundary-layer cloud fields is still poorly represented in GCMs, and deserves more scientific research.The functionality found by Neggers et al. (2011) captures the inefficient overlap as diagnosed in LES to the first order, a finding used in this study to justify its application in a simple experiment to the purpose of illustrating sensitivity.However, they also documented considerable casedependence in the associated constant of proportionality, which was speculated to be linked to the size statistics of the cumuliform cloud ensemble.More research is needed to fully understand this dependence.
This study focuses on the interaction between boundarylayer clouds and radiative transfer, and their impact on the coupled boundary-layer/soil system.To this purpose a specific set of measurements was defined for model evaluation.
The selection of such sets of multiple parameters really depends on the problem of interest.First a hypothesis should be formed which parametrized processes are involved in the problem that should be constrained by measurements.The choice of the site should also depend on the problem of interest; at some locations the associated weather regime occurs more frequently than at others, making these sites a more logical choice.
Other topics can be studied using the method followed in this study, although some terms and conditions apply.The process of interest should act on time and length-scales small enough so that i) the phenomenon acts much faster than the atmospheric circulation in which it is embedded, and ii) it is "locally forced" enough to allow its study in the absence of interaction with the larger scales.Only then can the problem be addressed with single-column modeling using prescribed large-scale forcings.Examples of topics that could be studied are i) the representation of momentum transport in the boundary-layer, and ii) impacts of soil moisture on evaporation.An example of a process that is less appropriate to study is mature deep convection, as this often involves meso-scale effects that might be partially resolved in the associated GCM.
In practice, another limiting factor in multiple parameter evaluation at process-level often proves to be the availability of instrumentation at a site, or the insufficient timecoverage of the relevant measurements.The approach advocates the long-term, continuous measurement of a range of relevant variables at super-sites, and promotes their availability to the scientific community.In this study LESgenerated data-sets were used to supplement the observational data-sets on parameters required to solve the problem.However, one should realize that LES is still a model.It should itself be evaluated against measurements, to increase confidence in its use as a virtual laboratory.This is an ongoing activity at the Cabauw site.
The evaluation of a system of interacting fast-acting parameterizations in isolated mode from the larger-scale circulation against long-term measurements at permanent meteorological sites can facilitate the attribution of GCM behavior to specific parameterizations.For example, it could be of assistance in reducing the uncertainty in numerical predictions of future climate that is known to be caused by the representation of subtropical marine boundary-layer Fig. 1.Schematic illustration of the chain of interacting processes in the coupled soil-boundary layer system that is investigated in this study.The numbers refer to the twelve measurements and LES-diagnostics as listed in Table 1 that are used to constrain this system.

Fig. 2 .
Fig. 2. Scatter-plot of the observed (abscissa) versus simulated (ordinate) monthly-mean a) total cloud cover and b) surface shortwave downward radiation at Cabauw at 12 UTC for the period 2007-2010.The host-model state (RACMO) is shown in grey, its native SCM in red, and the SCM including the new BL scheme in blue.

Fig. 3 .
Fig. 3.A Taylor-diagram quantifying the monthly-mean model performance at Cabauw at 12 UTC for the period 2007-2010 for the eight observed variables in Table1.The legend and interpretation are explained in the text.

Fig. 4 .Fig. 5 .
Fig.4.Bar-chart  showing the biases of the control SCM (red) and the new SCM (blue) for the eight observed variables in Table1, as calculated from the monthly means in the period 2007-2010.

Fig. 6 .Fig. 7 .
Fig. 6.Correlation coefficients between the monthly mean model differences in SW d and various other variables, for the period 2007-2010.The black line represents Pearson's correlation coefficient, while the grey line represents Spearman's rank correlation coefficient.

Fig. 8 .Fig. 9 .Fig. 10 .Fig. 11 .Fig. 12 .Fig. 13 .
Fig. 8. Break-down of the monthly-mean T CC at 12 UTC for June 2008, as shown in Fig.7a, into individual days.The eight days with the largest inter-model difference are highlighted by the light-blue shading.

Fig. 14 .Fig. 15 .Fig. 16 .Fig. 17
Fig. 14.Bar-chart showing the biases of the control SCM (red) and the new SCM (blue) against LES as calculated from the pdfs shown in Fig. 13.
clouds.This opportunity is explored in the ongoing European Cloud Inter-comparison, Process-Study and Evaluation project (EUCLIPSE, see http://www.euclipse.eu/).

Table 1 .
The set of observed and LES-diagnosed parameters at Cabauw used in the SCM evaluation.