## Abstract

A large number of urban surface energy balance models now exist with different assumptions about the important features of the surface and exchange processes that need to be incorporated. To date, no comparison of these models has been conducted; in contrast, models for natural surfaces have been compared extensively as part of the Project for Intercomparison of Land-surface Parameterization Schemes. Here, the methods and first results from an extensive international comparison of 33 models are presented. The aim of the comparison overall is to understand the complexity required to model energy and water exchanges in urban areas. The degree of complexity included in the models is outlined and impacts on model performance are discussed. During the comparison there have been significant developments in the models with resulting improvements in performance (root-mean-square error falling by up to two-thirds). Evaluation is based on a dataset containing net all-wave radiation, sensible heat, and latent heat flux observations for an industrial area in Vancouver, British Columbia, Canada. The aim of the comparison is twofold: to identify those modeling approaches that minimize the errors in the simulated fluxes of the urban energy balance and to determine the degree of model complexity required for accurate simulations. There is evidence that some classes of models perform better for individual fluxes but no model performs best or worst for all fluxes. In general, the simpler models perform as well as the more complex models based on all statistical measures. Generally the schemes have best overall capability to model net all-wave radiation and least capability to model latent heat flux.

## 1. Introduction

The world’s population has become increasingly urbanized: around 29% of the global population were urban dwellers in 1950, rising to 47% by 2000, and this proportion is predicted to rise to 69% by 2050 (UN 2009). Thus increasing numbers of people are impacted by weather and climate in urban areas. There is a growing requirement for accurate weather forecasts and climate change information within cities, and concurrent increases in computer capabilities allow greater spatial resolution within models. In combination, there is a greater proportion of the earth’s surface being categorized as “urban” and there are a larger number of smaller grid boxes in atmospheric models in which urban areas need to be resolved.

The surface morphology (i.e., urban form) and presence of impervious building materials, sparseness of vegetation, and anthropogenic heat, water, and pollutant contributions each have a significant effect on the climate of urban regions, which lead to phenomena such as the urban heat island. Thus, effects of the urban surface on the fluxes of heat, moisture, and momentum need to be accounted for in the land-surface schemes used within numerical models, although the complexity of these schemes has to be balanced with their computational requirements. A fundamental aim of urban energy balance models is to accurately predict fluxes at the local scale (10^{2}–10^{4} m). Some calculate additional terms including within-canyon air temperatures and wind speed, and facet surface temperature. A facet is a surface of the urban geometry that can be characterized by a single temperature and surface energy balance, and that can interact thermodynamically with other facets (e.g., a wall facet exchanging longwave radiation with the road facet; Fig. 1). The outputs from the model may be hourly or higher temporal resolution for the whole surface, or be facet/orientation-specific.

Models have been developed to incorporate urban features for different applications including global climate modeling (e.g., Oleson et al. 2008a,b); numerical weather prediction (e.g., Best 1998, 2005; Masson 2000; Chen et al. 2004; Harman and Belcher 2006; Liu et al. 2006); air quality forecasting (e.g., Martilli et al. 2003) and dispersion modeling (e.g., Hanna and Chang 1992, 1993); characterization of measurements (e.g., Krayenhoff and Voogt 2007); and water balance modeling (e.g., Grimmond et al. 1986; Grimmond and Oke 1991). Across these schemes a wide range of urban features are incorporated; the models have varying levels of complexity, and different fluxes modeled (Table 1; Figs. 1, 2).

In this paper, the methodology and initial results from the first international comparison of a broad range of urban land-surface schemes are presented. The requirements of a land-surface model from the perspective of an atmospheric model are considered; that is, surface fluxes of heat, moisture, and momentum. Thus, the fundamental requirement for the models to be included is that they simulate urban energy balance fluxes. The forcing data for the surface models are the same as that which would be provided by an atmospheric model; that is, the incoming shortwave (*K*↓) and longwave fluxes (*L*↓), air temperature, specific humidity, and the wind components. From these the outgoing radiative fluxes (*K*↑, *L*↑), net all-wave radiation (*Q**), turbulent sensible heat flux (*Q _{H}*), turbulent latent heat flux (

*Q*), and net heat storage flux (Δ

_{E}*Q*) are modeled. In this context, the net heat storage includes the energy storage within the buildings, the road and underlying soil, and for some models the air space within the street canyon (Grimmond and Oke 1999a). In the urban environment it is also useful to consider the anthropogenic heat flux (

_{S}*Q*) in the surface energy balance (Oke 1988):

_{F}Features such as additional sources of energy (*Q _{F}*), presence of built and natural surfaces, the bluff body nature of the buildings, and existence of urban canyons, combine to change energy partitioning in urban areas. Thus significant modification to rural land parameterization schemes is needed. While many urban models have been evaluated against observational datasets (e.g., Grimmond and Oke 2002; Masson et al. 2002; Dupont and Mestayer 2006; Hamdi and Schayes 2007; Krayenhoff and Voogt 2007; Kawai et al. 2009), with some models even using the same observations, these comparisons have not been conducted in a controlled manner that allows robust model intercomparison. The objective here is to do just that: to undertake a staged and carefully controlled classification and comparison of urban energy balance models and their performance. An important objective also is to determine which approaches minimize the errors in the simulated fluxes.

## 2. The characteristics of urban energy balance models

Urban energy balance models can be classified in a number of ways (see also Grimmond et al. 2009a); for example, they vary in terms of the fluxes they calculate (“F” in Figs. 1 and 2). While all the models examined here calculate *K*↑, *L*↑, *Q**, and *Q _{H}*, some do not model either

*Q*or the

_{E}*Q*, and some model neither. Here, a series of features are used to classify the approaches taken. Figures 1 and 2 illustrate these, and the latter provides the numbers of models in each category. The illustrations also give each model class a reference in order to identify the category and its classification.

_{F}### a. Vegetation and latent heat flux (“V” in Figs. 1 and 2)

A key decision in modeling an urban surface is whether or not vegetation (V) is simulated. A threefold classification is used here, where vegetation is

Vn: not considered,

Vs: modeled using a “tile” scheme to represent the surface heterogeneity (e.g., Essery et al. 2003) that does not interact with other surface types until the first atmospheric level of a mesoscale model (e.g., Best et al. 2006), and

Vi: “integrated” into the modeled urban surface.

The implication of not including vegetation is that there can be no latent heat except for periods immediately following rainfall. Some, even after rainfall, calculate no *Q _{E}*, whereas some account for dewfall and its later evaporation (Fig. 2b). For central business districts in many cities it may be reasonable to assume a negligible amount of vegetation and, hence, an absence of

*Q*associated with vegetated surfaces. However, in residential areas (e.g., suburban North America) extensive fractions of the surface are vegetated, so the assumption of no urban

_{E}*Q*is unrealistic. Moreover, in many locations, extensive street cleaning can result in water being available for evaporation despite the lack of vegetation (e.g., Mexico City, Mexico, Oke et al. 1999; Marseille, France, Grimmond et al. 2004).

_{E}The two classes of model that do incorporate vegetation differ in terms of the interactions which occur between the “built” and “vegetated” fractions (Figs. 2a–c). In the first case, “tiles” (Vs; Fig. 1), models typically take advantage of traditional land-surface schemes that have a wide variety of vegetation categories (e.g., Noilhan and Mahfouf 1996; Chen and Dudhia 2001; Essery and Clark 2003). Many have been extensively evaluated in the Project for Intercomparison of Land-Surface Parameterization Schemes (PILPS) (Henderson-Sellers et al. 1993, 2003; Irranejad et al. 2003) and other studies. Urban vegetation typically is more diverse than an individual vegetation class so a number of classes may be required (e.g., needleleaf and evergreen broadleaf trees) to ensure adequate representation. In the tile approach, the built and vegetated fluxes are typically weighted by their respective plan area fractions to contribute to total fluxes (e.g., Lemonsu et al. 2004). The integrated case (Vi) is the most physically realistic as it allows for direct interaction of built and vegetated surfaces. This additional complexity may require increased computing resources and parameter values.

### b. Anthropogenic heat fluxes (“AN” in Figs. 1 and 2)

The magnitude of *Q _{F}* varies across a city. Typically it will be greatest in the densest part of the city (Oke 1988; Grimmond 1992; Ichinose et al. 1999). But even low absolute

*Q*values may be important where they exceed the radiative forcing (e.g., cloudy, cold winters with low solar input).

_{F}Similar to *Q _{E}*, not all models consider

*Q*. The four general approaches are

_{F}ANn:

*Q*is assumed to be zero, negligible, or ignored;_{F}ANp:

*Q*is assumed to be a fixed amount that is required as specified input to the model, or is directly coded into the program;_{F}ANi:

*Q*is calculated based on assumed internal building temperature; and_{F}ANm:

*Q*is calculated and incorporates internal heat sources from buildings, and/or mobile sources associated with traffic, and/or metabolism._{F}

Models that calculate *Q _{F}* typically include the heat related to internal heating of buildings as a minimum. A fixed temperature is assigned internally and this may, or may not, be allowed to vary seasonally or diurnally. Alternatively a fixed minimum (maximum) temperature is used so the internal temperature of the building may vary but within limits. The heat flux from traffic typically is based on assumptions about traffic flow from vehicle counts. The models that calculate

*Q*in more detail, using a building energy model, mostly use the method of Kikegawa et al. (2003).

_{F}Beyond the internal temperature, the introduction of *Q _{F}* requires consideration of where the heat is released or added to the atmosphere; for example, whether heat is added within or above the canyon.

### c. Anthropogenic heat fluxes: Temporal variation (“T” in Figs. 1 and 2)

Anthropogenic heat flux *Q _{F}* varies both diurnally and seasonally (e.g., Sailor and Lu 2004; Offerle et al. 2005; Pigeon et al. 2007; Lee et al. 2009), although only some models consider this. Models that prescribe a fixed value (Tf) are likely to provide too much

*Q*at night and insufficient quantities in the day; they will also not capture peak values normally associated with commuting (seasonally this peak may be associated with low solar radiation forcing). The inclusion of a diurnal and/or seasonal cycle (Tv) is more significant for certain applications when the modeled fluxes must be correct for specific short time periods. It is less significant when applications are not concerned with diurnal patterns.

_{F}### d. Urban morphology (“L” in Figs. 1 and 2)

Urban morphology affects radiative and turbulent heat exchanges. A number of approaches are used to capture these features, including

L1: Slab or bulk surface;

L2: single-layer approaches, which separate the surface into a roof and canyon (wall plus road) or

L3: where the three facets (roof, wall, and road) are treated separately; and

L4–L7: multiple-layer approaches, which divide one or more of the facets into layers or patches.

Slab models represent the urban form as a flat horizontal surface with appropriate “bulk” radiative, aerodynamic, and thermal characteristics. This has the advantage of simplicity and reduced computational time and parameter requirements. Single-layer models simplify the urban form to an urban canyon with a roof, wall, and a road. This allows for more realistic representations of radiative trapping and turbulent exchange (Masson 2000; Kusaka et al. 2001; Harman et al. 2004a,b; Lee and Park 2008). Parameter values are assigned for each facet and one set of energy exchanges per facet is modeled. Multilayer schemes divide the walls into a number of vertical and/or the roof and road into a number of horizontal patches; each with their own parameter values and energy exchanges modeled. For some models this allows for variable building height, and for others even differing roof, wall, and road characteristics. Note that the range of multilayer models L4–L7 is not exhaustive; rather it covers the range compared here.

### e. Urban morphology: Facets and orientation (“FO” in Figs. 1 and 2)

Models can be further subdivided by how urban canyon morphology, specifically the number of facets and orientations, are dealt with. Models include those that assume no facets (or orientation) and hence a bulk (or slab) surface (FO1), those that assume one infinitely long canyon (FOn), and those that have infinitely long canyons that run in two cardinal directions (FOo). The canyons may be fixed in orientation and neglect shading or assume a random distribution of street canyons within the domain. Alternatively, the canyon may be modeled assuming two walls that have sunlit and shaded fractions that vary through the day and year. More realistic models also include an intersection between canyons (FOi), significantly increasing the number of the interactions with other facets that need to be computed.

### f. Radiative fluxes: Reflections (“R” in Figs. 1 and 2)

As *K*↓ and *L*↓ are provided, it is the *K*↑ and *L*↑ that are modeled. Beyond the morphology, and therefore the degree of detail needed for the surface parameters, the major differences relate to the number of reflections assumed—R1: single; Rm: multiple; Ri: infinite.

The single reflection model is the least computationally intensive and used in both slab and single-layer models. Models that simulate multiple reflections include both single-layer and multiple-layer models. Infinite reflections may be accounted for by slab, single-layer, and multilayer models.

For longwave radiation, slab models determine one surface temperature, whereas for facet-specific models, multiple surface temperatures are calculated (Figs. 2b,c). The surface temperatures then provide the forcing for *Q _{H}* and Δ

*Q*.

_{S}### g. Radiative fluxes: Albedo and emissivity (“AE” in Figs. 1 and 2)

The albedo and emissivity values that determine the radiative fluxes may either be defined as a single value (bulk, AE1); as two facets, similarly to the L2 category (AE2); or may consist of combinations of various facets, analogous to the L3 (or greater) category of models (AEf).

### h. Storage heat flux (ΔQ_{S}; “S” in Figs. 1 and 2)

The Δ*Q _{S}* is significant in urban areas given the materials and morphology of the urban surface (Grimmond and Oke 1999a). In urban models, it is determined in the following ways:

Sr: difference or residual of the energy balance,

Sc: solution of the heat conduction equation by dividing the facets into a number of thickness layers, and

Sn: function of

*Q** and surface characteristics.

All three methods are used by slab or bulk models (Fig. 3). For all three, the ability to model the outgoing longwave radiation will impact the values obtained given the common need to model surface temperature.

For those models in which heat storage is calculated as the residual of the surface energy balance, assumptions as to which fluxes are included (specifically *Q _{F}* and

*Q*) are important. The second method, the solution of the heat conduction equation, is used extensively by slab, single-, and multiple-layer models. It requires various parameters for each (sub) facet, including: number of layers, layer thickness, thermal conductivity, and volumetric heat capacity of the various layer materials (Table 2). The number of layers resolved varies between 1 and 48, and may be of fixed or variable thickness. Currently, none account for changing water content of built materials associated with rain, so the material parameters are static. Some solve the heat conduction equation using the force-restore method, while others solve the one-dimensional heat conduction equation.

_{E}The third approach is to use a fraction of *Q**(Sn). Some models take into account the diurnal pattern of the flux through the objective hysteresis model (Grimmond et al. 1991).

### i. Other features

The following characteristics are not explicitly used to classify the models in this evaluation but are presented here because of differences between models. They do not necessarily result in the models being grouped differently to the classifications above; that is, models fall into some common groupings across model classes (Figs. 2, 3).

#### 1) Turbulent sensible (*Q*_{H}) and latent heat (*Q*_{E}) flux

_{H}

_{E}

Typically surface resistance (or its inverse, conductance) schemes are used to model *Q _{H}* and

*Q*(“G” in Fig. 1, Fig. 2b). Depending on urban morphology, these consist of either single (G3) or multiple (G4) resistance networks, which account for the number of facets and layers that are resolved. Bulk models (G1) have the simplest resistance network (Figs. 1, 2b). A wide range of resistance schemes is used (e.g., Rowley et al. 1930; Clarke 1985; Zilitinkevich 1995; Guilloteau 1998; Harman et al. 2004b). To determine the resistance the wind profile within/above the canyon, roughness length, and displacement length or drag coefficients and atmospheric stability may be taken into account. Drag is either not considered or is calculated using roughness length, exponential wind profile, or distributed drag. Exchange between the canopy air and building surfaces may be parameterized by a roughness length approach or distributed sources of heat (generally in conjunction with a distributed drag approach).

_{E}The number of temperatures resolved, which drive the gradients, varies both for the surface and the air (within the canyon; “Z” and “A” respectively in Figs. 1 and 2b), and these are related to morphology and the number of vertical layers in the model. Many assume Monin–Obukhov similarity holds, which may not be applicable within the urban canyon or within the roughness sublayer (Roth 2000). However, given the lack of well-tested alternatives, currently, this may be the most appropriate approach.

As *Q _{H}* is calculated typically using surface temperature to force the gradient, a balance is inherent in the solution of surface temperature between the

*L*↑,

*Q*, and heat conduction. Depending on the model objective, performance may be improved for one flux at the expense of another. Models that use a combination method (P, Penman–Monteith, or combination-type approach) do not need to determine the surface temperature to calculate

_{H}*Q*, but still need to allow for the transport of heat away from the surface.

_{H}The approaches taken to model resistances (*G*), surface temperature (*Z*), and air temperature (*A*) result in a large number of combinations (Fig. 2c, expressed in GZA order). Here they are shown relative to the urban morphology classes (L1–L7; Fig. 1) and the vegetation class (Vn, Vs, Vi). The approach taken for each turbulent flux (*Q _{H}*,

*Q*) for the built (B) and vegetated (V) part of the surface are shown. It is clear that the earlier classifications (Fig. 2a) do not produce common characteristics for these fluxes. Given the wide range of approaches, these are not investigated in further detail in this paper. Subsequent analysis of a larger dataset will investigate this. For the calculation of

_{E}*Q*for the built (B) fraction of the surface, the two most common classes of the nine different combinations are

_{H}333: single-layer resistance (G3), surface temperature (Z3), and air temperature (A3); and

113: bulk resistance per facet (G1), surface temperature (Z1), and a single air temperature (A3).

For the vegetated surfaces the two most common classes for *Q _{H}* are N:

*Q*is not calculated; and 113. For

_{H}*Q*from built surfaces the predominant classes are N, 113, and 333; but also of note are those models that account for the evaporative loss of water in one time step immediately following precipitation with a fixed rate of evaporation (

_{E}*E*). For

*Q*from vegetated surfaces, the predominant classes are also N and 113. Two models, which do simulate

_{E}*Q*and

_{H}*Q*for vegetated area, account for evaporation from soil moisture only and not the loss of water through vegetation. In these cases the soil temperature and moisture profile are calculated using the approach of Tremback and Kessler (1985). In urban areas bare soil is rare, some sort of vegetation is most likely to be present.

_{E}#### 2) Energy balance closure

Not all models explicitly force or check that they have energy balance closure [i.e., that Eq. (1) holds; Fig. 2d]. Lack of closure may result from numerical instabilities or lack of precision in the code, from a lack of evaluation, or from inconsistent assumptions. Closure may be forced in a number of ways: through the calculation of Δ*Q _{S}* at the end of each time step as a residual, by updating the surface temperature of the facets, or by restricting the turbulent heat fluxes to the available energy (

*Q** − Δ

*Q*). Closure is an important issue when the land-surface scheme is part of a long-term climate model simulation; without it, there may be long-term bias in the model.

_{S}#### 3) Anthropogenic water flux and other capabilities

Water can be added to the urban environment by human activity. Water is released by combustion processes, cooling towers, and by people, which is equivalent to the *Q _{F}* release (anthropogenic latent heat flux). One model takes into account the loss of water through perspiration (a source of

*Q*). Given there are very few estimates of this term (Heiple and Sailor 2008; Moriwaki et al. 2008), and it is likely to be small in many settings, it is not surprising that it has not been included in most models. The term may be important in very dry areas (e.g., high-latitude cities in winter, hot dry cities) and in areas with excessive air conditioning. The second significant source of water comes via the pipe network, most typically as irrigation (e.g., garden sprinkling) or broken water pipes. In many suburban areas, if gardening is a common residential activity, this can be a large additional source of water relative to precipitation, especially during the summertime (e.g., Grimmond and Oke 1986; Grimmond et al. 1996). Estimating this component requires assumptions in the algorithms and/or the input data to define: 1) how much, and when, water is applied to the area and 2) where in the area it is released (e.g., to all vegetated surfaces or just to irrigated grass). The representation of this source is important (e.g., Mitchell et al. 2001) but has been considered in few models (e.g., Grimmond and Oke 1991).

_{E}The presence of snow cover will influence the energy balance of urban regions, affecting the albedo and, during periods of snowmelt, acting as a significant sink for latent energy (Lemonsu et al. 2008). For models with facets, the energy budgets of horizontal surfaces (roof and road) will be the most significantly affected, with additional energy budgets for these surfaces being necessary (Masson 2000).

### j. Model uniqueness

Using the 31 individual characteristics to classify the models compared (Fig. 2a), 26 unique combinations occur (Fig. 3). This varies between model capability and actual use (demonstrated here for a dataset termed VL92; see section 4). For example, 21 models have the capability to account for *Q _{F}* but only 7 utilize this capability for the VL92 application. Although there are preferred approaches (e.g., for

*Q*Tv over Tf), there is a notable diversity; models that have a similar approach for one aspect frequently use quite different approaches for other model components.

_{F}## 3. Model inputs

Inputs of three general types are required to model urban areas: 1) site parameters to describe the surface morphology and materials; 2) time series of atmospheric or forcing variables as boundary conditions; and 3) initial thermodynamic and moisture state conditions. The complexity of urban areas and diversity of surface description methods in the 33 models results in more than 145 (or >200 if individual layer values are considered) different parameters and state variables being needed for all of the models. The parameters fall into nine broad classes (Table 2). Some parameters, which are unique to individual models, can be derived from more basic parameters (Table 3; Fig. 4). Given the large effort needed to collect these data over the wide range of urban areas globally, or even within individual countries (e.g., Feddema et al. 2005; Ching et al. 2009), we encourage model developers to use common parameters. Also, it is important that the parameters are clearly defined and not open to misinterpretation (Loridan et al. 2010).

Morphometric parameters vary greatly, using either basic information (e.g., height, width) from which required parameters are calculated (e.g., canyon aspect ratio, sky view factor), or the “higher” level parameters as the inputs. Table 2 lists basic parameters from which higher-level parameters can be calculated.

Urban material related parameters are required to account for radiative transfer (e.g., albedo, emissivity) and thermal characteristics. Because of the different ways to describe the surface (Figs. 1, 2), there are varying numbers of models that use particular parameters (Table 2). All models use some form of albedo but this may be a single bulk albedo (*α _{b}*), or albedos for the roof (

*α*), wall (

_{f}*α*), road (

_{w}*α*) and so on. Thermal properties are specified explicitly either relative to mass (specific heat capacity) or volume (volumetric heat capacity), or implicitly from model lookup tables.

_{e}As noted in Fig. 2, *Q _{F}* is dealt with in a variety of ways. For those using a fixed value, a model parameter has to be specified or alternatively, internal building temperature may need to be specified (in Table 2, it is included under temperatures but could be specified under

*Q*).

_{F}Temperatures are required for many models. These may be initial state conditions (e.g., facet temperatures), which will evolve during the run, or require model spinup of sufficient time, or may be fixed for the duration of the run (e.g., deep soil temperature). In many applications, it is likely to be difficult to have realistic or observed values to meet the need for the temperature profile within a building or the soil to be prescribed. This may mean that some models require a long initialization period (spinup) to ensure that the temperature profiles are stable and representative of expected conditions.

For the models that use a vegetation tile, all the parameters required are not summarized in Table 2. Parameter values, based on class selection, have been determined for extensive nonurban vegetated areas, and are assigned through model lookup tables. Model users have selected the vegetation class (e.g., grassland, deciduous or evergreen woodland, and/or bare soil) that they think is most appropriate in relation to the urban region they are modeling.

Soil moisture characteristics require both initial values and fixed parameters. These state variables have similar constraints and implications to that of the temperature. As urban areas often have disturbed soils and additional materials mixed into the media, it may mean that adoption of rural soil physical properties for parameters is not appropriate.

## 4. The International Urban Energy Balance Model Comparison Project

The methodology adopted here follows that of PILPS (Henderson-Sellers et al. 1993), which provided insight into both the models and real world processes. This allows the relative importance of key parameters to be determined and an assessment of the level of complexity required to produce reliable results. The International Urban Surface Energy Balance Model Comparison Project has been endorsed by the Global Energy and Water Cycle Experiment (GEWEX) Global Land–Atmosphere System Study (GLASS) and World Meteorological Organization Expert Team on Urban and Building Climatology.

The procedure for the comparison requires individual modeling groups (users and/or developers) to run their model(s) offline using forcing data provided for the top of the model, as would be provided by an atmospheric model (Fig. 5). This implies that parameter values should be representative of the observational footprint (see discussion in Masson et al. 2002). There is no feedback to larger-scale conditions within the modeling domain, so no larger-scale advection can occur, as would be present in a mesoscale or larger-scale model. The temporal resolution of analysis is typically 30 or 60 min, but individual models may be run at higher temporal resolution (1.5–300 s) and then average or sample their data back to the specified time interval of analysis (60 min). The spatial scale for both the measurements and models is the local or neighborhood scale (10^{2}–10^{4} m). However there is no actual grid size because the models are run in single column mode. The observed fluxes and the forcing data are taken from tall towers that have the sensors located above the roughness sublayer (Grimmond and Oke 1999b; Roth 2000; Masson et al. 2002; Grimmond et al. 2004; Grimmond 2006). This height is equivalent to being above or at the blending height and is typically taken as the first atmospheric layer in mesoscale or larger-scale models (Fig. 5). The rationale for offline simulation is that although larger-scale circulation models may be accurate at the macroscale, their outputs will often be incompatible with those required as inputs to mesoscale urban surface models (Pitman et al. 1990). Equally, running such models offline prevents feedbacks between climate and land surface, meaning that the sensitivity of the land-surface schemes themselves can be examined while the overlying atmospheric conditions are effectively held fixed (Wilson et al. 1987; Henderson-Sellers and Dickinson 1992).

The forcing data provided to participants were collected from a light industrial site in Vancouver, British Columbia, Canada (termed here VL92; Voogt and Grimmond 2000; Grimmond and Oke 2002). All observational data have measurement errors. These are associated with instrumental errors, instrument siting, fetch, flux corrections, lack of energy balance closure, and neglected terms etc. (e.g., Offerle et al. 2005; Grimmond 2006). This dataset was chosen as it has been used previously by a number of groups to evaluate their models (Grimmond and Oke 2002; Masson et al. 2002; Best et al. 2006; Krayenhoff and Voogt 2007; Oleson et al. 2008a). This meant that parameter values were reasonably well known. Also, the observed fluxes were provided so no model/group had an advantage from previous knowledge of this data.

The observations used in the evaluation consist of *Q**, *Q _{H}*, and

*Q*plus Δ

_{E}*Q*determined as a residual (Grimmond and Oke 1999a). During the observations (14 days in August 1992), the area was in drought and there was an irrigation ban in the city that was adhered to (Grimmond and Oke 1999c). The area is characterized by little vegetation (<5% plan area cover) and the soil moisture was very low at the time of data collection (Grimmond and Oke 1999a,c), making

_{S}*Q*at this site small relative to the other fluxes (Table 4). The summertime conditions are expected to be associated with low

_{E}*Q*as the area did not have extensive use of air conditioning or other significant sources of

_{F}*Q*. This would be expected to be more significant in the winter but is not considered here as no observational data were available.

_{F}The purpose of this comparison is not to identify the best model, but to understand model errors related to the type of approach taken (Figs. 1, 2). Each model was assigned a random identifier number, which is used in the subsequent analysis of the results to ensure anonymity. The returned simulation data from each of these models were used to perform a series of statistical analyses to evaluate model performance (Table 5).

## 5. Results from VL92

Using the VL92 dataset, 33 different models/versions of models were analyzed (Table 1). Modeling groups assigned parameter values and initial state conditions they thought appropriate. Of the 33 participants, 20 chose to rerun their models subsequent to their initial submission and based on developments of their models during the period of the model comparison, thereby improving performance. Of those who did, 16, three, and one groups reran their models once, twice, and three times, respectively, with a decrease in the root-mean-square error (RMSE) in all cases except for the minimum values for *Q _{E}* and Δ

*Q*, which remain the same (Table 4). The remainder of this paper evaluates the performance based on the final run results only.

_{S}As noted, this site has been used to evaluate model performance in previous studies (Table 4). These evaluations are not directly comparable to the current data as the same forcing data were not used in all the studies, and the time periods are not consistent, unlike the current comparison where all models have followed an identical protocol. However, comparing those results with the “final” runs presented here we can see that the results are similar. As with the overall cohort of models participating in the International Urban Model Comparison, there is some suggestion that model performance may have improved in the current (final) runs.

For *Q** the models, on average, have a smaller systematic RMSE (RMSE* _{S}*) than unsystematic RMSE (RMSE

*; Table 4). However, the maximum RMSE*

_{U}*(81.9 W m*

_{S}^{−2}) is the same order of magnitude as the maximum RMSE

*(80.7 W m*

_{U}^{−2}), suggesting there are problems that could be fixed, for example, by changing parameter values. For

*Q*the mean and maximum RMSE

_{H}*are larger than the RMSE*

_{S}*, also suggesting that model results might be improved.*

_{U}The ranked performance of the individual models, based on RMSE calculated for the 312-h dataset, for the four fluxes is shown in Fig. 6. No individual model performs best or poorest for all fluxes. For each flux, when models are ordered from best to poorer performance, in the better performing models there are small differences in RMSE. However, there is a point of step drop in performance: for *Q** five models performing less well; for *Q _{H}*, 15 models show distinctly poorer performance.

The encouraging performance for *Q _{E}*, with small RMSE values and only two models performing noticeably poorer, is a function of its small flux (Table 4). When a normalized Taylor (2001) plot is considered (Figs. 6e–h), where the ideal model would fall at the square (the observations),

*Q*is the least well modeled (Fig. 6h). For

_{E}*Q**, the models cluster most closely to the observed value, except for the five outliers already identified. Again for

*Q**, all models have a correlation coefficient (

*r*) of greater than 0.95 except for one, which has an

*r*value of over 0.9. It is interesting that there is less of a step drop in Δ

*Q*model performance but an almost constant correlation coefficient for all models (∼0.9). Also

_{S}*Q*has an almost constant correlation coefficient for all models (∼0.9). Based on the index of agreement

_{H}*d*, on average model performance is best for

*Q**, followed by Δ

*Q*,

_{S}*Q*, and

_{H}*Q*(Table 4). This ranking is retained when the best overall performance (maximum

_{E}*d*) of any model for each flux is considered.

Models need to respond to changes in exchange processes through the course of the day. Of interest, for example, is whether they resolve peak radiant and turbulent heat fluxes during the day as well as fluxes at night when shortwave radiation does not need to be considered. When the data are analyzed by time of day, RMSE is larger during the day (Fig. 7) as expected because of its larger absolute magnitude. Figure 7 shows results for three time periods: 1) day (1 h after *Q** ≥ 0 W m^{−2}), 2) night (1 h after *Q** ≤ 0 W m^{−2}), and 3) transition (remaining hours when *Q** is going through 0 W m^{−2}). The five models with the largest RMSE for daytime *Q** (Fig. 7), are the same as those for all hours (Fig. 6), although the ranked order differs slightly. The transition hours are particularly problematic for these models. The two poorest performing in the daytime are among the six-poorest performing at night.

The observed fluxes of *Q _{H}* may be underestimated on some occasions because of advection caused by sea breezes (Masson et al. 2002). For

*Q*the daytime errors are largest. At night the models generally do well almost across the board but the absolute values of the fluxes are smaller (Fig. 7). The daytime RMSE for

_{H}*Q*is larger than for

_{H}*Q** for all models. The RMSE

_{S}tends to be greater for

*Q*than for

_{H}*Q**. For the most poorly performing models, RMSE

_{s}is generally larger than RMSE

_{U}(Fig. 7, circles plot above triangles).

Using the model classifications (Figs. 1, 2) we can evaluate whether particular approaches result in clear improvements in performance. It should be noted that the options used by groups were not always their most complex (cf. capability with VL92 options used in Figs. 2, 3). Two sets of statistics are used: RMSE and the mean bias error (MBE) for day and night (Figs. 8, 9) with results for each model shown as a point for each class and category (Fig. 2). The range, interquartile range (IQR), and mean and median performance of the category within the class can be compared. Perfect performance would have an RMSE and MBE of 0 W m^{−2}. Given the relative magnitude of the MBE for nighttime *Q _{E}* (<|12| W m

^{−2}), these results are not considered further here.

First, the method to represent vegetation (V class 1) is considered. Of the 18 models that have the ability to include vegetation as a separate tile (Vs; Fig. 2), five did not. Six additional models have integrated vegetation (Vi) within their urban surface. For the VL92 runs, a total of 14 models do not consider vegetation (Vn). The IQR of RMSE (bars on Fig. 8) is smaller in the daytime for *Q**, *Q _{H}*, and Δ

*Q*when vegetation is included as a separate tile (Vs). In the daytime, not including vegetation (Vn) results in the largest RMSE medians (

_{S}*Q*= 181, Δ

_{H}*Q*= 136,

_{S}*Q** = 59, and

*Q*= 36 W m

_{E}^{−2}) and MBE medians (

*Q*= 158 W m

_{H}^{−2}, Δ

*Q*= −107,

_{S}*Q** = 42, and

*Q*= −28). For daytime

_{E}*Q*and

_{H}*Q**, the tiled approach (Vs) has the smallest RMSE (median = 71 and 46 W m

^{−2}, respectively) and MBE (median = 18 and −14 W m

^{−2}, respectively), whereas the integrated vegetation (Vi) has the lowest individual RMSE values for

*Q** and Δ

*Q*. For daytime

_{S}*Q*, the RMSE and MBE are best for Vi (median = 27 and 3 W m

_{E}^{−2}, respectively). At night for

*Q*, the performance is poorest for those models that assume a separate tile (Vs; median RMSE = 19 W m

_{H}^{−2}, MBE = 17 W m

^{−2}) and best for Vi models (median RMSE = 14 W m

^{−2}, MBE = −2 W m

^{−2}). However, for

*Q**, Vi and Vn have similar performance (median RMSE Vi = 16, Vn = 19; MBE: Vi = −11, Vn = −8 W m

^{−2}).

Examining the combination of model characteristics (Fig. 3) shows that for those that do not take into account vegetation, Vn, share only one common characteristic: their calculation of Δ*Q _{S}* via conduction or net radiation (class 8, Sc, Sn). However, many models that do include vegetation (Vs) also use this approach to heat conduction (Sc), so this is not likely to be a primary coexplanation. Not including vegetation even in this area where there is very little, and where the measured

*Q*is small relative to the other fluxes, appears to impact the ability to model

_{E}*Q** and

*Q*, with a resulting poor performance also in Δ

_{H}*Q*.

_{S}The VL92 site also has low *Q _{F}*. Most groups assumed it is negligible (ANn) with only seven groups explicitly including the flux (Fig. 2). Those that have considered it have taken a wide range of approaches but because of the small numbers they are grouped together into one class for analysis (ANm). Similarly, different temporal approaches to modeling

*Q*(Tf, Tv) are used but the small number of models per class means analysis is the same and so is not shown separately. In the daytime, median RMSE and MBE are smallest for all fluxes when

_{F}*Q*is ignored (ANn). This differs for nighttime fluxes however, where ANm models have the smallest RMSE and absolute MBE for all fluxes except

_{F}*Q*.

_{E}The model combinations in Fig. 3 show that those models that use an internal temperature (ANi) tend to have a fixed or variable temporal variation in *Q _{F}* (class 3, Tf, Tv), an urban morphology that is multilayered (L4–L7), and a surface albedo–emissivity that has three or more facets (class 7, AEf).

The urban morphology (class 4, L) has a relatively large within-class difference (range of median RMSE: 98 W m^{−2} and MBE: 130 W m^{−2}) for the daytime *Q _{H}*. For both RMSE and MBE, there is no clear best performer of models across all fluxes (best for RMSE median:

*Q** = L3,

*Q*= Lm and L1,

_{H}*Q*= L3, Δ

_{E}*Q*= L1; best for MBE median: Δ

_{S}*Q*= L1,

_{S}*Q*= Lm,

_{H}*Q** = L1,

*Q*= L1; Figs. 8, 9). At night, multilayer models (Lm = L4–L7) perform best for

_{E}*Q**,

*Q*, and Δ

_{H}*Q*based on MBE (median

_{S}*Q** = −1,

*Q*= −1, Δ

_{H}*Q*= 6 W m

_{S}^{−2}). The urban morphology classes have few common characteristics, although all L1 models use a single reflection and a bulk albedo and emissivity. In addition, and by definition, L3 and all Lm models have three facets for albedo and emissivity.

With respect to the categorization based on facets and orientation (FO class 5), the largest difference is for the simulation of daytime *Q _{H}* (difference between category medians ΔRMSE of 96 W m

^{−2}, ΔMBE = 129 W m

^{−2}). Those that treat the surface as a whole (FO1) have the lowest daytime RMSE for

*Q*and Δ

_{H}*Q*(although for

_{S}*Q*, median RMSE for FO1, FOo, and FOi differ by <8 W m

_{H}^{−2}while it is lowest for FOo and

*Q** and for FOi and

*Q*). At night, the lowest median RMSE is:

_{E}*Q** = FOo,

*Q*= FO1, and FOn,

_{H}*Q*= all groups equal, Δ

_{E}*Q*= FOn. There is no consistency in groupings with the smallest daytime MBEs (

_{S}*Q** = FOn,

*Q*= FOi,

_{H}*Q*= FO1, Δ

_{E}*Q*= FOo). Except for

_{S}*Q**, during the daytime, models that simulate a canyon but have no associated orientation (FOn), have the largest biases (

*Q*: positive bias,

_{H}*Q*and Δ

_{E}*Q*: negative bias) and these are likely to be complementary. At night, models that incorporate orientation and intersections (FOi) have the smallest bias, again except for

_{S}*Q**, where it is FOo models (although differing by just 1 W m

^{−2}when compared with FOi). In the daytime, for

*Q*, the median RMSE performance improves from FOn, FOo, FOi, and FO1 (165, 77, 74, and 69 W m

_{H}^{−2}, respectively); and for

*Q**, improves from FOi, FOn, FO1, and FOo (67, 52, 46, and 43 W m

^{−2}). The unique combinations that these categories of models have in common include those that treat the surface as a whole (FO1), have no anthropogenic heat fluxes calculated (ANn) and, obviously, have just a slab surface morphology, single reflections, and a bulk albedo and emissivity. Models that include orientation (FOo, FOi) all assume three or more facets for albedo and emissivity (AEf; as would be expected). Models without orientation (FOn) largely utilize conduction methods to calculate the storage heat flux (Sc).

When the models are classified based on the number of reflections used, there are large within class differences (ΔRMSE = 89 W m^{−2} for daytime *Q _{H}*; Fig. 8). This is also the largest difference for the MBE (ΔMBE 109 W m

^{−2}; Fig. 9). During the day, models with a single reflection scheme (class 6, R1) perform best for all fluxes except

*Q*(median RMSE Δ

_{E}*Q*= 98,

_{S}*Q*= 73,

_{H}*Q** = 46 W m

^{−2}). The daytime MBE is smallest for

*Q** models that calculate single reflections (Rs; median MBE

*Q** = −14). Generally, during the daytime the models that have infinite reflections (Ri) perform least well for

*Q*and

_{H}*Q*, (median MBE

_{E}*Q*= 147,

_{H}*Q*= −27 W m

_{E}^{−2}; median RMSE

*Q*= 162,

_{H}*Q*= 35 W m

_{E}^{−2}); there are also negative median MBEs for all classes for

*Q*and Δ

_{E}*Q*, while

_{s}*Q*and

_{H}*Q** have a positive bias, with the exception of the single reflection class and

*Q**. This suggests that the single reflection models may not allow enough radiation to be absorbed in comparison with observations. For Δ

*Q*and

_{S}*Q*, RMSE increases with the number of reflections modeled.

_{H}At night, models using increasing numbers of reflections have smaller RMSE for *Q** (*Q**: Ri = 13, Rm = 20, R1 = 28 W m^{−2}); whereas, the situation reverses for *Q _{H}* and

*Q*, with those modeling fewer reflections yielding better results (

_{E}*Q*: Ri = 27, Rm = 18, R1 = 17 W m

_{H}^{−2}). For the calculation of

*Q** at night, the Ri type models perform best with the lowest median RMSE and MBE (RMSE = 13, MBE = 4 W m

^{−2}). However, as for daytime, superior performance for one flux is accompanied by poorer performance in another. All approaches have a similar sized negative MBE for nocturnal Δ

*Q*(median from −21 to −22 W m

_{S}^{−2}). The MBE for single reflections suggests that the surface temperature is too high, but correcting the bias during the daytime is likely to increase the nocturnal surface temperature, so there may be other issues with the models that use this method. Compensation also occurs between

*Q** and

*Q*most particularly at night. All schemes with infinite reflections (Ri) have three facets for albedo and emissivity (AEf).

_{H}The differences within groups of models are amongst the greatest when stratified based on specification of albedo/emissivity (class 7, AE). In general, using a bulk albedo–emissivity (AE1) results in better performance for all fluxes during the day based on median RMSE and MBE (median MBE Δ*Q _{S}* = −23,

*Q** = 3,

*Q*= 10,

_{E}*Q*= 28 W m

_{H}^{−2}). Models using two facets (AE2) tend to have the poorest daytime performance (except for

*Q** where median MBE for all groups is similar). At night, the differences in median MBE are smaller (

*Q*: AE2 = 14, AEf = 16, AE1 = 9;

_{H}*Q**: AE1 = −14, AEf = −13, AE2 = −17 W m

^{−2}). In this evaluation, where buildings are small and widely spaced, the ability to distinguish different facet characteristics of albedo and emissivity is not important. However, where buildings are taller, more tightly spaced, and/or constructed with very different materials, this result may not necessarily be the same. It is also important to remember that depending on the intended application, the ability to change facet material characteristics may be very important; for example, for scenario testing (e.g., for urban heat island mitigation).

Classifying models based on method used to calculate Δ*Q _{S}* (S class 8) has a relatively small difference in the median RMSE and MBE for all fluxes. Again the biggest difference in performance is associated with daytime

*Q*(52 and 74 W m

_{H}^{−2}for RMSE and MBE). The daytime

*Q** differences are 6 and 12 W m

^{−2}for RMSE and MBE, respectively; these are the smallest within-group differences in median for

*Q** across the classes. The residual method (Sr) performs better during the daytime for all fluxes except

*Q** daytime (median MBE:

*Q*= 27,

_{H}*Q** = 6,

*Q*= −11, Δ

_{E}*Q*= −30 W m

_{S}^{−2}) and for all nighttime fluxes (median MBE:

*Q*= 11,

_{H}*Q*= 4, Δ

_{E}*Q*= −5,

_{S}*Q** = −5 W m

^{−2}). The Sc models often assume three facets (AEf) without orientation (FOn).

If the 31 different classes are considered, the best performance during the daytime for *Q** is from the FOo class (median RMSE of 43 W m^{−2}). There are two classes with an absolute median MBE of ≤3 W m^{−2} (L1, AE1). There are six models with both these characteristics (Fig. 3). For daytime *Q _{H}*, there are four classes with an MBE of <20 W m

^{−2}(Vs, Lm, FO1, FOi). There is only one model with all of these (viz., Vs, Lm, FOi). The best overall performance for daytime

*Q*, based on median RMSE, has a value of 69 W m

_{H}^{−2}(FO1), but there are seven other classes within 4 W m

^{−2}of this (Vs, Vi, Ls, Lm, R1, AE1, Sr) and three additional classes within 7 W m

^{−2}(ANm, FOo, FOi), thereby accounting for all seven major classes (Fig. 2). No models have all of these characteristics, while two have five of them but do not generally fall within the group of best-performing models.

At night, best performance for *Q** is associated with ANm, Lm, Ri (median RMSE 11–13 W m^{−2} and/or median MBE < |4| W m^{−2}) and for *Q _{H}* with Vi, Lm, and FOi (median MBE = −2, −1, and −6, median RMSE = 14, 27, and 27 W m

^{−2}, respectively). The Sr and Sc models have a similarly good RMSE (17–18 W m

^{−2}).

For daytime *Q _{E}*, best overall performance is from Ls, FO1, AE1, Vi, FOi, and R1 (median MBE < |10| W m

^{−2}). For daytime Δ

*Q*, models with median RMSE < 96 W m

_{S}^{−2}are Vi, Ls, FO1, AE1, and Sr but based on the absolute MBE, the best-performing models are FOo (median MBE = ≤|10| W m

^{−2}), and FO1, AE1, Sr, Vi, Lm, and Ls (median MBE < |30| W m

^{−2}). At night, Lm, ANm, FOi, Sr, and Vi models perform well based on median MBE and RMSE (<|4| and/or <22 W m

^{−2}).

## 6. Conclusions

Urban surface–atmospheric exchanges are modeled for a wide variety of applications. The large set of models, examined here, have a range of approaches, complexities, and parameter requirements. Through the first stage of the first international model comparison reported here, significant model developments have taken place and improvements in model performance have resulted.

Evaluation of 33 models, with Vancouver (VL92) data, shows that generally models have best overall capability to model *Q** and least capability to model *Q _{E}* (order

*Q**, Δ

*Q*,

_{S}*Q*, and

_{H}*Q*; Table 4). No model performs best or worst for all fluxes. In particular, it seems to be difficult to minimize both

_{E}*Q** and

*Q*errors. There is evidence that some classes of models perform better for individual fluxes but not overall. Typically, those that perform best during daytime do not perform best at night.

_{H}The daytime RMSE for *Q _{H}* is larger than for

*Q** for all but four models. These four are characterized as having amongst the four largest

*Q** RMSE values. For RMSE

*, there is the tendency for*

_{S}*Q*errors to be greater than for

_{H}*Q**, although there are more cases where the errors are similar. The unsystematic errors are generally smaller than systematic errors, particularly for the most poorly performing models. For most models,

*Q*has a positive MBE, which observational errors may contribute to.

_{H}Seven characteristics (relating to vegetation, *Q _{F}*, morphology, facets and orientations, reflection, albedo and emissivity, and Δ

*Q*) are used to classify each model. Some of the greatest differences in model performance are found between classes of model that treat vegetation and reflections differently. Some of the smallest differences relate to approaches used to calculate the Δ

_{S}*Q*followed by urban morphology. Not including vegetation, even at a site with limited vegetation, yields the poorest performance for all fluxes during the day (in terms of RMSE) and for

_{S}*Q*at night. During the day, median RMSE for models that do not include

_{E}*Q*is similar (or better) than for those that do. However, at night, median RMSE for models, which include

_{F}*Q*shows better performance for

_{F}*Q**,

*Q*, and Δ

_{H}*Q*. Models that account for urban morphology orientation, and also intersections, often have slightly better performance than schemes that do not (e.g.,

_{S}*Q*in the daytime). The addition of intersections, however, does not always improve performance appreciably and in some cases has a negative impact on model performance.

_{H}The results for reflection schemes vary between day and night and with statistical measure (RMSE or MBE). In general, using a bulk albedo–emissivity results in better performance for all fluxes during the day. Classifying based on method used to calculate Δ*Q _{S}* has the smallest difference in the median of the RMSE and MBE of all classes. The residual method performs better during the day for all fluxes, while at night, differences are less significant. Class combinations show no models display all characteristics associated with strongest performance, although two display a large proportion of these. In general, the simpler models perform as well as the more complex models based on all statistical measures.

These results are based on a short time series for one urban location. In phase 2, the same models will be evaluated using a second dataset (Grimmond et al. 2009b). These results raise a number of questions that will be considered, with different flux partitioning, a wider range of conditions, and a longer time series. Of particular interest is whether the same models and classes perform well; whether the relative ability to model the individual fluxes remain the same; and whether it is possible for any class of models to minimize errors in both *Q** and *Q _{H}*.

## Acknowledgments

Funded (Grimmond) by the Met Office (P001550), European Union Framework 7 (7 FP7-ENV-2007-1): MEGAPOLI (212520), BRIDGE (211345). This is a contribution to COST728 (Enhancing mesoscale meteorological modeling capabilities for air pollution and dispersion applications). We thank all involved in the original dataset collection (especially James Voogt and Tim Oke) and their funding agencies. Funding from BSIK-COM29 (Steeneveld), CATER 2006-2202 (Baik), and other agencies that support all the time contributed by participating groups is acknowledged. Color versions of the figures can be obtained by e-mailing corresponding author Sue Grimmond.

## REFERENCES

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

*CTTC*model for predicting the air temperature in small urban wooded sites.

**,**

**,**

**,**(

**,**

**,**

**,**

## Footnotes

*Corresponding author address:* Sue Grimmond, Environmental Monitoring and Modelling Group, Department of Geography, King’s College London, London, WC2R 2LS, United Kingdom. Email: sue.grimmond@kcl.ac.uk