## Abstract

In this article, the authors examine the effect of a high-resolution grid (grid resolution lower than 3 km) in the context of a realistic climate simulation. For this purpose global simulation results of the German Weather Service were dynamically downscaled in a one-way nesting approach to 2.8 km using the regional forecast model of the Consortium for Small-Scale Modeling (COSMO) of the German Weather Service. The simulations were performed for the region of central Europe in 2003. In particular, the authors investigate whether COSMO adequately handles extreme events such as the persistent drought and heat of the summer of 2003. Comparisons of the simulated atmospheric conditions in terms of 2-m temperature, mean sea level pressure, and precipitation demonstrated a good correspondence to their associated observational data. Simulation results and observed data are both given as time series at locations such as grid cells or station locations. By cluster analysis a representation of the spatial structure for observation data and simulation results is found. The kappa statistic evaluates how well the two spatial structures correspond to each other. The different kappa variants are helpful to diagnose where shortcomings of the simulation results are located.

## 1. Introduction

With ever-increasing computing power it is possible for both regional weather and climate models to use spatial grids with a resolution of 5 km or lower and to obtain a longer simulation period. This is important for questions in agriculture meteorology (Mearns et al. 2001; Easterling et al. 2001), town and traffic planning (Grossman-Clarke et al. 2010), and reinsurance industry risk assessment (Mills 2009), which can only be satisfactorily answered by using spatially and temporally highly resolved models over a longer simulation time. The spatial and temporal representation of extreme events is especially important in these fields.

The simulations in this study were executed using the model of the Consortium for Small-Scale Modeling (COSMO) of the German National Meteorological Service (DWD) (Doms and Schättler 2002; Doms et al. 2007). The COSMO model is the nonhydrostatic operational weather prediction model applied and further developed by the national weather services joined in the consortium for small-scale modeling. Parallel to the development of COSMO for operational weather forecasts it was evaluated to what degree COSMO is suitable for climatological long time simulations (up to 30 years) and which steps have to be performed for reaching this goal (Böhm et al. 2006). These studies were performed by a joint effort of researchers at the Potsdam Institute for Climate Impact Research, the University of Cottbus, the Helmholtz-Zentrum Geesthacht, the Forschungszentrum Karlsruhe, and the DWD. First, regional climate projections have been performed for Europe with this climate version of the local model of the German Weather Service with a gridcell size of about 18 km (Hollweg et al. 2008). The analysis of these simulation results confirms that all investigated meteorological fields are physically consistent.

As preparation for highly resolved long-term climate simulations (25–30 yr) with a grid resolution of 2.8 km, a 1-yr simulation across Germany (see Fig. 1) for the whole year 2003 was performed to gain insight regarding the forecast quality and the ability of the model to correctly represent climatic extreme situations. For this we used simulation results of the global forecast model (GME) of the DWD as driving data. The global forecast model GME was developed by the German Weather Service. GME is the first operational weather forecast model that uses an icosahedral–hexagonal grid covering the globe and computes global weather forecasts for up to 7 days of forecast time (see Majewski et al. 2002). GME is a hydrostatical model with a mean horizontal resolution of 60 km (2003) and 31 layers. For this reason GME cannot resolve many local details of the topography that may have an important influence on the local weather. To forecast such local effects is the task of the COSMO model with a grid spacing of 7 or 2.8 km, respectively. The COSMO model is a nonhydrostatical regional model, which is driven by GME simulation results. For the simulations of this study version 4.15 of COSMO was used, which contains the developed adaptations for the climate mode as an optional feature (Schättler et al. 2009).

The simulation was executed in two steps. First, based on the GME analysis results a simulation (SIM_7) for Europe with a grid resolution of 7 km (see Table 1) was executed. The simulation result obtained was then used for a simulation for Germany with a grid resolution of 2.8 km (SIM_2.8; see Table 1). The focus of the result analysis were the months from June to August 2003. These months were the time frame for the well-known European heat wave of 2003 and were characterized by a long period of much-higher-than-normal temperatures (see Schönwiese et al. 2004; Stott et al. 2004). This event is an example for a weather pattern that on statistical average does not occur more often than once every century and is therefore called a once-in-a-hundred-years event.

The validation of simulation results was accomplished using daily measurements of the meteorological and climatological observing network of the DWD. The validation variable T2m indicates the daily mean air temperature 2 m above ground corrected to sea level. The validation variable PREC indicates the accumulated precipitation per day. The validation variable PMSL indicates the daily mean pressure. Here parameter PREC is highly variable in space (a few hundred meters to several hundred kilometers) and time (a few minutes to several days). In contrast T2M is a parameter with a smaller spatial variability. Both parameters are of major importance for an analysis of climate and weather patterns. Temperature and precipitation patterns are closely associated with circulation patterns like PMSL. The mean sea level pressure analysis is one of the most familiar methods for measuring atmospheric model quality. For this reason the validation data used include additionally PMSL.

The focus of the validation of the simulation results is the analysis and evaluation of the differences between simulation results and observation data by a multivariate method. For this a combination of cluster analysis and the kappa statistic (Cohen 1960) is used. Clustering algorithms have been applied to a wide variety of research problems in meteorology–climate science, that is, to define periods of similar meteorology (Fernau and Samson 1990) or to analyze model structures versus observation structures (Gordon et al. 2005).

Using cluster analysis, spatial patterns of observational data and simulation results are created that can consequently be analyzed for differences by variants of the kappa statistic. By cluster analysis grid points with similar meteorological conditions can be determined and ordered into clusters. This way a spatial structure for the chosen observation area in a certain time period is found that is characteristic for that region. This spatial structure will in general be somewhat different for the observation data and the simulation results. The kappa statistic gives us a measure of how well the spatial structures of the observation data and the simulation results coincide. This statistic is relatively rarely used for the evaluation of meteorological data because these data are usually numerical and the kappa statistic cannot be used to evaluate numerical data. However, the kappa statistic is applicable if meteorological data are assessed on a nominal scale, for example by cluster analysis and subsequent comparisons of the spatial cluster structures (Kücken et al. 2009).

The clustering provides a qualitative comparison between observation data and simulation results that is easy to interpret visually, has a clear meteorological meaning, gives information on the spatial structure of both observation data and simulation results, and can be quantified using the kappa statistic. The kappa statistic can also be used to objectively compare the performance of models for different variables and thus give valuable clues into model shortcomings. For instance, if the kappa statistic for precipitation is much larger than the kappa statistic for temperature, we can conclude that the model represents precipitation much better than temperature. Such comparisons for the model performance of different variables cannot as easily be reached using standard error measures. To summarize, cluster analysis and the kappa statistic are not meant to replace standard error measures but to augment them for a more profound analysis of discrepancies between observation data and simulation results and to gain more specific insight into model shortcomings.

In section 2, both observation data and simulation results are described that are compared in this study. Section 3 presents the simulations with their corresponding model parameterizations. In section 4 the validation results of the different parameters and the application of the validation technique is demonstrated. Section 5 contains conclusions and an outlook to future work.

## 2. Observation and simulation datasets

### a. Observation data

The validation of the simulation results is accomplished by verified and homogenized observational data stored at the Potsdam Institute for Climate Impact Research, which are based on measurements of the meteorological and climatological observing network of the DWD (Österle et al. 2006). The stations of the observing network form an irregular grid across Germany. The validation period starts on 1 January 2003 and ends on 31 December 2003. For T2M and PMSL a daily average value is used and for PREC the daily sum is used. Other important model parameters, for example, heat fluxes, cannot be compared with observation data because the necessary observation data are not available. For the validation T2M is corrected to sea level by the normal lapse rate [0.65 K (100 m)^{−1}] following the standard definition of the World Meteorological Organization (WMO 1957) in order that all grid points refer to a common altitude. PMSL is also corrected to sea level by extrapolating the measured pressure hydrostatically wherever the surface elevation is greater than 0.01 m; otherwise the mean sea level pressure is simply set to the measured pressure. PREC is used for the validation as measured in millimeters: 1 mm corresponds to exactly 1 L m^{−2}. In 2003 the network of weather stations in Germany consisted of 270 main meteorological watch offices and 2072 secondary weather and precipitation stations. For the validation of T2M and PMSL the data of the 270 main stations were used. Because of the high spatial variability of PREC it was necessary to choose a much larger number of stations. The validation of this quantity was performed using the 2342 main and secondary precipitation stations.

### b. Simulation results

Two simulation runs with the limited-area model COSMO are necessary to obtain simulation results with a grid resolution of 2.8 km. Based on the GME data as initial/boundary data, a SIM_7 simulation run was performed. The results of the simulation SIM_7 were used as initial/boundary data for the following simulation SIM_2.8. Both simulations were executed on a horizontal Arakawa-C grid (Arakawa and Lamb 1977). Here scalars like PREC, T2M, and PMSL are defined at the center of a grid box. The simulations were executed on a rotated grid to limit changes in the horizontal resolution. COSMO uses a rotated spherical coordinate system. In this system, the pole is tilted and can be positioned such that the equator runs through the center of the model domain. Thus, problems resulting from the convergence of the meridians can be minimized for any limited area model domain on the globe (Doms and Schättler 2002; Dutton 1976).

Both simulations use the Lorenz grid for vertical discretization (Herzog et al. 2002). A description of the simulation domains for SIM_7 and SIM_2.8 is given in Table 1.

## 3. Model and experiment descriptions

### a. General considerations

To study climate properties on a simulation grid with a resolution of 2.8 km, two preprocessing steps and two simulation steps have to be executed. In the following we will explain these steps in detail.

### b. The preprocessing program int2lm

The simulation results of GME are given on an icosahedral–hexagonal grid with 60-km resolution. For the SIM_7 simulation GME delivers the initial and boundary fields necessary to perform the simulation. Then, the results of the SIM_7 simulations are used as initial and boundary fields for the SIM_2.8 simulation.

In both cases we use the “interpolate to local model” (int2lm) program, version 1.14, which performs the interpolation from coarse grid input fields to initial and/or boundary fields for the COSMO model (Schättler 2009). The int2lm program uses a linear or a bilinear interpolation technique.

### c. The regional model COSMO

The nonhydrostatic fully compressible COSMO model has been developed to meet high-resolution regional forecast requirements of weather services. The COSMO model is based on the primitive thermo-hydrodynamical equations for momentum, mass, and heat describing compressible flow in a moist atmosphere. The solution of these equations is contained in the dynamical core of the model. An excellent model description (Doms and Schättler 2002) and its application (Schättler et al. 2009) are accessible on the Internet. The finite-difference model uses a centered scheme in space, a Runge–Kutta scheme in time, and an energy–enstrophy conserving Arakawa-C grid. The model equations are formulated in rotated geographical coordinates and a generalized terrain-following height coordinate. A variety of physical processes are taken into account by parameterization schemes (Doms et al. 2007).

### d. Parameterization of the simulation runs

Mathematically, the solution of the differential equations of the dynamic core of the model is the solution of an initial value–boundary value problem. For a model run of a few days the initialization problem is dominant. But for simulation times of several months or years the treatment of the boundary problem becomes increasingly more important. For that reason it is possible in COSMO to use different parameterization modes. In the climate mode, which is used for the simulation runs, no data assimilation takes place but vegetation parameters are continuously updated. The COSMO model provides a flexible simulation framework allowing detailed parameterizations of a simulation, so the user can customize the simulation runs on his exploratory focus (Schättler et al. 2009; see Tables 1 and 2). Here, Table 1 contains information about the simulation domain and the grid specification. Table 2 provides information on the particular numerical schemes to solve the model equations and the basic subgrid-scale physical processes both for the SIM_7 simulation and for the SIM_2.8 simulation. Cumulus convection has a major impact on the vertical structure of the temperature and the moisture fields of the atmosphere. However, convective processes operate on horizontal scales that are much smaller than those resolved by large-scale and mesoscale numerical weather prediction models. Thus, the only way to represent the overall effect of moist convection in these type of models is by means of parameterization. Some basic effects of moist convection have to be considered by cumulus parameterization schemes. These convection schemes are not independent from the spatial grid resolution. For that reason we use different schemes for the 2.8-km-resolution grid and the 7-km-resolution grid (see Table 2).

All information on the variables at the lateral boundaries and their time evolution must be specified by an external dataset. In this paper, boundary data from the operational hydrostatic global model GME are supported. A simple and effective solution to avoid the generation of numerical noise is to apply a sponge to the model variables within a relaxation zone close to the boundaries. In this zone, the variables of the high-resolution model are gradually modified to blend them with the driving model variables. In this way, the information transfer problem is solved, since information near the lateral boundaries is no longer generated by the high-resolution model but determined by the values of the low-resolution driving models (Davis 1976, 1983; Doms et al. 2007).

## 4. Validation methods und results

### a. General considerations

The validation is based on the comparison between observation data and simulation results. Meteorological observing stations are local, that is, the measured results represent the conditions at the measuring location only. Usually the stations are chosen such that they are representative for the surrounding region. The simulation results are defined at the center of an Arakawa-C grid box. Because of the different validation grid structures (regular versus irregular), the spatial location of simulation results and corresponding observations do not match exactly. To reach a compromise between both spatial representations, for the comparison of simulation results and observation data a nearest neighbor approach was used. The value in the center of the Arakawa-C simulation cell was compared to the value at the observation point inside the simulation cell that is closest to the center of the simulation cell. Considering that the grid spacing is only 2.8 km, it follows that the distance between the simulation cell center and the observation point is always less than 2 km. A tolerance of less than 2 km will be permissible for this validation approach. This tolerance should be taken into consideration when evaluating data, especially for locations with steep terrain within a scale of 2 km.

This paper evaluates the description of weather phenomena for only one specific year spatially and temporally. For this reason the daily means of the observation data and the simulation results were used. Higher-aggregated values such as monthly or yearly averages as they are commonly used in climate research would have been too coarse to resolve the summer heat wave of 2003. Further, a larger number of data ensure better statistics. Finally, it is obvious that a model that validates well for daily values will also validate well for monthly values. As the validation period we consider at first the entire year 2003 from 1 January to 31 December 2003 (*P*_{jan–dec}). Additionally, for the evaluation of seasonal factors, we investigate three time periods from 1 January to 31 May 2003 (*P*_{jan–may}), from 1 June to 31 August 2003 (*P*_{jun–aug}), and from 1 September to 31 December 2003 (*P*_{sep–dec}).

### b. The validation method

For the evaluation of the COSMO we use the observation data *O _{ij}* and the simulation results

*S*. Here

_{ij}*j*is the index for the location (range from 1 to 270 for T2M and PMSL or 2342 for PREC), and

*i*is the index for the time (range from 1 to 365). For the differences between

*O*and

_{ij}*S*we use several error measures that describe the differences between time series. Such measures are the mean absolute error (MAE) [see Eq. (A1)], the root-mean-square error (RMSE) [see Eq. (A2)], and the systematic error (bias) [see Eq. (A3)]. In contrast to the bias the MAE measures the differences of corresponding validation points but without considering their direction. The MAE means that all the individual differences are weighted equally in the average but the RMSE gives an increasingly larger weight to growing errors. They can both be used together with the bias to include the error variations in the reference analysis of a set of time series.

_{ij}An extension to these comparisons at a certain location point are evaluations of the spatial patterns of the time series *O _{ij}* and

*S*. As a first step to reach this goal we apply a hierarchical cluster algorithm to cluster the time series data (either PREC or T2M or PMSL) associated with grid points of

_{ij}*O*. The result is a dendogram consisting of a tree that tells how time series are agglomerated at every link in the tree. The major limbs of this tree identify the major object groupings. In contrast to climatological time series over many years, we cannot assign well-known climatological cluster definitions to the observation data of just one year. Therefore it is necessary to identify the number of clusters that allows a representative characterization of these data. A suitable number of clusters can be found from a plot of the cluster number versus clustering distance (see appendix B). It is the number of clusters where the sharp decrease in clustering distance that is observed for small cluster numbers flattens out as the number of clusters is increased. This point is also called “elbow” (see Backhaus et al. 2003 for details). In the next step both

_{ij}*O*and

_{ij}*S*are classified according to these clusters using a

_{ij}*K*-means clustering. A detailed description of the clustering steps is given in appendix B.

As the result of the *K*-means clustering a spatial pattern of clusters is obtained for the simulation results and for the observation data. For a physical characterization of the cluster centroids Tables 3–5 show the clusters with their associated time series characteristics like mean value (Mean), median (50% Per), standard deviation (Std dev), minimum (Min), 5th percentile (5% Per), maximum (Max), and 95th percentile (95% Per).

Using the kappa statistic the difference between these two spatial patterns is quantified by using the kappa coefficient *κ*_{cohen} [see Eq. (C1)]. A more detailed evaluation of the differences between the two patterns is accomplished by the derived values *κ*_{histo} [see Eq. (C3)] and *κ*_{location} [see Eq. (C2)]. A full description of these kappa derivatives can be found in Hagen (2002) and Pontius (2000). The term *κ*_{histo} represents the correspondence in the cluster histograms of both patterns, and *κ*_{location} is a measure for agreement between two corresponding clusters. These two variables give further insight into how the cluster patterns differ from each other. For example, a complete agreement in the cluster sizes can be combined with a complete disagreement in the spatial relations of the patterns. In this case the *κ*_{cohen} value would only detect a disagreement of the pattern but would not point toward the reason for the disagreement. Here *κ*_{histo} can be understood as a measure for correspondence for the entire validation area similar to the error measure BIAS. The term *κ*_{location} is a measure for local correspondence similar to the error measures RMSE or MAE. More details can be found in appendix C.

### c. Validation results of parameter PMSL

In Table 6 a summary of the statistical quantities relevant for the evaluation of the correspondence between simulation and observation for PMSL is presented. Such measures are the domain average of the observation time series averages, the corresponding error measures, and finally the kappa statistic. Figures 2 and 3 show the clustering of the evaluation domain and differences in the cluster distribution spatially (Fig. 2) and for each cluster (Fig. 3).

The *κ*_{cohen} index for the whole year 2003 (*P*_{jan–dec}) indicates a good correspondence between simulations and observations. The differences that are detected by *κ*_{location} and *κ*_{histo} are mainly due to the fact that the observation dataset is represented by nine clusters of which only four (1, 2, 8, and 9) can be found in the simulation results. There are four large clusters (1, 2, 8, and 9) each larger than 10% of all points and five small clusters (3, 4, 5, 6, and 7) each less than 10% of all points. In the simulation results all the small clusters vanish and become part of the adjoining large clusters.

Because the remaining five clusters are small no large impact on *κ*_{location} is generated. Apparently, the circulation processes are structured more subtly in the observations than their representation in the model. The magnitude of the error measures confirms a good correspondence of the circulation structures between simulation results and observation data for the whole year. The correspondence is not as good for the heat wave in the summer of 2003 (*P*_{jun–aug}) where all kappa values are lower.

### d. Validation results of parameter T2M

Table 7 shows a summary for the statistical analysis of T2M. The *κ*_{cohen} value for the whole year suggests that the agreement between observations and simulations is generally too low, which is mainly due to the low *κ*_{histo} value. Figures 4 and 5 show the clustering of the evaluation domain and differences in the cluster distribution spatially (Fig. 4) and for each cluster (Fig. 5). As seen in Fig. 5 the cluster sizes coincide for clusters 1 and 2 only, all other clusters differ significantly in size, in the case of cluster 6 there is even a drastic change. The error measures MAE and RMSE are also larger than an acceptable upper limit of 1 K. In contrast to the MAE, the bias shows a better agreement both for the whole year and the other validation periods. The reason for the unsatisfying result is the summer period, for which the kappa values are especially low and the error measures especially large. In the *P*_{jan–may} and *P*_{sep–dec} periods errors mostly even out as is shown by a low bias. However, the bias of the summer period is large indicating that a shift toward exaggerated temperatures has taken place in the simulations.

For large parts in southern Germany (mainly between 47° and 49°N; see Fig. 4) a drastic shift up to 3 clusters (shift from 3 to 6) is observed. This corresponds to an increase of the temperature in the simulation results in this area on average (Mean) by 2.4 K (see Table 4). This effect mainly concerns regions with a steadily increasing elevation profile until the Alps region at the southern border of Germany. This overestimation of the simulated temperature does not occur for cluster 6 in the observation data (containing for example the Upper Rhine valley). Here realistic simulation temperature values are obtained and the cluster number does not change in the simulation results.

In contrast, the north German plain and Germany’ s coastal regions show an extensive agreement between the clusters of the observation data and the simulation results. As a result of this, the warm temperature bias of southern Germany is mostly compensated by the zero or cold bias of the temperature simulation for northern Germany.

For *P*_{jun–aug} the bias (absolute value) between the observations and simulations for the temperature (1.4 K) is much larger than the corresponding bias for the other time periods. This implies significant lower *κ*_{histo} and *κ*_{location} values for the summer months. However, we also have to take into account that the observation average for *P*_{jun–aug} was already 3.4 K above the mean related to the 1960–90 period (Schönwiese et al. 2004). The character of the summer period as an extreme event is still adequately described in the simulation results.

### e. Validation results of parameter PREC

Table 8 contains a summary of the statistical quantities describing how well observation and simulation patterns of PREC correspond to each other. All validation measures evaluate the simulation precipitation values as good or even very good. However, if we focus on different sections of the year the evaluation is more differentiated. In the first part of the year (*P*_{jan–may}) the precipitation in the simulation results is significantly too large (by 15.9%). In contrast, in the summer period (*P*_{jun–aug}) the precipitation is much too small (by 23.4%). Only in the last section of the year (*P*_{sep–dec}) the precipitation is modeled satisfactorily. Therefore, the good overall result for the year is at least partly due to averaging of periods with too large and too small precipitation values.

Figures 6 and 7 show the clustering of the evaluation domain and differences in the cluster distribution spatially (Fig. 6) and for each cluster (Fig. 7). Figure 7 shows large differences in the sizes of some clusters. In clusters 1, 6, and 7 the simulation underestimates the precipitation, whereas in clusters 3, 4, and 5 the simulation overestimates precipitation. The *κ*_{location} value is also lower for the summer 2003 when compared with the time periods before and after it. Obviously the convective precipitation that is dominating for the summer could not be localized accurately since these precipitation events are highly localized. Figure 6 shows the spatial distribution of the observation clusters and the differences to the simulation. Areas with a larger orographic gradient and with a larger precipitation are more affected by a cluster change than the low plains in northern Germany. An exception to this is cluster 8, which covers the Black Forest. This cluster is not shifted, but the precipitation in the simulations at the cluster borders is too large. This may be caused by the large orographic gradient between the Upper Rhine valley and the Black Forest.

## 5. Summary and conclusions

The goal of this work was answering the question how accurately meteorological phenomena over a time period of one year can be reproduced by a regional model. The results reached in this work suggest that the model is suitable for very highly resolved long-time climate simulations. This work is a case study only because sufficient simulation time for multirun experiments was not available. For the estimation of statistical effects in the simulation such studies are necessary. Using a single climate scenario for impact studies, no quantified probability can be attached to the simulated output. Multirun methods are used to quantify uncertainties associated with model parameters (i.e., see Jackson et al. 2004). Candidates for such an investigation in COSMO are all numerical model parameters.

As the result of a hierarchical cluster analysis the dendogram is used to investigate how the grid points of a climate pattern are assigned to clusters for an increasing number of clusters. The elbow criterion described in appendix B defines the number of clusters that is used for the nonhierarchical cluster analysis. It represents a standardized technical criterion identified for each validation parameter to be assessed.

Tables that categorize kappa values solely on their numerical value in absolute evaluation categories (like good, fair, or bad) are not used. Only a difference of more than 0.1 in the magnitude of kappa values is evaluated as significant. Whether the kappa index of a comparison between simulations and observations is sufficient or not does not only depend on its value but should always be evaluated in context with the error measures mentioned in appendix A. The value *κ*_{histo} combined with BIAS is most suited for the evaluation of differences in the entire observation domain. If the focus of the evaluation is more on the differences for single points in the validation domain, the *κ*_{location} value combined with RMSE is a better measure. This means *κ*_{location} is more appropriate for the evaluation of weather forecasts, whereas *κ*_{histo} is more meaningful for climate simulations over a longer time period and for a larger area. Furthermore, the different kappa values allow the possibility of evaluating different model parameters by a single value. In this way contradictions in the simulation results can be uncovered.

PMSL is represented sufficiently well by the model for the whole year. The lower kappa values during the summer period, especially in southern Germany, indicate deviations in the circulation pattern. However, if we additionally take the error measures into account these deviations are only minor.

T2M in the simulations for the year 2003 is too large, especially in southern Germany. The main reason is an overestimation of the temperature in the summer period. From Fig. 5 we can see that a shift from cluster 3 to cluster 4 and from cluster 5 to cluster 6 has taken place. Cluster 6, which has a percentage of only 4% in the observation data, is much larger in the simulation results (28%). These shifts in the histogram are reflected in the *κ*_{histo} value for the whole year. For the time periods before and after the summer a better simulation result was achieved, but even here the error measures RMSE and MAE with a value greater than 1 K are too high. The bias value for the whole year is acceptable, but the corresponding *κ*_{histo} value indicates that the temperature simulations should be improved. In the periods *P*_{jan–may} and *P*_{sep–dec} T2M is simulated realistically, but the simulation for the extreme summer 2003 is overall not satisfying, especially in southern Germany. The extent of this extreme event as represented by the surface temperature mean is overestimated by 1.4 K. Nevertheless, both observation data and simulation results represent this period as an extreme temperature event.

The simulation of PREC for the entire year is fair. However, this is not the case for both the *P*_{jan–may} period and the *P*_{sep–dec} period. In the period *P*_{jan–may} the precipitation is overestimated by the simulation by 16%; for the *P*_{jun–aug} period we have an underestimation of 23%. In both cases lower kappa values are found. For the *P*_{sep–dec} period good results have been reached. The precipitation is only underestimated by 3% and large kappa values were obtained. As in the case of T2M, we also note in this case that larger deviations in certain time periods are averaged out in the whole year.

## Acknowledgments

The authors thank three anonymous reviewers for their helpful comments. Furthermore, the authors are thankful to Michael Kücken (Max-Planck-Institute for the Physics of Complex Systems, Dresden, Germany) for valuable pieces of advice and editorial help.

### APPENDIX A

#### Definitions of Error Measures

The following notation is used for the definitions of error measures that follow: *n* is the time series dimension per grid point, *k* is the number of grid points, *O _{ij}* is the observation value at time point

*i*of the time series at grid point

*j*, and

*S*is the simulation value at time point

_{ij}*i*of the time series at grid point

*j*. These definitions of error measures include the mean absolute error,

the root-mean-square error,

and the systematic error,

### APPENDIX B

#### Cluster Analysis

##### a. Step 1: Scaling

This is the transformation of observation data and simulation results from the interval [xmin, xmax] into the interval [0, 1], where xmin is assigned to 0 and xmax to 1. Here, xmin is the smallest data value from all-time series of both observation data and simulation results, and xmax is the largest data value from all-time series of both observation data and simulation results.

##### b. Step 2: Hierarchical cluster analysis

The purpose of the hierarchic cluster algorithm is the determination of an optimal cluster number that is later used in step 4, the nonhierarchical cluster analysis. Given the elements of the matrix of observations *O _{nk}* (

*n*time series dimension per grid point and

*k*number of grid points) a symmetric

*k*×

*k*matrix (distance matrix) is computed such that each matrix element

*X*represents how dissimilar the

_{ij}*i*th and

*j*th grid points are. For the distance matrix Euclidean squared distance is used as a distance measure. The computation of the distance matrix is performed using the module g03eac of the Numerical Algorithms Group (NAG) C Library (NAGLIB 2002). Given the distance matrix a hierarchical tree is produced using agglomerative clustering and starting with

*k*clusters (each containing a single object). At each step of the analysis it is checked which two clusters are closest to each other, which are then united into one cluster. This procedure concludes when only one cluster remains. The distance between cluster agglomerations is computed by the minimum variance method. The hierarchical cluster analysis is performed by using the module g03ecc of the NAG C Library (NAGLIB 2002), which yields a dendogram.

##### c. Step 3: Determination of a suitable number of clusters

By analyzing the results of hierarchical clustering it is possible to determine the distance in the hierarchical tree at a certain cluster number. For this purpose we use the module g03ejc of the NAG C Library (NAGLIB 2002). This module analyzes information from the dendogram and produces the corresponding distance for a given number of clusters. The distance is evaluated for cluster numbers between 2 and 50 and the result is plotted as a function of the cluster number. The plot is smoothened and scaled to the interval [0, 1]. An analysis of the curve reveals that the cluster distance strongly decreases as the number of clusters is increased for a small number of clusters. However, at some point the curve becomes much flatter and there is only a slight decrease in cluster distance as the number of clusters continues to increase. The point where the large decrease changes into a small decrease we call “elbow,” and it is considered to be the optimal cluster number (see Fig. B1). Every cluster is represented by its individual cluster center (centroid).

##### d. Step 4: Nonhierarchical cluster analysis

After having determined the optimal number of clusters a nonhierarchical algorithm [*K*-means clustering, module g03efc of the NAG C Library (NAGLIB 2002)] is performed. In contrast to the hierarchic algorithm the *K*-means clustering realizes the assignment of both observation data and simulation results to the initial cluster centers in a uniform manner. The nonhierarchical algorithm is initialized with the cluster centers calculated in step 3. The *K*-means method uses the variance criterion to ensure compatibility with the hierarchical cluster analysis. It is important that both observation data and simulation results refer to the same cluster centroids to enable meaningful comparison of both cluster patterns.

##### e. Step 5: Numerical labeling of clusters

For further applications the clusters from the nonhierarchical cluster analysis are numerically ordered according to the median values of the cluster centroid time series. That means cluster number 1 represents the time series which is driest (PREC), coldest (T2M), or the one with the lowest pressure (PMSL), and cluster *c* (highest cluster number) contains the time series with the wettest (PREC), warmest (T2M), or the one with the largest pressure (PMSL). The shift between corresponding clusters of simulation results and observation data can be compared numerically (Figs. 2, 4, and 6).

### APPENDIX C

#### The Kappa Statistic

The kappa statistic is used to assess and quantify the agreement between observation data and simulation results (Kücken et al. 2009). Cohen’ s index of agreement (Cohen 1960) does not make distinctions among various types and sources of disagreement. Therefore derived kappa coefficients are introduced, which describe the differences in the histogram (*κ*_{histo}) and the differences in the spatial structure of the clusters (*κ*_{location}) [see, e.g., Pontius (2000) and Hagen (2002)]. This enables a more detailed analysis of the reasons for differences between observation data and simulation results. Some authors applied ordinal categories derived from the value of the kappa index. For example, Altman (1991) suggests a scale as shown in Table C1.

In this work the authors mainly use the continuous scale of kappa coefficients combined with the error measures that were introduced in appendix A.

The starting point for the computation of the kappa coefficients is the construction of a contingency table. The structure of this matrix is described in Table C2 according to Monserud and Leemans (1992). The kappa statistic requires a balanced contingency table where the cluster categories used by the observation clusters (represented in the columns) are directly comparable to the cluster categories used by the simulation clusters (represented in the rows). Unbalanced tables (see Fig. 3) are corrected according to a method described by Crewson (2001). From the contingency table the following statistical parameters are computed, where *c* is the number of clusters and *k* is the number of grid points. The agreement between observations and simulations is

the expected agreement between observations and simulations is

the maximal agreement between observations and simulations is

Cohen’s kappa coefficient is

the kappa derivative with respect to the location is

and the kappa derivative with respect to the histogram is

Note that κ_{cohen} = κ_{location} × κ_{histo}.

## REFERENCES

_{2}direct effects