## 1. Introduction

Over two dozen different climate models contribute to the ongoing mission of the Intergovernmental Panel on Climate Change (IPCC), whose aim is to provide reliable estimates of future climate change to the public. The projections in the Fourth Assessment Report (AR4), the IPCC’s most recent, were mostly based on simple multimodel averages over the different participating models (Meehl et al. 2007a). The underlying assumption here and in similar studies is that models are more or less statistically independent from each other (e.g., Abramowitz and Gupta 2008) and that averaging over the individual models will cancel out nonsystematic errors. Probabilistic approaches of combining multimodel projections also require the assumption of model independence in order to produce results analytically (Furrer et al. 2007; Tebaldi et al. 2005). However, if the assumption of model independence in the above examples is not met, the resulting predictions are likely to be biased toward some artificial consensus. In this case, the effective number of models, as defined by the amount of statistically independent information in the simulations, is less than what is suggested by the actual number of models.

Another undesirable result of having a relatively low effective number of models is that the uncertainty in multimodel-derived climate projections could be significantly underestimated. Considering too many similar models in the calculation of the standard error of the ensemble mean leads to unrealistically narrow confidence intervals. Underestimating uncertainty has important consequences, for example on climate impact studies, which rely on a realistic understanding of the range of potential climate outcomes (Knutti 2008).

Previous research has shown clear evidence that the current generation of models considered by the IPCC has common biases and thus violates the assumption of independence (Reichler and Kim 2008a; Jun et al. 2008b). A plausible explanation for this lack of independence is the fact that models are designed in similar ways, for example by utilizing similar resolutions, numerical schemes, and parameterizations. In actuality, models often even share large parts of the same code (Pincus et al. 2008), and some institutions contribute simulations from more than one version of the same model. This is certainly true of the third Coupled Model Intercomparison Project (CMIP3; Meehl et al. 2007b) dataset, where model output is generally accepted from any center willing to participate. This dataset has been described as an “ensemble of opportunity,” which is likely to include shared model biases (Tebaldi and Knutti 2007).

In the present study, we seek to determine how statistically independent the models in the CMIP3 ensemble actually are from one another. To proceed, we develop the concept of an effective number of models. Although this concept may appear intuitive, some complications arise. First, one must clearly state what is meant by “model independence.” Clearly, it does not mean that models produce different solutions, but rather that models arrive at their respective solutions in unique ways. Next, one must develop a suitable metric. In other words, under what criteria is such a metric constructed? Possible measures could consider a model’s error or the magnitude of its simulated variability, but there are a countless number of defensible choices. Although they are all perhaps sensible, they may not necessarily lead to similar conclusions. A further difficulty lies in how robust a particular estimate of ensemble similarity is given the limited amount of available data.

In what follows, we explore similarity in the CMIP3 ensemble by establishing a measure of error that relates to how well these models simulate present-day mean climate (sections 2 and 3). Then, we describe two distinct methods that aid in quantifying the degree of similarity within the ensemble (section 4). Next, we present our results based on these two methods and we explore their sensitivity with respect to particular models and quantities (section 5). Finally, we summarize our results and discuss the potential impacts on current strategies for ensemble prediction (section 6).

## 2. Data

In this study, we examine climatological mean (1979–99) data based on the output of *M* = 24 climate models from the twentieth-century experiments (20C3M) of the CMIP3. For each different model, we consider the simulation of *Q* = 35 climate quantities during each of the four seasons [December–February (DJF), March–May (MAM), June–August (JJA), and September–November (SON)]. The different quantities, which are shown in Table 1, are chosen based on the availability of suitable observations as well as standard practices in climate model evaluation.

Climate quantities used in this study. Acronyms listed in the fifth column are commonly used in the literature to denote specific observational datasets. The average of all available observations for one quantity is taken for model evaluation.

*f*) and observed (

*o*) fields on a uniform grid, which are expressed for the model,

*m*; grid point,

*n*; season,

*s*; and quantity,

*q*, as (

*f*

_{m,m,q,s}−

*o*

_{n,q,s}). We normalize these differences based on the observed standard deviation of the interannual variability at grid point

*n*,

*σ*

_{n,q,s}, written as

*s*and

*q*subscripts for clarity here and hereafter. These differences are then nondimensional and comparable across quantities. We emphasize that, although all subsequent analyses are performed separately for the four seasons, we will primarily focus on annual means of the results, given by the mean over all seasons. In addition, we examine results individually for three regions of interest: the northern extratropics (30°–90°N), the tropics (30°S–30°N), and the southern extratropics (90°–30°S).

For model *m*, the errors in (1) form spatial patterns expressed as vector **e*** _{m}* = (

*e*

_{1,m},

*e*

_{2,m}, … ,

*e*

_{N−1,m},

*e*

_{N,m}), where

*N*is the number of grid points in the regional domain. It is well established that climate models have similar biases and that these biases result in correlated error patterns (e.g., Reichler and Kim 2008a; Jun et al. 2008b; Knutti et al. 2010). Similar biases are typically characterized by the multimodel error pattern (MME), which can be written as

Removing the relevant portion of the MME entails that these fields are constructed by standardizing both the model error fields and the MME, and then subtracting the scaled standardized MME pattern from the standardized model error field; that is, *r* is the correlation between the *m*th model’s error field and the MME. For simplicity, we refer to the construction of fields as “removing the MME” or “controlling for the MME.” The correlation between the MME and each individual error pattern is now zero by construction. The result of removing the MME on the GFD21 and MRICM model errors can be seen in the bottom row of Fig. 1. Removing the MME tends to make model errors more dissimilar from each other, as exemplified by the correlations between the two models before (36%) and after (−13%) this procedure. The resulting collection of such *M* × *Q* error patterns will hereafter be referred to as SPATIAL-data.

## 3. Calculating model similarity

Figure 2 demonstrates the effects of removing the MME in SPATIAL-data for all quantities and models. The top curve (Model vs. MME) shows correlations between individual model errors and the MME [i.e.,

Error pattern correlations by quantity for the northern extratropics; results for the other regions are similar (not shown). Top curve (Model vs. MME) shows annually averaged correlation between model error and multimodel error. Bottom two curves depict annually averaged correlation in errors among different models when the multimodel error is retained (With MME) and when the multimodel error has been removed (Without MME). Quantity labels are ordered by increasing correlations (With MME).

Citation: Journal of Climate 24, 9; 10.1175/2010JCLI3814.1

Error pattern correlations by quantity for the northern extratropics; results for the other regions are similar (not shown). Top curve (Model vs. MME) shows annually averaged correlation between model error and multimodel error. Bottom two curves depict annually averaged correlation in errors among different models when the multimodel error is retained (With MME) and when the multimodel error has been removed (Without MME). Quantity labels are ordered by increasing correlations (With MME).

Citation: Journal of Climate 24, 9; 10.1175/2010JCLI3814.1

Error pattern correlations by quantity for the northern extratropics; results for the other regions are similar (not shown). Top curve (Model vs. MME) shows annually averaged correlation between model error and multimodel error. Bottom two curves depict annually averaged correlation in errors among different models when the multimodel error is retained (With MME) and when the multimodel error has been removed (Without MME). Quantity labels are ordered by increasing correlations (With MME).

Citation: Journal of Climate 24, 9; 10.1175/2010JCLI3814.1

The middle curve (With MME) in Fig. 2 shows the mean over all correlations between model pairs when the MME is retained [i.e., corr(**e*** _{i}*,

**e**

*)]. The correlations are now smaller (~20% to 60%) but still quite positive. The largest correlations are found in precipitation quantities (pr, prw), cloudiness (clt), and surface temperature (ts). One can show that these correlations are roughly the square of the ones seen in the top curve, suggesting that similarities between model and MME imply strong correlations between model pairs.*

_{j}The bottom curve (Without MME) shows the mean over all correlations between model pairs in SPATIAL-data when the MME is removed [i.e., corr(**d*** _{i}*,

**d**

*)]. In this instance, controlling for the MME produces correlations that are near zero, indicating that the effects of the MME have largely been mitigated. A slight negative bias results because the MME is an average of model error and, therefore, bears some likeness to individual model error fields. As the number of models in the ensemble increase, this bias tends toward zero because the resemblance between the MME and individual model errors decreases. We note that removing the MME greatly reduces the regional variation across quantity and that there is little seasonal variation among correlations (not shown).*

_{j}For SPATIAL-data, model error patterns are used directly to formulate an estimate of model similarity. For example, if the strength of the linear relationship between two models’ error fields is large, then we will interpret the two models as being similar.

*z*transformed correlation coefficients (i.e., corresponding

*z*values). The Fisher’s

*z*transformation (Wilks 2006) as a function of correlation coefficient,

*r*, is defined as

## 4. Defining the effective number of models

The effective number of models (*M*_{eff}) is defined in the following way: *M*_{eff} equals one if an ensemble consists of completely correlated error structures since the model members’ error fields have identical features; alternatively, if all error fields are uncorrelated, then *M*_{eff} equals the actual number of models (*M*) in the ensemble. To measure *M*_{eff}, we utilize two methods that are both based on determining the effective degrees of freedom or effective sample size from a given dataset. We apply these methods to our SPATIAL-data, providing us with two different estimates of *M*_{eff}. We tested these two methods, along with others, on artificial datasets and found both to be particularly reliable in consistently meeting expectations (not shown). The two methods are outlined below.

### a. Z method

Our first method (*Z* method) has been proposed previously in the literature (van den Dool and Chervin 1986; Wang and Shen 1999). This method employs an inverse procedure based on analytical properties of the Fisher’s *z* transformed correlation coefficient distribution (van den Dool 2007). As explained in the literature, if two independent variables are Gaussian distributed, then their Fisher’s *z* transformed correlation coefficient, or *z* value, is approximately Gaussian distributed with variance 1/(*M* − 3).

For the *Z* method, we first calculate the sample variance of the *z* values as *K* denotes the total number of unique (*u*, *υ*) pairs. Then, *M*_{eff} is estimated by equating *z* values are symmetric about some mean value, larger similarities among model errors will subsequently result in larger sample variability thereby reducing *M*_{eff}.

Using SPATIAL-data, we correlate error vectors located at two different grid points. Defining a vector at grid point *n*, consisting of *M* different model error elements as **g*** _{n}* = (

*d*

_{n,1},

*d*

_{n,2}, … ,

*d*

_{n,M−1},

*d*

_{n,M}), the correlation between two vectors located at unique grid points

*u*and

*υ*can be expressed as

*r*

_{u,υ}= corr(

**g**

*,*

_{u}**g**

*). The corresponding*

_{υ}*z*value,

*z*

_{u,υ}=

*z*(

*r*

_{u,υ}), is then one sample from the distribution considered by the

*Z*method. When calculating these correlation coefficients from the error patterns, we exercise caution. Two nearby grid points will produce a spuriously high correlation since there is significant spatial dependency in these data. Some quantities, such as precipitation (pr), exhibit smaller spatial dependency than others, such as upper-air temperature (t200). To produce more accurate

*M*

_{eff}estimates, it is necessary to only consider gridpoint pairs that are located a large enough distance apart as to minimize the effects of spatial autocorrelation. Examining spatial error patterns, we calculate decorrelation length scales (DCLSs) based on the distance for which autocorrelation extends below a threshold of 1/

*e*. A representative DCLS is constructed by averaging the DCLS for all available models. We only correlate two grid points when they are located at a distance greater than one DCLS unit apart from each other.

### b. EIGEN method

For our second method (EIGEN method), we consider the eigenvalues that result from an eigenanalysis of correlation matrix *M* × *M* matrix composed of correlation coefficients. The element of *i*, *j*) is defined as *r*_{i,j} = corr(**d*** _{i}*,

**d**

*).*

_{j}Collecting the *M* eigenvalues that result from the eigenanalysis of *M*_{eff} can then be calculated as *λ _{i}* represents the

*i*th eigenvalue. If model error structures are collectively independent, then all eigenvalues will have the same value and

*M*

_{eff}=

*M*. However, if all error structures are identical, then there will exist only one nonzero eigenvalue and

*M*

_{eff}= 1. Here,

*M*

_{eff}is bounded inclusively between one and the number of models

*M*.

## 5. Results

In the first part of this section we present the results for the effective number of models. In the second part, we provide a breakdown of model similarities as a function of model and quantity in order to shed additional light on our findings.

### a. Effective number of models

We now derive two different estimates for the effective number of models, *M*_{eff}, in the CMIP3 ensemble by applying our two methods (section 4) to our SPATIAL-data (section 3). More precisely, we calculate *M*_{eff} for an increasing number of *M* models, ranging between 3 and 24, which allows us to quantify the amount of statistical independence for ensembles of various sizes.

Figure 3 presents the outcomes for the (a) *Z* and (b) EIGEN methods. We estimate *M*_{eff} by creating 100 ensembles of randomly selected models (i.e., bootstrap without replacement) for an increasing number of *M*. We repeat this procedure for each quantity and season. The gray lines in Fig. 3 indicate the outcomes for each of 35 individual quantities, averaged over all trials and seasons. Sampling only 100 ensembles is sufficiently precise because we fit the empirical results to semilogarithmic functions. The *M*_{eff} estimates for individual quantities differ widely (ranging between ~3 and ~15 for *M* = 24), indicating that results derived from single quantities are not representative for overall model similarity. Averaging these fitted curves over all quantities produces the thick solid black curve shown in both panels of Fig. 3. The 95% confidence bounds displayed as gray shadings around the solid black curve are generated by averaging over the 35 × 4 (quantities × seasons) individual confidence intervals.

(a),(b) Effective number of models as a function of actual models (northern extratropics). Thick solid lines are averages over all quantities and models, and gray shading indicates 95% confidence intervals. Thick dotted line is for same-center models only, and dashed–dotted line is for excluding six same-center models. Thin gray lines are quantity-specific estimates. Dashed curve in (b) indicates including the multimodel error. Straight diagonal line shows *M*_{eff} = *M*. All curves are fits of the actual data to semilogarithmic functions [*M*_{eff} = *a* + *b*ln(*M*) + *cM*], and extrapolating addresses the occasional problem of missing data.

Citation: Journal of Climate 24, 9; 10.1175/2010JCLI3814.1

(a),(b) Effective number of models as a function of actual models (northern extratropics). Thick solid lines are averages over all quantities and models, and gray shading indicates 95% confidence intervals. Thick dotted line is for same-center models only, and dashed–dotted line is for excluding six same-center models. Thin gray lines are quantity-specific estimates. Dashed curve in (b) indicates including the multimodel error. Straight diagonal line shows *M*_{eff} = *M*. All curves are fits of the actual data to semilogarithmic functions [*M*_{eff} = *a* + *b*ln(*M*) + *cM*], and extrapolating addresses the occasional problem of missing data.

Citation: Journal of Climate 24, 9; 10.1175/2010JCLI3814.1

(a),(b) Effective number of models as a function of actual models (northern extratropics). Thick solid lines are averages over all quantities and models, and gray shading indicates 95% confidence intervals. Thick dotted line is for same-center models only, and dashed–dotted line is for excluding six same-center models. Thin gray lines are quantity-specific estimates. Dashed curve in (b) indicates including the multimodel error. Straight diagonal line shows *M*_{eff} = *M*. All curves are fits of the actual data to semilogarithmic functions [*M*_{eff} = *a* + *b*ln(*M*) + *cM*], and extrapolating addresses the occasional problem of missing data.

Citation: Journal of Climate 24, 9; 10.1175/2010JCLI3814.1

The two methods reveal that the effective number of models is substantially lower than the number of actual models. Averaged over all quantities, *M*_{eff} is only about 9 (Z method) and 7.5 (EIGEN method) for all 24 models. Though these results only apply to the northern extratropics, we arrive at similar estimates for the other two regions (not shown). Both methods produce slightly higher estimates (by roughly two models for *M* = 24) in the tropics. In other words, the models are more dissimilar over the tropics than over the extratropics.

The curves in both panels in Fig. 3 depict a characteristic concavity over *M*. This concavity indicates that procedurally adding new models provides less and less new information as the ensemble grows. To some degree, diminished returns from adding new models are to be expected since the chance that an added model will have commonalities with the preceding models increases with ensemble size. In other words, this behavior is a consequence of biases shared across models and is qualitatively consistent with results found in Knutti et al. (2010).

Extrapolating from the semilog equation defining the average effective number of models, we can speculate as to how much unique information could be gained by adding another hypothetical CMIP3 model. For the *Z* and EIGEN methods, adding one model yields information increases of ~1% and ~1.5%, respectively. If the model simulations were independent, we would expect a value of 4.2%

Contrasting the outcomes from the two methods (Figs. 3a and 3b), we find that they arrive at quite similar results despite the differences in their respective approaches. Only for small ensembles does the *Z* method regularly produce unrealistically high estimates, which are likely due to the uncertainties inherent in utilizing small sample sizes.

We also applied our methods to the error fields that were not controlled for the MME, denoted by **e**_{m} above. For the EIGEN method, the dashed curve in Fig. 3b shows the results when the MME is retained. In this instance, the *M*_{eff} estimates are now far lower and range between three and four in all three regions. This result implies that model similarities are largely encapsulated by the MME. For the *Z* method, retaining the MME only slightly impacts the results and is therefore not shown. The reason for the slight difference is that the MME is subtracted by construction via the correlation operation.

### b. Sensitivity to specific models and quantities

The results from the previous section raise a number of important questions. For example: What causes the considerable reduction of the effective number of models? What is the contribution of individual models and quantities to this reduction? And, what role do models from the same institution play? In the following, we try to shed some light on these questions.

Going back to Fig. 3, the effective number of models determined from individual quantities differs widely. Upon closer inspection, we find that quantities associated with “smooth,” large-scale error fields (e.g., t200, chi, psi) tend to produce small estimates while the opposite holds for quantities associated with “noisy,” small-scale error fields (e.g., pr, va, v850). In other words, there exists an inverse relationship between the characteristic spatial scale for a quantity and its associated *M*_{eff} estimate. Perhaps, this is not too surprising since small-scale features imply a much larger range of possible simulation outcomes and the correlated model biases are more likely to be masked by the noisy, uncorrelated features.

We now investigate in more detail how error structures from individual models are related to each other. To this end, Fig. 4 shows correlations between error fields for all possible combinations of model pairs. Each circle represents the average correlation over all quantities and seasons for a given model pair. Each correlation appears twice in Fig. 4; once in each column, representing the two models.

Correlation between model errors (northern extratropics) for averages over all 35 quantities and the four seasons. Larger filled circles indicate significantly positive values that are outside the one-tailed 95% confidence limit (*r* = 28%, assuming a Gaussian distribution with empirically estimated moments).

Citation: Journal of Climate 24, 9; 10.1175/2010JCLI3814.1

Correlation between model errors (northern extratropics) for averages over all 35 quantities and the four seasons. Larger filled circles indicate significantly positive values that are outside the one-tailed 95% confidence limit (*r* = 28%, assuming a Gaussian distribution with empirically estimated moments).

Citation: Journal of Climate 24, 9; 10.1175/2010JCLI3814.1

Correlation between model errors (northern extratropics) for averages over all 35 quantities and the four seasons. Larger filled circles indicate significantly positive values that are outside the one-tailed 95% confidence limit (*r* = 28%, assuming a Gaussian distribution with empirically estimated moments).

Citation: Journal of Climate 24, 9; 10.1175/2010JCLI3814.1

Figure 4 shows that most correlations are quite evenly distributed around zero, which is primarily a consequence of removing the MME (see Fig. 2). However, Fig. 4 also illustrates nine model pairs that are significantly correlated at the 95% level (*r* > 0.28), as indicated by the larger outlined circles. According to the CMIP3 model documentation (information online at www-pcmdi.llnl.gov), pairings at this level indicate models that are developed at the same center or share parts of the same code (“same center” models): the Community Climate System Model and the Parallel Climate Model (CCSM–PCM11), the ECHAM5 and the Istituto Nazionale di Geofisica e Vulcanologia model (INGV4), the Goddard Institute for Space Studies A and R models (GISSA–GISSR), the University of Bergen Climate Model and the Centre National de Recherches Météorologiques Coupled Global Climate Model, version 3 (BCM20–CNRM3), [Coupled General Circulation Model version 3.1 T47 (C3T47)/Coupled General Circulation Model version 3.1 T63 (C3T63) C3T47/C3T63, [Commonwealth Scientific and Industrial Research Organisation Mark version 3.0 (CSR30)/Commonwealth Scientific and Industrial Research Organisation Mark version 3.5 (CSR35)] CSR30/CSR35, GFD20–GFD21, GISSH–GISSR, and {Model for Interdisciplinary Research on Climate 3.2 medium-resolution version [MIROC3.2 (medres)]}–Model for Interdisciplinary Research on Climate 3.2 high-resolution version [MIROC3.2 (hires)]} MIROM–MIROH. Some of the less obvious pairings are CCSM3 and PCM11 (developed at the National Center for Atmospheric Research), ECHAM5 and INGV4 (based on the same atmospheric model), and BCM20 and CNRM3 (which share the same atmospheric component). Most of these nine pairings are also found over the other two regions (not shown). One notable exception is the Geophysical Fluid Dynamics Laboratory (GFDL) model pair, which, consistent with earlier findings (Gnanadesikan et al. 2006), is quite dissimilar over the southern extratropics. Finally, Fig. 4 contains a few large negative correlations (e.g., GFD20 and INGV4), but these do not appear across all regions and are therefore not robust.

We now explore the influence of the same-center models, as identified above, on *M*_{eff}. We accomplish this in two ways. First, we remove CNRM3, C3T47, CSR30, GFD20, GISSR, and MIROM from our ensemble and repeat the *M*_{eff} analysis explained in section 5a. These particular models are removed because they are associated with the largest same-center correlations seen in Fig. 4. The outcome is identified by the thick dashed–dotted curves in Fig. 3. As shown, the remaining *M* = 18 models lead to an increase in *M*_{eff}, but this increase is quite small. Next, we repeat this analysis, retaining only the six models and their same-center counterparts (dotted curves in Fig. 3). As expected, there is now a decrease in *M*_{eff}. The relative size of this decrease is larger than the relative increase when removing same-center models. This makes sense given the uneven distribution of the two groups of models. We also examined if same-center relationships are connected to specific groups of quantities. We find (not shown) that it is generally impossible to identify similarities that belong to a specific quantity or groups of quantities. Instead, it appears that each given model pair is well correlated across most quantities.

We now provide a summary view of the similarities among different models. To this end, we convert the correlations (*r*) seen in Fig. 4 into a distance metric and enter them into a hierarchical clustering scheme. The clustering scheme groups models at different levels based on the distances between the models. The outcome of this analysis is graphically depicted by the “dendrogram” shown in Fig. 5.

Hierarchical clustering based on model error correlation (northern extratropics). Similar models merge closer to the right. The clustering scheme is based on the weighted pairwise average distance algorithm developed for the Interactive Data Language. The distance between two models is given by *z*(0.95) − *z*(*r*), with *z* being defined in (2) and with a value of 0.95 being an upper bound on the correlation. Other distance metrics and methodologies produce similar intermodel relationships (not shown). Scale bar units indicate equivalent correlation.

Citation: Journal of Climate 24, 9; 10.1175/2010JCLI3814.1

Hierarchical clustering based on model error correlation (northern extratropics). Similar models merge closer to the right. The clustering scheme is based on the weighted pairwise average distance algorithm developed for the Interactive Data Language. The distance between two models is given by *z*(0.95) − *z*(*r*), with *z* being defined in (2) and with a value of 0.95 being an upper bound on the correlation. Other distance metrics and methodologies produce similar intermodel relationships (not shown). Scale bar units indicate equivalent correlation.

Citation: Journal of Climate 24, 9; 10.1175/2010JCLI3814.1

Hierarchical clustering based on model error correlation (northern extratropics). Similar models merge closer to the right. The clustering scheme is based on the weighted pairwise average distance algorithm developed for the Interactive Data Language. The distance between two models is given by *z*(0.95) − *z*(*r*), with *z* being defined in (2) and with a value of 0.95 being an upper bound on the correlation. Other distance metrics and methodologies produce similar intermodel relationships (not shown). Scale bar units indicate equivalent correlation.

Citation: Journal of Climate 24, 9; 10.1175/2010JCLI3814.1

From Fig. 5, one can see that the same-center model pairs, identified earlier, merge at relatively short distances. Clusters containing more than two models tend to merge at insignificant correlations (*r* < 0.28), denoted by gray shading, and therefore appear to arise by construction. The only exceptions are GISSA, GISSR, and GISSH, which are shown at the top of the dendrogram. These three models were all developed at the same center, suggesting that this merger is meaningful. Also, it is interesting that Flexible Global Ocean–Atmosphere–Land System Model (FGOAL) has the largest merging distance with any other model, which is consistent with findings of Jun et al. (2008a) that this model is most independent from the CMIP3 ensemble. Additionally, although the two Hadley Centre models [the Hadley Centre Global Environmental Model (HadGEM) and the third climate configuration of the Met Office Unified Model (HadCM3)] merge at a rather large distance, it is clear that the two have more in common with each other than with any other model.

All of our results, thus far, are based on single-member simulations from each model (usually “run1”). Looking at multiple members from the same model, we find that the respective outcomes are very similar. This is exemplified in Fig. 5 for two members of GFD21 (GFD21-A and GFD21-B). They exhibit a correlation of ~93%, which is higher than between any two different models.

## 6. Conclusions

To our knowledge, this study represents the first attempt at explicitly determining the effective number of models from an ensemble. This is accomplished by calculating spatial errors in simulating present-day climatological mean fields for 35 different quantities. We then construct a dataset based on these errors and utilize two distinct methods to quantify the amount of statistical independence in the ensemble. Using both methods, we find that the effective number of models (*M*_{eff}) is considerably smaller than the actual number (*M*), and as the number of models increases, the disparity between the two widens considerably. For the full 24-member ensemble, this leads to an *M*_{eff} that, depending on method, lies only between 7.5 and 9 when we control for the multimodel error (MME). These results are in good quantitative agreement with those of Jun et al. (2008a,b) and Knutti et al. (2010), who also found that CMIP3 cannot be treated as a collection of independent models.

As explained before, we consider the effective number of models to be a useful measure of model independence. The demonstrably low effective number of models suggests that CMIP3 is not a very diverse ensemble. Due to this lack of diversity, we discover diminishing returns on adding models to a growing ensemble. Regarding the northern extratropics, for example, 12 models on average account for about 75% of the total information (Fig. 3). In other words, the CMIP3 ensemble gives the false impression of having more models than there actually are. As discussed in the introduction, a possible consequence might be that the CMIP3 ensemble underestimates the real range of climate prediction uncertainty. But the extent to which this statement holds for CMIP3 is unclear at this point. For example, a recent paper by Annan and Hargreaves (2010) suggests that under the paradigm of a “statistically indistinguishable ensemble” CMIP3 appears to have statistical properties similar to the observations.

Previous studies have provided convincing evidence that averaging over the outcomes from many models (multimodel mean) generally outperforms any individual model (e.g., Reichler and Kim 2008a; Gleckler et al. 2008). It has further been argued that the superiority of the multimodel mean is due to the inclusion of a large number of diverse models, which tend to reduce the effects of natural climate variability and cancel offsetting errors (Pierce et al. 2009). However, we do not find CMIP3 to be as diverse as suggested by its ensemble size, further limiting its potential usefulness for multimodel projections.

Common model biases are an obvious explanation for this lack of diversity. However, it is important to emphasize that we removed the MME in our calculations and, still, the effective number of models is surprisingly small. Another possible explanation for small *M*_{eff} may be that some CMIP3 models were developed at the same centers, and such models tend to differ little in their implementation (e.g., Delworth et al. 2006; Schmidt et al. 2005; Hasumi and Emori 2004). However, we find that same-center models only have a modest impact; eliminating them from the ensemble increases *M*_{eff} by less than 10%. This suggests that despite removing the MME, considerable overarching commonalities remain among the models. Apparently, removing the MME does not entirely eliminate such commonalities.

One cautionary note to this study concerns the potential influences of unknown errors in the observations on our results. Previous studies (Pincus et al. 2008; Gleckler et al. 2008; Reichler and Kim 2008b) indicate that this component of uncertainty is relatively small, but the real extent of this problem is not clear to us. Another potential caveat to this study is that model similarity is determined from present-day mean climate. Some studies indicate that there may be little relationship between the ability of models to simulate mean climate and their simulation of trends (Jun et al. 2008b; Pierce et al. 2009; Knutti et al. 2010; Reifen and Toumi 2009). However, in the present study we are merely interested in similarities in error patterns and not in the magnitudes of the errors. The strong similarities in model error structures found in our study indicate a considerable lack of model diversity. It is reasonable to suspect that such model similarities translate into a limited range of climate change projections.

## Acknowledgments

We acknowledge Huug van den Dool for useful discussions, Junsu Kim for providing data and code, the modeling groups for providing the CMIP3 data for analysis, the Program for Climate Model Diagnosis and Intercomparison for collecting and archiving the model output, and the JSC/CLIVAR Working Group on Coupled Modelling for organizing the model data analysis activity. The multimodel data archive is supported by the Office of Science, U.S. Department of Energy. We also thank the three anonymous reviewers for their comments and suggestions. This work was supported by NSF Grant ATM0532280 and NOAA Grant OAR-OGP-2006-2000116.

## REFERENCES

Abramowitz, G., and H. Gupta, 2008: Towards a model space and independence metric.

,*Geophys. Res. Lett.***35**, L05705, doi:10.1029/2007GL032834.Annan, J. D., and J. C. Hargreaves, 2010: Reliability of the CMIP3 ensemble.

,*Geophys. Res. Lett.***37**, L02703, doi:10.1029/2009GL041994.Bretherton, C. S., M. Widmann, V. P. Dymnikov, J. M. Wallace, and I. Bladé, 1999: The effective number of spatial degrees of freedom of a time-varying field.

,*J. Climate***12**, 1990–2009.Delworth, T. L., and Coauthors, 2006: GFDL’s CM2 global coupled climate models. Part I: Formulation and simulation characteristics.

,*J. Climate***19**, 643–674.Furrer, R., R. Knutti, S. R. Sain, D. W. Nychka, and G. A. Meehl, 2007: Spatial patterns of probabilistic temperature change projections from a multivariate Bayesian analysis.

,*Geophys. Res. Lett.***34**, L06711, doi:10.1029/2006GL027754.Gleckler, P. J., K. E. Taylor, and C. Doutriaux, 2008: Performance metrics for climate models.

,*J. Geophys. Res.***113**, D06104, doi:10.1029/2007JD008972.Gnanadesikan, A., and Coauthors, 2006: GFDL’s CM2 Global Coupled Climate Models. Part II: The baseline ocean simulation.

,*J. Climate***19**, 675–697.Hasumi, H., and S. Emori, 2004: K-1 coupled GCM (MIROC) description. K-1 Tech. Rep. 1, Center for Climate System Research, University of Tokyo, 34 pp. [Available online at http://www.ccsr.u-tokyo.ac.jp/kyosei/hasumi/MIROC/tech-repo.pdf.]

Jun, M., R. Knutti, and D. W. Nychka, 2008a: Local eigenvalue analysis of CMIP3 climate model errors.

,*Tellus***60A**, 992–1000.Jun, M., R. Knutti, and D. W. Nychka, 2008b: Spatial analysis to quantify numerical model bias and dependence: How many climate models are there?

,*J. Amer. Stat. Assoc.***103**, 934–947.Knutti, R., 2008: Should we believe model predictions of future climate change?

,*Philos. Trans. Roy. Soc.***366**, 4647–4664.Knutti, R., R. Furrer, C. Tebaldi, J. Cermak, and G. A. Meehl, 2010: Challenges in combining projections from multiple climate models.

,*J. Climate***23**, 2739–2758.Meehl, G. A., and Coauthors, 2007a: Global climate projections.

*Climate Change 2007: The Physical Science Basis,*S. Solomon et al., Eds, Cambridge Press University, 747–846.Meehl, G. A., C. Covey, K. E. Taylor, T. Delworth, R. J. Stouffer, M. Latif, B. McAvaney, and J. F. B. Mitchell, 2007b: The WCRP CMIP3 multimodel dataset: A new era in climate change research.

,*Bull. Amer. Meteor. Soc.***88**, 1383–1394.Pierce, D. W., T. P. Barnett, B. D. Santer, and P. J. Gleckler, 2009: Selecting global climate models for regional climate change studies.

,*Proc. Natl. Acad. Sci. USA***106**, 8441–8446.Pincus, R., C. P. Batstone, R. J. P. Hofmann, K. E. Taylor, and P. J. Gleckler, 2008: Evaluating the present-day simulation of clouds, precipitation, and radiation in climate models.

,*J. Geophys. Res.***113**, D14209, doi:10.1029/2007JD009334.Reichler, T., and J. Kim, 2008a: How well do coupled models simulate today’s climate?

,*Bull. Amer. Meteor. Soc.***89**, 303–311.Reichler, T., and J. Kim, 2008b: Uncertainties in the climate mean state of global observations, reanalyses, and the GFDL climate model.

,*J. Geophys. Res.***113**, D05106, doi:10.1029/2007JD009278.Reifen, C., and R. Toumi, 2009: Climate projections: Past performance no guarantee of future skill?

,*Geophys. Res. Lett.***36**, L13704, doi:10.1029/2009GL038082.Schmidt, G. A., and Coauthors, 2005: Present-day atmospheric simulations using GISS ModelE: Comparison to in situ, satellite, and reanalysis data.

,*J. Climate***19**, 153–192.Tebaldi, C., and R. Knutti, 2007: The use of the multi-model ensemble in probabilistic climate projections.

,*Philos. Trans. Roy. Soc.***365A**, 2053–2075.Tebaldi, C., R. L. Smith, D. Nychka, and L. O. Mearns, 2005: Quantifying uncertainty in projections of regional climate change: A Bayesian approach to the analysis of multimodel ensembles.

,*J. Climate***18**, 1524–1540.Van den Dool, H., 2007:

*Empirical Methods in Short-Term Climate Prediction*. Oxford University Press, 215 pp.Van den Dool, H., and R. M. Chervin, 1986: A comparison of month-to-month persistence of anomalies in a general circulation model and in the earth’s atmosphere.

,*J. Atmos. Sci.***43**, 1454–1466.Wang, X., and S. Shen, 1999: Estimation of spatial degrees of freedom of a climate field.

,*J. Climate***12**, 1280–1291.Wilks, D. S., 2006:

*Statistical Methods in the Atmospheric Sciences*. 2nd ed. Elsevier, 627 pp.