## Abstract

When analyzing multimodel climate ensembles it is often assumed that the ensemble is either truth centered or that models and observations are drawn from the same distribution. Here we analyze CMIP5 ensembles focusing on three measures that separate the two interpretations: the error of the ensemble mean relative to the error of individual models, the decay of the ensemble mean error for increasing ensemble size, and the correlations of the model errors. The measures are analyzed using a simple statistical model that includes the two interpretations as different limits and for which analytical results for the three measures can be obtained in high dimensions. We find that the simple statistical model describes the behavior of the three measures in the CMIP5 ensembles remarkably well. Except for the large-scale means we find that the indistinguishable interpretation is a better assumption than the truth centered interpretation. Furthermore, the indistinguishable interpretation becomes an increasingly better assumption when the errors are based on smaller temporal and spatial scales. Building on this, we present a simple conceptual mechanism for the indistinguishable interpretation based on the assumption that the climate models are calibrated on large-scale features such as, e.g., annual means or global averages.

## 1. Introduction

The analysis of ensembles of climate models has become a standard tool to estimate the effect of emission scenarios or other perturbations on the climate. Community efforts to create and store multimodel ensembles of climate model experiments with standardized forcings and outputs have been undertaken in the CMIP projects (Taylor et al. 2012). When analyzing such ensembles, the mean of the ensemble is often interpreted as the best estimate of the climate and the spread of the ensemble as an estimate of the uncertainty. Increasing the size of the ensemble would therefore be a natural way to obtain more precise estimates. Naively we might argue that the uncertainty on the ensemble mean would decrease as the inverse square root of the ensemble size. This argument is, however, based on two assumptions. The first assumption is that the ensemble is centered around the truth, i.e., the models can be seen as the truth polluted by noise. This is equivalent to the concept of measurement noise and is called the truth centered or the truth-plus-error interpretation. The second assumption is that the models can be treated as independent. If they are not fulfilled, these assumptions will lead to an underestimation of the spread and therefore to overconfident projections of the climate [see Steinschneider et al. (2015) for a concrete example].

The interest in the potential independence of climate models has increased over the last decade. This includes attempts to estimate the effective number of independent models (Jun et al. 2008; Pennell and Reichler 2011; Leduc et al. 2016), to weight the models according to their amount of independence (Sansom et al. 2013; Knutti et al. 2017; Sanderson et al. 2017), and to select independent subsets (Evans et al. 2013; Herger et al. 2018). See also the review by Abramowitz et al. (2019). There are strong a priori reasons to believe that climate models, such as those compiled in the CMIP archives, are not all independent. Some models from the same institute are closely related variants that only differ in, e.g., resolution. Models from different institutions share code and/or parameterizations and even when such components are developed independently they are often built on the same basic physical understanding and the same simplifications (Tebaldi and Knutti 2007). There is, in fact, a close connection between model genealogy and the proximity of the model simulations (Masson and Knutti 2011; Knutti et al. 2013; Boé 2018). The effective number of independent models are often found to be less than half of the nominal ensemble size (Pennell and Reichler 2011; Leduc et al. 2016; Sanderson et al. 2015).

Several methods have been applied to estimate the independence (Pirtle et al. 2010; Abramowitz et al. 2019) but a rigorous definition of independence is often not provided (Annan and Hargreaves 2017). As pointed out in, e.g., Stephenson et al. (2012) we need simplifying, transparent, and defensible statistical frameworks for ensembles of climate models. There are two simple null hypotheses for an ensemble of independent models that have been applied—sometimes tacitly—in previous work. The first is the truth centered interpretation mentioned above. The second is the indistinguishable interpretation in which the observations and the models all are assumed to be drawn from the same distribution. The choice of null hypothesis is important as the two interpretations give different values for typical test statistics used to test if ensemble members are independent. This will be discussed below and in more details in section 2. Note also that the ensemble mean only has a sound physical interpretation in the truth centered interpretation (Christiansen 2018a).

There is some support for the truth centered interpretation (Knutti et al. 2010; Sanderson and Knutti 2012, and references therein) and it has often been assumed more or less explicitly (see, e.g., the discussion in Jun et al. 2008). However, there is an increasing amount of evidence pointing toward the indistinguishable interpretation. The distance between the ensemble mean and the observations do not converge toward zero as the number of models increases (van Loon et al. 2007; Potempski and Galmarini 2009; Knutti et al. 2010; Bishop and Abramowitz 2013). The correlation between model errors are often found to be positive and near 0.5 (Jun et al. 2008; Annan and Hargreaves 2010; Pennell and Reichler 2011; Bishop and Abramowitz 2013; Herger et al. 2018; Abramowitz et al. 2019), which is the value predicted from the indistinguishable interpretation. Additionally, the error of the ensemble mean is often 30% smaller than the typical error of individual models, which again is the value predicted by the indistinguishable interpretation. For details see, e.g., Palmer et al. (2006) and the discussions in Christiansen (2018a, 2019).

The rationale for the truth centered interpretation is that it is the situation that would be expected after calibration of statistical models. This is a basic idea behind ensemble regression methods (Bishop 2007, chapter 14). Climate models are based on physical knowledge and are obviously not calibrated or trained like simple statistical models. However, in the process of developing the models, the observations over the last 100 years have served as a guideline. If, e.g., a model’s global mean temperature has deviated significantly from the observed, then efforts have been made to correct the problem, perhaps by improving relevant processes or perhaps by changing the forcings. Before a new version of a model with updated or new parameterizations is used it is often tuned to ensure, e.g., that the mean outgoing long wave radiation at the top of the atmosphere matches that of the observations (Mauritsen et al. 2012; Hourdin et al. 2017). Other often used targets for the tuning are the global mean surface temperature, the global mean outgoing radiation, El Niño–Southern Oscillation, the twentieth century warming, and the meridional overturning circulation (Hourdin et al. 2017). Thus, the calibration of the climate models is often focused on the largest scales (Flato et al. 2013), although different model centers may consider different aspects of the climate system. Perfect tuning is not possible due to structural limitations in the climate models.

A main difficulty for the indistinguishable interpretation is the lack of an obvious mechanism to produce such a distribution. In ensemble weather forecasting the model is initialized from a number of conditions representing the uncertainty in the observed initial state. When the model is integrated forward in time, the width of the distribution broadens because of the chaotic divergence of the trajectories. This model distribution should be compared to the distribution of the truth, i.e., a perfect model initialized from the same conditions. Thus, in weather forecasts the indistinguishable interpretation appears naturally as the target for a good model (Ehrendorfer 1994; Palmer 2000). However, initial condition ensembles with a single climate model are in general much narrower than the multimodel ensembles (Haughton et al. 2014), although this to some extent depends on the variable and scale under consideration (Kumar et al. 2016; Swart et al. 2015) (see also the analysis in section 4). The multimodel ensemble is therefore not mainly determined by initial conditions but more by the model deficiencies, which might be different in different models. Thus, the multimodel ensemble is basically a representation of our incomplete knowledge of the climate system.

We note that the real climate and the climate models are systems of very high dimensionality. When comparing different models or when comparing models with observations, the differences are often calculated as the root-mean-square error or the correlation over a large number of grid cells or times. We have previously used the properties of high dimensional spaces to analyze the errors of the ensemble mean (Christiansen 2018a, 2019). Here, we will argue that the indistinguishable interpretation arises for high dimensional systems when the calibration is performed on large-scale quantities. Note that we see this as a conceptual mechanism and that we do not expect climate models to follow the indistinguishable interpretation perfectly.

The truth centered and the indistinguishable interpretation are both very idealized viewpoints. To get more freedom in our analysis we use a simple statistical model where observations and models are drawn from distributions with different variances and which include a bias. The two interpretations can be found as limits of this more comprehensive model. We use the simplifying properties of high dimensional space (Bishop 2007; Christiansen 2018a) to obtain analytical results for the error correlations and for the error of the ensemble mean under this statistical model. We then analyze the near-surface temperature and the 500-hPa geopotential height fields from a multimodel CMIP5 ensemble of historical experiments. We use the results from the simple statistical model as a tool for interpreting the climate model ensemble. We show that there is a transition from an approximately truth centered situation when global annual means are considered to an indistinguishable interpretation when monthly gridpoint anomalies are considered. Based on this observation we present a simple conceptual mechanism for the indistinguishable interpretation.

In section 2 we introduce the simple statistical model that allows the models and observations to be drawn from different distributions. We use the properties of high dimensional space to obtain analytical results for the ensemble mean and the error correlations. In section 3 we analyze the CMIP5 ensemble using the simple statistical model for interpretation. In section 4 we consider how the indistinguishable interpretation may arise from calibration of the large-scale quantities.

## 2. The simple statistical model and the blessings of dimensionality

In section 2a we introduce the simple statistical model that was also used in Christiansen (2018a, 2019). We then describe the relevant properties of high dimensional space in section 2b. Then, in section 2c we discuss analytical results for the ensemble mean and the error correlations based on these properties. The derivation of the analytical results is given in appendix A.

### a. The simple statistical model

We use the same notation as in Christiansen (2018a, 2019). The *K* ensemble members are denoted by **x**^{k}, *k* = 1, …, *K*, and the observations are denoted by **o**. Each **x**^{k} is a vector in *N*-dimensional space, $xk=(x1k,x2k,\u2026,xNk)$, and so are the observations, **o** = (*o*_{1}, *o*_{2}, …,*o*_{N}). We denote the ensemble mean with an overbar: $x\xaf=\u2211k=1Kxk/K$. The *N*-dimensional space can represent both spatial and temporal fields.

We now consider a model where the ensemble members are drawn from $N\u2061(b,\sigma mod2I)$ and the observations from $N\u2061(0,\sigma obs2I)$. Here **b** is an *N* vector of length $\Vert b\Vert =BN$ and $I$ is the identity matrix of size *N*. Thus, the ensemble members and the observations are drawn from Gaussian distributions with different variances, and the ensemble members may be biased relatively to observations. The center, representing the expectation value of the observations, has been set to **0** without lack of generality as we only consider differences in our analysis. Note, that while $\sigma obs2$ represents internal variability in the observations, $\sigma mod2$ includes both internal variability and structural differences between models. The Gaussians are spherical so that the coordinates of all the ensemble members are independently drawn from the same univariate distribution. The same holds for the coordinates of the observations. This is not a serious limitation as will be explained in the next subsection. The situation *B* = 0 and $\sigma obs2=0$ corresponds to the truth centered interpretation and the situation *B* = 0 and *σ*_{obs} = *σ*_{mod} corresponds to the indistinguishable interpretation (Sanderson and Knutti 2012). We note that the analytical results in the next subsection include *B* and $\sigma obs2$ only in the combination $B2+\sigma obs2$.

### b. The blessings of high dimension

The properties of high dimensional spaces often defy our intuition based on two and three dimensions (Bishop 2007; Cherkassky and Mulier 2007; Blum et al. 2018). As a simple example, a unit cube in *N* dimensions will have 2^{N} vertices each with a length of $N/2$. The volume of the cube within a distance *ε* to the edge is 1 − (1 − *ε*)^{N}. For *N* = 100 there are more than 10^{30} vertices and more that 99% of the volume is within a distance 0.05 to the edge. Thus, the volume increasingly concentrates near the surface when the dimension increases.

The properties of high dimensional spaces are sometimes called the “curse of dimensionality” and sometimes the “blessing of dimensionality” depending on the considered problem. The relevant properties of high dimensional spaces for our purpose are in mathematics known as “waist concentration” and “concentration of measure” (Talagrand 1995; Donoho 2000; Hall et al. 2005; Gorban et al. 2016, 2020). Here “concentration of measure” means that random vectors drawn from the same distribution in high dimensions have almost always the same lengths. More precisely, the variance of the lengths relative to the expectation of the lengths converges toward zero. “Waist concentration” denotes the fact that independent vectors in high dimensions are almost always orthogonal. More precisely, when the dimension increases the variance of the angle between independent vectors converges toward *π*/2. Thus, in high dimensions we can set dot products to zero and substitute the length of a random vector with its expectation value.

The simple statistical model is obviously unrealistic. It assumes that $\sigma obs2$ and $\sigma mod2$ are the same for all coordinates. This is of course not the case when climate fields are considered, e.g., the near-surface temperature has larger variability near the poles than over the equator. Also, the assumption of Gaussianity of the individual components $xnk$ may be unrealistic. Perhaps even more importantly, the simple model assumes independent components while in many cases nearby grid points are strongly dependent. However, the concentration of measure—just as the central limit theorem (e.g., Clusel and Bertin 2008)—does not only hold for independent and identically distributed variables but has been extended to various classes of nonidentical and dependent variables (Kontorovich and Ramanan 2008; Chazottes 2015). So generally, quoting Chazottes (2015), “A random variable that smoothly depends on the influence of many weakly dependent random variables is, on an appropriate scale, very close to a constant.” It is only this constant—which in our case is an average variance $\sigma mod2$—that enters the simple statistical model and $\Vert xk\Vert 2/N$ can therefore be substituted with $\sigma mod2$.

In the present paper high dimensional space enters when we consider a spatial field and calculate, e.g., the mean-square error over the grid points. The number of grid points may be very large but in many cases nearby grid points are not independent. The “many” in the quotation above is related to the amount of dependence among the grid points: with strong dependence we would need more grid points. A relevant measure here is the effective number of independent points (or degrees of freedom) as it is related to how fast the correlations decay with distance. Some algorithms directly use the average decorrelation length to estimate the effective number of degrees of freedom. In general this number is rather difficult to estimate (Christiansen 2015) and depends both on the field, the time scale, and the region under consideration (Jones and Briffa 1996; North et al. 2011). However, it is still large and on the order of 50–100 for monthly surface temperatures in the Northern Hemisphere (Wang and Shen 1999; Bretherton et al. 1999).

In appendix B we illustrate the concentration of measure with numerical experiments including dependent variables. In particular, we demonstrate the convergence with the increasing number of effective degrees of freedom.

### c. Some analytical results

Using the simplifying properties of high dimensional spaces described in the previous subsection we can obtain analytical results for the error correlation and the ensemble mean. The results are discussed here but the detailed derivations are given in appendix A.

Considering the mean-square error between observations and ensemble mean we get

Here we have used that the cross terms vanish (the orthogonality) and that $\Vert o\Vert 2/N=\sigma obs2$ and $\Vert xk\u2212b\Vert 2/N=\sigma mod2$ for all *k* (same lengths). Thus, the mean-square error converges for large *K* toward $B2+\sigma obs2$ as 1/*K*. Only in the truth centered interpretation (*B* = 0 and $\sigma obs2=0$) will the mean-square error go to zero for large *N* and in this situation the root-mean-square error is *σ*_{mod}/*K*^{1/2}. In the indistinguishable interpretation (*B* = 0 and *σ*_{obs} = *σ*_{mod}) we get $\sigma mod1+1/K$. This result was also found in van Loon et al. (2007), Potempski and Galmarini (2009), and Bishop and Abramowitz (2013) based on expectation values that are valid for all *N* but only as the mean over many realizations. The finite error in the limit of large *K* has been observed in multimodel ensembles by Knutti et al. (2010) and Bishop and Abramowitz (2013). Dividing Eq. (1) by $\sigma obs2$ we get the dimensionless expression for the mean-square error:

Defining the error of the *k*th model as **e**^{k} = **x**^{k} − **o** we get for the correlation between two model errors:

Here we have again used the properties of high dimensional spaces. Similar analytical results for *B* = 0 were reported in Annan and Hargreaves (2011). In the truth centered interpretation the correlation becomes zero. However, in the indistinguishable interpretation it becomes 0.5 as also noted by Bishop and Abramowitz (2013). Such error correlations around 0.5 have been reported by Jun et al. (2008), Annan and Hargreaves (2010), Pennell and Reichler (2011), Herger et al. (2018), Bishop and Abramowitz (2013), and Abramowitz et al. (2019).

A measure that is often used when validating ensembles of climate models is the error relative to the median error of all ensemble members (Gleckler et al. 2008; Sillmann et al. 2013; Flato et al. 2013). If a specific ensemble member has a relative error of, e.g., 0.1 it therefore indicates that this ensemble member has an error 10% larger than the median error of the model ensemble. For the relative error of the ensemble mean we have

This result, which requires that both *K* and *N* are large, was found in Christiansen (2018a) and discussed further in Christiansen (2019). In the indistinguishable interpretation the relative error reduces to $\u2061(1\u22122)/2=\u22120.29$, which is close to the value found in many validations of climate models (Gleckler et al. 2008; Sillmann et al. 2013; Flato et al. 2013). In the truth centered interpretation the relative error is −1.

The three measures from Eqs. (3), (4), and (2) are shown as functions of *σ*_{obs}/*σ*_{mod} and *B*/*σ*_{mod} in Fig. 1. The latter measure is shown for large *K* so that the term 1/*K* disappears. The truth centered and the indistinguishable interpretation correspond to the points (0, 0) and (1, 0), respectively. The measures all only depend on (*B*/*σ*_{mod})^{2} + (*σ*_{obs}/*σ*_{mod})^{2}. This means, e.g., that a value of 0.5 for the error correlations can be obtained both for the indistinguishable interpretation and also for values of *σ*_{obs}/*σ*_{mod} smaller than one if a bias is present. Note that as the analytical equations are found as a consequence of operating in a high dimensional space, they hold even for a single realization in contrast to results found as the expectation over many realizations.

The decay with *K* of the ensemble mean error, Eq. (2), is illustrated in Fig. 2 for different values of the parameters. Figure 3 shows the rank histograms for the same parameters. The rank histograms are calculated numerically for *K* = 45 and *N* = 1000 and the bias is chosen equal for all components, $b=B/N\u2061(1,1,\u2026)$. A rank of zero means that the observation is smaller than all models and a rank of *K* means the observation is larger than all models. Such rank histograms measure the degree to which the model distributions are consistent with the distribution of observations (Talagrand et al. 1998; Hamill 2001) and has previously been used for the analysis of multimodel ensembles by Annan and Hargreaves (2010) and Haughton et al. (2014).

As expected, the ensemble mean error converges toward zero for large *K* when we are close to the truth centered interpretation (Fig. 2). In this situation the decay of the ensemble mean error is also most pronounced as measured by the difference between small and large *K*. Furthermore, the rank histogram (Fig. 3) is peaked around the center (*K*/2). When we are closer to the indistinguishable interpretation the decay of the ensemble mean error is slower and saturates at a finite value. Now the rank histogram is almost flat. Note that the rank histograms do depend independently on both *σ*_{obs} and *B*, as a finite *B* will show up as an asymmetry in the rank histograms.

## 3. The CMIP5 multimodel ensemble

Here we investigate the errors in a subset of 45 climate models from the CMIP5 (Taylor et al. 2012) multimodel ensemble. The models that are chosen from the major modeling centers are briefly identified in Table 1 and details can be found in Table 9.A.1 of Flato et al. (2013). We use monthly means from the historical experiments and from each model we use only one ensemble member (r1i1p1). As observations we use NCEP–NCAR data (Kalnay et al. 1996). We focus on the near-surface temperature (TAS) and the geopotential height at 500 hPa. The model data, which come in different horizontal resolutions, are interpolated to the horizontal NCEP resolution (144 longitudes, 73 latitudes) using a simple nearest neighbor procedure. The monthly and annual mean climatologies are calculated for all grid points using the period 1980–2005. Monthly and annual anomalies are calculated by subtracting the climatology. The errors are then calculated as the root-mean-square error over the included space and time as in Gleckler et al. (2008). Thus, for monthly climatologies the root-mean-square errors are calculated over the 144 × 73 grid points and the 12 calendar months and for annual means the errors are calculated over the 144 × 73 grid points and the 26 years. The results in this section are all robust to changes in the included period and to, e.g., detrending of the data. We also note that we get similar results using the MERRA-2 (Gelaro et al. 2017) or ERA5 (Hersbach et al. 2019) reanalyses as observations instead of NCEP–NCAR.

In Christiansen (2018b) we presented a direct test of the validity of the simple statistical model for a smaller version of the multimodel ensemble than used in the present paper. A similar test was presented in Christiansen (2019) for an ensemble of weather forecasts. These tests demonstrated that the lengths $\Vert xk\u2212x\xaf\Vert $ of the different models were narrowly distributed (concentration of measure) and that the angles between models were close to *π*/2 (waist concentration). We have repeated the test for the present ensemble. Calculating the $\Vert xk\u2212x\xaf\Vert /N$ for all models we find a mean length of 2.57 K with a standard deviation of 0.46 K for the monthly climatology of TAS. For the angles *ϕ* between models [from $(xi\u2212x\xaf)\u22c5(xj\u2212x\xaf)=\Vert xj\u2212x\xaf\Vert \Vert xi\u2212x\xaf\Vert cos\u2061(\varphi )$] we find a mean of 1.59 with a standard deviation of 0.28. Similar results are obtained for the other fields.

We now consider the rank histograms of the observations among the *K* = 45 models. Then we will use the rank histograms to fit the simple statistical model. The rank histograms for TAS are shown in Fig. 4. For the global annual means the distribution of these 26 ranks (one for each year, Fig. 4a) is rather closely centered around *K*/2 as should be expected for a truth centered ensemble. For the monthly climatology the distribution—calculated over all grid points and calendar months—of the ranks is much flatter indicating a less truth centered ensemble (Fig. 4b). The deviation from flatness shows that the model ensemble is overdispersed. For monthly means (Fig. 4c) the distribution—calculated over all grid points and all months—is even flatter. For the monthly anomalies—again calculated over all grid points and all months—the rank histogram is basically flat indicating that here the distributions follow the indistinguishable interpretation (Fig. 4d). All distributions are relatively symmetric indicating only a small bias.

We want to make sure that the differences noted above are not just a consequence of comparing different number of samples. To this end we define a measure of truth centeredness as the root-mean-square difference between the ranks and *K*/2. For an ensemble where all members are identical to observation this measure would be zero and the larger it is the further the ensemble is from being truth centered. For a perfectly uniform rank histogram this measure would be $K/12$.^{1} For the global annual mean we find this measure to be 4.5. The distributions over all grid points and temporal values of this measure are shown in Fig. 5 for the monthly climatology, monthly means, and monthly anomalies. The measure for the global mean is indicated by the vertical bar. For the monthly climatology about 96% of the grid points are less truth centered than the global mean. This fraction is even larger for monthly means and monthly anomalies. The distributions also become more narrowly peaked around $K/12~13.0$.

We now continue by fitting the simple statistical model to the rank histograms. Let *P*_{m} and *P*_{o} be the cumulative distribution functions of models and observations and *p*_{m} and *p*_{o} the corresponding probability densities. We then have for the distribution of the ranks, *k* = 0, 1,..., *K*, of the observations among the models:

This result is general and does not require any assumptions. For the simple statistical model we have $po=N\u2061(0,\sigma obs2)$ and $pm=N\u2061(B,\sigma mod2)$. Ranks from Eq. (5) are shown in Fig. 3 (cyan curves) where the parameters of the simple statistical model are known. If the parameters are not known they can be estimated from a given rank histogram.

The resulting fits to the histograms are very good as shown by the red curves in Fig. 4. The fitted parameters are shown in Table 2. In addition to the fields discussed above and shown in Fig. 4 we have also in the table included annual climatology, means, and anomalies as well as monthly global means. In general we only find a small bias. It is clear that the indistinguishable interpretation becomes an increasingly better assumption when decreasing the temporal and spatial scales: *σ*_{obs}/*σ*_{mod} increases from 0.21 for annual global means to 1.03 for monthly anomalies. The fact that we in general find *σ*_{mod} > *σ*_{obs} is an indication of the overdispersion of the model ensemble. Note, that for the anomalies the bias disappears by definition.

We can now calculate the three measures for the error correlation and the ensemble mean both directly from the climate model output and using the analytical expressions, Eqs. (1), (3), and (4) with the fitted parameters. The results are shown in Table 2. We see that the measures calculated from the analytical expressions provide excellent estimates for all fields.

Figure 6 shows the error correlations of all model pairs [*K*(*K* − 1)/2 = 990 for *K* = 45] for the monthly climatology, monthly means, and monthly anomalies. The mean of the distributions—shown as the black vertical lines and also given in Table 2—are very close to the value expected from the simple statistical model (red vertical line). In all three cases the correlations are closer to 0.5 (indistinguishable interpretation) than to 0 (truth centered interpretation). When we go from the monthly climatology over the monthly means to the monthly anomalies the mean of the correlations converges toward 0.5 and the distributions become increasingly narrowly centered around this value.

Figure 7 shows the decay of the ensemble mean error [normalized as in Eq. (2)] as function of the ensemble size for the same fields as for the error correlations. The figure is constructed by randomly drawing a subensemble of *k* members (*k* = 1, …, *K*) from the ensemble of *K* = 45 models. The root-mean-square error of the subensemble mean is calculated. For each *k* this is done 100 times and the average of the errors plotted (black curve). These curves strongly resemble the analytical curves calculated from Eq. (2) (red curves). The saturation level for large *k* increases when going from the monthly climatology over the monthly means to the monthly anomalies. For the monthly anomalies the saturation level is close to 1.

We have repeated the analysis with the 500-hPa geopotential height. The rank histograms are shown in Fig. 8 and the fitted parameters and the measures are given in Table 3. The results are rather similar to those of the TAS except that the model ensemble now has a negative bias. In particular for the monthly climatology the simple statistical model does not provide a good fit to the rank histogram. However, the simple statistical model still gives values of the three measures very close to those calculated directly from the model output. The size of the bias relative to *σ*_{mod} decreases with decreasing temporal and spatial scales and vanishes for monthly anomalies.

In general we see for all three measures that the indistinguishable interpretation becomes more accurate when the temporal and spatial scales decrease. For the monthly anomalies the results are in full agreement with the indistinguishable interpretation. This transition toward the indistinguishable interpretation is illustrated in Fig. 1 where the position of the different fields is plotted in the space spanned by *σ*_{obs}/*σ*_{mod} and *B*/*σ*_{mod}. Fields of smaller scales are indicated by smaller symbols and the convergence toward the indistinguishable interpretation (red filled circle) is clearly seen.

## 4. A conceptual mechanism for the indistinguishable interpretation

As was the case in the previous sections the *N*-dimensional space can be both spatial, temporal, or a combination. Let us for simplicity consider time series—this could be of monthly or annual means—so that for, e.g., observations we have **o** = (*o*_{1}, *o*_{2}, …, *o*_{N}), where the index *n* refers to time. Assume now that the observations *o*_{n} are drawn from a distribution $N\u2061(co,\sigma o2)$. We implement a separation of scales by letting the large-scale variability be given by *c*_{o} and smaller-scale variability modeled by *σ*_{o}. In the simplest case *c*_{o} is just the temporal mean although more generally it could include short range dependence and depend slowly on the index *n*. For the *k*th model we likewise have—before the model is calibrated—that $xnk$ is drawn from $N\u2061(ck,\sigma k2)$, where *c*_{k} only depends slowly on time. If we interpret the index *n* as a spatial pointer instead of time, then *c*_{o} would be a large-scale spatial mean—such as a global mean perhaps including large-scale spatial variations—and *σ*_{o} would be the smaller-scale spatial variability.

Assume first that a simple calibration is performed on each temporal value *n*—as we could assume to be the case if the time series represented, e.g., global means of annual or even longer temporal averages. Then we would for the calibrated values of the *k*th model $ynk$ have $ynk=on+\xi nk$, *n* = 1, …, *N*, where $\xi nk$ is random noise. This situation corresponds to small values of *σ*_{o} and *σ*_{k} and is equivalent to calibrating *c*_{k} against *c*_{o}. This corresponds to the truth centered interpretation, and if the residual noise is independent of the model then the ensemble mean $\u2211kynk/K$ would converge toward *o*_{n} for large *K* with a root-mean-square error disappearing as the inverse square root of *K*.

Consider now the situation where the time series represent, e.g., monthly gridpoint values and *σ*_{o} and *σ*_{k} are large. In this case we would not expect the individual values $xnk$ to be calibrated against *o*_{n}. Rather, we would expect the mean and variances to be calibrated. We then get $ck=co+\xi ck$ and $\sigma k=\sigma o+\xi \sigma k$ corresponding to the truth centered interpretation for the mean and the variance. However, the individual temporal values $ynk$ are now after the calibration drawn from a distribution with random values of the mean and variance. Such a distribution is called a compound distribution and generally has an inflated variance compared to the distribution with fixed values of the parameters. If we assume that the noise $\xi ck$ is distributed as $N\u2061(0,\sigma c2)$ and that $\xi \sigma k$ likewise is distributed as $N\u2061(0,\sigma \sigma 2)$ we can obtain analytical results. In this case we get that $ynk$ is distributed as $N\u2061(co,\Sigma 2)$, where $\Sigma 2=\sigma o2+\sigma c2+\sigma \sigma 2$.^{2}

Therefore, the individual calibrated values $ynk$ will be drawn from a distribution with the same mean as the observations *o*_{n} but with an inflated variance. The situation is schematically illustrated in Fig. 9. With perfect tuning we have *σ*_{c} = *σ*_{σ} = 0 and models and observations will be drawn from the same distribution corresponding to the perfect indistinguishable interpretation. This is similar to a rescaling of the model distribution not unlike the transformation suggested by Bishop and Abramowitz (2013). If the calibrations of the mean and variances are biased this will turn up as a bias in the distribution of $ynk$. Note that the blessings of dimensionality are not used in this conceptual mechanism.

There is a potential alternative explanation of the transition from the truth centered to indistinguishable interpretation based on internal variability. In the simple statistical model of section 2, $\sigma mod2$ is a sum of a contribution from the structural difference between models $\sigma struct2$ and a contribution from the internal variability of each model $\sigma var2$. One could imagine that the internal variability would dominate with decreasing averaging and would result in a situation corresponding to the indistinguishable interpretation. To estimate the internal variability $\sigma var2$, we have analyzed the Max Planck Institute Grand Ensemble (MPI-GE) (Maher et al. 2019). This is a 100-member initial condition ensemble performed with a single climate model. We find that $\sigma var2$ is always considerable less than $\sigma mod2$ (from the multimodel ensemble) except for the anomalies. For example, for TAS we find $\sigma var2/\sigma mod2$ is 0.02 and 0.01 for monthly and annual climatologies, respectively. For monthly and annual means the numbers are 0.25 and 0.09. Only for anomalies are these comparable as should be expected as in this case the structural difference has been removed. Also Haughton et al. (2014) found a much smaller spread in initial condition ensembles compared to the CMIP5 multimodel ensemble considering near-surface temperature.

Yokohata et al. (2012) described how tuning may lead to a truth centered interpretation through a series of updates toward observations. Sanderson and Knutti (2012) argued that an ensemble can drift away from the near truth centered situation due to a common model bias in periods not constrained by observations. In contrast, the simple argument above indicates a mechanism for how a calibration process that leads to truth centered distributions for the large-scale variability *c*_{k} can lead to the indistinguishable interpretation for the individual calibrated values $ynk$.

We note that the calibration of climate models is much more complicated than suggested by the simple scheme above. We therefore do not expect the explanation to be perfect or the indistinguishable interpretation to be fully satisfied.

## 5. Conclusions

We have presented a quantitative study of the structure of multimodel ensembles and their relation to observations. Results were interpreted using a simple statistical model describing the observations and models as drawn from two different distributions differing in variance and including a bias. The indistinguishable interpretation and the truth centered interpretation which are often used as null hypotheses when studying the independence of climate models—appear as two different limits of this model.

We considered three measures that are all based on the model errors and have been addressed in the previous literature. These measures are the limit of the ensemble mean error for large ensemble size, the correlation between individual model errors, and the relative error of the ensemble mean relative to the typical error of the individual models. In high dimensions—i.e., when the error is calculated as the root-mean-square or correlation over an extended spatial region or a long period—analytical expressions for the three measures can be found under the simple statistical model. These expressions only depend on the size of the bias and the ratio of the variances.

Using these three measures and the rank histograms we studied TAS and 500-hPa geopotential height in a 45-member multimodel CMIP5 ensemble of historical experiments using the NCEP analysis as observations. We studied different averages of the fields stretching from annual global means to monthly gridpoint anomalies. We found that the less averaging we applied the better the indistinguishable interpretation was fulfilled. In fact, the monthly gridpoint anomalies followed the indistinguishable interpretation almost perfectly. This conclusion follows both from the behavior of the three measures and the rank histograms.

The change toward a more perfect fulfillment of the indistinguishable interpretation when the spatial and temporal scales become smaller prompted us to formulate a simple conceptual mechanism explaining the occurrence of indistinguishable ensembles. While the origin of a truth centered ensemble is easy to imagine, as it is basically a measurement noise model, the origin of an indistinguishable ensemble is somewhat harder to comprehend. Our mechanism assumes that the calibration of climate models is focused on large-scale quantities such a means and variances. We emphasize that this is only meant as a conceptual mechanism. The calibration of climate models is much more complicated than the calibration of simple statistical models and other explanations might be possible.

The structure of the multimodel ensemble does have important consequences. In the indistinguishable interpretation (and even more when *σ*_{obs} > *σ*_{mod}) the ensemble mean error saturates quickly with increasing *K*, suggesting that a smaller number of models may be sufficient. However, also note that in high dimensions under the indistinguishable interpretation the ensemble mean is radically different from the individual models (Christiansen 2018a). When studying the independence of models, the choice of null hypotheses is often between the indistinguishable interpretation and the truth centered interpretation. The error correlations are often used as statistics in such studies. However, under the truth centered interpretation the error correlations of independent models are distributed around zero while they are distributed around 0.5 under the indistinguishable interpretation. Therefore, the error correlations should be used with care under the indistinguishable interpretation as the positive mean value is intrinsic and not a reflection of a common bias. In this situation correlations between models are a more suitable statistic.

We also note that the analytical results from the simple statistical model is remarkable successful. This is surprising considering that the model basically only includes two parameters and considering that all components (e.g., different grid cells) of all the individual models are drawn from the same distribution with a single variance (likewise all components of the observations are drawn from the same distribution). The analytical results rely on the simplifications of the high-dimensional limit and at least a part of the success hinges on the power of the blessing of dimensionality and the central limit theorem.

In our analysis we used 45 different climate models but did not consider the fact that they might not all be independent. For example, the different CESM1 models are close in the genealogy tree of models and might not be considered independent. The expression Eq. (3) for the correlations does not depend on *K*. For the relative error of the ensemble mean the analytical expression Eq. (4) is valid in the limit of large *K*. However, as shown in Christiansen (2018a) this convergence is fast and 10–20 models should be enough. The mean-square error between observations and ensemble mean, Eq. (2), depends on the ensemble size through the factor 1/*K*. As seen in Fig. 7 we get a very good agreement between the simple statistical model and the values calculated directly from the model ensemble using the nominal value of the size of the subensemble *k*. At least for small *k* this should be expected from the bootstrap method used to produce Fig. 7. If the number of models *k* is less than the independent number of models in the full ensemble, then the number of independent models in the subensemble will be close to *k*. For values of *k* near *K* the decay has saturated and the effect of the exact value of *K* is small [Eq. (2)]. Thus, there is not necessarily any conflict between the results of Fig. 7 and the result that the number of independent models is often found to be less than half or the nominal ensemble size (Pennell and Reichler 2011; Leduc et al. 2016; Sanderson et al. 2015). However, at least some methods to estimate the number of independent models is very sensitive to the effective degrees of freedom in the spatial field itself. It can also be shown that the spread of the bootstrapped values around the mean (Fig. 7) is determined by the effective degrees of freedom in the monthly climatology and goes to zero as this number increases. We will discuss these issues in a later paper.

## Acknowledgments

This work is supported by the NordForsk-funded Nordic Centre of Excellence project (Award 76654) Arctic Climate Predictions: Pathways to Resilient, Sustainable Societies (ARCPATH) and by the project European Climate Prediction System (EUCP) funded by the European Union under Horizon 2020 (Grant Agreement 776613). The NCEP–NCAR Reanalysis data were provided by the NOAA–CIRES Climate Diagnostics Center, Boulder, Colorado, from their website (http://www.cdc.noaa.gov/). We acknowledge the World Climate Research Programme’s Working Group on Coupled Modelling, which is responsible for CMIP, and we thank the climate modeling groups (listed in Table 1 of this paper) for producing and making available their model output. For CMIP the U.S. Department of Energy’s Program for Climate Model Diagnosis and Intercomparison provides coordinating support and led development of software infrastructure in partnership with the Global Organization for Earth System Science Portals. The author thanks the editor and the anonymous reviewers whose comments/suggestions helped improve and clarify this manuscript.

### APPENDIX A

#### Derivations of Eqs. (1), (3), and (4)

In this appendix we present derivations of the analytical expressions, Eqs. (1), (3), and (4), discussed in section 2c. Recall from the discussion of high dimensional spaces in section 2b that we can set the dot product of independent vectors to zero and substitute the length of a random vector with its expectation value. See also Christiansen (2019) and references therein.

We first consider the mean-square error. We have

In the last two steps we have used the orthogonality. Using now the property of the constant lengths, $\Vert o\Vert 2/N=\sigma obs2$ and $\Vert xk\u2212b\Vert 2/N=\sigma mod2$ we get Eq. (1).

To derive Eq. (3) we let ⟨⟩ denote averaging over *N*. The error correlation is

With **z**^{k} = **x**^{k} − **b** we get $ek\u2212\u2329ek\u232a=zk+b\u2212o$ as ⟨**b**⟩ is zero for large *N* and ⟨**o**⟩ is zero by definition (no lack of generality). Using that **z**^{k}, **z**^{m}, **o**, and **b** are orthogonal and that $\Vert o\Vert 2=\sigma obs2N$, we have $\u2061(zk+b\u2212o)\u22c5(zm+b\u2212o)=B2N+\sigma obs2N$. Likewise we obtain $\Vert zk+b\u2212o\Vert 2=\sigma mod2N+\sigma mod2N+B2N$ by using the orthogonality and that $\Vert xk\Vert 2=\sigma mod2N$. Therefore,

For the relative error, Eq. (4), we proceed as follows. From Eq. (1) we have $\Vert o\u2212x\xaf\Vert 2/N=B2+\sigma obs2$ for large *K*. We also see, most easily seen by setting *K* = 1 in Eq. (1), that $\Vert o\u2212xk\Vert 2/N=B2+\sigma obs2+\sigma mod2$. This expression is independent of *k* and is therefore also the median error. Finally,

from which Eq. (4) follows by reducing by $\sigma mod2$.

### APPENDIX B

#### Demonstration of Concentration of Measure for Dependent Data

Here we demonstrate the concentration of measure and the relation to the effective degrees of freedom. In Christiansen (2018a) we considered univariate Gaussian and uniform distributions of different dimensions *N* but only for isotropic and independent data. To illustrate the convergence for dependent data with different effective degrees of freedom, we generate multivariate Gaussian data with the exponential covariance function *σ*_{i}*σ*_{j}*θ*^{|i−j|}, *i*, *j* = 1, …, *N*, and zero mean. Here, the diagonal elements $\sigma i2$ give the variance of the different components. The amount of correlation between the components is given by *θ*: *θ* = 0 corresponds to uncorrelated components while *θ* = 1 corresponds to perfectly correlated components. To include anisotropy in the variances we let $\sigma i2=1+5i/N$, so that the variances vary with a factor of 5.

For a given *θ* and *N* we draw Gaussian distributed **x**^{k}, *k* = 1, …, *K*, with the specified covariance matrix and calculate $\Vert xk\Vert 2/N$. In Fig. B1 the distributions of this quantity (over *K* = 4000 realizations) are shown for *θ* = 0.6 and *N* = 50, 200, 400, and 800. As expected, the distributions are centered around $\sigma a2=\u22111N\sigma i2/N=3.5$ (the average variance). Likewise, the distributions of the angles *ϕ*—from $xi\u22c5xj=\Vert xj\Vert \Vert xi\Vert cos\u2061(\varphi )$—are centered around *π*/2 (not shown). More importantly, the widths of the distributions decrease with increasing *N*.

We have confirmed that this behavior is not restricted to the specific form of the covariance matrix by using, for example, a Cauchy covariance function $\u2061(1+|i\u2212j|2/\theta 2)\u22121$ (not shown). We have also confirmed that it is not restricted to Gaussian distributed data. This is done by rank transforming the Gaussian data followed by an appropriate scaling. We then get a distribution that has marginal uniform distributions and the same rank-correlation structure as the original Gaussian data.

There are several definitions and algorithms for the effective degrees of freedom (Bretherton et al. 1999; Wang and Shen 1999). Some of these were compared by Wang and Shen (1999) when applied to the surface temperature. The methods all have different strengths and weaknesses—e.g., some do not take anisotropy into account—and generally agree only within a factor of 2. The *ξ*^{2} method may be particularly relevant here as we consider sum of squares. For identically and independently distributed data we have the well-known result $\u22111Nxi2/N~N\u2061(\sigma 2,2\sigma 4/N)$ for large *N* (Ferguson 1996, chapter 7). In fact, this is a simple example of the concentration of measure as for large *N* the variance of $\u22111Nxi2/N$ goes to zero as 1/*N* and $\u22111Nxi2/N$ itself becomes *σ*^{2}. The *ξ*^{2} method is based on the assumption that for dependent data this decay should hold when substituting *N* with *N** (Lorenz 1969; Toth 1995). The effective degrees of freedom becomes $N*=N2/\u22111N\lambda i2$ where *λ*_{i}, *i* = 1, …, *N*, are the eigenvalues of the correlation matrix (Fraedrich et al. 1995).

For the example in Fig. B1, where we know the analytical covariance matrix, we get *N** = 23, 94, 188, and 376. Figure B2 more systematically shows the relation between *N** and the variance of the distributions of $\u22111Nxi2/N$. The $1/N*$ behavior is clearly seen with deviations only for the smallest *N**. The proportionality constant varies with the details of marginal distributions and the amount of anisotropy. In these examples the *ξ*^{2} method was applied to analytical covariance matrices. Unfortunately, the *ξ*^{2} method tends to strongly underestimate the degrees of freedom when applied to empirical covariance matrices (Wang and Shen 1999).

## REFERENCES

*Earth Syst. Dyn.*

*Geophys. Res. Lett.*

*J. Climate*

*Earth Syst. Dyn.*

*Pattern Recognition and Machine Learning*. 2nd ed. Springer-Verlag, 738 pp.

*Climate Dyn.*

*Foundations of Data Science*. Cornell University, 454 pp., https://www.cs.cornell.edu/jeh/book.pdf.

*Geophys. Res. Lett.*

*J. Climate*

*Nonlinear Dynamics New Directions*, Springer, 47–85.

*Learning from Data: Concepts, Theory, and Methods*. 2nd ed. John Wiley and Sons, 560 pp.

*J. Climate*

*J. Climate*

*J. Climate*

*Mon. Wea. Rev.*

*Int. J. Mod. Phys.*

*Conf. on Math Challenges of the 21st Century*, Los Angeles, CA, American Mathematical Society.

*Mon. Wea. Rev.*

*Environ. Res. Lett.*

*A Course in Large Sample Theory*. Chapman and Hall, 245 pp.

*Climate Change 2013: The Physical Science Basis*, T. F. Stocker et al., Eds., Cambridge University Press, 741–866.

*J. Climate*

*J. Climate*

*J. Geophys. Res.*

*IFAC-PapersOnLine*

*Entropy*

*J. Roy. Stat. Soc.*

*Mon. Wea. Rev.*

*Climate Dyn.*

*Earth Syst. Dyn.*

*ECMWF Newsletter*, No. 159, ECMWF, Reading, United Kingdom, 17–24, https://www.ecmwf.int/en/elibrary/19027-global-reanalysis-goodbye-era-interim-hello-era5.

*Bull. Amer. Meteor. Soc.*

*Climatic Variations and Forcing Mechanisms of the Last 2000 Years*, Global Environmental Change, Vol. 41, Springer-Verlag, 625–644.

*J. Amer. Stat. Assoc.*

*Bull. Amer. Meteor. Soc.*

*J. Climate*

*Geophys. Res. Lett.*

*Geophys. Res. Lett.*

*Ann. Probab.*

*J. Geophys. Res. Atmos.*

*J. Climate*

*J. Atmos. Sci.*

*J. Adv. Model. Earth Syst.*

*Geophys. Res. Lett.*

*J. Adv. Model. Earth Syst.*

*J. Climate*

*Rep. Prog. Phys.*

*ECMWF Newsletter*, No. 106, ECMWF, Reading, United Kingdom, 10–17, https://www.ecmwf.int/sites/default/files/elibrary/2006/18024-ensemble-prediction-pedagogical-perspective.pdf.

*J. Climate*

*Environ. Sci. Policy*

*Est modus in rebus*: Analytical properties of multi-model ensembles

*Atmos. Chem. Phys.*

*Geophys. Res. Lett.*

*J. Climate*

*Geosci. Model Dev.*

*J. Climate*

*J. Geophys. Res. Atmos.*

*Geophys. Res. Lett.*

*Environmetrics*

*Nat. Climate Change*

*Inst. Hautes Études Sci. Publ. Math.*

*Proc. Seminar on Predictability*, Reading, United Kingdom, ECMWF.

*Bull. Amer. Meteor. Soc.*

*Philos. Trans. Roy. Soc.*

*Tellus*

*Atmos. Environ.*

*J. Climate*

*Climate Dyn.*

## Footnotes

Denotes content that is immediately available upon publication as open access.

^{1}

The mean-square distance is $1/K\u222b0K\u2061(r\u2212K/2)2dr=K2/12$.

^{2}

This corresponds to a Bayesian model where the likelihood is $N\u2061(ck,\sigma k2)$ and the priors of *c*_{k} and $\sigma k2$ are $N\u2061(c0,\sigma c2)$ and $N\u2061(\sigma 02,\sigma \sigma 2)$. The distribution of $ynk$ is then found by marginalizing the product of the likelihood and the priors over $\sigma k2$ and *c*_{k}.