## Abstract

When comparing climate models to observations, it is often observed that the mean over many models has smaller errors than most or all of the individual models. This paper will show that a general consequence of the nonintuitive geometric properties of high-dimensional spaces is that the ensemble mean often outperforms the individual ensemble members. This also explains why the ensemble mean often has an error that is 30% smaller than the median error of the individual ensemble members. The only assumption that needs to be made is that the observations and the models are independently drawn from the same distribution. An important and relevant property of high-dimensional spaces is that independent random vectors are almost always orthogonal. Furthermore, while the lengths of random vectors are large and almost equal, the ensemble mean is special, as it is located near the otherwise vacant center. The theory is first explained by an analysis of Gaussian- and uniformly distributed vectors in high-dimensional spaces. A subset of 17 models from the CMIP5 multimodel ensemble is then used to demonstrate the validity and robustness of the theory in realistic settings.

## 1. Introduction

It has become a key component in climate modeling to compare an ensemble of different model experiments with observations. In such studies, the model mean (also known as the ensemble mean) has often turned out to be closer to the observations than all or most of the individual models for a wide range of climate variables, model systems, and error measures (Lambert and Boer 2001; Gleckler et al. 2008; Pincus et al. 2008; Knutti et al. 2010; Sillmann et al. 2013; Flato et al. 2013). Similar behavior has been described for numerical weather prediction and seasonal forecasts (Toth and Kalnay 1997; Hamill and Colucci 1997; Du et al. 1997; Hagedorn et al. 2005; Krishnamurti et al. 1999; Casanova and Ahrens 2009), as well as for air quality modeling (van Loon et al. 2007; Delle Monache and Stull 2003; McKeen et al. 2005). In the climate modeling context, explanations have been suggested by Annan and Hargreaves (2011) and Rougier (2016). Rougier (2016) noted that the mean squared error of the model mean is less than the average of the mean squared errors of the individual models. However, this does not hold for all metrics, particularly not for the root-mean-square error (RMSE). It also fails in explaining why all or most of the individual model members have larger errors than the model mean. Noting this, Rougier (2016) then suggested an explanation involving systematic biases. However, the previous explanations do not seem compelling. As the discussed behavior is ubiquitous and found across a range of climate variables and metrics, it requires a simple and general explanation. We will here suggest such a general explanation based on the geometric behavior of high-dimensional spaces. This explanation assumes that the observations and the models are drawn from the same distribution, and it also correctly predicts how much better the model mean will be. Furthermore, it reveals why the superiority of the model mean will be more common as the dimension increases.

The curse of dimensionality refers to the complexities of high-dimensional spaces that often defy our intuition based on two and three dimensions (Bishop 2007; Cherkassky and Mulier 2007). Apart from the well-known fact that the number of samples needed to obtain a given density grows exponentially with dimension, there are other less appreciated features of high-dimensional spaces (Blum et al. 2017). For a sphere or cube, the volume increasingly concentrates near the surface when the dimension increases. Independent random vectors in high-dimensional spaces are almost always orthogonal, and almost every point is an outlier in its own projection. Below, we will show that another consequence is that for a sample of random points in a high-dimensional space, the typical distance between two points is significantly larger than the typical distance between one point and the mean of the sample.

If one considers a spatial field of, for example, the surface temperature, rainfall, or the 500-hPa geopotential height, and uses, for example, the root-mean-square error over the grid points as a measure of the difference between models and observations, then one operates in a space with the dimension given by the number of grid points. The number of grid points may be very large, though in many cases, nearby grid points are not independent, and the correct dimension should reflect the effective number of independent points. This number of spatial degrees of freedom is difficult to estimate (Christiansen 2015) and depends on the field, the time scale, and the region under consideration (Jones and Briffa 1996; North et al. 2011). It is on the order of 50–100 for monthly surface temperatures in the Northern Hemisphere (Wang and Shen 1999; Bretherton et al. 1999) and probably an order of magnitude larger for daily means. For daily precipitation, the number of spatial degrees of freedom is much larger (Moron et al. 2006; Benestad 2013).

The relevant effect of the high-dimensional space is easy to demonstrate in any programming language. Draw 30 independent vectors with 100 elements, and calculate the mean vector. Draw an additional vector, also with 100 elements, to represent the observations. Calculate the errors with respect to the observations for the 30 individual vectors and for the mean vector. Observe that the error of the mean vector is (almost always) smaller than all the errors of the individual vectors.

The theory is explained and illustrated by simple examples in section 2. In section 3, we extend the simple examples and compare our study with previous work. Section 4 demonstrates that the predictions of the theory hold when a subset of phase 5 of the Coupled Model Intercomparison Project (CMIP5) multimodel ensemble is considered.

## 2. Simple analysis and theory

Consider *K* vectors in *N*-dimensional space , where . These vectors represent *K* models, and could, for example, be a spatial field, with *N* then being the number of grid points. The model mean is then a vector of dimension *N*. The observations are also given by a vector of dimension *N*, . We now make the simple assumption that the observations and the models are drawn independently from the same distribution. This describes the situation when the only differences between the individual models, and between the observations and the individual models, are due to internal variability. Thus, the models are assumed to be without bias and with a realistic variability.

We first assume that all components are drawn from a standard Gaussian distribution (zero mean and unit variance). We begin by considering the Euclidean metric . In one dimension, the distance becomes simply the absolute value of the difference . The length of the *k*th model is , and its error is . Likewise, the error of the model mean is . Note that this error differs by a factor of from the usual root-mean-square error. This is a convenient choice, as the error then equals the distance from the model to the observations. This choice does not influence the comparison of errors of the individual models with errors of the model mean.

Figure 1 shows the situation in the one-dimensional case with 30 models . Here, 30 independent random numbers are drawn to represent the models, and one additional number is drawn to represent the observations. In the one-dimensional case, the distances are simply the absolute values, and the errors are the absolute values of the differences. The model mean is closer to zero than most of the individual models, but the histogram of the lengths of the individual models peaks at zero (left panel). Considering the errors (right panel), the model mean therefore does not stick drastically out from the individual models. However, for , there is already a tendency for the model mean to be better than most individual models. This can be understood from the assumption that both models and observations are drawn from the same distribution. Let us assume that the observation is positive. For large *K*, the model mean *z* will be close to zero. If the distribution is symmetric, then around half of the individual models will be negative and have larger errors than the model mean. In fact, all models larger than will also have larger errors than the model mean. Analytical integrations show that on average, for large *K*, the model mean will be better than around 65% [5/8 and for uniformly and Gaussian-distributed numbers] of the individual ensemble members.

For higher dimensions, the situation changes drastically (Fig. 2). Here, 30 independent random vectors of 50 dimensions are drawn representing the models, and one additional 50-dimensional vector is drawn to represent the observations. Now the distribution of the lengths of the individual models does not peak at zero, but at a finite value. This distribution is quite narrow, and the model mean is closer to zero and totally separated from the individual models (Fig. 2, left panel). In the right panel of Fig. 2, we see the desired result: the error of the model mean is smaller than the errors of the individual models. This is due to two properties. The first is that the model mean is near zero, so the error of the model mean has the value of a typical length of an individual model. The second property is that, in contrast, the errors of the individual models are larger than the typical lengths of the individual models.

The dependence on the dimension *N* is shown in Fig. 3. For each *N*, independent random *N*-dimensional vectors are drawn representing *K* models. One additional random *N*-dimensional vector is drawn to represent the observations. The errors of the individual models are calculated, and the mean and standard deviation are shown with the cyan curves. The error of the model mean is also calculated (red curves). As above, this error is comparable to the lengths of the individual models, as the model mean is situated near zero (green curve), and the individual models and the observations are drawn from the same distribution. For *N* larger than about 15, the errors of the individual models are significantly larger than the error of the model mean. As the dimension *N* increases, both the errors of the individual models and the error of the model mean increase. However, the errors of individual models increase fastest, and the standard deviations remain almost constant. Thus, when the dimension increases, the model mean becomes an increasingly better estimate of the observations than the individual models.

We can understand this analytically. The surface area of a hypersphere with radius *r* in *N* dimensions is .^{1} The standard Gaussian probability distribution is . So as a function of *r*,

The maximum of is reached for , and the width (standard deviation) of the peak is , independent of *N*. Thus, for increasing *N*, the models lie in an annulus with radius *r* in . The mean distance between the center and the observations is therefore . As the model mean will be situated near the center (see next paragraph), this is also the mean distance between the model mean and the observations (the model mean error).

The mean distance between two points on the surface of the unit hypersphere (mean chord length) is in the limit of large *N*. This is a special case of the general fact that in high-dimensional space, two independent random vectors are almost always orthogonal. Thus, the mean distance between individual models is , which is also the mean distance between the observations and the individual model minus the individual model errors. Because the mean radius of the annulus increases while the width stays the same for increasing *N*, the model mean will become more separated from the individual models.

Let us also consider how the length of the model mean approaches zero. We have . Each is Gaussian-distributed with variance , so is *χ* distributed with *N* degrees of freedom. As the mean and variance of a *χ* distribution with *N* degrees of freedom are , and , it is seen that has a mean of and a standard deviation of . Note that for large *N*, so . We see that the required number of models *K* for to be small relative to , for which peaks, is independent of *N*.

The results are not particular for the Gaussian distribution, as we have confirmed by repeating the numerical experiments with *t*-distributed and uniformly distributed numbers. This can also be argued from the central limit theorem, which implies that the squared lengths of random vectors will be Gaussian-distributed for large *N* independent of the distributions of the individual components. In the non-Gaussian cases, further analytic results are difficult to obtain, but note that a cube in *N* dimensions will have vertices. For a unit hypercube, the volume is one, but when *N* increases, the volume will be increasingly concentrated in a thin layer near the surface. The vertices will increase their distance from the center as , while the volume of the central *N*-dimensional sphere inscribed in the cube will decrease very fast as . Thus, the hypercube will be very spiky and can be perceived, when projected on a three-dimensional space, to have the form of a sea urchin or porcupine (Hecht-Nielsen 1990). The model mean is therefore special, as it is located near the center, while the individual models are located in the spikes.

Above, we have shown that the model mean often outperforms the individual ensemble members, but the results are also valid when considering the median of the models instead. Figure 4 shows results when all components are drawn from a uniform distribution (over ) and the median of the models is considered instead of the mean.

The results are also not sensitive to the distance function used (and hence, not to the error metric) and do not, in particular, depend on the convexity of the distance function. This has been confirmed by repeating the numerical experiments with the distance functions and . Also, the spatial correlations could be used.

In the validation of model ensembles (Gleckler et al. 2008; Sillmann et al. 2013; Flato et al. 2013), the errors of the ensemble members are often given relative to their median error. Thus, a value of 0.1 means that the specific ensemble member has an error 10% larger than the median error of the model ensemble. Remarkably, the relative error for the model mean is consistently around −0.3 for a large range of fields and measures. This is clearly seen in Figs. 3 and 7 in Gleckler et al. (2008), Fig. 10 in Sillmann et al. (2013), and Figs. 9.7 and 9.37 in Flato et al. (2013). Thus, the model mean is very often 30% better than the typical individual model.

While this might seem surprising, it is a simple consequence of the theory above. In high-dimensional space, two independent random points are almost always orthogonal, and they are almost at the same distance from the center where the model mean is situated. Thus, the observations, any ensemble member, and the model mean form an isosceles (two equal sides) right triangle, and the relative error of the model mean is, therefore, .

Figure 5 (top) shows the relative error of Gaussian-distributed values as a function of the dimension *N*. Here we have chosen the number of models *K* to be 17 to be consistent with the next session. As the dimension *N* increases, the range of individual model errors decreases. However, the error of the model mean is close to −0.30 for *N* larger than 5–10. As a result, the error of the model mean is lower than the errors of all individual models for *N* larger than approximately 40. The relative errors do not depend on the width of the distribution, and very similar results are obtained for uniformly distributed values (Fig. 5, bottom).

Thus, the curse of dimensionality does not only predict that the model mean is better, but also how much better.

## 3. Extending the model and previous work

In the previous section, both the ensemble members and the observations were independently drawn from the same Gaussian distribution, , where **0** and are the *N*-dimensional zero vector and the *N*-dimensional identity matrix, respectively. Obviously, the choice of zero means does not change the generality of the results. Choosing different variances for the different dimensions also does not change the results, as long as the total variance is not dominated by a few dimensions.

We now more systematically extend the model to allow for a common bias between models and observations and for models and observations to have different variances. Thus, the ensemble members are drawn from and the observations from . With , and using the general results about the length and orthogonality for large *N* and *K*, we get

and

For the relative error of the model mean, we get

The relative error is shown as a function of and in Fig. 6. The relative error changes only slowly with both parameters: for example, for and , the relative error is still −20%. When the bias grows, the relative error increases (approaches zero), making the error of the ensemble mean closer to the errors of the individual ensemble members. The relative error also increases when decreases, as could be the case if the models are imperfectly tuned. However, it is important to note that as for the simpler model in the last section, the ensemble mean will still outperform all ensemble members as follows from comparing Eqs. (2) and (3). This happens because the distribution of is an annulus with fixed width but with a radius that increases with *N*.

Van Loon et al. (2007) observed that when observations and ensemble members are drawn from a one-dimensional distribution , the expected value of the mean squared error of the ensemble mean is , and the expected value of the mean squared error of an individual ensemble member is . This corresponds to the one-dimensional model considered in the beginning of section 2. However, as shown there, one cannot conclude from this result that the ensemble mean is closer to the observations than any individual ensemble member.

Annan and Hargreaves (2011) assumed, as in the present paper, a model where observations and ensemble members are drawn from the same distribution. Based on numerical analysis, they found that for high dimensions, the errors of the individual ensemble members are a factor of larger than the error of the ensemble mean. However, they only provided heuristic arguments, failing to realize that random vectors in high dimensions are almost always orthogonal. The factor of was previously also noted by Du et al. (1997), based on theoretical arguments by Leith (1974).

Rougier (2016) studied the asymptotic situation for large *N*. He considered the model where the ensemble members are drawn from , where , that is, the ensemble members have biases along the same direction but of different lengths. If for all *k*, this reduces to the situation where the ensemble members are scattered randomly around the observation. In this situation, the error of the ensemble mean will trivially be smaller than the errors of the individual ensemble members for large *K* (the relative error will be −1). This limit is also found for large in the model above [Eq. (4)]. Rougier (2016) studied the conditions when the values are small enough for the error of the ensemble mean to still be smaller than the errors of all the individual ensemble members. For large *K* and *N*, one such condition is for all *k*. In the spirit of the present paper, we have in the limit of large *K* that and (overline denoting ensemble mean). And in the limit of large *N* (because of the orthogonality), . From this follows Rougier’s “Result 2” in the limit of large *K*. We also find that the relative error of the ensemble mean is . Note that in contradiction with the model suggested in the present paper, this does not predict that the relative error of the ensemble mean is , as found in the analyses of climate models.

We close this section with the related question considered by, for example, Hagedorn et al. (2005): How can a poor model add skill? The curse of dimensionality might help explain this. As we have seen in the limit of large *N*, the center is a special point that is vacant of both observations and individual ensemble members. Therefore, if the new model drags the ensemble mean closer to zero, then it might add skill to the ensemble mean. Assume we already have *K* models drawn from . The mean is . In high dimensions, all have the length *r* and are orthogonal, so . Adding a model with length gives the mean with squared length . The new length is less than the old when . Thus, a model that is poor in the sense that its error is larger than the errors of the other models can still add skill to the mean model, as long as its error does not exceed the errors of the other models with a factor of more than .

## 4. The CMIP5 multimodel ensemble

Here, we investigate the errors in a subset of 17 climate models from the CMIP5 (Taylor et al. 2012). We will show how the superiority of the model mean depends on the degrees of freedom. We will also show that there is symmetry between models and observations; it is not some particularity of either the observations (observational errors) or the models (biases) that make the model mean superior.

The models that are chosen from the major modeling centers are briefly identified in Table 1. Details can be found in Flato et al. (2013; their Table 9.A.1). We use monthly means from the historical experiments, and from each model, we use one ensemble member (r1i1p1). As observations, we use NCEP–NCAR data (Kalnay et al. 1996). We focus on the near-surface temperature (TAS). The model data, which come in different horizontal resolutions, are interpolated to the horizontal NCEP resolution (2.5° × 2.5° longitude–latitude grid) using a simple nearest-neighbor procedure.

The monthly and annual mean climatologies of the SAT are calculated using the period 1980–2005. The climatologies are calculated for all grid points, for the zonal means, and for the Northern Hemisphere (NH) mean. The errors are then calculated as the root-mean-square error over the included space and time, as in Gleckler et al. (2008). Thus, for monthly zonal mean climatologies, the root-mean-square errors are calculated over the 73 latitudes and the 12 calendar months.

The resulting relative errors (Fig. 7, top five rows) show the expected behavior. When most degrees of freedom are included, as for monthly climatologies for all grid points, the model mean has a relative error around −0.3 and smaller errors than all the individual models. When the number of degrees of freedom is reduced, as for the annual and monthly climatologies for the NH mean, several individual models are better than the model mean. It is difficult to estimate the number of degrees of freedom in the monthly climatology, but we note from Fig. 5 that the full separation between model mean and individual models can be expected for *N* around 35. Also note that the scatter of the model mean between −0.4 and −0.2 found in Fig. 7 is consistent with Fig. 5 for *N* less than 40. Figure 7 (sixth row) also shows the root-mean-square error of monthly anomalies of the TAS over the period 1980–2005 and over all grid points. This example has a huge number of degrees of freedom, and the model mean has a relative error near −0.3 and is well separated from the individual models. Note that this happens even though the models are initialized a long time before 1980 and any forecast skill is lost.

The errors in the upper part of Fig. 7 were based on models that had been bias corrected by adding in each grid point the difference between the long-term means of the model and the observations. We have confirmed that the bias correction had no effect on the results regarding the difference between the model mean and the individual models. The two rows in the middle of Fig. 7 demonstrate this.

The basic assumption that separates the high-dimensional explanation from explanations involving model biases is that the observation and the models are drawn from the same distribution. We test this symmetry by exchanging the observations with one of the 17 models and calculating the new model mean and the new set of errors. We can do this swapping for each of the 17 models, therefore giving us 17 test cases. Thus, in the *i*th case, the observations and the *i*th model are swapped so the model ensemble now includes 16 real models and the observations. The model mean is now calculated from this ensemble, and the errors are calculated relative to the *i*th model. As in the upper row of Fig. 7, the root-mean-square errors are calculated over the monthly climatology and all grid points. The results are shown in the lower part of Fig. 7. We see that in almost all cases (15 out of 17), the model mean has lower errors than the individual models. We also see that the relative errors are close to −0.29, except for one case. This demonstrates that the superiority of the model mean is an inherent property of the averaging process and is not connected to any model bias or imperfect observations.

Note that the spread of the relative errors of the individual models in the swap experiments is larger than for the original data (first row in Fig. 7). This suggests that deviations from the hypothesis—observations and models are drawn from the same distribution—do exist. We find that this hypothesis is sufficient to explain the superiority of the model mean, but the swap experiment should not be seen as a general test of the interchangeability of models and observations, as it probably lacks statistical power. Adding an additional model closely related to one of the existing models (such as an r2i1p1 ensemble member or a model from the same family as an already included model) will not change the general results about the model mean. However, when one of these closely related models is swapped with the observations, its twin will have a much smaller relative error than the rest.

## 5. Conclusions

We have given a general explanation for the observation that the ensemble mean often outperforms all the individual ensemble members. This explanation does not need any assumptions beyond the assumption that observations and models are drawn from the same distribution. The explanation holds even in the simplest examples of Gaussian- or uniformly distributed data. The explanation is based on the nonintuitive properties of high-dimensional spaces. In such spaces, randomly chosen vectors are almost always orthogonal. Another important property is that the largest part of the volume is concentrated in a thin layer, with a distance to the center that increases with the dimension. On the other hand, the ensemble mean is exceptional, as it is located close to the center. This explanation also predicts the observation that the error of the ensemble mean often is 30% lower than the median error of the individual ensemble members.

Studying a subset of the CMIP5 multimodel ensemble, we confirmed previously published results that the model mean is, in general, superior to individual models and that the relative error of the model mean is around −30%. Here, we considered monthly and annual climatologies, as well as monthly anomalies of the near-surface temperatures. Similar results are found for other fields, such as zonal wind and geopotential height at different pressure levels.

We found, as expected from the theory, that the superiority of the ensemble mean is positively correlated with the number of degrees of freedom. We also found that the superiority of the ensemble mean does not depend on the application of simple bias correction. The validity—in the present context—of the assumption that observations and models are drawn from the same distribution was investigated by swapping individual models with the observations and repeating the calculations of the model mean and the relative errors. Again, we found that the new model mean has lower errors then the individual models and that the relative errors are close to −0.29. This demonstrates that the lower errors of the model mean are due to the averaging process itself and are not connected to imperfections in the models or observations.

More meaningful comparisons of individual models and model means could be done after applying dimension-reduction techniques such as principal component analysis.

## Acknowledgments

The NCEP–NCAR reanalysis data were provided by the NOAA/CIRES Climate Diagnostics Center (Boulder, Colorado) from their website (http://www.cdc.noaa.gov/). We acknowledge the World Climate Research Programme’s Working Group on Coupled Modelling, which is responsible for CMIP, and we thank the climate modeling groups (listed in Table 1) for producing and making available their model output. For CMIP, the U.S. Department of Energy’s Program for Climate Model Diagnosis and Intercomparison provides coordinating support and led development of software infrastructure in partnership with the Global Organization for Earth System Science Portals.

## REFERENCES

*Pattern Recognition and Machine Learning.*2nd ed. Information Science and Statistics Series, Springer, 738 pp.

*Learning from Data: Concepts, Theory, and Methods*. 2nd ed. John Wiley & Sons, 538 pp.

*Climate Change 2013: The Physical Science Basis*, T. F. Stocker et al., Eds., Cambridge University Press, 741–866, https://doi.org/10.1017/CBO9781107415324.020.

*Neurocomputing*. Addison-Wesley, 433 pp.

*Climatic Variations and Forcing Mechanisms of the Last 2000 Years*, P. D. Jones, R. S. Bradley, and J. Jouzel, Eds., NATO ASI Series, Vol. 41, Springer, 625–644, https://doi.org/10.1007/978-3-642-61113-1_30.

## Footnotes

A comment/reply has been published regarding this article and can be found at http://journals.ametsoc.org/doi/abs/10.1175/JCLI-D-18-0274.1 and http://journals.ametsoc.org/doi/abs/10.1175/JCLI-D-18-0416.1

© 2018 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

^{1}

, where for *m* even and for *m* odd.