## Abstract

In fields such as climate science, it is common to compile an ensemble of different simulators for the same underlying process. It is a striking observation that the ensemble mean often outperforms at least half of the ensemble members in mean squared error (measured with respect to observations). In fact, as demonstrated in the most recent IPCC report, the ensemble mean often outperforms all or almost all of the ensemble members across a range of climate variables. This paper shows that these could be mathematical results based on convexity and averaging but with implications for the properties of the current generation of climate simulators.

## 1. Introduction

A striking feature of Fig. 9.7 of chapter 9 of the Fifth Assessment Report of the IPCC (Flato et al. 2013, p. 766) is that the ensemble mean consistently outperforms more than half of the individual simulators on all variables, as shown by the blue rectangles on the left-hand side of Fig. 9.7 in the column labeled “MMM.” Even more strikingly, the ensemble mean often outperforms all of the individual simulators (deep blue rectangles). This paper provides a mathematical explanation of these features.

Section 2 shows that it is a mathematical certainty that the ensemble mean will have a mean squared error (MSE) that is no larger than the arithmetic mean of the MSEs of the individual ensemble members. This result holds for any convex loss function, of which squared error is but one example. While this does not imply the same relation for root-mean-square error (RMSE) and the median ensemble member (as represented in Fig. 9.7), it makes a similar result plausible.

Section 3 establishes a stronger result, concerning the rank of the ensemble mean MSE among the individual MSEs (with an identical result for RMSE). This is based on a simple model of simulator biases and on an asymptotic treatment of the behavior of MSE in the case where the number of pixels increases without limit. Section 4 argues that this is a plausible explanation for the stronger result that the ensemble mean outperforms all of the individual simulators. A crucial aspect of this explanation is that it does not rely on “offsetting biases,” which would be inappropriate for the current generation of climate simulators.

In this paper, I exercise my strong preference for “simulator” over “model” when referring to the code that produces climate-like output (see Rougier et al. 2013). This allows me to use the word model without ambiguity to refer to statistical models.

## 2. Convex loss functions

Let be the output for simulator i in pixel j, where there are k simulators and n pixels. I will always use i to index simulators and j to index pixels and suppress the limits on sums, to reduce clutter. Let be the ensemble arithmetic mean for pixel j:

Let be the observation for pixel j. Denote the mean squared error as for simulator i, and for the ensemble mean:

Then we have the following result—that the MSE of the ensemble mean is never larger than the arithmetic mean of the MSEs of the individual simulators.

Result 1: Proof:

where (*) follows by Jensen’s inequality.

This result holds for any convex function; replacing with gives the same inequality for mean absolute deviation (MAD). The result for has previously appeared in the climate literature in Stephenson and Doblas-Reyes (2000) and Annan and Hargreaves (2011), but these authors failed to discover that it is the convexity of the loss function that induces this result.

This result falls short of being an explanation for the blue rectangles in the MMM column of Fig. 9.7 in Flato et al. (2013) in two respects. First, Fig. 9.7 is drawn for RMSE, not MSE, and second, it is drawn with respect to the median of the RMSEs of the ensemble, not the mean.

A weaker result is available for the RMSE; unfortunately Jensen’s inequality does not help here because the square root is a concave function (i.e., it bends the wrong way to extend the inequality). Let be the RMSE of simulator i, and define the sample mean and sample variance of the RMSEs as follows:

Then, starting from result 1,

If the variation in the simulators’ RMSEs is small relative to their mean, then we would expect the RMSE of the ensemble mean to be no larger than the mean of the RMSEs of the individual simulators (although this outcome is not a mathematical certainty).

Extending the result from the mean to the median is trickier. A histogram of MSEs will typically be very positively skewed, and the histogram of RMSEs will remain positively skewed. Therefore the median and the mean of the RMSEs will not be similar. Typically the median will be lower, and therefore the mean being an upper bound does not imply that the median is an upper bound. But progress can be made using the result from the next section.

## 3. A simple systematic bias model

Flato et al. (2013, p. 767) and others have commented on the “notable feature” that is typically smaller than any of the individual MSEs. This statement holds equally for MSE and RMSE because rankings are preserved under increasing transformations, such as the square root.

A simple thought experiment suggests that this is indeed notable. If all of the simulators had MSE and then we took pixel j in simulator i and steadily increased its value, then the MSE of simulator i and of the ensemble mean would increase. The rank of among would be , where the rank is defined as follows:

where , and denotes the number of elements in the set. By result 1, a rank of k is impossible if the are not identical. Thus, ranks of between 0 and are attainable, in principle, and ranks that are consistently small invite an explanation.

A candidate explanation is found in weather forecast verification, in which it is sometimes found that a high-resolution simulation has a larger MSE than a lower-resolution simulation when evaluated with high-resolution observations (see, e.g., Mass et al. 2002). The explanation is that if the high-resolution simulation puts a local feature such as a peak in slightly the wrong place (in space or time), then it suffers a “double penalty,” while a lower-resolution simulation, which does not contain the feature at all, suffers only a single penalty. Following similar reasoning, we might argue that the ensemble mean is flatter than any individual member and is thus penalized less if the individual members are putting local features in slightly wrong places. However, this argument is not compelling for the IPCC climate simulations, in which the observations have low resolution and there is already substantial averaging in the individual simulator output.

I propose a different explanation, in terms of the simulators’ “biases.” Suppose each simulator has a systematic bias . Then over a large number of pixels the MSE of simulator i would be approximately plus a constant. The ensemble mean at each pixel, though, would average the biases to the value . Then over a large number of pixels the MSE of the ensemble mean would be approximately plus a smaller constant (see the proof of result 2). This looks to be a promising explanation, but there is work to be done to establish the conditions under which the MSE of the ensemble mean is driven down toward or even below the smallest of the individual MSEs and also to establish that this is not just an offsetting biases argument, which would be inappropriate for climate simulators (see section 4).

The mathematical challenge is that are not mutually independent; this difficulty was noted by Annan and Hargreaves (2011, p. 4532), who were unable to go beyond a heuristic explanation. One way to finesse this difficulty is with asymptotics (i.e., to consider the limit as the number of pixels increases without bound). This is not literally possible with climate simulators (which have a fixed domain), but, as is common practice in statistics, asymptotic results serve to illuminate the situation when n is large (see, e.g., Cox and Hinkley 1974, chapter 9). Results established asymptotically can be checked for finite n by simulation.

An asymptotic approach requires a statistical model of the joint relationship between the simulator output and the observations. Any results that are proven on the basis of the model are likely to hold for actual ensembles that might have been simulated from the model. Therefore we look to make the model as general as possible; the approach below is to start with a simple model and then to check that the results generalize.

Define , and take the following as the statistical model:

where is the “bias” of simulator i, , and, for usage below, is the arithmetic mean of . From now on, treat and as fixed constants in order to avoid writing in every probability statement. The following result shows that, for this model, in the limit as the rank of among is completely and simply determined by and .

Result 2: for the model given in (2),

where “rank” was defined in (1).

The asymptotic theory in the following proof can be found in van der Vaart (1998, chapter 2; hereinafter VDV).

Proof: in terms of ,

Equation (2) and the weak law of large numbers (WLLN; VDV, section 2.16) imply

where denotes convergence in probability (VDV, section 2.1). For ,

where . Equation (2) and the WLLN imply

Each is converging in probability, and is converging in probability; hence, is converging in probability (VDV, section 2.7). Then the continuous mapping theorem (VDV, section 2.3) implies the following:

This implies that in the limit as ,

from which result 2 follows directly.

The result for ranking under RMSE is identical.

### Generalizations

The normal distribution for is unnecessary; all that is required for the WLLN is that . Different loss functions, such as MAD, can replace squared loss, providing that , where L is the loss function. The common value of for all simulators can be relaxed, at the expense of a slightly more complicated expression in result 2. The independence can be relaxed somewhat, as long as the correlation length across pixels is small relative to the size of the domain. In particular, neighboring pixels from the same simulator can have some positive correlation in their . These generalizations affect the rate of convergence, and thus the accuracy of the result for finite n, but they do not affect the limit.

## 4. Interpretation

Result 2 shows that offsetting biases across the simulators in the ensemble, leading to , would suffice to ensure that rank = 0 in the limit as . However, offsetting biases is an implausible hypothesis for the current generation of climate simulators, as has been shown empirically in the “genealogy” study of Knutti et al. (2013). Simulators have common biases, which cannot be expected to offset each other over the entire ensemble to give an overall mean of approximately zero. Additionally, as a reviewer has pointed out, the use of imperfect observations for introduces another common bias across all simulators.

Therefore it is interesting that result 2 can provide other sufficient conditions for which rank = 0, or is very small.

Result 3: if for all i, then

Proof: If for all i then , and . Thus, a necessary (but not sufficient) condition for in the limit as is , where the right-hand side goes to 0 as .

The condition in result 3 can be summarized as the simulators’ biases are smaller in absolute size than the large pixel errors. If individual simulators are tuned more on their overall bias than their large pixel errors, then we might expect something similar to this condition to hold.

Result 2 also illustrates when the ensemble mean performs badly. The two situations, good (for the ensemble mean, according to result 3) and bad, are shown in Figs. 1 and 2, in the limit as . In the first case, and rank = 0. In the second case, and . The larger than σ pull the value of above , and this allows the small to pass under the threshold in result 2 and raise the rank.

Fig. 1.

A configuration of biases with for which rank = 0 in the limit as . Here, and .

Fig. 1.

A configuration of biases with for which rank = 0 in the limit as . Here, and .

Fig. 2.

As in Fig. 1, but with all of the values increased by σ; the asymptotic rank has increased from 0 to 3.

Fig. 2.

As in Fig. 1, but with all of the values increased by σ; the asymptotic rank has increased from 0 to 3.

There is a reason to distrust the asymptotic result when n is small. The distribution of is very positively skewed so that, for small n, the value of will typically be less than its expectation, possibly much less. The value of will typically be closer to its expectation. Therefore, the asymptotic rank is likely to be similar to a lower bound on the finite-n rank.

This can be tested in a stochastic simulation study. In each simulation, the configuration is generated using the following:

For each configuration, the distribution of is computed using the model in (2). I then repeat with all of the increased by σ. I set , the same as Fig. 9.7 in Flato et al. (2013), and , which is the number of land regions in Giorgi and Mearns (2002) and much lower than the typical number of pixels. This small n is used to challenge the asymptotic nature of result 2.

The simulation study reveals that the asymptotic approximation is accurate in the “good” configurations of Fig. 1. For all 30 configurations, the asymptotic value for the rank of in is 0. In every configuration, the probability of rank = 0 is at least 0.94, and the median probability across the configurations is 0.99. Across the configurations, the maximum value for the largest rank is 3, and the median value for the largest rank is 1. In summary, under the good configuration of the biases it is highly probable that the ensemble mean will outperform all of the individual simulators.

The outcome for the “bad” configurations (see Fig. 2) is shown in Fig. 3. As anticipated, the distribution of the rank has shifted upward away from 0 for each configuration, and it is clearer that the asymptotic result provides an approximate lower bound on the rank. The median rank for this simulation study is 21 since . Under the bias model, the distribution of the rank is located entirely below the median for each configuration. Were n to be increased, the distribution of the rank in each configuration would collapse toward its asymptotic value. The median asymptotic value across the configurations is . In summary, under the bad configuration of the biases it is highly probable that the ensemble mean will perform far better than the median simulator.

Fig. 3.

Simulation study for when the configuration of μ does not satisfy , as shown in Fig. 2; 30 different configurations of μ are used, with the boxplots showing the distribution of the rank of in . The “ledge” for each configuration shows the asymptotic rank, using result 2. The box-and-whisker plots have the standard definition, for which the box shows the interquartile range and the median, and the whiskers extend to the most extreme value no more than 1.5 box widths from the box.

Fig. 3.

Simulation study for when the configuration of μ does not satisfy , as shown in Fig. 2; 30 different configurations of μ are used, with the boxplots showing the distribution of the rank of in . The “ledge” for each configuration shows the asymptotic rank, using result 2. The box-and-whisker plots have the standard definition, for which the box shows the interquartile range and the median, and the whiskers extend to the most extreme value no more than 1.5 box widths from the box.

Thus, the mathematics and the stochastic simulations show that the simulator biases model provides an explanation for Flato et al.’s (2013) “notable feature” of their Fig. 9.7: perhaps it is because for many of the variables the simulators’ biases are smaller in absolute size than the large pixel errors. In this case, the notable feature of Fig. 9.7 is not just a mathematical artifact but is telling us something interesting about the current generation of climate simulators.

Finally I would like to end with a caution about how to report and summarize ensemble model experiments. During the process of tuning the parameters of a climate simulator, a research group creates an ensemble of simulator versions with slightly different parameterizations. Result 2 suggests that they may get a lower MSE from the ensemble mean than from their best-tuned simulator—again, we cannot assume offsetting biases in this case. If simulators are judged by the wider community on their MSEs, with more kudos and funding going to those research groups with lower MSEs, then the temptation will be to publicize the output from the ensemble mean rather than the best-tuned simulator. And yet the ensemble mean is “less physical” at the pixel scale since the space of climate states is not convex: linear combinations of valid climate states are not necessarily valid climate states. This makes the ensemble mean less suitable for providing boundary conditions (e.g., for regional downscaling and risk assessment). Therefore research groups might consider how to certify the output they publicize if they do not want to put their simulators in the public domain.

## Acknowledgments

I would like to thank Yi Yu for her very helpful comments on my mathematics, Reto Knutti and Ben Sanderson for a discussion that prompted me to investigate the result in section 3, and Ken Mylne for an illuminating conversation about weather forecast verification. Two reviewers made perceptive comments on all aspects of this paper and substantial improvements in its clarity. This research was supported by the EPSRC SuSTaIn Grant EP/D063485/1.

## REFERENCES

REFERENCES
Annan
,
J.
, and
J.
Hargreaves
,
2011
:
Understanding the CMIP3 multimodel ensemble
.
J. Climate
,
24
,
4529
4538
, doi:.
Cox
,
D.
, and
D.
Hinkley
,
1974
:
Theoretical Statistics.
Chapman and Hall, 528 pp.
Flato
,
G.
, and Coauthors
,
2013
: Evaluation of climate models. Climate Change 2013: The Physical Science Basis, T. F. Stocker et al., Eds., Cambridge University Press, 741–866.
Giorgi
,
F.
, and
L.
Mearns
,
2002
:
Calculation of average, uncertainty range, and reliability of regional climate changes from AOGCM simulations via the “reliability ensemble averaging” (REA) method
.
J. Climate
,
15
,
1141
1158
, doi:.
Knutti
,
R.
,
D.
Masson
, and
A.
Gettelman
,
2013
:
Climate model genealogy: Generation CMIP5 and how we got there
.
Geophys. Res. Lett.
,
40
,
1194
1199
, doi:.
Mass
,
C.
,
D.
Owens
,
K.
Westrick
, and
B.
Colle
,
2002
:
Does increasing horizontal resolution produce more skillful forecasts?
Bull. Amer. Meteor. Soc.
,
83
,
407
430
, doi:.
Rougier
,
J.
,
M.
Goldstein
, and
L.
House
,
2013
:
Second-order exchangeability analysis for multi-model ensembles
.
J. Amer. Stat. Assoc.
,
108
,
852
863
, doi:.
Stephenson
,
D.
, and
F.
Doblas-Reyes
,
2000
:
Statistical methods for interpreting Monte Carlo ensemble forecasts
.
Tellus
,
52A
,
300
322
, doi:.
van der Vaart
,
A.
,
1998
:
Asymptotic Statistics.
Cambridge University Press, 462 pp.