1. Introduction
A striking feature of Fig. 9.7 of chapter 9 of the Fifth Assessment Report of the IPCC (Flato et al. 2013, p. 766) is that the ensemble mean consistently outperforms more than half of the individual simulators on all variables, as shown by the blue rectangles on the left-hand side of Fig. 9.7 in the column labeled “MMM.” Even more strikingly, the ensemble mean often outperforms all of the individual simulators (deep blue rectangles). This paper provides a mathematical explanation of these features.
Section 2 shows that it is a mathematical certainty that the ensemble mean will have a mean squared error (MSE) that is no larger than the arithmetic mean of the MSEs of the individual ensemble members. This result holds for any convex loss function, of which squared error is but one example. While this does not imply the same relation for root-mean-square error (RMSE) and the median ensemble member (as represented in Fig. 9.7), it makes a similar result plausible.
Section 3 establishes a stronger result, concerning the rank of the ensemble mean MSE among the individual MSEs (with an identical result for RMSE). This is based on a simple model of simulator biases and on an asymptotic treatment of the behavior of MSE in the case where the number of pixels increases without limit. Section 4 argues that this is a plausible explanation for the stronger result that the ensemble mean outperforms all of the individual simulators. A crucial aspect of this explanation is that it does not rely on “offsetting biases,” which would be inappropriate for the current generation of climate simulators.
In this paper, I exercise my strong preference for “simulator” over “model” when referring to the code that produces climate-like output (see Rougier et al. 2013). This allows me to use the word model without ambiguity to refer to statistical models.
2. Convex loss functions









Result 1:


This result holds for any convex function; replacing
This result falls short of being an explanation for the blue rectangles in the MMM column of Fig. 9.7 in Flato et al. (2013) in two respects. First, Fig. 9.7 is drawn for RMSE, not MSE, and second, it is drawn with respect to the median of the RMSEs of the ensemble, not the mean.





Extending the result from the mean to the median is trickier. A histogram of MSEs will typically be very positively skewed, and the histogram of RMSEs will remain positively skewed. Therefore the median and the mean of the RMSEs will not be similar. Typically the median will be lower, and therefore the mean being an upper bound does not imply that the median is an upper bound. But progress can be made using the result from the next section.
3. A simple systematic bias model
Flato et al. (2013, p. 767) and others have commented on the “notable feature” that











A candidate explanation is found in weather forecast verification, in which it is sometimes found that a high-resolution simulation has a larger MSE than a lower-resolution simulation when evaluated with high-resolution observations (see, e.g., Mass et al. 2002). The explanation is that if the high-resolution simulation puts a local feature such as a peak in slightly the wrong place (in space or time), then it suffers a “double penalty,” while a lower-resolution simulation, which does not contain the feature at all, suffers only a single penalty. Following similar reasoning, we might argue that the ensemble mean is flatter than any individual member and is thus penalized less if the individual members are putting local features in slightly wrong places. However, this argument is not compelling for the IPCC climate simulations, in which the observations have low resolution and there is already substantial averaging in the individual simulator output.
I propose a different explanation, in terms of the simulators’ “biases.” Suppose each simulator has a systematic bias
The mathematical challenge is that
An asymptotic approach requires a statistical model of the joint relationship between the simulator output and the observations. Any results that are proven on the basis of the model are likely to hold for actual ensembles that might have been simulated from the model. Therefore we look to make the model as general as possible; the approach below is to start with a simple model and then to check that the results generalize.















The asymptotic theory in the following proof can be found in van der Vaart (1998, chapter 2; hereinafter VDV).




















The result for ranking under RMSE is identical.
Generalizations
The normal distribution for
4. Interpretation
Result 2 shows that offsetting biases across the simulators in the ensemble, leading to
Therefore it is interesting that result 2 can provide other sufficient conditions for which rank = 0, or is very small.










The condition in result 3 can be summarized as the simulators’ biases are smaller in absolute size than the large pixel errors. If individual simulators are tuned more on their overall bias than their large pixel errors, then we might expect something similar to this condition to hold.
Result 2 also illustrates when the ensemble mean performs badly. The two situations, good (for the ensemble mean, according to result 3) and bad, are shown in Figs. 1 and 2, in the limit as

A configuration of biases with
Citation: Journal of Climate 29, 24; 10.1175/JCLI-D-16-0012.1

A configuration of biases with
Citation: Journal of Climate 29, 24; 10.1175/JCLI-D-16-0012.1
A configuration of biases with
Citation: Journal of Climate 29, 24; 10.1175/JCLI-D-16-0012.1

As in Fig. 1, but with all of the
Citation: Journal of Climate 29, 24; 10.1175/JCLI-D-16-0012.1

As in Fig. 1, but with all of the
Citation: Journal of Climate 29, 24; 10.1175/JCLI-D-16-0012.1
As in Fig. 1, but with all of the
Citation: Journal of Climate 29, 24; 10.1175/JCLI-D-16-0012.1
There is a reason to distrust the asymptotic result when n is small. The distribution of







The simulation study reveals that the asymptotic approximation is accurate in the “good” configurations of Fig. 1. For all 30 configurations, the asymptotic value for the rank of
The outcome for the “bad” configurations (see Fig. 2) is shown in Fig. 3. As anticipated, the distribution of the rank has shifted upward away from 0 for each configuration, and it is clearer that the asymptotic result provides an approximate lower bound on the rank. The median rank for this simulation study is 21 since

Simulation study for when the configuration of μ does not satisfy
Citation: Journal of Climate 29, 24; 10.1175/JCLI-D-16-0012.1

Simulation study for when the configuration of μ does not satisfy
Citation: Journal of Climate 29, 24; 10.1175/JCLI-D-16-0012.1
Simulation study for when the configuration of μ does not satisfy
Citation: Journal of Climate 29, 24; 10.1175/JCLI-D-16-0012.1
Thus, the mathematics and the stochastic simulations show that the simulator biases model provides an explanation for Flato et al.’s (2013) “notable feature” of their Fig. 9.7: perhaps it is because for many of the variables the simulators’ biases are smaller in absolute size than the large pixel errors. In this case, the notable feature of Fig. 9.7 is not just a mathematical artifact but is telling us something interesting about the current generation of climate simulators.
Finally I would like to end with a caution about how to report and summarize ensemble model experiments. During the process of tuning the parameters of a climate simulator, a research group creates an ensemble of simulator versions with slightly different parameterizations. Result 2 suggests that they may get a lower MSE from the ensemble mean than from their best-tuned simulator—again, we cannot assume offsetting biases in this case. If simulators are judged by the wider community on their MSEs, with more kudos and funding going to those research groups with lower MSEs, then the temptation will be to publicize the output from the ensemble mean rather than the best-tuned simulator. And yet the ensemble mean is “less physical” at the pixel scale since the space of climate states is not convex: linear combinations of valid climate states are not necessarily valid climate states. This makes the ensemble mean less suitable for providing boundary conditions (e.g., for regional downscaling and risk assessment). Therefore research groups might consider how to certify the output they publicize if they do not want to put their simulators in the public domain.
Acknowledgments
I would like to thank Yi Yu for her very helpful comments on my mathematics, Reto Knutti and Ben Sanderson for a discussion that prompted me to investigate the result in section 3, and Ken Mylne for an illuminating conversation about weather forecast verification. Two reviewers made perceptive comments on all aspects of this paper and substantial improvements in its clarity. This research was supported by the EPSRC SuSTaIn Grant EP/D063485/1.
REFERENCES
Annan, J., and J. Hargreaves, 2011: Understanding the CMIP3 multimodel ensemble. J. Climate, 24, 4529–4538, doi:10.1175/2011JCLI3873.1.
Cox, D., and D. Hinkley, 1974: Theoretical Statistics. Chapman and Hall, 528 pp.
Flato, G., and Coauthors, 2013: Evaluation of climate models. Climate Change 2013: The Physical Science Basis, T. F. Stocker et al., Eds., Cambridge University Press, 741–866.
Giorgi, F., and L. Mearns, 2002: Calculation of average, uncertainty range, and reliability of regional climate changes from AOGCM simulations via the “reliability ensemble averaging” (REA) method. J. Climate, 15, 1141–1158, doi:10.1175/1520-0442(2002)015<1141:COAURA>2.0.CO;2.
Knutti, R., D. Masson, and A. Gettelman, 2013: Climate model genealogy: Generation CMIP5 and how we got there. Geophys. Res. Lett., 40, 1194–1199, doi:10.1002/grl.50256.
Mass, C., D. Owens, K. Westrick, and B. Colle, 2002: Does increasing horizontal resolution produce more skillful forecasts? Bull. Amer. Meteor. Soc., 83, 407–430, doi:10.1175/1520-0477(2002)083<0407:DIHRPM>2.3.CO;2.
Rougier, J., M. Goldstein, and L. House, 2013: Second-order exchangeability analysis for multi-model ensembles. J. Amer. Stat. Assoc., 108, 852–863, doi:10.1080/01621459.2013.802963.
Stephenson, D., and F. Doblas-Reyes, 2000: Statistical methods for interpreting Monte Carlo ensemble forecasts. Tellus, 52A, 300–322, doi:10.1034/j.1600-0870.2000.d01-5.x.
van der Vaart, A., 1998: Asymptotic Statistics. Cambridge University Press, 462 pp.