## 1. Introduction

The makers of dynamical medium-range forecast ensembles such as the European Centre for Medium-Range Weather Forecasts (ECMWF; Molteni et al. 1996) and the National Centers for Environmental Prediction (NCEP; Tracton and Kalnay 1993) have to decide how to allocate resources to produce the best forecast. This means making a difficult choice between higher resolution, longer integrations, larger ensemble size, more accurate data assimilation schemes, and so on. One of the difficulties is that what is meant by “best” is also unclear. Each specific application of forecasting would lead to a different definition. Some users are only interested in the next 48 hours and would find short ensembles with high resolution and large ensemble size the most useful. Others are more interested in the next 3 weeks and would prefer longer ensembles at the expense of resolution and ensemble size. The difficult nature of this decision is illustrated by the fact that ECMWF runs 50 members but only for 10 days, whereas NCEP chooses to run 11 members for 16 days.

We contribute to this discussion by considering the trade-off between ensemble size and length of integration for one particular application: the pricing of weather swap contracts. Weather swaps are simple financial contracts with a payout dependent on the weather. A weather index is defined over a period of time, usually a week, a month, or 5 months. If the index turns out high, then the seller of the contract pays the buyer. If the index turns out low, then the buyer pays the seller. All payments are proportional to the difference between the index and a predefined *strike,* which is the level at which no payment occurs in either direction. Pricing of such contracts consists of estimating the appropriate level for the strike. For the strike to be fair to both parties, it can be set at the mean value of the index. Because most contracts are effectively based on average temperature,^{1} this fair strike level is often just the current best estimate for the average temperature for the contract period. For more details on weather derivatives and how they are priced, see Jewson et al. (2002) and Brix et al. (2002).

We will focus our attention on monthly swap contracts based on London Heathrow airport temperatures. These are currently the most commonly traded contracts in Europe. They are traded both well before the start of the contract month, in which case the fair strike value is estimated using historical data, and during the contract month itself, in which case the fair strike is continually reestimated using the latest forecasts. It is this latter case that we are most interested in for the purpose of this study: the prediction of the mean monthly temperature, made at some point during the month. How can this metric be estimated most accurately? Are the current ensemble products appropriately designed to solve this problem? In particular, we ask whether the trade-off between the number of ensemble members and the length of the ensemble is such that forecasts are as accurate as they could be.

One method to answer these questions would be to make a set of forecasts extending out to, say, 30 days, with ensemble size of 100, for 1 yr. Subensembles could then be constructed, and the skill of the subensembles could be mapped as a function of ensemble size and length. For a particular level of computing resources, the optimum ensemble size could then be determined according to whatever measure of value has been chosen: in our case, the accuracy of the predictions of the monthly mean. However, such an approach would be extremely expensive in terms of resources and can only be attempted by the forecasting centers themselves. It is fortunate that some progress toward answering the question of optimal ensemble size and length can be made at a much simpler level by estimating the skill of ensembles from the known skill of individual forecasts. This is the approach that we take. In section 2, we discuss the ways that optimal forecasts of the mean temperature can be derived from dynamical model integrations and derive theoretical expressions for the skill of ensembles as a function of ensemble size and length. In section 3, we present the forecast that will be used to provide the necessary underlying information about the skill of individual model forecasts and calculate the skill of predictions of the monthly mean derived from ensembles of this forecast. In section 4, we consider again how ensemble size and length affect accuracy of prediction of the monthly mean, but this time with the computing resources held fixed so that as ensemble size increases, forecast length decreases. In section 5, we summarize our results.

## 2. Optimal temperature forecasts

We now consider the various ways in which forecasts of the daily and monthly mean temperature can be generated and compare their relative skills. We assume that seasonal cycles in the mean have been removed from both the forecasts and the observed values.

### a. Realistic-variance single forecasts

We first consider a forecast *f*_{s} created from a single integration of a numerical model. The lead-dependent linear anomaly correlation with observations is given by *ρ,* and the lead-dependent linear anomaly correlation between different forecasts in the ensemble is given by *ρ̂**ρ* and *ρ̂**σ*^{2}, and we assume that this variance is realistic (i.e., equal to the variance of the observations). This assumption is reasonable because if it were not the case, it would be easy to correct using the statistics of past forecasts. We do not restrict *σ* to being constant in time, and it would be expected to vary seasonally.

*o*are the observed values. Thus we see that when the correlation is high this forecast has a low MSE but that when the correlation is low the MSE is very large. In fact, in the low-correlation case, the MSE is larger even than the MSE of a forecast that consists simply of the climatological mean, which has a constant MSE of

*σ*

^{2}. Realistic-variance single forecasts are therefore not a practical way to predict the mean temperature.

### b. Damped single forecasts

A simple way to reduce the MSE of a single track forecast is to *damp* it toward the climatological mean (in our case, toward zero because the mean has been removed) (Leith 1974). This is done by regressing past forecasts onto appropriate observed values using least squares linear regression. The regression coefficient then determines the optimum damping, because least squares regression minimizes MSE by definition. The optimum damping coefficient is just the anomaly correlation *ρ,* and the new forecast is given by *f*_{ds} = *ρf*_{s}.

*ρ*

^{2}

*σ*

^{2}, the correlation with observations is

*ρ*(because linear correlations are not affected by linear transformations), and the MSE is given by

*ρ*< 1, then this is less than the MSE for the undamped single forecast and is also less than the MSE for the trivial case of guessing the climatological mean every time. Damped single forecasts are therefore practical forecasts for the mean temperature if nothing better is available.

We note that this expression for the MSE is a theoretical lower limit only attainable given perfect knowledge of *ρ.* In practice, because *ρ* has to be estimated from a finite amount of past forecast data, the actual MSE will likely be slightly greater.

### c. Ensemble mean forecasts

An alternative method for improving the skill of single forecasts is to form an ensemble of such forecasts from different initial conditions and different models. The mean of such an ensemble is more skillful than undamped single forecasts because the errors in the different members of the ensemble are partly uncorrelated.

We will consider both “perfect” and “imperfect” ensembles. By perfect ensemble we mean that the different ensemble members and reality are all sampled from the same distribution and are, thus, correlated with each other in pairs with the same correlations. In other words, *ρ̂**ρ.* As we will see, real ensembles are not close to this idealization. For this reason, we will also consider imperfect ensembles, in which the correlations between members of the ensemble are different from the correlations between the ensemble members and reality. Typical real ensembles suffer from the problem that the ensemble members are too similar, and thus we would expect that *ρ̂**ρ.* We will see that this is indeed the case for the NCEP ensemble.

*f*

_{i}for

*i*= 1, … ,

*N*and the ensemble mean by

*f*

_{em}, for a perfect ensemble we have

*α*

_{N}is a factor that depends on

*N*and is given by

*N*= 1,

*α*

_{N}= 1, and we recover the case of a realistic-variance single forecast. As

*N*→ ∞,

*α*

_{N}→

*ρ,*and so for large ensembles the variance of the ensemble mean is roughly proportional to the anomaly correlation. For

*ρ*= 1,

*α*

_{N}= 1 and the ensemble mean variance is equal to the variance of the observations; for

*ρ*= 0,

*α*

_{N}= 1/

*N,*which is the residual error left by averaging together

*N*forecasts, none of which has any skill.

*f*

^{2}

_{em}

*α̂*

_{N}

*σ*

^{2}

*ρ̂*

*ρ,*we see that

*α̂*

_{N}>

*α*

_{N}and, hence, that the variance of the ensemble mean is greater in the imperfect case. This can be easily understood as being due to the greater correlations between the ensemble members.

*N*= 1 we recover the result for a realistic-variance single forecast. As

*N*→ ∞, the MSE tends to

*σ*

^{2}(1 −

*ρ*). For small values of

*ρ*(

*ρ*< 1/

*N*), we note that the MSE of the ensemble mean is higher than that of the single damped forecast. It is therefore not the case that ensemble means are always more accurate than single forecasts.

*f*

_{em}

*o*

^{2}

*σ*

^{2}

*α̂*

_{N}

*ρ*

The anomaly correlation of the ensemble mean with observations in the perfect ensemble case is given by *ρ*/*α*_{N}*ρ* < 1, this is a larger anomaly correlation than for either the realistic-variance or damped single forecasts. As *N* → ∞, this anomaly correlation tends toward *ρ*

In the imperfect ensemble case, the anomaly correlation is given by *ρ*/*α̂*_{N}*ρ*/*ρ̂*^{1/2} as *N* → ∞.

### d. Damped ensemble mean forecasts

The observation that the MSE for a single damped forecast can be lower than the MSE for an ensemble mean motivates the idea that we should also damp the ensemble mean. The optimum damping coefficients are given by *ρ*/*α*_{N} and *ρ*/*α̂*_{N} in the perfect and imperfect ensemble cases and give new forecasts with variances of *ρ*^{2}*σ*^{2}/*α*_{N} and *ρ*^{2}*σ*^{2}/*α̂*_{N}. For *ρ* < 1, these variances are lower than the variance of the original ensemble means. In a practical environment in which *ρ* and *ρ̂*

*ρ*/

*α*

_{N}

*ρ*/

*α̂*

_{N}

*ρ*< 1, these MSEs are slightly lower than the MSEs of the corresponding undamped ensemble means. As

*N*→ ∞, the MSEs for both the damped ensemble mean and the undamped ensemble mean forecasts tend to

*σ*

^{2}(1 −

*ρ*) in the perfect ensemble case.

Figure 1 shows the MSE for each of the forecasts described above, along with the MSE for the simple forecast consisting of just the climatological mean value, for a correlation of *ρ* = 0.25 and as a function of ensemble size. We choose a low correlation to emphasize the differences among the various forecasts. The graph shows that the undamped ensemble mean performs very poorly for small ensembles. For ensembles of size 11 (like the NCEP ensemble), the damped ensemble mean shows definite improvement over the undamped mean. For ensembles of size greater than 20 (like the ECMWF ensemble), the difference is very small and damping offers little extra benefit.

### e. Forecasts of the monthly mean temperature

We now consider the skill of forecasts for the monthly mean temperature, based on forecasts for the daily mean. We imagine that we are making the forecast on the first day of a 30-day month and that skillful seasonal forecasts are not available. The best forecast for the monthly mean is given by the sum of the best daily mean forecasts. The most accurate of the daily mean forecasts we have presented above is the damped ensemble mean, and so we will use that forecast for deriving the prediction of the monthly mean.

*e*be the monthly mean error and

*e*

_{i}be the daily mean error, then the monthly mean MSE when using an

*M*-day forecast (where

*M*< 30) is given by

The *σ*^{fc} values are simply the square roots of the MSE values derived in Eqs. (9) and (10). The *σ*^{clim} values are climatological standard deviations and can be estimated from historical data. We will make the assumption that there are no errors in this estimation. The *ρ*_{ij} are the correlations between temperatures on different days. In principle, we could try to derive some of the *ρ*_{ij} from an ensemble forecast. In practice, it is not clear whether the predicted values would be any more accurate than climatological values. For this reason, and in order to keep the analysis simple, we will use climatological values for all of the *ρ*_{ij}.

We will now use Eq. (11) to analyze how the skill of the forecast of the monthly mean temperature depends on ensemble size and length of forecast, for a given *ρ* and *ρ̂**ρ* and *ρ̂*

## 3. Forecast skill results

### a. The NCEP ensemble forecast

To evaluate Eq. (11), we need to estimate *ρ* and *ρ̂**σ*^{fc}. These are clearly model dependent. It would seem that, out to 10 days, the highest values of *ρ* probably come from the ECMWF model because it has the highest resolution of the various operational forecasting models. However, because the ECMWF integrations stop at 10 days, we cannot use them to ask the question of whether it might be better to extend forecasts *beyond* 10 days. The NCEP model, on the other hand, is integrated out 16 days (although only the first 15 days were available to us). For this reason, we base our estimates of *ρ* and *ρ̂**ρ,* but such forecasts are not available to the authors.

### b. Anomaly correlations

We estimate *ρ* and *ρ̂**ρ* and *ρ̂**ρ* and *ρ̂**ρ* (the lower curve) is that there is clear skill right out to the end of the forecast, and it looks plausible that the forecast would still have skill beyond 15 days if it were extended. Here, *ρ̂**ρ* at all lead times, with a fairly constant difference of around 0.2. The difference between *ρ* and *ρ̂**ρ* and *ρ̂*

### c. Monthly MSE versus ensemble size

From Eq. (11) we see that the MSE of forecasts of the monthly mean depends on both the length of the forecast *M* and the size of the ensemble *N* through *σ*^{fc}. Given the *ρ* and *ρ̂**σ* = 1. In Fig. 3, we show how the monthly MSE varies with ensemble size for a fixed length of forecast (taken to be 11 days). The lower curve is for a perfect ensemble, and the upper curve is for an imperfect ensemble. For the imperfect ensemble, we see that there is an initial rapid reduction in MSE up to an ensemble size of about 10 but that beyond 10 members increasing ensemble size only has a small effect on the MSE. We also note that there is a dramatic difference between the perfect and imperfect ensembles. A perfect ensemble of size 3 performs better than an imperfect ensemble of size 50.

### d. Monthly MSE versus length of the forecast

From the slow decay of the anomaly correlations shown in Fig. 2, it is clear that using a longer forecast will reduce the MSE on the predicted monthly mean, right up to 15 days and perhaps beyond. However, the marginal benefit of adding one more day will presumably decrease because of the decreasing correlations. Figure 4 shows the effect of forecast length on MSE for a fixed ensemble size of 50 members. The lower curve is for a perfect ensemble, and the upper curve is for an imperfect ensemble. We see that the MSE reduces rapidly as the length of the forecast increases. The rate of reduction slows slightly as we approach 15 days, but it would certainly seem likely that using a forecast longer than 15 days would give a further significant reduction in MSE. We note that a perfect ensemble of length 9 days performs as well as an imperfect ensemble of length 15 days.

## 4. Optimal forecasts for a fixed resource

An important question for the makers of ensemble forecasts is how best to allocate resources. Clearly longer forecasts and larger ensembles are better, but if one has to choose between longer or larger, which gives the greater benefit? As was mentioned in the introduction, the answer to this question depends to a great extent on what one is trying to predict. Extreme wind storms cannot be predicted by long integrations of low-resolution models, for instance.

We address this question using Eq. (11) and consider the accurate prediction of monthly mean temperatures as the goal. Many practical uses of weather forecasts depend on aspects of the forecast other than just the prediction of the mean. In these cases, it is often necessary to introduce utility theory or cost–loss analysis to ascertain the extent to which the forecast is useful. For weather swap pricing, however, only the mean temperature is important because of the simple linear payoff structure.

We introduce the constraint of fixed computing resources by fixing the number of total forecast days, defined as the length of the forecast multiplied by the ensemble size. For ECMWF, the total number of forecast days in current use is 500 (50 ensemble members for 10 days); for our version of the NCEP forecast, the total number of forecast days is 165 (11 ensemble members for 15 days). We calculate the monthly mean MSE as a function of the length of the forecast but adjust the ensemble size to keep the total number of forecast days fixed. The longer forecasts will therefore use smaller ensembles, and vice versa. The upper panel of Fig. 5 shows the MSE versus forecast length for a resource level of 500 forecast days, corresponding to the ECMWF prediction system. The lower curve is for a perfect ensemble, and the upper curve is for an imperfect ensemble. The lower panel of Fig. 5 shows the corresponding number of ensemble members. Of interest, we see that the skill of the forecast of the monthly mean increases monotonically with length of forecast, even though the number of ensemble members is decreasing. The benefit of longer forecasts more than outweighs the disadvantage of having fewer members. This result implies that, of all the ensemble-size–forecast-length configurations tested, the most accurate would be forecasts of length 15 days with 33 members in the ensemble. The fact that the MSE does not appear to reach a minimum in the imperfect ensemble case suggests that longer forecasts with fewer members might be even better.

The upper panel of Fig. 6 shows the MSE for a resource level of 165 forecast days, and the lower panel shows the corresponding ensemble size. Given the lower level of resources, only smaller ensembles are possible, and one might expect that the optimum length of forecast might be reached before 15 days. Again, however, the optimum of all the combinations tested is at a forecast length of 15 days, implying once more that longer forecasts with fewer members could possibly be even better.

## 5. Summary

We have discussed the question of how ensemble forecast systems should be designed in order to maximize the benefit to the users of the forecasts. We have focused in particular on the problem of pricing weather swap contracts, for which an accurate estimate of the mean monthly temperature is critical, and have investigated the optimum trade-off between size of ensemble and length of forecast for a given number of total forecast days. We have derived expressions for the skill of the damped ensemble mean forecast for both perfect and imperfect ensembles and have used that to derive expressions for the skill of predictions of monthly means.

The results suggest that long ensembles are more useful than wide ones for this purpose: the damped ensemble mean of a 15-day forecast with 11 ensemble members is significantly more skillful than that of a 10-day forecast with 50 members, for instance. To the extent that our model for monthly mean MSE is realistic, and to the extent that predicting accurate monthly means is the goal of producing ensemble forecasts, we have shown that ECMWF could produce more useful forecasts with the same level of computer resources by reducing the number of ensemble members and increasing the length of their forecasts, to at least 15 days and possibly beyond. As an alternative, they could make forecasts that are equally useful but need much less computer power. Even NCEP, whose forecast system is much closer to an optimum ensemble design for this application (perhaps not by accident) would possibly produce more useful forecasts if they sacrificed a little ensemble size for longer integrations.

The underlying single model anomaly correlation score *ρ* that we have used for our analysis is not likely to be as skillful as can be achieved. This is for two reasons: first, the downscaling system used to produce the forecast was not particularly state of the art and, second, the NCEP model itself is not the highest-resolution forecast model available. Increasing *ρ* would tend to reinforce our conclusions with respect to the need for longer, rather than wider, ensembles.

Last, we reiterate that the method we have used is only one possible way to measure the value of ensembles. Evaluation by other criteria would lead to different results. For the purposes of predicting extreme events, for instance, short and wide ensembles are likely to be more useful than long, thin ones. However, we do believe that predicting the mean temperature as far as possible in advance and as well as possible is an important goal that is likely relevant not just for the pricing of weather swaps but also for many other applications of meteorological forecasting.

## REFERENCES

Brix, A., Jewson S. , and Ziehmann C. , 2002: Weather derivative modelling and valuation: A statistical perspective.

*Climate Risk and the Weather Market,*R. Dischel, Ed., Risk Books, 127–150.Jewson, S., Brix A. , and Ziehmann C. , 2002: Use of meteorological forecasts in weather derivative pricing.

*Climate Risk and the Weather Market,*R. Dischel, Ed., Risk Books, 169–184.Leith, C., 1974: Theoretical skill of Monte Carlo forecasts.

,*Mon. Wea. Rev.***102****,**409–418.Molteni, F., Buizza R. , Palmer T. , and Petroliagis T. , 1996: The new ECMWF ensemble prediction system: Methodology validation.

,*Quart. J. Roy. Meteor. Soc.***122****,**73–119.Tracton, M., and Kalnay E. , 1993: Operational ensemble prediction at the National Meteorological Center.

,*Wea. Forecasting***8****,**379–398.

The empirical anomaly correlation for a single NCEP forecast for London Heathrow airport (lower curve), calculated using 1 yr of forecasts, and the empirical anomaly correlation between single NCEP forecasts (upper curve) taken from the same ensemble

Citation: Weather and Forecasting 18, 4; 10.1175/1520-0434(2003)018<0675:WSPATO>2.0.CO;2

The empirical anomaly correlation for a single NCEP forecast for London Heathrow airport (lower curve), calculated using 1 yr of forecasts, and the empirical anomaly correlation between single NCEP forecasts (upper curve) taken from the same ensemble

Citation: Weather and Forecasting 18, 4; 10.1175/1520-0434(2003)018<0675:WSPATO>2.0.CO;2

The empirical anomaly correlation for a single NCEP forecast for London Heathrow airport (lower curve), calculated using 1 yr of forecasts, and the empirical anomaly correlation between single NCEP forecasts (upper curve) taken from the same ensemble

Citation: Weather and Forecasting 18, 4; 10.1175/1520-0434(2003)018<0675:WSPATO>2.0.CO;2

The MSE of a damped ensemble mean prediction of the monthly mean temperature made on the first of the month vs ensemble size. The lower curve corresponds to a perfect ensemble, and the upper curve corresponds to an imperfect ensemble

Citation: Weather and Forecasting 18, 4; 10.1175/1520-0434(2003)018<0675:WSPATO>2.0.CO;2

The MSE of a damped ensemble mean prediction of the monthly mean temperature made on the first of the month vs ensemble size. The lower curve corresponds to a perfect ensemble, and the upper curve corresponds to an imperfect ensemble

Citation: Weather and Forecasting 18, 4; 10.1175/1520-0434(2003)018<0675:WSPATO>2.0.CO;2

The MSE of a damped ensemble mean prediction of the monthly mean temperature made on the first of the month vs ensemble size. The lower curve corresponds to a perfect ensemble, and the upper curve corresponds to an imperfect ensemble

Citation: Weather and Forecasting 18, 4; 10.1175/1520-0434(2003)018<0675:WSPATO>2.0.CO;2

As in Fig. 3, but vs ensemble length

Citation: Weather and Forecasting 18, 4; 10.1175/1520-0434(2003)018<0675:WSPATO>2.0.CO;2

As in Fig. 3, but vs ensemble length

Citation: Weather and Forecasting 18, 4; 10.1175/1520-0434(2003)018<0675:WSPATO>2.0.CO;2

As in Fig. 3, but vs ensemble length

Citation: Weather and Forecasting 18, 4; 10.1175/1520-0434(2003)018<0675:WSPATO>2.0.CO;2

(top) As in Fig. 4, but for a fixed amount of computing CPU time, expressed as a fixed number of forecast days. Thus the longer the forecast is, the smaller the ensemble must be. In this case, 500 forecast days were used, to correspond to the computing resources used to produce the ECMWF ensemble. (bottom) Ensemble size vs forecast length

Citation: Weather and Forecasting 18, 4; 10.1175/1520-0434(2003)018<0675:WSPATO>2.0.CO;2

(top) As in Fig. 4, but for a fixed amount of computing CPU time, expressed as a fixed number of forecast days. Thus the longer the forecast is, the smaller the ensemble must be. In this case, 500 forecast days were used, to correspond to the computing resources used to produce the ECMWF ensemble. (bottom) Ensemble size vs forecast length

Citation: Weather and Forecasting 18, 4; 10.1175/1520-0434(2003)018<0675:WSPATO>2.0.CO;2

(top) As in Fig. 4, but for a fixed amount of computing CPU time, expressed as a fixed number of forecast days. Thus the longer the forecast is, the smaller the ensemble must be. In this case, 500 forecast days were used, to correspond to the computing resources used to produce the ECMWF ensemble. (bottom) Ensemble size vs forecast length

Citation: Weather and Forecasting 18, 4; 10.1175/1520-0434(2003)018<0675:WSPATO>2.0.CO;2

As in Fig. 5, but for 165 forecast days, corresponding to the computing resources used to generate the NCEP ensemble

Citation: Weather and Forecasting 18, 4; 10.1175/1520-0434(2003)018<0675:WSPATO>2.0.CO;2

As in Fig. 5, but for 165 forecast days, corresponding to the computing resources used to generate the NCEP ensemble

Citation: Weather and Forecasting 18, 4; 10.1175/1520-0434(2003)018<0675:WSPATO>2.0.CO;2

As in Fig. 5, but for 165 forecast days, corresponding to the computing resources used to generate the NCEP ensemble

Citation: Weather and Forecasting 18, 4; 10.1175/1520-0434(2003)018<0675:WSPATO>2.0.CO;2

^{1}

We say *effectively* because in fact cumulative temperatures or degree days are used. For many locations these are equivalent to average temperature, however.