## 1. Introduction

At present, ensemble forecasting is an important product delivered by a number of operational meteorological centers around the globe. By perturbing the initial state of a high-resolution deterministic forecast, a number of ensemble members is integrated, usually at a lower resolution. From the spread in such ensembles, an estimate of the quality of the high-resolution deterministic forecast is made, and in the medium range, clusters of members are a guide to the likelihood of alternative weather scenarios. Examples of meteorological centers that produce ensemble systems on an operational basis are the National Centers for Environmental Prediction (NCEP; Toth and Kalnay 1993; Tracton and Kalnay 1993), the Canadian Meteorological Centre (CMC; Houtekamer et al. 1996), and the European Centre for Medium-Range Weather Forecasts (ECMWF; Molteni et al. 1996; Buizza et al. 1998).

An important aspect of a forecasting system in general, and for an ensemble system in particular, is the assessment of performance. Because of its probabilistic nature, verification for ensemble systems is more complex than it is for deterministic forecasts. Popular tools that measure the performance of probabilistic forecasts are rank histograms, reliability diagrams, Brier scores, receiver operating characteristic (ROC) curves, and cost–loss analyses [a detailed description may be found in Strauss and Lanzinger (1996), Wilks (1995), and Richardson (2000)]. For all of these tools, ensemble forecasts are compared to the verifying weather pattern. This truth is measured either directly by using observations, or by using verifying analysis fields. The errors in these quantities are usually neglected. The argument is that they are small compared to the errors in the forecasts and therefore will have a negligible effect on the results. However, in the short range, where forecast errors are still small, this argument may not be justified. For instance, at ECMWF, the accuracy of the 1-day forecast in the geopotential height at 500 hPa is comparable to the observation error of radiosondes (Simmons and Hollingsworth 2002).

In this paper, the impact of observation errors at verification time on verification tools is studied. The focus will be on rank histograms and reliability diagrams, that is, tools that measure the statistical aspects of ensemble forecasts. One way of doing this is to transform the probability density function (PDF) for the truth to a PDF for the verifying observations or analysis. This means that the PDF created by an ensemble system is to be convolved with the verifying observation error before it is presented to the verification tool under consideration. In practice, this can be accomplished by adding normal noise to the ensemble members, with a standard deviation given by these observation errors. The same approach has been suggested by Anderson (1996), who also points out the importance of taking into account observation errors when validating ensemble forecasts. A theoretical basis for this method is presented in section 2. In addition, a general expression is derived for the frequency of outliers (in cases when the verifying observation is found to be outside the ensemble). Special attention is given to the influence of nonnormality of the underlying distributions.

The fact that models may have systematic errors or biases can both obscure, or make it difficult to interpret, the results of verification. As a method for testing purely the impact of the sensitivity of observation errors on verification tools, a perfect model approach may be used. In such a study, the initial state of one of the ensemble members is defined as the truth. Then, because of the perfect model approach, so will its trajectory. Observations may be created by perturbing the truth according to their errors. This approach is followed in section 3 by using the three-variable Lorenz-63 model (Lorenz 1963) as a simple version of a nonlinear dynamical system.

In section 4 the effect of the incorporation of verifying observation errors is considered for the ensemble prediction system (EPS) that is operational at ECMWF. Both the verification on the basis of observations (ocean wave heights are taken as an example) and on the basis of verifying analyses (geopotential at 500 hPa) are considered. It is shown that in the short range rank histograms are quite sensitive to observation errors and that there is even a nonnegligible effect in the medium range. The effect on reliability diagrams is found to be limited.

The paper ends with some concluding remarks.

## 2. Analytical considerations

### a. General

In the case of a perfect ensemble, each ensemble member is an independent sample from the PDF of which the truth is also a realization. Therefore, for a parameter *x,* the collection of the *N* ensemble members *x*_{1}, … , *x*_{N} plus the verifying truth *x*_{T}, forms a set of *N* + 1 quantities with identical statistical properties. The probablity that one given quantity satisfies a certain condition (e.g., being the smallest, second smallest, largest) must be the same for all, that is, 1/(*N* + 1). In particular, the probability that the *i*th smallest quantity is the verifying truth, that is, that the verifying truth is in between the ordered ensemble members *i* − 1 and *i,* is 1/(*N* + 1). For large samples, this will give rise to a flat rank histogram (see, e.g., Hamill 2001; Talagrand and Vautard 1997). This is a general result that is true for any PDF for the truth (normal or nonnormal, skewed or nonskewed).

*ρ*

_{E}and

*ρ*

_{O}, respectively. Furthermore, in general let the standard deviation of a PDF

*ρ*

_{α}be denoted by

*σ*

_{α}and its cumulative density function by

*P*

_{α}(

*x*) =

^{x}

_{−∞}

*ρ*

_{α}(

*y*)

*dy.*The condition that the verifying observation is an upper outlier can be expressed as

*H*is the Heavyside function, being 0 for

*x*< 0 and 1 for

*x*≥ 0. Its expectation value, denoted by FR

_{u}can be found by weighting Eq. (1) over the joint PDF for (

*x*

_{O},

*x*

_{1}, … ,

*x*

_{N}):

_{l}is given by Eq. (2), if

*P*is replaced by

*Q*= 1 −

*P.*In the special case that

*ρ*

_{O}=

*ρ*

_{E}, for example, when both observation and ensemble members are realizations from the PDF for the truth

*ρ*

_{T}, this integral can be evaluated exactly:

_{l}. When observation errors are to be taken into account (though assumed to be uncorrelated with the PDF for the truth),

*ρ*

_{O}will be the convolution between

*ρ*

_{T}and the PDF for the observation error

*ρ*

_{o}:

*ρ*

_{E}=

*ρ*

_{T}. This effect can be neutralized by artificially perturbing the ensemble members with respect to a PDF

*ρ*

_{e}with the same standard deviation as the observation error (

*σ*

_{e}=

*σ*

_{o}). In this case

*ρ*

_{E}→

*ρ*

_{e}

*ρ*

_{T}, which will then be equal to

*ρ*

_{O}again, and the frequency of outliers will be restored to 1/(

*N*+ 1). A deviation from this frequency can then be attributed to incorrect statistical properties of the ensemble system.

### b. Normal distributions

Before going into more detail on how *ρ*_{E} and *ρ*_{O} are related to an underlying truth, a simple situation in which *ρ*_{O} and *ρ*_{E} are given is considered. In Fig. 1, FR_{u} is shown for the case where the PDFs for both observations and ensemble members are normally distributed and unbiased, that is, *ρ*_{O} = *N*(0, *σ*_{O}) and *ρ*_{E} = *N*(0, *σ*_{E}). The effect of a relative bias on rank histograms is not considered [examples may be found in Hamill (2001)]. From a variable transformation *x* → *x*/*σ*_{E}, and the fact that [*N*(0, *σ*)](*x*) = (1/*σ*)[*N*(0, 1)](*x*/*σ*), it follows directly that FR_{u} is a function of the ratio (*σ*_{O}/*σ*_{E}). From Fig. 1a it is seen that increasing ensemble size always decreases the fraction of outliers. This also shows, together with Fig. 1b, that the sensitivity with respect to a deviation of *σ*_{O} from *σ*_{E} rapidly increases with ensemble size *N.* For a deterministic forecast (*N* = 1) there are by definition only outliers; in the unbiased case 50% upper and 50% lower, independent of *σ*_{O}. For large ensembles, the frequency of outliers rapidly decreases as *σ*_{O} becomes smaller. In the extreme case that *σ*_{O} = 0, that is, when the true state is perfectly predictable and observed without error, *ρ*_{O} will become a Dirac delta function and FR_{u} = *P*^{N}(0) = 2^{−N}. In the opposite extreme, when the ensemble spread is so small that it resembles a deterministic forecast, the value of 50% is recovered. To illustrate that the effect of observation errors becomes increasingly important for larger ensemble size: a misfit of 20% between *σ*_{O} and *σ*_{E} leads, relative to expected, to an increase of outliers with 37% for *N* = 10 to a doubling for *N* = 50 and almost to a quadrupling for *N* = 500.

### c. Small deviations from the perfect situation

*ρ*

_{T}. If the observation error does not depend on the value of the verifying truth,

*ρ*

_{O}is given by Eq. (4). If, in addition, ensemble members are (by hand) perturbed according to a PDF

*ρ*

_{e}, one has

*ρ*

_{E}→

*ρ*

_{e}

*ρ*

_{E}. In case all distributions are normal, it follows that

*F*

_{N}is the function displayed in Figs. 1a and 1b.

*ρ*

_{E}and

*ρ*

_{O}from

*ρ*

_{T}are small, the following expansion can be made:

*ρ*

_{t}(

*x*) =

*ρ*(

*x*/

*σ*

_{T})/

*σ*

_{T}. Here it is assumed that the deviation of

*ρ*

_{E}from

*ρ*

_{T}is represented by a misfit in spread only, that is, the shape of both distributions is the same. The corresponding term in Eq. (6) then follows from a variable change

*x*→ (

*σ*

_{T}/

*σ*

_{E})

*x*and the Taylor expansion of (1 +

*δ*)

*ρ*[(1 +

*δ*)

*x*] in

*δ.*The term that accounts for observation error can be found by inserting expression (4) into Eq. (2) and expanding with respect to ε. The term that reflects the noise added to ensemble members is found by realizing that for

*σ*

_{e}=

*σ*

_{o}and

*σ*

_{E}=

*σ*

_{T}it must counterbalance the effect of observation error exactly. It should be stressed that Eq. (6) is only valid if the individual contributions are small and of similar magnitude. Otherwise higher-order terms including interactions between the various error sources must be included.

_{u}depends to lowest order linearly on a misfit in ensemble size and quadratically on

*σ*

_{o}and

*σ*

_{e}. The sensitivities

*α*

_{N}and

*β*

_{N}do not depend on the details of the observation error or the noise applied to the ensemble members. They are determined by the PDF of the underlying truth. If

*ρ*

_{T}is normal,

*α*

_{N}and

*β*

_{N}are equal. Numerical results are displayed by the solid line in Fig. 1c. It clearly confirms the growing sensitivity toward larger ensemble size as was observed in Figs. 1a and 1b. This property can be understood by realizing that for large

*N,*

*P*

^{N}becomes a sharp transition function. In the extreme,

*N*

*ρ*

*x*

*P*

^{N}

*x*

*δ*

*x*

*x*

_{N}

*x*

_{N}, and therefore the result of Eqs. (2), (7), and (8), depends on the details of the tail of the PDF for the truth. For a normal distribution, one has

*N.*For

*N*= 50,

*x*

_{N}≈ 2.2

*σ*

_{T}; for

*N*= 500,

*x*

_{N}≈ 3.0

*σ*

_{T}.

### d. The effect of a nonnormal tail

*ρ*

_{T}within a few standard deviations of the mean, its validity in the tail may be disputable. As an illustration of the dependency of the frequency of outliers on the details of the tail of

*ρ*

_{T}, the Laplace distribution will be considered here:

*ρ*

_{T}

*x*

*x*

*x*= 0 is in practice not realistic, its tail, going much more slowly to 0 than a normal distribution, might be. The sensitivity parameters

*α*

_{N}and

*β*

_{N}can be evaluated exactly for this case:

*γ*= 0.57721⋯ is Euler's constant, and the approximation for

*α*

_{N}is within 0.1% for

*N*> 5. For the Laplace distribution,

*α*

_{N}and

*β*

_{N}differ. Their dependency on

*N*is illustrated in Fig. 1c. It shows that

*β*

_{N}quickly saturates. Therefore, in contrast to the normal case [see Eq. (6)], the influence of observation errors on FR

_{u}does not grow for

*N*> 10. The sensitivity of FR

_{u}to incorrect ensemble spread does continue to grow with

*N,*although less rapidly than for the normal case. As a consequence, for large

*N,*misfits in ensemble spread will have a larger effect on FR

_{u}than unaccounted observation errors have.

### e. Averages over large samples

*ρ*

_{T}) and low predictability (broad

*ρ*

_{T}). This means that expressions for the frequency of outliers, such as Eq. (6), are to be weighted over a PDF

*ρ*(

*ρ*

_{T}) expressing the temporal variability in predictability. If one assumes that for most cases the shape of

*ρ*

_{T}is similar, this weighting can be transformed to a weighting over

*σ*

_{T}. If an ensemble system obeys a good spread–skill relation, but consistently over- or underpredicts spread, it may be argued that the ratio

*σ*

_{E}/

*σ*

_{T}is constant. This means that

*σ*

_{E}/

*σ*

_{T}in the term for

*α*in Eq. (6) can be replaced by its average values. This is not true for the term for

*β,*since the constant observation error (and similarly for

*σ*

_{e}) leads to case-dependent ratios of

*σ*

_{o}/

*σ*

_{T}. Although for this term

*σ*

_{T},

*σ*

_{o}and

*σ*

_{e}can be replaced by their average values, it will alter the value of

*β*:

*ρ*(

*σ*

_{T}).

Methods to measure the temporal variability in predictability were explored by Kruizinga and Kok (1988) and Houtekamer (1992). They assume that *σ*_{T} ∼ exp(−*γd*), where *d* is drawn from a standard normal distribution. This leads to 〈*β*〉 = *β* exp(3*γ*^{2}). The parameter *γ* expresses the degree of variability. For the geopotential at 500-hPa, Houtekamer (1992) obtains realistic values of *γ* between 0.3 and 0.4, which would lead to an enhancement of *β* by 30% to 60%.

## 3. Idealized experiments with the Lorenz model

In the previous section attention was focused on the frequency of outliers. The effect of observation errors on other parts of the rank histogram, and also on reliability diagrams, is considered in this section. For this, a synthetic dataset under idealized conditions, where initial spread, uncertainties, and observation errors could all be manipulated, was produced using the nonlinear dynamical system originally suggested by Lorenz (1963).

*X,*

*Y,*and

*Z*of the Lorenz model are determined by the equations

*σ,*

*r,*and

*b*are constants. To solve these equations numerically the double approximation procedure suggested by Lorenz was used. The constants were taken to be

*σ*= 10,

*r*= 28, and

*b*= 8/3, and the dimensionless time step Δ

*t*= 0.001. In order to generate a dataset that could later be used as initial conditions, the model was run for a period of 100 000 time steps. The result is shown as the gray lines in Fig. 2 and yields the well-known Lorenz attractor.

When generating the dataset used in the statistical analysis, randomly chosen points from the Lorenz attractor were used as the true state of the system. By assuming that the observations were subject to errors, the analysis was found by adding normally distributed noise to the true initial states. The standard deviation used for the analysis is denoted *σ*_{a}. This analysis was used to initialize the control forecasts. For the ensemble members, the initial conditions were calculated by adding normally distributed noise, with standard deviation *σ*_{ε}, to the analysis. The numerical model was then used to propagate the true state, analysis, and ensemble members forward in time. In this way the initial PDFs for the true states and ensembles evolve to PDFs for the verifying truth and ensemble members with (flow dependent) standard deviations *σ*_{T} and *σ*_{E}, respectively. To make this system resemble the EPS currently running operationally at ECMWF, 50 ensemble members in addition to the control forecast were used. In each case, the forecast range was 2000 time steps. When choosing this forecast range, we did a number of initial tests to make sure the forward integration was long enough to produce decent spread of the ensemble members and at the same time short enough to retain predictability. Two thousand time steps was found to be a reasonable compromise. An example of the trajectory of one true state is shown as the dotted line in Fig. 2. If *σ*_{ε} < *σ*_{a}, the ensemble spread will be too low, and vice versa. A system with perfect spread is obtained if *σ*_{ε} = *σ*_{a}.

*E*

*σ*

_{o}, to true value at the final states. Figure 2 shows an example of an ensemble forecast using this model together with a close-up of the final state for all the ensemble members together with the true final state.

To use this system to test the effects of observation errors, 12 different experiments were defined. These experiments can be sorted into three groups: experiments with perfect spread, too low initial spread, and too large initial spread. For the cases with too low initial spread, the standard deviation for the ensemble members was *σ*_{ε} = 1/4*σ*_{a}. In the experiments with too large spread, the standard deviation for the ensemble members was *σ*_{ε} = 4*σ*_{a}. A summary of all the values used to determine the spread together with the standard deviation used for measuring the final state for each experiment is given in Table 1. For each experiment, 60 000 observations and forecasts were generated.

As was discussed in section 2 (or see, e.g., Hamill 2001), the ensemble members and the verifying observations should ideally be random draws from the same probability distribution. This means that the ranks of an observation when mixed with the ensemble members should be evenly distributed. The frequencies of such ranks (in this application ideally *p* = 1/52 ≈ 0.019) can be illustrated graphically in a rank histogram (Anderson 1996; Talagrand and Vautard 1997; Hamill 2001). The result from the experiment where *σ*_{ε} = *σ*_{a} and *σ*_{o} = 0 is shown in the rank histogram in Fig. 3 (experiment 1). As expected for this case, the rank histogram is almost flat.

To test the ability of the ensemble to correctly forecast probabilities of a certain event, reliability diagrams may be used (Strauss and Lanzinger 1996; Wilks 1995). Here, we will consider the event that the distance between the origin in the phase space and the final state of the system is larger than average; that is, that *E* > 1, where *E* is given by Eq. (16). The forecast probabilities are split into discrete bins ranging from 0 to 1. For each probability class, the proportion of times the event is observed, compared to the total number of ensemble forecasts in that class, called the observed frequency, is plotted against the corresponding probability class. For a perfectly reliable forecasting system, these points lie on the diagonal line. The reliability diagram for the case with perfect spread and no observation errors is shown in Fig. 4 (experiment 1).

In experiment 2 the ensemble forecasts and observations are pure random draws from a normal distribution. To obtain this, both ensembles and observations were given the same value before normally distributed noise with standard deviation of 0.25 was added. Although the rank histogram is almost flat, the reliability diagram in Fig. 4 (experiment 2) reveals no forecast skill at all.

In the first experiment, perfect ensemble forecasts were compared with observations that were taken to be the true state of the system. To simulate the effects of observation errors, normally distributed noise can be added to this final state. The effect of this on the rank histogram is shown in Fig. 3, experiment 3, where *σ*_{o} = 0.02. As can be seen, this has a rather dramatic effect on the frequencies in the two extreme ranks and can easily be misinterpreted as too low spread in the ensemble forecasts. However, in this case the ensemble spread is perfect, and the strongly u-shaped rank histogram is solely caused by observation errors. The corresponding reliability diagram seems to be much less sensitive in this respect (Fig. 4, experiment 3). The frequency of outliers (see Fig. 3) is around 40%, which for a normal distribution would imply [see Eq. (5) and Fig. 1a] *σ*_{T} = 0.082, while 〈*σ*_{T}〉 = 0.25 was observed. The large difference is attributed to a large temporal variability of predictability (see section 2e).

As argued in the introduction, the proper verification is obtained by adding noise to the ensemble members as well. The rank histogram method does not distinguish between large and small differences in the parameter to be ranked. It is of course meaningless to rank the data according to differences that are much smaller than the observation errors. Hence, this effect must be filtered out before calculating the frequencies of observations in the different ranks. In experiment 4, this has been done by adding the same amount of noise to the ensemble members as was used for the observations. The effect on rank histograms is striking (see Fig. 3): the u shape for experiment 3 has been removed completely, leaving a flat histogram, as it should. Again, the effect on the reliability diagram (Fig. 4) is much less apparent.

In Fig. 3, experiment 5, the effect of too small initial spread on the rank histogram is demonstrated. Here, the true values are used for the observations. In this case the rank histogram is also strongly u shaped and looks very similar to the result obtained in experiment 3, although in that case, this was caused by observation errors. This shows that it is not possible to distinguish between the effects of observation errors or too low ensemble spread using the rank histogram. The frequency of outliers (2 × 28%) agrees well with FR_{u} given by Eq. (5) with *σ*_{E} = *σ*_{T}/4.

When noise was added to the evolved ensemble member for this last case, the observed frequencies in the two extreme ranks were reduced (Fig. 3, experiment 6). However, in contrast to the case with true measurement errors, a bell shape also appeared in the center of the histogram. The combined effect of too small ensemble spread and observation errors can be seen in Fig. 3, experiment 7. In this case, the u shape of the rank histogram is even more pronounced. Experiment 8 shows the result for this last case when observation errors are filtered by adding noise to the ensemble members. Note that the frequency of outliers is considerably lower than for experiment 5. Neutralizing the observation error in this case does, in contrast to approximation (6), have a residual effect on outliers. The reason for this is that *σ*_{E} and *σ*_{T} differ too much for Eq. (6) to be valid. In the framework of normal distributions, the reduction of outliers is clear, since the addition of observation error and noise to ensemble members makes the argument of *F*_{N} in Eq. (5) deviate less from unity.

From Fig. 4 (experiments 5 to 8), it is clear that too low ensemble spread has a stronger impact on reliability diagrams than observation errors have. For the two cases without noise added to the ensemble members, the reliability curves are s shaped, underforecasting low probabilities and overforecasting high probabilities. When noise is added to the ensemble members, the curves are improved.

The effect of too large ensemble spread is demonstrated by experiments 9 to 12. The rank histogram for experiment 9 shows the Gaussian shape obtained when the true values are used for the observations. In experiment 10, the result of adding noise to the ensemble members in the case where the observations are perfect is shown. The combined effect of observation errors and too large initial spread is shown by experiment 11. Finally, the result when noise is added to the evolved ensemble members is given by experiment 12. This removes the overrepresentation in the extreme ranks, while a dome shape remains.

The reliability diagrams for the cases with too large spread (Fig. 4, experiments 9 to 12) also indicate that these are more sensitive to errors in the ensemble spread than to observation errors. The reliability curves for the four cases are almost the same, with a pronounced s shape, overforecasting low probabilities and underforecasting high probabilities.

## 4. The ensemble prediction system at ECMWF

In this section the impact of errors in the verifying observations on rank histograms is studied for the EPS running operationally at ECMWF (Molteni et al. 1996; Buizza et al. 2000). First the situation in which verification is performed with respect to observations is considered. As an example, the focus will be on ocean wave heights. Then the case in which rank histograms are based on verifying analyses is studied. This is done for the most commonly used parameter in this situation, which is the geopotential height field at 500 hPa.

### a. Ocean waves

To see how observation errors may affect the interpretation of the ensemble spread for the EPS, ensemble forecasts of ocean wave heights are considered. In June 1998, the ocean wave model (WAM; Komen et al. 1994) was coupled to the atmospheric circulation model (Janssen et al. 2002). From then on, ensemble forecasts of ocean waves have been available on a daily basis.

Saetra and Bidlot (2002) used buoy and platform observations for the assessment of the ECMWF wave ensembles. The model was compared with wave data obtained via the Global Telecommunication System (GTS) for the period from September 1999 to March 2002 for offshore locations around the U.S. and Canadian coasts and on both sides of the British Isles. This dataset is used here. The focus is exclusively on the effect of observation errors and the interpretation of the number of outliers in rank histograms as a measure of the ensemble spread. A detailed description of the observations used is given in Saetra and Bidlot (2002). In Fig. 5, the rank histograms for significant wave height and peak period are shown for four different forecast ranges. At day 3 (*T* + 72), more than 25% of the significant wave height observations are outside the ensemble. For the peak period, this number is more than 35%. At later forecast steps, the number of outliers is reduced. But even at day 10, the percentage of observations in the two extreme ranks is still about 15% for the significant wave height.

Similar results have been obtained by several authors. Strauss and Lanzinger (1996) compared the global temperatures at 850 hPa with the analysis and found the percentage of outliers to be approximately 22% at day 6, compared to the theoretical value of 6% for the 32-member ensemble used in their study. Evans et al. (2000) gave rank histograms for the 500-hPa height at *T* + 156 (day 6.5) over the North Atlantic–European region. Here, about 15% of the observations were in the two extreme ranks. Buizza et al. (2000) showed the difference between the percentage of outliers and the reference value for the 500-hPa height over Europe for the 51-member ensembles. For the period after 1998, they found that for the day 5 forecasts, the percentage of cases when the analysis was not included in the ensemble forecast range minus the reference (3.8%) was approximately 10%. In all cases, the observations were taken as the true value. In the light of the result obtained in the previous section, the relatively large fraction of outliers may be explained, at least to some extent, by the presence of observation errors. In Fig. 5, the number of outliers in the the upper- and lowermost rank are different. This skewness in the histograms cannot be a result of observation errors but is more likely caused by bias in the wave forecasts.

For wave height observations obtained from the GTS, the mean error has been estimated to be approximately 12% of the mean wave height (Janssen et al. 2003). In an attempt to filter out the effects of variability smaller than the uncertainties in the observations, errors for the significant wave height are assumed to be normally distributed with standard deviation of 12% of the true value. The same amount of noise will therefore be added to each ensemble member. For the peak period, we use a standard deviation for the errors of 1. This is based on the fact that this corresponds approximately to the average resolution of the wave spectra and that the observations are truncated to the nearest second. The results of this are given in Fig. 6. Although the rank histograms still indicate that the ensembles have too little spread, this is far less pronounced than if the observations are taken to be the true value. For the significant wave height, the number of outliers is reduced to less than 10% for all forecast ranges. Also for the peak period, the number of outliers is about 10% in all cases. One interesting feature is that for the wave height the number of outliers is more or less the same for all forecast ranges. This is in contrast to what was obtained before observation errors were taken into account and demonstrates that the system is more stable than the earlier results indicated.

In Fig. 7, rank histograms and reliability diagrams are shown for two cases with 12% (left) and 24% (right) noise added to the ensemble members. The forecast step here is 120 h. When too much noise is added, the number of outliers is reduced considerably and gives a too broad ensemble. Frequencies of outliers were calculated for a range of *σ*_{e} between 0.1% and 120%. For *σ*_{e} ≤ 15%, the quadratic behavior according to Eq. (6) was observed for both upper and lower outliers (results not shown). The effect on the reliability diagram is again small. We see some differences for high probabilities, but overall the reliability is good for both cases.

### b. Geopotential at 500 hPa

Most commonly, ensemble systems are verified with respect to verifying analysis fields rather than with respect to observations. A practical advantage of using analyses is that they are easily available. In addition, they represent the same scales as the forecast fields of the ensemble members and, therefore, avoid representativity and/or collocation errors that would be introduced when using observations (such as the buoys in section 4a).

Before the impact of the errors in these fields on rank histograms can be assessed, an estimate for their typical magnitude (denoted by *σ*_{a}) needs to be made. For October 2000, such an estimate for 500-hPa geopotential (Z500) is displayed in Fig. 8. It is based on the spread over 10 assimilation experiments averaged over the specified period (M. Fisher 2002, personal communication). For each experiment, observations were randomly perturbed according to their uncertainty before they were assimilated into the ECMWF four-dimensional variational data assimilation (4DVAR) system. Analysis errors are found to be small over western Europe and over the eastern part of the United States, and large over the Pacific and the Arctic. This reflects the respectively high and low density of the conventional observational network (mainly radiosondes and aircraft measurements) in these areas.

The estimates for the *Z*500 analysis error for October 2000 were used for the verification period of October 2001. In the case where the analysis error at verification time was incorporated, ensemble members were perturbed according to these estimates. So on average, larger perturbations were applied in the Pacific than in the European area. For six different areas for the October 2001 period, the sum of upper and lower outliers for *Z*500 is shown in Fig. 9. Solid curves are for unperturbed ensemble members (*σ*_{e} = 0), dashed curves for *σ*_{e} = *σ*_{a}, and dotted curves for *σ*_{e} = *σ*_{a}/2. Averages were based on a regular lat–lon grid with a spacing of 2.5°. The Northern Hemisphere is defined by all points with a latitude between 20° and 80°N; the five other areas are determined by the boxes in Fig. 8.

As was seen for the other cases considered in this paper, the impact of taking “observation” errors into account is considerable. When they are not incorporated, the frequency of outliers is far too high in the short range. For this range the inclusion of errors in the verifying analysis (dashed curves) has the largest effect. The frequency of outliers becomes much lower and more constant as a function of forecast time. The impact is smallest for Europe because there (see Fig. 8), average analysis errors are smaller than in the other five areas. It is also exactly this region for which the frequency of outliers was smallest (7% compared to 10%–12% for the other areas) when the errors in the verifying analysis were neglected. Both the largest impact in the short range and the smallest mismatch for the unperturbed method in regions where the analysis errors are smallest favor the conjecture that the effect of observation errors on rank histograms is realistic. When ensemble members are perturbed with *σ*_{a}/2 (dotted curves), the impact is more than halved [25% according to Eq. (6)], though still considerable.

Note that at analysis time the frequency of outliers is 0 regardless of whether errors in the (verifying) analysis are taken into account or not. This results from the fact that the initial ensemble members are symmetrically perturbed around the analysis. By definition, the analysis is located in the center of the ensemble and, therefore, can never be an outlier. This effect forms an extra complication that obscures the assessment of the statistical quality of an ensemble system. From Fig. 9 it may be suggested that rank histograms are affected up to day 1. This undesirable feature mainly applies when verification is performed with respect to the verifying analysis fields. It is less present when the comparison with “real” observations is performed (as was the case for the previous section).

Finally, histograms of the rescaled ensemble distributions were determined (deviations from the average geopotential are denoted by Δ*Z*500). For each forecast time these were accumulated for ensembles at all grid points north of 20°N and all 31 days. It was found (results not shown) that for each forecast time, deviations from a standard normal distribution were only present beyond three standard deviations. At initial time, *ρ*_{E} goes faster to 0 for |Δ*Z*500| > 3*σ*_{E}, which must be a result of the deterministic manner in which initial perturbations are created (Molteni et al. 1996). In addition the initial distribution shows a small spike at (Δ*Z*500) = 0, which is induced by the inclusion of the (unperturbed) control in the ensemble. It was found to dissappear after day 1. For day 3 and later, the decay for (Δ*Z*500) < −3*σ*_{E} is slower than normal and more similar to the tail of a Laplace distribution [see Eq. (11)]. However, for (Δ*Z*500) > 3*σ*_{E} the resemblance with a normal distribution is still good. According to the analysis performed in section 2c, this indicates that for *Z*500, the increasing sensitivity of outliers to observation errors toward larger ensemble size *N* (as shown in Fig. 1) should apply for at least *N* ≤ 500.

## 5. Conclusions

Based on the analysis performed in section 2 and the results obtained by using the Lorenz model in section 3, it was found that observation errors may have a rather dramatic effect on rank histograms, leading to an increase in the number of observations in the lowest and highest ranks. It was shown that this sensitivity depends on the tail of the underlying distributions. For normal distributions an increasing sensitivity was found for larger ensemble size. A nonnormal tail could change this picture.

Ranking data obtained with a perfect model for both forecasts and observations resulted in u-shaped histograms when normally distributed noise was added to the observations. The results were almost identical to those obtained with perfect observations but too low ensemble spread. This false impression of too low ensemble spread is caused by the fact that the rank histogram method does not distinguish between large and small errors. Only significant differences—that is, differences larger than the observation errors—should contribute to the lowest and highest ranks when observations are outside the ensemble range. To account for this, the probability distribution from the ensemble forecast must be convolved with the observation errors. By adding normally distributed noise to each ensemble member, using the same standard deviation as for the observation errors, the flat rank histogram was restored for the perfect model case.

The reliability diagrams are less sensitive in this respect. For the case with perfect spread, the reliability diagrams were almost identical for cases with perfect observations and observations with noise added to them. Errors in the ensemble spread on the other hand had a stronger impact on the reliability curves. Too low ensemble spread resulted in underforecasting of low probabilities. Again, adding normally distributed noise to the ensemble members slightly improved the results. Too large ensemble spread resulted in s-shaped reliability curves, overforecasting for low probabilities and underforecasting for high probabilities.

Many investigations have pointed out that the ECMWF EPS does not have enough spread (Strauss and Lanzinger 1996; Evans et al. 2000; Buizza et al. 2000). Most studies of ensemble spread are based on either rank histograms or counting the number of outliers. Since the observations or verifying analysis fields are usually taken to be the truth, these results can to some extent be explained by the presence of errors in these quantities. When comparing the wave ensembles with buoy observations, we have demonstrated that the total number of outliers for the day 3 forecasts are reduced from more than 25% to less than 10% when a reasonable estimate of the observation errors is taken into account. Although the spread is still too low, the performance of the EPS is in this respect much better than often suggested.

For the *Z*500 forecasts, the impact of errors in the verifying analysis on ensemble spread was also considerable. Traditionally, a peak in the number of outliers has been found in the short forecast range, which was considered to be proof of lack of short-range ensemble spread. The fact that results have usually been better over western Europe and the United States, where observation networks are denser, indicates however that unaccounted errors in the verifying analysis fields may have contributed to this picture. Indeed, when noise based on estimates of such errors are added to the ensemble members, the peak in the number of outliers in the short range is found to almost vanish. Also, the numbers of outliers becomes more constant in time. Although the residual overpopulation of outliers may indicate some lack of ensemble spread, the situation is far more positive than previously believed.

## Acknowledgments

This research was partly funded by the European Commision through the SEAROUTES project under Contract G3RD-CT-2000-00309. We would like to thank Peter Janssen, Saleh Abdalla, and Mike Fisher for support and valuable discussions.

## REFERENCES

Anderson, J. L., 1996: A method for producing and evaluating probabilistic forecasts from ensemble model integrations.

,*J. Climate***9****,**1518–1530.Buizza, R., T. Petroliagis, T. Palmer, J. Barkmeijer, M. Hamrud, A. Hollingsworth, A. Simmons, and N. Wedi, 1998: Impact of model resolution and ensemble size on the performance of an ensemble prediction system.

,*Quart. J. Roy. Meteor. Soc***124B****,**1935–1960.Buizza, R., J. Barkmeijer, T. M. Palmer, and D. S. Richardson, 2000: Current status and future developments of the ECMWF Ensemble Prediction System.

,*Meteor. Appl***7****,**163–175.Evans, R. E., M. S. J. Harrison, R. J. Graham, and K. R. Mylne, 2000: Joint medium-range ensembles from The Met. Office and ECMWF systems.

,*Mon. Wea. Rev***128****,**3104–3127.Hamill, T. M., 2001: Interpretation of rank histograms for verifying ensemble forecasts.

,*Mon. Wea. Rev***129****,**550–560.Houtekamer, P. L., 1992: Predictability in models of the atmospheric circulation. Ph.D. thesis, Landbouwuniversiteit Wageningen, Netherlands, 121 pp.

Houtekamer, P. L., L. Lavaivre, J. Derome, H. Ritchie, and H. L. Mitchell, 1996: A system simulation approach to ensemble prediction.

,*Mon. Wea. Rev***124****,**1225–1242.Janssen, P. A. E. M., J. D. Doyle, J. Bidlot, B. Hansen, L. Isaksen, and P. Viterbo, 2002: Impact and feedback of ocean waves on the atmosphere.

*Atmosphere–Ocean Interactions,*W. Perrie, Ed., Advances in Fluid Mechanics, Vol. I, WIT Press, 155–197.Janssen, P. A. E. M., S. Abdalla, and H. Hersbach, 2003: Error estimation of buoy, satellite, and model wave height data. ECMWF Research Dept. Tech. Memo. 402, 17 pp.

Komen, G. J., L. Cavaleri, M. Doneland, K. Hasselmann, S. Hasselmann, and P. A. E. M. Janssen, Eds.,. 1994:

*Dynamics and Modelling of Ocean Waves*. Cambridge University Press, 533 pp.Kruizinga, S., and C. J. Kok, 1988: Evaluation of the ECMWF experimental skill prediction scheme and a statistical analysis of forecast errors.

*Proc. ECMWF Workshop on Predictability in Medium and Extended Range,*Reading, United Kingdom, ECMWF, 403–415.Lorenz, E. N., 1963: Deterministic nonperiodic flow.

,*J. Atmos. Sci***20****,**130–141.Molteni, F., R. Buizza, T. N. Palmer, and T. Petroliagis, 1996: The ECMWF Ensemble Prediction System: Methodology and validation.

,*Quart. J. Roy. Meteor. Soc***122****,**73–119.Richardson, D. S., 2000: Skill and relative economic value of the ECMWF Ensemble Prediction System.

,*Quart. J. Roy. Meteor. Soc***126****,**649–668.Saetra, Ø, and J-R. Bidlot, 2002: Assessment of the ECMWF ensemble prediction system for waves and marine winds. ECMWF Research Dept. Tech. Memo. 388, 29 pp.

Simmons, A. J., and A. Hollingsworth, 2002: Some aspects of the improvement in skill of numerical weather prediction.

,*Quart. J. Roy. Meteor. Soc***128****,**647–677.Strauss, B., and A. Lanzinger, 1996: Verification of the Ensemble Prediction System (EPS).

*ECMWF Newsletter,*No. 72, ECMWF, Reading, United Kingdom, 9–15.Talagrand, O., and R. Vautard, 1997: Evaluation of probabilistic prediction systems.

*Proc. ECMWF Workshop on Predictability,*Reading, United Kingdom, ECMWF, 1–25.Toth, Z., and E. Kalnay, 1993: Ensemble forecasting at NMC: The generation of perturbations.

,*Bull. Amer. Meteor. Soc***74****,**2317–2330.Tracton, M. S., and E. Kalnay, 1993: Operational ensemble prediction at the National Meteorological Center: Practical aspects.

,*Wea. Forecasting***8****,**379–398.Wilks, D. S., 1995:

*Statistical Methods in the Atmospheric Sciences: An Introduction*. Academic Press, 467 pp.

An example of the trajectory in phase space of the “true” solution (dotted line) together with the final states of the ensemble forecasts of this (small stars): (a) the *xz* projection, (b) the *xy* projection, and (c), (d) close-ups of the final state. In (a) and (b), the large star denotes the initial state. In (c) and (d) the large star shows the final true state

Citation: Monthly Weather Review 132, 6; 10.1175/1520-0493(2004)132<1487:EOOEOT>2.0.CO;2

An example of the trajectory in phase space of the “true” solution (dotted line) together with the final states of the ensemble forecasts of this (small stars): (a) the *xz* projection, (b) the *xy* projection, and (c), (d) close-ups of the final state. In (a) and (b), the large star denotes the initial state. In (c) and (d) the large star shows the final true state

Citation: Monthly Weather Review 132, 6; 10.1175/1520-0493(2004)132<1487:EOOEOT>2.0.CO;2

An example of the trajectory in phase space of the “true” solution (dotted line) together with the final states of the ensemble forecasts of this (small stars): (a) the *xz* projection, (b) the *xy* projection, and (c), (d) close-ups of the final state. In (a) and (b), the large star denotes the initial state. In (c) and (d) the large star shows the final true state

Citation: Monthly Weather Review 132, 6; 10.1175/1520-0493(2004)132<1487:EOOEOT>2.0.CO;2

Rank histograms based on artificial datasets generated using the Lorenz model. Expts 1, 3, and 4 in the upper row are the cases with perfect spread in the ensemble forecasts. Expt 2 represents pure normally distributed noise. The experiments in the second row are cases with too low ensemble spread. The results for too large ensemble spread are given by the four cases shown in the third row

Citation: Monthly Weather Review 132, 6; 10.1175/1520-0493(2004)132<1487:EOOEOT>2.0.CO;2

Rank histograms based on artificial datasets generated using the Lorenz model. Expts 1, 3, and 4 in the upper row are the cases with perfect spread in the ensemble forecasts. Expt 2 represents pure normally distributed noise. The experiments in the second row are cases with too low ensemble spread. The results for too large ensemble spread are given by the four cases shown in the third row

Citation: Monthly Weather Review 132, 6; 10.1175/1520-0493(2004)132<1487:EOOEOT>2.0.CO;2

Rank histograms based on artificial datasets generated using the Lorenz model. Expts 1, 3, and 4 in the upper row are the cases with perfect spread in the ensemble forecasts. Expt 2 represents pure normally distributed noise. The experiments in the second row are cases with too low ensemble spread. The results for too large ensemble spread are given by the four cases shown in the third row

Citation: Monthly Weather Review 132, 6; 10.1175/1520-0493(2004)132<1487:EOOEOT>2.0.CO;2

Reliability diagrams for experiments of Fig. 3 based on artificial datasets generated using the Lorenz model. Expts 1, 3, and 4 in the upper row are the cases with perfect spread in the ensemble forecasts. Expt 2 represents pure normally distributed noise. The experiments in the second row are cases with too low ensemble spread. The results for too large ensemble spread are given by the four cases shown in the third row

Citation: Monthly Weather Review 132, 6; 10.1175/1520-0493(2004)132<1487:EOOEOT>2.0.CO;2

Reliability diagrams for experiments of Fig. 3 based on artificial datasets generated using the Lorenz model. Expts 1, 3, and 4 in the upper row are the cases with perfect spread in the ensemble forecasts. Expt 2 represents pure normally distributed noise. The experiments in the second row are cases with too low ensemble spread. The results for too large ensemble spread are given by the four cases shown in the third row

Citation: Monthly Weather Review 132, 6; 10.1175/1520-0493(2004)132<1487:EOOEOT>2.0.CO;2

Reliability diagrams for experiments of Fig. 3 based on artificial datasets generated using the Lorenz model. Expts 1, 3, and 4 in the upper row are the cases with perfect spread in the ensemble forecasts. Expt 2 represents pure normally distributed noise. The experiments in the second row are cases with too low ensemble spread. The results for too large ensemble spread are given by the four cases shown in the third row

Citation: Monthly Weather Review 132, 6; 10.1175/1520-0493(2004)132<1487:EOOEOT>2.0.CO;2

Rank histogram for ECMWF forecasts of (left) wave height and (right) peak period compared with GTS wave data. Here, the observations are taken as the true value. Forecast steps are 72, 120, 168, and 240 h. The period covers spans from Sep 1999 to Mar 2002

Citation: Monthly Weather Review 132, 6; 10.1175/1520-0493(2004)132<1487:EOOEOT>2.0.CO;2

Rank histogram for ECMWF forecasts of (left) wave height and (right) peak period compared with GTS wave data. Here, the observations are taken as the true value. Forecast steps are 72, 120, 168, and 240 h. The period covers spans from Sep 1999 to Mar 2002

Citation: Monthly Weather Review 132, 6; 10.1175/1520-0493(2004)132<1487:EOOEOT>2.0.CO;2

Rank histogram for ECMWF forecasts of (left) wave height and (right) peak period compared with GTS wave data. Here, the observations are taken as the true value. Forecast steps are 72, 120, 168, and 240 h. The period covers spans from Sep 1999 to Mar 2002

Citation: Monthly Weather Review 132, 6; 10.1175/1520-0493(2004)132<1487:EOOEOT>2.0.CO;2

As in Fig. 5, but with noise added to ensemble members

Citation: Monthly Weather Review 132, 6; 10.1175/1520-0493(2004)132<1487:EOOEOT>2.0.CO;2

As in Fig. 5, but with noise added to ensemble members

Citation: Monthly Weather Review 132, 6; 10.1175/1520-0493(2004)132<1487:EOOEOT>2.0.CO;2

As in Fig. 5, but with noise added to ensemble members

Citation: Monthly Weather Review 132, 6; 10.1175/1520-0493(2004)132<1487:EOOEOT>2.0.CO;2

Rank histogram and reliability diagrams for forecast step 120 h. (left) Cases when normally distributed errors with standard deviation of 12% of the true value are added to the ensemble members. (right) Results when too much error is added. Here, 24% error is used. The period covered spans from Sep 1999 to Mar 2002

Citation: Monthly Weather Review 132, 6; 10.1175/1520-0493(2004)132<1487:EOOEOT>2.0.CO;2

Rank histogram and reliability diagrams for forecast step 120 h. (left) Cases when normally distributed errors with standard deviation of 12% of the true value are added to the ensemble members. (right) Results when too much error is added. Here, 24% error is used. The period covered spans from Sep 1999 to Mar 2002

Citation: Monthly Weather Review 132, 6; 10.1175/1520-0493(2004)132<1487:EOOEOT>2.0.CO;2

Rank histogram and reliability diagrams for forecast step 120 h. (left) Cases when normally distributed errors with standard deviation of 12% of the true value are added to the ensemble members. (right) Results when too much error is added. Here, 24% error is used. The period covered spans from Sep 1999 to Mar 2002

Citation: Monthly Weather Review 132, 6; 10.1175/1520-0493(2004)132<1487:EOOEOT>2.0.CO;2

Estimate for the error covariance of the analysis in *Z*500 for the period between 2 and 31 Oct 2000. Average values for North America and Europe (solid boxes), for the North Atlantic plus Europe and North Pacific (dashed boxes), and for Asia (dashed-dotted box) are 2.86 m (NAmerica), 2.68 m (Europe), 2.90 m (AtWEur), 3.73 m (Pacific), and 3.19 m (Asia), respectively. The average over the Northern Hemisphere (20° ≤ lat ≤ 80°N) is 3.24 m

Citation: Monthly Weather Review 132, 6; 10.1175/1520-0493(2004)132<1487:EOOEOT>2.0.CO;2

Estimate for the error covariance of the analysis in *Z*500 for the period between 2 and 31 Oct 2000. Average values for North America and Europe (solid boxes), for the North Atlantic plus Europe and North Pacific (dashed boxes), and for Asia (dashed-dotted box) are 2.86 m (NAmerica), 2.68 m (Europe), 2.90 m (AtWEur), 3.73 m (Pacific), and 3.19 m (Asia), respectively. The average over the Northern Hemisphere (20° ≤ lat ≤ 80°N) is 3.24 m

Citation: Monthly Weather Review 132, 6; 10.1175/1520-0493(2004)132<1487:EOOEOT>2.0.CO;2

Estimate for the error covariance of the analysis in *Z*500 for the period between 2 and 31 Oct 2000. Average values for North America and Europe (solid boxes), for the North Atlantic plus Europe and North Pacific (dashed boxes), and for Asia (dashed-dotted box) are 2.86 m (NAmerica), 2.68 m (Europe), 2.90 m (AtWEur), 3.73 m (Pacific), and 3.19 m (Asia), respectively. The average over the Northern Hemisphere (20° ≤ lat ≤ 80°N) is 3.24 m

Citation: Monthly Weather Review 132, 6; 10.1175/1520-0493(2004)132<1487:EOOEOT>2.0.CO;2

Frequencies of outliers as function of forecast time, averaged over Oct 2001 for six different areas. Solid curves are for the (usual) case in which observation errors at verification are neglected, dashed curves for the case in which estimates of such errors (given in Fig. 8) are included, dotted curves in which ensemble members are randomly perturbed by half of these estimates. The constant solid lines represent the ideal frequency of 3.8%. AtWEur = Atlantic and western Europe

Citation: Monthly Weather Review 132, 6; 10.1175/1520-0493(2004)132<1487:EOOEOT>2.0.CO;2

Frequencies of outliers as function of forecast time, averaged over Oct 2001 for six different areas. Solid curves are for the (usual) case in which observation errors at verification are neglected, dashed curves for the case in which estimates of such errors (given in Fig. 8) are included, dotted curves in which ensemble members are randomly perturbed by half of these estimates. The constant solid lines represent the ideal frequency of 3.8%. AtWEur = Atlantic and western Europe

Citation: Monthly Weather Review 132, 6; 10.1175/1520-0493(2004)132<1487:EOOEOT>2.0.CO;2

Frequencies of outliers as function of forecast time, averaged over Oct 2001 for six different areas. Solid curves are for the (usual) case in which observation errors at verification are neglected, dashed curves for the case in which estimates of such errors (given in Fig. 8) are included, dotted curves in which ensemble members are randomly perturbed by half of these estimates. The constant solid lines represent the ideal frequency of 3.8%. AtWEur = Atlantic and western Europe

Citation: Monthly Weather Review 132, 6; 10.1175/1520-0493(2004)132<1487:EOOEOT>2.0.CO;2

Specification of the 12 experiments for the Lorenz model. Ensembles (ENS) are defined by the spread of the initial ensemble members and the noise σ_{e} added to the integrated ensemble members. Their relation to reality is characterized by the average analysis error σ_{a} (ANA) at initial time and observation error σ_{o} (OBS) at final time