## 1. Introduction

The use of ensemble predictions allows for the estimation of possible outcomes of specific weather events. From the ensemble, a “best” deterministic forecast can be extracted as the ensemble mean (Ehrendorfer 1997), while the forecast uncertainty is often reduced to one measure, called ensemble *spread*, summarizing the information on the uncertainties associated with the impact of the presence of the initial conditions and model errors (Nicolis et al. 2009). Although much information is lost by reducing the ensemble forecast to two numbers, such an approach does offer benefits since it condenses the most essential and most reliable knowledge. Moreover, conveying additional information could be counterproductive to most end users as the interpretation can be difficult. Whereas the added value of ensemble forecasting for minimizing the error of the ensemble mean is beyond doubt, the analysis of uncertainty information remains a nontrivial task (Toth et al. 2003; Wilks 2011) and forms the subject of intense ongoing research (Hopson 2014; Christensen et al. 2014).

The verification of the forecast uncertainty, based on the spread only, is intricate for two reasons: for each probabilistic forecast only one observation is realized and the spread is not meant to provide an exact prediction for the error. The statistical analysis of binned forecasts of similar spread can be used to overcome this last issue (Wang and Bishop 2003; Kolczynski et al. 2011; Leutbecher 2009; Grimit and Mass 2007). Within the context of forecast uncertainty verification much attention has been, however, devoted to verification scores that assess the *deterministic* rather than the probabilistic aspects of the forecast. For instance, the strengths and weaknesses of the spread–error correlation for different spread–error metrics were explored in detail by Grimit and Mass (2007) and Hopson (2014). Christensen Moroz and Palmer (2014), on the other hand, go beyond this approach by proposing a new proper skill score by extending the mean-square error of the (squared) error forecast into a proper score that takes into account the ensemble skewness.

In the present work, we first argue that, rather than using the spread as a deterministic forecast of the uncertainty, it is preferable to construct a *probabilistic* forecast for the uncertainty. Spread–error verification can then be based on conventional probabilistic scores. Second, using such verification, one can assess whether the uncertainty estimate based on spread only degrades or even improves upon the full-ensemble uncertainty estimate. For instance, the impact on statistical scores of replacing an ensemble with a Gaussian distribution with corresponding mean and variance can be checked. Third, rather than inferring relations of the form

With that perspective in mind the following questions are addressed: (i) Given a series of ensemble spreads and associated ensemble errors, how can one best model the uncertainty forecast based on the spread such that it reproduces well the uncertainty information contained in the full-ensemble uncertainty forecast (section 2)? (ii) What are the appropriate scores for the verification of the spread-based uncertainty forecast (section 3a)? (iii) Can we assess to what extent uncertainty estimates that are based on spread only degrade or even improve the full-ensemble estimates (section 3b)? (iv) Is the ensemble spread useful as a predictor for the uncertainty (section 4)?

The approach is then tested within the context of the Ensemble Prediction System (EPS) of the European Centre for Medium-Range Weather Forecasts (ECMWF), in order to clarify whether information is lost when using the ensemble spread instead of the full ensemble for uncertainty estimation (section 5a), and at what lead times the spread is a good predictor for the EPS uncertainty (section 5b). Finally, the influence of limited ensemble size, non-Gaussian ensemble statistics, limited predictability, the spread–error metric, and the spread-based model are also investigated in sections 5a and 5b, and our methodology is related to the binning approach in section 5c. The results are further discussed in section 6 from the perspective of previous works and our conclusions are drawn in section 7.

## 2. The forecast of the uncertainty and its modeling

### a. General approach

*e*, and

*O*being the observation. Once the definition of the error metric

*full-ensemble error forecast or uncertainty forecast*

*perfect spread consistency*as perfect reliability conditional on spread as the only uncertainty measure.

*error metric*but not on the error (with respect to the observation) itself. The first proposed spread-based model for the

*uncertainty forecast*assumes Gaussian-distributed errors:Note that this model applies to the

*error*distribution and not the

*ensemble*distribution. Both are equivalent only for the ensemble-mean error metric

*spread-based models for the uncertainty forecast*areHere,

All quantities

Different definitions for combinations of spreads and error metrics are given in the first three columns. For the given metrics, the last column suggests the preferred spread-based model, along with its parameters. For the last two spread–error metrics (EM-GEO and AV-GEO) the regression coefficients of the log-transformed regression as defined in Eq. (20) for a perfectly spread-consistent forecast are independent of any hypothesis on the nature of the ensemble distribution. Note that more results are given in Table 2 but the metrics specified here are restricted to the ones that have universal parameters (i.e., independent of assumptions on the ensemble distribution). We denote member *e* of the ensemble forecast as *O*.

For different combinations of spread and error metrics the parameters of the spread-based models in section 2 are specified. The approach for their calculation is given in the section 2b. In addition, the regression coefficients *α*_{perf} and *β*_{perf} of the log-transformed regression as defined in Eq. (20) and corresponding to a perfectly reliable ensemble forecast are given. Member *e* of the ensemble forecast is denoted as *O*. Results that are independent of any hypothesis on the nature of the ensemble distribution are denoted with check mark (✓) while results that are considered “not applicable” are denoted as N/A. Note that

### b. Computation of model parameters for two specific error metrics

*e*based on the spread–error metrics provided in the second column of Table 2. Then, the parameters

Now we need to express all model parameters of Eq. (9) in terms of the spread

## 3. Verification of the uncertainty forecast

### a. Verification of the uncertainty forecast based on spread

*q*. The

Note that the reliability of the uncertainty forecast can be probed, for instance, by considering the reliability component of the CRPS (Hersbach 2000) or the histograms of the probability integral transform (PIT; Gneiting et al. 2007).

### b. Comparison of full-ensemble and spread-based uncertainty forecasts

The skill of the spread-based uncertainty models can now be compared with that of the full ensemble uncertainty. Such comparison is useful in estimating the information loss caused by reducing the full-ensemble uncertainty information to one quantity, the spread, using the models proposed in section 2. One may expect that the ensemble distributions that are skewed or have high kurtosis, for instance wind speed, are not appropriately characterized by spread only. The modeling of such higher moments, however, is crucial for meteorological purposes since it determines the tails of the ensemble distributions that may pertain to rare but high-impact events. On the other hand, the uncertainty estimate of ensembles of small size may be improved using a spread-based uncertainty forecast as the latter provides an overall smoother forecast distribution, as well as including nonzero probabilities for the more unlikely events.

*q*) integrated over all quantile values

*q*amount to the CRPS. The quantile score can thus be used to analyze the origin of CRPS changes.

The skill-score formalism as presented here analyzes if, *with respect to a particular score*, spread alone statistically improves upon or degrades the full ensemble for the uncertainty estimation. Of course, many aspects of such information reduction depend on the application, as well as on the user, but as long as these aspects can be quantified using scores, this formalism applies. The skill scores in Eq. (17) will be tested in section 5a for different ensemble sizes and for different meteorological variables including wind speed.

## 4. Spread as a predictor of uncertainty

### a. Introduction

So far, an approach has been proposed to assess the overall forecast quality and its reliability based on spread and error only. However, a forecast distribution that corresponds to the climatological distribution satisfies perfect reliability and a good ensemble forecast is expected to provide additional information. Hence, a more advanced aspect of ensemble forecasting, called resolution, concerns whether the flow-dependent variations of spread are indicative of the variations of the uncertainty. Another useful ensemble system is one for which the spread is systematically twice the spread of a perfectly spread-consistent forecast system. This feature entails that the spread is a perfect predictor for the true uncertainty (their correlation is one), even though the spread–error scores as introduced in last section may not be good. Note that in such a case the scores could be improved by ensemble-spread calibration techniques along the lines of Gneiting et al. (2005) and Van Schaeybroeck and Vannitsem (2015).

A seemingly straightforward approach to testing whether the spread is a good predictor for the uncertainty is through the computation of the Pearson correlation between spread and error. However, since, even for a perfectly reliable ensemble system, the spreads and errors on a scatter diagram do not align, the associated Pearson correlation is not equal to one (Whitaker and Loughe 1998). The correlation coefficient also appears as the most important quantity for ordinary least squares (OLS) regression. Applied on spread and error, however, OLS regression is statistically unjustified since it assumes homoscedasticity. The latter means that the scatter variance around the regression line is constant. For a perfectly reliable and therefore also for a perfectly spread-consistent system, this would imply

### b. Modeling spread as predictor of uncertainty

*C*is a constant, independent of the spread. The last term can be considered the new noise term, which now is

*independent of the spread*. Therefore, the logarithmically transformed spread and error are homoscedastic against one another. This is also apparent in Fig. 2, which is simply the log–log equivalent of Fig. 1b. Note that the logarithmic transformation can only be applied to errors that are always positive and it is therefore applicable to all metrics of Table 2 except for EM-RAW.

*α*and

*β*should be compared to

^{1}The perfect-model regression parameters for these are independent of any hypothesis on the nature of the ensemble distribution and satisfy

*α*and

*β*can be used to quantify the predictive power of the spread as a measure of uncertainty. And, even more importantly, it allows for evaluating the quality of any ensemble system as compared to a perfectly spread-consistent ensemble as will be demonstrated in the next section. For systems that are close to perfectly spread consistent the heteroscedasticity is expected to be present, so Eq. (19) provides a statistically firm relation, to estimate a spread–error relationship.

Note that even for a perfectly spread-consistent ensemble, the random term of order one that appears in Eq. (18) may be non-Gaussian. Additional transformations may improve Gaussianity (Yeo and Johnson 2000; Draper and Smith 1998) but care should be taken that this is effective for all values of the spread.

## 5. Application on ECMWF EPS

The spread–error assessments are now applied to forecasts for 500-hPa geopotential height (denoted Z500), 500-hPa wind velocity (V500), and forecasts of wind velocity at 10 m (V10m) over Europe based on the ECMWF’s EPS. These are compared with their analysis. Daily forecasts (at 0000 UTC) from 20 June 2012 until 19 March 2013 (240 cases), during which no model change was performed and with lead-times intervals of 12 h, are considered. Note also that a resolution change occurs after the 10-day lead time.

The choice of variables allows us to study the potential impact of statistical characteristics and predictability on the theory proposed in section 2. Indeed, upper-air variables (Z500 and V500) are known to have much better predictability features than surface variables (V10m). In addition, the statistical characteristics of Z500 are expected to be well represented using Gaussian statistics, in contrast to wind speed (V500 and V10m).

### a. Verification of spread-based models

Figure 3 shows the CRPSSs [see Eq. (17a)] against lead time for different meteorological variables (different columns: Z500, V500, and V10m) and different ensemble sizes (different rows: 51, 15, and 5 ensemble members). In each panel in Fig. 3 three uncertainty forecasts (EM-RAW, EM-SQU, and AV-SQU) are also considered to be associated with different spread–error metrics, each of which is modeled using spread-based uncertainty forecasts. More specifically, EM-RAW is modeled using model I (blue line) while all models (I, II, and III) are used for EM-SQU (red lines) and AV-SQU (black lines). The full-ensemble uncertainty forecast improves upon the spread-based uncertainty forecast for positive CRPSS (shaded in red). The analytic expressions in Eq. (A6) were used to calculate the CRPS of the spread-based models.

Surprisingly, almost all conclusions that can be drawn from Fig. 3 are independent of the lead time and, most importantly, of the variable under consideration. These conclusions are as follow:

- For all cases including the ones with a five-member ensemble, the CRPS skill improvement or degradation is at most 10%. This indicates that the uncertainty information results of the spread-based models and the full ensemble are largely comparable. Skill changes among the full-ensemble forecasts due to changes of the ensemble sizes are usually larger (not shown).
- For EM-RAW the spread-based model I shows an improvement for all variables and ensemble sizes. Therefore replacing each ensemble by a Gaussian distribution with its corresponding mean and variance results in an improvement in the CRPS score.
- For each uncertainty forecast at least one spread-based model is able to improve upon or be marginally worse than the full-ensemble forecast. For the EM-SQU uncertainty, forecast model II leads to the best results. The quality of this model is related to the fact that the associated model parameters are independent of any hypothesis on the nature of the ensemble distribution, as indicated in Table 1. For AV-SQU, model III is far better than models I or II and induces only a marginal skill loss upon use of 51 ensemble members. Note that although the errors of EM-SQU and AV-SQU are always positive, model I allows for negative values. This may explain why model I is never the best model for the AV-SQU and EM-SQU uncertainties. Results for EM-ABS and AV-ABS are not shown but were very similar to those of EM-SQU and AV-SQU, respectively. Therefore, it is concluded that for the ensemble-mean-based error metrics (except for EM-RAW) the best spread-based model is model II. For all ensemble-average error metrics (indicated with “AV”) the use of model III is advised.
- All spread-based models improve upon the full-ensemble forecast for five-member ensembles as a result of the smoother representation of the ensemble distribution by the spread-based model.

Figure 4 shows the QSSs as given in Eq. (17b) for the same variables (columns: Z500, V500, and V10m), spread–error metrics, and associated models as in Fig. 3.The first and second rows in Fig. 4 are for the 10th and 90th percentiles, respectively. All 51 ensemble members are used. The main conclusions that were drawn from Fig. 3 remain valid for Fig. 4: for EM-RAW the spread-based model I shows an overall improvement for all variables and quantiles while the same is valid for model II applied on the EM-SQU uncertainty forecast. Finally, model III is the best model for the AV-SQU uncertainty forecast.

However, differences are present between the QSSs for the different quantiles and the different variables. Good skill scores are found for all models for the 90th percentile and thus for predicting events that are unlikely according to the forecast. Exceptions exhibiting minor skill decreases include model III applied on the EM-SQU forecast of Z500 and V500 and model III applied on the AV-SQU forecast of V10m. This indicates that at least for the cases considered, the reduction of uncertainty information to one quantity (spread) does not strongly affect the prediction of events that are unlikely according to the forecast. The spread-based model I applied on the EM-RAW uncertainty forecast even gradually improves the full-ensemble forecast as a function of lead time. The fact that models I and II strongly deteriorate the overall AV-SQU forecast quality (see Fig. 3) is a result of from their bad ensemble predictions of small errors (10th percentile). As mentioned before, model I allows for negative values while the EM-SQU and AV-SQU uncertainties are always positive, and it is therefore worse than models II and III, as well as the full-ensemble forecast.

Note that again the results for EM-ABS and AV-ABS are not shown but were very similar to those for EM-SQU and AV-SQU, respectively.

The 10th and 90th percentiles pertain to the ensemble distribution of the error forecast, which is expected to be asymmetric—except for EM-RAW—and are therefore skewed because of the positive character of the error. Therefore, even for a variable that is expected to be described reasonably well using Gaussian statistics (Z500), the results for the 10th and 90th percentiles strongly differ. For EM-RAW the results at the 10th and 90th percentiles are similar for Z500 because of the symmetric character of the ensemble distributions, but different for the wind variables for which skewed distributions are still present.

To summarize, the use of spread as the only measure of uncertainty, combined with the spread-based models in section 2, does not lead to a strong uncertainty information loss. Replacing each ensemble by a Gaussian distribution with corresponding mean and variance yields a small but significant improvement in the statistical scores. A similar result, based on the Brier score, was obtained by Atger (1999) for the ECMWF EPS. Also, model II applied on the uncertainty forecasts involving ensemble-mean errors (EM metrics) systematically improves the full-ensemble uncertainty forecasts, even for 51 members. Model III is suitable for uncertainty forecasts involving ensemble-average errors (AV metrics) but overall induces weak skill degradation.

### b. Verification of spread as a predictor of uncertainty

The spread–error relation measured with the regression parameters as specified in Eqs. (20a) and (20b) can now be evaluated for the spread of the ECMWF EPS. Each panel in Fig. 5 shows the coefficients *β* for the EPS as well as for a “persistence” forecast over Europe. The persistence forecast is constructed by subtracting from each ensemble member at each lead time a constant value such that its ensemble mean is the one of the forecast at lead time zero. Therefore, only the ensemble mean and not the ensemble spread is affected. The different columns in Fig. 5 correspond to different variables (Z500, V500, and V10m) while the different rows are for different spread–error metrics (EM-SQU, AV-SQU, and AV-GEO). Also shown are the regression values corresponding to a perfectly reliable forecast

It is found that for a fixed variable the qualitative behavior of the regression coefficients pertaining to different spread–error metrics is very similar. This indicates that the results are independent of the assumption of Gaussianity [Eq. (14)]. For instance for Z500 (Figs. 5a,d,g), *α* and *β* for all spread–error metrics are far from *α* and *β* are stationary and very close to *α* and *β* depart again from *β* of the EPS and the persistence forecasts overlap one another.

As can be seen in Figs. 5b, 5e, and 5h, the evolution of the coefficients characterizing consistency for wind velocity at 500 hPa qualitatively differs from that of Z500. Good spread consistency is visible already at 12 h and persists up to a lead time of 5 days, after which time a progressive degradation sets in. As opposed to the upper-air variables (Z500 and V500), the wind velocity at 10 m (see Figs. 5c,f,i) is much less predictable, which is visible through the close proximity of *β* for the EPS and the one of the persistence for all spread–error metrics. Remarkably, V10m becomes more and more consistent as the lead time increases.

A general feature for all variables and spread–error metrics is the fact that *α* is generally larger than *β* is smaller than

Let us now focus specifically on the spread consistency of Z500. Strong deviations from the perfectly spread-consistent behavior still appear at 24 h when spread and error are related by

### c. Impact of ensemble size

Limited ensemble sizes are known to have a strong impact on spread–skill measures (Grimit and Mass 2007; Kolczynski et al. 2011). The spread–error relation is examined in Fig. 6 for different ensemble sizes. Each panel in Fig. 6 shows *α* and *β* as a function of lead time for the ECMWF EPS with 51 members (solid lines), 10 members (long-dashed lines), and 5 members (short-dashed lines). The influence of a change in ensemble size is independent of the spread–error metric used (Fig. 6, rows) and independent of the variable (Fig. 6, columns). More specifically a decrease of ensemble size induces a larger departure of *α* and *β* to their perfect values

### d. Comparison with a binning approach

As specified before, for perfectly reliable ensemble forecasts the error for a forecast of fixed ensemble spread may vary over a wide range of values proportional to the spread. Approaches that partially avoid the problem of heteroscedasticity are the categorization of forecasts depending on the state (Ziehmann 2001; Toth et al. 2001), or binning, and averaging over errors with comparable spread (Leutbecher 2009; Wang and Bishop 2003; Kolczynski et al. 2011; Grimit and Mass 2007).

Figure 7 uses the approach of binning for 500-hPa geopotential height and we consider how such an approach can supplement and support our findings on the spread–error relationship. At a fixed lead time, the entire verification set

In Figs. 7a and 7d the intervals around the gray dots indicate the standard deviation of the error distribution within each bin. Even though the ensemble is not perfectly spread consistent, the standard deviation is still proportional to the spread and as argued before such heteroscedasticity can be overcome by a logarithmic transformation. Therefore, similar to the binning performed in Figs. 7a and 7d, the green dots in Figs. 7b and 7e show

Diagrams of binned error against binned spread are very informative and yield a good picture of the spread–error relationship at a fixed lead time. However, this relationship must be summarized in order to be able to see its evolution as a function of lead time. A straightforward manner to do this is to apply ordinary least squares (OLS) regression of

The full red lines in Figs. 7b and 7e are obtained using OLS regression of all logarithmically transformed error and spread (not of the binned averages) and thereby establish a relationship of the form

An exponential transformation is performed on the green dots, the full red line, and the black dashed line in Figs. 7b and 7e, and the results are shown in Figs. 7a and 7d as the blue dots, the dashed red line, and the black dashed line, respectively. Note that the dotted line relates the spread and error based on the arithmetic mean

Conclusions drawn from binned data in Figs. 7b and 7e remain subject to doubt as a result of the fact that the line corresponding to perfect reliability associated with the EM-ABS metric makes Gaussian assumptions for the ensemble. A more rigorous approach is therefore to consider the EM-GEO or AV-GEO schemes for which the regression coefficients associated with a perfectly reliable system are

Figures 7c and 7f show the same results as Figs. 7b and 7e but for the EM-GEO spread–error metric. The strong heteroscedasticity is again removed by the logarithmic transformation. On the other hand, the dashed lines in Figs. 7c and f corresponding to perfect reliability are universal and independent of ensemble assumptions. Again, the same conclusions are drawn concerning underdispersion and overdispersion at the different lead times.

Based on Fig. 7, it is concluded that a binning of spread and error is useful for studying the spread–error relationship for fixed lead times. This approach has been extended in order to allow for a quantification of the spread–error relation for different lead times. Such quantification requires certain simple statistical relationships, for which the assertions can be checked using binning. In particular, the inference of the link between binned spread and error is feasible together with a comparison of the results at different lead times. However, such an inference must necessarily take into account the strong heteroscedasticity. One option is to apply a weighted linear regression to the binned data while OLS regression could also be applied to the unbinned but logarithmically transformed data. The latter should be preferred, in particular in cases of limited data availability.

Perfectly reliable systems satisfy certain relationships between averages of binned quantities and feature strong heteroscedasticity with the standard deviations of the binned errors being linear with respect to the spread. For cases of near-perfect spread consistency, as discussed in Fig. 7, the linear relationships are modified; however, the heteroscedasticity remains similar to a large extent. Therefore, the statistical inference on logarithmically transformed data remains justified.

## 6. Discussion

The fact that, for a perfectly reliable ensemble forecast the observation is statistically indistinguishable from the ensemble members should be reflected in the relation between error and spread. A common way to verify the spread–error relation is by direct comparison of the *average* ensemble variance *flow-dependent* uncertainties. The coincidence of the averages of spread and error does not imply a good flow-dependent or point-by-point correspondence.

*n*.

For unskewed ensembles, the

The strengths and weaknesses of the spread–error correlation as a basis of spread–error assessment are explained in Whitaker and Loughe (1998), Grimit and Mass (2007), and Hopson (2014). These works also highlight the prerequisite of sufficient day-to-day variability of ensemble spread for good spread–error assessment. Several works also noted that the spread–error correlation is mostly determined by only a few cases featuring anomalous spread (Barker 1991; Houtekamer 1993; Whitaker and Loughe 1998; Grimit and Mass 2007). This issue can be understood in terms of heteroscedasticity and is analogous to the previous argument that errors should preferably be assessed relative to the ensemble spread. Usually the spread–error Pearson correlation is compared to the perfectly reliable model, which is then considered as the highest obtainable correlation—an interpretation that has recently been criticized by Kumar et al. (2014). Moreover, correlation values comparable to these maximal values seem difficult to obtain (Barker 1991; Buizza 1997; Houtekamer 1993; Grimit and Mass 2007; Van Schaeybroeck and Vannitsem 2011; Eckel et al. 2012). Despite the well-known weaknesses of the Pearson correlation, it is still often used.

Postprocessing techniques based on spread calibration improve reliability (Gneiting et al. 2005; Raftery et al. 2005; Wilks 2009; Roulin and Vannitsem 2012) and are therefore expected to improve the spread–skill scores as proposed in this work. Moreover, along the lines of Veenhuis (2013) and Van Schaeybroeck and Vannitsem (2015), new spread-calibration techniques could be designed that directly improve the new spread–error relations. In section 5a it was concluded that for the ECMWF EPS, no substantial information is lost when using the spread as the measure of uncertainty. It is, however, unclear at this stage if this conclusion would hold after spread calibration. Note also that many spread-calibration techniques only correct the ensemble mean and ensemble spread (Gneiting et al. 2005; Wilks 2009) and therefore no additional information is retained. This is different for “member by member” correction schemes for which each ensemble member is corrected separately and ensemble kurtosis and skewness are naturally preserved (Johnson and Bowler 2009; Van Schaeybroeck and Vannitsem 2015). Therefore, tests are in order to see whether member-by-member postprocessing of the full-ensemble uncertainty forecast of the ECMWF EPS improves as compared with the spread-based uncertainty forecast.

When model errors are present, systematic biases of both the ensemble mean and spread are observed. In particular, a systematic underdispersion at intermediate lead times is experienced [see Van Schaeybroeck and Vannitsem (2015)]. Ensemble calibration will transform the distribution, in general, by modifying the ensemble mean and/or spread. Moreover, these modifications are strongly dependent on the lead time and will obviously affect the verification of spread. This will be the subject of future investigation.

## 7. Conclusions

As most end users do not require all the uncertainty information contained in an ensemble, it is common to provide only a single uncertainty measure, the ensemble spread. Statistically justified models are proposed to assess the spread–error relation. The uncertainty forecast should be considered probabilistically rather than deterministically, the main reason being that both the mean and the standard deviation of the observed error are proportional to the spread. The verification of such forecasts can then be done by using proper scores like the logarithmic score or the CRPS. Because of the sole dependence of the uncertainty forecast on the spread, simple analytic expressions for these scores can be derived (see the appendix). A methodology is outlined that allows us to assess whether the uncertainty estimation based on spread only is worse or better than the full-ensemble estimation.

To verify whether spread is a good predictor for uncertainty, a regression of log-transformed spread and error is proposed. Such a transformation is justified as it overcomes the problem of heteroscedasticity. Two spread–error metrics are proposed [see Eq. (21)] for which the regression coefficients are independent of any hypothesis on the nature of the ensemble distribution. For other spread–error metrics the regression coefficients are also provided, and are derived based on the assumption of Gaussian-distributed ensembles. Note that, given spread as the only measure of uncertainty, the identification of perfect reliability is not possible and one can only identify perfect spread consistency, that is, behavior that is consistent with a perfectly reliable system. Binning was shown to provide additional insights into the approach and is recommended for the determination of ensemble overdispersion or underdispersion as a function of spread.

For the two upper-air variables (Z500 and V500) and one surface variable (V10m) of the ECWMF 51-member EPS, the uncertainty forecasts based on spread (spread-based models) do not result in substantial loss of uncertainty information, at least not with respect to the CRPS and quantile scores. More specifically for the conventional EM-RAW spread–error metric the spread-based model I [Eq. (5)] even improves upon the full-ensemble uncertainty estimate. This implies that replacing each ensemble by a Gaussian distribution with corresponding mean and variance amounts to an improvement in the statistical scores. For other error metrics that involve the ensemble mean, model II [Eq. (8a)] also systematically improves the full-ensemble forecast. Moreover, for error metrics related to nontrivial ensemble averages, model III [Eq. (8a)] is the best, even though it still slightly degrades the full-ensemble uncertainty forecast. For small ensemble sizes, all spread-based models are found to improve upon the full ensemble.

Concerning spread as a predictor of uncertainty, the best results are obtained for Z500. For lead times between 2 and 7 days the regression coefficients indicate that the ensemble is close to a perfectly spread-consistent ensemble.

The sensitivity of the new scores on the ensemble-mean state (Ziehmann 2001; Toth et al. 2001), spatial averages (Barker 1991), and the impact of systematic biases and extreme events (Whitaker and Loughe 1998) are important issues worth investigating in the future.

## Acknowledgments

This work has benefited from fruitful discussions with Martin Leutbecher, Olivier Talagrand, Hannah Christensen, Geert Smet, Alex Deckmyn, and Lesley De Cruz. Michael Scheuerer, Hannah Christensen, Pieter Smets, and three anonymous referees are thanked for their constructive comments on the manuscript. This work is partially supported by the Belgian Federal Science policy office under Contract SD/CA/04A.

## APPENDIX

### Analytic Derivation of Probabilistic Scores

#### a. Proposed models

The three probabilistic models for the uncertainty forecast as introduced in section 2 are detailed here and analytic expressions for the CRPS and QS are derived for the three models. For each forecast the uncertainty forecast

- The first model for the predictive density of the error is well suited to model the error distribution when the error distribution is approximately Gaussian distributed. The error metric corresponds to the error of the ensemble mean and the spread is the ensemble standard deviation. The predictive density of the error or uncertainty forecast
is - The second model is suitable when the error is bound from below by
(all metrics in Table 2 except EM-RAW): - The third model is a variant of the latter:

#### b. Scores used for verification of the spread-based forecast models

*q*as used in Eq. (16b):

*ϕ*is the standard normal PDF and

Note that for the EM-RAW spread–error metric, the CRPS of the full-ensemble uncertainty forecast is identical to the CRPS of the full-ensemble forecast *F*.

## REFERENCES

Abramowitz, M., , and I. A. Stegun, 1972:

*Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables*. 9th ed. Dover, 1047 pp.Atger, F., 1999: The skill of ensemble prediction systems.

,*Mon. Wea. Rev.***127**, 1941–1953, doi:10.1175/1520-0493(1999)127<1941:TSOEPS>2.0.CO;2.Barker, T. W., 1991: The relationship between spread and forecast error in extended-range forecasts.

,*J. Climate***4**, 733–742, doi:10.1175/1520-0442(1991)004<0733:TRBSAF>2.0.CO;2.Benedetti, R., 2010: Scoring rules for forecast verification.

,*Mon. Wea. Rev.***138**, 203–211, doi:10.1175/2009MWR2945.1.Bentzien, S., , and P. Friederichs, 2014: Decomposition and graphical portrayal of the quantile score.

,*Quart. J. Roy. Meteor. Soc.***140**, 1924–1934, doi:10.1002/qj.2284.Box, E. P., , J. S. Hunter, , and W. G. Hunter, 2005:

. 2nd ed. Wiley-Interscience, 664 pp.*Statistics for Experimenters: Design, Innovation, and Discovery*Buizza, R., 1997: Potential forecast skill of ensemble prediction and spread and skill distributions of the ECMWF Ensemble Prediction System.

,*Mon. Wea. Rev.***125**, 99–119, doi:10.1175/1520-0493(1997)125<0099:PFSOEP>2.0.CO;2.Buizza, R., , M. Leutbecher, , L. Isaksen, , and J. Haseler, 2010: Combined use of EDA- and SV-based perturbations in the EPS.

*ECMWF Newsletter*, No. 123, ECMWF, Reading, United Kingdom, 22–28.Candille, G., , C. Côté, , P. L. Houtekamer, , and G. Pellerin, 2007: Verification of an ensemble prediction system against observations.

,*Mon. Wea. Rev.***135**, 2688–2699, doi:10.1175/MWR3414.1.Chatterjee, S., , and A. S. Hadi, 2006:

. 4th ed. Wiley Series in Probability and Statistics, J. Wiley and Sons, 375 pp.*Regression Analysis by Example*Christensen, H. M., , I. M. Moroz, , and T. N. Palmer, 2014: Evaluation of ensemble forecast uncertainty using a new proper score: Application to medium-range and seasonal forecasts.

,*Quart. J. Roy. Meteor. Soc.***141**, 538–549, doi:10.1002/qj.2375.Draper, N. R., , and H. Smith, 1998:

J. Wiley and Sons, 706 pp.*Applied Regression Analysis.*Eckel, F. A., , M. S. Allen, , and M. C. Sittel, 2012: Estimation of ambiguity in ensemble forecasts.

,*Wea. Forecasting***27**, 50–69, doi:10.1175/WAF-D-11-00015.1.Ehrendorfer, M., 1997: Predicting the uncertainty of numerical weather forecasts: A review.

,*Meteor. Z.***6**, 147–183.Gneiting, T., , and A. E. Raftery, 2007: Strictly proper scoring rules, prediction, and estimation.

,*J. Amer. Stat. Assoc.***102**, 359–378, doi:10.1198/016214506000001437.Gneiting, T., , A. E. Raftery, , A. Westveld, , and T. Goldman, 2005: Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation.

,*Mon. Wea. Rev.***133**, 1098–1118, doi:10.1175/MWR2904.1.Gneiting, T., , F. Balabdaoui, , and A. E. Raftery, 2007: Probabilistic forecasts, calibration and sharpness.

,*J. Roy. Stat. Soc.***69B**, 243–268, doi:10.1111/j.1467-9868.2007.00587.x.Grimit, E. P., , and C. F. Mass, 2007: Measuring the ensemble spread–error relationship with a probabilistic approach: Stochastic ensemble results.

,*Mon. Wea. Rev.***135**, 203–221, doi:10.1175/MWR3262.1.Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems.

,*Wea. Forecasting***15**, 559–570, doi:10.1175/1520-0434(2000)015<0559:DOTCRP>2.0.CO;2.Hopson, T. M., 2014: Assessing the ensemble spread–error relationship.

,*Mon. Wea. Rev.***142**, 1125–1142, doi:10.1175/MWR-D-12-00111.1.Houtekamer, P. L., 1993: Global and local skill forecasts.

,*Mon. Wea. Rev.***121**, 1834–1846, doi:10.1175/1520-0493(1993)121<1834:GALSF>2.0.CO;2.Johnson, C., , and N. Bowler, 2009: On the reliability and calibration of ensemble forecasts.

,*Mon. Wea. Rev.***137**, 1717, doi:10.1175/2009MWR2715.1.Kolczynski, W. C., , D. R. Stauffer, , S. E. Haupt, , N. S. Altman, , and A. Deng, 2011: Investigation of ensemble variance as a measure of true forecast variance.

,*Mon. Wea. Rev.***139**, 3954–3963, doi:10.1175/MWR-D-10-05081.1.Kumar, A., , P. Peitao, , and C. Mingyue, 2014: Is there a relationship between potential and actual skill?

,*Mon. Wea. Rev.***142**, 2220–2227, doi:10.1175/MWR-D-13-00287.1.Leutbecher, M., 2009: Diagnosis of ensemble forecasting systems.

*Seminar on Diagnosis of Forecasting and Data Assimilation Systems*, ECMWF, Reading, United Kingdom, 235–266.Leutbecher, M., , and T. N. Palmer, 2008: Ensemble forecasting.

,*J. Comput. Phys.***227**, 3515–3539, doi:10.1016/j.jcp.2007.02.014.Lorenz, E. N., 1996: Predictability—A problem partly solved.

*Proc. Seminar on Predictability*, Vol. 1, Reading, United Kingdom, ECMWF, 1–18.Nicolis, C., , R. A. P. Perdigao, , and S. Vannitsem, 2009: Dynamics of prediction errors under the combined effect of initial condition and model errors.

,*J. Atmos. Sci.***66**, 766–778, doi:10.1175/2008JAS2781.1.Raftery, A. E., , F. Balabdaoui, , T. Gneiting, , and M. Polakowski, 2005: Using Bayesian model averaging to calibrate forecast ensembles.

,*Mon. Wea. Rev.***133**, 1155–1174, doi:10.1175/MWR2906.1.Roulin, E., , and S. Vannitsem, 2012: Postprocessing of ensemble precipitation predictions with extended logistic regression based on hindcasts.

,*Mon. Wea. Rev.***140**, 874–888, doi:10.1175/MWR-D-11-00062.1.Roulston, M. S., , and L. A. Smith, 2002: Evaluating probabilistic forecasts using information theory.

,*Mon. Wea. Rev.***130**, 1653–1660, doi:10.1175/1520-0493(2002)130<1653:EPFUIT>2.0.CO;2.Scherrer, S. C., , C. Appenzeller, , P. Eckert, , and D. Cattani, 2004: Analysis of the spread–skill relations using the ECMWF Ensemble Prediction System over Europe.

,*Wea. Forecasting***19**, 552–565, doi:10.1175/1520-0434(2004)019<0552:AOTSRU>2.0.CO;2.Toth, Z., , Y. Zhu, , and T. Marchok, 2001: The use of ensembles to identify forecasts with small and large uncertainty.

,*Wea. Forecasting***16**, 463–477, doi:10.1175/1520-0434(2001)016<0463:TUOETI>2.0.CO;2.Toth, Z., , O. Talagrand, , G. Candille, , and Y. Zhu, 2003: Probability and ensemble forecasts.

*Forecast Verification: A Practitioner’s Guide in Atmospheric Science*, I. Jolliffe and D. B. Stephenson, Eds., J. Wiley and Sons, 137–163.Van Schaeybroeck, B., , and S. Vannitsem, 2011: Post-processing through linear regression.

,*Nonlinear Processes Geophys.***18**, 147–160, doi:10.5194/npg-18-147-2011.Van Schaeybroeck, B., , and S. Vannitsem, 2015: Ensemble post-processing using member-by-member approaches: Theoretical aspects.

,*Quart. J. Roy. Meteor. Soc.***141**, 807–818, doi:10.1002/qj.2397.Veenhuis, B. A., 2013: Spread calibration of ensemble MOS forecasts.

,*Mon. Wea. Rev.***141**, 2467–2482, doi:10.1175/MWR-D-12-00191.1.Wang, X., , and C. H. Bishop, 2003: A comparison of breeding and ensemble transform Kalman filter ensemble forecast schemes.

,*J. Atmos. Sci.***60**, 1140–1158, doi:10.1175/1520-0469(2003)060<1140:ACOBAE>2.0.CO;2.Whitaker, J. S., , and A. F. Loughe, 1998: The relationship between ensemble spread and ensemble mean skill.

,*Mon. Wea. Rev.***126**, 3292–3302, doi:10.1175/1520-0493(1998)126<3292:TRBESA>2.0.CO;2.Wilks, D. S., 2009: Extending logistic regression to provide full-probability-distribution MOS forecasts.

,*Meteor. Appl.***16**, 361–368, doi:10.1002/met.134.Wilks, D. S., 2011:

. 3rd ed. Elsevier, 676 pp.*Statistical Methods in the Atmospheric Sciences*Yeo, I.-K., , and R. A. Johnson, 2000: A new family of power transformations to improve normality or symmetry.

,*Biometrika***87**, 954–959, doi:10.1093/biomet/87.4.954.Ziehmann, C., 2001: Skill prediction of local weather forecasts based on the ECMWF ensemble.

,*Nonlinear Processes Geophys.***8**, 419–428, doi:10.5194/npg-8-419-2001.

^{1}

For an ensemble of discrete size *E*, the AV-GEO and EM-GEO spreads and errors can be reduced to a product over the ensemble members. For example, *E* numbers