Decomposition of a New Proper Score for Verification of Ensemble Forecasts

H. M. Christensen Atmospheric, Oceanic and Planetary Physics, University of Oxford, Oxford, United Kingdom

Search for other papers by H. M. Christensen in
Current site
Google Scholar
PubMed
Close
Full access

Abstract

A new proper score, the error-spread score (ES), has recently been proposed for evaluation of ensemble forecasts of continuous variables. The ES is formulated with respect to the moments of the ensemble forecast. It is particularly sensitive to evaluating how well an ensemble forecast represents uncertainty: is the probabilistic forecast well calibrated? In this paper, it is shown that the ES can be decomposed into its reliability, resolution, and uncertainty components in a similar way to the Brier score. The first term evaluates the reliability of the forecast standard deviation and skewness, rewarding systems where the forecast moments reliably indicate the properties of the verification. The second term evaluates the resolution of the forecast standard deviation and skewness, and rewards systems where the forecast moments vary from the climatological moments according to the predictability of the atmospheric flow. The uncertainty term depends only on the observed error distribution and is independent of the forecast standard deviation or skewness. The decomposition was demonstrated using forecasts made with the European Centre for Medium-Range Weather Forecasts ensemble prediction system, and was able to identify the source of the skill in the forecasts at different latitudes.

Corresponding author address: H. M. Christensen, Atmospheric, Oceanic and Planetary Physics, Clarendon Laboratory, Parks Road, Oxford, OX1 3PU, United Kingdom. E-mail: h.m.christensen@atm.ox.ac.uk

Abstract

A new proper score, the error-spread score (ES), has recently been proposed for evaluation of ensemble forecasts of continuous variables. The ES is formulated with respect to the moments of the ensemble forecast. It is particularly sensitive to evaluating how well an ensemble forecast represents uncertainty: is the probabilistic forecast well calibrated? In this paper, it is shown that the ES can be decomposed into its reliability, resolution, and uncertainty components in a similar way to the Brier score. The first term evaluates the reliability of the forecast standard deviation and skewness, rewarding systems where the forecast moments reliably indicate the properties of the verification. The second term evaluates the resolution of the forecast standard deviation and skewness, and rewards systems where the forecast moments vary from the climatological moments according to the predictability of the atmospheric flow. The uncertainty term depends only on the observed error distribution and is independent of the forecast standard deviation or skewness. The decomposition was demonstrated using forecasts made with the European Centre for Medium-Range Weather Forecasts ensemble prediction system, and was able to identify the source of the skill in the forecasts at different latitudes.

Corresponding author address: H. M. Christensen, Atmospheric, Oceanic and Planetary Physics, Clarendon Laboratory, Parks Road, Oxford, OX1 3PU, United Kingdom. E-mail: h.m.christensen@atm.ox.ac.uk

1. Introduction

A problem in many scientific disciplines relates to forecasting an event given some prior knowledge of that system’s past behavior together with a theoretical model of the system. A secondary problem is then how to verify the accuracy of that forecast once the event has taken place.

Scoring rules provide a framework for forecast verification. They quantitatively score the success of the forecast based on the forecast probability distribution and the actual outcome. Scoring rules are particularly useful in the field of meteorology, as they allow for the unambiguous comparison of different weather forecasts. This allows the strengths and weaknesses of forecasts from different centers, or for different regions of the world, or for different times of year, to be analyzed. Scoring rules are often considered to be a reward that should be maximized. It is important to carefully design scoring rules to encourage honesty from the forecaster: for example, they must not encourage the forecaster to hedge his or her bets by being deliberately vague, or equally to back a certain option instead of issuing the full probabilistic forecast. Such a score is a proper score, and its expectation is optimized if the “true” probability distribution is predicted. The expectation of a strictly proper score is optimized if and only if the true distribution is predicted.

Proper scores are sensitive to two properties of a forecast. The first is reliability. This is a measure of the statistical consistency of a forecast, and tests whether the verification behaves like a random sample from the issued forecast probability density function (PDF). The second property is resolution. This is a measure of how case dependent the probabilistic forecast is—is the forecasting system able to distinguish between different outcomes of the system, effectively sorting the observations into separate groups? It is important that a forecast has resolution, but is also reliable. Bröcker (2009) showed that all strictly proper scores can be explicitly decomposed into a reliability and resolution component. The third term of the decomposition is uncertainty, which is independent of the forecasting system and depends only on the statistics of the observations.

It is useful to decompose a score into its constituent components, reliability and resolution, as it gives insight into the source of skill of a forecast. It allows the user to identify the strengths of one forecasting system over another. Importantly, it indicates the characteristics of the forecast that require improvement, providing focus for future research efforts. Many of the existing scoring rules have been decomposed into their constituent components. The Brier score (Brier 1950) has been decomposed in several ways (e.g., Sanders 1963; Murphy 1973, 1986; Young 2010; Ferro and Fricker 2012). Similarly, the continuous ranked probability score can be decomposed into two parts: scoring reliability and resolution/uncertainty (Hersbach 2000). Tödter and Ahrens (2012) show that a generalization of the ignorance score can also be decomposed into reliability, resolution, and uncertainty components. Weijs et al. (2010) present a new “divergence score,” closely related to the ignorance score, as well as its three-component decomposition. This decomposition can also be generalized to account for uncertainty in the observation (Weijs and van de Giesen 2011). In each of these cases, the decomposition allows the source of skill in a forecast to be identified.

A new proper score, the error-spread score (ES), has been proposed for evaluation of ensemble forecasts (Christensen et al. 2015). It is formulated purely with respect to moments of the ensemble forecast distribution, instead of using the full distribution itself. This means that the verifier does not need to estimate or store the full forecast PDF,1 though this simplification will necessarily result in some loss of information.2 Nevertheless, the usefulness of scores that depend only on forecast moments has been recognized by other authors [e.g., Eq. (27) in Gneiting and Raftery (2007)]. The ES is suitable for evaluation of continuous forecasts, and does not require the discretization of the forecast using bins, as is the case for the categorical Brier and ranked probability scores. The score is designed to evaluate how well a forecast represents uncertainty: is the forecast able to distinguish between cases where the atmospheric flow is very predictable from those where the flow is unpredictable? A well-calibrated probabilistic forecast that represents uncertainty is essential for decision making, and therefore has high value to the user of the forecast. The ES is particularly sensitive to testing this requirement.

It is desirable to be able to decompose the ES into its constituent components as has been carried out for the Brier and continuous ranked probability scores. This paper shows how this can be achieved. The score can be decomposed into a reliability, resolution, and uncertainty component. The reliability component has two terms: the first evaluates the reliability of the forecast spread while the second evaluates the reliability of the forecast skewness. Similarly, the resolution component evaluates the resolution of the forecast spread and skewness, respectively. The third term is the uncertainty term, which depends only on the measured climatological error distribution. There is also a fourth term, which is an explicit function of the bias in the forecast. In section 2 the ES is introduced, and in section 3 the decomposition for the ES is presented. To illustrate the use of the decomposition, it is evaluated for forecasts made using the operational Ensemble Prediction System (EPS) at the European Centre for Medium-Range Weather Forecasts (ECMWF) in section 4. Some conclusions are drawn in section 5.

2. The error-spread score

A new score, the error-spread score, has been proposed by Christensen et al. (2015), which is particularly suitable for evaluation of ensemble forecasts. It is formulated using the moments of the ensemble forecast, so does not require an estimate of the full forecast PDF.

The ES is defined as
e1
where is the forecast variance and g is the forecast skewness. The variable e is the error in the ensemble mean:
e2
where z is the verification and m is the ensemble mean. The score is averaged over many forecast–verification pairs n. The moments of the ensemble forecast, , are defined in the usual way (see appendix B): for F ensemble forecast members :
e3
e4
e5
where the unbiased estimates of variance and skewness have been used.
The error-spread score was derived through consideration of the spread–error relationship (Leutbecher and Palmer 2008; Leutbecher 2010). For a statistically consistent ensemble, it is expected that the spread of the ensemble should represent the uncertainty in the forecast, and therefore should indicate the expected error in the ensemble mean:
e6
where the overbar indicates that the ensemble variance and squared error have been averaged over many forecast–verification pairs. The third term in the error-spread score depends on the forecast skewness, and acknowledges that information is contained in the higher moments of the forecast PDF. For example, the skewness could indicate if the error in the ensemble mean is more likely to be positive or negative. The form of the error-spread score is derived and explained in more detail in Christensen et al. (2015).

To be useful, a scoring rule must be proper. The ES cannot be a strictly proper score as it is a function of the moments of the forecast PDF, not the full PDF itself. This results in an equal score for two different PDFs with the same moments. Nevertheless, the ES is a proper score, and its expectation is minimized if the forecast mean, spread, and skewness match the moments of the true PDF. The ES is proved to be a proper score in Christensen et al. (2015).

The ES has its lowest instantaneous value for the perfect deterministic forecast (a delta function centered on the verification), as well as for a subset of well-calibrated probabilistic forecasts with
e7
Note that if the forecast is not skewed , the instantaneous lowest value is for an error equal to .

3. Decomposition of the error-spread score

The error-spread score, as a proper score, evaluates both reliability and resolution (Bröcker 2009). In this section, the decomposition into these components is presented.

In this decomposition, the focus is on assessing the spread and shape of the ensemble forecast—does the forecast probability distribution correctly indicate the uncertainty in the forecast? To investigate this, we consider whether the observed error distribution is consistent with the forecast PDF, after the forecast PDF has been centered on zero. To test for this consistency, ideally one would have access to many identical forecasts for different verification events; it would then be possible to evaluate whether the verification statistics follow the forecast PDF. This is not possible in practice since forecasts vary significantly from day to day. As an alternative, we assume that the predicted spread can only take I discrete values where . Here k is the forecast–verification pair index, , where n is the total number of forecast–verification pairs. Similarly, it is assumed that the predicted skewness can only take J discrete values where . The measured errors are then binned according to the predicted spread , and the predicted skewness . In practice, the forecast–verification pairs are first sorted according to the forecast spread, and split into I equally populated bins. Within each bin, the pairs are sorted according to forecast skewness, and split into J equally populated bins. The average spread and skew within each bin are assumed to be representative of the bin. The forecasts moments within one bin are sufficiently similar to one another that we can statistically evaluate whether the error distribution follows the forecast distribution for that bin.

The technique used to perform the decomposition is similar to that used for the decomposition of the Brier score (Murphy 1973), and is shown in full in appendix A. First, define
e8
e9
e10
where is the number of forecast–verification pairs in bin . Here is the average squared error in each bin and is the climatological squared error, both of which represent the sample estimates of the expected value of these errors.
Similarly, define the measured shape factor,
e11
which is approximately equal to the third moment of the error distribution in each bin, estimated using a finite sample size. It can be shown (see appendix A) that if the forecast standard deviation and skewness are accurate (i.e., are equal to those of the true distribution), the measured shape factor should obey
e12
Finally, define the climatological shape factor:
e13
Appendix A shows that the ES can be decomposed into
e14

The first term evaluates the reliability of the forecast. This has two components that test (i) the reliability of the ensemble spread, and (ii) the reliability of the ensemble shape respectively. Term (i) is the squared difference between the forecast variance and the observed mean square error for that forecast variance. For a reliable forecast, these terms should be equal (Leutbecher and Palmer 2008; Leutbecher 2010). The smaller term (i) is, the more reliable the forecast spread. Term (ii) is the standardized squared difference between the measured shape factor , and the expression the shape factor takes if the ensemble spread and skew are accurate, , following Eq. (12). If the forecast skewness, or “shape,” of the probability distribution is a good indicator of the skewed uncertainty in the forecast distribution, this term will be small. For both terms (i) and (ii), the sum is weighted by the number of forecast–verification cases in each bin .

The second term evaluates the resolution of the forecast. This also has two components, testing (iii) the resolution of the predicted spread and (iv) the resolution of the predicted shape. Both terms evaluate how well the forecasting system is able to distinguish between situations with different forecast uncertainty characteristics. Term (iii) is the squared difference between the mean square error in each bin and the climatological mean squared error. If the forecast has high resolution in the error variance, the forecast should separate predictions into cases with low uncertainty (low mean square error), and those with high uncertainty (high mean square error), resulting in a large value for term (iii). If the forecast PDF does not indicate the expected error in the forecast, term (iii) will be small as all binned mean squared errors will be close to the climatological value. Therefore, a large absolute value of term (iii) indicates high resolution in the predicted error variance. This is subtracted when calculating the error-spread score, contributing to the low value of ES for a skillful forecast. Similarly, term (iv) indicates the resolution of the skewness (shape) of the error distribution, evaluating the squared difference between the binned and climatological shape factors. If this term is large, the forecast has successfully distinguished between situations with different degrees of skewness in the forecast uncertainty: it has high shape resolution. Again, for both terms (iii) and (iv), the sum is weighted by the number of forecast–verification pairs in each bin.

The third term, (v), is the uncertainty in the forecast, which is not a function of the binning process. It depends only on the measured climatological error distribution, compared to the individual measurements. Nevertheless, unlike for the Brier score decomposition, this term is not independent of the forecast system, and instead provides information about the error characteristics of the forecast system. For example, for a system with large variability in the forecast error, the first term in (v) will be large, whereas that term is small if the forecast consistently has errors of a similar magnitude.

The last term, (vi), is the bias in the forecast. This is a function of the binning process, and is dependent on how biases in the forecasting system depend on the spread and skew of the forecast. If this term is large, it indicates that the forecasting system is systematically biased under certain forecasting conditions. However, cancellation of positive and negative average errors means that a small value for (vi) does not rule out the possibility of conditional biases in the system.

4. Evaluation of forecasts from the ECMWF Ensemble Prediction System

The decomposition of the ES was tested using operational 10-day forecasts made using the ECMWF EPS. The EPS uses a spectral atmosphere model with a horizontal triangular truncation of T639,3 with 62 vertical levels, and persisted sea surface temperature anomalies instead of a dynamical ocean. The EPS uses a 50-member ensemble. Initial condition uncertainty is sampled using an ensemble of data assimilations (EDA; Isaksen et al. 2010), combined with perturbations from the leading singular vectors. When using the ES, it is important that the number of ensemble members is large enough to reliably estimate the value of the ensemble forecast moments. In Christensen et al. (2015), it was shown that calculation of the expected value of the ES converges for ensembles with 40 members or more. The 50-member EPS ensemble used here is therefore large enough for the purpose.

The EPS uses stochastic parameterization schemes to represent model uncertainty in the forecast. This uncertainty stems from the finite resolution of the forecasting model, which results in simplifications and approximations in the representation of small-scale processes. The EPS uses two stochastic schemes. The first is stochastically perturbed parameterization tendencies (SPPT; Palmer et al. 2009), which addresses uncertainty due to these unresolved small-scale processes. In SPPT, the deterministic parameterized tendencies in temperature, humidity, and horizontal wind are perturbed by a spatially and temporally correlated random number with unit mean. The second scheme is the stochastic kinetic energy backscatter (SKEB) scheme (Berner et al. 2009). This uses random streamfunction perturbations to represent upscale kinetic energy transfer, which is otherwise missing from the model.

The 10-day forecasts are considered, initialized from 30 dates between 14 April and 15 September 2010, 2011, and 2012, respectively, and 10 dates from the same period in 2009: this large sample of 100 forecast–verification pairs is required since the forecast is binned in two dimensions.4 The high-resolution 4DVar analyses (T1279, 16 km) are used for verification. Forecast and verification fields are truncated to T159 (125 km) before verification. Forecasts of temperature at 850 hPa (approximately 1.5 km above ground level) are considered.

To perform the decomposition, the forecast–verification pairs are sorted into 10 bins of equal population according to the forecast standard deviation. Within each of these bins, the forecasts are sorted into 10 further bins of equal population according to their skewness. To increase the sample size, for each latitude–longitude point the forecast–verification pairs within a radius of 285 km are used in the binning process, which has the additional effect of spatially smoothing the calculated scores. The number of data points within a bin varies slightly depending on latitude, but is approximately 20. The average standard deviation and skewness are calculated for each bin, as are the average error characteristics required by Eq. (14).

For comparison with the EPS, a perfect static probabilistic forecast is generated in an analogous way to the idealized hypothetical forecasts in Leutbecher (2010). The mean of the EPS forecast is calculated and used as a deterministic forecast. The error between this deterministic forecast and the 4DVar analysis is computed for each 10-day forecast, and an error PDF constructed from these as a function of latitude. The deterministic forecast is dressed with 50 errors randomly sampled from this latitudinally dependent distribution. This dressed deterministic (DD) ensemble is a “perfect static” forecast as the error distribution is correct by construction if averaged over all start dates for a latitudinal band. However, it does not vary from day to day, or longitudinally from position to position, dependent on the predictability of the atmospheric flow. The decomposition of the ES should distinguish between this perfect static forecast and the dynamic probabilistic forecasts made using the EPS, and identify in what way the dynamic probabilistic forecast improves over the perfect static case.

Figure 1a shows the forecasting skill of the EPS evaluated using the ES. The lower the value of the score is, the better the forecast. A strong latitudinal dependency in the value of the score is observed, with better scores found at low latitudes. This can be attributed largely to the climatological variability, which is strongly latitudinally dependent. At high latitudes variability is greater, the mean-square error is larger, so the ES is larger. This is explained in more detail in Christensen et al. (2015). Figure 1b shows the forecasting skill of the EPS evaluated using the ES, where the ES has been calculated using the decomposition described in Eq. (14) by summing the calculated components. The results are similar to using the raw ES, confirming the decomposition is valid. The small observed differences can be attributed to two causes. First, the decomposition assumes that spread and skew are discrete variables constant within a bin, which is not true. Second, the decomposition uses neighboring forecast–verification pairs to increase the sample size for the binning process, which is not necessary when the ES is evaluated using Eq. (1).

Fig. 1.
Fig. 1.

Forecasting skill of the EPS evaluated using the error-spread score. The 10-day forecasts of T850 are compared with the 4DVar analysis and averaged over 100 dates sampled from April to September 2009–12. (a) The score is calculated the standard way using Eq. (1). (b) The score is calculated using the decomposition in Eq. (14). (c) The error-spread skill score for the EPS forecast, calculated the standard way, with respect to the DD forecast. For (a) and (b), the score is plotted on a logarithmic scale—a contour level of n indicates a score of .

Citation: Monthly Weather Review 143, 5; 10.1175/MWR-D-14-00150.1

Figure 1c shows the error-spread skill score (ESS) calculated for the EPS with reference to the DD forecasts:
e15
A positive value of the skill score indicates an improvement over the DD forecast whereas a negative value indicates the DD forecast was more skillful. The results overwhelmingly indicate the EPS is more skillful than the DD, with positive scores over much of the globe. The highest skill is found in two bands north and south of the equator in the western Pacific Ocean, while there are some small regions with negative skill over equatorial land regions and over the equatorial east Pacific.

To investigate the source of skill in the EPS compared to the DD forecast, the decomposition of the ES was calculated for both sets of forecasts. Figure 2 shows the reliability, resolution, uncertainty, and bias terms calculated for the EPS (left-hand column) and DD (middle column) forecasts. Visually, the plots in the two columns look similar. Comparing Figs. 2a and 2b indicates that the reliability term tends to be smaller for the EPS across much of the tropics, and comparing Figs. 2d and 2e shows that the resolution term tends to be smaller for the DD. The uncertainty term, shown in Figs. 2g and 2h, is similar for the EPS and DD. In Fig. 2b, the method of construction of the DD forecast results in a strong horizontal banding across much of the equatorial Pacific Ocean. The standard deviation of the DD forecast is constant as a function of longitude, and the error characteristics are similar, so the reliability term is approximately constant.

Fig. 2.
Fig. 2.

Source of forecasting skill evaluated using the ESS, comparing the EPS and DD forecasts. See text for more details. The reliability component of (a) the EPS forecasts and (b) the DD forecasts. (c) The reliability skill score (RELSS): positive values indicate the EPS has better reliability than the DD forecast. The resolution component of (d) the EPS forecasts and (e) the DD forecasts. (f) The resolution skill score (RESSS): negative values indicate the EPS has better resolution than the DD forecast. The uncertainty component of (g) the EPS forecasts and (h) the DD forecasts. (i) The uncertainty skill score (UNCSS): positive values indicate the EPS has lower (better) uncertainty than the DD forecast. The absolute value of the bias component of (j) the EPS forecasts and (k) the DD forecasts. The color bar in (a) also corresponds to (b),(d),(e),(g),(h),(j), and (k). The color bar in (c) also corresponds to (f) and (i). In (a),(b),(d),(e),(g),(h),(j), and (k), the components of the score are plotted on a logarithmic scale—a contour level of n indicates a score of .

Citation: Monthly Weather Review 143, 5; 10.1175/MWR-D-14-00150.1

To ease comparison, Figs. 2c, 2f, and 2i show the skill score calculated for the EPS forecasts with reference to the DD forecast for the reliability, resolution, and uncertainty components of the ES (RELSS, RESSS, and UNCSS), respectively:
e16
e17
e18
Figure 2c shows the ES reliability skill score. High skill scores indicate the EPS is more reliable than the DD. Very high skill scores of greater than 0.8 are found in two bands north and south of the equator in the western Pacific Ocean, with lower positive scores observed over much of the Pacific Ocean. This indicates that the high skill in these regions, as indicated by the ESS, is largely attributable to an improvement in the reliability of the forecast.

At polar latitudes and in the southeast Pacific, the reliability skill score is negative, indicating the DD is more reliable than the EPS. However in these regions, Fig. 2f shows an ES resolution skill score that is large and negative. Because resolution contributes negatively to the total score, a large value of resolution is desirable and negative values of the resolution skill score indicate skill in the EPS forecast. At polar latitudes and in the southeast Pacific, the EPS forecasts have better resolution than the DD forecasts. Therefore, despite their low reliability at these latitudes, the overall ESS indicates an improvement over the DD. The improvement in ES in these regions can be attributed to an improvement in resolution of the forecast. At low latitudes, the resolution of the EPS is similar to that of the DD.

Figure 2i shows the ES uncertainty skill score. This is zero over much of the globe, indicating the EPS and DD forecasts have very similar uncertainty characteristics. This is as expected, since the forecast error characteristics are nearly identical. The small deviations from zero can be attributed to sampling: the sample distribution of errors used to dress the deterministic forecast does not necessarily have a mean of zero. The bias term, shown in Figs. 2j and 2k, is approximately an order of magnitude smaller than each of the other three terms, so contributes little to the total ES. In the tropics, the bias term is smaller for the EPS forecasts than the DD forecasts, though it is larger for the EPS in the extratropics.

a. Significance of observed differences in skill

Three regions of interest were defined by consideration of Fig. 2. The three regions are indicated in Fig. 3. Region 1 (R1) is defined as 10°–25°N, 120°–200°E, and covers the region in the northwest Pacific Ocean with a very high reliability skill score. Region 2 (R2) is defined as 0°–8°N, 220°–290°E, and covers the region in the east Pacific Ocean with a very low (negative) reliability skill score. Region 3 (R3) is defined as 35°–50°S, 200°–280°E, and covers a region in the southeast Pacific Ocean with a negative reliability skill score, but also a negative resolution skill score indicating an improvement in resolution.

Fig. 3.
Fig. 3.

The three regions of interest defined by considering the decomposition of the ES. Region 1 is defined as 10°–25°N, 120°–200°E. Region 2 is defined as 0°–8°N, 220°–290°E. Region 3 is defined as 35°–50°S, 200°–280°E.

Citation: Monthly Weather Review 143, 5; 10.1175/MWR-D-14-00150.1

Skill scores for the reliability and resolution components of the ES were evaluated for each region of interest, and are shown in Table 1. It is important to consider whether there is a statistically significant difference between the scores for different regions. Significance tests were performed to evaluate whether the variation in RELSS and RESSS between regions indicates a difference in skill, or whether it can be attributed to sampling errors (see appendix C for full details). To test whether the three regions are indistinguishable, bootstrap confidence intervals were calculated for the difference in RELSS and RESSS between regions. Spatial correlations were preserved by sampling subregions from within the regions of interest, and temporal correlation was preserved by employing the block bootstrap. The skill of EPS forecasts with respect to DD forecasts was calculated for 10 000 synthetic datasets generated by sampling the original dataset with replacement, and the difference in skill between the regions was evaluated. The distribution of differences in skill is shown in gray in Fig. 4 for the RELSS and RESSS for each region, compared to the observed difference in skill (blue line with crisscrosses). The test rejects the null hypothesis at the 5% level if the observed difference in skill falls outside the red lines, in which case the regions are said to be significantly different.

Table 1.

Skill scores for the reliability and resolution components of the ES (RELSS and RESSS, respectively) for the ECMWF EPS forecast compared to the DD forecast, for each of the three regions defined in the text.

Table 1.
Fig. 4.
Fig. 4.

(a)–(f) Difference in skill scores between each of the three regions defined in the text for the reliability (RELSS) and resolution (RESSS) components of the ES (blue line with crisscrosses). For each component of the score, for each region pair, the significance of the difference in skill is considered by calculating the variability to be expected if each region were indistinguishable from the other regions using a block bootstrap (gray histogram). If the measured difference in skill scores falls outside the red lines, the difference is said to be significant at the 5% level. See appendix C and the text for more details.

Citation: Monthly Weather Review 143, 5; 10.1175/MWR-D-14-00150.1

Figure 4 indicates that R1 has a significantly higher reliability skill score than R2, since the measured difference in skill score falls outside of the 95% confidence intervals for this case. However, Fig. 4d indicates there is no significant difference between the resolution of forecasts in these two regions. R3 is significantly less reliable than R1, but has significantly better resolution than either R1 or R2. There is no significant difference between the reliability of forecasts for R2 and R3.

b. Comparison with the RMS spread-error diagnostic

The ES decomposition has indicated in what ways the EPS forecast is more skillful than the DD forecast, and has also highlighted regions of concern. It is of interest to see if this skill is reflected in other diagnostic tools. The calibration of the second moment of the ensemble forecast can be evaluated by constructing root-mean-square (RMS) spread-error diagrams, which test whether Eq. (6) is followed. This graphical diagnostic is a more comprehensive analysis of the calibration of the forecast, and can be used to identify the shortcomings of a forecast in more detail.

The forecast–verification pairs are sorted and binned according to the forecast variance, and the RMS error and spread are evaluated for each bin, and shown as a scatter diagram. The spread reliability and spread resolution can be identified on these diagrams. A forecast with high spread reliability has scattered points lying close to the diagonal line. If the range in vertical distribution in scattered points is large, the forecast has successfully sorted the cases according to their uncertainty, and the forecast has high resolution.

Figure 5 shows the RMS spread-error diagnostic evaluated for each region in Fig. 3 for both the EPS and DD forecasts. The diagnostic can be compared to the skill score for the ES reliability and resolution components calculated for each region, shown in Table 1. For comparison, a perfectly reliable forecast is generated for each region considered: for each forecast, one ensemble member is randomly selected to be the verification while the remaining members are used as the reliable ensemble (see appendix C for details). The RMS spread-error diagnostic is evaluated for this perfectly reliable forecast, and the process repeated to indicate the variability to be expected in this diagnostic. The perfectly reliable forecasts are shown in Fig. 3 as the pale gray plume.

Fig. 5.
Fig. 5.

RMS error-spread plot for forecasts made using the EPS (midgray) and the DD system (dark gray). Also shown is the set of perfectly reliable forecasts generated following appendix C (pale gray). The 10-day forecasts of T850 are considered for three regions: (a) region 1: 10°–25°N, 120°–200°E; (b) region 2: 0°–8°N, 220°–290°E; and (c) region 3: 35°–50°S, 200°–280°E. The ensemble forecasts are sorted and binned according to their forecast spread. The RMS error in each bin is plotted against the RMS spread for each bin. For a reliable ensemble, these should lay on the diagonal shown (Leutbecher and Palmer 2008).

Citation: Monthly Weather Review 143, 5; 10.1175/MWR-D-14-00150.1

Figure 5a shows the results from R1. As expected, the reliability of the EPS is markedly better than for the DD, with the scattered points for the EPS forecasts falling close to the diagonal as required for a statistically consistent ensemble forecast. There is a slight improvement in resolution, dominated by the cases with the highest uncertainty. This is reflected in Fig. 2f and Table 1, which show an improvement in the region.

In Fig. 5b, the results are shown for R2. The reliability of the EPS forecast is indeed poorer than for the DD forecast; the ensemble is consistently underdispersive. However, the figure indicates an improvement in resolution in this region. This improvement can be traced to a tongue of very low resolution skill score extending northwest from the Peruvian coast, which is visible in Fig. 2f.

Figure 5c shows the results for R3. The EPS forecast is less reliable than the DD forecast, being somewhat underdispersive, though the difference is small. The resolution of the EPS forecast is better than for the DD forecast, as expected from the ES decomposition. The ES decomposition has correctly identified regions of interest for the ECMWF EPS that have particularly high or low skill with respect to the reliability or resolution of the forecast. Investigating these regions further using more complete, graphical tests of statistical consistency, can then indicate in what way the forecast is unreliable or has poor resolution.

Figure 5 also indicates that, in all regions, there is a significant departure from perfectly reliable forecasts for both EPS and DD forecasts. There is some difference between the set of RMS spread-error diagnostics generated for the perfectly reliable forecasts, but the set remains close together. This indicates that the sampling uncertainty to be expected for this diagnostic for a sample of this size is small, and that the forecast–verification sample size is sufficient to detect a departure from a perfectly reliable forecast.

c. Reliability and resolution as a function of forecast error

The ES decomposition focuses on how well the PDF captures the uncertainty in the ensemble mean—spread and skew are chosen to be the two binning dimensions. However, it is interesting to also consider the third dimension: the accuracy of the ensemble mean. In particular, is it the case that the skill of the spread and skew of the forecast is a function of the forecast mean? For example, if model predicts a lower than average temperature, is this accompanied by a worse than average estimate of uncertainty? Asking this question can reveal information about how to interpret and calibrate probabilistic forecasts, and also information about model biases that can guide model development.

The model climatological mean was evaluated as a function of latitude. The forecast temperature anomaly was evaluated as the difference between the forecast and climatological means for each spatial point for each start date. The dataset was sorted according to how anomalously cold or warm the forecast conditions were, before being binned into 10 equally populated groups. The ES decomposition was evaluated for each decile, and the reliability and resolution terms shown as a function of forecast temperature anomaly in Fig. 6. For comparison, the gray plumes in Fig. 6 indicate the expected variation in RELSS and RESSS if the dataset were randomly separated into tenths, and the decomposition performed on each random tenth.

Fig. 6.
Fig. 6.

(a) ES reliability skill score and (b) ES resolution skill score as a function of forecast ensemble mean temperature anomaly (black)—see text for more details. Low (high) deciles correspond to forecasts that are colder (warmer) than average. The 100-member gray plume indicates the variability to be expected from randomly distributing the data into 10 bins before calculating the decomposition.

Citation: Monthly Weather Review 143, 5; 10.1175/MWR-D-14-00150.1

The figure indicates a significant variation in forecast skill as a function of forecast temperature anomaly. Anomalously cold conditions (low deciles) have worse reliability on average, whereas forecasts of anomalously warm conditions tend to be more reliable. However, the resolution of forecasts is improved if the event is more extreme—forecasts of anomalously warm or cold conditions have better resolution than forecasts of moderate conditions. The gray plume indicates that the observed differences as a function of temperature anomaly are significant.

5. Conclusions

This paper has shown how the error-spread score (ES), presented by Christensen et al. (2015), can be decomposed into three components. The score is designed to evaluate ensemble forecasts of continuous variables, and to test how well uncertainty is represented in the forecast: the score is particularly sensitive to the calibration of the forecast [i.e., how well the probabilistic forecast represents uncertainty; Christensen et al. (2015)]. In a similar manner to other proper scores, the ES can be decomposed into reliability, resolution, and uncertainty components. The ES reliability component evaluates the reliability of the forecast spread and skewness. This term is small if the forecast and verification are statistically consistent, and the moments of the ensemble forecast are a reliable indication of the statistical characteristics of the verification. Similarly, the ES resolution component evaluates the resolution of the forecast spread and shape. This term contributes negatively to the ES, so a large resolution term is desirable. This term is large if the spread and skewness of the ensemble forecast vary according to the state of the atmosphere and the predictability of the atmospheric flow. The forecast PDF from a system with high ES resolution separates forecast situations with high uncertainty (large mean square error) from those with low uncertainty. The ES uncertainty component depends only on the measured (climatological) error distribution, and is independent of the forecast spread or skewness. A large variability in the forecast error will contribute toward a larger (poorer) uncertainty component.

The decomposition of the ES was used to evaluate forecasts made using the European Centre for Medium-Range Weather Forecasts Ensemble Prediction System (EPS), and compared to a statistically produced “perfect static” forecast. The EPS was found to have skill across most of the globe. The ES decomposition attributed the improvement in skill at low latitudes to an improvement in reliability, whereas the skill at higher latitudes was due to an improvement in resolution.

The ES decomposition was used to highlight a number of regions of interest for the EPS: the RELSS and RESSS for these regions showed statistically significant differences from each other. The RMS spread-error diagnostic was calculated for these regions. The results were as expected from the ES decomposition, but also indicated in what way the forecast was reliable or showed resolution. Finally, the ES decomposition was evaluated as a function of forecast mean anomaly. The reliability and resolution of the forecast were found to vary significantly as a function of the forecast anomaly. The decomposition shown in this paper is therefore a useful tool for analyzing the source of skill in ensemble forecasts, and for identifying regions that can be investigated further using more comprehensive graphical diagnostic tools.

Acknowledgments

Thanks to J. Bröcker, C. Ferro, and M. Leutbecher for helpful discussions, and to I. Moroz and T. Palmer for their support in the writing of this paper. Thanks to two anonymous reviewers, C. Ferro, and the editor for their helpful comments. This research was supported by a NERC studentship and European Research Council Grant 291406.

APPENDIX A

The Decomposition of the Error-Spread Score

a. General decomposition

Assume that the predicted spread can only take I discrete values where . Assume the predicted skewness can only take J discrete values where . Bin the measured errors according to the predicted spread and the predicted skewness . As in section 3, define
ea1
ea2
and
ea3
where n is the total number of forecast–verification pairs and is the number of forecast–verification pairs in bin . The term is the average squared error in each bin and is the climatological squared error, both of which represent the sample estimates of the expected value of these errors.
The error-spread score can be rewritten as a sum over the bins as
ea4
Consider the first term A, evaluating the spread of the forecast:
ea5
Here it has been recognized that is a discrete variable, constant within a bin, so it can be moved outside the summation term. Using the definitions of and , the square is completed twice to give:
eq1
Recalling the definition of , and from Eq. (A2)
eq2
ea6
Since the have not been sorted according to spread and skew, the multiple summation terms can be replaced by a summation over k:
ea7

The first term, A, has been decomposed into a sum of squared terms.

Consider the second two terms, B, which evaluate the shape of the forecast:
eq3
Define
ea8
As in section 3, define the measured shape factor,
ea9
which is approximately equal to the third moment of the error distribution in each bin, estimated using a finite sample size. Also define the climatological shape factor:
ea10
Variable B can be written in terms of the shape factor as
ea11
Completing the square in Eq. (A11) by adding and subtracting:
eq4
ea12
Combining this with Eq. (A7), the ES has been decomposed into
ea13

b. Comments to aid interpretation

The form of the decomposition presented above does not rely on any assumptions. Two assumptions will now be considered to aid interpretation of the decomposition.

First, for an unbiased forecast,
ea14
If it is the case that is zero for all i and j, term (vi) in Eq. (A13) would be approximately zero. However, a real forecasting system may have complicated biases (conditional on forecast spread and skew), so this term is not necessarily negligible. The bias term (vi) has therefore been retained separately in the above decomposition.

Second, consider a forecast that, in addition to having no bias, has perfect forecast spread and skew.

If the forecast spread is always perfect so that ,
ea15
For such a forecast, the first term (i) in Eq. (A13) will be approximately zero.
Consider the shape factor for the perfect forecast, for which the bias . Using Eq. (B3) (see appendix B),
ea16
where γ is the observed skewness of the error distribution. Combining this with Eq. (6), it can be shown that if the forecast standard deviation and skewness are accurate (i.e., they are equal to those of the true distribution), the measured shape factor should obey
ea17
where the negative sign arises because the verification has a negative sign in the definition of error, (so that for an accurate forecast), and we have used the fact that for an accurate forecast. The second term (ii) in Eq. (A13) is therefore zero if the forecast shape is perfect.

APPENDIX B

Mathematical Properties of Moments

For a random variable X that is drawn from a probability distribution, P(X), the moments of the distribution are defined as follows. The mean μ:
eb1
The variance :
eb2
The skewness γ:
eb3

APPENDIX C

Significance Testing

The significance of the results reported here will be tested in two ways in order to address two different null hypotheses.

a. Hypothesis 1: There is no significant difference between the EPS observed behavior and that of a perfectly reliable forecast

It is important to consider whether the values of the scores vary between regions simply because of the limited sample size used to evaluate the scores, or if in fact the difference indicates a significant departure from some nominal behavior. If hypothesis 1 is true, the deviation from the diagonal in Fig. 5 is due to sampling, and reveals no significant deviation from a reliable forecast in any region. Similarly, the variation in calculated RELSS between regions does not reveal a significant difference in reliability, and is as expected from sampling errors.

To test this hypothesis, a set of perfectly reliable forecasts is generated for each region considered. For each forecast within the region (as a function of both position and date), one ensemble member is randomly selected to be the verification while the remaining members are used as the reliable ensemble. This reproduces both the spatial and temporal correlations in the data. The RMS spread-error diagnostic is evaluated for this perfectly reliable forecast, as is the REL and RES components of the ES. This process is repeated 10 000 times, where each iteration resamples one member from each forecast to be the verification.

The variation in REL and RES skill scores for each region for this perfectly reliable forecast are shown as the pale gray histograms in Fig. C1. The skill scores for the reliable forecast vary little from iteration to iteration, and are significantly different to the skill scores calculated for the EPS forecast, shown in blue. This indicates that the forecast–verification sample size is sufficient to detect a departure from a perfectly reliable forecast.

Fig. C1.
Fig. C1.

Skill scores for the reliability and resolution components of the ES (RELSS and RESSS, respectively) for the ECMWF EPS forecast compared to the DD forecast, for each of the three regions defined in the text (blue line with crisscrosses). For each component of the score, for each region, the significance of the skill score is revealed using two techniques: first, the variability in skill scores to be expected from sampling a perfectly reliable forecast (pale gray histogram), and second, the expected distribution of skill scores if each region were indistinguishable from the other regions (dark gray histogram).

Citation: Monthly Weather Review 143, 5; 10.1175/MWR-D-14-00150.1

The variation in the RMS spread-error diagnostic for this perfectly reliable forecast is shown as the pale gray plume in Fig. 5. Similarly to the REL component of the ES, the spread-error diagnostic also indicates that forecasts for regions 1, 2, and 3 are significantly less reliable that the perfectly reliable forecast generated here.

b. Hypothesis 2: There is no significant difference between the EPS observed behavior in each region

If this hypothesis is true, the variation in RELSS and RESSS between regions is due to sampling errors from the limited number of forecast–verification pairs, and does not reveal a significant difference in the reliability and resolution of the EPS forecasts between each region.

To test whether the three regions are indistinguishable, a block bootstrap technique is used to generate 10 000 synthetic datasets. First, region 1 and region 3 are split into two nonoverlapping subregions of approximately the same size and shape as region 2. Points within each of these five subregions will be sampled together to preserve spatial correlations in the dataset. In time, the data is split into 10 nonoverlapping sections each containing 10 consecutive forecast–verification pairs.C1 The dates within these blocks will be sampled together to preserve temporal correlations in the dataset.

A total of 20 space–time blocks of data are randomly selected with replacement for each of region 1 and 3, and 10 space–time blocks are selected for region 2, accounting for the difference in size of the original regions. The skill of EPS forecasts with respect to DD forecasts is then calculated for these randomized regions. The process is repeated 10 000 times.

The distribution of skill calculated for the resampled datasets is compared to that evaluated for each original region and shown as the dark gray histogram in Fig. C1. In addition, the difference in skill scores is calculated for each region. The distribution of differences in skill is shown in gray in Fig. 4 for the RELSS and RESSS for each region, and compared to the observed difference in skill (blue line with crisscrosses). The red lines indicate the 2.5th and 97.5th percentiles of the distribution of skill. If the magnitude of the difference in skill between the true regions is outside the red lines estimated from the bootstrap test, the difference in skill is said to be significant at the 95% level.

REFERENCES

  • Berner, J., G. J. Shutts, M. Leutbecher, and T. N. Palmer, 2009: A spectral stochastic kinetic energy backscatter scheme and its impact on flow-dependent predictability in the ECMWF ensemble prediction system. J. Atmos. Sci., 66, 603626, doi:10.1175/2008JAS2677.1.

    • Search Google Scholar
    • Export Citation
  • Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78, 13, doi:10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Bröcker, J., 2009: Reliability, sufficiency, and the decomposition of proper scores. Quart. J. Roy. Meteor. Soc., 135, 15121519, doi:10.1002/qj.456.

    • Search Google Scholar
    • Export Citation
  • Christensen, H. M., I. M. Moroz, and T. N. Palmer, 2015: Evaluation of ensemble forecast uncertainty using a new proper score: Application to medium-range and seasonal forecasts. Quart. J. Roy. Meteor. Soc., 141, 538549, doi:10.1002/qj.2375.

    • Search Google Scholar
    • Export Citation
  • Ferro, C. A. T., and T. E. Fricker, 2012: A bias-corrected decomposition of the Brier score. Quart. J. Roy. Meteor. Soc., 138, 19541960, doi:10.1002/qj.1924.

    • Search Google Scholar
    • Export Citation
  • Gneiting, T., 2011: Making and evaluating point forecasts. J. Amer. Stat. Assoc., 106, 746762, doi:10.1198/jasa.2011.r10138.

  • Gneiting, T., and A. E. Raftery, 2007: Strictly proper scoring rules, prediction, and estimation. J. Amer. Stat. Assoc., 102, 359378, doi:10.1198/016214506000001437.

    • Search Google Scholar
    • Export Citation
  • Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems. Wea. Forecasting, 15, 559570, doi:10.1175/1520-0434(2000)015<0559:DOTCRP>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Isaksen, L., M. Bonavita, R. Buizza, M. Fisher, J. Haseler, M. Leutbecher, and L. Raynaud, 2010: Ensemble of data assimilations at ECMWF. Tech. Rep. 636, European Centre for Medium-Range Weather Forecasts, Reading, United Kingdom, 48 pp.

  • Leutbecher, M., 2010: Diagnosis of ensemble forecasting systems. Proc. Seminar on Diagnosis of Forecasting and Data Assimilation Systems, Reading, United Kingdom, ECMWF, 235–266.

  • Leutbecher, M., and T. N. Palmer, 2008: Ensemble forecasting. J. Comput. Phys., 227, 35153539, doi:10.1016/j.jcp.2007.02.014.

  • Murphy, A. H., 1973: A new vector partition of the probability score. J. Appl. Meteor., 12, 595600, doi:10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1986: A new decomposition of the Brier score: Formulation and interpretation. Mon. Wea. Rev., 114, 26712673, doi:10.1175/1520-0493(1986)114<2671:ANDOTB>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Palmer, T. N., R. Buizza, F. Doblas-Reyes, T. Jung, M. Leutbecher, G. J. Shutts, M. Steinheimer, and A. Weisheimer, 2009: Stochastic parametrization and model uncertainty. Tech. Rep. 598, European Centre for Medium-Range Weather Forecasts, Reading, United Kingdom, 44 pp.

  • Sanders, F., 1963: On subjective probability forecasting. J. Appl. Meteor., 2, 191201, doi:10.1175/1520-0450(1963)002<0191:OSPF>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Tödter, J., and B. Ahrens, 2012: Generalization of the ignorance score: Continuous ranked version and its decomposition. Mon. Wea. Rev., 140, 20052017, doi:10.1175/MWR-D-11-00266.1.

    • Search Google Scholar
    • Export Citation
  • Weijs, S. V., and N. van de Giesen, 2011: Accounting for observational uncertainty in forecast verification: An information-theoretical view on forecasts, observations, and truth. Mon. Wea. Rev., 139, 21562162, doi:10.1175/2011MWR3573.1.

    • Search Google Scholar
    • Export Citation
  • Weijs, S. V., R. van Nooijen, and N. van de Giesen, 2010: Kullback–Leibler divergence as a forecast skill score with classic reliability-resolution-uncertainty decomposition. Mon. Wea. Rev., 138, 33873399, doi:10.1175/2010MWR3229.1.

    • Search Google Scholar
    • Export Citation
  • Young, R. M. B., 2010: Decomposition of the Brier score for weighted forecast–verification pairs. Quart. J. Roy. Meteor. Soc., 136, 13641370, doi:10.1002/qj.641.

    • Search Google Scholar
    • Export Citation
1

Note that if the forecaster only supplies the verifier with the mean, standard deviation, and skewness instead of the full ensemble forecast, strictly speaking the score cannot be called proper, and should instead be called “consistent, though not strictly consistent, for the mean, standard deviation and skewness” (Gneiting 2011).

2

For example, in the case of bimodal forecasts, the first three moments will be insufficient to characterize the full forecast PDF.

3

A spectral resolution of T639 corresponds to a reduced Gaussian grid of N320, or 30-km resolution, or a 0.28° latitude–longitude grid.

4

The selection of 100 dates is a pragmatic choice since 100 bins will be used for the decomposition—this guarantees that the data may be evenly distributed between the bins. Forecasts are constrained to boreal summer, as the impact of the annual cycle on forecast skill will not be considered.

C1

The temporal correlation is approximately 0.1 after this lag.

Save
  • Berner, J., G. J. Shutts, M. Leutbecher, and T. N. Palmer, 2009: A spectral stochastic kinetic energy backscatter scheme and its impact on flow-dependent predictability in the ECMWF ensemble prediction system. J. Atmos. Sci., 66, 603626, doi:10.1175/2008JAS2677.1.

    • Search Google Scholar
    • Export Citation
  • Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78, 13, doi:10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Bröcker, J., 2009: Reliability, sufficiency, and the decomposition of proper scores. Quart. J. Roy. Meteor. Soc., 135, 15121519, doi:10.1002/qj.456.

    • Search Google Scholar
    • Export Citation
  • Christensen, H. M., I. M. Moroz, and T. N. Palmer, 2015: Evaluation of ensemble forecast uncertainty using a new proper score: Application to medium-range and seasonal forecasts. Quart. J. Roy. Meteor. Soc., 141, 538549, doi:10.1002/qj.2375.

    • Search Google Scholar
    • Export Citation
  • Ferro, C. A. T., and T. E. Fricker, 2012: A bias-corrected decomposition of the Brier score. Quart. J. Roy. Meteor. Soc., 138, 19541960, doi:10.1002/qj.1924.

    • Search Google Scholar
    • Export Citation
  • Gneiting, T., 2011: Making and evaluating point forecasts. J. Amer. Stat. Assoc., 106, 746762, doi:10.1198/jasa.2011.r10138.

  • Gneiting, T., and A. E. Raftery, 2007: Strictly proper scoring rules, prediction, and estimation. J. Amer. Stat. Assoc., 102, 359378, doi:10.1198/016214506000001437.

    • Search Google Scholar
    • Export Citation
  • Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems. Wea. Forecasting, 15, 559570, doi:10.1175/1520-0434(2000)015<0559:DOTCRP>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Isaksen, L., M. Bonavita, R. Buizza, M. Fisher, J. Haseler, M. Leutbecher, and L. Raynaud, 2010: Ensemble of data assimilations at ECMWF. Tech. Rep. 636, European Centre for Medium-Range Weather Forecasts, Reading, United Kingdom, 48 pp.

  • Leutbecher, M., 2010: Diagnosis of ensemble forecasting systems. Proc. Seminar on Diagnosis of Forecasting and Data Assimilation Systems, Reading, United Kingdom, ECMWF, 235–266.

  • Leutbecher, M., and T. N. Palmer, 2008: Ensemble forecasting. J. Comput. Phys., 227, 35153539, doi:10.1016/j.jcp.2007.02.014.

  • Murphy, A. H., 1973: A new vector partition of the probability score. J. Appl. Meteor., 12, 595600, doi:10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1986: A new decomposition of the Brier score: Formulation and interpretation. Mon. Wea. Rev., 114, 26712673, doi:10.1175/1520-0493(1986)114<2671:ANDOTB>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Palmer, T. N., R. Buizza, F. Doblas-Reyes, T. Jung, M. Leutbecher, G. J. Shutts, M. Steinheimer, and A. Weisheimer, 2009: Stochastic parametrization and model uncertainty. Tech. Rep. 598, European Centre for Medium-Range Weather Forecasts, Reading, United Kingdom, 44 pp.

  • Sanders, F., 1963: On subjective probability forecasting. J. Appl. Meteor., 2, 191201, doi:10.1175/1520-0450(1963)002<0191:OSPF>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Tödter, J., and B. Ahrens, 2012: Generalization of the ignorance score: Continuous ranked version and its decomposition. Mon. Wea. Rev., 140, 20052017, doi:10.1175/MWR-D-11-00266.1.

    • Search Google Scholar
    • Export Citation
  • Weijs, S. V., and N. van de Giesen, 2011: Accounting for observational uncertainty in forecast verification: An information-theoretical view on forecasts, observations, and truth. Mon. Wea. Rev., 139, 21562162, doi:10.1175/2011MWR3573.1.

    • Search Google Scholar
    • Export Citation
  • Weijs, S. V., R. van Nooijen, and N. van de Giesen, 2010: Kullback–Leibler divergence as a forecast skill score with classic reliability-resolution-uncertainty decomposition. Mon. Wea. Rev., 138, 33873399, doi:10.1175/2010MWR3229.1.

    • Search Google Scholar
    • Export Citation
  • Young, R. M. B., 2010: Decomposition of the Brier score for weighted forecast–verification pairs. Quart. J. Roy. Meteor. Soc., 136, 13641370, doi:10.1002/qj.641.

    • Search Google Scholar
    • Export Citation
  • Fig. 1.

    Forecasting skill of the EPS evaluated using the error-spread score. The 10-day forecasts of T850 are compared with the 4DVar analysis and averaged over 100 dates sampled from April to September 2009–12. (a) The score is calculated the standard way using Eq. (1). (b) The score is calculated using the decomposition in Eq. (14). (c) The error-spread skill score for the EPS forecast, calculated the standard way, with respect to the DD forecast. For (a) and (b), the score is plotted on a logarithmic scale—a contour level of n indicates a score of .

  • Fig. 2.

    Source of forecasting skill evaluated using the ESS, comparing the EPS and DD forecasts. See text for more details. The reliability component of (a) the EPS forecasts and (b) the DD forecasts. (c) The reliability skill score (RELSS): positive values indicate the EPS has better reliability than the DD forecast. The resolution component of (d) the EPS forecasts and (e) the DD forecasts. (f) The resolution skill score (RESSS): negative values indicate the EPS has better resolution than the DD forecast. The uncertainty component of (g) the EPS forecasts and (h) the DD forecasts. (i) The uncertainty skill score (UNCSS): positive values indicate the EPS has lower (better) uncertainty than the DD forecast. The absolute value of the bias component of (j) the EPS forecasts and (k) the DD forecasts. The color bar in (a) also corresponds to (b),(d),(e),(g),(h),(j), and (k). The color bar in (c) also corresponds to (f) and (i). In (a),(b),(d),(e),(g),(h),(j), and (k), the components of the score are plotted on a logarithmic scale—a contour level of n indicates a score of .

  • Fig. 3.

    The three regions of interest defined by considering the decomposition of the ES. Region 1 is defined as 10°–25°N, 120°–200°E. Region 2 is defined as 0°–8°N, 220°–290°E. Region 3 is defined as 35°–50°S, 200°–280°E.

  • Fig. 4.

    (a)–(f) Difference in skill scores between each of the three regions defined in the text for the reliability (RELSS) and resolution (RESSS) components of the ES (blue line with crisscrosses). For each component of the score, for each region pair, the significance of the difference in skill is considered by calculating the variability to be expected if each region were indistinguishable from the other regions using a block bootstrap (gray histogram). If the measured difference in skill scores falls outside the red lines, the difference is said to be significant at the 5% level. See appendix C and the text for more details.

  • Fig. 5.

    RMS error-spread plot for forecasts made using the EPS (midgray) and the DD system (dark gray). Also shown is the set of perfectly reliable forecasts generated following appendix C (pale gray). The 10-day forecasts of T850 are considered for three regions: (a) region 1: 10°–25°N, 120°–200°E; (b) region 2: 0°–8°N, 220°–290°E; and (c) region 3: 35°–50°S, 200°–280°E. The ensemble forecasts are sorted and binned according to their forecast spread. The RMS error in each bin is plotted against the RMS spread for each bin. For a reliable ensemble, these should lay on the diagonal shown (Leutbecher and Palmer 2008).

  • Fig. 6.

    (a) ES reliability skill score and (b) ES resolution skill score as a function of forecast ensemble mean temperature anomaly (black)—see text for more details. Low (high) deciles correspond to forecasts that are colder (warmer) than average. The 100-member gray plume indicates the variability to be expected from randomly distributing the data into 10 bins before calculating the decomposition.

  • Fig. C1.

    Skill scores for the reliability and resolution components of the ES (RELSS and RESSS, respectively) for the ECMWF EPS forecast compared to the DD forecast, for each of the three regions defined in the text (blue line with crisscrosses). For each component of the score, for each region, the significance of the skill score is revealed using two techniques: first, the variability in skill scores to be expected from sampling a perfectly reliable forecast (pale gray histogram), and second, the expected distribution of skill scores if each region were indistinguishable from the other regions (dark gray histogram).

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 555 303 142
PDF Downloads 244 71 1