1. Introduction
A problem in many scientific disciplines relates to forecasting an event given some prior knowledge of that system’s past behavior together with a theoretical model of the system. A secondary problem is then how to verify the accuracy of that forecast once the event has taken place.
Scoring rules provide a framework for forecast verification. They quantitatively score the success of the forecast based on the forecast probability distribution and the actual outcome. Scoring rules are particularly useful in the field of meteorology, as they allow for the unambiguous comparison of different weather forecasts. This allows the strengths and weaknesses of forecasts from different centers, or for different regions of the world, or for different times of year, to be analyzed. Scoring rules are often considered to be a reward that should be maximized. It is important to carefully design scoring rules to encourage honesty from the forecaster: for example, they must not encourage the forecaster to hedge his or her bets by being deliberately vague, or equally to back a certain option instead of issuing the full probabilistic forecast. Such a score is a proper score, and its expectation is optimized if the “true” probability distribution is predicted. The expectation of a strictly proper score is optimized if and only if the true distribution is predicted.
Proper scores are sensitive to two properties of a forecast. The first is reliability. This is a measure of the statistical consistency of a forecast, and tests whether the verification behaves like a random sample from the issued forecast probability density function (PDF). The second property is resolution. This is a measure of how case dependent the probabilistic forecast is—is the forecasting system able to distinguish between different outcomes of the system, effectively sorting the observations into separate groups? It is important that a forecast has resolution, but is also reliable. Bröcker (2009) showed that all strictly proper scores can be explicitly decomposed into a reliability and resolution component. The third term of the decomposition is uncertainty, which is independent of the forecasting system and depends only on the statistics of the observations.
It is useful to decompose a score into its constituent components, reliability and resolution, as it gives insight into the source of skill of a forecast. It allows the user to identify the strengths of one forecasting system over another. Importantly, it indicates the characteristics of the forecast that require improvement, providing focus for future research efforts. Many of the existing scoring rules have been decomposed into their constituent components. The Brier score (Brier 1950) has been decomposed in several ways (e.g., Sanders 1963; Murphy 1973, 1986; Young 2010; Ferro and Fricker 2012). Similarly, the continuous ranked probability score can be decomposed into two parts: scoring reliability and resolution/uncertainty (Hersbach 2000). Tödter and Ahrens (2012) show that a generalization of the ignorance score can also be decomposed into reliability, resolution, and uncertainty components. Weijs et al. (2010) present a new “divergence score,” closely related to the ignorance score, as well as its three-component decomposition. This decomposition can also be generalized to account for uncertainty in the observation (Weijs and van de Giesen 2011). In each of these cases, the decomposition allows the source of skill in a forecast to be identified.
A new proper score, the error-spread score (ES), has been proposed for evaluation of ensemble forecasts (Christensen et al. 2015). It is formulated purely with respect to moments of the ensemble forecast distribution, instead of using the full distribution itself. This means that the verifier does not need to estimate or store the full forecast PDF,1 though this simplification will necessarily result in some loss of information.2 Nevertheless, the usefulness of scores that depend only on forecast moments has been recognized by other authors [e.g., Eq. (27) in Gneiting and Raftery (2007)]. The ES is suitable for evaluation of continuous forecasts, and does not require the discretization of the forecast using bins, as is the case for the categorical Brier and ranked probability scores. The score is designed to evaluate how well a forecast represents uncertainty: is the forecast able to distinguish between cases where the atmospheric flow is very predictable from those where the flow is unpredictable? A well-calibrated probabilistic forecast that represents uncertainty is essential for decision making, and therefore has high value to the user of the forecast. The ES is particularly sensitive to testing this requirement.
It is desirable to be able to decompose the ES into its constituent components as has been carried out for the Brier and continuous ranked probability scores. This paper shows how this can be achieved. The score can be decomposed into a reliability, resolution, and uncertainty component. The reliability component has two terms: the first evaluates the reliability of the forecast spread while the second evaluates the reliability of the forecast skewness. Similarly, the resolution component evaluates the resolution of the forecast spread and skewness, respectively. The third term is the uncertainty term, which depends only on the measured climatological error distribution. There is also a fourth term, which is an explicit function of the bias in the forecast. In section 2 the ES is introduced, and in section 3 the decomposition for the ES is presented. To illustrate the use of the decomposition, it is evaluated for forecasts made using the operational Ensemble Prediction System (EPS) at the European Centre for Medium-Range Weather Forecasts (ECMWF) in section 4. Some conclusions are drawn in section 5.
2. The error-spread score
A new score, the error-spread score, has been proposed by Christensen et al. (2015), which is particularly suitable for evaluation of ensemble forecasts. It is formulated using the moments of the ensemble forecast, so does not require an estimate of the full forecast PDF.
To be useful, a scoring rule must be proper. The ES cannot be a strictly proper score as it is a function of the moments of the forecast PDF, not the full PDF itself. This results in an equal score for two different PDFs with the same moments. Nevertheless, the ES is a proper score, and its expectation is minimized if the forecast mean, spread, and skewness match the moments of the true PDF. The ES is proved to be a proper score in Christensen et al. (2015).
3. Decomposition of the error-spread score
The error-spread score, as a proper score, evaluates both reliability and resolution (Bröcker 2009). In this section, the decomposition into these components is presented.
In this decomposition, the focus is on assessing the spread and shape of the ensemble forecast—does the forecast probability distribution correctly indicate the uncertainty in the forecast? To investigate this, we consider whether the observed error distribution
The first term evaluates the reliability of the forecast. This has two components that test (i) the reliability of the ensemble spread, and (ii) the reliability of the ensemble shape respectively. Term (i) is the squared difference between the forecast variance and the observed mean square error for that forecast variance. For a reliable forecast, these terms should be equal (Leutbecher and Palmer 2008; Leutbecher 2010). The smaller term (i) is, the more reliable the forecast spread. Term (ii) is the standardized squared difference between the measured shape factor
The second term evaluates the resolution of the forecast. This also has two components, testing (iii) the resolution of the predicted spread and (iv) the resolution of the predicted shape. Both terms evaluate how well the forecasting system is able to distinguish between situations with different forecast uncertainty characteristics. Term (iii) is the squared difference between the mean square error in each bin and the climatological mean squared error. If the forecast has high resolution in the error variance, the forecast should separate predictions into cases with low uncertainty (low mean square error), and those with high uncertainty (high mean square error), resulting in a large value for term (iii). If the forecast PDF does not indicate the expected error in the forecast, term (iii) will be small as all binned mean squared errors will be close to the climatological value. Therefore, a large absolute value of term (iii) indicates high resolution in the predicted error variance. This is subtracted when calculating the error-spread score, contributing to the low value of ES for a skillful forecast. Similarly, term (iv) indicates the resolution of the skewness (shape) of the error distribution, evaluating the squared difference between the binned and climatological shape factors. If this term is large, the forecast has successfully distinguished between situations with different degrees of skewness in the forecast uncertainty: it has high shape resolution. Again, for both terms (iii) and (iv), the sum is weighted by the number of forecast–verification pairs in each bin.
The third term, (v), is the uncertainty in the forecast, which is not a function of the binning process. It depends only on the measured climatological error distribution, compared to the individual measurements. Nevertheless, unlike for the Brier score decomposition, this term is not independent of the forecast system, and instead provides information about the error characteristics of the forecast system. For example, for a system with large variability in the forecast error, the first term in (v) will be large, whereas that term is small if the forecast consistently has errors of a similar magnitude.
The last term, (vi), is the bias in the forecast. This is a function of the binning process, and is dependent on how biases in the forecasting system depend on the spread and skew of the forecast. If this term is large, it indicates that the forecasting system is systematically biased under certain forecasting conditions. However, cancellation of positive and negative average errors means that a small value for (vi) does not rule out the possibility of conditional biases in the system.
4. Evaluation of forecasts from the ECMWF Ensemble Prediction System
The decomposition of the ES was tested using operational 10-day forecasts made using the ECMWF EPS. The EPS uses a spectral atmosphere model with a horizontal triangular truncation of T639,3 with 62 vertical levels, and persisted sea surface temperature anomalies instead of a dynamical ocean. The EPS uses a 50-member ensemble. Initial condition uncertainty is sampled using an ensemble of data assimilations (EDA; Isaksen et al. 2010), combined with perturbations from the leading singular vectors. When using the ES, it is important that the number of ensemble members is large enough to reliably estimate the value of the ensemble forecast moments. In Christensen et al. (2015), it was shown that calculation of the expected value of the ES converges for ensembles with 40 members or more. The 50-member EPS ensemble used here is therefore large enough for the purpose.
The EPS uses stochastic parameterization schemes to represent model uncertainty in the forecast. This uncertainty stems from the finite resolution of the forecasting model, which results in simplifications and approximations in the representation of small-scale processes. The EPS uses two stochastic schemes. The first is stochastically perturbed parameterization tendencies (SPPT; Palmer et al. 2009), which addresses uncertainty due to these unresolved small-scale processes. In SPPT, the deterministic parameterized tendencies in temperature, humidity, and horizontal wind are perturbed by a spatially and temporally correlated random number with unit mean. The second scheme is the stochastic kinetic energy backscatter (SKEB) scheme (Berner et al. 2009). This uses random streamfunction perturbations to represent upscale kinetic energy transfer, which is otherwise missing from the model.
The 10-day forecasts are considered, initialized from 30 dates between 14 April and 15 September 2010, 2011, and 2012, respectively, and 10 dates from the same period in 2009: this large sample of 100 forecast–verification pairs is required since the forecast is binned in two dimensions.4 The high-resolution 4DVar analyses (T1279, 16 km) are used for verification. Forecast and verification fields are truncated to T159 (125 km) before verification. Forecasts of temperature at 850 hPa (approximately 1.5 km above ground level) are considered.
To perform the decomposition, the forecast–verification pairs are sorted into 10 bins of equal population according to the forecast standard deviation. Within each of these bins, the forecasts are sorted into 10 further bins of equal population according to their skewness. To increase the sample size, for each latitude–longitude point the forecast–verification pairs within a radius of 285 km are used in the binning process, which has the additional effect of spatially smoothing the calculated scores. The number of data points within a bin varies slightly depending on latitude, but is approximately 20. The average standard deviation and skewness are calculated for each bin, as are the average error characteristics required by Eq. (14).
For comparison with the EPS, a perfect static probabilistic forecast is generated in an analogous way to the idealized hypothetical forecasts in Leutbecher (2010). The mean of the EPS forecast is calculated and used as a deterministic forecast. The error between this deterministic forecast and the 4DVar analysis is computed for each 10-day forecast, and an error PDF constructed from these as a function of latitude. The deterministic forecast is dressed with 50 errors randomly sampled from this latitudinally dependent distribution. This dressed deterministic (DD) ensemble is a “perfect static” forecast as the error distribution is correct by construction if averaged over all start dates for a latitudinal band. However, it does not vary from day to day, or longitudinally from position to position, dependent on the predictability of the atmospheric flow. The decomposition of the ES should distinguish between this perfect static forecast and the dynamic probabilistic forecasts made using the EPS, and identify in what way the dynamic probabilistic forecast improves over the perfect static case.
Figure 1a shows the forecasting skill of the EPS evaluated using the ES. The lower the value of the score is, the better the forecast. A strong latitudinal dependency in the value of the score is observed, with better scores found at low latitudes. This can be attributed largely to the climatological variability, which is strongly latitudinally dependent. At high latitudes variability is greater, the mean-square error is larger, so the ES is larger. This is explained in more detail in Christensen et al. (2015). Figure 1b shows the forecasting skill of the EPS evaluated using the ES, where the ES has been calculated using the decomposition described in Eq. (14) by summing the calculated components. The results are similar to using the raw ES, confirming the decomposition is valid. The small observed differences can be attributed to two causes. First, the decomposition assumes that spread and skew are discrete variables constant within a bin, which is not true. Second, the decomposition uses neighboring forecast–verification pairs to increase the sample size for the binning process, which is not necessary when the ES is evaluated using Eq. (1).
To investigate the source of skill in the EPS compared to the DD forecast, the decomposition of the ES was calculated for both sets of forecasts. Figure 2 shows the reliability, resolution, uncertainty, and bias terms calculated for the EPS (left-hand column) and DD (middle column) forecasts. Visually, the plots in the two columns look similar. Comparing Figs. 2a and 2b indicates that the reliability term tends to be smaller for the EPS across much of the tropics, and comparing Figs. 2d and 2e shows that the resolution term tends to be smaller for the DD. The uncertainty term, shown in Figs. 2g and 2h, is similar for the EPS and DD. In Fig. 2b, the method of construction of the DD forecast results in a strong horizontal banding across much of the equatorial Pacific Ocean. The standard deviation of the DD forecast is constant as a function of longitude, and the error characteristics are similar, so the reliability term is approximately constant.
At polar latitudes and in the southeast Pacific, the reliability skill score is negative, indicating the DD is more reliable than the EPS. However in these regions, Fig. 2f shows an ES resolution skill score that is large and negative. Because resolution contributes negatively to the total score, a large value of resolution is desirable and negative values of the resolution skill score indicate skill in the EPS forecast. At polar latitudes and in the southeast Pacific, the EPS forecasts have better resolution than the DD forecasts. Therefore, despite their low reliability at these latitudes, the overall ESS indicates an improvement over the DD. The improvement in ES in these regions can be attributed to an improvement in resolution of the forecast. At low latitudes, the resolution of the EPS is similar to that of the DD.
Figure 2i shows the ES uncertainty skill score. This is zero over much of the globe, indicating the EPS and DD forecasts have very similar uncertainty characteristics. This is as expected, since the forecast error characteristics are nearly identical. The small deviations from zero can be attributed to sampling: the sample distribution of errors used to dress the deterministic forecast does not necessarily have a mean of zero. The bias term, shown in Figs. 2j and 2k, is approximately an order of magnitude smaller than each of the other three terms, so contributes little to the total ES. In the tropics, the bias term is smaller for the EPS forecasts than the DD forecasts, though it is larger for the EPS in the extratropics.
a. Significance of observed differences in skill
Three regions of interest were defined by consideration of Fig. 2. The three regions are indicated in Fig. 3. Region 1 (R1) is defined as 10°–25°N, 120°–200°E, and covers the region in the northwest Pacific Ocean with a very high reliability skill score. Region 2 (R2) is defined as 0°–8°N, 220°–290°E, and covers the region in the east Pacific Ocean with a very low (negative) reliability skill score. Region 3 (R3) is defined as 35°–50°S, 200°–280°E, and covers a region in the southeast Pacific Ocean with a negative reliability skill score, but also a negative resolution skill score indicating an improvement in resolution.
Skill scores for the reliability and resolution components of the ES were evaluated for each region of interest, and are shown in Table 1. It is important to consider whether there is a statistically significant difference between the scores for different regions. Significance tests were performed to evaluate whether the variation in RELSS and RESSS between regions indicates a difference in skill, or whether it can be attributed to sampling errors (see appendix C for full details). To test whether the three regions are indistinguishable, bootstrap confidence intervals were calculated for the difference in RELSS and RESSS between regions. Spatial correlations were preserved by sampling subregions from within the regions of interest, and temporal correlation was preserved by employing the block bootstrap. The skill of EPS forecasts with respect to DD forecasts was calculated for 10 000 synthetic datasets generated by sampling the original dataset with replacement, and the difference in skill between the regions was evaluated. The distribution of differences in skill is shown in gray in Fig. 4 for the RELSS and RESSS for each region, compared to the observed difference in skill (blue line with crisscrosses). The test rejects the null hypothesis at the 5% level if the observed difference in skill falls outside the red lines, in which case the regions are said to be significantly different.
Skill scores for the reliability and resolution components of the ES (RELSS and RESSS, respectively) for the ECMWF EPS forecast compared to the DD forecast, for each of the three regions defined in the text.
Figure 4 indicates that R1 has a significantly higher reliability skill score than R2, since the measured difference in skill score falls outside of the 95% confidence intervals for this case. However, Fig. 4d indicates there is no significant difference between the resolution of forecasts in these two regions. R3 is significantly less reliable than R1, but has significantly better resolution than either R1 or R2. There is no significant difference between the reliability of forecasts for R2 and R3.
b. Comparison with the RMS spread-error diagnostic
The ES decomposition has indicated in what ways the EPS forecast is more skillful than the DD forecast, and has also highlighted regions of concern. It is of interest to see if this skill is reflected in other diagnostic tools. The calibration of the second moment of the ensemble forecast can be evaluated by constructing root-mean-square (RMS) spread-error diagrams, which test whether Eq. (6) is followed. This graphical diagnostic is a more comprehensive analysis of the calibration of the forecast, and can be used to identify the shortcomings of a forecast in more detail.
The forecast–verification pairs are sorted and binned according to the forecast variance, and the RMS error and spread are evaluated for each bin, and shown as a scatter diagram. The spread reliability and spread resolution can be identified on these diagrams. A forecast with high spread reliability has scattered points lying close to the diagonal line. If the range in vertical distribution in scattered points is large, the forecast has successfully sorted the cases according to their uncertainty, and the forecast has high resolution.
Figure 5 shows the RMS spread-error diagnostic evaluated for each region in Fig. 3 for both the EPS and DD forecasts. The diagnostic can be compared to the skill score for the ES reliability and resolution components calculated for each region, shown in Table 1. For comparison, a perfectly reliable forecast is generated for each region considered: for each forecast, one ensemble member is randomly selected to be the verification while the remaining members are used as the reliable ensemble (see appendix C for details). The RMS spread-error diagnostic is evaluated for this perfectly reliable forecast, and the process repeated to indicate the variability to be expected in this diagnostic. The perfectly reliable forecasts are shown in Fig. 3 as the pale gray plume.
Figure 5a shows the results from R1. As expected, the reliability of the EPS is markedly better than for the DD, with the scattered points for the EPS forecasts falling close to the diagonal as required for a statistically consistent ensemble forecast. There is a slight improvement in resolution, dominated by the cases with the highest uncertainty. This is reflected in Fig. 2f and Table 1, which show an improvement in the region.
In Fig. 5b, the results are shown for R2. The reliability of the EPS forecast is indeed poorer than for the DD forecast; the ensemble is consistently underdispersive. However, the figure indicates an improvement in resolution in this region. This improvement can be traced to a tongue of very low resolution skill score extending northwest from the Peruvian coast, which is visible in Fig. 2f.
Figure 5c shows the results for R3. The EPS forecast is less reliable than the DD forecast, being somewhat underdispersive, though the difference is small. The resolution of the EPS forecast is better than for the DD forecast, as expected from the ES decomposition. The ES decomposition has correctly identified regions of interest for the ECMWF EPS that have particularly high or low skill with respect to the reliability or resolution of the forecast. Investigating these regions further using more complete, graphical tests of statistical consistency, can then indicate in what way the forecast is unreliable or has poor resolution.
Figure 5 also indicates that, in all regions, there is a significant departure from perfectly reliable forecasts for both EPS and DD forecasts. There is some difference between the set of RMS spread-error diagnostics generated for the perfectly reliable forecasts, but the set remains close together. This indicates that the sampling uncertainty to be expected for this diagnostic for a sample of this size is small, and that the forecast–verification sample size is sufficient to detect a departure from a perfectly reliable forecast.
c. Reliability and resolution as a function of forecast error
The ES decomposition focuses on how well the PDF captures the uncertainty in the ensemble mean—spread and skew are chosen to be the two binning dimensions. However, it is interesting to also consider the third dimension: the accuracy of the ensemble mean. In particular, is it the case that the skill of the spread and skew of the forecast is a function of the forecast mean? For example, if model predicts a lower than average temperature, is this accompanied by a worse than average estimate of uncertainty? Asking this question can reveal information about how to interpret and calibrate probabilistic forecasts, and also information about model biases that can guide model development.
The model climatological mean was evaluated as a function of latitude. The forecast temperature anomaly was evaluated as the difference between the forecast and climatological means for each spatial point for each start date. The dataset was sorted according to how anomalously cold or warm the forecast conditions were, before being binned into 10 equally populated groups. The ES decomposition was evaluated for each decile, and the reliability and resolution terms shown as a function of forecast temperature anomaly in Fig. 6. For comparison, the gray plumes in Fig. 6 indicate the expected variation in RELSS and RESSS if the dataset were randomly separated into tenths, and the decomposition performed on each random tenth.
The figure indicates a significant variation in forecast skill as a function of forecast temperature anomaly. Anomalously cold conditions (low deciles) have worse reliability on average, whereas forecasts of anomalously warm conditions tend to be more reliable. However, the resolution of forecasts is improved if the event is more extreme—forecasts of anomalously warm or cold conditions have better resolution than forecasts of moderate conditions. The gray plume indicates that the observed differences as a function of temperature anomaly are significant.
5. Conclusions
This paper has shown how the error-spread score (ES), presented by Christensen et al. (2015), can be decomposed into three components. The score is designed to evaluate ensemble forecasts of continuous variables, and to test how well uncertainty is represented in the forecast: the score is particularly sensitive to the calibration of the forecast [i.e., how well the probabilistic forecast represents uncertainty; Christensen et al. (2015)]. In a similar manner to other proper scores, the ES can be decomposed into reliability, resolution, and uncertainty components. The ES reliability component evaluates the reliability of the forecast spread and skewness. This term is small if the forecast and verification are statistically consistent, and the moments of the ensemble forecast are a reliable indication of the statistical characteristics of the verification. Similarly, the ES resolution component evaluates the resolution of the forecast spread and shape. This term contributes negatively to the ES, so a large resolution term is desirable. This term is large if the spread and skewness of the ensemble forecast vary according to the state of the atmosphere and the predictability of the atmospheric flow. The forecast PDF from a system with high ES resolution separates forecast situations with high uncertainty (large mean square error) from those with low uncertainty. The ES uncertainty component depends only on the measured (climatological) error distribution, and is independent of the forecast spread or skewness. A large variability in the forecast error will contribute toward a larger (poorer) uncertainty component.
The decomposition of the ES was used to evaluate forecasts made using the European Centre for Medium-Range Weather Forecasts Ensemble Prediction System (EPS), and compared to a statistically produced “perfect static” forecast. The EPS was found to have skill across most of the globe. The ES decomposition attributed the improvement in skill at low latitudes to an improvement in reliability, whereas the skill at higher latitudes was due to an improvement in resolution.
The ES decomposition was used to highlight a number of regions of interest for the EPS: the RELSS and RESSS for these regions showed statistically significant differences from each other. The RMS spread-error diagnostic was calculated for these regions. The results were as expected from the ES decomposition, but also indicated in what way the forecast was reliable or showed resolution. Finally, the ES decomposition was evaluated as a function of forecast mean anomaly. The reliability and resolution of the forecast were found to vary significantly as a function of the forecast anomaly. The decomposition shown in this paper is therefore a useful tool for analyzing the source of skill in ensemble forecasts, and for identifying regions that can be investigated further using more comprehensive graphical diagnostic tools.
Acknowledgments
Thanks to J. Bröcker, C. Ferro, and M. Leutbecher for helpful discussions, and to I. Moroz and T. Palmer for their support in the writing of this paper. Thanks to two anonymous reviewers, C. Ferro, and the editor for their helpful comments. This research was supported by a NERC studentship and European Research Council Grant 291406.
APPENDIX A
The Decomposition of the Error-Spread Score
a. General decomposition
The first term, A, has been decomposed into a sum of squared terms.
b. Comments to aid interpretation
The form of the decomposition presented above does not rely on any assumptions. Two assumptions will now be considered to aid interpretation of the decomposition.
Second, consider a forecast that, in addition to having no bias, has perfect forecast spread and skew.
APPENDIX B
Mathematical Properties of Moments
APPENDIX C
Significance Testing
The significance of the results reported here will be tested in two ways in order to address two different null hypotheses.
a. Hypothesis 1: There is no significant difference between the EPS observed behavior and that of a perfectly reliable forecast
It is important to consider whether the values of the scores vary between regions simply because of the limited sample size used to evaluate the scores, or if in fact the difference indicates a significant departure from some nominal behavior. If hypothesis 1 is true, the deviation from the diagonal in Fig. 5 is due to sampling, and reveals no significant deviation from a reliable forecast in any region. Similarly, the variation in calculated RELSS between regions does not reveal a significant difference in reliability, and is as expected from sampling errors.
To test this hypothesis, a set of perfectly reliable forecasts is generated for each region considered. For each forecast within the region (as a function of both position and date), one ensemble member is randomly selected to be the verification while the remaining members are used as the reliable ensemble. This reproduces both the spatial and temporal correlations in the data. The RMS spread-error diagnostic is evaluated for this perfectly reliable forecast, as is the REL and RES components of the ES. This process is repeated 10 000 times, where each iteration resamples one member from each forecast to be the verification.
The variation in REL and RES skill scores for each region for this perfectly reliable forecast are shown as the pale gray histograms in Fig. C1. The skill scores for the reliable forecast vary little from iteration to iteration, and are significantly different to the skill scores calculated for the EPS forecast, shown in blue. This indicates that the forecast–verification sample size is sufficient to detect a departure from a perfectly reliable forecast.
The variation in the RMS spread-error diagnostic for this perfectly reliable forecast is shown as the pale gray plume in Fig. 5. Similarly to the REL component of the ES, the spread-error diagnostic also indicates that forecasts for regions 1, 2, and 3 are significantly less reliable that the perfectly reliable forecast generated here.
b. Hypothesis 2: There is no significant difference between the EPS observed behavior in each region
If this hypothesis is true, the variation in RELSS and RESSS between regions is due to sampling errors from the limited number of forecast–verification pairs, and does not reveal a significant difference in the reliability and resolution of the EPS forecasts between each region.
To test whether the three regions are indistinguishable, a block bootstrap technique is used to generate 10 000 synthetic datasets. First, region 1 and region 3 are split into two nonoverlapping subregions of approximately the same size and shape as region 2. Points within each of these five subregions will be sampled together to preserve spatial correlations in the dataset. In time, the data is split into 10 nonoverlapping sections each containing 10 consecutive forecast–verification pairs.C1 The dates within these blocks will be sampled together to preserve temporal correlations in the dataset.
A total of 20 space–time blocks of data are randomly selected with replacement for each of region 1 and 3, and 10 space–time blocks are selected for region 2, accounting for the difference in size of the original regions. The skill of EPS forecasts with respect to DD forecasts is then calculated for these randomized regions. The process is repeated 10 000 times.
The distribution of skill calculated for the resampled datasets is compared to that evaluated for each original region and shown as the dark gray histogram in Fig. C1. In addition, the difference in skill scores is calculated for each region. The distribution of differences in skill is shown in gray in Fig. 4 for the RELSS and RESSS for each region, and compared to the observed difference in skill (blue line with crisscrosses). The red lines indicate the 2.5th and 97.5th percentiles of the distribution of skill. If the magnitude of the difference in skill between the true regions is outside the red lines estimated from the bootstrap test, the difference in skill is said to be significant at the 95% level.
REFERENCES
Berner, J., G. J. Shutts, M. Leutbecher, and T. N. Palmer, 2009: A spectral stochastic kinetic energy backscatter scheme and its impact on flow-dependent predictability in the ECMWF ensemble prediction system. J. Atmos. Sci., 66, 603–626, doi:10.1175/2008JAS2677.1.
Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78, 1–3, doi:10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.
Bröcker, J., 2009: Reliability, sufficiency, and the decomposition of proper scores. Quart. J. Roy. Meteor. Soc., 135, 1512–1519, doi:10.1002/qj.456.
Christensen, H. M., I. M. Moroz, and T. N. Palmer, 2015: Evaluation of ensemble forecast uncertainty using a new proper score: Application to medium-range and seasonal forecasts. Quart. J. Roy. Meteor. Soc., 141, 538–549, doi:10.1002/qj.2375.
Ferro, C. A. T., and T. E. Fricker, 2012: A bias-corrected decomposition of the Brier score. Quart. J. Roy. Meteor. Soc., 138, 1954–1960, doi:10.1002/qj.1924.
Gneiting, T., 2011: Making and evaluating point forecasts. J. Amer. Stat. Assoc., 106, 746–762, doi:10.1198/jasa.2011.r10138.
Gneiting, T., and A. E. Raftery, 2007: Strictly proper scoring rules, prediction, and estimation. J. Amer. Stat. Assoc., 102, 359–378, doi:10.1198/016214506000001437.
Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems. Wea. Forecasting, 15, 559–570, doi:10.1175/1520-0434(2000)015<0559:DOTCRP>2.0.CO;2.
Isaksen, L., M. Bonavita, R. Buizza, M. Fisher, J. Haseler, M. Leutbecher, and L. Raynaud, 2010: Ensemble of data assimilations at ECMWF. Tech. Rep. 636, European Centre for Medium-Range Weather Forecasts, Reading, United Kingdom, 48 pp.
Leutbecher, M., 2010: Diagnosis of ensemble forecasting systems. Proc. Seminar on Diagnosis of Forecasting and Data Assimilation Systems, Reading, United Kingdom, ECMWF, 235–266.
Leutbecher, M., and T. N. Palmer, 2008: Ensemble forecasting. J. Comput. Phys., 227, 3515–3539, doi:10.1016/j.jcp.2007.02.014.
Murphy, A. H., 1973: A new vector partition of the probability score. J. Appl. Meteor., 12, 595–600, doi:10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2.
Murphy, A. H., 1986: A new decomposition of the Brier score: Formulation and interpretation. Mon. Wea. Rev., 114, 2671–2673, doi:10.1175/1520-0493(1986)114<2671:ANDOTB>2.0.CO;2.
Palmer, T. N., R. Buizza, F. Doblas-Reyes, T. Jung, M. Leutbecher, G. J. Shutts, M. Steinheimer, and A. Weisheimer, 2009: Stochastic parametrization and model uncertainty. Tech. Rep. 598, European Centre for Medium-Range Weather Forecasts, Reading, United Kingdom, 44 pp.
Sanders, F., 1963: On subjective probability forecasting. J. Appl. Meteor., 2, 191–201, doi:10.1175/1520-0450(1963)002<0191:OSPF>2.0.CO;2.
Tödter, J., and B. Ahrens, 2012: Generalization of the ignorance score: Continuous ranked version and its decomposition. Mon. Wea. Rev., 140, 2005–2017, doi:10.1175/MWR-D-11-00266.1.
Weijs, S. V., and N. van de Giesen, 2011: Accounting for observational uncertainty in forecast verification: An information-theoretical view on forecasts, observations, and truth. Mon. Wea. Rev., 139, 2156–2162, doi:10.1175/2011MWR3573.1.
Weijs, S. V., R. van Nooijen, and N. van de Giesen, 2010: Kullback–Leibler divergence as a forecast skill score with classic reliability-resolution-uncertainty decomposition. Mon. Wea. Rev., 138, 3387–3399, doi:10.1175/2010MWR3229.1.
Young, R. M. B., 2010: Decomposition of the Brier score for weighted forecast–verification pairs. Quart. J. Roy. Meteor. Soc., 136, 1364–1370, doi:10.1002/qj.641.
Note that if the forecaster only supplies the verifier with the mean, standard deviation, and skewness instead of the full ensemble forecast, strictly speaking the score cannot be called proper, and should instead be called “consistent, though not strictly consistent, for the mean, standard deviation and skewness” (Gneiting 2011).
For example, in the case of bimodal forecasts, the first three moments will be insufficient to characterize the full forecast PDF.
A spectral resolution of T639 corresponds to a reduced Gaussian grid of N320, or 30-km resolution, or a 0.28° latitude–longitude grid.
The selection of 100 dates is a pragmatic choice since 100 bins will be used for the decomposition—this guarantees that the data may be evenly distributed between the bins. Forecasts are constrained to boreal summer, as the impact of the annual cycle on forecast skill will not be considered.
The temporal correlation is approximately 0.1 after this lag.