1. Answers to comments by Wilson
Wilson’s (2000) comments concerned the way the relative operating characteristic (ROC) curve has been computed and on the interpretation of the area under a ROC curve in terms of forecast skill.
a. On the computation of the ROC
Wilson’s comments on how the ROC curve should be computed are more than welcomed, and will lead to a revision of our computation methodology.
b. On the impact of the EPS system upgrade
We agree that the impact of the Ensemble Prediction System (EPS) upgrade of December 1996 on the area under the ROC curve is less evident if Wilson’s methodology is followed. Table 1 is the equivalent to
c. On the impact of atmospheric variability between different seasons
The fact that atmospheric variability can have affected the results has been clearly stated in the text when results referring to different seasons are compared (sections 5 and 6), and in the conclusions. The reader is referred to Buizza et al (1998) for a “cleaner” comparison of different ensemble systems on the same set of 14 cases (unfortunately, with large size, high-resolution ensemble systems, there is a practical limit of the number of cases that can be run in different ensemble configurations).
d. On the 0.7 value for the ROC area as a limit of a useful prediction
We understand Wilson’s criticism. Generally speaking, we agree that there is a lot of arbitrariness in the choice of a threshold value. Theory only justifies the use of a 0.5 value as the limit for a skillful system. But we feel that the use of a threshold value could ease the comparison of the performance of two different systems for a certain time period, or of one system during different periods, especially when there is a need to condense results, for example, in few figures or tables. This was the reason why the 0.7 threshold was introduced.
e. On atmospheric predictability
We do not agree with the final criticism of Wilson: “If it is atmospheric predictability that is referred to, how can this be judged when no atmospheric data are used in the verification?” We have discussed rather extensively in an appendix (Buizza et al. 1999b) the problem of precipitation verification, where we have clearly stated the limits of our approach, and our belief that 0–24-h forecasts can be used as a practical way of verifying the model on a scale that is consistent with its formulation. The reader is referred to Mullen and Buizza (1999, manuscript submitted to Mon. Wea. Rev.) for an assessment of the performance of the European Centre for Medium-Range Weather Forecasts’ Ensemble Prediction System over the United States for a 2-yr period against a verifying field defined using observations.
2. Answer to comments by Juras
Three are the main points raised by Juras (2000): the first on the choice of 0.70 as a threshold value for the area under a ROC curve of a useful prediction, the second on the influence of climatic variability on skill measures, and the third one on the “no resolution” line of the reliability diagrams (Figs. 1c,d and 18 of Buizza et al. 1999b).
a. On the 0.7 value for the ROC area as a limit of a useful prediction
We accept the criticism. We did not intend to suggest that a ROC area of 0.7 should be accepted as a universal limit for verification, but rather we think that the identification of a threshold value could ease the comparison of the performance of two different systems for a certain time period, or of one system during different periods, especially when there is a need to condense results, for example, in a few figures or tables.
b. On the climatic variability of precipitation
Juras correctly points out that areas with different climatological frequencies of precipitation should not be combined together. One way to overcome this problem would be to compute, for each grid point and for each month (or season), the climatological distribution of observed precipitation. Then, consider probabilistic predictions of precipitation events as “precipitation beyond last decile (or quartile)” instead of events with fixed (i.e., geographically constant) thresholds as “precipitation amount greater than 2 mm day−1.”
Unfortunately, such gridded climatological distributions of precipitation are not available over Europe, to the authors’ knowledge. By contrast, it is easy to compute such a field for upper-level variables as the 500-hPa geopotential height field using so-called analysis fields (Buizza et al. 1999a).
Juras is therefore correct to say that the statement “the predictive skill is lower for small regions” reported in Buizza et al. (1999b) is inadequate. It should be kept in mind however that the smaller the area, the longer the sample needed to achieve significant results. This is one of the reasons why when considering areas as large as Europe one can find a bigger number of independent events.
REFERENCES
Buizza, R., T. Petroliagis, T. N. Palmer, J. Barkmeijer, M. Hamrud, A. Hollingsworth, A. Simmons, and N. Wedi, 1998: Impact of model resolution and ensemble size on the performance of an ensemble prediction system. Quart. J. Roy. Meteor. Soc.,124, 1935–1960.
——, J. Barkmeijer, T. N. Palmer, and D. S. Richardson, 1999a: Current status and future developments of the ECMWF Ensemble Prediction System. Meteor. Appl.,6, 1–14.
——, A. Hollingsworth, F. Lalaurette, and A. Ghelli, 1999b: Probabilistic predictions of precipitation using the ECMWF Ensemble Prediction System. Wea. Forecasting,14, 168–189.
Juras, J., 2000: Comments on “Probabilistic predictions of precipitation using the ECMWF Ensemble Prediction System.” Wea. Forecasting,15, 365–366.
Wilson, L. J., 2000: Comments on “Probabilistic predictions of precipitation using the ECMWF Ensemble Prediction System.” Wea. Forecasting,15, 361–364.
Forecast time (day) at which the Brier skill score crosses for the first time the zero line.