1. Introduction
The relative (or receiver) operating characteristic (ROC) is a representation of the skill of a forecasting system in which the hit rate and the false-alarm rate are compared (Swets 1973; Mason 1982). The related ROC score is often used to evaluate the quality of probability forecasts (Stanski et al. 1989; Buizza and Palmer 1998; Mason and Graham 1999).
The ROC has its origins in the signal detection theory [e.g., see the review by Swets (1973)]. Typically, the ROC is described in terms of the parameters of two hypothetical, often Gaussian, probability distributions (e.g., Mason 1982; Harvey et al. 1992). One distribution represents the evidence strength associated with the occurrence of the event, and the other represents the nonevent distribution. The two distributions are used to obtain the hit rate and the false-alarm rate as a function of the decision criterion. In the present study, the ROC is derived in the framework of a simple analog of a climate forecasting system (Kharin and Zwiers 2003, hereinafter KZ2003) and the link between the ROC score and deterministic potential predictability (Zwiers 1996; Rowell 1998) is established.
The ROC score is relatively independent of forecast calibration, that is, the correspondence between the forecast probability and observed relative frequency. It is qualitatively similar to resolution, that is, the ability of the forecast system to discriminate between event occurrences and nonoccurences [see, e.g., Mason (1982) and, more recently, Wilks (2001) and Mullen and Buizza (2001)]. The insensitivity to some types of forecast biases is demonstrated here in the context of a simple analog of a climate forecasting system. In particular, it is shown that under certain conditions the ROC score is a function of the properties of the observed system only.
The outline of this note is as follows. The ROC curve and score are derived and discussed in section 2. Some findings are illustrated with a collection of 24-member ensemble hindcasts of seasonal mean 700-hPa temperature in section 3, followed by a summary in section 4.
2. Relative operating characteristic
In this section, a simple analog of a climate forecasting system is introduced and expressions for the hit rate and false-alarm rate, the essential ingredients in the ROC curve definition, are derived. Examples of ROC curves are given for a “perfect” forecasting system in a Gaussian setting, and the relationship between the ROC score and potential predictability is illustrated.
a. Climate forecasting system analog
b. The hit rate and false-alarm rate
We begin by considering a deterministic forecasting system in which a warning is issued when an event is predicted to occur. The operation of such a forecasting system can be summarized in a 2 × 2 contingency table (Table 1). Assume that there are a total of N forecasts and verifying observations. Let the total number of warnings issued be W and the number of nonwarnings be W′ = N − W. Similarly, let O be the number of events that occurred, and let O′ = N − O be the number of nonevents. Also let H be the number of hits, for which an event occurred and a warning was issued; let FA be the number of false alarms, for which a warning was issued but an event did not occur; let M be the number of misses, for which an event occurred but a warning was not issued; and let CR be the number of correct rejections, for which an event did not occur and a warning was not issued.
c. The ROC curve
A new contingency table and the corresponding HR and FAR can be determined for every probability threshold Pcr. A ROC curve is obtained by varying Pcr and plotting the resulting HRs versus the FARs. The ROC curve for no-skill forecasts coincides with the 45° line from the origin, and that for perfect forecasts connects the points (0, 0), (0, 1), and (1, 1). For deterministic forecasts, a ROC curve can be constructed by plotting the HRs and FARs for the deterministic forecast system and connecting it to the HRs and FARs obtained for a forecast system issuing perpetual warnings (1, 1) and a perpetual no-warnings forecast system (0, 0) (e.g., Mason and Graham 1999). Examples of ROC curves are displayed in Fig. 1. These curves are constructed for probability forecasts of three equiprobable categories (below normal “B,” near normal “N,” and above normal “A”) produced with a perfect forecasting system (β′ = β and
d. The ROC score
Figure 2 displays the ROC skill score of three-category probability forecasts produced with the perfect forecasting system in the Gaussian setting as a function of the potential predictability
Expressions for the HR and FAR [(7), (8)] demonstrate clearly that the ROC score is insensitive to some types of forecast biases, a fact that is already well known to the meteorological community, as mentioned in the introduction. In particular, any two forecasts that are related as P2(β) = P1(aβ), a ≠ 0, have identical ROC curves and equal ROC scores. Rescaling the signal to noise ratio in the forecast system (1) does not change the ROC score. Thus, the ROC score should be regarded as a measure of potential rather than actual skill.
KZ2003 discuss a statistical technique for improving biased probability forecasts. The technique assumes a biased system of the form (1) and adjusts the amplitude of the forecast signal and noise to minimize the corresponding Brier score (i.e., the mean-square error of probability forecasts). It is clear from the above considerations that the ROC score cannot be improved by such a technique. On the contrary, the cross-validated ROC scores of statistically improved forecasts are likely to be degraded by sampling errors in the parameters of the statistical improvement technique.
3. ROC scores of HFP hindcasts
In this section we describe the ROC scores of probability hindcasts of 700-hPa temperature (T700) derived from a collection of 24-member ensemble hindcasts. The hindcasts were produced for 26 northern winters [December–January–February (DJF)] for the 1969–95 period with the second-generation general circulation model of the Canadian Centre for Climate Modelling and Analysis (McFarlane et al. 1992). These integrations were performed as the part of the Canadian Historical Forecast Project (HFP; Derome et al. 2001). Each HFP ensemble member is initialized from reanalyzed fields (Kalnay et al. 1996) lagged at 6-h intervals prior to the forecast season. The monthly mean sea surface temperature anomalies observed in the month prior to the forecast period are “persisted” throughout the forecast season. These anomalies are obtained from the Global Sea Ice and Sea Surface Temperature (GISST) dataset (version 2.2; Rayner et al. 1996). Sea ice extent is specified from climatological data. The initial snow line in the Northern Hemisphere is specified from National Centers for Environmental Prediction satellite observations for the week before the forecast period. The soil conditions are initialized from climatological data.
The improved hindcasts P̂
Figure 3 shows the ROC curves of the unadjusted Gaussian and improved probability hindcasts in the Tropics and in the North American sector. The ROC skill score for each category is indicated in the upper-left corner of each diagram. The ROC curves for the probability hindcasts in the Tropics deviate farther away from the no-skill diagonal toward the upper-left corner than those for the North American sector, indicating better skill in the Tropics. The hindcasts of the A category are somewhat more skillful than those of the B category, and the hindcasts of the N category are substantially less skillful than those of the other two categories. These results are consistent with the findings in KZ2003.
The statistically improved hindcasts have slightly smaller ROC skill scores than unadjusted hindcasts, a result that is not unexpected given the discussion of the ROC score properties in section 2 in which it was argued that the ROC score cannot be improved by adjusting the amplitude of the predictable signal. The degradation of the ROC skill scores of the “improved” hindcasts is apparently the result of sampling errors introduced by the statistical technique.
4. Summary
The properties of the ROC curve and ROC score have been discussed in the context of a simple analog of a seasonal forecasting system in which the total variability may be decomposed into two components, one that is associated with the potentially predictable signal and another that is unpredictable. The relationship between the ROC skill score and the potential predictability, defined as the ratio of the variance of the potentially predictable signal variance to the total variance, is established in the Gaussian setting. It is shown that there is a one-to-one correspondence between the deterministic potential predictability and the ROC scores of optimal probability forecasts. This correspondence depends on the definition of the forecast event of interest. In particular, the same level of potential predictability results in a smaller ROC score for the N category than that for the A or B category.
The insensitivity of the ROC score to certain types of forecast biases is demonstrated in the framework of the simple forecasting system. In an idealized setting in which the model-simulated signal is a function of the observed potentially predictable signal only, the ROC score is shown to be a function of the properties of the observed system only. If sampling variability can be neglected, probability forecasts derived from a forecast system that systematically under- or oversimulates the amplitude of the potentially predictable signal have the same ROC score as that of the perfectly reliable probability forecasts. That is, the ROC score does not penalize forecasting biases. Thus, care is needed when interpreting and intercomparing the performance of model forecasts based on the ROC score.
The findings are illustrated with a collection of 24-member ensemble seasonal hindcasts of 700-hPa temperature. It was shown that although statistically improved temperature hindcasts are more reliable (less biased) than the unadjusted hindcasts, the ROC scores of the statistically improved hindcasts are slightly degraded as compared with those of unadjusted hindcasts.
REFERENCES
Buizza, R., and T. N. Palmer, 1998: Impact of ensemble size on ensemble prediction. Mon. Wea. Rev., 126 , 2503–2518.
Char, B. W., K. O. Geddes, G. H. Gonnet, B. L. Leong, M. B. Monagan, and S. M. Watt, 1991: Maple V Library Reference Manual. Springer, 698 pp.
Derome, J., G. Brunet, A. Plante, N. Gagnon, G. J. Boer, F. W. Zwiers, S. Lambert, and H. Ritchie, 2001: Seasonal predictions based on two dynamical models. Atmos.–Ocean, 39 , 485–501.
Harvey Jr., L. O., K. R. Hammond, C. M. Lusk, and E. F. Mross, 1992: Application of signal detection theory to weather forecasting behavior. Mon. Wea. Rev., 120 , 863–883.
Kalnay, E., and Coauthors, 1996: The NCEP/NCAR 40-Year Reanalysis Project. Bull. Amer. Meteor. Soc., 77 , 437–471.
Kharin, V. V., and F. W. Zwiers, 2003: Improved seasonal probability forecasts. J. Climate, 16 , 1684–1701.
Mason, I., 1982: A model for assessment of weather forecasts. Aust. Meteor. Mag., 30 , 291–303.
Mason, S. J., and N. E. Graham, 1999: Conditional probabilities, relative operating characteristics, and relative operating levels. Wea. Forecasting, 14 , 713–725.
McFarlane, N. A., G. J. Boer, J-P. Blanchet, and M. Lazare, 1992: The Canadian Climate Centre second-generation general circulation model and its equilibrium climate. J. Climate, 5 , 1013–1044.
Mullen, S. L., and R. Buizza, 2001: Quantitative precipitation forecasts over the United States by the ECMWF ensemble prediction system. Mon. Wea. Rev., 129 , 638–663.
Murphy, A. H., and R. L. Winkler, 1987: A general framework for forecast verification. Mon. Wea. Rev., 115 , 1330–1338.
Rayner, N. A., E. B. Horton, D. E. Parker, C. K. Folland, and R. B. Hackett, 1996: Version 2.2 of the Global Sea-Ice and Sea Surface Temperature Data Set, 1903–1994. Hadley Centre Climate Research Tech. Note CRTN 74, 21 pp.
Rowell, D. P., 1998: Assessing potential predictability with an ensemble of multidecadal GCM simulations. J. Climate, 11 , 109–120.
Stanski, H. R., L. J. Wilson, and W. R. Burrows, 1989: Survey of common verification methods in meteorology. WMO World Weather Watch Tech. Rep. 8, WMO TD 358, 114 pp.
Swets, J. A., 1973: The relative operating characteristic in psychology. Science, 182 , 990–1000.
Wilks, D. S., 2001: A skill score based on economic value for probability forecasts. Meteor. Appl., 8 , 209–219.
Zwiers, F. W., 1996: Interannual variability and predictability in an ensemble of AMIP climate simulations conducted with the CCC GCM2. Climate Dyn., 12 , 825–848.
ROC curves for probability forecasts of the (left) below- or above-normal category and (right) near-normal category from a perfect forecasting system (1) for
Citation: Journal of Climate 16, 24; 10.1175/1520-0442(2003)016<4145:OTRSOP>2.0.CO;2
The ROC skill score SROC of the probability forecasts from a perfect forecasting system (1) as a function of potential predictability
Citation: Journal of Climate 16, 24; 10.1175/1520-0442(2003)016<4145:OTRSOP>2.0.CO;2
The ROC curves for the probability hindcasts of the A, N, and B categories derived from the 24-member HFP ensemble DJF T700 hindcasts. The HR and FAR are averaged over (left) the Tropics and (right) the North American sector. (top) The unadjusted Gaussian hindcasts P̂
Citation: Journal of Climate 16, 24; 10.1175/1520-0442(2003)016<4145:OTRSOP>2.0.CO;2
The two-by-two contingency table for verification of a dichotomous forecasting system