1. Introduction
Signal detection theory (Green and Swets 1966) is a technique used to verify the goodness of a classifier and is widespread in many disciplines, such as signal processing, pattern recognition, medicine diagnosis, psychological testing, and also weather forecasting. In forecast verification a contingency table (e.g., Wilks 2006) is built for assessing the quality of a binary classifier, which is a tool to forecast the occurrence of an event. Let us introduce these concepts in more detail.
Instead, what is usually done is to consider the two conditional density probabilities, obtained by normalizing fY and fN by the prior probabilities. In particular, the first conditional density probability, also called the event likelihood, is that of having a value x given that there is the event occurrence: p(x | YES) = fY(x)/PYES. The latter conditional density probability, namely the nonevent likelihood, is that of having a value x given that there is nonoccurrence: p(x | NO) = fN(x)/PNO.
If a threshold, let us say
In this way, it is finally possible to build the contingency table and derive many statistical measures of the classifier skill, as is done for the example in Table 1. In this case the event studied is the occurrence of at least 20 mm of rain during a 6-h period in the Friuli Venezia Giulia region (notheast Italy, hereafter FVG) using as a simple predictor the mid- to-low-level relative humidity (MRH) derived from the lowest 500 hPa of the Udine radiosounding, released at the beginning of the 6-h period. These data will be used in section 3.
2. The maximum Peirce skill score
In ORP it has also been shown that the threshold that maximizes PSS, let us say
Richardson (2000) has already shown how the classifier forecast relative value—relative to a climatological forecast—is equal to its PSS and then is maximized at
Finally, in ORP it was shown that PSS is also equitable, in the sense introduced by Gandin and Murphy (1992), for an asymmetrical scoring matrix. So, it was suggested that all of these properties make the maximum Peirce skill score, PSS(
a. The x P and the likelihood ratio
If the two conditional probabilities overlap in part of their domain and are unimodal, then there is only one threshold that maximizes PSS: the point where they intersect. At this threshold the ratio of the components fN(
Finally, note that even if the threshold that maximizes PSS varies as the event climatology changes, as stated before, the maximum PSS value itself theoretically does not. Woodcock (1976) has shown that PSS does not change if one uses a database with NYES ≠ NNO or a random subsection of it that is “equalized” with respect to the event and nonevent frequency, so that NsubYES = NsubNO. In that case, the underlying likelihood distributions, that is, the shapes of p(x | YES) and p(x | NO), are the same for the original and the equalized datasets.
Also in ORP it was shown, for Gaussian likelihood, how the maximum PSS does not change with varying α if the likelihood mean and standard deviation are the same. In any case, it seems more likely to obtain higher maximum PSS for rare events than for near-equiprobable event problems,4 because the likelihood shapes are more likely to change. Thus, the event climatology must be shown when reporting the classifier skill, because the maximum PSS can vary for events with different frequencies.
b. The x P and the event posterior probability
Equation (11) can be particularly useful when X is the output of a complex model that correctly estimates the event posterior probability. For example, if an artificial neural network is used to predict the event posterior probability from the value of many different predictors (inputs), then one can simply choose as an output threshold the event prior, PYES, to dichotomize the forecast in yes–no classes maximizing the PSS of the model.
In Manzato (2005a) a method for transforming a predictor X into its event posterior probability was shown. It is interesting to note that (10) shows how the event posterior probability is a monotonic transformation of the likelihood ratio Λ (Kupinski et al. 2001). It is straightforward to show that the ROC curve is invariant for any monotonic transformation of the thresholded variable, because it is just a relabeling of the threshold (e.g., Green and Swets 1966). This means that converting the original X variable into its posterior probability or into its likelihood ratio will produce the same ROC curve for the new variable.
The fact that the same ROC is obtained from either the posterior probability or from the likelihood ratio transformation is important because the ROC curve obtained using the likelihood ratio Λ(x) as a mapping function is the optimal ROC, that is, a curve that will always lie on or above the ROC curve made with the original X values, or with any other transformation of X (Green and Swets 1966; Zhang and Mueller 2005). This is a consequence of the Neyman–Pearson criterion (Neyman and Pearson 1928). Then, it can be said that transforming the predictor X into its p(YES | x), as was done in Manzato (2005a), is an optimal preprocessing choice for classification problems.
3. A practical example
Let us show how the previous properties can be applied in a concrete example. Figure 1a shows the two estimated conditional components (normalized by N) of the sounding-derived mean relative humidity in the lowest 500 hPa. These two histograms are built by splitting the normalized histogram of all the MRH values for the case when there was an occurrence of rainfall greater than 20 mm (in the 6 h after the sounding release) in the FVG plain, [ fY(MRH)], and for the case when there was not occurrence, [ fN(MRH)]. The conditional density probabilities, or likelihoods, could be obtained by dividing these components by the prior probabilities estimated from the data sample with (1).
Varying the threshold in all of the MRH domain, one can compute many contingency tables and their derived PSSs. The vertical dashed line in Fig. 1a indicates the threshold (MRHP ≅ 71%) that gives the maximum PSS in this empirical way and sets the four coefficients of the contingency table shown in Table 1. The corresponding ratio of the two conditional components is 23, while the sample estimate of α is 21, which is quite close. The ratio of the two conditional probabilities (figure not shown) is 1.1, which is very close to the theoretical value (Λ = 1). These small errors are because continuous density functions have been approximated with discrete histograms.
Figure 1b shows the sample estimate posterior probability fit derived for the same dataset. The small circles are obtained by splitting the MRH domain into 21 equal bins and then, for each single bin, dividing the number of event occurrences by the number of total cases populating the bin. The continuous black line is a two-parameter exponential fit of these points, weighted with the population of each bin. Other details on how to interpret this kind of figure and the fitting line can be found in Manzato (2005a).
The gray dashed horizontal line shows the event prior probability, PYES = 1/(1 + α) ≅ 0.045, while the gray dashed vertical line shows the threshold MRHP, which empirically maximizes PSS. These two lines intersect very close to the fitted posterior probability line. So, instead of empirically computing the threshold that maximizes PSS, one can transform the original values into their posterior probability (which means issuing a calibrated event probability forecast) and then use the event prior probability as the threshold that maximizes PSS. Because of the optimality of the posterior probability-derived ROC, this method has the advantage that the PSS will be greater than the original one when the posterior probability fit is not monotonic.
Figure 2 shows the ROC curve and the point corresponding to PSS(MRHP). The value of the Peirce skill score is given by the vertical and the horizontal gray segments, while the third steep segment shows the maximum distance H from the no-skill bisector. In this particular case, one obtains the same ROC curve if the sample estimate event posterior probability of MRH is used as the thresholded variable, because p(YES | MRH) is monotonic.
For an example of a nonmonotonic transformation, if we consider the vertical component of the water vapor flux in the lowest 3 km (VFlux), then we obtain a fY (VFlux) function with a minimum around the mean VFlux value and two maxima around the range extremes. This leads to a u-shaped posterior probability fit. The threshold that maximizes PSS on VFlux leads to a PSS of 0.56, while using the posterior probability transformed data leads to a higher ROC, with a maximum PSS of 0.57.
4. Conclusions
The threshold that maximizes the Peirce skill score identifies the point on the ROC curve that has the maximum distance H from the no-skill line. In this sense, it is the ROC point that maximizes the skill of the classifier. As shown by Richardson (2000), this threshold maximizes the forecast relative value. It has been shown how the likelihood ratio and the event posterior probability are known a priori for that particular threshold. In particular, at the threshold that maximizes PSS, the two likelihoods have the same value (Λ = 1) and the event posterior probability is equal to the event prior probability.
These results support the conclusion made in ORP that it seems reasonable to use the maximum PSS as a scalar measure of the absolute classifier skill, together with an estimate of the event climatology (like PYES or α). This can be particularly useful when comparing different classifiers, especially when applied to different datasets. Of course, this does not mean that the threshold that maximizes PSS is always the best one for all end-user purposes, which can be differently sensitive to false alarms or missing events. This is because of the complex relationship between forecast skill and forecast value (e.g., Roebber and Bosart 1996; Semazzi and Mera 2006).
Acknowledgments
The author thanks his friend Luciano Sbaiz (EPFL, Lausanne, Switzerland) and Prof. Matthew Kupinski (Optical Sciences, The University of Arizona, Tucson, Arizona) for their support via e-mail. Three anonymous reviewers provided very useful suggestions to improve an earlier version of this note. This work was done using only open-source software and in particular the R statistical software package, the Python script language, the Emacs editor, and the Latex editing software, under the Linux Ubuntu distribution.
REFERENCES
Bayes, T., 1763: An essay towards solving a problem in the doctrine of chances. Philos. Trans. Roy. Soc. London, 53 , 370–418.
Choi, B. C. K., 1998: Slopes of a receiver characteristic curve and likelihood ratios for diagnostic test. Amer. J. Epidemiol., 148 , 1127–1132.
Gandin, L. S., and Murphy A. H. , 1992: Equitable skill scores for categorical forecasts. Mon. Wea. Rev., 120 , 361–370.
Green, D. M., and Swets J. A. , 1966: Signal Detection Theory and Psychophysics. J. Wiley and Sons, 455 pp. (Reprinted by R. E. Krieger Publishing, 1974.).
Jolliffe, I. T., and Stephenson D. B. , 2003: Forecast Verification: A Practitioner’s Guide in Atmospheric Science. J. Wiley and Sons, 240 pp.
Katz, R. W., and Ehrendorfer M. , 2006: Bayesian approach to decision making using ensemble weather forecast. Wea. Forecasting, 21 , 220–231.
Kupinski, M., Edwards D. C. , Giger M. L. , and Metz C. , 2001: Ideal observer approximation using Bayesian classification neural network. IEEE Trans. Med. Imaging, 20 , 886–899.
Manzato, A., 2005a: The use of sounding-derived indices for a neural network short-term thunderstorm forecast. Wea. Forecasting, 20 , 896–917.
Manzato, A., 2005b: An odds ratio parameterization for ROC diagram and skill score indices. Wea. Forecasting, 20 , 918–930.
Manzato, A., 2007: Sounding-derived indices for neural network based short-term thunderstorm and rainfall forecasts. Atmos. Res., 83 , 349–365.
Mason, I. B., 1979: On reducing probability forecasts to yes/no forecasts. Mon. Wea. Rev., 107 , 207–211.
Mason, I. B., 2003: Binary events. Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. T. Jolliffe and D. B. Stephenson, Eds., J. Wiley and Sons, 37–76.
Neyman, J., and Pearson E. S. , 1928: On the problem of the most efficient test of statistical hypotheses. Philos. Trans. Roy. Soc. London, 231 , 289–337.
Peirce, C. S., 1884: The numerical measure of the success of predictions. Science, 4 , 453–454.
Richardson, D. S., 2000: Skill and relative economic value of the ECMWF Ensemble Prediction System. Quart. J. Roy. Meteor. Soc., 126 , 649–667.
Richardson, D. S., 2003: Economic value and skill. Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. T. Jolliffe and D. B. Stephenson, Eds., J. Wiley and Sons, 165–188.
Roebber, P. J., and Bosart L. F. , 1996: The complex relationship between forecast skill and forecast value: A real-world analysis. Wea. Forecasting, 11 , 544–558.
Semazzi, F. H. M., and Mera R. J. , 2006: An extended procedure for implementing the relative operating characteristic graphical method. J. Appl. Meteor. Climatol., 45 , 1215–1223.
Swets, J. A., 1973: The relative operating characteristic in psychology. Science, 182 , 990–1000.
Wilks, D. S., 2006: Statistical Methods in the Atmospheric Sciences. 2d ed. Academic Press, 648 pp.
Woodcock, F., 1976: The evaluation of yes/no forecasts for scientific and administrative purposes. Mon. Wea. Rev., 104 , 1209–1214.
Zhang, J., and Mueller S. T. , 2005: A note on ROC analysis and non-parametric estimate of sensitivity. Psychometrika, 70 , 145–154.
(a) The two conditional component histograms of the mean relative humidity, fN(MRH) and fY(MRH), together with the threshold that maximizes the PSS (about 71%). The four areas produced by the threshold line correspond to the contingency table coefficients, normalized by N. (b) The sample estimate event posterior probability derived from the previous histograms and its theoretical value PYES ≅ 0.045 for the threshold that maximizes PSS. The vertical bars indicate how populated each bin interval is (see Manzato 2005a for more details). The continuous and dashed tick marks identify the mean values of the nonoccurrence and occurrence samples, respectively, while the numbers in between are their differences divided by the (95%–5%) quantiles interval.
Citation: Weather and Forecasting 22, 5; 10.1175/WAF1041.1
The ROC curve obtained by this binary classifier and the segments that show the maximum PSS and the maximum distance H from the no-skill bisector line.
Citation: Weather and Forecasting 22, 5; 10.1175/WAF1041.1
The contingency table and the derived scores obtained for the rainfall > 20 mm classifier built using the mean relative humidity in the lowest 500 hPa of the Udine sounding and the threshold that optimizes the PSS. A total of 18 555 soundings (from 1992 to 2005, 4 times per day) without missing MRH have been used. The rain occurrence was measured in the 6 h after the sounding release.
It is interesting to note that Eq. (14) of Semazzi and Mera (2006) extends the PSS definition to a new skill score computed as the vertical distance between the ROC point and a generic “baseline,” which can be different from the bisector line because it takes into account the user-defined loss–cost ratio.
In general, this optimal point is not necessarily the nearest to the top-left angle of the ROC diagram. It surely is for symmetric ROCs, like those obtained for Gaussian likelihoods with the same standard deviation.
As shown in ORP, if α = 1, then PSS = HSS. In other cases, it is not possible to find similar properties for HSS.
An example of this behavior can be found in Fig. 11 of Manzato (2007), where the maximum PSS of a regression neural network increases almost linearly as the event prior probability decreases.
Katz and Ehrendorfer (2006) have shown how (12) is the Bayesian estimator of the event probability in case of a uniform prior distribution, while the “face value,” PYES, is obtained for a prior that is a limiting case of a beta distribution.