1. Introduction
Consider the problem of assessing the quality of forecasts produced for binary observations (here labeled 0 and 1). The forecast quantity may be a continuous quantity ranging from −∞ to +∞, or it may be a probability, ranging from 0 to 1. It was shown by Murphy and Winkler (1987, 1992) that this problem is best cast into a framework based on the joint probability distribution of the forecasts and observations. Figure 1 depicts the general situation, where L0 and L1 are the likelihoods for the two classes. In other words, Li(x) is the probability of the forecast x, given that the observation is from the ith class.1 This figure illustrates an example of what Murphy and Winkler call a discrimination diagram. There, it was shown that the quality of forecasts can be assessed with complete generality in terms of several such diagrams; other diagrams gauge different facets of that quality, for example, refinement, resolution, reliability, etc.
Meteorologists (Harvey et al. 1992; Mason 1982; Mason and Graham 1999; Stephenson 2000; Wilks 2001; Atger 2004) have also become interested in a procedure heavily utilized in medical circles (Dorfman and Alf 1969; Dorfman et al. 1997; Metz et al. 1998; Shapiro 1999; Zhou et al. 2002; Zou 2003; Coffin and Sukhatme 1997). The procedure is based on the receiver operating characteristic (ROC) curve, sometimes referred to as relative operating characteristic. In its simplest form it is a parametric plot of the hit rate (or probability of detection) versus the false alarm rate, as a decision threshold is varied across the full range of a continuous forecast quantity. The diagonal line corresponds to random forecasts, and the amount of concavity is taken to be a measure of performance. The area under the ROC curve (AUC) is often taken as a scalar measure (Hanley and McNeil 1982). An AUC of 0.5 reflects random forecasts, while AUC = 1 implies perfect forecasts. It has also been shown by Mylne (1999) and Richardson (2000, 2001) that AUC is closely related to the economic value of a forecast system.
The ROC framework is somewhat different from the Murphy–Winkler framework. For example, for probabilistic forecasts the Murphy–Winkler framework does not require, and indeed discourages, the reduction of the forecasts into categorical classes. The ROC analysis, by contrast, is based on the contingency table and, therefore, requires the introduction of a decision threshold for the purpose of reducing the continuous forecasts into binary forecasts. Of course, the introduction of a threshold does not imply that ROC analysis is in any way inferior to the Murphy–Winkler framework; it is simply another method of assessing performance, with an emphasis on different facets of performance. The Murphy– Winkler framework is more suitable for comparing different sets of forecasts (e.g., from two forecasters), while the explicit presence of a decision threshold in ROC analysis lends itself to the situation where a decision must be made, or action must be taken, in response to forecasts.
In this paper, a number of questions are addressed regarding the shape of ROC curves. A few examples are provided to motivate the questions, and five toy models are utilized to answer the questions. The toy models, although somewhat unrealistic, are designed to be progressively better approximations to the general problem depicted in Fig. 1. The primary aim of this study is to introduce an awareness of the connections between the Murphy–Winkler framework and ROC analysis. As such, the results reported here are specific to the toy models considered and are unlikely to be generally true. Although one model—based on Gaussians—is likely to be generally valid, all of the considered examples are sufficiently flexible to allow for a number of ROC behaviors observed in realistic situations. The simplicity of the models offers a transparent environment wherein observed ROC behaviors can be explained in terms of more basic quantities, namely the parameters of the class-conditional distribution of forecasts (i.e., the likelihoods).
Figure 2a displays 16 ROC curves representing different levels of performance. These curves gauge the performance of a Markov chain model for forecasting tornadic activity in four different regions of the United States, during four seasons (Drton et al. 2003). The behavior of these curves is canonical in that they do what they are expected to. They all begin from the point (0, 0) and end at (1, 1). But note the high degree of symmetry about the diagonal(s). Figure 2b displays another set of 16 ROC curves; this time from a statistical model for predicting hail size (Marzban and Witt 2001). Although, these curves are not pathological in any sense, they do display a few features that are common to many ROC curves. The lowest performing models have symmetric ROC curves, but the midrange models begin to loose that symmetry. A natural question to ask is if this asymmetry can be explained in terms of the underlying distributions?
Another feature that often emerges is the extensive overlap of the ROC curve with one (or two) of the axes of the diagram. In Fig. 2b, this can be seen in the most concave curves (i.e., corresponding to the best-performing models). These yield ROC curves that overlap the top axis for all false alarm rates higher than 0.4. What is the explanation for this type of overlap? And what about an overlap with the y axis?
Another type of asymmetry (not shown here) arises when the ROC curve crosses the diagonal at some (usually one) point. What causes this type of crossover?
Many users of ROC curves observe that in dealing with a wide range of forecasts in different situations, most forecasts appear to lead to highly concave ROC curves, or equivalently high AUC values. AUC values of, say, 0.9995 are not uncommon. Figure 2c displays eight sets of ROC curves with extreme concavity. These are related to a neural network developed for the prediction of ceiling and visibility (Marzban et al. 2003). The forecasts underlying the curves have different forecast characteristics (in terms of the various attributes of probabilistic forecasts computed within the Murphy– Winkler framework), yet they all lead to very concave ROC curves. The AUC values for these curves vary from 0.990 to 0.996. Why are these AUC values exceedingly near 1? Is it because the forecasts are of extraordinary quality? Or is it an artifact of the AUC itself? If the former is true, then a histogram of all AUC values would be right-peaked (or show a heavy tail to the left). This is difficult to test for, because the necessary data would be difficult to compile. On the other hand, if the culprit is the measure itself, then testing that hypothesis would be unnecessary, for an explanation would then be at hand. And what sort of artifact would lead to near-one AUC values?
As mentioned above, although the two approaches have different emphases, they are related. After all, the quantities from which an ROC curve is derived—hit rate and false alarm rate—are areas under the conditional distributions, above some decision threshold. Moreover, although the computation of ROC curves does not require knowledge of these distributions, an assessment of the statistical significance of ROC curves does (Dorfman and Alf 1969; Hanley and McNiel 1982; Stephenson 2000; Dorfman et al. 1997). For example, in order to compute standard errors for ROC or AUC (in a parametric approach) one makes some assumptions regarding these underlying distributions. It is natural, then, to utilize the connection between the ROC curve and the underlying distributions to answer the above questions. The answers, then, offer a means of interpreting ROC curves at a more fundamental level.
In summary, here, several toy models are utilized to relate some characteristic features of ROC curves with features of the underlying distributions. As such, the shape of the ROC curve can be interpreted or “explained.” Knowledge of the underlying distributions can guide the development of better forecasts. AUC is also examined within the toy models. It is important to emphasize that the distributions examined here are toy models and mostly of pedagogical value. The five distributions considered are shown in Figs. 3a–7a. They are referred to as 1) uniform, 2) triangular with unconstrained support, 3) Gaussian, 4) triangular with constrained support, and 5) beta distributions. The first three are appropriate for cases where the forecast quantity varies over the real line from -∞ to +∞, while the last two apply to probabilistic forecasts.
2. Uniform distribution
Several observations can be made. First, (3) implies that two models with different means and widths can yield the same ROC curve if they have the same slope and intercept (see Fig. 3b). As such, the ROC curve does not uniquely specify the underlying parameters. In other words, there is a family of underlying distributions that give rise to the same ROC curve. This is a known fact even for more general distributions (Zhou et al. 2002).
Second, the length of the vertical segment overlapping the y axis is determined by two quantities, δc and w0/w1. This is sensible since the “goodness” of the underlying model is determined by both quantities. By contrast, the slope of the middle segment depends only on the ratio of the half-widths (and not δc). As such, the inequality of w0 and w1 reflects itself as an asymmetric ROC curve.
As a function of the measure δc, AUC is a parabola. Figure 3c shows an instance for w0 = w1 = 0.4 and w0 = 0.4, w1 = 0.6. The AUC curve rises rapidly and then flattens. It is this nonlinear behavior that explains the appearance of near-one AUC values in practice. For example, in Fig. 3c, as a model improves in terms of δc, its AUC value increases quickly to 0.99 at around δc ∼ 0.8. And the infinity of better models with δc ≥ 0.8 will result in only comparable AUC values, still around 0.99. In other words, the frequent appearance of high AUC values in practice suggests that the corresponding models are all in the “good” range of the AUC curve. One can say that AUC discriminates well between “good” and “bad” models, but not between good models, where those adjectives are gauged in terms of the underlying distributions.5 Similar arguments apply to the performance measures w0 and w1; AUC flattens off for sharper distributions.
3. Triangular distribution with unconstrained support
From the endpoints of the middle segment (Fig. 4b), it follows that the ROC curve is asymmetric if and only if w0 ≠ w1. Specifically, if the concavity is mostly to the left, then w0 < w1. Bowing to the right suggests w0 > w1. Note that the asymmetry is independent of ci.
4. Gaussian distribution
A common error is to assume that a theoretical ROC curve based on Gaussian distributions is constrained to obey the canonical ROC behavior, that is, concave either above or below the diagonal. Although this is true for the symmetric case where w0 = w1, in general the ROC curve is not strictly concave. It is easy to show that if w0 ≠ w1, then the ROC curve crosses the diagonal at precisely one point (other than the end points). Proof: The ROC curve will cross the diagonal where Φ[(c0 − t)/w0] = Φ[(c1 − t)/w1], that is, when c1/w1 − c0/w0 = (1/w1 − 1/w0) t. This equation has only one nontrivial solution when w0 ≠ w1. The value of F at this crossing point is given by Φ(δc/δw). Figure 5b illustrates this crossover.
This result must be interpreted cautiously. Specifically, it does not imply that an apparently concave empirical ROC curve suggests w0 = w1. Even if w0 ≠ w1, the ROC curve can still appear to be mostly concave (i.e., without a crossover). This is because Φ(x) is a rapidly increasing function of x. In fact, it is nearly 0 or 1, when x is nearly +2 or −2, respectively. Therefore, a concave empirical ROC curve suggests one of two possibilities: Either w0 = w1, or w0 ≠ w1, but with |δc/ δw| ≥ ∼2.
5. Triangular distribution with constrained support
In some situations the forecast quantity is a probability, calling for distributions that are restricted to that range. The first of the two such distributions considered here is shown in Fig. 6a. This model does assume that the forecasts do span the full range of possibilities (i.e., 0 to 1). In the language of Murphy and Winkler (1987, 1992), the forecasts are assumed to be well refined. Also note that in this approximation, the only parameters are the two modes: c0 and c1.7
From an expression of the slope, it follows that a symmetric ROC curve imples c0 + c1 = 1. Any other combination of c0 and c1 will result in an asymmetric curve. It is also easy to show that there does not exist a crossover; a nontrivial curve is either always above or always below the diagonal. It also follows that the ROC curve will bow to the left if c0 ∼ 0.5, and to the right if c1 ∼ 0.5.
6. Beta distribution
7. Summary and conclusions
Several models are examined for the purpose of explicitly illustrating some features of ROC curves and the area under the curve (AUC). The findings aid in interpreting the shape of the ROC curve in terms of the parameters defining the class-conditional distributions of the forecast quantity. In addition to providing a pedagogical exposition of the ROC analysis, the work also offers some guidance for interpreting ROC curves and the AUC. The guidance is based on only the models examined here. As such, the generality of the results is not assured by any means. Nevertheless, all of the examples shown in Fig. 2 are found to be completely consistent with the findings here. The following statements should be interpreted only as qualitative guidance. More quantitative statements are found in the text.
For unbounded forecasts, an asymmetric ROC curve suggests unequal widths for the underlying distributions. If the class with the larger mean is labeled as 1, then a concavity to the top suggests w0 > w1, and concavity to the bottom suggests w0 < w1. In other words, in attempting to explain any asymmetry in an empirical ROC curve, it is advisable to examine the widths of the underlying distributions. The amount of overlap with the axes is also a measure of the difference in the widths. The crossing of the diagonal by a ROC curve suggests that the quantity |δc/δw| is smaller than some critical value. For example, if the distributions are Gaussian, then that critical value is approximately 2.
For bounded forecasts, the distributions examined here do not generate an overlap with the axes. The existence of a significant overlap in an empirical ROC plot suggests that the underlying distributions are different from the ones examined here in some significant way. The symmetry and crossover of the ROC are determined by a combination of means and variances, for example, (20).
For both bounded and unbounded forecasts, the AUC increases nonlinearly with respect to natural measures of forecast quality derived from parameters of the underlying distributions. Moreover, in the examples considered here, the more realistic models display more of this nonlinearity. The nonlinearity is such as to reduce the effectiveness of the AUC in assessing performance, as performance increases. As such, the frequent occurrence of near-1 AUC values observed empirically is an indication that many forecasts are of “reasonable” quality.
Acknowledgments
The author is grateful to Rich Caruana for invaluable discussions and a reading of an early version of this article.
REFERENCES
Atger, F., 2004: Estimation of the expected reliability of ensemble based probabilistic forecasts. Quart. J. Roy. Meteor. Soc, 130 , 627–646.
Coffin, M., and Sukhatme S. , 1997: Receiver operating characteristic studies and measurement errors. Biometrics, 53 , 823–837.
Dorfman, D. D., and Alf E. Jr., 1969: Maximum likelihood estimation of parameters of signal detection theory and determination of confidence intervals. J. Math. Psychol, 6 , 487–496.
Dorfman, D. D., Berbaum K. S. , Metz C. E. , Lenth R. V. , Hanley J. A. , and Abu Dagga H. , 1997: Proper receiver operating characteristic analysis: The bigamma model. Acad. Radiol, 4 , 138–149.
Drton, M., Marzban C. , Guttorp P. , and Schaefer J. T. , 2003: A Markov chain model of tornadic activity. Mon. Wea. Rev, 131 , 2941–2953.
Hanley, J. A., and McNeil B. J. , 1982: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143 , 29–36.
Hanley, J. A., and McNeil B. J. , 1983: A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology, 148 , 839–843.
Harvey L. O. Jr., , Hammond K. R. , Lusk C. M. , and Mross E. F. , 1992: Application of signal detection theory to weather forecasting behavior. Mon. Wea. Rev, 120 , 863–883.
Marzban, C., and Witt A. , 2001: A Bayesian neural network for hail size prediction. Wea. Forecasting, 16 , 600–610.
Marzban, C., Leyton S. , and Colman B. , cited 2003: Nonlinear post-processing of model output: Ceiling and visibility. NWS/COMET Rep. [Available online at http://www.nhn.ou.edu/marzban/comet1.pdf.].
Mason, I., 1982: A model for assessment of weather forecasts. Aust. Meteor. Mag, 30 , 291–303.
Mason, S. J., and Graham N. E. , 1999: Conditional probabilities, relative operating characteristics, and relative operating levels. Wea. Forecasting, 14 , 713–725.
Metz, C. E., Herman B. A. , and Shen J. H. , 1998: Maximum likelihood estimation of receiver operating characteristic (ROC) curves from continuously-distributed data. Stat. Med, 17 , 1033–1053.
Murphy, A. H., and Winkler R. L. , 1987: A general framework for forecast verification. Mon. Wea. Rev, 115 , 1330–1338.
Murphy, A. H., and Winkler R. L. , 1992: Diagnostic verification of probability forecasts. Int. J. Forecasting, 7 , 435–455.
Mylne, K. R., 1999: The use of forecast value calculations for optimal decision making using probability forecasts. Preprints, 17th Conf. on Weather Analysis and Forecasting, Denver, CO, Amer. Meteor. Soc., 235–239.
Richardson, D. S., 2000: Applications of cost loss models. Proc. Seventh Workshop on Meteorological Operational Systems, Reading, United Kingdom, ECMWF, 209–213.
Richardson, D. S., 2001: Measures of skill and value of ensemble prediction systems, their interrelationship and the effect of ensemble size. Quart. J. Roy. Meteor. Soc, 127 , 2473–2489.
Shapiro, D. E., 1999: The interpretation of diagnostic tests. Stat. Methods Med. Res, 8 , 113–134.
Stephenson, D. B., 2000: Use of the “odds ratio” for diagnosing forecast skill. Wea. Forecasting, 15 , 221–232.
Wilks, D. S., 2001: A skill score based on economic value for probability forecasts. Meteor. Appl, 8 , 209–219.
Zhou, X-H., McClish D. K. , and Obuchowski N. A. , 2002: Statistical Methods in Diagnostic Medicine. John Wiley and Sons, 464 pp.
Zou, K. H., cited 2003: Receiver operating characteristic (ROC) literature research. [Available online at http://splweb.bwh.harvard.edu:8000/pages/ppl/zou/roc.html.].
A generic situation involving a forecast of two classes
Citation: Weather and Forecasting 19, 6; 10.1175/825.1
Examples of ROC curves representing different levels of performance quality. The diagonal line corresponds to random forecasts (i.e., poor performance), while the curves away from the diagonal represent higher levels of performance. The following features are noted: (a) symmetric ROC curves, (b) symmetric and asymmetric curves, also overlapping one axis, and (c) extremely concave curves
Citation: Weather and Forecasting 19, 6; 10.1175/825.1
Schematics of (top) uniform class-conditional distributions, (middle) the corresponding ROC curve, and (bottom) the AUC curve as a function of δc = c1 − c0
Citation: Weather and Forecasting 19, 6; 10.1175/825.1
Same as in Fig. 3 but for triangular distributions over unbounded forecasts
Citation: Weather and Forecasting 19, 6; 10.1175/825.1
Same as in Fig. 3 but for Gaussian distributions
Citation: Weather and Forecasting 19, 6; 10.1175/825.1
Same as in Fig. 3 but for bounded (e.g., probabilistic) forecasts
Citation: Weather and Forecasting 19, 6; 10.1175/825.1
Same as in Fig. 3 but for beta distributions. The corresponding parameters are b0 = 2, b1 = 3, a0 = 2, with a1 taking values 2, 3, 4, and 5 (from top to bottom)
Citation: Weather and Forecasting 19, 6; 10.1175/825.1
For a given dataset, a normalized histogram of x is the best way of visualizing the likelihood.
Throughout this paper, the symbols c and w refer to measures of central tendency and half-width, respectively, of the respective distribution. For the case of the Gaussian, they coincide with the mean and the standard deviation of the distribution.
The expressions in (2) are specific to Fig. 3a; changing the relative position of c0 and c1, or the magnitudes of the widths, yields different expressions.
This is not a problem in model selection, because the standard error of the AUC converges to 0, as AUC approaches 1 (Hanley and McNeil 1983).
In decision theoretic applications where one seeks an “optimal” decision threshold, this expression is often given to argue for the threshold at which slope = 1. However, that choice assumes that the two classes have equal prior probabilities, pi. Sometimes p1 and p0 are referred to as the base rate and its complement. The optimal threshold should be the one corresponding to slope = p0/p1.
First, note that in this section, c stands for the mode (not mean) of the distribution. Also, the widths of the distributions are not independent quantities. The mean is given as (1 + c)/3, and the variance as (1 − c + c2)/18.
Technically, this expression should be multiplied by the ratio of the respective prior probabilities as well. They are neglected here because they are not functions of x.