The paper briefly reviews measures that have been proposed since the 1880s to assess accuracy and skill in categorical weather forecasting. The majority of the measures consist of a single expression, for example, a proportion, the difference between two proportions, a ratio, or a coefficient. Two exemplar single-expression measures for 2 × 2 categorical arrays that chronologically bracket the 130-yr history of this effort—Doolittle's inference ratio i and Stephenson's odds ratio skill score (ORSS)—are reviewed in detail. Doolittle's i is appropriately calculated using conditional probabilities, and the ORSS is a valid measure of association, but both measures are limited in ways that variously mirror all single-expression measures for categorical forecasting. The limitations that variously affect such measures include their inability to assess the separate accuracy rates of different forecast–event categories in a matrix, their sensitivity to the interdependence of forecasts in a 2 × 2 matrix, and the inapplicability of many of them to the general k × k (k ≥ 2) problem. The paper demonstrates that Wagner's unbiased hit rate, developed for use in categorical judgment studies with any k × k (k ≥ 2) array, avoids these limitations while extending the dual-measure Bayesian approach proposed by Murphy and Winkler in 1987.
1. The attempt to develop a single-numerical measure of Sgt. Finley's tornado forecasting skill
In 1884, Sgt. J. P. Finley of the U.S. Army Signal Corps published the results of a four-month study of his method for predicting the occurrence of tornadoes (Finley 1884). Finley cross-classified the data from a total of 2803 tornado–no-tornado events by using the two variables of predicted and actual events (Table 1). Finley claimed a prediction success rate of 0.966, a result that he derived by simply adding the correct tornado and no-tornado predictions (28 + 2680 = 2708) and dividing by the number of events (2803).
Within a few months of the publication of Finley's data and continuing for several years afterward, Finley's results were challenged by scholars (Gilbert 1884; Peirce 1884; Doolittle 1884, 1888; Curtis 1887; Hazen 1887, 1892; Clayton 1889, 1891, 1927, 1934; Köppen 1893; Heidke 1926). There was agreement among them that Finley's use of the proportion-correct method produced a misleading measure of the success of his forecasts. Gilbert (1884) noted that tornadoes hit so rarely during Finley's field study that a person who simply predicted “no tornado” for the entire four months would have exceeded Finley's claimed success rate of 96.6%, achieving a rate of 98.2% (2752 actual no-tornado events divided by 2803 total events).
Beyond noting that Finley's 96.6% success rate was misleading, there was little else that these authors agreed on. Differences arose in their reasoning, the coefficients of association they developed, and their calculations of chance. This was to be expected in light of the newness and difficulty of their task. In Doolittle's unequaled prose, this was the problem to be solved (Doolittle 1888, p. 85):
Having given the number of instances respectively in which things are both thus and so, in which they are thus but not so, in which they are so but not thus, and in which they are neither thus nor so, it is required to eliminate the general quantitative relativity inhering in the mere thingness of the things, and to determine the special quantitative relativity subsisting between the thusness and the soness of the things.
As an example of the differences in approach and results that this problem invited, Doolittle noted that his own formula for measuring that part of Finley's tornado–no-tornado forecasting success that was due to skill, rather than to skill and chance together, yielded what Doolittle called an inference ratio of 0.142, while Gilbert's method yielded 0.216 and Peirce's yielded 0.523, which is the equivalent of the hit rate minus false alarm rate, or H − F (see the appendix for cell identifications, a glossary of terms, the formula for calculating Wagner's unbiased hit rate, and the properties of Wagner's and others' categorical measures).
Subsequent reviews of this early work were in agreement that doubts remained about its value for binary weather forecasting analysis (see, e.g., Goodman and Kruskal 1954, 127–132; Murphy 1996). Goodman and Kruskal (1954, p. 131) observed:
It may, of course, be cogently argued that in situations such as Finley's it is misleading to search for a single numerical measure of predictive success; and that rather the whole 2 × 2 table should be considered, or at least two numbers from it, the proportions of false positives and false negatives.
Murphy (1996, p. 7) noted but did not explicate a second problem: the apparent inability to apply single-numerical measures formulated for 2 × 2 verification problems such as Finley's to the general k × k (k ≥ 2) verification problem. A century before, Doolittle (1884, p. 126) had commented on the problem, also without proposing an explanation of it. (The problem derives from differences in degrees of freedom as between k × k matrices of k = 2 and k > 2; see section 3.)
The early work on binary categorical assessment problems in weather forecasting replicated the early search in other emerging scientific disciplines for single-numerical expressions of association between like cross-classified categorical variables (Pearson et al. 1899; Pearson and Lee 1903; Yule 1900, 1903; Mumford and Young 1923; Niles 1922; Cohen and Nagel 1934; Fisher 1934; Bartlett 1935; Norton 1939). This approach in meteorology has persisted to the present day. Coefficients like Gilbert's skill score, the Heidke skill score, the true skill statistic, and Stephenson's odds ratio skill score (ORSS) are in current use. Arguably, they are limited in ways that Goodman and Kruskal glimpsed but that have not been discussed in detail in the literature. The crux of the problem is that a cross-classification of forecasts and events–nonevents like Finley's data matrix does not constitute a cross-classification of like variables. Rather, it constitutes a confusion matrix, a variant of a contingency table whose analysis is enhanced by moving beyond single-numerical expressions.
An implicit premise of Finley's calculations is that predictions of “tornado” are independent from the predictions of “no tornado”; that is, their respective accuracy rates can be completely and accurately represented with simple proportion-correct measures. In a 2 × 2 contingency table that cross-classifies, for example, the similar variables of high–low plant yield Y against drip–surface irrigation method X, one could treat the two X columns as distinct binomial samples and separately condition Y on them (Agresti 1996). This would yield results after the same fashion as Finley's calculations, which would be appropriate because mistaken judgments, either false negatives or false positives, are not at issue.
This is not the case in a confusion matrix such as Finley's, where intentional choice (hence, bias) is operating. Using Finley's logic, we would find that his forecasts of tornado were correct 54.9% of the time (28/51). This ignores the problem that while Finley did correctly forecast 28 of the 51 tornadoes that occurred during the study period, he also forecast tornado more than twice as often as that (72 times) when no tornado would have been the correct forecast. These 72 erroneous forecasts diminish his accuracy in forecasting both tornado and no tornado, because in a 2 × 2 confusion matrix every forecast of a weather event affects the true rate of accuracy in forecasting both categories of an event. The 72 mistaken forecasts of tornado did not just disappear; they were mistaken as being the null forecast. Intentionality and bias in judgments are the hidden influences in a confusion matrix. To account for them, measures of accuracy and of chance are required that use both conditional probabilities of identification accuracy and forecast accuracy and that do so for each separate event–forecast category. H. L. Wagner, an experimental psychologist, developed the unbiased hit rate in recognition of this in his own field.
2. H. L. Wagner's unbiased hit rate
Wagner (1993) developed his measure for use with cross-classified categorical data in the general k × k (k ≥ 2) array where the stimulus variables are known facial expressions of emotion and the response variables are human subjects' judgments of the emotions being expressed, in other words, where the contingency table is a confusion matrix representing the judgment accuracy and bias of human subjects when assessing visual cues given off by facial expressions of basic emotions. In meteorological studies, the judgment accuracy of a confusion matrix is the accuracy of a forecaster or a forecast model, and the known stimuli are the weather cues used by the forecaster or the model.
The analysis of a confusion matrix should account for the accuracy of responses to every stimulus presented to the judge (i.e., for the accuracy of forecasts of every event–null event category), especially when, as with Sgt. Finley's data, forecast accuracy varies widely as between the event and null event categories. Wagner demonstrated that commonly used measures for confusion matrices in his own specialty were variously deficient: that they were biased, calculating accuracy only as a proportion of stimuli correctly identified (as with Finley's calculations, which were based only on correct identifications of events) or as a proportion of responses correctly given (i.e., forecasts correctly used), rather than as a proportion of both; that they used inappropriate a priori chance calculations; that some of them were inapplicable to studies with fewer than three categories of judgment; and that many of them could calculate only a single, overall accuracy rate rather than an accuracy rate for each of the separate stimulus–response categories in a study. The single-numerical expressions that have been developed for categorical weather forecasting skill are variously subject to the same problems.1
Wagner recognized, as had Doolittle a century before (Doolittle 1884, p. 123), that classification accuracy in judgment studies is the product of the two conditional probabilities that 1) a stimulus (i.e., an event or null event) is correctly identified given that cues to its identity are presented at all to the subject and 2) a response (i.e., a forecast) is used correctly given that it is used at all by the subject. Wagner named the product of these two conditional probabilities the “unbiased hit rate,” or Hu. In terms familiar to meteorologists (see the appendix), Hu is the product of the probability of detection (POD) and the frequency of hits (FOH). The measure can be used to assess the separate accuracy rates of category judgments in studies of two or more categories, and it accounts simultaneously for stimulus and response accuracy in each category. Wagner also developed an appropriate category-level chance rate calculation that can be used in tests of significance (see section 3).
Since its introduction, Hu has been adopted by many scientists in Wagner's field of nonverbal behavior and communication, particularly in facial expression and vocalization studies (e.g., Elfenbein et al. 2002; Marsh et al. 2003; Scherer 2003; Davis et al. 2006; Sauter and Scott 2007; Goeleven et al. 2008; Hawk et al. 2009). While the literature in the field of nonverbal behavior reflects substantial internal debate (Russell and Fernández-Dols 1997), no critiques of Wagner's unbiased hit rate have appeared in that or any other literature. It calculates accuracy by determining the proportions correct of event identifications and forecasts for each category of a matrix, so it is insensitive to stimulus or response bias, to proportions of stimuli of different types, to the number of categories of stimulus–response, and to the interdependence of responses (i.e., forecasts) in 2 × 2 matrices. It has been used in studies with matrices from size 2 × 2 to 8 × 8 and larger. Recently, the statistic has been used in the reanalysis of criminological data, and has been proposed for wider use in dichotomous and two-category judgment studies outside the field of experimental psychology (Armistead 2011, 2012). The current paper extends that proposal to categorical measures of weather forecasting accuracy and skill. For purposes of simplifying the argument, only binary studies are considered in this paper, even though the argument applies equally to any k × k (k ≥ 2) matrix.
3. Comparing Doolittle's i and Stephenson's ORSS to Wagner's Hu
Doolittle's i, or inference ratio, and Stephenson's odds ratio skill score are notable measures that stand on either end of the long history of attempts to develop a valid and useful single-expression measure of categorical forecasting skill. In the following discussion, they—and in less detail, other popular single-expression measures—will be compared to Wagner's measures of accuracy and chance.
a. Doolittle's i
Doolittle's “inference ratio” is a single-numerical expression that he introduced in 1884, shortly after Finley's report appeared. He considered i to represent “that part of the success which is due to skill and not to chance…” (Doolittle 1884, p. 123). Doolittle argued that when i was used as the dividend in a ratio measure with raw accuracy, it captured the skillful portion of both tornado and no-tornado prediction success, as opposed to the skillful and chance portions of success “mingled indistinguishably” (Doolittle 1884, p. 124). Like Wagner a century later, Doolittle defined raw accuracy in a confusion matrix as the product of the two conditional probabilities of forecast and identification accuracy. Thus, for the Finley tornado data, Doolittle calculated raw accuracy as (28/100)(28/51) = 0.154. He performed the same calculation for the no-tornado cases and obtained the proportion 0.966 = (2680/2752)(2680/2703). These are the same values as Wagner's unbiased hit rates for the two forecast–event categories.
Doolittle presented two formulas for i in his paper, each yielding the value of 0.142 for the Finley data. It is easier to understand how he arrived at i if we restate his final version of it in current terms and consider that his inspiration for the measure arose from his critique of a measure that Peirce (1884) had developed for the Finley data: H − F. Doolittle's formula expressed in modern terms (see the appendix) is i = (H − F)(PPV − FN), or hit rate minus false alarm rate multiplied by the positive predictive value minus false negative rate.
The terms hit rate, false alarm rate, etc. were not used in Doolittle's work. His final equation (see Doolittle 1884, p. 124 et seq.) was more cumbersome as a result, but it is faithfully restated in current terms as above. In traditional meteorological terms, Doolittle's expression is (POD − POFD)(FOH − DFR). In terms of cell identifiers, the equation when applied to the tornado events is [a/(a + c) − b/(b + d)][a/(a + b) − c/(c + d)]. When applied to the no-tornado events, the equation is the mirror image of this: [d/(b + d) − c/(a + c)][d/(c + d) − b/(a + b)]. Substituting the tornado prediction values from Finley's data into the formula gives (0.549 − 0.026)(0.28 − 0.0085), or 0.523 × 0.272 = 0.142. Substituting the no-tornado prediction values from Finley's data into the formula yields the same value: (0.974 − 0.451)(0.992 − 0.72), or 0.523 × 0.272 = 0.142.
This final formulation of i was a simple and intentional modification of Peirce's measure H − F, an expression that has come to be known in meteorology as the true skill statistic (TSS). Doolittle formulated i explicitly as a corrective to Peirce's measure, noting that Peirce accounted only for “sins of omission”—failures of identification—but not for “sins of commission”—failures of prediction (Doolittle 1884, p. 124). Doolittle proposed to correct this by multiplying Peirce's result 0.523 by its complement PPV − FN = 0.272 thereby obtaining i = 0.142, which he argued is that proportion of correct identifications and responses that were obtained through skill alone, not through skill and chance together.
Since Doolittle's formula for i yielded the same value (0.142) for forecasts of both event categories (cells a and d and cells b and c are interchangeable in the calculation), he used 0.142 as a constant term in a ratio calculation with raw accuracy to determine what proportion of Finley's success at predicting both the event and the null event could be considered skill and what proportion chance. Since 0.142/0.154 = 0.923, Doolittle concluded that 92.3% of the unbiased hit rate of 0.154 constituted the nonchance portion of raw accuracy involved in Finley's 28 correct predictions of tornado, while 7.7% would have been by chance. For Finley's correct no-tornado predictions, 0.142/0.966 = 0.147, so Doolittle concluded that “of (Finley's) success in predicting the non-occurrence of tornadoes, only 0.147 is due to skill, and 0.853 is due to chance” (Doolittle 1884, p. 125).
We can see three problems with Doolittle's formula for i and his use of it.
1) Interdependence of responses in a 2 × 2 matrix and the inapplicability of Doolittle's i to the general k × k (k ≥ 2) problem
Doolittle observed (Doolittle 1884, p. 126) that he had found it impossible to calculate a single-numerical inference ratio that was identical for both an event and a nonevent in a k × k (k > 2) situation. He did not propose a reason for this difficulty. The reason for it is also the reason that Peirce's and Doolittle's single expressions are identical for an event and a null event for the Finley data. In 2 × 2 matrices, df = 1; that is, given the marginals and any single cell value, the value of each of the remaining three cells is absolutely determined and so the proportions of correct and erroneous forecasts for each event and nonevent category are interdependent. This interdependence is reflected in the symmetry of proportions in H − F and PPV − FN for both an event and a nonevent. The H − F for tornado responses, for example, is calculated as 0.549 − 0.026 = 0.523; for no-tornado responses, it is 0.974 − 0.451 = 0.523. Since df > 1 in the k × k (k > 2) situation (e.g., df = 4 in a 3 × 3 matrix), the interdependence of response proportions that characterizes a 2 × 2 matrix is not present and a single numerical value like Peirce's 0.523 or Doolittle's 0.142 that applies to all forecast–event categories in the k > 2 case cannot be calculated.
2) Anomalies in the meaning of the value of +1 for i
If we were to assume, in place of the Finley data, a matrix with zero accuracy rates for both events (tornado, no tornado), and with false alarm rates of 100% for both, the unbiased hit rate would be 0, as it should be: [0/(a + b) × 0/(a + c)] = 0. Doolittle's i, however, yields the anomalous value of (0 − 1)(0 − 1) = +1, which it also yields when both hit rates are perfect (1 − 0)(1 − 0) = +1. Doolittle dismisses this anomaly as merely a demonstration that the “logical connection” between prediction and event is perfect—just negatively perfect (Doolittle 1884). This reasoning ignores Doolittle's definition of i as being that part of success that is due to skill. A forecasting method that achieves an i value of +1 must be interpreted according to Doolittle's definition of i as having done so by the operation of perfect skill unalloyed with chance, which is a contradiction in terms in the case of a method that never is correct. While a dichotomous prediction method that is always wrong can be useful, one could just report out the opposite of any given forecast, it can hardly be said to represent skill.
Similarly, as to i being +1 in the opposite situation, when the hit rates are perfect and false alarm and false negative rates are zero, its value would have to be interpreted as having determined that all predictions were a matter of skill and none were due to the operation of chance. This is an impossible result. Chance successes cannot be ruled out, even though the probability of chance occurrence would be extremely low. Consider, for example, the hypothetical matrix at Table 2.
Doolittle's i would be +1 for these data. Given his definition of i, the interpretation of this value would be that not only is accuracy perfect, but that perfect identification and predictive skill fully explain the accuracy. In fact, however, the meaning of the perfect coefficient of +1 here is only that the prediction method made no errors, the issue of chance versus skill aside.
Wagner's measure, by contrast, would be +1 for both tornado and no-tornado forecasts, but with the following respective chance proportions:
The meaning of these results is that prediction in this hypothetical study was perfect but chance still was operating. In the case of no-tornado forecasts, the forecaster (or the forecasting model) used the no-tornado response 75% of the time: 300 times out of a total of 400 responses. By the operation of chance, then, 75% of the 300 actual no-tornado events could have been classified as no tornado, or 225 chance predictions. Using the normal approximation to the binomial distribution gives us z = (300 − 225)/(300 × 0.75 × 0.25)−1/2 = 10; p < 0.0001. So, while the probability of this perfect hit rate of 300 out of 300 no-tornado predictions being by chance is extremely low, the operation of chance cannot be foreclosed. The same is true of the perfect accuracy in predicting tornado: z = 17; p < 0.0001.
At the root of both these anomalies is that i for any given 2 × 2 matrix is at base the product of an imperfect second calculation of Wagner's unbiased hit rate. This is explicated below.
3) Doolittle's i is an unnecessary and imperfect recalculation of Wagner's unbiased hit rate
Doolittle's argument implies that even though raw accuracy rates differ as between response categories—0.154 for tornado forecasts; 0.966 for no-tornado forecasts—those portions of accurate forecasts that represent skill are a function of i, in this case, 0.142. The reverse is true, however; the value 0.142 is a derivative of the correct calculation of the unbiased hit rates and of the very different chance rates of Finley's tornado and no-tornado forecasts. The value 0.142 for the Finley data is simply the unbiased hit rate of both tornado and no-tornado forecast–event categories after the true chance rates are calculated and the nonchance values of the forecasts are substituted into cells a and d of the original matrix, creating an altered data array, marginals, and n.
In a confusion matrix, the chance rate of success in correctly forecasting an event or nonevent category is a function of the proportion of total responses that are of that category and is calculated using the same logic as the unbiased hit rate: it is the product of the two probabilities that cues to an event are presented and that specific event is forecast: [(a + b)/n][(a + c)/n] for tornado forecasts; [(b + d)/n][(c + d)/n] for no-tornado forecasts (Wagner 1993, pp. 6 and 18–19). The nonrandom rate varies accordingly. This calculation of the true chance rate of each response category is the same as that used to compute the expected; that is, chance – values for the χ2 test and the results are the same as Stephenson's calculation of a random no-skill forecast for the Finley data (Stephenson 2000, p. 223, Table 4).
In the case of Finley's data, the chance of random success in predicting tornado is the product of the two probabilities that a tornado actually did present itself 51/2803 and that the prediction, tornado, had been made 100/2803, or p(c) = 0.000 65. In absolute numbers, the chance portion of the 28 successful predictions of tornado is 1.8 predictions (100 × 51)/2803, and what Doolittle called the skillful portion is 26.2 predictions. In the case of the no-tornado predictions, the chance rate is calculated as (2703/2803)(2752/2803) = 0.947. In absolute numbers, the chance portion of the 2680 correct predictions is 2653.8 = (2703 × 2752)/2803. The skillful portion is 26.2, the same as the skillful portion of the tornado forecasts. In other words, because of the interdependence of forecasts in a 2 × 2 confusion matrix the true chance rates of the forecast categories may differ greatly, but they yield the same chance value for each category. This is the case for every 2 × 2 confusion matrix for the same reason that H − F and PPV − FN always yield the same value for the two respective event categories: the interdependence of forecast accuracy in a 2 × 2 matrix.
Table 3 is useful in demonstrating that we obtain the value of Doolittle's i by simply substituting the nonrandom correct forecast values into cells a and d of the matrix and calculating Hu for both forecast–event categories. Because of the interdependence of forecasts in a 2 × 2 matrix, the new rates are identical: 26.22/(98.2 × 49.2) = 0.142.
While Doolittle believed that the value of 0.142 represents the logical connection between prediction and event in the matrix, it is more to the point to observe that it represents only the absolute interdependence of forecasts—in this case, only the nonchance correct ones—that is always present in a 2 × 2 matrix. It adds no value to the calculation of unbiased hit rates, chance rates, and chance values from the original Finley data. In fact, it derives from those calculations. Calculating the unbiased hit rates and the appropriate chance rates from the original data for each category allows us to see the true n, the extremely disparate numbers of correct event and null event forecasts, the correspondingly disparate rates of chance, and the true chance values of each forecast–event category.
While Doolittle's formula produced the same value as Hu in Table 3, the anomalous meanings of +1 for i in the case of completely erroneous and completely perfect forecast matrices, and its failure to account for chance in either case, make it an imperfect second calculation of the unbiased hit rate. If it were considered important to develop a single-numerical expression for the nonchance accuracy rates of both forecast–event categories in a 2 × 2 matrix, we could simply recalculate Wagner's unbiased hit rate correctly after substituting the nonchance correct values into cells a and d of the matrix. The identical Hu obtained for both categories would be easier to calculate than Doolittle's formula and would not yield distorted values and a misunderstanding of the role of chance in the cases of perfect or completely erroneous forecasts.
Ironically, Doolittle correctly calculated the number of chance tornado and no-tornado forecasts from the original Finley data as an early step in his reasoning, recognizing that chance successes are calculated as the product of the number of actual tornadoes (or no tornadoes) and the number of tornado (or no tornado) predictions, divided by the total number of cases (Doolittle 1884, p. 123). Even though the correct chance values of tornado and no-tornado forecasts (1.82 and 2653.8, respectively) are implicit in Doolittle's formula, however, his measure is a flawed version of Hu.
Doolittle's i has seen little use over the years, perhaps because he eventually developed a poor opinion of single-expression categorical measures of forecasting ability. The TSS, however, was popular for many years and became known by different names in various disciplines (e.g., Woodworth 1938; Swets 1986; Beck and Feldman 1989). Doswell and Flueck (1989), Doswell et al. (1990), and Murphy and Daan (1985) discuss its meteorological utility at some length. More recently, however, there has been general acknowledgment that in the case of rare events its value tends toward that of the simple hit rate (POD) because the false alarm rate (POFD) becomes negligible (e.g., Stephenson 2000). In Wagner's terms, its bias (both its variables concern only measures of identification accuracy) has become apparent.
Three years after introducing his measure, Doolittle (1888, p. 95) noted in a talk on 25 May 1887 to the Philosophical Society of Washington that the skill score devised by Gilbert—if revised in a manner suggested by Doolittle in his second “inference ratio” measure—appeared to have accounted for hypothetical chance. Doolittle's new formula for i yielded the single coefficient 0.355, essentially the same value obtained by the coefficient of determination (0.352) and close to the value of phi (0.377), each of which is appropriately used in categorical tables that are not confusion matrices. Doolittle concluded his review of Gilbert's score, as well as his own and Finley's efforts, on a dismal note (Doolittle 1888, p. 96):
Mr. Gilbert says that he hopes “…to show that Mr. Finley's method involves a serious fallacy. This fallacy consists in the assumption that verifications of a rare event may be classed with verifications of the predictions of frequent events without any system of weighting.” It is not perceived that Mr. Gilbert has furnished any such system. The fallacy, perhaps, consists rather in the supposition that any valuable result can be obtained by averaging the percentages of verification of heterogeneous classes of predictions. Mr. Finley correctly computed his indiscriminate percentage of verifications, and thereby furnished a striking and, perhaps, much-needed illustration of the worthlessness of such computations. The elimination of hypothetical chance from such mixed percentages merely renders their worthlessness less apparent.
Despite Doolittle's conclusion that the proportions of verification of heterogeneous classes of predictions cannot be meaningfully represented in a single score, his second i measure appeared in another guise in 1926, 13 years after Doolittle's death. Heidke's skill score (Heidke 1926) yields the same value as Doolittle's second i in any 2 × 2 matrix, including the Finley data in Table 1 (0.355).2 In May 1887 when Doolittle presented the equivalent of the measure that would be attributed to Heidke 38 years later and that has been widely used since then, he rejected it (and others like it) inasmuch as it “successfully passes all the tests which Mr. Gilbert devised for his own formula, but it fails under others, and it is not maintained that it has any scientific value” (Doolittle 1888, 95–96).
b. Stephenson's ORSS
Despite Doolittle's pessimism about single-expression measures of categorical forecasting skill, more efforts were made in subsequent years. One of the most recent among them is the ORSS. The climate statistician D. B. Stephenson used Sgt. Finley's data to illustrate the strengths and weaknesses of the various statistics that were employed during the past century for measuring dichotomous weather forecasting performance (Stephenson 2000; see also Thomes and Stephenson 2001; Jolliffe and Stephenson 2003). The measures he surveyed included the simple hit rate H; several different skill scores proposed by Heidke (1926), Doolittle (1884, 1888), Peirce (1884), and Gilbert (1884); the bias calculation; χ2; various formulas derived from signal detection theory; and Murphy and Winkler's (1987) joint distributions approach.
Stephenson then proposed as the best measure for assessing skill in dichotomous weather forecasts an odds ratio–based expression that he developed: the odds ratio skill score, which is Stephenson's restatement of Yule's Q (Yule 1900, 1903).3 Stephenson (2000, p. 227) wrote:
Despite its wide use for measuring association in contingency tables (Agresti 1996), until now, [Yule's Q] has never been applied for verifying meteorological forecasts. It is based entirely on the joint probabilities, and so is not influenced in any way by the marginal totals.
It is true that Yule's Q involves only the cross-products of cells, not the marginals, but it is also true that Stephenson's equivalent of Yule's Q can be calculated using only the conditional probability of correct identifications of events and nonevents, while the conditional probability of correct uses of forecasts is ignored (Stephenson 2000, p. 227): (H − F)/[(H + F) − 2HF].
The ORSS statistic, like other statistics Stephenson surveys, yields a single overall measure of association. When applied to the data compiled by Sgt. Finley, it is 0.957, or 95.7%, which is not significantly different from Finley's own claim of 96.6% accuracy. Stephenson also notes (2000, p. 231) that a skill score obtained with his measure communicates only limited information:
It is important to realize that forecast skill does not necessarily imply anything about the possible utility or value of the forecasts… The purpose of skill scores is to quantify the overall agreement between the forecasts and the observations, and so by definition should not depend on what the user considers to be important (e.g., tornado rather than not tornado as the event).
The problem with the ORSS is that, as Stephenson writes, it is the equivalent of Yule's Q, which is sensitive to outsized values in a cell of a 2 × 2 table, as in this case. The Yule's Q calculation for Table 2 would be [(28 × 2680) − (23 × 72)]/[(28 × 2680) + (23 × 72)] = 0.957. The outsized value in cell d effectively guarantees a nearly perfect and positive association between Finley's forecasts and the occurrence–nonoccurrence of tornadoes, regardless of the weak true positive rate for the tornado forecasts. In the case of Finley's data, ORSS's representation of the overall agreement between forecasts and observations is valid but of limited value.
This is not surprising, since—like Doolittle's i—the measure cannot discriminate between the separate contributions of the two forecast choices. The same result, 0.957, would be obtained if the values in cells a and d were to be reversed, depicting a completely different set of facts than those in Table 1. When considered within the context of this limitation, Stephenson's claim that the measure is not influenced in any way by the marginal totals is mathematically correct—it is calculated using only the joint probabilities—but incomplete. As with Doolittle's measure, the marginals do have an effect on its adequacy. In general, the farther the marginals diverge from numerical equality, the less representative the measure will be of disparate rates of judgment skill as between the two categories.
The failure of ORSS to account for both of the conditional probabilities that determine forecast accuracy is what leaves it susceptible to skewed cell values. By way of contrast, calculating Hu yields information of value both to an understanding of Sgt. Finley's skill at forecasting as well as to the practical utility of his forecasts. As we saw in the preceding section on Doolittle's i, the unbiased hit rates are as follow:
Hu of tornado forecasts—0.154 accuracy, with chance rate of 0.000 65 and
Hu of no-tornado forecasts—0.966 accuracy, with chance rate of 0.947.
These calculations illuminate Finley's record in a way that any single expression cannot. First, a single score masks the disparate accuracy rates of tornado and no-tornado predictions. Finley's accuracy in correctly predicting that a tornado will hit—which, given the very different implications of a tornado hitting rather than not hitting, is the most critical of the two judgments—is neither 95.7% (ORSS and Yule's Q) nor is it 96.6% (Finley). It is 15.4%. While well above the level of chance (z = 23, p < 0.0001), 15.4% accuracy still would give little assurance that Finley's tornado predictions were reliable. Nearly half the time (45%) that a tornado did occur, Finley had predicted no tornado, and nearly three-fourths of the time (72%) that he did predict tornado, no tornado hit. Thus, Hu demonstrates that Finley's more substantial accuracy was in predicting that a tornado was not on the way, but that here his accuracy was better than chance largely because n was a high number that was skewed toward the null event (z = 2.68, p = 0.004).
4. Concluding observations: Wagner's unbiased hit rate as an extension of Doolittle's use of conditional probabilities and of Murphy and Winkler's joint distributions approach
Doolittle's and Stephenson's measures (as well as Peirce's and Heidke's) illustrate the limitations of any single-expression representation of categorical weather forecast accuracy, and in doing so they assist in demonstrating the potential value of Wagner's unbiased hit rate and its appropriate chance measure, each calculated separately for forecasts of each of two or more weather events or conditions in the general k × k (k ≥ 2) format. Wagner's measure can be considered a logical next step in a chain of reasoning about conditional and joint probabilities presented briefly in Doolittle's 1884 paper and later in great detail in a 1987 paper by Murphy and Winkler. The latter paper proposed that a general framework for weather forecast verification be established that could bring coherence and structure to the field, particularly in light of the proliferation of analytical methods for different types of variables (categorical, continuous, ordered), different event likelihoods (routine, rare), and different forecast measures (categorical, probabilistic). The authors wrote (Murphy and Winkler 1987, p. 1330):
To be useful, such a framework should (inter alia) (i) unify and impose some structure on the overall body of verification methodology, (ii) provide insight into the relationships among verification measures, and (iii) create a sound scientific basis for developing and/or choosing particular verification measures in specific contexts.
Murphy and Winkler, like Doolittle and Wagner, realized that we need to account both for correct event identification rates and correct forecast rates. Their verification model proposed two factorizations, each of which would yield what Murphy and Winkler considered to be a distinct but complementary approach to verification: a “calibration refinement” factorization, which calculates the conditional distributions of observations given the forecasts and given the marginal distributions of the forecasts, and a “likelihood-base rate” factorization, which calculates the conditional distributions of forecasts given the observations and given the marginal distributions of the observations. While Murphy and Winkler's intent was to assess both forecast and event identification accuracy, however, they did not propose that their two factorizations be combined in order to yield an unbiased expression of the accuracy of a forecaster, a model, a diagnostic test, and so on. Rather, they simply advocated that both the hit rate (likelihood-base rate) and the forecast rate (calibration refinement) be evaluated, rather than just one or the other, perhaps without realizing that each of those two distinct factorizations constitutes a biased value when considered in isolation from the other (Murphy and Winkler 1987, p. 1335):
We believe that these factorizations constitute complementary rather than alternative ways to approach the verification problem… . Thus, a complete verification study would necessarily involve the evaluation of factors associated with both factorizations. [Emphasis added.]
Murphy and Winkler proposed to demonstrate that their two factorizations were jointly evaluated by Bayes's theorem, where x represents observed events and f represents forecasts:
They wrote (Murphy and Winkler 1987, p. 1334):
Here the base rate p(x) and the likelihoods p(f | x) are multiplied to combine the two types of information that they represent. This product is then divided by p(f) to normalize and yield p(x | f) which reflects both the base rate information and the information contained in the forecast f.
While it is true that both the base rate of events and the probability of correct forecasts conditioned upon observations are represented in Bayes's theorem, this does not constitute the calculation of an accuracy rate that accounts for both conditional probabilities that determine accuracy. It calculates only the probability of correct identification of the tornado events conditioned on the forecasts of tornado. Substituting Finley's data for tornado into Eq. (1) yields p(x | f) = [(51/2803)(28/51)]/(100/2803) = 0.28. In Murphy and Winkler's terms, this is the calibration-refinement factorization. Using Wagner's calculus for the data in Table 1, this is simply a/(a + b), or 28/100, which in traditional meteorological terms is the frequency of hits (FOH), and in current terms is the positive predictive value (PPV) of a forecast of tornado. The complement of this would be
Equation (2), in Murphy and Winkler's terms, is the likelihood-base rate factorization, or the conditional probability that when a tornado occurred, Finley's method would have correctly predicted it. To Wagner, this is a/(a + c), or 28/51, which is the simple hit rate H of tornado forecasts, or in traditional terms POD.
The product of the two values from Eqs. (1) and (2) gives us Wagner's unbiased hit rate for Finley's predictions of tornado: 0.28 × 0.549 = 0.154. Absent this calculation, neither p(x | f)—the positive predictive value of tornado forecasts—nor p(f | x)—the simple hit rate—yields a valid measure of tornado forecasting accuracy, because both of them are biased. The first ignores false negatives and the second ignores false positives, and consequently each overestimates the true tornado forecasting rate. Murphy and Winkler are correct in arguing that each rate has its distinct utility for forecasters—the PPV allows a forecaster to evaluate predictive accuracy in order to refine a technique, while H allows a forecaster to determine how accurate a forecast can be expected to be for a particular type of event—but the true rate of both event identification accuracy and predictive precision is the product of these two factorizations. That Hu yields this rate can be illustrated using the hypothetical data in Table 4.
An assumption that is common in research involving confusion matrices (e.g., Kraut 1980; Park and Levine 2001; Newman et al. 2003; Vrij et al. 2007; Vrij 2008, 198–199) is that the data in Table 4 depict a 50% error rate, or what is sometimes considered to be the equivalent of the chance rate of accuracy. The logic of this 50% chance rate (often expressed in the literature as “no better than the flip of a coin”) is that when a tornado occurred, half the time the forecaster had predicted that no tornado would occur, and, conversely, when a tornado did not occur, half the time the forecaster had predicted that one would occur. Murphy and Winkler would argue in evaluating the accuracy of this hypothetical clueless forecaster that we should examine both the hit rate—p(f | x), which is 50% (25/50)—and the response rate—p(x | f), also 50% (25/50), in order to have a complete picture.
The true accuracy (and predictive rate) of both tornado and no-tornado forecasts however is 25% not 50% (or 0, as association measures calculate it). With each set of weather event cues placed before the clueless forecaster (or the clueless forecast model), there are two probabilities involved, not one, just as there would be if the forecaster were to actually flip a coin each time in order to make a prediction. There is the probability that the forecaster will choose to predict tornado instead of no tornado (or that the coin will land on heads rather than on tails), and there is the separate probability that a tornado will occur. The data in Table 4 indicate that each of those probabilities is 50%. The rate of success is determined by the product rule. In this case, since the forecaster (or the coin) is clueless and the predictions are independent of the events, the accuracy (and predictive) rate is 50% of 50%, or 25%.
Twenty-five percent is not only the true accuracy–predictive rate of the forecaster depicted in Table 4; it is the true chance rate, as well. When we calculate the unbiased hit rate for the data in Table 4, we obtain Hu = [a/(a + b)][a/(a + c)] = (25/50)(25/50) = 0.25. We obtain the same rate for chance success: p(c) = [(a + b)/n][(a + c)/n] = (50/100)(50/100) = 0.25, confirming that when the accuracy of the coin-flipping forecaster in Table 4 is calculated by the unbiased hit rate it is exactly equal to the chance rate, as it should be.
In their real-world example of how their joint distributions framework would work in probabilistic forecasting,4 Murphy and Winkler calculated Eq. (2) and proposed it as the necessary complement to Eq. (1). On the other hand, they went no further in their analysis. Their probabilistic example (Murphy and Winkler 1987, 1334–1335) produced the two separate values, p(f | x) and p(x | f), without using the product rule to obtain the unbiased hit rate. They merely advised that we not focus on just one, but rather on both, of the values; in the case of the 0.40 probability of prediction (PoP) forecast in their example, 0.0895 and 0.3662. Similarly, in their discussion of the operation of their framework in the case of 2 × 2 categorical forecasts of rainfall (Murphy and Winkler 1987, 1332–1334), they proposed to calculate hit rate and false alarm rate for rain and no-rain events, and to also calculate both p(f | x) and p(x | f) measures, but not to calculate measures for rain and no-rain events that account simultaneously for both event identification and forecast accuracy.
Murphy and Winkler's proposal that forecast verification be based on the joint distributions of observations and forecasts was a step in the right direction. Their calculations yielded two accuracy rates for each forecast–event category, however: one representing only the proportion of correct identifications of events–null events and the other representing only the proportion of correct uses of the forecasts. While each proportion has practical utility, it is also the case that each is biased. Neither Doolittle nor Murphy and Winkler developed a measure with appropriate calculations of chance and significance based simultaneously on the two conditional probabilities that determine accuracy and predictive success in a categorical analysis. The unbiased hit rate developed by Wagner does do so for each event–forecast category in any k × k (k ≥ 2) matrix, and it could enhance the assessment of dichotomous and multicategory weather forecasting performance.
The author thanks Emeritus Professor Robert E. Kleck and Professor A. S. R. Manstead for their recollections of conversations with the late Hugh L. Wagner when he was developing the unbiased hit rate. The author also thanks the editor and the anonymous reviewers for greatly improving the paper through their meticulous attention to detail.
Calculation of Hu, Glossary of Terms, and Properties of Selected Measures
a. Configuration of confusion matrices and calculation of Hu
Matrices in the paper are arranged such that columns indicate observed (i.e., confirmed) weather events and rows indicate forecasts. Cell identifiers are always italicized when used in the body of the paper. Here they are as follows, using O for observed events and F for forecasts: a = (O+/F+), b = (O−/F+), c = (O+/F−), and d = (O−/F−). The total of a + b + c + d is designated as n, not , as all matrices in the paper are considered to represent samples, not populations.
Table A1 demonstrates the calculation of Wagner's unbiased hit rate Hu in a 2 × 2 matrix. Even though for reasons of brevity the paper does not extend its argument to k × k (k > 2) matrices, Table A2 demonstrates the calculation of Hu in a 4 × 4 matrix as an exemplar of any k × k (k > 2) matrix.
The Hu for event 1 is calculated as the product of 1) the proportion of uses of the forecast, event 1, that were correct and 2) the proportion of all instances of event 1 that were correctly forecast: [a/(a+b)][a/(a + c)], or a2/(a + b)(a + c). Similarly, Hu for event 2 (or the null event) is the product of the conditional probabilities that the forecast, event 2, was used correctly and that event 2 was correctly forecast. Thus, Hu for event 2 is d2/(c + d)(b + d). In terms widely used in the social sciences and biostatistics, Hu for event 1 is the product of the positive predictive value and the hit rate, or PPV × H. In terms more familiar to meteorologists, Hu for event 1 is the product of the frequency of hits and the probability of detection, or FOH × POD. When calculating Hu for cell d in the case when event 2 is just the null for event 1, the meteorological terminology would be FOCN × PON, the product of the frequency of correct null forecasts and the probability of null events.
The unbiased hit rate for event 1 in Table A2 is calculated as [a/(a + b + c + d)][a/(a + e + i + m)], or a2/(a + b + c + d)(a + e + i + m). For each successive category in any k × k (k > 2) matrix, Hu is calculated similarly along the diagonal. For event 2, for example, Hu is calculated as f2/(e + f + g + h)(b + f + j + n), and for event 3 it would be k2/(i + j + k + l)(c + g + k + o).
b. Glossary of terms
Disciplines such as biostatistics, computer sciences, artificial intelligence, bioinformatics, and the social sciences use the same key terms when working with confusion matrices. The field of meteorology has continued, for the most part, to use terms particular to its own history (e.g., Doswell et al. 1990, 577–578). Recently, however, the more common terms from outside the field are being adopted (Stephenson 2000; Wilks 2011), so these are used in this paper. They are defined below, along with the names and acronyms of their counterparts in the traditional meteorological literature.
Hit rate, H, known in meteorology as probability of detection (POD). It is calculated in this paper as a/(a + c) for Table A1 and as a/(a + e + i + m) for Table A2. In some fields, writers occasionally consider H to be the proportion of all correctly identified events or statuses [i.e., (a + d)/n in a 2 × 2 matrix or (a + f + k + p)/n in Table A2], even though the more common name for this value is accuracy. The author will use H to mean only the proportion of detection in a single event category.
H. L. Wagner's unbiased hit rate, Hu, is calculated in 2 × 2 matrices as [a/(a + b)][a/(a + c)]. In this work, Hu performs this calculation for every event–forecast category of a k × k (k ≥ 2) matrix, so while the calculation in the formula above remains the same, the cell identifiers will change depending on the event–forecast category being measured. For example, Hu for the no-tornado forecasts in Finley's data (Table 1) would be calculated as d2/(c + d)(b + d), or 26802/(2703 × 2752) = 0.966.
In this paper, F is the false alarm rate, known in meteorology as the POFD, or probability of false detection (e.g., Doswell et al. 1990). Within the context of a 2 × 2 matrix, F is defined as the proportion of null events that were wrongly forecast as being events. Referring to Table 1 in the text, F would be calculated as 72/2752 = 0.026 for tornado forecasts. In addition, F can (and in this paper, will) be calculated for every event category in a matrix. Thus, it is considered that forecasts of no tornado can be correct or erroneous, and that the proportion of all no-tornado forecasts that were erroneous can be calculated as F for no tornado. In the case of the Finley data in Table 1, therefore, F for the forecasts of no tornado is calculated as 23/51 = 0.45.
4) True positives and true negatives
True positives and true negatives are, respectively, the value that represents both correct identifications of an event and correct uses of the forecast of that event, and the value that represents both correct identification of a null event and correct use of the forecast (null event). In a pure dichotomous 2 × 2 matrix (i.e., not simply a two-category matrix), the true positives reside in cell a, and the true negatives reside in cell d. In Table A2, where the categories are simply different events or statuses (i.e., not necessarily the opposites of each other), the true positives lie along the diagonal a–f–k–p. The author does not use the term true negative for any cells in such a matrix. A forecast (event 1) that falls, for example, in the cell for actual event 3 is simply a false positive event 1 forecast.
5) False positives and false negatives
False positives and false negatives are erroneously positive forecasts of a null event and erroneously negative forecasts of an event, respectively. False positives can be considered for each category in a confusion matrix. In Table A2, the false positives for event 1 are cells b, c, and d. The false positives for event 4 are m, n, and o. In Table A1, cell b contains the false positives for event 1 and cell c contains the false positives for event 2. It is important to note that in a 2 × 2 confusion matrix, the false positives for one event are also the false negatives for the other event. Thus, for the Finley data in Table 1, the 72 forecasts in cell b are the false positive tornado forecasts, and they also are the 72 false negative no-tornado forecasts. Conversely, the 23 forecasts in cell c are the false negative tornado forecasts and they also are the false positive forecasts for the no-tornado category.
False negative rate or fraction, in the traditional meteorological literature, is DFR, or detection failure ratio. This is the complement of the false alarm rate. The latter calculates that proportion of all null events that were erroneously forecast as being the event. The false negative rate calculates that proportion of all forecasts of a null event that were erroneously forecast as being the null event. Thus, in Table A1, the FN for event 1 is calculated as c/(c + d). For the Finley data in Table 1, the FN for tornado is 23/2703 = 0.0085.
Positive predictive value, in the meteorological literature, is FOH, or frequency of hits. It is calculated as that proportion of forecasts of an event that correctly forecast the event. In Table A1, the PPV of event 1 is calculated as a/(a + b). In Table A2, the PPV of event 2 is calculated as f/(e + f + g + h).
Negative predictive value is the proportion of all forecasts of a null event that are correct. In the meteorological literature, this is FOCN, or frequency of correct null forecasts, and in a 2 × 2 matrix it is calculated as d/(c + d). In matrices that are more appropriately considered to be two category (e.g., tornado/thunderstorm) rather than dichotomous (e.g., tornado–no-tornado), neither event should be considered to have an NPV.
Table A3 displays pertinent properties of the expressions that are reviewed in some detail in the paper.
A comment/reply has been published regarding this article and can be found at http://journals.ametsoc.org/doi/abs/10.1175/WAF-D-14-00004.1 and http://journals.ametsoc.org/doi/abs/10.1175/WAF-D-14-00008.1
The third deficiency mentioned here might appear to be the reverse of the problem noted by Doolittle and by Murphy, but it can be seen as a different aspect of the same problem, i.e., the inapplicability of many single-expression measures to the general k × k (k ≥ 2) case.
The critique of Stephenson's odds ratio–based measure that follows is not intended to be a commentary on other uses of the odds ratio in categorical weather forecasting, such as Bowler's use of it in accounting for observational error (Bowler 2006, p. 1604).
Murphy and Winkler's only example of the operation of their framework using a real-world data array concerned probabilistic forecasts of rainfall in a particular jurisdiction. They provided no databased example of how the framework would work in the case of categorical forecasting, but they indicated that their reasoning in these cases would be the same.