Abstract
The paper briefly reviews measures that have been proposed since the 1880s to assess accuracy and skill in categorical weather forecasting. The majority of the measures consist of a single expression, for example, a proportion, the difference between two proportions, a ratio, or a coefficient. Two exemplar single-expression measures for 2 × 2 categorical arrays that chronologically bracket the 130-yr history of this effort—Doolittle's inference ratio i and Stephenson's odds ratio skill score (ORSS)—are reviewed in detail. Doolittle's i is appropriately calculated using conditional probabilities, and the ORSS is a valid measure of association, but both measures are limited in ways that variously mirror all single-expression measures for categorical forecasting. The limitations that variously affect such measures include their inability to assess the separate accuracy rates of different forecast–event categories in a matrix, their sensitivity to the interdependence of forecasts in a 2 × 2 matrix, and the inapplicability of many of them to the general k × k (k ≥ 2) problem. The paper demonstrates that Wagner's unbiased hit rate, developed for use in categorical judgment studies with any k × k (k ≥ 2) array, avoids these limitations while extending the dual-measure Bayesian approach proposed by Murphy and Winkler in 1987.
A comment/reply has been published regarding this article and can be found at http://journals.ametsoc.org/doi/abs/10.1175/WAF-D-14-00004.1 and http://journals.ametsoc.org/doi/abs/10.1175/WAF-D-14-00008.1