• Bishop, C. M., 1996: Neural Networks for Pattern Recognition. Clarendon Press, 482 pp.

  • Duda, R. O., and P. E. Hart, 1973: Pattern Classification and Scene Analysis. John Wiley and Sons, 482 pp.

  • Gandin, L. S., and A. Murphy, 1992: Equitable skill scores for categorical forecasts. Mon. Wea. Rev.,120, 361–370.

  • Gerrity, J. P., 1992: A note on Gandin and Murphy’s equitable skill score. Mon. Wea. Rev.,120, 2709–2712.

  • Marzban, C., 1998: Scalar measures of performance in rare-event situations. Wea. Forecasting,13, 753–763.

  • Murphy, A. H., and R. L. Winkler, 1987: A general framework for forecast verification. Mon. Wea. Rev.,115, 1330–1338.

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 23 23 2
PDF Downloads 11 11 1

On the Uniqueness of Gandin and Murphy’s Equitable Performance Measures

View More View Less
  • 1 National Severe Storms Laboratory, and Cooperative Institute for Mesoscale and Meteorological Studies, and Department of Physics, University of Oklahoma, Norman, Oklahoma
  • | 2 National Severe Storms Laboratory, and Cooperative Institute for Mesoscale and Meteorological Studies, Norman, Oklahoma
© Get Permissions
Full access

Abstract

Gandin and Murphy (GM) have shown that if a skill score is linear in the scoring matrix, and if the scoring matrix is symmetric, then in the two-event case there exists a unique, “equitable” skill score, namely, the True Skill Score (or Kuipers’s performance index). As such, this measure is treated as preferable to other measures because of its equitability. However, in most practical situations the scoring matrix is not symmetric due to the unequal costs associated with false alarms and misses. As a result, GM’s considerations must be reexamined without the assumption of a symmetric scoring matrix. In this note, it will be proven that if the scoring matrix is nonsymmetric, then there does not exist a unique performance measure, linear in the scoring matrix, that would satisfy any constraints of equitability. In short, there does not exist a unique, equitable skill score for two-category events that have unequal costs associated with a miss and a false alarm.

Corresponding author address: Dr. Caren Marzban, National Severe Storms Laboratory, 1313 Halley Circle, Norman, OK 73069.

Email: marzban@nssl.noaa.gov

Abstract

Gandin and Murphy (GM) have shown that if a skill score is linear in the scoring matrix, and if the scoring matrix is symmetric, then in the two-event case there exists a unique, “equitable” skill score, namely, the True Skill Score (or Kuipers’s performance index). As such, this measure is treated as preferable to other measures because of its equitability. However, in most practical situations the scoring matrix is not symmetric due to the unequal costs associated with false alarms and misses. As a result, GM’s considerations must be reexamined without the assumption of a symmetric scoring matrix. In this note, it will be proven that if the scoring matrix is nonsymmetric, then there does not exist a unique performance measure, linear in the scoring matrix, that would satisfy any constraints of equitability. In short, there does not exist a unique, equitable skill score for two-category events that have unequal costs associated with a miss and a false alarm.

Corresponding author address: Dr. Caren Marzban, National Severe Storms Laboratory, 1313 Halley Circle, Norman, OK 73069.

Email: marzban@nssl.noaa.gov

1. Introduction

Performance measures or skill scores are often required to be “equitable” in that their use must not induce the forecasters to make forecasts that differ from their best judgments. Gandin and Murphy (1992), hereafter referred to as GM92, considered measures that are linear in the scoring matrix and derived the constraints that must be placed on the scoring matrix in order to assure the equitability of the measure. In the two-event case, if in addition to the derived constraints the scoring matrix is assumed symmetric, then the number of constraints is equal to the number of elements of the scoring matrix. This, in turn, allows for the determination of a unique scoring matrix and, consequently, a unique, equitable measure—the True Skill Score (or Kuipers’s performance index, among other names). Gerrity (1992) expands on the multicategory considerations of GM92 and finds a closed formula for a symmetric, equitable scoring matrix in terms of the marginal probabilities of the various categories.

However, in most practical situations the scoring matrix is not symmetric. This occurs not only when the two events have different a priori (climatological) probabilities, but also when the cost (or loss) associated with a false alarm is different from that of a miss. Therefore, to check for the existence of a unique, equitable measure, GM92’s analysis must be reexamined without assuming that the scoring matrix is symmetric. It is, in fact, possible to generalize further: GM92 also assume that constant forecasts of two events must be assigned equal scores (e.g., zero). However, this assumption is too restrictive and so it, too, can be relaxed (see the discussion section).

Specifically, GM92 asked the following question. What scoring matrix will yield a performance measure, S, satisfying the following three constraints:
SSα,
and
Sβ,
where α and β are constants defining a scale for the score? They defined such a measure as equitable and showed that in the two-event case if the scoring matrix is further assumed to be symmetric, then there is a unique measure that satisfies these constraints. The uniqueness of the solution may be anticipated based on the number of constraints (three) and unknowns (three), but it is not automatically implied. Unless these constraints yield an independent set of simultaneous linear equations, a unique solution will not exist.

Relaxing both the assumption of a symmetric scoring matrix and the equality of constant forecasts, the equitability constraints translate to four equations for four unknowns. As such, one might anticipate a unique solution. However, in this note it will be argued that without the assumption of a symmetric scoring matrix (or any other constraint on the scoring matrix) there does not exist a unique measure satisfying these or any set of (four) constraints placed on S. In other words, in most practical situations there is no unique, equitable measure. Or, said differently, there exists no definition of equitability that would yield a unique measure of performance linear in the scoring matrix.

2. Preliminaries

Many measures of performance are defined in terms of the elements of the contingency table. For dichotomous forecasts of two events, labeled as 0 and 1, the contingency table C is
i1520-0493-127-6-1134-eq3
Note that Ni0 + Ni1, represented by Ni·, is simply the sample size of the ith observation, and N0i + N1iN·i is the number of forecasts of the ith type (i = 0, 1).
The joint probability of forecasts, f, and observations, o, is (Murphy and Winkler 1987)
PijNijN··
where N·· is the total sample size. The two relevant conditional probabilities are the probability of assigning (forecasting) an observed event from the ith class (i = 0, 1) into the jth class (j = 0, 1)
pfjoiNijNi·Qij
and the belief that a class-i event was assigned to class j,
poifjNijN·jBij
The matrices P, Q, and B are sometimes called the performance matrix, the percent confusion matrix, and the belief matrix, respectively.
The class-conditional risks Ri(C) are given by
i1520-0493-127-6-1134-eq7
where Lij are the elements of the loss matrix, and Bayes risk (Bishop 1996) is then defined as
i1520-0493-127-6-1134-e1
where Pi = Ni·/N·· are the a priori (i.e., climatological) probabilities.

Bayes risk and the loss function are quantities central to the analysis of performance, and it is evident that the quantity that GM92 call the expected score is in fact equal to Bayes risk, and what they call the scoring matrix is equal to (the transpose) of the loss matrix. In other words, R = S if the loss matrix L is identified with (the transpose of) the scoring matrix.1

3. Theorem

The question asked by GM92 can be asked at a more general level: what loss matrix yields a measure satisfying constraints of the form
RC(k)α(k)kn
where n is the number of constraints, and C(k) and α(k) are the contingency table and the value of the measure associated with the kth constraint?
Since the (2 × 2) loss matrix has four degrees of freedom one would require four equations in order to uniquely solve for Lij. For example,
i1520-0493-127-6-1134-eq8
constitute four such constraints. The special case α(0) = α(1) = α(2) is the one considered by GM92.
More generally, however, Eqs. (1) and (2) imply
i1520-0493-127-6-1134-eq9
where Q(k)ij is the percent confusion matrix for the kth constraint, and n is the number of constraints. For n = 4, this yields
i1520-0493-127-6-1134-eq10
Noting the identity Q(k)i0 + Q(k)i1 = 1, ∀k, it is then straightforward to show that the determinant of the 4 × 4 matrix is zero.

Consequently, the system of four equations and four unknowns is underdetermined. As such, for a general scoring matrix the true skill score is no longer uniquely equitable. Therefore, in practical cases where the scoring matrix has no particular symmetry, there exists no unique, equitable score. Note that this is true for any definition of equitability based on the four constraints of the aforementioned type [Eq. (2)].

Another family of risk functions, or performance measures, can be defined in terms of the belief matrix B (instead of the percent confusion matrix Q) (Duda and Hart 1973):
i1520-0493-127-6-1134-eq11
but it can be shown that the above theorem applies to R′ as well, because B0i + B1i = 1.

4. Discussion

It is important to point out that the above result is not contained in GM92’s findings. The question asked in this article does partially reduce to that of GM92 in the limit α(0) = α(1) = α(2). However, the number of equations and unknowns in this article (i.e., four) is different from that of GM92 (i.e., three), and so there is no smooth limit in which the two questions are related.

The findings herein generalize GM92’s results in that several assumptions made by GM92 are no longer invoked. Specifically, two independent assumptions are made by GM92: 1) α(0) = α(1) and 2) Lij = Lji (i.e., that the loss matrix is symmetric).2

GM92 motivate the first assumption by arguing that it precludes the forecaster from over- or underforecasting all observations as one event (i.e., f = 0) or the other event (i.e., f = 1). It is true that such an assumption would be necessary if the performance measure R behaved like the dotted curve in Fig. 1, wherein the behavior of a measure is plotted against a quantity, Q, that at its extremes coincides with R(f = 0) and R(f = 1). Two examples of Q are 1) the percentage of class-1 forecasts that the forecaster has issued and 2) the decision threshold that a forecaster must place on a probabilistic forecast in order to dichotomize the forecasts. More generally, however, the plot of a measure would have a shape similar to the dashed line in Fig. 1, wherein there is a local maximum marking the “optimal performance.” Indeed, in the case of the second Q example it has been shown that most measures behave as such (Marzban 1998). Of course, the use of such a measure will still influence the forecaster’s judgment via his attempts to reach the optimum critical point; however, the asymptotic limits of the measure are no longer of any concern. Hence, it is not necessary for the constant-forecast scores to be equal. It is important to emphasize that a truly equitable measure would behave like the solid curve in Fig. 1 in that its use would not induce the forecaster to significantly affect her judgment.

GM92 motivate the second assumption by emphasizing the issue of accuracy and requiring the generality of the ultimate results. However, in practice it is much more likely that the misclassification costs are unequal for the two classes. For example, the cost associated with missing a tornado is rarely the same as that associated with making a false tornado forecast. Therefore, the second assumption is far too stringent to be of any utility in most practical situations. For these reasons, neither assumption was invoked here.

Acknowledgments

CM is grateful to Gregory Heath of MIT Lincoln Laboratory for an invaluable discussion on Bayes risk. Partial support was provided by the FAA and the NWS/OSF.

REFERENCES

  • Bishop, C. M., 1996: Neural Networks for Pattern Recognition. Clarendon Press, 482 pp.

  • Duda, R. O., and P. E. Hart, 1973: Pattern Classification and Scene Analysis. John Wiley and Sons, 482 pp.

  • Gandin, L. S., and A. Murphy, 1992: Equitable skill scores for categorical forecasts. Mon. Wea. Rev.,120, 361–370.

  • Gerrity, J. P., 1992: A note on Gandin and Murphy’s equitable skill score. Mon. Wea. Rev.,120, 2709–2712.

  • Marzban, C., 1998: Scalar measures of performance in rare-event situations. Wea. Forecasting,13, 753–763.

  • Murphy, A. H., and R. L. Winkler, 1987: A general framework for forecast verification. Mon. Wea. Rev.,115, 1330–1338.

Fig. 1.
Fig. 1.

The generic behavior of R as a function of Q.

Citation: Monthly Weather Review 127, 6; 10.1175/1520-0493(1999)127<1134:OTUOGA>2.0.CO;2

1

The contingency table in this article is the transpose of that of GM92.

2

As pointed out by GM92, α(0) = α(1) implies α(0) = α(1) = α(2).

Save