## 1. Introduction

Performance measures or skill scores are often required to be “equitable” in that their use must not induce the forecasters to make forecasts that differ from their best judgments. Gandin and Murphy (1992), hereafter referred to as GM92, considered measures that are linear in the scoring matrix and derived the constraints that must be placed on the scoring matrix in order to assure the equitability of the measure. In the two-event case, if in addition to the derived constraints the scoring matrix is assumed symmetric, then the number of constraints is equal to the number of elements of the scoring matrix. This, in turn, allows for the determination of a unique scoring matrix and, consequently, a unique, equitable measure—the True Skill Score (or Kuipers’s performance index, among other names). Gerrity (1992) expands on the multicategory considerations of GM92 and finds a closed formula for a symmetric, equitable scoring matrix in terms of the marginal probabilities of the various categories.

However, in most practical situations the scoring matrix is not symmetric. This occurs not only when the two events have different a priori (climatological) probabilities, but also when the cost (or loss) associated with a false alarm is different from that of a miss. Therefore, to check for the existence of a unique, equitable measure, GM92’s analysis must be reexamined without assuming that the scoring matrix is symmetric. It is, in fact, possible to generalize further: GM92 also assume that constant forecasts of two events must be assigned equal scores (e.g., zero). However, this assumption is too restrictive and so it, too, can be relaxed (see the discussion section).

*S,*satisfying the following three constraints:

*S*

*S*

*α,*

*S*

*β,*

*α*and

*β*are constants defining a scale for the score? They defined such a measure as equitable and showed that in the two-event case if the scoring matrix is further assumed to be symmetric, then there is a unique measure that satisfies these constraints. The uniqueness of the solution may be anticipated based on the number of constraints (three) and unknowns (three), but it is not automatically implied. Unless these constraints yield an independent set of simultaneous linear equations, a unique solution will not exist.

Relaxing both the assumption of a symmetric scoring matrix and the equality of constant forecasts, the equitability constraints translate to four equations for four unknowns. As such, one might anticipate a unique solution. However, in this note it will be argued that without the assumption of a symmetric scoring matrix (or any other constraint on the scoring matrix) there does not exist a unique measure satisfying these or any set of (four) constraints placed on *S.* In other words, in most practical situations there is no unique, equitable measure. Or, said differently, there exists no definition of equitability that would yield a unique measure of performance linear in the scoring matrix.

## 2. Preliminaries

*C*is Note that

*N*

_{i0}+

*N*

_{i1}, represented by

*N*

_{i·}, is simply the sample size of the

*i*th

*observation,*and

*N*

_{0i}+

*N*

_{1i}≡

*N*

_{·i}is the number of

*forecasts*of the

*i*th type (

*i*= 0, 1).

*f,*and observations,

*o,*is (Murphy and Winkler 1987)

*P*

_{ij}

*N*

_{ij}

*N*

_{··}

*N*

_{··}is the total sample size. The two relevant conditional probabilities are the probability of assigning (forecasting) an observed event from the

*i*th class (

*i*= 0, 1) into the

*j*th class (

*j*= 0, 1)

*p*

*f*

*j*

*o*

*i*

*N*

_{ij}

*N*

_{i·}

*Q*

_{ij}

*i*event was assigned to class

*j,*

*p*

*o*

*i*

*f*

*j*

*N*

_{ij}

*N*

_{·j}

*B*

_{ij}

*P, Q,*and

*B*are sometimes called the performance matrix, the percent confusion matrix, and the belief matrix, respectively.

*R*

_{i}(

*C*) are given by where

*L*

_{ij}are the elements of the loss matrix, and Bayes risk (Bishop 1996) is then defined as where

*P*

_{i}=

*N*

_{i·}/

*N*

_{··}are the a priori (i.e., climatological) probabilities.

Bayes risk and the loss function are quantities central to the analysis of performance, and it is evident that the quantity that GM92 call the expected score is in fact equal to Bayes risk, and what they call the scoring matrix is equal to (the transpose) of the loss matrix. In other words, *R* = *S* if the loss matrix *L* is identified with (the transpose of) the scoring matrix.^{1}

## 3. Theorem

*R*

*C*

^{(k)}

*α*

^{(k)}

*k*

*n*

*n*is the number of constraints, and

*C*

^{(k)}and

*α*

^{(k)}are the contingency table and the value of the measure associated with the

*k*th constraint?

*L*

_{ij}. For example, constitute four such constraints. The special case

*α*

^{(0)}=

*α*

^{(1)}=

*α*

^{(2)}is the one considered by GM92.

*Q*

^{(k)}

_{ij}

*k*th constraint, and

*n*is the number of constraints. For

*n*= 4, this yields Noting the identity

*Q*

^{(k)}

_{i0}

*Q*

^{(k)}

_{i1}

*k,*it is then straightforward to show that the determinant of the 4 × 4 matrix is zero.

Consequently, the system of four equations and four unknowns is underdetermined. As such, for a general scoring matrix the true skill score is no longer uniquely equitable. Therefore, in practical cases where the scoring matrix has no particular symmetry, there exists no unique, equitable score. Note that this is true for any definition of equitability based on the four constraints of the aforementioned type [Eq. (2)].

**B**

**Q**

*R*′ as well, because

*B*

_{0i}+

*B*

_{1i}= 1.

## 4. Discussion

It is important to point out that the above result is not contained in GM92’s findings. The question asked in this article does partially reduce to that of GM92 in the limit *α*^{(0)} = *α*^{(1)} = *α*^{(2)}. However, the number of equations and unknowns in this article (i.e., four) is different from that of GM92 (i.e., three), and so there is no smooth limit in which the two questions are related.

The findings herein generalize GM92’s results in that several assumptions made by GM92 are no longer invoked. Specifically, two independent assumptions are made by GM92: 1) *α*^{(0)} = *α*^{(1)} and 2) *L*_{ij} = *L*_{ji} (i.e., that the loss matrix is symmetric).^{2}

GM92 motivate the first assumption by arguing that it precludes the forecaster from over- or underforecasting all observations as one event (i.e., *f* = 0) or the other event (i.e., *f* = 1). It is true that such an assumption would be necessary if the performance measure *R* behaved like the dotted curve in Fig. 1, wherein the behavior of a measure is plotted against a quantity, *Q,* that at its extremes coincides with *R*(*f* = 0) and *R*(*f* = 1). Two examples of *Q* are 1) the percentage of class-1 forecasts that the forecaster has issued and 2) the decision threshold that a forecaster must place on a probabilistic forecast in order to dichotomize the forecasts. More generally, however, the plot of a measure would have a shape similar to the dashed line in Fig. 1, wherein there is a local maximum marking the “optimal performance.” Indeed, in the case of the second *Q* example it has been shown that most measures behave as such (Marzban 1998). Of course, the use of such a measure will still influence the forecaster’s judgment via his attempts to reach the optimum critical point; however, the asymptotic limits of the measure are no longer of any concern. Hence, it is not necessary for the constant-forecast scores to be equal. It is important to emphasize that a truly equitable measure would behave like the solid curve in Fig. 1 in that its use would not induce the forecaster to significantly affect her judgment.

GM92 motivate the second assumption by emphasizing the issue of accuracy and requiring the generality of the ultimate results. However, in practice it is much more likely that the misclassification costs are unequal for the two classes. For example, the cost associated with missing a tornado is rarely the same as that associated with making a false tornado forecast. Therefore, the second assumption is far too stringent to be of any utility in most practical situations. For these reasons, neither assumption was invoked here.

## Acknowledgments

CM is grateful to Gregory Heath of MIT Lincoln Laboratory for an invaluable discussion on Bayes risk. Partial support was provided by the FAA and the NWS/OSF.

## REFERENCES

Bishop, C. M., 1996:

*Neural Networks for Pattern Recognition.*Clarendon Press, 482 pp.Duda, R. O., and P. E. Hart, 1973:

*Pattern Classification and Scene Analysis.*John Wiley and Sons, 482 pp.Gandin, L. S., and A. Murphy, 1992: Equitable skill scores for categorical forecasts.

*Mon. Wea. Rev.,***120,**361–370.Gerrity, J. P., 1992: A note on Gandin and Murphy’s equitable skill score.

*Mon. Wea. Rev.,***120,**2709–2712.Marzban, C., 1998: Scalar measures of performance in rare-event situations.

*Wea. Forecasting,***13,**753–763.Murphy, A. H., and R. L. Winkler, 1987: A general framework for forecast verification.

*Mon. Wea. Rev.,***115,**1330–1338.