## 1. Introduction

The behavior of performance measures computed from the 2 × 2 contingency table for dichotomous forecasts has received much attention in the literature over the past several decades. An excellent account of previous work is given in Baldwin and Kain (2006, hereafter BK). The primary purpose here is to offer another perspective from which to analyze the behavior of performance measures. As Murphy (1991) points out, the complexity of comparative forecast verification requires a thorough understanding of the measures being computed from the joint distributions of the probabilities involved.

The general approach given in this paper is made specific to address directly the issue of “hedging” as described by Marzban (1998) and others. Here, hedging refers to the action on the part of forecasters to use the fact that performance measures may be more favorable if bias is different from one. Therefore, expanding or contracting a forecast area or adjusting a threshold above or below what would be indicated by the forecast situation becomes a tactic assumed to improve forecast scores. Even in the absence of hedging, forecasts are rarely perfectly biased. Misconceptions as to bias dependency can lead not only to poorer performance as assessed by the measures, but also to forecasts that may have less value to users. In addition, statistical analysis of comparative performance often requires that bias dependencies be considered (Hamill 1999; Mesinger and Brill 2004).

In addition to assessing sensitivity to hedging, this method may be used to assess the attainment of a goal set for a particular performance measure based on past performance. Usually, in such cases, achievement of the goal assumes an improvement or change in the bias characteristics of future forecasts by calibration or some other means. The method is also applied to analyze bias sensitivity for the geometric model used by BK.

In section 2, the 2 × 2 contingency table is constructed in a form useful for this approach. Section 3 lays out the generalized form for the analysis method. Section 4 applies the analysis method to the most commonly used performance measures. Section 5 applies the method to setting a goal for a performance measure based on past performance and includes a subsection related to assessing bias dependency using a geometric model (BK). Finally, section 6 summarizes and concludes the paper.

## 2. The 2 × 2 contingency table

*H*, points at which the event is both forecast and observed), forecasts (

*F*) of the event, observations (

*O*) of the event, and total points or area (

*N*) examined for the occurrence or nonoccurrence of the event. The tabulated values in terms of

*H*,

*F*,

*O*, and

*N*may be areas or discrete counts cumulative over either space or time, or both space and time. This form of the table is shown in Table 1a, including the marginal totals. The arrangement of the row and column elements follows Wilks (2006). Table 1b is a transformation of Table 1a obtained by multiplying all

*F*and

*H*values by (

*O*/

*O*) = 1 and making the following replacements: and readily recognized as bias (sometimes denominated as “frequency bias”) and probability of detection (POD), respectively, as defined in Wilks (2006) and elsewhere. The cells are finally normalized by dividing by

*N*, and (

*O*/

*N*) is replaced with event frequency (base rate),

*α*. The term “event frequency” is not defined consistently in the literature. For example, Hilliker (2004) refers to the

*number*of observations as the “event frequency,” while BK define the event frequency as the fraction of the total verification area covered by the observed event. It is the latter definition that is adopted here. The 2 × 2 contingency table in the form of Table 1b is not unique to this work (e.g., Stephenson et al. 2008).

## 3. Analysis method

A range of mathematical techniques has been used to analyze the 2 × 2 contingency table and performance measures computed from its entries. Doswell et al. (1990), Marzban (1998), and Marzban and Lakshmanan (1999) draw on linear algebra, exploiting the matrix aspects of the 2 × 2 contingency table. Baldwin and Kain (2006) use a geometrical model to analyze measures in a heuristic approach. Standard tools from mathematical statistics have been utilized (e.g., Doswell et al. 1990). Hilliker (2004) makes use of differential calculus to analyze the sensitivity of the threat score to hits. This work applies differential calculus, assuming that the quantities in the contingency table can be represented by continuous functions over the ranges of values allowed for bias, *B*, and POD, *P*.

Using Table 1b, any performance measure may be written in terms of *B*, *P*, and *α*. For any given sample of verification data, the event frequency *α* is independent of the forecasts and is treated as a constant so that performance measures may be written as a function of two variables. Results thereby derived may then be evaluated for different event frequencies.

This approach is analogous to the treatment in classical thermodynamics of a system composed of a constant volume filled with an ideal gas. In the analogy, bias (*B*) is like temperature, POD (*P*) is like pressure, and event frequency is the constant density of the gas. In the gas system, if temperature is increased, pressure increases. Likewise, for the contingency table system, if *B* increases, *P* also increases eventually, if not immediately. The analogy is limited because *P* is defined only in the range of zero to one, and there is no equation of state relating *P* and *B* as there is in thermodynamics relating pressure, *p*, and temperature, *T*. The thermodynamics analogy for any performance measure computed from *B* and *P* with *α* constant is the Gibbs function [defined in most texts on thermodynamics; e.g., Iribarne and Godson (1973)] whose total differential form is expressed in terms of *dp* and *dT*. Although *T* and *p* are related by an equation of state, they are treated mathematically as independent variables. A similar approach is used here, treating *B* and *P* as independent variables.

*α*as a constant parameter, any performance measure,

*S*, is expressed as The total differential of

*S*is where the subscripts indicate that the partial derivatives are taken with the indicated subscript variable held constant.

Performance measures that increase in value as the forecast improves have a positive orientation (Wilks 2006), while those that decrease with improving performance have a negative orientation. In general, the goal is to determine the conditions for *S* to indicate improvement as *B* is either increased past *β* or decreased below *β*, where *β* = 1 to assess hedging by over- or underforecasting to exploit bias sensitivity. If Eq. (4) is evaluated at *B* = *β* for a positively oriented measure, then *dS* > 0 is required to indicate improvement for one or the other case of bias decreasing or increasing through the value *β*. If a measure is negatively oriented, then *dS* < 0 is required to indicate improvement. There are four cases to consider: 1) *dB* > 0 for a positively oriented measure, 2) *dB* < 0 for a positively oriented measure, 3) *dB* > 0 for a negatively oriented measure, and 4) *dB* < 0 for a negatively oriented measure.

Furthermore, (∂*S*/∂*P*)_{B=β} > 0 is required for a positively oriented measure, where the subscript indicates that the derivative is to be evaluated for *B* = *β*. For a negatively oriented measure, (∂*S*/∂*P*)_{B=β} < 0 is required. In other words, the performance measure moves toward a more favorable value as POD increases. While this condition seems unlikely ever to be violated, it must nevertheless be checked.

*S*to improve for the first case is derived in detail, treating the differentials

*dB*and

*dP*in Eq. (4) as small finite increments: It follows that Since Δ

*B*> 0 and (∂

*S*/∂

*P*)

_{B=β}> 0, the condition for a positively oriented measure

*S*to indicate improvement by increasing as bias increases is The other three cases are derived similarly, paying heed to the sense of the inequality according the algebraic signs of Δ

*B*and (∂

*S*/∂

*P*)

_{B=β}, since dividing both sides of an inequality by a negative quantity reverses the sense of the inequality. The results follow for cases 2, 3, and 4 in Eqs. (8)–(10):

*dS*> 0, Δ

*B*< 0, and (∂

*S*/∂

*P*)

_{B=β}> 0, resulting in In case 3,

*dS*< 0, Δ

*B*> 0, and (∂

*S*/∂

*P*)

_{B=β}< 0, resulting in In case 4,

*dS*< 0, Δ

*B*< 0, and (∂

*S*/∂

*P*)

_{B=β}< 0, resulting in

Inequalities (7)–(10) state how the POD must change as bias is increased or decreased past *B* = *β*. If this condition is not met or is impossible to meet, the performance measure will not indicate improvement. For most measures, but not all, *ρ* in Eqs. (7)–(10) is a function of *B*, *P*, and *α*, before substituting *B* = *β*. The derivatives required to compute *ρ* in the inequalities above are evaluated for *B* = *β*, with *β* = 1 to assess hedging for otherwise unbiased forecasts. The derivatives can be evaluated for any *B* > 0; therefore, this approach is general.

Since *α* is constant, Eqs. (1) and (2) may be used to write Δ*P*/Δ*B* as Δ*H*/Δ*F*, clarifying the meaning of Eqs. (7)–(10). The l.h.s. of Eqs. (7)–(10) is the hit fraction for the forecasts added or removed in changing the bias. The performance measure cannot improve if the appropriate one of Eqs. (7)–(10) expresses an impossible condition. If the inequality requires that Δ*H*/Δ*F* be greater than one or less than zero, then the condition is impossible: One cannot add (subtract) more hits than the quantity of additional (removed) forecasts, nor can one have a decrease (an increase) in hits for an increase (a decrease) in forecasts. If *P* = 1 (*P* = 0), then Δ*H*/Δ*F* must be zero if bias is increased (decreased), because all (none) of the observations have been hit already (yet). These constraints are valid within the context intended here: analysis of forecast verification results, addressing under what conditions a performance measure could improve had the forecaster acted so as to expand or contract forecast coverage.

If a positively (negatively) oriented performance measure is either poorly designed or specifically designed for certain restricted conditions, then its partial derivative with respect to bias may be positive (negative) over some portions of the space defined by the ranges of *P*, *B*, and *α*. In such regions, the value of *ρ* will be negative (given that ∂*S*/∂*P* is as assumed above—*S* always responds favorably to increasing *P*), and the performance measure will always indicate improvement for an increase in bias because Δ*H*/Δ*F* will be positive or zero and satisfy Eq. (7) or (9). On the other hand, in such regions, the performance measure cannot indicate improvement for a decrease in bias because Δ*H*/Δ*F* cannot be negative as would be required by Eq. (8) or (10).

With the caveats described in the preceding two paragraphs in mind, the quantity *ρ* in Eq. (7) serves as a critical performance ratio (CPR), which Δ*H*/Δ*F* must exceed for the performance measure to indicate improvement as bias is increased. If bias is decreased, then Δ*H*/Δ*F* must be less than the CPR for the measure to indicate improvement. This is true regardless of the orientation of the performance measure. The higher the CPR value, the more difficult it is to improve the performance measure by *increasing* bias. The higher the CPR value, the easier it is to improve the performance measure by *decreasing* bias.

The steps required to analyze a performance measure, *S*, as bias changes past *B* = *β* are summarized as follows:

- Express
*S*in terms of the contingency table as given in Table 1b, simplifying the result as much as possible. - Derive the partial derivatives used in Eqs. (7)–(10).
- Simplify each partial derivative for
*B*=*β*. - Make sure that (∂
*S*/∂*P*)_{B=β}has the correct algebraic sign according to the orientation of the performance measure. - Compute the CPR,
*ρ*. - Select the appropriate inequality to apply based on the orientation of the measure and the sign of the bias change.
- The performance measure will not indicate improvement unless Δ
*H*/Δ*F*= Δ*P*/Δ*B*is possible. Accept or reject the validity of the condition given particular values of*P*and*α*, using the following axioms, the last two of which are trivial but are included for completeness:- (a) 0 ≤ Δ
*H*/Δ*F*≤ 1, - (b) Δ
*H*/Δ*F*must be zero if*P*= 1 and Δ*B*> 0, and - (c) Δ
*H*/Δ*F*must be zero if*P*= 0 and Δ*B*< 0.

- (a) 0 ≤ Δ

## 4. Analysis of performance measures

Except for the extreme dependency score (EDS), the formulas for the performance measures presented in Table 2 are defined by Wilks (2006), where they are expressed in terms of the elements of the contingency table. Appropriate simplifications have been made after using Table 1b. The EDS is defined by Stephenson et al. (2008). Table 2 shows the results of the derivations for all of the performance measures considered here. For complete generality in the case of each performance measure, the partial derivatives with respect to *B* and *P* are shown along with the CPR formulas before and after substituting *B* = 1. Each performance measure is discussed below with reference to its row in Table 2. The performance measures are discussed below in groups of two or more, following their order in the table. This order is approximately according to a subjective judgment as to the increasing complexity of their explicit dependence on bias. The meaning of “explicit dependence” or lack thereof is discussed in appendix A.

### a. POD and EDS

Both POD and EDS are positively oriented measures. For both (∂*S*/∂*P*) > 0 is satisfied (0 < *α* < l → ln(*α*) < 0). By Eq. (7), either POD or EDS will improve as bias increases past 1 if Δ*P*/Δ*B* > *ρ* = 0. However, if bias is decreased, then Eq. (8) applies and must be rejected by the axiom given in step 7a above. Neither POD nor EDS can be improved by decreasing bias, but either can be improved by increasing bias if additional hits are made. The EDS was designed for verification of forecasts of rare events (Stephenson et al. 2008), but it is susceptible to hedging. Stephenson et al. (2008) are very careful to advise that bias accompany the EDS. The same advice applies to POD. Neither POD nor EDS has an explicit dependency on bias, yet they are both readily amendable to hedging by increasing bias. Referring to the thermodynamics analogy, if there existed an equation of state relating *B*, *P*, and *α*, then this bias dependency would be readily apparent. No such equation exists, but *P* inevitably responds to increasing *B*, hence the sensitivity reflected by the low values of CPR, zero, for these two measures. On the other hand, neither POD nor EDS can be hedged by decreasing bias, since Δ*P*/Δ*B* cannot be less than zero.

### b. True skill score (TSS) and percent correct (PC)

Both TSS (sometimes called the true skill statistic) and PC are positively oriented performance measures and exhibit explicit bias dependency; however, this dependency disappears in taking the partial derivatives, leaving the CPR dependent only on event frequency or a constant for TSS or PC, respectively. For both, (∂*S*/∂*P*) > 0 is satisfied. The TSS is particularly vulnerable to hedging by increasing bias as the event frequency and, consequently, the CPR decreases. Although the TSS was identified as “equitable” by Gandin and Murphy (1992), the disadvantages of the TSS have been discussed extensively by Doswell et al. (1990), Marzban and Lakshmanan (1999), and BK. This work confirms these previous admonitions regarding the TSS. In particular, the recommendation by Doswell et al. (1990) to eschew the TSS in verifying forecasts of rare events is well taken.

In most cases, the frequencies of significant events are less than 0.5. In fact, BK characterize an event having a frequency of 0.5 as a “very common” event. The contingency table is usually constructed asymmetrically with the rarer and more meteorologically significant of the two possible outcomes placed in the first column. That being the case, the PC is more difficult to hedge by increasing bias compared to TSS, requiring half of all added forecast area to hit observations regardless of event frequency. The opposite is true for decreasing bias, since the PC improves even if up to half of the removed forecasts are hits.

### c. Probability of false detection (POFD) and false alarm ratio (FAR)

The POFD is also known as the false alarm rate. The FAR is sometimes confused with the false alarm rate and mistakenly denoted as such (e.g., Barnes et al. 2007). Both Doswell et al. (1990) and BK are careful to document the proper distinction between these two. POFD and FAR are negatively oriented and have explicit bias dependence. For both, (∂*S*/∂*P*) < 0 is satisfied as expected for negatively oriented measures (see Table 2).

The POFD has CPR = 1, impossibly demanding Δ*P*/Δ*B* > 1. It is, therefore, impossible to improve POFD by increasing bias according to axiom 7a in section 3. This stands to reason since increasing bias cannot reduce the number of forecasts for which the event is not observed. On the other hand, POFD is easily hedged by decreasing bias, since Δ*P*/Δ*B* ≤ 1 always; in fact, forecasting no events at all yields a perfect value of zero for POFD.

The FAR has no explicit dependency on event frequency. The behavior of FAR in this analysis method resembles that of the threat score (TS) discussed below, but FAR is more difficult to hedge by increasing bias past one than is TS. To improve (decrease) FAR by increasing bias past one requires that the fraction of added forecasts that are hits be greater than the POD. Improving the TS at *B* = 1 requires more than only half as many additional hits. However, for decreasing bias, FAR is easier to hedge than TS because up to a fraction equivalent to POD of area relinquished can be composed of hits, and FAR will still improve (decrease) for decreasing bias below one. Example calculations are given in Table 3 and are discussed in more detail in the next subsection.

### d. Threat score (TS), equitable threat score (ETS), and Heidke skill score (HSS)

The TS and ETS are very commonly used measures of performance, with TS having a longer history than ETS (BK). Unlike the ETS, the TS is not explicitly dependent on event frequency (see appendix A). In this analysis, the HSS behaves identically to the ETS; therefore, all discussion pertinent to ETS applies to HSS as well. In fact, Schaefer (1990) shows that HSS = 2ETS/(1 + ETS). Also, for unbiased forecasts (*B* = 1), regardless of POD, the HSS is equivalent to the TSS, a fact easily obtained using the formulas in Table 2. Conditional equivalence of the HSS and TSS is noted in Doswell et al. (1990). Despite this equivalence, the CPR values for HSS and TSS are very different for *B* = 1. The TS, ETS, and HSS are positively oriented. From Table 2, it is clear that (∂*S*/∂*P*) > 0 for the TS. In addition, (∂*S*/∂*P*) > 0 is true for ETS and HSS as demonstrated in appendix B.

Figure 1 displays contours of CPR for TS on a portion of the POD–bias plane. The blank portion in Fig. 1 is the region where POD is impossibly greater than the bias. For a fixed value of POD, as bias increases, the TS becomes steadily easier to improve by increasing bias, because lower values of CPR are encountered in moving horizontally from left to right parallel to the bias axis. Going horizontally from right to left in Fig. 1, the threat score becomes easier to improve by *decreasing* bias. Moving vertically along a line of constant bias toward higher values of POD results in increasing values of CPR and greater difficulty (ease) in improving the TS by increasing (decreasing) bias. Along the vertical line identified as “Bias = 1,” CPR values are exactly one-half of the POD values.

For the ETS and, consequently, the HSS, the CPR is explicitly dependent on event frequency. Figure 2 displays contours of the CPR on a portion of the POD–bias plane for two event frequencies, 0.025 (Fig. 2a) and 0.25 (Fig. 2b), which would be characterized by BK as frequencies for rare and relatively common events, respectively. For rare events, there is not much difference between the CPR values for ETS and TS. From Table 2, it is apparent that the CPR for ETS reduces to that for TS for the limiting case of *α* approaching zero. Figure 2b shows that the CPR values increase for the higher event frequency, especially at lower values of POD. Since the orientation of the contour lines is the same in both panels of Fig. 2 as in Fig. 1, the changes of CPR along a line of constant POD or constant bias are similar to that described for the TS.

Figure 3a shows the ETS CPR for *B* = 1 as a function of event frequencies up to 0.5. As event frequency increases, ETS becomes more difficult to hedge by increasing bias as already demonstrated in Fig. 2. The ETS becomes more difficult to hedge as POD increases, but this tendency lessens as the event frequency approaches 0.5, where CPR = 0.5 for all values of POD. This agrees with the findings of BK that “more than 50% of the additional forecast area resulting from an increase in *B* must be correct” for ETS to increase when the event frequency is 0.5. The ETS CPR values continue to increase for event frequencies higher than 0.5, but are not shown in Fig. 3a. As stated previously, meteorologically significant events typically have event frequencies of much less than 0.5.

Table 3 presents specific calculations to demonstrate the relative behavior of FAR, TS, and ETS. Table 3 displays values of *B*, *P*, and the performance measure value, *S*, before and after an assumed bias change. Since FAR is negatively oriented, higher values of FAR indicate a degradation in the forecast. Rows 1–3 in Table 3 assume a modest 10% increase in bias past one with 30% of the added forecasts being hits.

Under these conditions, only the threat score indicates an improved forecast after the bias change, as expected since 0.3 exceeds the CPR value only for the TS. If 60% of the added forecasts are hits as in rows 4–6, then all three measures of performance indicate improvement, since 0.6 exceeds the CPR value for all three. The last six rows in Table 3 demonstrate the case of a 10% reduction in bias. If 30% of the removed forecasts have been hits, then only the TS indicates a degradation in performance for this change, since only for TS is 0.3 not less than the CPR as required by Eq. (8) or (10) for a decrease in bias. In rows 10–12 in Table 3, 60% of forecasts removed in decreasing the bias have been hits, and all three performance measures indicate a degraded forecast as expected, since 0.6 is greater than the CPR for all three.

### e. Clayton skill score (CSS) and odds ratio skill score (ORSS)

Both the CSS and the ORSS are positively oriented measures of performance, with explicit dependence on bias being more complex for ORSS than the CSS. Here, (∂*S*/∂*P*) > 0 is always true for the CSS, because (l − *αB*) > 0, since *α B* is the fraction of the verification domain covered by the forecast area and must be less than 1. Conditions for (∂*S*/∂*P*) > 0 for the ORSS are discussed in appendix B.

Figure 3b shows the CSS CPR for *B* = 1 as a function of event frequency up to the value 0.5. Comparison with Fig. 3a shows that the CSS is a bit easier to hedge by increasing bias than the ETS as event frequency increases for very low values of POD, but the CPR for CSS increases much more rapidly with increasing POD than for ETS. Therefore, for higher values of POD, the CSS is more difficult to hedge by increasing bias beyond one than ETS, for typical event frequencies of 0.5 or less.

The ORSS CPR for *B* = 1 is displayed in Fig. 3c. A singularity occurs at (*α* = 0.5, *P* = 0), resulting in termination of the contours near that point. The ORSS is easier to hedge than both ETS and CSS at very low POD as event frequency increases toward 0.5, but the CPR for ORSS increases more rapidly with increasing POD than does the CPR for ETS. Compared to the CSS, the ORSS CPR values increase less rapidly with increasing POD. For values of POD greater than about 0.2, the CSS is more difficult to hedge by increasing bias past one than either ORSS or ETS in the event frequency range up to about 0.25.

## 5. Applications examples

### a. Analysis addressing a performance requirement

The National Centers for Environmental Prediction’s (NCEP) Hydrometeorological Prediction Center (HPC) has a mandated goal of cumulative annual TS = 0.29 for its quantitative precipitation forecasts (QPF) of 24-h accumulations of precipitation exceeding the 1-in. threshold at projection hour 24 (day 1) over the continental United States (CONUS). This mandate is set under the requirements of the Government Performance Results Act (GPRA) of 1993. This GPRA goal is set to increase to 0.30 for fiscal year (FY) 2010 (1 October 2009—30 September 2010) as of this writing. For FY 2007, the HPC’s TS value was 0.3060 with POD = 0.5035 and bias = 1.149. There is no GPRA mandate for bias. The HPC 24-h QPF may be more useful to users if its bias were closer to 1.

The questions are: Could HPC forecasters reduce their bias to 1.0 and still meet the 2010 GPRA goal of 0.30, and how easy would that be? Once forecasters are aware of their bias, they can make a conscious effort to reduce it. An assumption is needed on how forecasters will perform in effectuating their bias reduction. It may be reasonable to assume that the performance will remain on or very near the relative operating characteristics (ROC) curve associated with their biased forecasts. It is also assumed that the event frequency will remain approximately that computed from the FY 2007 sample, for which *α* = 0.01552. The ROC curve traces POD as a function of POFD (false alarm rate). This curve may be parameterized in terms of the odds ratio as described by Stephenson (2000). Figure 4 shows this ROC curve for the day 1 HPC 24-h QPF exceeding the 1-in. threshold, with annotation showing the point corresponding to the 2007 performance. Staying on this parameterized ROC curve means that the HPC forecasters will have the same ratio of the odds of a hit to the odds of a false alarm for the bias-corrected forecast as for the original biased forecast. In other words, the odds ratio and, consequently, the ORSS will be conserved. A new value for *P* could be computed using the formula for ORSS or odds ratio with *B* = 1, but that requires solving a quadratic equation and deciding which of two solutions to use. Computing CPR values offers not only a shortcut to the calculations, but also yields additional insight as to what fraction of the forecasts relinquished in reducing bias can be hits.

Using the penultimate column in Table 2, the CPR values for both TS and ORSS are computed for the biased HPC forecasts. Since the required bias change is negative, if the CPR value for the ORSS is less than that for TS, then conserving the odds ratio will improve the TS, and a decision in favor of attempting the bias reduction is easy to make under that assumption. For the actual HPC forecasts, the ORSS CPR is 0.2812, and the TS CPR is 0.2343 (see Fig. 1). The ORSS CPR is not less than that for the TS; therefore, the TS value will decrease if bias is decreased to one with the ORSS conserved. If the bias is reduced to one, the ORSS is conserved if the accompanying Δ*P*/Δ*B* exactly equals the ORSS CPR. Since the change in bias is known (−0.149), a new value of *P* is obtained: *P*′ = *P* + (Δ*B*) (ORSS CPR). The changes associated with the bias reduction are given in the last column in Table 4. The ORSS improves slightly even though it was assumed to be conserved. This is because using the CPR to compute *P*′ for a finite change is an approximation and will not conserve the ORSS exactly. As a result, POFD improves by decreasing. But POD decreases, and the new TS value of 0.3001 for the unbiased forecasts is less than that for the biased forecasts, but is still slightly greater than the future GPRA goal. Based on this analysis, it may be reasonable for HPC forecasters to attempt to reduce the bias of the day 1 24-h QPF 1-in. threshold, thereby improving the quality of the forecasts while still meeting the 2010 GPRA goal. The fact that the 0.0003 increase in the ORSS exceeds the 0.0001 difference between the new TS value and the 2010 GPRA goal suggests a cautious approach. If it is attempted, performance must be monitored carefully, since a shift to a less favorable ROC curve may occur.

On a more cynical note, HPC forecasters can stay comfortably ahead of the GPRA goal by increasing bias, provided the added forecast area has a hit rate exceeding the TS CPR. The bias sensitivity of the TS suggests that management performance goals should include bounds for bias in addition to the goal for TS.

### b. Bias dependency in a geometric model (BK)

The geometric model of BK was used to assess placement error and bias dependency. The contingency table elements convey no direct information about error in the relative placement of forecast and observed areas or points. All other things being the same, a change in the relative position of a forecast and observed area effects a change in the POD. Exactly how POD changes as displacement error increases or decreases depends on the shape and orientation of the forecast and observed areas. Shapes and orientations also influence how POD changes if bias changes. In other words, the geometry is important. The derivation of the analytic method presented in section 2 makes no geometrical assumptions; however, the method can be applied if geometrical assumptions are made.

*α*,

*B*, and

*D*, where

*D*is the displacement error. Invoking the thermodynamics analogy, BK impose an equation of state on the 2 × 2 contingency table system: where the function

*a*is that given by BK’s Eq. (2) with

*r*(radius of the observed circle in the BK model) replaced with

_{o}*α*/

*π*

*α*is constant, allowing assessment of performance measure behavior as a function of bias and displacement error for a given event frequency. If Eq. (11) is differentiated with respect to

*B*(

*D*is independent of

*B*), an analytic expression for the Δ

*P*/Δ

*B*imposed by the assumed geometry is obtained. Wherever the Δ

*P*/Δ

*B*based on Eq. (11) exceeds the CPR for a particular positively oriented performance measure, an increase in that measure must occur as bias increases. Conversely, if the CPR exceeds the analytic Δ

*P*/Δ

*B*, the positively oriented measure must decrease with increasing bias.

Figure 5 shows two overlaid contoured fields computed using the BK geometric model applied to the ETS for the common event frequency (*α* = 0.28). The dark contours show the CPR for ETS in the *B*–*D*′ plane, where *D*′ is the normalized displacement error as described by BK. The gap in the upper-right corner respects the assumption of unit area for the verification domain, and the gap in the lower-right respects axiom 7b in section 3 above requiring that Δ*P*/Δ*B* = 0 for increasing *B* when *P* = 1 regardless of the CPR value, implying no possibility for the ETS to increase with increasing bias under that condition; therefore, ETS must decrease as shown in BK’s Fig. 3d. The lightly shaded contours depict the field obtained by subtracting the CPR field from the analytic Δ*P*/Δ*B* field. Where the contours of this difference field indicate positive (negative) values, ETS must increase (decrease) with increasing bias (see Fig. 3d in BK). The difference field zero contour matches the position of the axis of maximum ETS marked by the dashed line in BK’s Fig. 3d. Thus, wherever the Δ*P*/Δ*B* imposed by the geometric model exceeds (is less than) the CPR, the ETS increases (decreases) as bias increases as expected and depicted in BK’s Fig. 3d.

If there are no conditions imposed on Δ*P*/Δ*B*, the dark contours in Fig. 5 are curves along which the ability to hedge the ETS by increasing bias is of constant difficulty in terms of the fraction of added forecasts that must hit in order to increase the ETS. In the lower-left part of Fig. 5, the centers of the forecast and observed circles are close enough that expanding the forecast circle radius produces a ratio of new hit area to new forecast area exceeding the ETS CPR. Therefore, the assumed geometry dictates that ETS will increase with increasing bias in the lower-left quadrant of Fig. 5 where the light gray contours indicate positive values, and the geometry makes it impossible to hedge the ETS outside of that area. The simple geometric model of BK serves well in demonstrating two important points: 1) the geometry of certain situations can make the ETS (and other measures of performance) more sensitive to bias compared to other different geometrical configurations and 2) conclusions related to bias sensitivity based on a specific analytic geometric model must be interpreted carefully and may not apply in general.

## 6. Concluding remarks

This work derives the general form for computing a critical performance ratio (CPR) for any performance measure computed from the entries in the 2 × 2 contingency table for dichotomous forecasts. The CPR varies between zero and one, inclusively, and specifies the minimum fraction of added forecast area that must observe the event above which fraction the performance measure indicates improvement if bias is increased. Conversely, if bias is decreased, the CPR value specifies the maximum fraction of removed forecast area that observes the event below which fraction the performance measure indicates improvement. The CPR may be a constant or a function of one or more of the independent variables of choice for this formulation—bias and POD with event frequency as a constant parameter. The CPR for a performance measure serves as an indicator of how easy it is to “hedge” the performance measure toward a more favorable value by changing bias.

The CPR is derived for a number of performance measures, documented in Table 2, and analyzed graphically for those that exhibit more complex dependency on POD, bias, and/or event frequency. All performance measures considered are shown to have sensitivity to either increasing or decreasing bias, or, more commonly, both. In addition to analyzing performance measures, the CPR is shown to be useful for developing strategies for meeting requirements as measured by a single performance measure, assuming future performance is characterized by the same relative operating characteristics (ROC) curve as the current performance. The CPR is also applied to demonstrate that the bias dependencies induced by a simple geometric model (BK) are consistent with that expected from CPR values computed in the model domain.

Finally, the work undertaken here is not intended to identify any one particular performance measure as unconditionally better or worse than the others. Rather, it is intended to provide a general, rather simple, tool to analyze performance measures. Evaluation of the performance of dichotomous forecasts is best done by including bias along with one or more performance measures and a thorough understanding of their relative strengths and weaknesses. The needs of the users of the forecasts ultimately determine the sensitivity to bias, and that also must inform the selection of the performance measures.

I am grateful to Dr. Michael Baldwin and Dr. Fedor Mesinger for the many interesting and educational conversations and e-mail exchanges related to bias sensitivity of performance measures for dichotomous forecasts over the past several years. Their respective reviews of the draft of this paper were very helpful. I am grateful for the positive comments and helpful suggestions of the formal reviewers. Funding from NCEP/HPC to provide computational tools and to cover the publication costs is much appreciated.

## REFERENCES

Baldwin, M. E., , and Kain J. S. , 2006: Sensitivity of several performance measures to displacement error, bias, and event frequency.

,*Wea. Forecasting***21****,**636–648.Barnes, L. R., , Gruntfest E. C. , , Hayden M. H. , , Schultz D. M. , , and Benight C. , 2007: False alarms and close calls: A conceptual model of warning accuracy.

,*Wea. Forecasting***22****,**1140–1147.Doswell C. A. III, , , Davies-Jones R. , , and Keller D. L. , 1990: On summary measures of skill in rare event forecasting based on contingency tables.

,*Wea. Forecasting***5****,**576–585.Gandin, L. S., , and Murphy A. H. , 1992: Equitable skill scores for categorical forecasts.

,*Mon. Wea. Rev.***120****,**361–370.Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts.

,*Wea. Forecasting***14****,**155–167.Hilliker, J. L., 2004: The sensitivity of the number of correctly forecasted events to the threat score: A practical application.

,*Wea. Forecasting***19****,**646–650.Iribarne, J. V., , and Godson W. L. , 1973:

*Atmospheric Thermodynamics*. D. Reidel, 222 pp.Marzban, C., 1998: Scalar measures of performance in rare-event situations.

,*Wea. Forecasting***13****,**753–763.Marzban, C., , and Lakshmanan V. , 1999: On the uniqueness of Gandin and Murphy’s equitable performance measures.

,*Mon. Wea. Rev.***127****,**1134–1136.Mesinger, F., , and Brill K. , 2004: Bias normalized precipitation scores. Preprints,

*17th Conf. on Probability and Statistics,*Seattle, WA, Amer. Meteor. Soc., J12.6. [Available online at http://ams.confex.com/ams/pdfpapers/69561.pdf.].Murphy, A. H., 1991: Forecast verification: Its complexity and dimensionality.

,*Mon. Wea. Rev.***119****,**1590–1601.Schaefer, J. T., 1990: The critical success index as an indicator of warning skill.

,*Wea. Forecasting***5****,**570–575.Stephenson, D. B., 2000: Use of the “odds ratio” for diagnosing forecast skill.

,*Wea. Forecasting***15****,**221–232.Stephenson, D. B., , Casati B. , , Ferro C. A. T. , , and Wilson C. A. , 2008: The extreme dependency score: A non-vanishing measure for forecasts of rare events.

,*Meteor. Appl.***15****,**41–50.Wilks, D. S., 2006:

*Statistical Methods in the Atmospheric Sciences*. 2nd ed. Academic Press, 630 pp.

# APPENDIX A

## Discussion of Explicit Dependency

*not explicitly dependent on*the parameter or parameters not appearing in the formula, allowing for the fact that a different form of the contingency table can reveal an explicit dependency.

*Explicit dependency*or lack thereof is determined by the formulation of the contingency table. For example, the threat score (TS) formula given in Table 2 using the contingency Table 1b exhibits no explicit dependency of TS on event frequency,

*α*. However, if Table 1a is normalized by

*N*, and rewritten in terms of hit frequency (

*h*=

*H/N*), forecast frequency (

*f*=

*F/N*), and event frequency (

*α*=

*O/N*), then the formula for TS becomes

Equation (A1) shows that the TS cannot be declared independent of event frequency. In the Table 2 formula, the TS dependence on event frequency is implicit in its bias dependency.

In this work the odds ratio skill score (ORSS) is shown to be dependent on and sensitive to bias, contradicting Stephenson (2000), who states that the odds ratio (the positively oriented basis for the ORSS—the ratio of the statistical odds of a hit to the odds of a false alarm) is “independent of any bias between the observations and the forecasts.” Stephenson (2000) uses a formulation of the contingency table that conceals the bias dependency of the odds ratio, expressing it directly in terms of POD and POFD. From the forecast verification point of view adopted here, it is relatively easy to see that odds ratio and ORSS may be sensitive to bias: if a forecast area is expanded without change in the observed area (bias increases) and without additional hits, then the odds of a false alarm increase and the ORSS decreases. Perhaps Stephenson (2000) intended a different meaning for “independent,” since it is possible for the ORSS to remain constant as bias changes, provided the POD and/or event frequency change appropriately. Further examination of this issue is beyond the scope of this work.

# APPENDIX B

## Demonstrations of (∂S/∂P) > 0 for ETS and ORSS

*S*/∂

*P*) > 0 requires

Two cases are considered (although case 2 is not typically realized in practice):

*α*≤ 1/2, and Eq. (B1) is rewritten as*B*(1 − 2*α*) > −1. Regardless of the value of*B*> 0, the latter is always true for this case, supporting the truth of Eq. (B1).- 1/2 <
*α*< 1, and Eq. (B1) is rewritten as*B*(2*α*− l) < l. The largest possible value of B occurs if the forecast covers the entire area so that*B*=*F*/*O*= 1/*α*. Substituting this into the inequality results in 2 − (1/*α*) < 1. The latter is always true for this case.

The two cases are sufficient to demonstrate that Eq. (B1) is always true for 0 < *α* < 1.

*S*/∂

*P*) > 0 only for the case

*B*= 1 and

*α*≤ 1/2, not for the general case, which can be handled computationally by evaluating

*Y*in Table 2. The case of

*B*= 1 requires

Since 0 ≤ *P* < 1 (*P* = l → ∂*S*/∂*P* = 0), the first factor is always less than zero; therefore, the second factor must be greater than zero for Eq. (B3) to hold true. The inequality of interest becomes *P* > 2*α* − 1. This is satisfied for *α* ≤ 1/2, but not necessarily for *α* > 1/2. Since even the most common events typically have frequencies less the 0.5, this condition on *α* for ORSS is not particularly restrictive.

(a) Contingency table in terms of *H*, *F*, *O*, and *N*. See text for details. (b) Normalized contingency table in terms of *B*, *P*, and *α*.

Table of performance measure (PM) formulas, partial derivatives with respect to *B* and *P*, CPR, and CPR for bias equals one. The second column (Or) refers to the orientation (see text). All other symbols and abbreviations are defined in the text.

Example calculations of FAR, TS, and ETS after bias increases or decreases from one. The event frequency for ETS is 0.25. The last column indicates whether or not the PM indicates improvement.

Changes of POD (*P*), bias (*B*), TS, ORSS, and POFD for deliberate bias change under the assumption of approximate conservation of ORSS. See text for details.