• Brooks, H., and C. A. Doswell III, 1995: A comparison of measures- oriented and distributions-oriented approaches to forecast verification. Wea. Forecasting,11, 288–303.

  • Donaldson, R. J., R. M. Dyer, and M. J. Krauss, 1975: An objective evaluator of techniques for predicting severe weather events. Preprints, Ninth Conference on Severe Local Storms, Norman, OK, Amer. Meteor. Soc., 321–326.

  • Doswell, C. A., III, R. Davies-Jones, and D. Keller, 1990: On summary measures of skill in rare event forecasting based on contingency tables. Wea. Forecasting,5, 576–585.

  • Gandin, L. S., and A. Murphy, 1992: Equitable skill scores for categorical forecasts, Mon. Wea. Rev.,120, 361–370.

  • Hamill, T. M., 1997: Reliability diagrams for multi-category probabilistic forecasts, Wea. Forecasting,12, in press.

  • Kendall, M. G., and A. Stuart, 1969: The Advanced Theory of Statistics. Vol. 1. Hafner Publishing, 439 pp.

  • Mason, I., 1982: A model for assessment of weather forecasts. Aust. Meteor. Mag.,30, 291–303.

  • Murphy, A. H., 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting. Wea. Forecasting,8, 281–293.

  • ——, 1996: The Finley affair: A signal event in the history of forecast verification. Wea. Forecasting,11, 3–20.

  • ——, and R. L. Winkler, 1987: A general framework for forecast verification, Mon. Wea. Rev.,115, 1330–1338.

  • ——, and ——, 1992: Diagnostic verification of probability forecasts. Int. J. Forecasting,7, 435–455.

  • ——, B. G. Brown, and Y.-S. Chen, 1989: Diagnostic verification of temperature forecasts. Wea. Forecasting,4, 485–501.

  • O’Hagan, A., 1994: Kendall’s Advanced Theory of Statistics. Vol. 2B, Bayesian Inference, Halsted Press, 330 pp.

  • Schaefer, J. T., 1990: The critical success index as an indicator of warning skill, Wea. Forecasting,5, 570–575.

  • Wilks, D. S., 1995: Statistical Methods in the atmospheric Sciences, Academic Press, 467 pp.

  • View in gallery

    Gaussian likelihood functions for two groups with means at μ = 30 and μ = 50, and (a) with equal variances and (b) with unequal variances. In the former, the two curves cross at only one point, but in the latter they cross at two points.

  • View in gallery

    The p1 dependence of four measures for some examples: (a) N0/N1 = 1, δ = 1; (b) N0/N1 = 1, δ = 2; (c) N0/N1 = 10;δ = 1; (d) N0/N1 = 10, δ = 2, in the linear case, and then (e) N0/N1 = 100, δ0 = 5, δ1 = 1; and (f) N0/N1 = 100, δ0 = 1, δ1 = 5, in the quadratic case. The vertical lines in each graph mark the “true” value of p1, i.e., p1 = N1/N.

  • View in gallery

    The reliability diagram for several values of a = p0N1/p1N0. Optimum reliability corresponds to a = 1, which translates to p1 = N1/N.

  • View in gallery

    (a) The values of p1c and (b) the corresponding HSS, HSS(p1c), in the linear case, as a function of δ and for 10 values of N0/N1 = 1, 2, 4, 8, 16, 32, . . . , 210 = 1024.

  • View in gallery

    The values of p1c, in the quadratic case, as a function of δ0 and δ1 = 0.1, 0.5, 1.0, 1.5, 2.0, . . . , 7.0, for N0/N1 = 2, 4, 8, 100, 500, 1000. (For N0/N1 = 1, p1c = 1/2.)

  • View in gallery

    The values of HSS(p1c), in the quadratic case, as a function of δ0 and δ1 = 0.1, 0.5, 1.0, 1.5, 2.0, . . . , 7.0, for N0/N1 = 2, 4, 8, 100, 500, 1000. [For N0/N1 = 1, HSS is found from (8)–(10) with p1c = 1/2.]

  • View in gallery

    The distributions of hourly surface air pressures for Syracuse, NY, for 1990. The rough curves represent the data and the smooth curves are Gaussian fits to the data. The “larger” curve is for hours when some form of precipitation occurred, and the“smaller” curve is for when there was no precipitation.

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 114 114 1
PDF Downloads 21 21 0

Bayesian Probability and Scalar Performance Measures in Gaussian Models

View More View Less
  • 1 National Severe Storms Laboratory, Cooperative Institute for Mesoscale and Meteorological Studies, and Department of Physics, University of Oklahoma, Norman, Oklahoma
© Get Permissions
Full access

Abstract

The transformation of a real, continuous variable into an event probability is reviewed from the Bayesian point of view, after which a Gaussian model is employed to derive an explicit expression for the probability. In turn, several scalar (one-dimensional) measures of performance quality and reliability diagrams are computed. It is shown that if the optimization of scalar measures is of concern, then prior probabilities must be treated carefully, whereas no special care is required for reliability diagrams. Specifically, since a scalar measure gauges only one component of performance quality—a multidimensional entity—it is possible to find the critical value of prior probability that optimizes that scalar measure; this value of “prior probability” is often not equal to the “true” value as estimated from group sample sizes. Optimum reliability, however, is obtained when prior probability is equal to the estimate based on group sample sizes. Exact results are presented for the critical value of “prior probability” that optimize the fraction correct, the true skill statistic, and the reliability diagram, but the critical success index and the Heidke skill statistic are treated only graphically. Finally, an example based on surface air pressure data is employed to illustrate the results in regard to precipitation forecasting.

Corresponding author address: Dr. Caren Marzban, National Severe Storms Laboratory, 1313 Halley Circle, Norman, OK 73069.

marzban@gump.nssl.noaa.gov

Abstract

The transformation of a real, continuous variable into an event probability is reviewed from the Bayesian point of view, after which a Gaussian model is employed to derive an explicit expression for the probability. In turn, several scalar (one-dimensional) measures of performance quality and reliability diagrams are computed. It is shown that if the optimization of scalar measures is of concern, then prior probabilities must be treated carefully, whereas no special care is required for reliability diagrams. Specifically, since a scalar measure gauges only one component of performance quality—a multidimensional entity—it is possible to find the critical value of prior probability that optimizes that scalar measure; this value of “prior probability” is often not equal to the “true” value as estimated from group sample sizes. Optimum reliability, however, is obtained when prior probability is equal to the estimate based on group sample sizes. Exact results are presented for the critical value of “prior probability” that optimize the fraction correct, the true skill statistic, and the reliability diagram, but the critical success index and the Heidke skill statistic are treated only graphically. Finally, an example based on surface air pressure data is employed to illustrate the results in regard to precipitation forecasting.

Corresponding author address: Dr. Caren Marzban, National Severe Storms Laboratory, 1313 Halley Circle, Norman, OK 73069.

marzban@gump.nssl.noaa.gov

Introduction

Frequently, one is faced with the task of “transforming” a variable (e.g., surface air pressure, gate-to-gate velocity difference, etc.) into a probability for a corresponding event (e.g., precipitation, tornado, etc.). A related problem is that of the performance of the forecaster, that is, the accuracy of the forecasts or the reliability of the generated probabilities (Murphy 1993, 1996; Murphy et al. 1989; Murphy and Winkler 1987, 1992; Wilks 1995).

Several subtleties arise in both problems. For instance, in forming forecast probabilities, it is important to consider the correct conditional probability, namely, the probability of an event, given the observation of a variable, and not the converse (Brooks and Doswell 1995; Murphy and Winkler 1987, 1992). The relationship among the various conditional probabilities is given by Bayes’ theorem (Kendall and Stuart 1969; O’Hagan 1994). Also, in assessing the performance of the forecaster, not only the correct probabilities must be assessed but also it is important to acknowledge the multidimensionality of performance itself. For instance, it is entirely possible that one forecaster will outperformanother, in terms of a specific measure of performance, but not in terms of another measure.1

The “worst” measures are scalar (one-dimensional) and nonprobabilistic, while the “best” are multidimensional and probabilistic with admixtures also possible. However, sometimes the particular aspect of performance that is of interest is unambiguously specified (by funders, for example), in which case one may concentrate on only the corresponding scalar measure and effectively treat performance as a one-dimensional entity. There are also times when one has no choice but to appeal to a scalar measure; for instance, in deciding the winner of a forecasting contest, enforcing the multidimensionality of performance may lead to several winners—one for each dimension of performance. It is possible that a forecaster could outperform all others in terms of all the components of performance, but such situations are neither guaranteed nor likely. In this article, although the framework for forming forecasts is intrinsically probabilistic, primary consideration is given to scalar, nonprobabilistic measures based on a contingency table (Wilks 1995). Reliability diagrams will also be considered, but a complete treatment of multidimensional and probabilistic measures will be postponed to a later article.

One nonprobabilistic scalar measure that appears to satisfy most of the requirements that one could placeon such measures (Marzban 1997, submitted to Wea. Forecasting) is Heidke’s skill statistic (HSS). A measure that is intimately related (Doswell et al. 1990) to HSS is the true skill statistic (TSS). Another popular measure in meteorology is the critical success index (CSI) (Donaldson et al. 1975), with its popularity withstanding its“inequitability” (Gandin and Murphy 1992) in that its values for random guessing and persistence are unequal;in fact, technically, CSI is not even a measure of skill since it does not take into account either of these two factors. Schaefer (1990) also addresses CSI in the rare- event situation. Finally, there is the most notorious of measures, namely, the fraction correct (FRC), which, in spite of its numerous inadequacies, is still in common use. Its inclusion in the present study serves only as a point of contrast.

As for probabilistic and multidimensional measures, by virtue of being multidimensional, they cannot be represented by a single number, and one must appeal to multidimensional means, for example, two-dimensional diagrams, to represent such quantities. One example is the reliability diagram (Wilks 1995), which is the one that will be discussed here as an example of a probabilistic, multidimensional measure. Multicategory reliability diagrams have also been considered (Hamill 1997).

We begin with a review of the Bayesian approach of transforming a real, continuous variable into the posterior probability of an event, given an observation. Then, a Gaussian model is employed to allow for an explicit computation of the probabilities and several measures of accuracy.2 It is shown that by virtue of being scalar measures and also depending on prior probability, it is possible to “free” the prior probabilities from their“true” values, as estimated from the group sample sizes, and instead set them to critical values that maximize a given measure. These critical values of “prior probability” will be computed for the above-mentioned scalar measures. Exact results are found for FRC, TSS, and reliability diagrams, but CSI and HSS lend themselves mostly to an approximate (graphic) treatment.

Probabilities and Bayes’ theorem

In a probabilistic approach to forecasting, conditional probabilities play an important role (Brooks and Doswell 1995; Murphy and Winkler 1987), and so, the conditions under which one obtains the probability of a given event must be carefully specified. For instance, consider the situation where there are only two possible hypotheses (e.g., tornado and no tornado, or rain and no rain, etc.), generically labeled as “1” and “0” for the existence of event and no event, respectively. Then,the probability of making an observation of a quantity x (e.g., wind speed, temperature, etc.), given the hypothesis, is a completely different quantity than the probability of a hypothesis being in effect when x is measured. The former is sometimes called likelihood, and the latter is the posterior probability of an event, given the measurement x; the two probabilities are related through Bayes’ theorem (Kendal and Stuart 1969;O’Hagan 1994). It is this posterior probability that is of interest when a forecast is made since x is measured first and one is then interested in the probability of the hypothesis that gave rise to that value of x.

In the two-group case, generally labeled “0” and“1,” the Bayes’ theorem states that
i1520-0450-37-1-72-e1
where p0, p1 are the prior probabilities for the two groups and L0(x), L1(x) are the likelihood functions, while P1(x) is the posterior probability that hypothesis 1 is in effect when x is measured. Of course, p0 + p1 = 1 and P0 + P1 = 1.
How does one compute these probabilities from the data on x? First, one simply plots two histograms (i.e., a frequency plot)—one for the x values corresponding to the nonevents (0’s) and another for the events (1’s). These frequencies can be labeled as N0(x) and N1(x), respectively. The likelihoods are then defined as
i1520-0450-37-1-72-eq1
where Ni (i = 0, 1) are the respective sample sizes. Clearly, the sum of all the observations in each histogram is equal to the total number of nonevents (N0) and events (N1), respectively, in the sample. Consequently, the sum of the observations under the L0(x) and L1(x) plots is equal to 1:
i1520-0450-37-1-72-eq2
In other words, the likelihoods are simply normalized histograms. As for the prior probabilities, their estimates are given by the sample sizes as
i1520-0450-37-1-72-eq3
where N = N0 + N1. The posterior probabilities are then computed from (1). This completes the transformation of an observation, x, into a posterior probability of a corresponding event.

As will be shown, the value of p1 = N1/N does not necessarily imply optimum performance when performance is gauged in terms of scalar, categorical measures. One aim of this article is to derive the critical values of prior probability, p1c, that optimize a given measure.

A Gaussian model

There are many reasons (Wilks 1995) for assuming a parametric form for the likelihoods, Li(x), and to proceed to estimate the parameters from the sample data. A most common ansatz is that of normality, that is, that the single variable x is distributed in a normal or Gaussian fashion:
i1520-0450-37-1-72-e2
where μi and σi are the mean and the standard deviation for the variable x, in group i (i = 0, 1), which are all estimated from the sample data. Figures 1a and 1b show some generic, Gaussian likelihood functions for two groups.

As stated in section 1, sometimes one is interested in nonprobabilistic measures such as CSI, TSS, or HSS, which are meaningful only for categorical forecasts. Probabilistic forecasts can always be reduced to categorical ones by the introduction of one or more thresholds. For example, in regard to Fig. 1a, a value of x can be assigned to (or forecasted as) the group with the higher likelihood. Then, the value of x at which the two curves intersect would be the natural threshold marking the boundary between the two groups. However, as discussed in section 1, this is the “wrong” probability to consider. A measurement x must be classified into thegroup with the higher posterior probability. Figure 1 is still instructive in that it allows for the interpretation of prior probability as a “generalized” threshold. As will be shown below, the qualifier generalized serves two functions. In the special case where the two groups have equal standard deviations, prior probability simply shifts the threshold away from the one at the crossing point of the two curves in Fig. 1a to the crossing point of the curves p0L0(x) and p1L1(x). The other sense in which prior probabilities represent generalized thresholds arises in the more general case where the group standard deviations are unequal; in that case, p0L0(x) and p1L1(x) cross not at one threshold but at two thresholds, marking the boundaries between the two groups. Here, the decision boundary is nonlinear. The effect follows simply from the relevant equations and will be shown below.

In obtaining measures of performance for categorical forecasts, as mentioned previously, the decision criterion must be based on P1(x)/P0(x) or, equivalently, log[P1(x)/P0(x)]. In a Gaussian model, it follows from (1) and (2) that
i1520-0450-37-1-72-eq4
where the so-called discriminant function, D2(x), is given by
i1520-0450-37-1-72-e3
Recall that the means and the standard deviations are estimated from the sample data, and then an observation, x (either from the same data or independent data), is assigned to group 1 if D2(x) > 0 [i.e., P1(x) > P0(x)], otherwise it is classified (forecast) as a 0 [i.e., P0(x) > P1(x)].3 Also note that this larger-posterior-probability rule implies the unique threshold of 0.5 for posterior probability because P0(x) + P1(x) = 1. In other words, as far as posterior probability is concerned, a threshold other than 50% should not be considered. Therefore, the root(s) of (3) are the threshold(s), and they depend on p1. [Recall that these roots correspond to the values of x where the quantities p0L0(x) and p1L1(x) intersect.] It is clear that any scalar measure of performance, through its dependence on these root(s), also depends on the p1. One can then find the critical value of p1 that optimizes a given measure.

There are two cases that must be treated separately: the “linear” case, if σ0 = σ1, and the “quadratic” case where σ0σ1. The reason for these terms will be discussed below.

Linear discriminant function

In the fortunate situation where σ0 = σ1 = σ (i.e., if the data is so-called homoelastic), the discriminant function becomes linear44 in x, and there is, then, only the one root (threshold)
i1520-0450-37-1-72-e4
As mentioned previously, the number of crossing points (1 or 2; excluding the ones at +∞ and −∞) is determined by the relative size of σ0 and σ1 (i.e., equal or not) and not by p0 or p1. Therefore, in the linear case there is only one threshold regardless of the value of prior probability.
We will assume μ1 > μ0 from this point—without loss of generality. This can be done simply by labeling the two groups appropriately. Then x > tlinear implies that x belongs to group 1, and x < tlinear implies that x belongs to group 0. It is then possible to calculate the false alarm rate and the miss rate in terms of which the various measures can be written (next section). The former is simply the rate at which 0’s are classified as 1’s, and the latter is the rate of misclassifying 1’s as 0’s, that is,
i1520-0450-37-1-72-eq5
Substitution of the Gaussian expressions for Li(x) yields
i1520-0450-37-1-72-e5
where erf(x) is the Gaussian error function and ti, (i = 0, 1), are defined as
i1520-0450-37-1-72-e6
where the plus and minus signs are for i = 0, 1, respectively, and we have defined the useful quantity
i1520-0450-37-1-72-e7
to be determined from the sample data. The number of false alarms and misses is obtained by multiplication of the rates by the respective group sample sizes: number of false alarms equals N0c01, and number of misses equals N1c10.

Quadratic discriminant function

If, as is too often the case, σ0σ1, then (3), in general, has two roots, that is, two thresholds. Writing the discriminant function as
i1520-0450-37-1-72-eq6
with the roots written as t±, then the false alarm rate and the miss rate are given by
i1520-0450-37-1-72-e8
if σ20 > σ21. By exchanging 0 ⇔ 1 in these equations, one obtains the respective rates, if σ21 > σ20. Here, t±i (i = 0, 1) is defined as
i1520-0450-37-1-72-eq7
Thus far, we have not written the expression for the roots t± because the relevant quantities in (8) depend on t±0 and t±1, and they can be found to be
i1520-0450-37-1-72-e9
where
i1520-0450-37-1-72-e10
In (9), the upper signs “±” apply when δ0 > δ1, and the lower signs “∓” apply when δ1 > δ0.
The two roots are real (nonimaginary) only if the expression under the square root is nonnegative, which translates into a constraint on p1:
i1520-0450-37-1-72-eq8
where pcc is defined as
i1520-0450-37-1-72-e11

In other words, there is a range of values for p1 where there is no (real) threshold at all. For such values of p1, there does not exist a discriminant function. It is easy to understand this phenomenon. Recall that the two roots correspond to the two crossing points of p0L0(x) and p1L1(x), but there exists some value of p1 for which one of the two quantities lies entirely below the other, and so there are no crossing points. In such a case, all observations of x are then persistently classified as belonging to the group with the higher piLi(x). Note that in the linear case when σ1 = σ0, none of the groups can have a piLi(x) that lies entirely below the other, and so pcc is specific to the quadratic case.

The above results are relevant mostly for scalar measures of categorical forecasts, and the critical values of p1 that optimize the measures will be presented in sec~tion 5. At the other extreme, if forecasts are probabilistic, and one has the full luxury of dealing with multidimensional, probabilistic measures, then a different approach must be adopted. Section 5 also presents the critical value of p1 that yields the most reliable plot in a reliability diagram (in complete generality, i.e., without reference to a Gaussian model).

Measures

In a Bayesian approach to forecasting, the dependence of posterior probabilities on prior probabilities is transmitted to the performance measures—both the scalar and multidimensional measures, although in the latter case there is no reason to set the prior probabilities to a value different from that estimated from the group sample sizes. The dependence of the false alarm rate and the miss rate on p1 is now explicit in (5) and (6) and (8) and (9). We must now write the measures in terms of these rates. The four scalar measures FRC, CSI, TSS, and HSS can be defined in terms of the contingency table [otherwise known as the confusion matrix (C matrix)] (Marzban 1997, submitted to Wea. Forecasting):
i1520-0450-37-1-72-eq9
where a and d are the number of correct forecasts of nonevents and events, respectively, and b and c are the number of false alarms and misses, respectively, and are therefore given by b = c01N0 and c = c10N1. Note that N0 = a + b and N1 = c + d. These four measures can be manipulated to depend on c01, c10 and the ratio of the two group sample sizes only.;MD Fraction correct
i1520-0450-37-1-72-eq10
A reliability diagram is simply a graph of the observed ratio of event sample size to the total sample size, at a given value of forecast probability:
i1520-0450-37-1-72-eq11
where |p1 means “at a given value of P1.” This quantity will be derived in the next section.

Note that if N0 = N1, then HSS = TSS and FRC = (1 + TSS)/2. It is interesting that TSS has no dependence (explicit or implicit) on the sample sizes and, consequently it is well defined in the rare-event limit (see section 8). As for the reliability diagram, a most reliable forecaster will produce a straight, diagonal line on that diagram, and points below (above) the diagonal line reflect over- (under-) forecasting. Recalling (5) and (8) for c01, c10, the dependence of the scalar measures on prior probability becomes explicit. Figures 2a–d show four measures as a function of p1, in the linear case, for several examples of N0/N1 and δ. Note that when N0 = N1 (Figs. 2a,b), all four measures peak around p1 = 1/2. The differences between the measures appear in full force, however, when N0N1 (Figs. 2c,d);whereas TSS continues to peak at p1 = 1/2, FRC appears to peak at p1 = N1/N, and CSI and HSS have other critical values. All of these values will be derived in the next section.

The significance of setting the prior probability at its optimal value can be seen in Fig. 2c, for instance. Hastily setting p1 at 0.5 would result in an HSS of 18%, while p1 = N1/N would yield HSS = 6%. However, HSS at its critical value is 25%. The improvement in HSS is even more significant for smaller values of δ (Fig. 2c is for δ = 1).

As will be shown below, it is important to point out that the critical value of p1 that yields optimal results in a reliability diagram is exactly the true value as estimated from the group sample sizes, that is, p1 = N1/N. Values of p1 that are different from this true value appear to exist only for scalar, categorical measures.

For the quadratic case, the p1 dependence of the four measures, for several different values of N0/N1, δ0, δ1,is illustrated in Figs. 2e,f. These figures show the effect of pcc—the value of p1 beyond which the roots of the discriminant function become imaginary [see (11)]. As can be seen, the behavior of the curves is different depending on whether δ0 > δ1 or δ1 > δ0; in the former (Fig. 2e), the curves behave similarly to the linear case (Fig. 2d), except that there is a forbidden region above pcc. However, if δ1 > δ0 (Fig. 2f), then not only is there a forbidden region but the true value of p1 = N1/N also falls in this forbidden region. In other words, it is possible that this natural choice of p1 will yield a classifier that has no skill at all, when performance is gauged with these measures. Also note that FRC reaches its maximum at pcc.

Having obtained and illustrated the p1 dependence of the measures, the task then becomes to differentiate these measures with respect to p1 and find the roots of the resulting expressions. These critical values, p1c, are the values at which the various measures are maximized. For the reliability diagram, this will not be necessary because the value of p1 corresponding to maximum reliability is given exactly by the true value (next section).

Some exact results

Given the nonlinear nature of the equations, analytic expressions for p1c are difficult to derive. However, for FRC, TSS, and the reliability diagram, exact results arepossible. In FRC and TSS, the appearance of c01, c10 in the numerator alone makes the calculation possible if one notes the identity
i1520-0450-37-1-72-eq12
which follows from the definition of the Gaussian error function
i1520-0450-37-1-72-eq13
The details of the calculation will not be presented here, but we find, in both the linear and the quadratic case, the following:
i1520-0450-37-1-72-e12
Similarly, for the reliability diagram
i1520-0450-37-1-72-e14

These values of p1c will maximize the respective measures. Equations (12)–(14) are interesting in that they reproduce the two popular critical values—the one suggested by Bayes’ postulate (Kendall and Stuart 1969), that is, 1/2, and the true value, that is, N1/N. So, TSS appears to have an affinity for p1 = 1/2, while FRC and the reliability diagram have an affinity for N1/N.

It is worthwhile to outline the derivation of p1c for the reliability diagram found in (14) because it is true in general, that is, not for Gaussian distributions only. Recall from (1) that P1 (x) can be written as
i1520-0450-37-1-72-eq14pr
because Li(x) = Ni(x)/Ni, where Ni(x) is the sample size of the ith class at a given value of x. Meanwhile, the observed ratio of event sample size to the total sample size, also at a given value of x, is
i1520-0450-37-1-72-eq15
Both expressions have an x dependence only through the ratio N0(x)/N1(x), and therefore, eliminating this ratio from both equations yields the P1 dependence of N1/(N0 + N1), that is, a reliability diagram:
i1520-0450-37-1-72-e15
where a = (p0N1)/(p1N0). Figure 3 shows the resulting reliability diagram for several values of a. Equation (15)implies that a = 1 yields maximum reliability; this value of a corresponds to p1c = N1/N. Note that this is also the value at which FRC—not the best of measures—is maximized.

The existence of c01, c10 in the denominators of CSI and HSS renders the critical equations nonlinear and, as a result, only limiting and numerical solutions for p1c are possible. As examples of the former, it is possible to show that for N0N1, CSI and HSS have the same p1c, although that value itself can be found only numerically. The same holds if δ0, δ1 ≫ 1. And if N0N1, then the p1c of CSI is equal to that of FRC, namely, N1/N. Also, for HSS, p1cN1/N as δ → ∞; given that this value of p1 is the true value optimizing a reliability diagram, then it is evident that a forecaster with an optimum HSS is apt to have a suboptimum reliability plot unless the dataset happens to have “large” δ or N0/N1.

Numerical results

For the sake of brevity, in this article only HSS is considered, although the CSI results are available as well. It is important to note that all of the scalar measures are written in terms of only N0/N1 and δ0, δ1. As a result, the p1c for HSS and the corresponding HSS itself [i.e., HSS(p1c)] can be tabulated in terms of these quantities. Again, recall that these quantities are obtained directly from the sample data.

For the linear case, δ0 = δ1 = δ, and so the p1c are determined from only two quantities—N0/N1 and δ. Figures 4a,b allow one to read off p1c and HSS(p1c), respectively, given δ and N0/N1. For values of δ and N0/N1 not given in Fig. 4, observe that the plots in Fig. 4a asymptotically approach N1/N for large δ and large N0/N1, as was shown in the previous section. Then, one can find HSS(p1c) for large values of δ and N0/N1 from Fig. 4b.

For the quadratic case, the critical values of p1 can be read off from Figs. 5a–f, given N0/N1 and δ0, δ1. It is sufficient to present only N0/N1 = 2, 4, 8, 100, 500, 1000, and 0 ≤ δ0, δ1 ≤ 7 results since the results for other values can be found by extrapolation. As in the linear case, p1cN1/N for large δ0 or δ1. The N0/N1 = 1 case is not considered numerically since in that case HSS=TSS, and from (13) we find p1c = 1/2 for HSS, exactly.

Similarly, HSS(p1c) (i.e., the maximum value of HSS) can be read off from Figs. 6a–f. Again, for N0/N1 = 1, HSS(p1c = 1/2) is not presented since it can be calculated exactly from (8)–(10). Hence, Figs. 5 and 6 allow one to obtain the critical value of p1 and the corresponding HSS, given N0/N1, δ0, and δ1, which are all obtained from the sample data. It is evident from the defining equations of the measures in section 3 that whereas CSI is not symmetric under the exchange 0 ⇔ 1, FRC, TSS, and HSS are. It is this symmetry that has justified the presentation of the results for only N0/N1 ≥ 1 since the results for N0/N1 < 1 can be obtained by the exchange 0 ⇔ 1 everywhere.

Example

To illustrate the above methodology, an example will be considered in this section. The example is carefully selected to point out some of the subtleties.

The hourly surface air pressures from Syracuse, New York, for the year 1990, were considered. Each hourly observation was also accompanied by whether some form of precipitation—rain, various types of snow, hail, etc.—was observed. The above methodology can be applied to deduce the optimal value of prior probability for precipitation when surface air pressure is employed to forecast precipitation; performance is gauged in terms of Heidke’s skill statistic and via a reliability diagram.

The number of precipitation and no-precipitation observations was 1776 and 6145, respectively; the mean pressures were 998.48 (mbar) and 1002.86 (mbar), and the corresponding standard deviations were 8.673 (mbar) and 7.738 (mbar). This is all that will be required. The actual frequency distributions and the corresponding Gaussian fits are shown in Fig. 7.

First, recall that the condition μ1 > μ0 imposed at the outset of the analysis (section 2) implies that the group with the larger mean must be labeled as group 1. Therefore, we have
i1520-0450-37-1-72-eq16
Then,
i1520-0450-37-1-72-eq17
with δ0 and δ1 found from (10).

If we are willing to overlook the difference between σ0 and σ1, then we may work in the linear scheme with δ = (δ0 + δ1)/2 = 0.536. Given δ and N0/N1, Fig. 4a is consulted to obtain the desired quantity; however, the N0/N1 = 0.289 results do not appear in that figure. At this point, we simply recall the symmetry of the results under the exchange of the labels 0 and 1. Then, we simply look up the N0/N1N1/N0 = 1/0.289 ∼ 3.5 results. Figure 4a suggests a value of p1c = 0.45, but this is really the value of p0c since we relabeled the groups. As a result, p1c = 1 − p0c = 0.55. Similarly, Fig. 4b suggests a corresponding value of HSS ∼ 15%.

The single root of the linear discriminant function is 997.60 (mbar), computed from (4). This means that in this linear approximation, pressures below this value are more likely to be associated with precipitation, while those above are more likely to represent no precipitation.

On the other hand, if one assures that σ0 and σ1 are statistically distinct (more on this, below), then the quadratic method must be employed. According to Figs. 5a,b since N0/N1 = 3.5 lies in between N0/N1 = 2 and 4, then p1c lies in between 0.45 and 0.43. So, we may choose p1c = 0.44, but again, this is really p0c because of the relabeling. Therefore, p1c = 1 − p0c = 0.56. Similarly, HSS itself can be read off from Figs. 6a,b as HSS ∼ 20%.

This example was chosen for having a σ0 and σ1 that are relatively comparable in magnitude to allow for the illustration of both the linear and quadratic methods.However, if the object is more than an illustration, then it behooves one to question whether the difference between σ0 and σ1 is statistically significant. This can be done by computing the confidence intervals on σ0 and σ1. Without going into the details, suffice it to say that in this example, although the means μ0 and μ1 are statistically distinct at the 99% level, σ0 and σ1 are statistically equivalent (at the 99% level), and so the linear method is quite adequate. See section 8 for an interesting consequence that would have ensued if σ0 and σ1 had turned out to be statistically distinct.

As for the reliability diagram, the exact results of the previous section indicate that the most reliable plot is obtained when p1 takes the true value N1/N, that is, p1c = 0.78. Then the reliability plot is a diagonal line. Notethat this value of p1 is different from the value that optimizes HSS in either the linear or the quadratic regime.

Discussion

In addition to allowing for improved (scalar) performance, there is one other reason for considering a value of p1 that is different from N1/N, and that arises if the single variable x is the output of some sort of a regression analysis. It is entirely possible that the group sample sizes in the training data may not be proportional to the climatological one. In such a situation, it is unclear which p1 should be selected—the p1 of the training data or that of the test data.

It is interesting that TSS does not depend on group sample sizes (section 3). This may seem contrary to what has previously been said about TSS in the rare- event limit. On one hand, if N0N1, then any reasonable classifier will yield a C matrix with ab, c, d, from which it follows that (Doswell et al. 1990)
i1520-0450-37-1-72-eq18
The right-most term in this equation is the so-called probability of detection, which by itself constitutes an improper measure since one can optimize it by persistently forecasting all observations as 1s. On the other hand, the independence of TSS from N0, N1 would imply that TSS is given by 1 − c01c10 regardless of whether events are rare. This may appear to be paradoxical. The“catch” is embedded in the expressions b = c01N0 and c = c10N1, with c01, c10 given by the N-independent (4)–(7), which together imply that TSS is independent of N0 and N1. But this is true if only if there exist unique false alarm and miss rates (c01 and c10) that are independent of the sample sizes. In other words, assuming that there exist unique, N-independent false alarm andmiss rates, then TSS is a well-defined measure even in the rare-event limit. TSS is still a pathological measure even in the case where there exist unique false alarm and miss rates. If the former is unusually small (i.e., c01 ≪ 1), then TSS = 1 − c10; this time, however, the problem is peculiar to the classifier and not any rare- event conditions present in the data.

In the example above it was mentioned that σ0 and σ1 are statistically equivalent with 99% confidence, and it was concluded that for all practical purposes the linear method would be adequate. There is an interesting effect that would have occurred if σ0 and σ1 were statistically distinct: That would have implied that there would be two crossing points between p0L0(x) and p1L1(x). Indeed, the two roots can be found to be 996.5 (mbar) and 1043.5 (mbar). This, in turn, would have implied that pressures below 996.5 (mbar) are more likely to be associated with precipitation, and those above 996.5 (mbar) are more likely to be associated with no precipitation. This is as it occurs in the linear scheme and is physically acceptable; however, the existence of the second root would have implied that pressures above 1043.5(mbar) are more likely to be associated with precipitation! Although, in this case, the statistical equivalence of σ0 and σ1 precludes such nonlinear behavior in pressure, it is important to emphasize the “ease” with which such effects may occur; all that is required is that the two groups be normally distributed and have statistically distinct standard deviations. In such a case, the wider distribution is guaranteed to cross the narrower one twice, thereby giving rise to such nonlinear effects.

Summary

In this article, the Bayesian approach to the transformation of a single continuous variable to a posterior probability of a corresponding event is outlined. The issue of forecast quality is then addressed in terms of four scalar performance measures for categorical (reduced) forecasts and reliability diagrams. It is shown that these measures may be optimized by setting the event prior probability at certain critical values as given by p1c, where p1c = N1/N (i.e., the true value) for fraction correct and reliability diagrams, and p1c = 1/2 for the true skill statistic. The critical success index and Heidke’s skill statistic do not allow for exact results, and so their values of p1c are presented graphically. The use of the graphs requires only a knowledge of the means, the standard deviations, and the sample size ratio of the two groups.

Acknowledgments

The author is grateful to John Cortinas for providing the dataset employed for illustrating the methodology in the example section, and to Kim Elmore and Jeanne Schneider for a discussion of various aspects of the dataset.

REFERENCES

  • Brooks, H., and C. A. Doswell III, 1995: A comparison of measures- oriented and distributions-oriented approaches to forecast verification. Wea. Forecasting,11, 288–303.

  • Donaldson, R. J., R. M. Dyer, and M. J. Krauss, 1975: An objective evaluator of techniques for predicting severe weather events. Preprints, Ninth Conference on Severe Local Storms, Norman, OK, Amer. Meteor. Soc., 321–326.

  • Doswell, C. A., III, R. Davies-Jones, and D. Keller, 1990: On summary measures of skill in rare event forecasting based on contingency tables. Wea. Forecasting,5, 576–585.

  • Gandin, L. S., and A. Murphy, 1992: Equitable skill scores for categorical forecasts, Mon. Wea. Rev.,120, 361–370.

  • Hamill, T. M., 1997: Reliability diagrams for multi-category probabilistic forecasts, Wea. Forecasting,12, in press.

  • Kendall, M. G., and A. Stuart, 1969: The Advanced Theory of Statistics. Vol. 1. Hafner Publishing, 439 pp.

  • Mason, I., 1982: A model for assessment of weather forecasts. Aust. Meteor. Mag.,30, 291–303.

  • Murphy, A. H., 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting. Wea. Forecasting,8, 281–293.

  • ——, 1996: The Finley affair: A signal event in the history of forecast verification. Wea. Forecasting,11, 3–20.

  • ——, and R. L. Winkler, 1987: A general framework for forecast verification, Mon. Wea. Rev.,115, 1330–1338.

  • ——, and ——, 1992: Diagnostic verification of probability forecasts. Int. J. Forecasting,7, 435–455.

  • ——, B. G. Brown, and Y.-S. Chen, 1989: Diagnostic verification of temperature forecasts. Wea. Forecasting,4, 485–501.

  • O’Hagan, A., 1994: Kendall’s Advanced Theory of Statistics. Vol. 2B, Bayesian Inference, Halsted Press, 330 pp.

  • Schaefer, J. T., 1990: The critical success index as an indicator of warning skill, Wea. Forecasting,5, 570–575.

  • Wilks, D. S., 1995: Statistical Methods in the atmospheric Sciences, Academic Press, 467 pp.

Fig. 1.
Fig. 1.

Gaussian likelihood functions for two groups with means at μ = 30 and μ = 50, and (a) with equal variances and (b) with unequal variances. In the former, the two curves cross at only one point, but in the latter they cross at two points.

Citation: Journal of Applied Meteorology 37, 1; 10.1175/1520-0450(1998)037<0072:BPASPM>2.0.CO;2

Fig. 2.
Fig. 2.

The p1 dependence of four measures for some examples: (a) N0/N1 = 1, δ = 1; (b) N0/N1 = 1, δ = 2; (c) N0/N1 = 10;δ = 1; (d) N0/N1 = 10, δ = 2, in the linear case, and then (e) N0/N1 = 100, δ0 = 5, δ1 = 1; and (f) N0/N1 = 100, δ0 = 1, δ1 = 5, in the quadratic case. The vertical lines in each graph mark the “true” value of p1, i.e., p1 = N1/N.

Citation: Journal of Applied Meteorology 37, 1; 10.1175/1520-0450(1998)037<0072:BPASPM>2.0.CO;2

Fig. 3.
Fig. 3.

The reliability diagram for several values of a = p0N1/p1N0. Optimum reliability corresponds to a = 1, which translates to p1 = N1/N.

Citation: Journal of Applied Meteorology 37, 1; 10.1175/1520-0450(1998)037<0072:BPASPM>2.0.CO;2

Fig. 4.
Fig. 4.

(a) The values of p1c and (b) the corresponding HSS, HSS(p1c), in the linear case, as a function of δ and for 10 values of N0/N1 = 1, 2, 4, 8, 16, 32, . . . , 210 = 1024.

Citation: Journal of Applied Meteorology 37, 1; 10.1175/1520-0450(1998)037<0072:BPASPM>2.0.CO;2

Fig. 5.
Fig. 5.

The values of p1c, in the quadratic case, as a function of δ0 and δ1 = 0.1, 0.5, 1.0, 1.5, 2.0, . . . , 7.0, for N0/N1 = 2, 4, 8, 100, 500, 1000. (For N0/N1 = 1, p1c = 1/2.)

Citation: Journal of Applied Meteorology 37, 1; 10.1175/1520-0450(1998)037<0072:BPASPM>2.0.CO;2

Fig. 6.
Fig. 6.

The values of HSS(p1c), in the quadratic case, as a function of δ0 and δ1 = 0.1, 0.5, 1.0, 1.5, 2.0, . . . , 7.0, for N0/N1 = 2, 4, 8, 100, 500, 1000. [For N0/N1 = 1, HSS is found from (8)–(10) with p1c = 1/2.]

Citation: Journal of Applied Meteorology 37, 1; 10.1175/1520-0450(1998)037<0072:BPASPM>2.0.CO;2

Fig. 7.
Fig. 7.

The distributions of hourly surface air pressures for Syracuse, NY, for 1990. The rough curves represent the data and the smooth curves are Gaussian fits to the data. The “larger” curve is for hours when some form of precipitation occurred, and the“smaller” curve is for when there was no precipitation.

Citation: Journal of Applied Meteorology 37, 1; 10.1175/1520-0450(1998)037<0072:BPASPM>2.0.CO;2

1

In this article, “performance” refers to a measure of forecast quality (Murphy 1993).

2

Given the ubiquity of Gaussian distributions in meteorological data, these findings are valid in a wide range of applications. A Gaussian model has also been considered by Mason (1982).

3

An x that yields D2(x) = 0 can always be assigned to one of the groups on a random basis.

4

This has great utility in the multivariate case because then the coefficients of the various x terms would represent the predictive strength of the respective independent variables.

Save