A Rose by Any Other Name: On Basic Scores from the 2 × 2 Table and the Plethora of Names Attached to Them

Harold E. Brooks aNOAA/National Severe Storms Laboratory, Norman, Oklahoma
bSchool of Meteorology, University of Oklahoma, Norman, Oklahoma

Search for other papers by Harold E. Brooks in
Current site
Google Scholar
PubMed
Close
https://orcid.org/0000-0003-3800-0199
,
Montgomery L. Flora cCooperative Institute for Severe and High-Impact Weather Research and Operations, University of Oklahoma, Norman, Oklahoma
aNOAA/National Severe Storms Laboratory, Norman, Oklahoma

Search for other papers by Montgomery L. Flora in
Current site
Google Scholar
PubMed
Close
, and
Michael E. Baldwin cCooperative Institute for Severe and High-Impact Weather Research and Operations, University of Oklahoma, Norman, Oklahoma
dNOAA/National Weather Service/Storm Prediction Center, Norman, Oklahoma

Search for other papers by Michael E. Baldwin in
Current site
Google Scholar
PubMed
Close
Open access

Abstract

Forecast evaluation metrics have been discovered and rediscovered in a variety of contexts, leading to confusion. We look at measures from the 2 × 2 contingency table and the history of their development and illustrate how different fields working on similar problems has led to different approaches and perspectives of the same mathematical concepts. For example, probability of detection (POD) is a quantity in meteorology that was also called prefigurance in the field, while the same thing is named recall in information science and machine learning, and sensitivity and true positive rate in the medical literature. Many of the scores that combine three elements of the 2 × 2 table can be seen as either coming from a perspective of Venn diagrams or from the Pythagorean means, possibly weighted, of two ratios of performance measures. Although there are algebraic relationships between the two perspectives, the approaches taken by authors led them in different directions, making it unlikely that they would discover scores that naturally arose from the other approach. We close by discussing the importance of understanding the implicit or explicit values expressed by the choice of scores. In addition, we make some simple recommendations about the appropriate nomenclature to use when publishing interdisciplinary work.

© 2024 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Harold E. Brooks, harold.brooks@noaa.gov

Abstract

Forecast evaluation metrics have been discovered and rediscovered in a variety of contexts, leading to confusion. We look at measures from the 2 × 2 contingency table and the history of their development and illustrate how different fields working on similar problems has led to different approaches and perspectives of the same mathematical concepts. For example, probability of detection (POD) is a quantity in meteorology that was also called prefigurance in the field, while the same thing is named recall in information science and machine learning, and sensitivity and true positive rate in the medical literature. Many of the scores that combine three elements of the 2 × 2 table can be seen as either coming from a perspective of Venn diagrams or from the Pythagorean means, possibly weighted, of two ratios of performance measures. Although there are algebraic relationships between the two perspectives, the approaches taken by authors led them in different directions, making it unlikely that they would discover scores that naturally arose from the other approach. We close by discussing the importance of understanding the implicit or explicit values expressed by the choice of scores. In addition, we make some simple recommendations about the appropriate nomenclature to use when publishing interdisciplinary work.

© 2024 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Harold E. Brooks, harold.brooks@noaa.gov

1. Introduction and background

The problems of forecast evaluation, classification, and comparative relationships between groups of concepts or things share some common characteristics. We can think of them as becoming concerned with the similarity or dissimilarity between two (or more) different descriptions of some “reality.” In weather forecasting, quantifying the similarity of the forecasts and weather associated with the forecasts (observations) describes the quality of the forecast (Murphy 1993). When looking at the cultural traits of different people groups (Driver and Kroeber 1932) or the plant species in different areas (Jaccard 1901), we can use the properties of one group and compare it to the other group to see how alike they are. With similar basic problems facing many fields, multiple metrics that quantify the same properties have been developed and, often, given new names. Identical metrics, but with a bewildering array of names for the same things, can lead to confusion, especially in interdisciplinary work where researchers from different backgrounds can end up, essentially, speaking foreign languages to each other. It is useful, however, to see how different fields have thought about solving similar problems and dealing with issues that arise in their fields. In many cases, the applicability of terms developed in a field is obvious, but when that term is adopted by another field, the applicability is not so obvious. In this paper, we will focus on some illustrative examples from what is a seemingly simple problem, the so-called 2 × 2 problem, and make some recommendations for improved, if not best, practices to ease the communication challenges.

2. The 2 × 2 contingency table and scoring rules

As a starting point, we begin with the classic 2 × 2 contingency table, introduced by Pearson (1904). It is illustrative to note that this is an example of something that has been given a different name in a specific context. Miller and Nicely (1955) introduced the term “confusion matrix” for the same thing as a contingency table in their study of how listeners confused different consonant sounds in English. The name is evident in that context, as it is in the study of cattle feeding behavior of Ruuska et al. (2018), but several other areas have adopted that term, such as psychology (Cheng et al. 2023), machine learning (Heydarian et al. 2022), statistical classification (Riehl et al. 2023), and environmental science (Phillips et al. 2024), where it is not so obviously connected to the original context. This does not mean the usage is wrong, but the additional terminology can confuse new users. For the simplest possible forecast problem of a yes/no forecast and a yes/no observation, we get a 2 × 2 table (Table 1).

Table 1.

The 2 × 2 contingency table for forecast and observations.

Table 1.

The first use of contingency-table-style elements in meteorology was in the tornado forecast experiment of Finley (1884). The history of the early days after Finley’s experiment, the measures derived from the table that were discovered in the immediate aftermath, and the subsequent rediscoveries of those scores are described in Murphy (1996). Researchers discussed several fundamental issues in the first years after Finley, such as the quality of observations, the possible difference in importance of different kinds of forecast errors, and what was an appropriate baseline for how many forecasts should have been correct by chance.

One of the other issues that was discussed in some of the early papers was what we might now think of as the correct forecast of a “nonevent” (e.g., no tornado warning was issued and no tornado occurred), the “d” element of Table 1. In some fields, such as geobotany, this term has no meaning. In comparing the species of plants in different Swiss valleys, Jaccard (1901) ignored the existence of d completely. This was a sensible thing to do, given that it would be “all of the plants that don’t exist in either of two valleys.” Putting bounds on what constitutes “all” is challenging. Palmer and Allen (1949), in the context of weather forecasts, excluded d by framing the forecasts as being associated with threats, meaning that the d term would be a forecast for a nonthreat that did not happen. They assumed that those cases were “less difficult, perhaps less significant forecasting problem(s).” As a result of this issue, our primary emphasis will be on scores derived from the a, b, and c terms of the table. Despite the seeming simplicity, Brusco et al. (2021) discuss 18 different metrics relying on those three terms, which is not exhaustive. In total, Brusco et al. (2021) compiled 71 metrics that are derived from the 2 × 2 contingency table and they used synthetic data to examine which scores behave similarly under a variety of conditions, producing clusters of metrics that behave similarly to each other. Other lists of metrics derived from the 2 × 2 table have been compiled by Murphy (1996), Marzban (1998), and Warrens and van der Hoef (2022).

Two philosophical areas have been approached by researchers leading to different metrics.

  1. Should we measure similarity or dissimilarity between our two sets?

  2. Should we consider the individual elements (a, b, c) or composite statistics like the total number of each category or ratios [a/(a + b)]?

As a result of addressing area 1, there are many scores that have complements of other scores, so that the two scores would sum to one. Many of the dissimilarity metrics are referred to as “distance,” so that a perfect score would be 0, while most similarity metrics are called “index” or “score” with a perfect value of 1. As for area 2 although, obviously, all of the scores can be written in terms of the individual elements, many are written as combinations of the ratios, so the underlying approach makes some scores more accessible and others less so.

a. Gilbert’s ratio of verification and its copies

Gilbert (1884) produced what he referred to as a “ratio of verification”
υg=aa+b+c,
in terms of Table 1. Gilbert derived his ratio by considering the number of forecasts (a + b), the number of events (a + c), and the “coincidences” a and wrote (1) as
υg=a(a+b)+(a+c)a,
which reduces to (1), emphasizing the forecasts and events as classes and subtracting out the coincidences. In meteorology, this score would be rediscovered by Palmer and Allen (1949) and given the name “% success,” later to be called the “threat score” by Larue and Younkin (1961). The word “threat” came from Palmer and Allen describing their work as relating to the forecasting of threats. Palmer and Allen worked directly with the three elements of Table 1. They explicitly ignored the d term in the table because “when no precipitation was forecast and no precipitation was observed, it is assumed that the possible occurrence of precipitation was a less difficult, perhaps insignificant forecast problem.” Probabilistic approaches to forecasting can help quantify the difficulty, and thresholding probabilities to create yes/no forecasts can separate the probabilistic forecasts into a series of 2 × 2 tables.

Donaldson et al. (1975a,b) also produced the same metric for severe thunderstorm forecasting by looking at the individual elements a, b, and c. They describe their choice of (1) as “arbitrary” and do not reference Gilbert, Palmer and Allen, or Larue and Younkin. They gave the score the name “critical success index” (CSI), which has become the most common name in use in meteorology and is the term used in the official verification of severe thunderstorm and tornado warnings by the U.S. National Weather Service.

One of the most common names for υg outside of meteorology is the Jaccard index, which came out of the field that might now be referred to as biogeography. It is widely used in medical research (e.g., Scheller et al. 2023), computer imaging (Amirkhani and Bastanfard 2021), information science (Leydesdorff 2008), and machine learning as seen in its use in Scikit-learn in Python (Pedregosa et al. 2011). This score was developed by the botanist Jaccard (1901)1 as the coefficient de communauté (community coefficient) in his work comparing the distribution of plant species in valleys in Switzerland. Jaccard followed a similar approach to Gilbert, in that he compiled lists of species in each valley and then did pairwise comparisons between two valleys at a time. In Gilbert’s terms, the list for one valley could be considered as the “forecast” for the “event” species in another valley. From the 2 × 2 table perspective, the d entry, which would be the plant species that did not appear in a pair of valleys, was problematic for Jaccard and, as a result, it was ignored.

The other primary development of the same score as υg comes from Tanimoto (1958). In contrast to Gilbert and Jaccard, Tanimoto developed his index from a set theory approach, with the score being the intersection of two datasets over their union, commonly known as intersection over union (IoU) today (Wilks et al. 1990). In a spatial perspective, this can be easily seen in the diagrams of Venn (1880). Larue and Younkin (1961) called it the threat score,2 and Stensrud and Wandishin (2000) referred to it as the correspondence ratio for areal forecasts, particularly with application to ensemble forecasts. A timeline showing the introduction of these different names for the scores is given in Fig. 1. The applicability to both identification of individual events and spatial fields was highlighted by Mason (1989). Marczewski and Steinhaus (1958) also used a set theory approach to define a dissimilarity metric between two sets that is 1 − υg. They show that this metric has all of the properties of a mathematical distance.

Fig. 1.
Fig. 1.

Timeline of introduction of scores of the form of υg.

Citation: Artificial Intelligence for the Earth Systems 3, 2; 10.1175/AIES-D-23-0104.1

As part of a larger effort to look at binary similarity coefficients, computational chemists Consonni and Todeschini (2012) built on υg by taking logarithmic transformations of the numerator and denominator, leading to a similarity coefficient they referred to as T4
sT4=log(1+a)log(1+a+b+c).
This was one of a set of five scores they developed, using logarithmic transformations based off of previous scores from the literature. This was the only one that ignored the d term in the 2 × 2 table. They tested their “new” scores in comparison to the original scores as well as some other traditional scores. Most of the logarithmic transformations appeared to provide little new information compared to the original scores. The T4, however, gave a different ordering in their tests and, hence, provided a different view of performance in their problem. It has not been applied widely, so it is not clear how it might behave in meteorological applications and whether it has significant value.

b. Ratios from the 2 × 2 table as the basis of scoring

Another approach to developing scores from the three elements is to consider ratios developed of terms from the row and the column of Table 1. Specifically, many researchers have looked at a/(a + c) and a/(a + b) and their complements, c/(a + c) and b/(a + b), creating relationships between those ratios. Again, a plethora of names have been applied to those ratios.

In meteorology, the most common name for the a/(a + c) is the probability of detection (POD), from Donaldson et al. (1975a,b). Prefigurance (Brier and Allen 1951)3 also has been used, but POD dominates in modern literature. Its origins in meteorology appear to come from House (1960), who was studying the problem of what density of observational stations was needed to observe different atmospheric features and estimated the probability of detecting a feature of interest given station spacing. It is unclear what impact House had on the choice by Donaldson et al.

In the machine learning community, recall is the most common descriptor of a/(a + c). It comes out of information retrieval and was first used by Kent et al. (1955) in the context of literature searching with the name “recall factor.” They wrote “This fraction, which we shall term the “recall factor,” measures the proportion of pertinent documents to which the information retrieval system directed attention when a given search was conducted”. In effect, its name comes from the consideration of how many of the appropriate documents are recalled from the database. Sensitivity is another common name for this expression, coming out of the medical diagnostic community (Yerushalmy 1947), deriving from measuring how sensitive a diagnostic test was in responding to a particular condition. Dice (1945) simply called both a/(a + b) and a/(a + c) “association indices” with notations added to indicate which one was being referred to.

Although postagreement (U.S. Army Air Forces 1944, unpublished manuscript) was suggested in meteorology for the term a/(a + b), it has not been widely used. Neither has the probability of hits, suggested by Doswell et al. (1990). At this time, the most common term is the success ratio (SR) (Schaefer 1990). For reasons that are not particularly clear, meteorology has tended to focus on the complement of the SR, the fraction of forecasts of an event that did not have the event occur. The term b is often labeled as “false alarms,” so that the b/(a + b) term has been called the false alarm ratio (Donaldson et al. 1975a,b) and the false alarm rate (Olson 1965), among others. Given that the term b/(b + d) has also been called the false alarm rate, this has caused great confusion. Barnes et al. (2007) give an outstanding review of the conflicting usage.

As a companion to recall, Kent et al. (1955) named a/(a + b) the pertinency ratio. In the context of information retrieval, the origin of this term seems clear, giving the fraction of retrieved pieces of information that is pertinent to the search request. The early 1960s provided a huge boost to the study of information retrieval, funded by the National Science Foundation, to help scientists find the most important references to speed research in the aftermath of the Soviet launch of Sputnik. One of the biggest recipients of that support was at the Cranfield College of Aeronautics in the United Kingdom. The leader of that project, Cyril Cleverdon, followed up on Kent et al. adopting the use of recall, but preferring relevancy over pertinency (Cleverdon 1962). The other big project, Information Storage and Retrieval (ISR), led by Gerard Salton at Harvard, introduced the term precision (Salton 1964) with no discussion of the reasons.4 Interestingly, in the same report that Salton used “precision,” Rocchio (1964) used relevancy, in keeping with Cleverdon but added a parenthetical comment that it “was sometimes called precision.” Eventually, at a closed-door meeting in Washington in December 1964, Cleverdon agreed to use precision (Cleverdon and Swanson 1965).

One of the biggest advances made by Cleverdon (1962) was the notion of plotting recall and relevancy on a two-dimensional plot. Although Kent et al. (1955) had plotted distributions of the individual terms, Cleverdon was the first to plot them jointly. It could be done for a single 2 × 2 table or for a series of tables where the events were the same, but different thresholds were used, in the original case, by the addition of search terms to make the search more restrictive. That addition led to an apparent trade-off between recall and relevancy, although a debate about whether that trade-off was real or the result of the way Cleverdon had done the analysis went on for some time, despite a connection that could be made to the model of Tanner and Swets (1954) that supported the trade-off as real. For clarity, we have replotted Cleverdon’s original data (Fig. 2, his Tables 7.5 and 7.6). It is of particular interest that in the first plot of the two quantities, recall was plotted on the vertical axis in the same convention as used in the performance diagram of Roebber (2009). This also means that the vertical axis is the same as used in the receiver operating characteristic (ROC) plots developed by Peterson and Birdsall (1953). Swets (1963) uses that convention in a discussion of what the horizontal axis on a plot should be [relevancy or b/(b + d) that Tanner and Swets (1954) had advocated, or, simply, b as used by Swanson in a 1962 conference presentation].

Fig. 2.
Fig. 2.

Reproduction of Cleverdon (1962) Table 7.6. Legend is as used in Cleverdon where different “Relevance” lines are associated with different search efforts. The dashed lines are from Cleverdon.

Citation: Artificial Intelligence for the Earth Systems 3, 2; 10.1175/AIES-D-23-0104.1

The flipping of the axes on this two-dimensional plot of recall and what was referred to as precision first appears in Salton (1964, Fig. 12) with no reason given for the change. No discussion appears in the reports of the Cranfield and ISR groups as to why the choices are made, but it appears to be conventions used by the groups. Interestingly, Michael Keen writes sections in reports for both groups, implying he may have received support from each. Even though he always refers to the diagrams as recall–precision diagrams, when they are published in Cranfield reports, recall is the vertical axis (e.g., Keen 1966) and when they are published in ISR reports, precision is the vertical axis (e.g., Keen 1968).

The question of the “standard” orientation of the two axes was not resolved for a long time, and it is not clear why precision as the vertical axis became the default in the machine learning community. Outside of the two big groups’ choices (Cranfield and ISR), Lancaster (1968, 1979) and Sparck Jones (1972, 1979) were among notable people who preferred Recall on the vertical axis. Keen (1981) also used recall on the vertical. Heine (1973) used precision on the vertical axis and overlaid contours of the Marczewski–Steinhaus distance, marking the first appearance of a combined measure on such a plot. Van Rijsbergen (1974) and, later, Van Rijsbergen and Croft (1975) also used precision on the vertical axis.

Recently, de Elía (2022) proposed a model of forecast performance that looked at the trade-off between false alarms and missed events from the perspective of expected utility. The core idea came from consideration of a performance diagram and modeling POD as a power of (1 − SR), so that
POD=(1SR)r.
Using that framework, de Elía et al. (2023) have proposed a skill score,
Q=ln(1SR)ln(POD),
which represents a trade-off between POD and SR such that the total losses associated with misclassifications are constant along the curves. Given its recent development, this score has not been widely used. A big unanswered question is how well the power-law formulation in (4) applies to real forecasting or classification problems. The behavior of Q is illustrated in Fig. 3 as a function of SR for constant values of υg, so that as SR increases, POD decreases. POD and SR cannot be less than υg, by definition. As SR or POD approach υg, Q becomes very large POD. That means that Q is very sensitive to small changes at the edge of the range of the values.
Fig. 3.
Fig. 3.

de Elia’s Q as a function of SR. Each line represents a constant value of υg. Note the asymptotic behavior of Q as SR approaches υg. The right side asymptote near SR = 1 is a result of POD approaching υg.

Citation: Artificial Intelligence for the Earth Systems 3, 2; 10.1175/AIES-D-23-0104.1

c. The Pythagorean means scores

If one starts with the two quantities POD and SR (or recall and precision), combining the quantities in ways that differ from what might be considered a Venn diagram approach seems logical. It is often possible to use algebra to show that scores developed in the two approaches are mathematically related, but the paths taken to get there are very different. Scores have repeatedly been developed that are simple functions of the three Pythagorean means.

Reviewing, the arithmetic mean (AM) for any two elements such as POD and SR is given by
AM=x+y2,
the geometric mean (GM) is given by
GM=xy,
and the harmonic mean (HM) is given by
HM=21x+1y.
The three means are equal if, for nonzero values of x and y, x = y. If they are not equal, then, in general, HM ≤ GM ≤ AM for a pair of x and y. All three of these means have been the basis of scores based on combining POD and SR, even though, in most cases, the authors have not acknowledged it. To illustrate the difference, we show AM, GM, and HM for curves passing through points where x = y on a performance diagram (Fig. 4).
Fig. 4.
Fig. 4.

Curves of Pythagorean mean scores equal to 0.25, 0.5, and 0.75 that pass through points where POD = SR.

Citation: Artificial Intelligence for the Earth Systems 3, 2; 10.1175/AIES-D-23-0104.1

Kulczynski (1927) used the arithmetic mean of POD and SR5 as a measure to compare plant distributions in different parts of the Pieniny Mountains in Poland. McConnaughey (1964) developed a measure
McC=(POD)(SR)(1POD)(1SR).
With some manipulation, it can be shown that this reduces to 2 × (AM − 0.5). In effect, it converts the AM from a scale of 0 to 1 to a measure that goes between −1 and 1 without providing additional information. Although the AM is available in many software packages, often under Kulcynski’s name, it does not seem to have been used much in the meteorological literature.

The GM has also been found a number of times over the years in a variety of fields in a process that represents an excellent example of the curiosities of naming. It appears that the first use is from Thomson (1916), in the testing of psychological theories. It was cited by Driver and Kroeber (1932) in archaeology. Their application was similar to Jaccard’s, in that they were comparing the distribution of artifacts in an effort to measure similarity, or the lack thereof, of different cultural groups. Interestingly, one of the names that Brusco et al. (2021) give to the score is Driver and Kroeber, despite the fact that they reference Thomson. If any weight is given to the “reputation” of the authors in the adoption of names, it is curious to note that Thomson is regarded as one of the pioneers of intelligence research and who wrote his paper as part of a debate with Spearman, of Spearman’s rank correlation fame.

Another field that has given the GM a name that has passed to other fields has been marine biology. A common name for the GM is the Ochiai coefficient (Ochiai 1957). Ochiai credits Otsuka (1936),6 but via an intermediate publication by Hamai (1955). To add to the naming confusion, Howarth (2017) credits Otuka but attributes the score to another researcher with the same family name.

Doolittle (1885), in his response to Finley (1884), used the product of POD and SR, which is obviously the square of GM, in developing his inference ratio, which is intended to measure the skill of a forecaster by adjusting for the number of forecasts that would be correct by chance. In his work on the distribution of mollusks in the Miocene, Sorgenfrei (1959) also used GM2. His problem was similar to that of Jaccard, in that he was looking at the probability of species occurring in each of two different areas and simply multiplying the two probabilities. Wagner (1993) referred to this quantity as the “unbiased hit rate” in a review paper about evaluation of behavioral psychology experiments. Armistead (2013) brought the metric back into meteorology, proposing it as a method for evaluating categorical forecasts. Armistead references Doolittle’s inference ratio without mentioning that Doolittle had used Wagner’s unbiased hit rate in its development, although in a later paper (Armistead 2016), he refers to it as Doolittle’s unattributed joint probability measure.

The third Pythagorean mean, HM, is widely used. With some algebraic manipulation, it can be shown that υg is related to HM by
υg=HM2HM.

Scores that are identical to HM have been used in many contexts and are known by a variety of names. In ecological studies, Gleason (1920) used it without naming it, Sørensen (1948) referred to it as the quotient of similarity, and Dice (1945) referred to it as the coincidence index. Neither of the names Sørensen or Dice used seems to have been used in references since then, with the score being called the Sørensen index, Dice’s coefficient, or the Sørensen–Dice index. An identical score was presented at a conference by van Rijsbergen and is often referred to as the F score or F1 score in many places in the machine learning community. Sasaki (2007) indicates that the origin of the name was apparently accidental, being confused with another F function that van Rijsbergen and Croft (1975) introduced as a “combination” function in his derivation of a measure of effectiveness E.

van der Maarel (1969) created a score that begins with HM as a measure of similarity but then includes an additional term that expresses an estimate of dissimilarity. In terms of the 2 × 2 table, van der Maarel’s index can be written as
VDM=(2abc)(2a+b+c)=HM(b+c2a+b+c).
The use of this score has been limited.

d. Scores that weight error terms differently

Any score that can be expressed as a simple function that weights the components of a Pythagorean mean equally (including υg) implicitly values errors of false alarms or missed events equally. For many users, this is an unrealistic assumption. For instance, it is highly likely that failure to diagnose a disease with a high fatality rate is a greater threat to health than treating for the disease when it is not present. In that case, a patient would be more interested in high POD than high SR. There are two popular scores that provide weights. Although in an important selection of weights discussed below, they can be shown to be algebraically related to each other, as before, they start from different perspectives.

The first of these scores comes from van Rijsbergen and Croft (1975) and is usually given the notation Fβ, even though, as noted before, they never used that notation. It is defined as
Fβ=(1+β2)(POD)(SR)β2(SR)+(POD).
van Rijsbergen’s goal was to measure the “effectiveness” of a system where a user attaches β times as importance to POD as to SR and the derivation comes from determining the ratio of trade-offs between the two that the hypothetical user is willing to take. It can be interpreted as a variant of HM, with the two terms being weighted. A key part of the derivation is the definition of the effectiveness E in terms of α,
E=1[αSR+(1α)POD]1,
Fβ = 1 − E with
α=(1+β2)1.
In terms of the three elements of the 2 × 2 table, (10) can be rewritten as
Fβ=(1+β2)a(1+β2)a+b+β2c.
Tversky (1977) approached the problem of weighting by putting weights on the error terms, rather than ratios, in the 2 × 2 table in an effort to develop a score to compare objects or stimuli that contain (or do not contain) common features. The attractiveness of Tversky’s approach is that it allows users who have knowledge of the relative costs of errors to have a measure when they may not be able to express it in terms of POD and SR. In general, his index (or coefficient) can be written as
T=aa+γb+δc.
For specific values of the coefficients on b and c, T becomes other well-known scores. For instance, if γ = δ = 1, T = υg. If γ = δ = 0.5, it becomes T = HM. The situation where γ + δ = 1 is particularly interesting. In that case,
Tγ=aa+γb+(1γ)c.
If we divide the numerator and denominator by γ, this becomes
Tγ=1γa1γa+b+(1γ)γc.
Equations (15) and (17) are similar and Tγ = Fβ if γ = α from (14). Thus, a value of β = 2, which implies POD is twice as important as SR is equivalent to α = γ = 0.2, which implies that missed events c are weighted 4 times as much as false alarms b. Assuming that there are no costs associated with correct forecasts, the α = γ value is equivalent to the decision threshold for taking action for a user whose decision problem can be described by the misclassification cost ratio discussed by Roebber and Bosart (1996) and Wandishin and Brooks (2002). Kumler-Bonfanti et al. (2020) use (17) with γ = 0.3, equivalent to β=7/31.53. In general, the impacts of weighting of the POD and SR, via β, or b and c (via γ) can be seen in Fig. 5.
Fig. 5.
Fig. 5.

As in Fig. 3, but for Fβ = 0.5, 1.0, and 2.0 ( = 0.8, 0.5, and 0.2, respectively). The Fβ = 1.0 curves for 0.25 and 0.75 are not shown since they are in Fig. 3.

Citation: Artificial Intelligence for the Earth Systems 3, 2; 10.1175/AIES-D-23-0104.1

e. Impacts of score selection on apparent forecast performance

Brooks and Correia (2018) looked at long-term performance metrics for National Weather Service tornado warnings. One of the points of interest was the change in performance that took place in 2012–13 as a result of an apparent increase in emphasis on false alarms [equivalent to lowering b in (15) or increasing γ in (17)]. POD decreased and SR increased in that interval compared to the previous years. If we wish to consider some combination of the two scores to estimate an overall impact of the change, the impression we get depends on what combination is chosen. Time series of the results from Brooks and Correia, updated through 2022, of the three Pythagorean means and Fβ for β = 0.5 and 2 provide a range of impressions (Fig. 6). (Note that, as seen before, HM is related directly to υg.)

Fig. 6.
Fig. 6.

Tornado warning performance by year [updated from Brooks and Correia (2018)], as estimated by the Pythagorean means and Fβ = 0.5 and 2.0 (equivalent to Tγ = 0.8 and 0.2, respectively).

Citation: Artificial Intelligence for the Earth Systems 3, 2; 10.1175/AIES-D-23-0104.1

HM shows less of an impact on apparent performance in 2012–13 than the other Pythagorean means and, in fact, shows increases back to the highest values ever seen by the late 2010s/early 2020s. GM and AM show large decreases with the increased emphasis on false alarms, and the measure has not returned to the previous values yet. For the weighted scores, the one that values POD more highly shows large decreases after 2011 while the score that values POD less shows continuous increases through to the current. We offer no judgment as to which of the scores is most “correct.” Our point is that the choice of performance measure, even for relatively simple evaluation exercises, can have a huge impact on how one judges performance. Optimizing to one metric may lead to much poorer performance by other metrics. This is likely to be particularly important in cases where the costs of different kinds of errors are very different.

Often, there is a disconnect between the loss function and verification metrics used for training and evaluating a machine learning model, respectively. In recent work by Lagerquist and Ebert-Uphoff (2022), researchers explored using verification metrics like fraction skill score (Roberts and Lean 2008) and CSI as loss functions for training neural networks. One result was that using CSI as the loss function resulted in an overprediction bias, which is unsurprising given that CSI is a biased metric (i.e., one can maximize CSI by either overpredicting or underpredicting, known as hedging). Loss functions often require mathematical properties like continuity, differentiability, convexity, etc., while verification metrics are often more interpretable. A machine learning practitioner should carefully select a loss function appropriate for the decision task, and more work is required to determine which loss functions are appropriate for certain tasks.

3. Concluding thoughts

First, we strongly recommend that authors show the contingency table they are using and where the basic terms they are using come from using that table. There are enough options available from the literature that it is very easy to confuse even experienced users. We recommend that interdisciplinary researchers primarily use terms that are commonly used in the field where they are publishing, such as υg or “CSI” when publishing in a meteorological journal, to assist the primary audience in quickly understanding. If there is another term for a metric that is commonly used in the other discipline from which the researchers come, and may be their standard choice for the term, it would be extremely useful to include that term in a parenthetical remark or footnote. Marzban (1998) is an example of this when he mentioned that a “contingency table” is sometimes referred to as a “confusion matrix.” This process helps the person in a discipline who is reading papers in a related, but not necessarily core to them, area to learn the language of that other discipline. It also introduces that term to them, so that they can find what this unfamiliar discipline has done in terms of analysis of their problem and they may learn new techniques from that discipline. As an example, biogeography, in its study of similarity and differences in the distribution of organisms in different situations, has produced a wide body of literature on classification metrics.

Much of what we have discussed has involved the rediscovery of metrics. As part of that process, we have seen the fruits of what might be considered less-than-complete literature reviews. Although we believe we have found many of the most important early works, if nothing else, we have learned to not have hubris to think we have been complete. We apologize to anyone who may have a favorite “old paper” that found one of these metrics that we missed.

1

Although Jaccard (1901) is clearly the first use, some authors have credited Jaccard (1902, 1907, 1912) as the origin. The 1912 reference is an English translation of the 1907 reference.

2

Mason (1989) and Schaefer (1990) incorrectly credit Palmer and Allen (1949) as the source of “threat index.” Larue and Younkin (1961) do not reference Palmer and Allen.

3

Brier and Allen is the standard reference for this term, but they cite the unpublished U.S. Army Air Force (1944, unpublished manuscript) technical memo as the source. It is likely, based on the terminology used in the abstract of U.S. Army Air Force (USAAF), that one or both of them are responsible for USAAF.

4

Multiple reports from both the Cleverdon and Salton groups are available online, along with a number of other documents from the history of information retrieval, from the Special Interest Group on Information Retrieval at https://www.sigir.org/resources/museum/.

5

We use POD and SR in the discussion of the Pythagorean means instead of recall and precision, or any other pairs of names for the same quantities for simplicity.

6

The name is sometimes spelled Otsuka.

Acknowledgments.

This material is based upon work supported by the Joint Technology Transfer Initiative Program within the NOAA/OAR Weather Program Office under Award NA22OAR4590171. Funding was also provided by NOAA/Office of Oceanic and Atmospheric Research under the NOAA-University of Oklahoma Cooperative Agreement NA21OAR4320204, U.S. Department of Commerce. Allan Murphy inspired interest in the history of scoring metrics and discussions between the lead author and him in the 1990s were useful in clarifying credit for discovery or rediscovery, even if Allan might not agree with all we have done here. More recently, exposure to the work done by AI/ML researchers seeking to improve forecasters and an awareness of the different language they use has highlighted opportunities for growth in multidisciplinary work. We thank Bob Glahn for providing a copy of Palmer and Allen (1949), which is in process of being cataloged at the National Centers for Environmental Information after being lost for decades. Since that provision, Bob has unfortunately passed away and we dedicate this paper to his memory, as well as that of Allan Murphy. We thank Naoko Sakaeda for translating a portion of the Japanese papers. We thank Amanda Schilling and Tracy Chapman, scientific librarians at the National Weather Center, for their invaluable assistance in tracking down some of the more obscure references, particularly Kulczynski (1927), Tanimoto (1958), and McConnaughey (1964). We hope that they enjoyed the hunt as much as we did.

Data availability statement.

The data and Python code used to generate the figures can be found at https://github.com/monte-flora/verification_diagrams. To encourage adoption and exploration of other 2 × 2 contingency-table metrics, this package contains all 71 scores from Brusco et al. (2021).

REFERENCES

  • Amirkhani, D., and A. Bastanfard, 2021: An objective method to evaluate exemplar‐based inpainted images quality using Jaccard index. Multimedia Tools Appl., 80, 26 19926 212, https://doi.org/10.1007/s11042-021-10883-3.

    • Search Google Scholar
    • Export Citation
  • Armistead, T. W., 2013: H. L. Wagner’s unbiased hit rate and the assessment of categorical forecasting accuracy. Wea. Forecasting, 28, 802814, https://doi.org/10.1175/WAF-D-12-00047.1.

    • Search Google Scholar
    • Export Citation
  • Armistead, T. W., 2016: Misunderstood and unattributed: Revisiting M. H. Doolittle’s measures of association, With a note on Bayes’ theorem. Amer. Stat., 70, 6373, https://doi.org/10.1080/00031305.2015.1086686.

    • Search Google Scholar
    • Export Citation
  • Barnes, L. R., E. C. Gruntfest, M. H. Hayden, D. M. Schultz, and C. Benight, 2007: False alarms and close calls: A conceptual model of warning accuracy. Wea. Forecasting, 22, 11401147, 10.1175/WAF1031.1; Corrigendum, 24, 1452–1454, 10.1175/2009WAF2222300.1.

    • Search Google Scholar
    • Export Citation
  • Brier, G. W., and R. A. Allen, 1951: Verification of weather forecasts. Compendium of Meteorology, T. Malone, Ed., Amer. Meteor. Soc., 841–848.

  • Brooks, H. E., and J. Correia Jr., 2018: Long-term performance metrics for National Weather Service tornado warnings. Wea. Forecasting, 33, 15011511, https://doi.org/10.1175/WAF-D-18-0120.1.

    • Search Google Scholar
    • Export Citation
  • Brusco, M., J. D. Cradit, and D. Steinley, 2021: A comparison of 71 binary similarity coefficients: The effect of base rates. PLOS ONE, 16, e0247751, https://doi.org/10.1371/journal.pone.0247751.

    • Search Google Scholar
    • Export Citation
  • Cheng, Y., P. A. Pérez-Díaz, K. V. Petrides, and J. Li, 2023: Monte Carlo simulation with confusion matrix paradigm – A sample of internal consistency indices. Front. Psychol., 14, 1298534, https://doi.org/10.3389/fpsyg.2023.1298534.

    • Search Google Scholar
    • Export Citation
  • Cleverdon, C., and D. R. Swanson, 1965: The Cranfield hypotheses. Libr. Quart., 35, 121124, https://doi.org/10.1086/619319.

  • Cleverdon, C. W., 1962: Report on the testing and analysis of an investigation into the comparative efficiency of indexing systems. ASLIB Cranfield Research ProjectCranfield Aeronautics Laboratory, 322 pp.

  • Consonni, V., and R. Todeschini, 2012: New similarity coefficients for binary data. MATCH Commun. Math. Comp. Chem., 68, 581592.

  • de Elía, R., 2022: The false alarm/surprise trade-off in weather warnings systems: An expected utility theory perspective. Environ. Syst. Decisions, 42, 450461, https://doi.org/10.1007/s10669-022-09863-1.

    • Search Google Scholar
    • Export Citation
  • de Elía, R., J. J. Ruiz, V. Francce, P. Lohigorry, M. Saucedo, M. Menalled, and D. D’Amen, 2023: Early warning systems and end-user decision-making: A risk formalism tool to aid communication and understanding. Risk Anal., 15, 115, https://doi.org/10.1111/risa.14221.

    • Search Google Scholar
    • Export Citation
  • Dice, L. R., 1945: Measures of the amount of ecologic association between species. Ecology, 26, 297302, https://doi.org/10.2307/1932409.

    • Search Google Scholar
    • Export Citation
  • Donaldson, R. J., M. J. Kraus, and R. M. Dyer, 1975a: Operational benefits of meteorological Doppler radar. AFCRL Tech. Rep. AFCRL-TR-75-0103, 25 pp., https://apps.dtic.mil/sti/trecms/pdf/ADA010434.pdf.

  • Donaldson, R. J., R. M. Dyer, and M. J. Kraus, 1975b: An objective evaluator of techniques for predicting severe weather events. Preprints, Ninth Conf. on Severe Local Storms, Norman, OK, Amer. Meteor. Soc., 321–326.

  • Doolittle, M. H., 1885: The verification of predictions. Bull. Philos. Soc. Wash., 7, 122127.

  • Doswell, C. A., R. Davies-Jones, and D. L. Keller, 1990: On summary measures of skill in rare event forecasting based on contingency tables. Wea. Forecasting, 5, 576585, https://doi.org/10.1175/1520-0434(1990)005<0576:OSMOSI>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Driver, H. E., and A. L. Kroeber, 1932: Quantitative expression of cultural relationships. Univ. Calif. Publ. Amer. Archaeol. Ethnology, 31, 211256.

    • Search Google Scholar
    • Export Citation
  • Finley, J. P., 1884: Tornado predictions. Amer. Meteor. J., 1, 8588.

  • Gilbert, G. K., 1884: Finley’s tornado predictions. Amer. Meteor. J., 1, 166172.

  • Gleason, H. A., 1920: Some applications of the quadrat method. Bull. Torrey Bot. Club, 47, 2133, https://doi.org/10.2307/2480223.

  • Hamai, I., 1955: Stratification of community by means of “community coefficient” (continued) (most of text in Japanese). Japan J. Ecol., 5, 4145, https://doi.org/10.18960/seitai.4.4_171.

    • Search Google Scholar
    • Export Citation
  • Heine, M. H., 1973: Distance between sets as an objective measure of retrieval effectiveness. Inf. Storage Retr., 9, 181198, https://doi.org/10.1016/0020-0271(73)90066-1.

    • Search Google Scholar
    • Export Citation
  • Heydarian, M., T. E. Doyle, and R. Samavi, 2022: MLCM: Multi-label confusion matrix. IEEE Access, 10, 19 08319 095, https://doi.org/10.1109/ACCESS.2022.3151048.

    • Search Google Scholar
    • Export Citation
  • House, D. C., 1960: Remarks on the optimum spacing of upper air observations. Mon. Wea. Rev., 88, 97100, https://doi.org/10.1175/1520-0493(1960)088<0097:ROTOSO>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Howarth, R. J., 2017: Dictionary of Mathematical Geosciences: With Historical Notes. Springer, 893 pp., https://doi.org/10.1007/978-3-319-57315-1.

  • Jaccard, P., 1901: Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines (in French). Bull. Soc. Vaudoise Sci. Nat., 37, 241272, https://doi.org/10.5169/seals-266440.

    • Search Google Scholar
    • Export Citation
  • Jaccard, P., 1902: Lois de distribution florale dans le zone alpine (in French). Bull. Soc. Vaudoise Sci. Nat., 38, 69130, https://doi.org/10.5169/seals-266762.

    • Search Google Scholar
    • Export Citation
  • Jaccard, P., 1907: La distribution de la flore dans la zone alpine (in French). Rev. Gen. Sci., 18, 961967.

  • Jaccard, P., 1912: The distribution of the flora in the alpine zone. New Phytol., 11, 3750, https://doi.org/10.1111/j.1469-8137.1912.tb05611.x.

    • Search Google Scholar
    • Export Citation
  • Keen, M., 1966: Measures and averaging methods used in performance testing of indexing systems. ASLIB-Cranfield Research Project, 62 pp.

  • Keen, M., 1968: Evaluation parameters. Information Storage and Retrieval: Scientific Report No. ISR-13, G. Salton, Ed., National Science Foundation, II-1–II-67.

  • Keen, M., 1981: Laboratory tests of manual systems. Information Retrieval Experiment, K. Sparck Jones, Ed., Butterworths, 136–155.

  • Kent, A., M. M. Berry, F. U. Luehrs Jr., and J. W. Perry, 1955: Machine literature searching VIII. Operational criteria for designing information retrieval systems. Amer. Doc., 6, 93101, https://doi.org/10.1002/asi.5090060209.

    • Search Google Scholar
    • Export Citation
  • Kulczynski, M., 1927: Zespoły roślin w Pieninach—Die Pflanzenassoziationen der Pieninen. Bull. Int. Acad. Pol. Sci. Lett., 2, 57203.

    • Search Google Scholar
    • Export Citation
  • Kumler-Bonfanti, C., J. Stewart, D. Hall, and M. Govett, 2020: Tropical and extratropical cyclone detection using deep learning. J. Appl. Meteor. Climatol., 59, 19711985, https://doi.org/10.1175/JAMC-D-20-0117.1.

    • Search Google Scholar
    • Export Citation
  • Lagerquist, R., and I. Ebert-Uphoff, 2022: Can we integrate spatial verification methods into neural network loss functions for atmospheric science? Artif. Intell. Earth Syst., 1, e220021, https://doi.org/10.1175/AIES-D-22-0021.1.

    • Search Google Scholar
    • Export Citation
  • Lancaster, F. W., 1968: Evaluation of the MEDLARS demand search service. U. S. Department of Health, Education and Welfare, Public Health Service Rep., 288 pp.

  • Lancaster, F. W., 1979: Information Retrieval Systems: Characteristics, Testing and Evaluation. 2nd ed. John Wiley and Sons, 381 pp.

  • Larue, J. A., and R. J. Younkin, 1961: Weather Note: The Middle Mississippi Valley hydrometeorological storm of May 4–9, 1961. Mon. Wea. Rev., 89, 555559, https://doi.org/10.1175/1520-0493(1961)089<0555:WNTMMV>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Leydesdorff, L., 2008: On the normalization and visualization of author co-citation data: Salton’s Cosine versus the Jaccard index. J. Amer. Soc. Inf. Sci. Technol., 59, 7785, https://doi.org/10.1002/asi.20732.

    • Search Google Scholar
    • Export Citation
  • Marczewski, E., and H. Steinhaus, 1958: On a certain distance of sets and the corresponding distance of functions. Colloq. Math., 6, 319327, https://doi.org/10.4064/cm-6-1-319-327.

    • Search Google Scholar
    • Export Citation
  • Marzban, C., 1998: Scalar measures of performance in rare-event situations. Wea. Forecasting, 13, 753763, https://doi.org/10.1175/1520-0434(1998)013<0753:SMOPIR>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Mason, I. B., 1989: Dependence of the critical success index on sample climate and threshold probability. Aust. Meteor. Mag., 37, 7581.

    • Search Google Scholar
    • Export Citation
  • McConnaughey, B. H., 1964: The Determination and Analysis of Plankton Communities. Lembaga Penelitian Laut, 40 pp.

  • Miller, G. A., and P. A. Nicely, 1955: An analysis of perceptual confusions among some English consonants. J. Acoust. Soc. Amer., 27, 338352, https://doi.org/10.1121/1.1907526.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting. Wea. Forecasting, 8, 281293, https://doi.org/10.1175/1520-0434(1993)008<0281:WIAGFA>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1996: The Finley affair: A signal event in the history of forecast verification. Wea. Forecasting, 11, 320, https://doi.org/10.1175/1520-0434(1996)011<0003:TFAASE>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Ochiai, A., 1957: Zoogeographic studies on the soleoid fishes found in Japan and its neighboring regions. Bull. J. Japan Soc. Fish. Sci., 22, 526530, https://doi.org/10.2331/suisan.22.526.

    • Search Google Scholar
    • Export Citation
  • Olson, R. H., 1965: On the use of Bayes’ theorem in estimating false alarm rates. Mon. Wea. Rev., 93, 557558, https://doi.org/10.1175/1520-0493(1965)093<0557:OTUOBT>2.3.CO;2.

    • Search Google Scholar
    • Export Citation
  • Otsuka, Y., 1936: The faunal character of the Japanese Pleistocene marine Mollusca, as evidence of the climate having become colder during the Pleistocene in Japan. Bull. Biogeogr. Soc. Japan, 6, 165170.

    • Search Google Scholar
    • Export Citation
  • Palmer, W. C., and R. A. Allen, 1949: Note on the accuracy of forecasts concerning the rain problem. U. S. Weather Bureau Manuscript, 2 pp.

  • Pearson, K., 1904: Mathematical Contributions to the Theory of Evolution. XIII: On the Theory of Contingency and its Relation to Association and Normal Correlation. Drapers’ Company Research Memoirs, Biometric Series, Vol. I, Dulau and Co, 35 pp.

  • Pedregosa, F., and Coauthors, 2011: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res., 12, 28252830.

  • Peterson, W. W., and T. G. Birdsall, 1953: The theory of signal detectability. Part I: The general theory. Electronic Defense Group, Department of Electrical Engineering, University of Michigan, Tech Rep. 13, 156 pp.

  • Phillips, G., and Coauthors, 2024: Setting nutrient boundaries to protect aquatic communities: The importance of comparing observed and predicted classifications using measures derived from a confusion matrix. Sci. Total Environ., 912, 168872, https://doi.org/10.1016/j.scitotenv.2023.168872.

    • Search Google Scholar
    • Export Citation
  • Riehl, K., M. Neunteufel, and M. Hemberg, 2023: Hierarchical confusion matrix for classification performance evaluation. J. Roy. Stat. Soc., 72C, 13941412, https://doi.org/10.1093/jrsssc/qlad057.

    • Search Google Scholar
    • Export Citation
  • Roberts, N. M., and H. W. Lean, 2008: Scale-selective verification of rainfall accumulations from high-resolution forecasts of convective events. Mon. Wea. Rev., 136, 7897, https://doi.org/10.1175/2007MWR2123.1.

    • Search Google Scholar
    • Export Citation
  • Rocchio, J. J., 1964: Performance indices for document retrieval systems. Information Storage and Retrieval: Scientific Report No. ISR-8, G. Salton, Ed., National Science Foundation, III-1–III-18.

  • Roebber, P. J., 2009: Visualizing multiple measures of forecast quality. Wea. Forecasting, 24, 601608, https://doi.org/10.1175/2008WAF2222159.1.

    • Search Google Scholar
    • Export Citation
  • Roebber, P. J., and L. F. Bosart, 1996: The complex relationship between forecast skill and forecast value: A real-world analysis. Wea. Forecasting, 11, 544559, https://doi.org/10.1175/1520-0434(1996)011<0544:TCRBFS>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Ruuska, S., W. Hämäläinen, S. Kajava, M. Mughal, P. Matilaine, and J. Mononen, 2018: Evaluation of the confusion matrix method in the validation of an automated system for measuring feeding behaviour of cattle. Behav. Processes, 148, 5662, https://doi.org/10.1016/j.beproc.2018.01.004.

    • Search Google Scholar
    • Export Citation
  • Salton, G., 1964: The evaluation of automatic storage retrieval procedures—selected test results using the SMART system. Information Storage and Retrieval: Scientific Report No. ISR-8, G. Salton, Ed., National Science Foundation, IV-1–IV-36.

  • Sasaki, Y., 2007: The truth of the F-measure. Lecture Notes, 5 pp., https://www.cs.odu.edu/∼mukka/cs795sum09dm/Lecturenotes/Day3/F-measure-YS-26Oct07.pdf.

  • Schaefer, J. T., 1990: The critical success index as an indicator of Warning skill. Wea. Forecasting, 5, 570575, https://doi.org/10.1175/1520-0434(1990)005<0570:TCSIAA>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Scheller, I. F., K. Lutz, C. Mertes, V. A. Yépez, and J. Gagneur, 2023: Improved detection of aberrant splicing with FRASER 2.0 and the intron Jaccard index. Amer. J. Hum. Genet., 110, 20562067, https://doi.org/10.1016/j.ajhg.2023.10.014.

    • Search Google Scholar
    • Export Citation
  • Sørensen, T., 1948: A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. K. Dan. Vidensk. Selsk. Biol. Skr., 5, 134.

    • Search Google Scholar
    • Export Citation
  • Sorgenfrei, T., 1959: Molluscan assemblages from the marine Middle Miocene of South Jutland and their environments. Dan. Geol. Unders., 79, 356503, https://doi.org/10.34194/raekke2.v79.6869.

    • Search Google Scholar
    • Export Citation
  • Sparck Jones, K., 1972: A statistical interpretation of term specificity and its application in retrieval. J. Doc., 28, 1121, https://doi.org/10.1108/eb026526.

    • Search Google Scholar
    • Export Citation
  • Sparck Jones, K., 1979: Experiments in relevance weighting of search terms. Inf. Process. Manage., 15, 133144, https://doi.org/10.1016/0306-4573(79)90060-8.

    • Search Google Scholar
    • Export Citation
  • Stensrud, D. J., and M. S. Wandishin, 2000: The correspondence ratio in forecast evaluation. Wea. Forecasting, 15, 593602, https://doi.org/10.1175/1520-0434(2000)015<0593:TCRIFE>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Swets, J. A., 1963: Information retrieval systems. Science, 141, 245250, https://doi.org/10.1126/science.141.3577.245.

  • Tanimoto, T. T., 1958: An elementary mathematical theory of classification and prediction. Internal IBM Tech. Rep., 10 pp.

  • Tanner, W. P., Jr., and J. A. Swets, 1954: A decision-making theory of visual detection. Psychol. Rev., 61, 401409, https://doi.org/10.1037/h0058700.

    • Search Google Scholar
    • Export Citation
  • Thomson, G. H., 1916: A hierarchy without a general factor. Br. J. Psychol., 8, 271281.

  • Tversky, A., 1977: Features of similarity. Psychol. Rev., 84, 327352, https://doi.org/10.1037/0033-295X.84.4.327.

  • van der Maarel, E., 1969: On the use of ordination models in phytosociology. Vegetatio, 19, 2146, https://doi.org/10.1007/BF00259002.

    • Search Google Scholar
    • Export Citation
  • Van Rijsbergen, C. J., 1974: Foundation of evaluation. J. Doc., 30, 365373, https://doi.org/10.1108/eb026584.

  • Van Rijsbergen, C. J., and W. B. Croft, 1975: Document clustering: An evaluation of some experiments with the Cranfield 1400 collection. Inf. Process. Manage., 11, 171182, https://doi.org/10.1016/0306-4573(75)90006-0.

    • Search Google Scholar
    • Export Citation
  • Venn, J., 1880: I. On the diagrammatic and mechanical representation of propositions and reasonings. London, Edinburgh, Dublin Philos. Mag. J. Sci., 10, 118, https://doi.org/10.1080/14786448008626877.

    • Search Google Scholar
    • Export Citation
  • Wagner, H. L., 1993: On measuring performance in category judgment studies of nonverbal behavior. J. Nonverbal Behav., 17, 328, https://doi.org/10.1007/BF00987006.

    • Search Google Scholar
    • Export Citation
  • Wandishin, M. S., and H. E. Brooks, 2002: On the relationship between Clayton’s skill score and expected values for forecasts of binary events. Meteor. Appl., 9, 455459, https://doi.org/10.1017/S1350482702004085.

    • Search Google Scholar
    • Export Citation
  • Warrens, M. J., and H. van der Hoef, 2022: Understanding the adjusted Rand index and other partition comparison indices based on counting object pairs. J. Classif., 39, 487509, https://doi.org/10.1007/s00357-022-09413-z.

    • Search Google Scholar
    • Export Citation
  • Wilks, Y., D. Fass, C.-M. Guo, J. E. McDonald, T. Plate, and B. M. Slator, 1990: Providing machine tractable dictionary tools. Mach. Transl., 5, 99154, https://doi.org/10.1007/BF00393758.

    • Search Google Scholar
    • Export Citation
  • Yerushalmy, J., 1947: Statistical problems in assessing methods of medical diagnosis, with special reference to X-ray techniques. Public Health Rep., 62, 14321449, https://doi.org/10.2307/4586294.

    • Search Google Scholar
    • Export Citation
Save
  • Amirkhani, D., and A. Bastanfard, 2021: An objective method to evaluate exemplar‐based inpainted images quality using Jaccard index. Multimedia Tools Appl., 80, 26 19926 212, https://doi.org/10.1007/s11042-021-10883-3.

    • Search Google Scholar
    • Export Citation
  • Armistead, T. W., 2013: H. L. Wagner’s unbiased hit rate and the assessment of categorical forecasting accuracy. Wea. Forecasting, 28, 802814, https://doi.org/10.1175/WAF-D-12-00047.1.

    • Search Google Scholar
    • Export Citation
  • Armistead, T. W., 2016: Misunderstood and unattributed: Revisiting M. H. Doolittle’s measures of association, With a note on Bayes’ theorem. Amer. Stat., 70, 6373, https://doi.org/10.1080/00031305.2015.1086686.

    • Search Google Scholar
    • Export Citation
  • Barnes, L. R., E. C. Gruntfest, M. H. Hayden, D. M. Schultz, and C. Benight, 2007: False alarms and close calls: A conceptual model of warning accuracy. Wea. Forecasting, 22, 11401147, 10.1175/WAF1031.1; Corrigendum, 24, 1452–1454, 10.1175/2009WAF2222300.1.

    • Search Google Scholar
    • Export Citation
  • Brier, G. W., and R. A. Allen, 1951: Verification of weather forecasts. Compendium of Meteorology, T. Malone, Ed., Amer. Meteor. Soc., 841–848.

  • Brooks, H. E., and J. Correia Jr., 2018: Long-term performance metrics for National Weather Service tornado warnings. Wea. Forecasting, 33, 15011511, https://doi.org/10.1175/WAF-D-18-0120.1.

    • Search Google Scholar
    • Export Citation
  • Brusco, M., J. D. Cradit, and D. Steinley, 2021: A comparison of 71 binary similarity coefficients: The effect of base rates. PLOS ONE, 16, e0247751, https://doi.org/10.1371/journal.pone.0247751.

    • Search Google Scholar
    • Export Citation
  • Cheng, Y., P. A. Pérez-Díaz, K. V. Petrides, and J. Li, 2023: Monte Carlo simulation with confusion matrix paradigm – A sample of internal consistency indices. Front. Psychol., 14, 1298534, https://doi.org/10.3389/fpsyg.2023.1298534.

    • Search Google Scholar
    • Export Citation
  • Cleverdon, C., and D. R. Swanson, 1965: The Cranfield hypotheses. Libr. Quart., 35, 121124, https://doi.org/10.1086/619319.

  • Cleverdon, C. W., 1962: Report on the testing and analysis of an investigation into the comparative efficiency of indexing systems. ASLIB Cranfield Research ProjectCranfield Aeronautics Laboratory, 322 pp.

  • Consonni, V., and R. Todeschini, 2012: New similarity coefficients for binary data. MATCH Commun. Math. Comp. Chem., 68, 581592.

  • de Elía, R., 2022: The false alarm/surprise trade-off in weather warnings systems: An expected utility theory perspective. Environ. Syst. Decisions, 42, 450461, https://doi.org/10.1007/s10669-022-09863-1.

    • Search Google Scholar
    • Export Citation
  • de Elía, R., J. J. Ruiz, V. Francce, P. Lohigorry, M. Saucedo, M. Menalled, and D. D’Amen, 2023: Early warning systems and end-user decision-making: A risk formalism tool to aid communication and understanding. Risk Anal., 15, 115, https://doi.org/10.1111/risa.14221.

    • Search Google Scholar
    • Export Citation
  • Dice, L. R., 1945: Measures of the amount of ecologic association between species. Ecology, 26, 297302, https://doi.org/10.2307/1932409.

    • Search Google Scholar
    • Export Citation
  • Donaldson, R. J., M. J. Kraus, and R. M. Dyer, 1975a: Operational benefits of meteorological Doppler radar. AFCRL Tech. Rep. AFCRL-TR-75-0103, 25 pp., https://apps.dtic.mil/sti/trecms/pdf/ADA010434.pdf.

  • Donaldson, R. J., R. M. Dyer, and M. J. Kraus, 1975b: An objective evaluator of techniques for predicting severe weather events. Preprints, Ninth Conf. on Severe Local Storms, Norman, OK, Amer. Meteor. Soc., 321–326.

  • Doolittle, M. H., 1885: The verification of predictions. Bull. Philos. Soc. Wash., 7, 122127.

  • Doswell, C. A., R. Davies-Jones, and D. L. Keller, 1990: On summary measures of skill in rare event forecasting based on contingency tables. Wea. Forecasting, 5, 576585, https://doi.org/10.1175/1520-0434(1990)005<0576:OSMOSI>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Driver, H. E., and A. L. Kroeber, 1932: Quantitative expression of cultural relationships. Univ. Calif. Publ. Amer. Archaeol. Ethnology, 31, 211256.

    • Search Google Scholar
    • Export Citation
  • Finley, J. P., 1884: Tornado predictions. Amer. Meteor. J., 1, 8588.

  • Gilbert, G. K., 1884: Finley’s tornado predictions. Amer. Meteor. J., 1, 166172.

  • Gleason, H. A., 1920: Some applications of the quadrat method. Bull. Torrey Bot. Club, 47, 2133, https://doi.org/10.2307/2480223.

  • Hamai, I., 1955: Stratification of community by means of “community coefficient” (continued) (most of text in Japanese). Japan J. Ecol., 5, 4145, https://doi.org/10.18960/seitai.4.4_171.

    • Search Google Scholar
    • Export Citation
  • Heine, M. H., 1973: Distance between sets as an objective measure of retrieval effectiveness. Inf. Storage Retr., 9, 181198, https://doi.org/10.1016/0020-0271(73)90066-1.

    • Search Google Scholar
    • Export Citation
  • Heydarian, M., T. E. Doyle, and R. Samavi, 2022: MLCM: Multi-label confusion matrix. IEEE Access, 10, 19 08319 095, https://doi.org/10.1109/ACCESS.2022.3151048.

    • Search Google Scholar
    • Export Citation
  • House, D. C., 1960: Remarks on the optimum spacing of upper air observations. Mon. Wea. Rev., 88, 97100, https://doi.org/10.1175/1520-0493(1960)088<0097:ROTOSO>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Howarth, R. J., 2017: Dictionary of Mathematical Geosciences: With Historical Notes. Springer, 893 pp., https://doi.org/10.1007/978-3-319-57315-1.

  • Jaccard, P., 1901: Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines (in French). Bull. Soc. Vaudoise Sci. Nat., 37, 241272, https://doi.org/10.5169/seals-266440.

    • Search Google Scholar
    • Export Citation
  • Jaccard, P., 1902: Lois de distribution florale dans le zone alpine (in French). Bull. Soc. Vaudoise Sci. Nat., 38, 69130, https://doi.org/10.5169/seals-266762.

    • Search Google Scholar
    • Export Citation
  • Jaccard, P., 1907: La distribution de la flore dans la zone alpine (in French). Rev. Gen. Sci., 18, 961967.

  • Jaccard, P., 1912: The distribution of the flora in the alpine zone. New Phytol., 11, 3750, https://doi.org/10.1111/j.1469-8137.1912.tb05611.x.

    • Search Google Scholar
    • Export Citation
  • Keen, M., 1966: Measures and averaging methods used in performance testing of indexing systems. ASLIB-Cranfield Research Project, 62 pp.

  • Keen, M., 1968: Evaluation parameters. Information Storage and Retrieval: Scientific Report No. ISR-13, G. Salton, Ed., National Science Foundation, II-1–II-67.

  • Keen, M., 1981: Laboratory tests of manual systems. Information Retrieval Experiment, K. Sparck Jones, Ed., Butterworths, 136–155.

  • Kent, A., M. M. Berry, F. U. Luehrs Jr., and J. W. Perry, 1955: Machine literature searching VIII. Operational criteria for designing information retrieval systems. Amer. Doc., 6, 93101, https://doi.org/10.1002/asi.5090060209.

    • Search Google Scholar
    • Export Citation
  • Kulczynski, M., 1927: Zespoły roślin w Pieninach—Die Pflanzenassoziationen der Pieninen. Bull. Int. Acad. Pol. Sci. Lett., 2, 57203.

    • Search Google Scholar
    • Export Citation
  • Kumler-Bonfanti, C., J. Stewart, D. Hall, and M. Govett, 2020: Tropical and extratropical cyclone detection using deep learning. J. Appl. Meteor. Climatol., 59, 19711985, https://doi.org/10.1175/JAMC-D-20-0117.1.

    • Search Google Scholar
    • Export Citation
  • Lagerquist, R., and I. Ebert-Uphoff, 2022: Can we integrate spatial verification methods into neural network loss functions for atmospheric science? Artif. Intell. Earth Syst., 1, e220021, https://doi.org/10.1175/AIES-D-22-0021.1.

    • Search Google Scholar
    • Export Citation
  • Lancaster, F. W., 1968: Evaluation of the MEDLARS demand search service. U. S. Department of Health, Education and Welfare, Public Health Service Rep., 288 pp.

  • Lancaster, F. W., 1979: Information Retrieval Systems: Characteristics, Testing and Evaluation. 2nd ed. John Wiley and Sons, 381 pp.

  • Larue, J. A., and R. J. Younkin, 1961: Weather Note: The Middle Mississippi Valley hydrometeorological storm of May 4–9, 1961. Mon. Wea. Rev., 89, 555559, https://doi.org/10.1175/1520-0493(1961)089<0555:WNTMMV>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Leydesdorff, L., 2008: On the normalization and visualization of author co-citation data: Salton’s Cosine versus the Jaccard index. J. Amer. Soc. Inf. Sci. Technol., 59, 7785, https://doi.org/10.1002/asi.20732.

    • Search Google Scholar
    • Export Citation
  • Marczewski, E., and H. Steinhaus, 1958: On a certain distance of sets and the corresponding distance of functions. Colloq. Math., 6, 319327, https://doi.org/10.4064/cm-6-1-319-327.

    • Search Google Scholar
    • Export Citation
  • Marzban, C., 1998: Scalar measures of performance in rare-event situations. Wea. Forecasting, 13, 753763, https://doi.org/10.1175/1520-0434(1998)013<0753:SMOPIR>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Mason, I. B., 1989: Dependence of the critical success index on sample climate and threshold probability. Aust. Meteor. Mag., 37, 7581.

    • Search Google Scholar
    • Export Citation
  • McConnaughey, B. H., 1964: The Determination and Analysis of Plankton Communities. Lembaga Penelitian Laut, 40 pp.

  • Miller, G. A., and P. A. Nicely, 1955: An analysis of perceptual confusions among some English consonants. J. Acoust. Soc. Amer., 27, 338352, https://doi.org/10.1121/1.1907526.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting. Wea. Forecasting, 8, 281293, https://doi.org/10.1175/1520-0434(1993)008<0281:WIAGFA>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1996: The Finley affair: A signal event in the history of forecast verification. Wea. Forecasting, 11, 320, https://doi.org/10.1175/1520-0434(1996)011<0003:TFAASE>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Ochiai, A., 1957: Zoogeographic studies on the soleoid fishes found in Japan and its neighboring regions. Bull. J. Japan Soc. Fish. Sci., 22, 526530, https://doi.org/10.2331/suisan.22.526.

    • Search Google Scholar
    • Export Citation
  • Olson, R. H., 1965: On the use of Bayes’ theorem in estimating false alarm rates. Mon. Wea. Rev., 93, 557558, https://doi.org/10.1175/1520-0493(1965)093<0557:OTUOBT>2.3.CO;2.

    • Search Google Scholar
    • Export Citation
  • Otsuka, Y., 1936: The faunal character of the Japanese Pleistocene marine Mollusca, as evidence of the climate having become colder during the Pleistocene in Japan. Bull. Biogeogr. Soc. Japan, 6, 165170.

    • Search Google Scholar
    • Export Citation
  • Palmer, W. C., and R. A. Allen, 1949: Note on the accuracy of forecasts concerning the rain problem. U. S. Weather Bureau Manuscript, 2 pp.

  • Pearson, K., 1904: Mathematical Contributions to the Theory of Evolution. XIII: On the Theory of Contingency and its Relation to Association and Normal Correlation. Drapers’ Company Research Memoirs, Biometric Series, Vol. I, Dulau and Co, 35 pp.

  • Pedregosa, F., and Coauthors, 2011: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res., 12, 28252830.

  • Peterson, W. W., and T. G. Birdsall, 1953: The theory of signal detectability. Part I: The general theory. Electronic Defense Group, Department of Electrical Engineering, University of Michigan, Tech Rep. 13, 156 pp.

  • Phillips, G., and Coauthors, 2024: Setting nutrient boundaries to protect aquatic communities: The importance of comparing observed and predicted classifications using measures derived from a confusion matrix. Sci. Total Environ., 912, 168872, https://doi.org/10.1016/j.scitotenv.2023.168872.

    • Search Google Scholar
    • Export Citation
  • Riehl, K., M. Neunteufel, and M. Hemberg, 2023: Hierarchical confusion matrix for classification performance evaluation. J. Roy. Stat. Soc., 72C, 13941412, https://doi.org/10.1093/jrsssc/qlad057.

    • Search Google Scholar
    • Export Citation
  • Roberts, N. M., and H. W. Lean, 2008: Scale-selective verification of rainfall accumulations from high-resolution forecasts of convective events. Mon. Wea. Rev., 136, 7897, https://doi.org/10.1175/2007MWR2123.1.

    • Search Google Scholar
    • Export Citation
  • Rocchio, J. J., 1964: Performance indices for document retrieval systems. Information Storage and Retrieval: Scientific Report No. ISR-8, G. Salton, Ed., National Science Foundation, III-1–III-18.

  • Roebber, P. J., 2009: Visualizing multiple measures of forecast quality. Wea. Forecasting, 24, 601608, https://doi.org/10.1175/2008WAF2222159.1.

    • Search Google Scholar
    • Export Citation
  • Roebber, P. J., and L. F. Bosart, 1996: The complex relationship between forecast skill and forecast value: A real-world analysis. Wea. Forecasting, 11, 544559, https://doi.org/10.1175/1520-0434(1996)011<0544:TCRBFS>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Ruuska, S., W. Hämäläinen, S. Kajava, M. Mughal, P. Matilaine, and J. Mononen, 2018: Evaluation of the confusion matrix method in the validation of an automated system for measuring feeding behaviour of cattle. Behav. Processes, 148, 5662, https://doi.org/10.1016/j.beproc.2018.01.004.

    • Search Google Scholar
    • Export Citation
  • Salton, G., 1964: The evaluation of automatic storage retrieval procedures—selected test results using the SMART system. Information Storage and Retrieval: Scientific Report No. ISR-8, G. Salton, Ed., National Science Foundation, IV-1–IV-36.

  • Sasaki, Y., 2007: The truth of the F-measure. Lecture Notes, 5 pp., https://www.cs.odu.edu/∼mukka/cs795sum09dm/Lecturenotes/Day3/F-measure-YS-26Oct07.pdf.

  • Schaefer, J. T., 1990: The critical success index as an indicator of Warning skill. Wea. Forecasting, 5, 570575, https://doi.org/10.1175/1520-0434(1990)005<0570:TCSIAA>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Scheller, I. F., K. Lutz, C. Mertes, V. A. Yépez, and J. Gagneur, 2023: Improved detection of aberrant splicing with FRASER 2.0 and the intron Jaccard index. Amer. J. Hum. Genet., 110, 20562067, https://doi.org/10.1016/j.ajhg.2023.10.014.

    • Search Google Scholar
    • Export Citation
  • Sørensen, T., 1948: A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. K. Dan. Vidensk. Selsk. Biol. Skr., 5, 134.

    • Search Google Scholar
    • Export Citation
  • Sorgenfrei, T., 1959: Molluscan assemblages from the marine Middle Miocene of South Jutland and their environments. Dan. Geol. Unders., 79, 356503, https://doi.org/10.34194/raekke2.v79.6869.

    • Search Google Scholar
    • Export Citation
  • Sparck Jones, K., 1972: A statistical interpretation of term specificity and its application in retrieval. J. Doc., 28, 1121, https://doi.org/10.1108/eb026526.

    • Search Google Scholar
    • Export Citation
  • Sparck Jones, K., 1979: Experiments in relevance weighting of search terms. Inf. Process. Manage., 15, 133144, https://doi.org/10.1016/0306-4573(79)90060-8.

    • Search Google Scholar
    • Export Citation
  • Stensrud, D. J., and M. S. Wandishin, 2000: The correspondence ratio in forecast evaluation. Wea. Forecasting, 15, 593602, https://doi.org/10.1175/1520-0434(2000)015<0593:TCRIFE>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Swets, J. A., 1963: Information retrieval systems. Science, 141, 245250, https://doi.org/10.1126/science.141.3577.245.

  • Tanimoto, T. T., 1958: An elementary mathematical theory of classification and prediction. Internal IBM Tech. Rep., 10 pp.

  • Tanner, W. P., Jr., and J. A. Swets, 1954: A decision-making theory of visual detection. Psychol. Rev., 61, 401409, https://doi.org/10.1037/h0058700.

    • Search Google Scholar
    • Export Citation
  • Thomson, G. H., 1916: A hierarchy without a general factor. Br. J. Psychol., 8, 271281.

  • Tversky, A., 1977: Features of similarity. Psychol. Rev., 84, 327352, https://doi.org/10.1037/0033-295X.84.4.327.

  • van der Maarel, E., 1969: On the use of ordination models in phytosociology. Vegetatio, 19, 2146, https://doi.org/10.1007/BF00259002.

    • Search Google Scholar
    • Export Citation
  • Van Rijsbergen, C. J., 1974: Foundation of evaluation. J. Doc., 30, 365373, https://doi.org/10.1108/eb026584.

  • Van Rijsbergen, C. J., and W. B. Croft, 1975: Document clustering: An evaluation of some experiments with the Cranfield 1400 collection. Inf. Process. Manage., 11, 171182, https://doi.org/10.1016/0306-4573(75)90006-0.

    • Search Google Scholar
    • Export Citation
  • Venn, J., 1880: I. On the diagrammatic and mechanical representation of propositions and reasonings. London, Edinburgh, Dublin Philos. Mag. J. Sci., 10, 118, https://doi.org/10.1080/14786448008626877.

    • Search Google Scholar
    • Export Citation
  • Wagner, H. L., 1993: On measuring performance in category judgment studies of nonverbal behavior. J. Nonverbal Behav., 17, 328, https://doi.org/10.1007/BF00987006.

    • Search Google Scholar
    • Export Citation
  • Wandishin, M. S., and H. E. Brooks, 2002: On the relationship between Clayton’s skill score and expected values for forecasts of binary events. Meteor. Appl., 9, 455459, https://doi.org/10.1017/S1350482702004085.

    • Search Google Scholar
    • Export Citation
  • Warrens, M. J., and H. van der Hoef, 2022: Understanding the adjusted Rand index and other partition comparison indices based on counting object pairs. J. Classif., 39, 487509, https://doi.org/10.1007/s00357-022-09413-z.

    • Search Google Scholar
    • Export Citation
  • Wilks, Y., D. Fass, C.-M. Guo, J. E. McDonald, T. Plate, and B. M. Slator, 1990: Providing machine tractable dictionary tools. Mach. Transl., 5, 99154, https://doi.org/10.1007/BF00393758.

    • Search Google Scholar
    • Export Citation
  • Yerushalmy, J., 1947: Statistical problems in assessing methods of medical diagnosis, with special reference to X-ray techniques. Public Health Rep., 62, 14321449, https://doi.org/10.2307/4586294.

    • Search Google Scholar
    • Export Citation
  • Fig. 1.

    Timeline of introduction of scores of the form of υg.

  • Fig. 2.

    Reproduction of Cleverdon (1962) Table 7.6. Legend is as used in Cleverdon where different “Relevance” lines are associated with different search efforts. The dashed lines are from Cleverdon.

  • Fig. 3.

    de Elia’s Q as a function of SR. Each line represents a constant value of υg. Note the asymptotic behavior of Q as SR approaches υg. The right side asymptote near SR = 1 is a result of POD approaching υg.