1. Introduction and background
The problems of forecast evaluation, classification, and comparative relationships between groups of concepts or things share some common characteristics. We can think of them as becoming concerned with the similarity or dissimilarity between two (or more) different descriptions of some “reality.” In weather forecasting, quantifying the similarity of the forecasts and weather associated with the forecasts (observations) describes the quality of the forecast (Murphy 1993). When looking at the cultural traits of different people groups (Driver and Kroeber 1932) or the plant species in different areas (Jaccard 1901), we can use the properties of one group and compare it to the other group to see how alike they are. With similar basic problems facing many fields, multiple metrics that quantify the same properties have been developed and, often, given new names. Identical metrics, but with a bewildering array of names for the same things, can lead to confusion, especially in interdisciplinary work where researchers from different backgrounds can end up, essentially, speaking foreign languages to each other. It is useful, however, to see how different fields have thought about solving similar problems and dealing with issues that arise in their fields. In many cases, the applicability of terms developed in a field is obvious, but when that term is adopted by another field, the applicability is not so obvious. In this paper, we will focus on some illustrative examples from what is a seemingly simple problem, the so-called 2 × 2 problem, and make some recommendations for improved, if not best, practices to ease the communication challenges.
2. The 2 × 2 contingency table and scoring rules
As a starting point, we begin with the classic 2 × 2 contingency table, introduced by Pearson (1904). It is illustrative to note that this is an example of something that has been given a different name in a specific context. Miller and Nicely (1955) introduced the term “confusion matrix” for the same thing as a contingency table in their study of how listeners confused different consonant sounds in English. The name is evident in that context, as it is in the study of cattle feeding behavior of Ruuska et al. (2018), but several other areas have adopted that term, such as psychology (Cheng et al. 2023), machine learning (Heydarian et al. 2022), statistical classification (Riehl et al. 2023), and environmental science (Phillips et al. 2024), where it is not so obviously connected to the original context. This does not mean the usage is wrong, but the additional terminology can confuse new users. For the simplest possible forecast problem of a yes/no forecast and a yes/no observation, we get a 2 × 2 table (Table 1).
The 2 × 2 contingency table for forecast and observations.
The first use of contingency-table-style elements in meteorology was in the tornado forecast experiment of Finley (1884). The history of the early days after Finley’s experiment, the measures derived from the table that were discovered in the immediate aftermath, and the subsequent rediscoveries of those scores are described in Murphy (1996). Researchers discussed several fundamental issues in the first years after Finley, such as the quality of observations, the possible difference in importance of different kinds of forecast errors, and what was an appropriate baseline for how many forecasts should have been correct by chance.
One of the other issues that was discussed in some of the early papers was what we might now think of as the correct forecast of a “nonevent” (e.g., no tornado warning was issued and no tornado occurred), the “d” element of Table 1. In some fields, such as geobotany, this term has no meaning. In comparing the species of plants in different Swiss valleys, Jaccard (1901) ignored the existence of d completely. This was a sensible thing to do, given that it would be “all of the plants that don’t exist in either of two valleys.” Putting bounds on what constitutes “all” is challenging. Palmer and Allen (1949), in the context of weather forecasts, excluded d by framing the forecasts as being associated with threats, meaning that the d term would be a forecast for a nonthreat that did not happen. They assumed that those cases were “less difficult, perhaps less significant forecasting problem(s).” As a result of this issue, our primary emphasis will be on scores derived from the a, b, and c terms of the table. Despite the seeming simplicity, Brusco et al. (2021) discuss 18 different metrics relying on those three terms, which is not exhaustive. In total, Brusco et al. (2021) compiled 71 metrics that are derived from the 2 × 2 contingency table and they used synthetic data to examine which scores behave similarly under a variety of conditions, producing clusters of metrics that behave similarly to each other. Other lists of metrics derived from the 2 × 2 table have been compiled by Murphy (1996), Marzban (1998), and Warrens and van der Hoef (2022).
Two philosophical areas have been approached by researchers leading to different metrics.
-
Should we measure similarity or dissimilarity between our two sets?
-
Should we consider the individual elements (a, b, c) or composite statistics like the total number of each category or ratios [a/(a + b)]?
As a result of addressing area 1, there are many scores that have complements of other scores, so that the two scores would sum to one. Many of the dissimilarity metrics are referred to as “distance,” so that a perfect score would be 0, while most similarity metrics are called “index” or “score” with a perfect value of 1. As for area 2 although, obviously, all of the scores can be written in terms of the individual elements, many are written as combinations of the ratios, so the underlying approach makes some scores more accessible and others less so.
a. Gilbert’s ratio of verification and its copies
Donaldson et al. (1975a,b) also produced the same metric for severe thunderstorm forecasting by looking at the individual elements a, b, and c. They describe their choice of (1) as “arbitrary” and do not reference Gilbert, Palmer and Allen, or Larue and Younkin. They gave the score the name “critical success index” (CSI), which has become the most common name in use in meteorology and is the term used in the official verification of severe thunderstorm and tornado warnings by the U.S. National Weather Service.
One of the most common names for υg outside of meteorology is the Jaccard index, which came out of the field that might now be referred to as biogeography. It is widely used in medical research (e.g., Scheller et al. 2023), computer imaging (Amirkhani and Bastanfard 2021), information science (Leydesdorff 2008), and machine learning as seen in its use in Scikit-learn in Python (Pedregosa et al. 2011). This score was developed by the botanist Jaccard (1901)1 as the coefficient de communauté (community coefficient) in his work comparing the distribution of plant species in valleys in Switzerland. Jaccard followed a similar approach to Gilbert, in that he compiled lists of species in each valley and then did pairwise comparisons between two valleys at a time. In Gilbert’s terms, the list for one valley could be considered as the “forecast” for the “event” species in another valley. From the 2 × 2 table perspective, the d entry, which would be the plant species that did not appear in a pair of valleys, was problematic for Jaccard and, as a result, it was ignored.
The other primary development of the same score as υg comes from Tanimoto (1958). In contrast to Gilbert and Jaccard, Tanimoto developed his index from a set theory approach, with the score being the intersection of two datasets over their union, commonly known as intersection over union (IoU) today (Wilks et al. 1990). In a spatial perspective, this can be easily seen in the diagrams of Venn (1880). Larue and Younkin (1961) called it the threat score,2 and Stensrud and Wandishin (2000) referred to it as the correspondence ratio for areal forecasts, particularly with application to ensemble forecasts. A timeline showing the introduction of these different names for the scores is given in Fig. 1. The applicability to both identification of individual events and spatial fields was highlighted by Mason (1989). Marczewski and Steinhaus (1958) also used a set theory approach to define a dissimilarity metric between two sets that is 1 − υg. They show that this metric has all of the properties of a mathematical distance.
b. Ratios from the 2 × 2 table as the basis of scoring
Another approach to developing scores from the three elements is to consider ratios developed of terms from the row and the column of Table 1. Specifically, many researchers have looked at a/(a + c) and a/(a + b) and their complements, c/(a + c) and b/(a + b), creating relationships between those ratios. Again, a plethora of names have been applied to those ratios.
In meteorology, the most common name for the a/(a + c) is the probability of detection (POD), from Donaldson et al. (1975a,b). Prefigurance (Brier and Allen 1951)3 also has been used, but POD dominates in modern literature. Its origins in meteorology appear to come from House (1960), who was studying the problem of what density of observational stations was needed to observe different atmospheric features and estimated the probability of detecting a feature of interest given station spacing. It is unclear what impact House had on the choice by Donaldson et al.
In the machine learning community, recall is the most common descriptor of a/(a + c). It comes out of information retrieval and was first used by Kent et al. (1955) in the context of literature searching with the name “recall factor.” They wrote “This fraction, which we shall term the “recall factor,” measures the proportion of pertinent documents to which the information retrieval system directed attention when a given search was conducted”. In effect, its name comes from the consideration of how many of the appropriate documents are recalled from the database. Sensitivity is another common name for this expression, coming out of the medical diagnostic community (Yerushalmy 1947), deriving from measuring how sensitive a diagnostic test was in responding to a particular condition. Dice (1945) simply called both a/(a + b) and a/(a + c) “association indices” with notations added to indicate which one was being referred to.
Although postagreement (U.S. Army Air Forces 1944, unpublished manuscript) was suggested in meteorology for the term a/(a + b), it has not been widely used. Neither has the probability of hits, suggested by Doswell et al. (1990). At this time, the most common term is the success ratio (SR) (Schaefer 1990). For reasons that are not particularly clear, meteorology has tended to focus on the complement of the SR, the fraction of forecasts of an event that did not have the event occur. The term b is often labeled as “false alarms,” so that the b/(a + b) term has been called the false alarm ratio (Donaldson et al. 1975a,b) and the false alarm rate (Olson 1965), among others. Given that the term b/(b + d) has also been called the false alarm rate, this has caused great confusion. Barnes et al. (2007) give an outstanding review of the conflicting usage.
As a companion to recall, Kent et al. (1955) named a/(a + b) the pertinency ratio. In the context of information retrieval, the origin of this term seems clear, giving the fraction of retrieved pieces of information that is pertinent to the search request. The early 1960s provided a huge boost to the study of information retrieval, funded by the National Science Foundation, to help scientists find the most important references to speed research in the aftermath of the Soviet launch of Sputnik. One of the biggest recipients of that support was at the Cranfield College of Aeronautics in the United Kingdom. The leader of that project, Cyril Cleverdon, followed up on Kent et al. adopting the use of recall, but preferring relevancy over pertinency (Cleverdon 1962). The other big project, Information Storage and Retrieval (ISR), led by Gerard Salton at Harvard, introduced the term precision (Salton 1964) with no discussion of the reasons.4 Interestingly, in the same report that Salton used “precision,” Rocchio (1964) used relevancy, in keeping with Cleverdon but added a parenthetical comment that it “was sometimes called precision.” Eventually, at a closed-door meeting in Washington in December 1964, Cleverdon agreed to use precision (Cleverdon and Swanson 1965).
One of the biggest advances made by Cleverdon (1962) was the notion of plotting recall and relevancy on a two-dimensional plot. Although Kent et al. (1955) had plotted distributions of the individual terms, Cleverdon was the first to plot them jointly. It could be done for a single 2 × 2 table or for a series of tables where the events were the same, but different thresholds were used, in the original case, by the addition of search terms to make the search more restrictive. That addition led to an apparent trade-off between recall and relevancy, although a debate about whether that trade-off was real or the result of the way Cleverdon had done the analysis went on for some time, despite a connection that could be made to the model of Tanner and Swets (1954) that supported the trade-off as real. For clarity, we have replotted Cleverdon’s original data (Fig. 2, his Tables 7.5 and 7.6). It is of particular interest that in the first plot of the two quantities, recall was plotted on the vertical axis in the same convention as used in the performance diagram of Roebber (2009). This also means that the vertical axis is the same as used in the receiver operating characteristic (ROC) plots developed by Peterson and Birdsall (1953). Swets (1963) uses that convention in a discussion of what the horizontal axis on a plot should be [relevancy or b/(b + d) that Tanner and Swets (1954) had advocated, or, simply, b as used by Swanson in a 1962 conference presentation].
The flipping of the axes on this two-dimensional plot of recall and what was referred to as precision first appears in Salton (1964, Fig. 12) with no reason given for the change. No discussion appears in the reports of the Cranfield and ISR groups as to why the choices are made, but it appears to be conventions used by the groups. Interestingly, Michael Keen writes sections in reports for both groups, implying he may have received support from each. Even though he always refers to the diagrams as recall–precision diagrams, when they are published in Cranfield reports, recall is the vertical axis (e.g., Keen 1966) and when they are published in ISR reports, precision is the vertical axis (e.g., Keen 1968).
The question of the “standard” orientation of the two axes was not resolved for a long time, and it is not clear why precision as the vertical axis became the default in the machine learning community. Outside of the two big groups’ choices (Cranfield and ISR), Lancaster (1968, 1979) and Sparck Jones (1972, 1979) were among notable people who preferred Recall on the vertical axis. Keen (1981) also used recall on the vertical. Heine (1973) used precision on the vertical axis and overlaid contours of the Marczewski–Steinhaus distance, marking the first appearance of a combined measure on such a plot. Van Rijsbergen (1974) and, later, Van Rijsbergen and Croft (1975) also used precision on the vertical axis.
c. The Pythagorean means scores
If one starts with the two quantities POD and SR (or recall and precision), combining the quantities in ways that differ from what might be considered a Venn diagram approach seems logical. It is often possible to use algebra to show that scores developed in the two approaches are mathematically related, but the paths taken to get there are very different. Scores have repeatedly been developed that are simple functions of the three Pythagorean means.
The GM has also been found a number of times over the years in a variety of fields in a process that represents an excellent example of the curiosities of naming. It appears that the first use is from Thomson (1916), in the testing of psychological theories. It was cited by Driver and Kroeber (1932) in archaeology. Their application was similar to Jaccard’s, in that they were comparing the distribution of artifacts in an effort to measure similarity, or the lack thereof, of different cultural groups. Interestingly, one of the names that Brusco et al. (2021) give to the score is Driver and Kroeber, despite the fact that they reference Thomson. If any weight is given to the “reputation” of the authors in the adoption of names, it is curious to note that Thomson is regarded as one of the pioneers of intelligence research and who wrote his paper as part of a debate with Spearman, of Spearman’s rank correlation fame.
Another field that has given the GM a name that has passed to other fields has been marine biology. A common name for the GM is the Ochiai coefficient (Ochiai 1957). Ochiai credits Otsuka (1936),6 but via an intermediate publication by Hamai (1955). To add to the naming confusion, Howarth (2017) credits Otuka but attributes the score to another researcher with the same family name.
Doolittle (1885), in his response to Finley (1884), used the product of POD and SR, which is obviously the square of GM, in developing his inference ratio, which is intended to measure the skill of a forecaster by adjusting for the number of forecasts that would be correct by chance. In his work on the distribution of mollusks in the Miocene, Sorgenfrei (1959) also used GM2. His problem was similar to that of Jaccard, in that he was looking at the probability of species occurring in each of two different areas and simply multiplying the two probabilities. Wagner (1993) referred to this quantity as the “unbiased hit rate” in a review paper about evaluation of behavioral psychology experiments. Armistead (2013) brought the metric back into meteorology, proposing it as a method for evaluating categorical forecasts. Armistead references Doolittle’s inference ratio without mentioning that Doolittle had used Wagner’s unbiased hit rate in its development, although in a later paper (Armistead 2016), he refers to it as Doolittle’s unattributed joint probability measure.
Scores that are identical to HM have been used in many contexts and are known by a variety of names. In ecological studies, Gleason (1920) used it without naming it, Sørensen (1948) referred to it as the quotient of similarity, and Dice (1945) referred to it as the coincidence index. Neither of the names Sørensen or Dice used seems to have been used in references since then, with the score being called the Sørensen index, Dice’s coefficient, or the Sørensen–Dice index. An identical score was presented at a conference by van Rijsbergen and is often referred to as the F score or F1 score in many places in the machine learning community. Sasaki (2007) indicates that the origin of the name was apparently accidental, being confused with another F function that van Rijsbergen and Croft (1975) introduced as a “combination” function in his derivation of a measure of effectiveness E.
d. Scores that weight error terms differently
Any score that can be expressed as a simple function that weights the components of a Pythagorean mean equally (including υg) implicitly values errors of false alarms or missed events equally. For many users, this is an unrealistic assumption. For instance, it is highly likely that failure to diagnose a disease with a high fatality rate is a greater threat to health than treating for the disease when it is not present. In that case, a patient would be more interested in high POD than high SR. There are two popular scores that provide weights. Although in an important selection of weights discussed below, they can be shown to be algebraically related to each other, as before, they start from different perspectives.
e. Impacts of score selection on apparent forecast performance
Brooks and Correia (2018) looked at long-term performance metrics for National Weather Service tornado warnings. One of the points of interest was the change in performance that took place in 2012–13 as a result of an apparent increase in emphasis on false alarms [equivalent to lowering b in (15) or increasing γ in (17)]. POD decreased and SR increased in that interval compared to the previous years. If we wish to consider some combination of the two scores to estimate an overall impact of the change, the impression we get depends on what combination is chosen. Time series of the results from Brooks and Correia, updated through 2022, of the three Pythagorean means and Fβ for β = 0.5 and 2 provide a range of impressions (Fig. 6). (Note that, as seen before, HM is related directly to υg.)
HM shows less of an impact on apparent performance in 2012–13 than the other Pythagorean means and, in fact, shows increases back to the highest values ever seen by the late 2010s/early 2020s. GM and AM show large decreases with the increased emphasis on false alarms, and the measure has not returned to the previous values yet. For the weighted scores, the one that values POD more highly shows large decreases after 2011 while the score that values POD less shows continuous increases through to the current. We offer no judgment as to which of the scores is most “correct.” Our point is that the choice of performance measure, even for relatively simple evaluation exercises, can have a huge impact on how one judges performance. Optimizing to one metric may lead to much poorer performance by other metrics. This is likely to be particularly important in cases where the costs of different kinds of errors are very different.
Often, there is a disconnect between the loss function and verification metrics used for training and evaluating a machine learning model, respectively. In recent work by Lagerquist and Ebert-Uphoff (2022), researchers explored using verification metrics like fraction skill score (Roberts and Lean 2008) and CSI as loss functions for training neural networks. One result was that using CSI as the loss function resulted in an overprediction bias, which is unsurprising given that CSI is a biased metric (i.e., one can maximize CSI by either overpredicting or underpredicting, known as hedging). Loss functions often require mathematical properties like continuity, differentiability, convexity, etc., while verification metrics are often more interpretable. A machine learning practitioner should carefully select a loss function appropriate for the decision task, and more work is required to determine which loss functions are appropriate for certain tasks.
3. Concluding thoughts
First, we strongly recommend that authors show the contingency table they are using and where the basic terms they are using come from using that table. There are enough options available from the literature that it is very easy to confuse even experienced users. We recommend that interdisciplinary researchers primarily use terms that are commonly used in the field where they are publishing, such as υg or “CSI” when publishing in a meteorological journal, to assist the primary audience in quickly understanding. If there is another term for a metric that is commonly used in the other discipline from which the researchers come, and may be their standard choice for the term, it would be extremely useful to include that term in a parenthetical remark or footnote. Marzban (1998) is an example of this when he mentioned that a “contingency table” is sometimes referred to as a “confusion matrix.” This process helps the person in a discipline who is reading papers in a related, but not necessarily core to them, area to learn the language of that other discipline. It also introduces that term to them, so that they can find what this unfamiliar discipline has done in terms of analysis of their problem and they may learn new techniques from that discipline. As an example, biogeography, in its study of similarity and differences in the distribution of organisms in different situations, has produced a wide body of literature on classification metrics.
Much of what we have discussed has involved the rediscovery of metrics. As part of that process, we have seen the fruits of what might be considered less-than-complete literature reviews. Although we believe we have found many of the most important early works, if nothing else, we have learned to not have hubris to think we have been complete. We apologize to anyone who may have a favorite “old paper” that found one of these metrics that we missed.
Although Jaccard (1901) is clearly the first use, some authors have credited Jaccard (1902, 1907, 1912) as the origin. The 1912 reference is an English translation of the 1907 reference.
Mason (1989) and Schaefer (1990) incorrectly credit Palmer and Allen (1949) as the source of “threat index.” Larue and Younkin (1961) do not reference Palmer and Allen.
Brier and Allen is the standard reference for this term, but they cite the unpublished U.S. Army Air Force (1944, unpublished manuscript) technical memo as the source. It is likely, based on the terminology used in the abstract of U.S. Army Air Force (USAAF), that one or both of them are responsible for USAAF.
Multiple reports from both the Cleverdon and Salton groups are available online, along with a number of other documents from the history of information retrieval, from the Special Interest Group on Information Retrieval at https://www.sigir.org/resources/museum/.
We use POD and SR in the discussion of the Pythagorean means instead of recall and precision, or any other pairs of names for the same quantities for simplicity.
The name is sometimes spelled Otsuka.
Acknowledgments.
This material is based upon work supported by the Joint Technology Transfer Initiative Program within the NOAA/OAR Weather Program Office under Award NA22OAR4590171. Funding was also provided by NOAA/Office of Oceanic and Atmospheric Research under the NOAA-University of Oklahoma Cooperative Agreement NA21OAR4320204, U.S. Department of Commerce. Allan Murphy inspired interest in the history of scoring metrics and discussions between the lead author and him in the 1990s were useful in clarifying credit for discovery or rediscovery, even if Allan might not agree with all we have done here. More recently, exposure to the work done by AI/ML researchers seeking to improve forecasters and an awareness of the different language they use has highlighted opportunities for growth in multidisciplinary work. We thank Bob Glahn for providing a copy of Palmer and Allen (1949), which is in process of being cataloged at the National Centers for Environmental Information after being lost for decades. Since that provision, Bob has unfortunately passed away and we dedicate this paper to his memory, as well as that of Allan Murphy. We thank Naoko Sakaeda for translating a portion of the Japanese papers. We thank Amanda Schilling and Tracy Chapman, scientific librarians at the National Weather Center, for their invaluable assistance in tracking down some of the more obscure references, particularly Kulczynski (1927), Tanimoto (1958), and McConnaughey (1964). We hope that they enjoyed the hunt as much as we did.
Data availability statement.
The data and Python code used to generate the figures can be found at https://github.com/monte-flora/verification_diagrams. To encourage adoption and exploration of other 2 × 2 contingency-table metrics, this package contains all 71 scores from Brusco et al. (2021).
REFERENCES
Amirkhani, D., and A. Bastanfard, 2021: An objective method to evaluate exemplar‐based inpainted images quality using Jaccard index. Multimedia Tools Appl., 80, 26 199–26 212, https://doi.org/10.1007/s11042-021-10883-3.
Armistead, T. W., 2013: H. L. Wagner’s unbiased hit rate and the assessment of categorical forecasting accuracy. Wea. Forecasting, 28, 802–814, https://doi.org/10.1175/WAF-D-12-00047.1.
Armistead, T. W., 2016: Misunderstood and unattributed: Revisiting M. H. Doolittle’s measures of association, With a note on Bayes’ theorem. Amer. Stat., 70, 63–73, https://doi.org/10.1080/00031305.2015.1086686.
Barnes, L. R., E. C. Gruntfest, M. H. Hayden, D. M. Schultz, and C. Benight, 2007: False alarms and close calls: A conceptual model of warning accuracy. Wea. Forecasting, 22, 1140–1147, 10.1175/WAF1031.1; Corrigendum, 24, 1452–1454, 10.1175/2009WAF2222300.1.
Brier, G. W., and R. A. Allen, 1951: Verification of weather forecasts. Compendium of Meteorology, T. Malone, Ed., Amer. Meteor. Soc., 841–848.
Brooks, H. E., and J. Correia Jr., 2018: Long-term performance metrics for National Weather Service tornado warnings. Wea. Forecasting, 33, 1501–1511, https://doi.org/10.1175/WAF-D-18-0120.1.
Brusco, M., J. D. Cradit, and D. Steinley, 2021: A comparison of 71 binary similarity coefficients: The effect of base rates. PLOS ONE, 16, e0247751, https://doi.org/10.1371/journal.pone.0247751.
Cheng, Y., P. A. Pérez-Díaz, K. V. Petrides, and J. Li, 2023: Monte Carlo simulation with confusion matrix paradigm – A sample of internal consistency indices. Front. Psychol., 14, 1298534, https://doi.org/10.3389/fpsyg.2023.1298534.
Cleverdon, C., and D. R. Swanson, 1965: The Cranfield hypotheses. Libr. Quart., 35, 121–124, https://doi.org/10.1086/619319.
Cleverdon, C. W., 1962: Report on the testing and analysis of an investigation into the comparative efficiency of indexing systems. ASLIB Cranfield Research ProjectCranfield Aeronautics Laboratory, 322 pp.
Consonni, V., and R. Todeschini, 2012: New similarity coefficients for binary data. MATCH Commun. Math. Comp. Chem., 68, 581–592.
de Elía, R., 2022: The false alarm/surprise trade-off in weather warnings systems: An expected utility theory perspective. Environ. Syst. Decisions, 42, 450–461, https://doi.org/10.1007/s10669-022-09863-1.
de Elía, R., J. J. Ruiz, V. Francce, P. Lohigorry, M. Saucedo, M. Menalled, and D. D’Amen, 2023: Early warning systems and end-user decision-making: A risk formalism tool to aid communication and understanding. Risk Anal., 15, 1–15, https://doi.org/10.1111/risa.14221.
Dice, L. R., 1945: Measures of the amount of ecologic association between species. Ecology, 26, 297–302, https://doi.org/10.2307/1932409.
Donaldson, R. J., M. J. Kraus, and R. M. Dyer, 1975a: Operational benefits of meteorological Doppler radar. AFCRL Tech. Rep. AFCRL-TR-75-0103, 25 pp., https://apps.dtic.mil/sti/trecms/pdf/ADA010434.pdf.
Donaldson, R. J., R. M. Dyer, and M. J. Kraus, 1975b: An objective evaluator of techniques for predicting severe weather events. Preprints, Ninth Conf. on Severe Local Storms, Norman, OK, Amer. Meteor. Soc., 321–326.
Doolittle, M. H., 1885: The verification of predictions. Bull. Philos. Soc. Wash., 7, 122–127.
Doswell, C. A., R. Davies-Jones, and D. L. Keller, 1990: On summary measures of skill in rare event forecasting based on contingency tables. Wea. Forecasting, 5, 576–585, https://doi.org/10.1175/1520-0434(1990)005<0576:OSMOSI>2.0.CO;2.
Driver, H. E., and A. L. Kroeber, 1932: Quantitative expression of cultural relationships. Univ. Calif. Publ. Amer. Archaeol. Ethnology, 31, 211–256.
Finley, J. P., 1884: Tornado predictions. Amer. Meteor. J., 1, 85–88.
Gilbert, G. K., 1884: Finley’s tornado predictions. Amer. Meteor. J., 1, 166–172.
Gleason, H. A., 1920: Some applications of the quadrat method. Bull. Torrey Bot. Club, 47, 21–33, https://doi.org/10.2307/2480223.
Hamai, I., 1955: Stratification of community by means of “community coefficient” (continued) (most of text in Japanese). Japan J. Ecol., 5, 41–45, https://doi.org/10.18960/seitai.4.4_171.
Heine, M. H., 1973: Distance between sets as an objective measure of retrieval effectiveness. Inf. Storage Retr., 9, 181–198, https://doi.org/10.1016/0020-0271(73)90066-1.
Heydarian, M., T. E. Doyle, and R. Samavi, 2022: MLCM: Multi-label confusion matrix. IEEE Access, 10, 19 083–19 095, https://doi.org/10.1109/ACCESS.2022.3151048.
House, D. C., 1960: Remarks on the optimum spacing of upper air observations. Mon. Wea. Rev., 88, 97–100, https://doi.org/10.1175/1520-0493(1960)088<0097:ROTOSO>2.0.CO;2.
Howarth, R. J., 2017: Dictionary of Mathematical Geosciences: With Historical Notes. Springer, 893 pp., https://doi.org/10.1007/978-3-319-57315-1.
Jaccard, P., 1901: Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines (in French). Bull. Soc. Vaudoise Sci. Nat., 37, 241–272, https://doi.org/10.5169/seals-266440.
Jaccard, P., 1902: Lois de distribution florale dans le zone alpine (in French). Bull. Soc. Vaudoise Sci. Nat., 38, 69–130, https://doi.org/10.5169/seals-266762.
Jaccard, P., 1907: La distribution de la flore dans la zone alpine (in French). Rev. Gen. Sci., 18, 961–967.
Jaccard, P., 1912: The distribution of the flora in the alpine zone. New Phytol., 11, 37–50, https://doi.org/10.1111/j.1469-8137.1912.tb05611.x.
Keen, M., 1966: Measures and averaging methods used in performance testing of indexing systems. ASLIB-Cranfield Research Project, 62 pp.
Keen, M., 1968: Evaluation parameters. Information Storage and Retrieval: Scientific Report No. ISR-13, G. Salton, Ed., National Science Foundation, II-1–II-67.
Keen, M., 1981: Laboratory tests of manual systems. Information Retrieval Experiment, K. Sparck Jones, Ed., Butterworths, 136–155.
Kent, A., M. M. Berry, F. U. Luehrs Jr., and J. W. Perry, 1955: Machine literature searching VIII. Operational criteria for designing information retrieval systems. Amer. Doc., 6, 93–101, https://doi.org/10.1002/asi.5090060209.
Kulczynski, M., 1927: Zespoły roślin w Pieninach—Die Pflanzenassoziationen der Pieninen. Bull. Int. Acad. Pol. Sci. Lett., 2, 57–203.
Kumler-Bonfanti, C., J. Stewart, D. Hall, and M. Govett, 2020: Tropical and extratropical cyclone detection using deep learning. J. Appl. Meteor. Climatol., 59, 1971–1985, https://doi.org/10.1175/JAMC-D-20-0117.1.
Lagerquist, R., and I. Ebert-Uphoff, 2022: Can we integrate spatial verification methods into neural network loss functions for atmospheric science? Artif. Intell. Earth Syst., 1, e220021, https://doi.org/10.1175/AIES-D-22-0021.1.
Lancaster, F. W., 1968: Evaluation of the MEDLARS demand search service. U. S. Department of Health, Education and Welfare, Public Health Service Rep., 288 pp.
Lancaster, F. W., 1979: Information Retrieval Systems: Characteristics, Testing and Evaluation. 2nd ed. John Wiley and Sons, 381 pp.
Larue, J. A., and R. J. Younkin, 1961: Weather Note: The Middle Mississippi Valley hydrometeorological storm of May 4–9, 1961. Mon. Wea. Rev., 89, 555–559, https://doi.org/10.1175/1520-0493(1961)089<0555:WNTMMV>2.0.CO;2.
Leydesdorff, L., 2008: On the normalization and visualization of author co-citation data: Salton’s Cosine versus the Jaccard index. J. Amer. Soc. Inf. Sci. Technol., 59, 77–85, https://doi.org/10.1002/asi.20732.
Marczewski, E., and H. Steinhaus, 1958: On a certain distance of sets and the corresponding distance of functions. Colloq. Math., 6, 319–327, https://doi.org/10.4064/cm-6-1-319-327.
Marzban, C., 1998: Scalar measures of performance in rare-event situations. Wea. Forecasting, 13, 753–763, https://doi.org/10.1175/1520-0434(1998)013<0753:SMOPIR>2.0.CO;2.
Mason, I. B., 1989: Dependence of the critical success index on sample climate and threshold probability. Aust. Meteor. Mag., 37, 75–81.
McConnaughey, B. H., 1964: The Determination and Analysis of Plankton Communities. Lembaga Penelitian Laut, 40 pp.
Miller, G. A., and P. A. Nicely, 1955: An analysis of perceptual confusions among some English consonants. J. Acoust. Soc. Amer., 27, 338–352, https://doi.org/10.1121/1.1907526.
Murphy, A. H., 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting. Wea. Forecasting, 8, 281–293, https://doi.org/10.1175/1520-0434(1993)008<0281:WIAGFA>2.0.CO;2.
Murphy, A. H., 1996: The Finley affair: A signal event in the history of forecast verification. Wea. Forecasting, 11, 3–20, https://doi.org/10.1175/1520-0434(1996)011<0003:TFAASE>2.0.CO;2.
Ochiai, A., 1957: Zoogeographic studies on the soleoid fishes found in Japan and its neighboring regions. Bull. J. Japan Soc. Fish. Sci., 22, 526–530, https://doi.org/10.2331/suisan.22.526.
Olson, R. H., 1965: On the use of Bayes’ theorem in estimating false alarm rates. Mon. Wea. Rev., 93, 557–558, https://doi.org/10.1175/1520-0493(1965)093<0557:OTUOBT>2.3.CO;2.
Otsuka, Y., 1936: The faunal character of the Japanese Pleistocene marine Mollusca, as evidence of the climate having become colder during the Pleistocene in Japan. Bull. Biogeogr. Soc. Japan, 6, 165–170.
Palmer, W. C., and R. A. Allen, 1949: Note on the accuracy of forecasts concerning the rain problem. U. S. Weather Bureau Manuscript, 2 pp.
Pearson, K., 1904: Mathematical Contributions to the Theory of Evolution. XIII: On the Theory of Contingency and its Relation to Association and Normal Correlation. Drapers’ Company Research Memoirs, Biometric Series, Vol. I, Dulau and Co, 35 pp.
Pedregosa, F., and Coauthors, 2011: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res., 12, 2825–2830.
Peterson, W. W., and T. G. Birdsall, 1953: The theory of signal detectability. Part I: The general theory. Electronic Defense Group, Department of Electrical Engineering, University of Michigan, Tech Rep. 13, 156 pp.
Phillips, G., and Coauthors, 2024: Setting nutrient boundaries to protect aquatic communities: The importance of comparing observed and predicted classifications using measures derived from a confusion matrix. Sci. Total Environ., 912, 168872, https://doi.org/10.1016/j.scitotenv.2023.168872.
Riehl, K., M. Neunteufel, and M. Hemberg, 2023: Hierarchical confusion matrix for classification performance evaluation. J. Roy. Stat. Soc., 72C, 1394–1412, https://doi.org/10.1093/jrsssc/qlad057.
Roberts, N. M., and H. W. Lean, 2008: Scale-selective verification of rainfall accumulations from high-resolution forecasts of convective events. Mon. Wea. Rev., 136, 78–97, https://doi.org/10.1175/2007MWR2123.1.
Rocchio, J. J., 1964: Performance indices for document retrieval systems. Information Storage and Retrieval: Scientific Report No. ISR-8, G. Salton, Ed., National Science Foundation, III-1–III-18.
Roebber, P. J., 2009: Visualizing multiple measures of forecast quality. Wea. Forecasting, 24, 601–608, https://doi.org/10.1175/2008WAF2222159.1.
Roebber, P. J., and L. F. Bosart, 1996: The complex relationship between forecast skill and forecast value: A real-world analysis. Wea. Forecasting, 11, 544–559, https://doi.org/10.1175/1520-0434(1996)011<0544:TCRBFS>2.0.CO;2.
Ruuska, S., W. Hämäläinen, S. Kajava, M. Mughal, P. Matilaine, and J. Mononen, 2018: Evaluation of the confusion matrix method in the validation of an automated system for measuring feeding behaviour of cattle. Behav. Processes, 148, 56–62, https://doi.org/10.1016/j.beproc.2018.01.004.
Salton, G., 1964: The evaluation of automatic storage retrieval procedures—selected test results using the SMART system. Information Storage and Retrieval: Scientific Report No. ISR-8, G. Salton, Ed., National Science Foundation, IV-1–IV-36.
Sasaki, Y., 2007: The truth of the F-measure. Lecture Notes, 5 pp., https://www.cs.odu.edu/∼mukka/cs795sum09dm/Lecturenotes/Day3/F-measure-YS-26Oct07.pdf.
Schaefer, J. T., 1990: The critical success index as an indicator of Warning skill. Wea. Forecasting, 5, 570–575, https://doi.org/10.1175/1520-0434(1990)005<0570:TCSIAA>2.0.CO;2.
Scheller, I. F., K. Lutz, C. Mertes, V. A. Yépez, and J. Gagneur, 2023: Improved detection of aberrant splicing with FRASER 2.0 and the intron Jaccard index. Amer. J. Hum. Genet., 110, 2056–2067, https://doi.org/10.1016/j.ajhg.2023.10.014.
Sørensen, T., 1948: A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. K. Dan. Vidensk. Selsk. Biol. Skr., 5, 1–34.
Sorgenfrei, T., 1959: Molluscan assemblages from the marine Middle Miocene of South Jutland and their environments. Dan. Geol. Unders., 79, 356–503, https://doi.org/10.34194/raekke2.v79.6869.
Sparck Jones, K., 1972: A statistical interpretation of term specificity and its application in retrieval. J. Doc., 28, 11–21, https://doi.org/10.1108/eb026526.
Sparck Jones, K., 1979: Experiments in relevance weighting of search terms. Inf. Process. Manage., 15, 133–144, https://doi.org/10.1016/0306-4573(79)90060-8.
Stensrud, D. J., and M. S. Wandishin, 2000: The correspondence ratio in forecast evaluation. Wea. Forecasting, 15, 593–602, https://doi.org/10.1175/1520-0434(2000)015<0593:TCRIFE>2.0.CO;2.
Swets, J. A., 1963: Information retrieval systems. Science, 141, 245–250, https://doi.org/10.1126/science.141.3577.245.
Tanimoto, T. T., 1958: An elementary mathematical theory of classification and prediction. Internal IBM Tech. Rep., 10 pp.
Tanner, W. P., Jr., and J. A. Swets, 1954: A decision-making theory of visual detection. Psychol. Rev., 61, 401–409, https://doi.org/10.1037/h0058700.
Thomson, G. H., 1916: A hierarchy without a general factor. Br. J. Psychol., 8, 271–281.
Tversky, A., 1977: Features of similarity. Psychol. Rev., 84, 327–352, https://doi.org/10.1037/0033-295X.84.4.327.
van der Maarel, E., 1969: On the use of ordination models in phytosociology. Vegetatio, 19, 21–46, https://doi.org/10.1007/BF00259002.
Van Rijsbergen, C. J., 1974: Foundation of evaluation. J. Doc., 30, 365–373, https://doi.org/10.1108/eb026584.
Van Rijsbergen, C. J., and W. B. Croft, 1975: Document clustering: An evaluation of some experiments with the Cranfield 1400 collection. Inf. Process. Manage., 11, 171–182, https://doi.org/10.1016/0306-4573(75)90006-0.
Venn, J., 1880: I. On the diagrammatic and mechanical representation of propositions and reasonings. London, Edinburgh, Dublin Philos. Mag. J. Sci., 10, 1–18, https://doi.org/10.1080/14786448008626877.
Wagner, H. L., 1993: On measuring performance in category judgment studies of nonverbal behavior. J. Nonverbal Behav., 17, 3–28, https://doi.org/10.1007/BF00987006.
Wandishin, M. S., and H. E. Brooks, 2002: On the relationship between Clayton’s skill score and expected values for forecasts of binary events. Meteor. Appl., 9, 455–459, https://doi.org/10.1017/S1350482702004085.
Warrens, M. J., and H. van der Hoef, 2022: Understanding the adjusted Rand index and other partition comparison indices based on counting object pairs. J. Classif., 39, 487–509, https://doi.org/10.1007/s00357-022-09413-z.
Wilks, Y., D. Fass, C.-M. Guo, J. E. McDonald, T. Plate, and B. M. Slator, 1990: Providing machine tractable dictionary tools. Mach. Transl., 5, 99–154, https://doi.org/10.1007/BF00393758.
Yerushalmy, J., 1947: Statistical problems in assessing methods of medical diagnosis, with special reference to X-ray techniques. Public Health Rep., 62, 1432–1449, https://doi.org/10.2307/4586294.