# Search Results

## You are looking at 1 - 10 of 57 items for :

- Author or Editor: ALLAN H. MURPHY x

- Article x

- Refine by Access: All Content x

^{ }

## Abstract

Skill scores defined as measures of relative mean square error—and based on standards of reference representing climatology, persistence, or a linear combination of climatology and persistence—are decomposed. Two decompositions of each skill score are formulated: 1) a decomposition derived by conditioning on the forecasts and 2) a decomposition derived by conditioning on the observations. These general decompositions contain terms consisting of measures of statistical characteristics of the forecasts and/or observations and terms consisting of measures of basic aspects of forecast quality. Properties of the terms in the respective decompositions are examined, and relationships among the various skill scores—and the terms in the respective decompositions—are described.

Hypothetical samples of binary forecasts and observations are used to illustrate the application and interpretation of these decompositions. Limitations on the inferences that can be drawn from comparative verification based on skill scores, as well as from comparisons based on the terms in decompositions of skill scores, are discussed. The relationship between the application of measures of aspects of quality and the application of the sufficiency relation (a statistical relation that embodies the concept of unambiguous superiority) is briefly explored.

The following results can be gleaned from this methodological study. 1) Decompositions of skill scores provide quantitative measures of—and insights into—multiple aspects of the forecasts, the observations, and their relationship. 2) Superiority in terms of overall skill is no guarantor of superiority in terms of other aspects of quality. 3) Sufficiency (i.e., unambiguous superiority) generally cannot be inferred solely on the basis of superiority over a relatively small set of measures of specific aspects of quality.

Neither individual measures of overall performance (e.g., skill scores) nor sets of measures associated with decompositions of such overall measures respect the dimensionality of most verification problems. Nevertheless, the decompositions described here identify parsimonious sets of measures of basic aspects of forecast quality that should prove to be useful in many verification problems encountered in the real world.

## Abstract

Skill scores defined as measures of relative mean square error—and based on standards of reference representing climatology, persistence, or a linear combination of climatology and persistence—are decomposed. Two decompositions of each skill score are formulated: 1) a decomposition derived by conditioning on the forecasts and 2) a decomposition derived by conditioning on the observations. These general decompositions contain terms consisting of measures of statistical characteristics of the forecasts and/or observations and terms consisting of measures of basic aspects of forecast quality. Properties of the terms in the respective decompositions are examined, and relationships among the various skill scores—and the terms in the respective decompositions—are described.

Hypothetical samples of binary forecasts and observations are used to illustrate the application and interpretation of these decompositions. Limitations on the inferences that can be drawn from comparative verification based on skill scores, as well as from comparisons based on the terms in decompositions of skill scores, are discussed. The relationship between the application of measures of aspects of quality and the application of the sufficiency relation (a statistical relation that embodies the concept of unambiguous superiority) is briefly explored.

The following results can be gleaned from this methodological study. 1) Decompositions of skill scores provide quantitative measures of—and insights into—multiple aspects of the forecasts, the observations, and their relationship. 2) Superiority in terms of overall skill is no guarantor of superiority in terms of other aspects of quality. 3) Sufficiency (i.e., unambiguous superiority) generally cannot be inferred solely on the basis of superiority over a relatively small set of measures of specific aspects of quality.

Neither individual measures of overall performance (e.g., skill scores) nor sets of measures associated with decompositions of such overall measures respect the dimensionality of most verification problems. Nevertheless, the decompositions described here identify parsimonious sets of measures of basic aspects of forecast quality that should prove to be useful in many verification problems encountered in the real world.

^{ }

## Abstract

Heretofore it has been widely accepted that the contributions of W. E. Cooke in 1906 represented the first works related to the explicit treatment of uncertainty in weather forecasts. Recently, however, it has come to light that at least some aspects of the rationale for quantifying the uncertainty in forecasts were discussed prior to 1900 and that probabilities and odds were included in some weather forecasts formulated more than 200 years ago. An effort to summarize these new historical insights, as well as to clarify the precise nature of the contributions made by various individuals to early developments is this area, appears warranted.

The overall purpose of this paper is to extend and clarify the early history of probability forecasts. Highlights of the historical review include 1) various examples of the use of qualitative and quantitative probabilities or odds in forecasts during the eighteenth and nineteenth centuries, 2) a brief discussion in 1890 of the economic component of the rationale for quantifying the uncertainty in forecasts, 3) further refinement of the rationale for probability forecasts and the presentation of the results of experiments involving the formulation of quasi-probabilistic and probabilistic forecasts during the period 1900–25 (in reviewing developments during this early twentieth century period, the noteworthy contributions made by W. E. Cooke, C. Hallenbeck, and A. K. Ångström are described and clarified), and 4) a very concise overview of activities and developments in this area since 1925.

The early treatment of some basic issues related to probability forecasts is discussed and, in some cases, compared to their treatment in more recent times. These issues include 1) the underlying rationale for probability forecasts, 2) the feasibility of making probability forecasts, and 3) alternative interpretations of probability in the context of weather forecasts. A brief examination of factors related to the acceptance of—and resistance to—probability forecasts in the meteorological and user communities is also included.

## Abstract

Heretofore it has been widely accepted that the contributions of W. E. Cooke in 1906 represented the first works related to the explicit treatment of uncertainty in weather forecasts. Recently, however, it has come to light that at least some aspects of the rationale for quantifying the uncertainty in forecasts were discussed prior to 1900 and that probabilities and odds were included in some weather forecasts formulated more than 200 years ago. An effort to summarize these new historical insights, as well as to clarify the precise nature of the contributions made by various individuals to early developments is this area, appears warranted.

The overall purpose of this paper is to extend and clarify the early history of probability forecasts. Highlights of the historical review include 1) various examples of the use of qualitative and quantitative probabilities or odds in forecasts during the eighteenth and nineteenth centuries, 2) a brief discussion in 1890 of the economic component of the rationale for quantifying the uncertainty in forecasts, 3) further refinement of the rationale for probability forecasts and the presentation of the results of experiments involving the formulation of quasi-probabilistic and probabilistic forecasts during the period 1900–25 (in reviewing developments during this early twentieth century period, the noteworthy contributions made by W. E. Cooke, C. Hallenbeck, and A. K. Ångström are described and clarified), and 4) a very concise overview of activities and developments in this area since 1925.

The early treatment of some basic issues related to probability forecasts is discussed and, in some cases, compared to their treatment in more recent times. These issues include 1) the underlying rationale for probability forecasts, 2) the feasibility of making probability forecasts, and 3) alternative interpretations of probability in the context of weather forecasts. A brief examination of factors related to the acceptance of—and resistance to—probability forecasts in the meteorological and user communities is also included.

^{ }

## Abstract

An individual skill score (*SS*) and a collective skill score (*CSS*) are examined to determine whether these scoring or improper. The *SS* and the *CSS* are both standardized versions of the Brier, or probability, score (*PS*) and have been used to measure the “skill” of probability forecasts. The *SS* is defined in terms of individual forecasts, while the *CSS* is defined in terms of collections of forecasts. The SS and the CSS are shown to be *improper* scoring rules, and, as a result, both the *SS* and the *CSS* encourage hedging on the part of forecasters.

The results of a preliminary, investigation of the nature of the hedging produced by. the *SS* and the *CSS* indicate that, while the *SS* may encourage a considerable amount of hedging, the *CSS*, in general, encourages only a modest amount of hedging, and even this hedging decreases as the sample size *K* of the collection forecasts increases. In fact, *the CSS is approximately strictly Proper for large collections of forecasts* (*K* ≥ 100).

Finally, we briefly consider two questions related to the standardization of scoring rules: 1) the use of different scoring rules in the assessment and evaluation tasks, and 2) the transformation of strictly proper scoring rules. With regard to the latter, we identify standardized versions of the *PS* which are strictly proper scoring rules and which, as a result, appear to be appropriate scoring rules to use to measure the “skill” of probability forecasts.

## Abstract

An individual skill score (*SS*) and a collective skill score (*CSS*) are examined to determine whether these scoring or improper. The *SS* and the *CSS* are both standardized versions of the Brier, or probability, score (*PS*) and have been used to measure the “skill” of probability forecasts. The *SS* is defined in terms of individual forecasts, while the *CSS* is defined in terms of collections of forecasts. The SS and the CSS are shown to be *improper* scoring rules, and, as a result, both the *SS* and the *CSS* encourage hedging on the part of forecasters.

The results of a preliminary, investigation of the nature of the hedging produced by. the *SS* and the *CSS* indicate that, while the *SS* may encourage a considerable amount of hedging, the *CSS*, in general, encourages only a modest amount of hedging, and even this hedging decreases as the sample size *K* of the collection forecasts increases. In fact, *the CSS is approximately strictly Proper for large collections of forecasts* (*K* ≥ 100).

Finally, we briefly consider two questions related to the standardization of scoring rules: 1) the use of different scoring rules in the assessment and evaluation tasks, and 2) the transformation of strictly proper scoring rules. With regard to the latter, we identify standardized versions of the *PS* which are strictly proper scoring rules and which, as a result, appear to be appropriate scoring rules to use to measure the “skill” of probability forecasts.

^{ }

## Abstract

A *new* vector partition of the probability, or Brier, score (*PS*) is formulated and the nature and properties of this partition are described. The relationships between the terms in this partition and the terms in the *original* vector partition of the *PS* are indicated. The new partition consists of three terms: 1) a measure of the uncertainty inherent in the events, or states, on the occasions of concern (namely, the *PS* for the sample relative frequencies); 2) a measure of the reliability of the forecasts; and 3) a new measure of the resolution of the forecasts. These measures of reliability and resolution are and are not, respectively, equivalent (i.e., linearly related) to the measures of reliability and resolution provided by the original partition. Two sample collections of probability forecasts are used to illustrate the differences and relationships between these partitions. Finally, the two partitions are compared, with particular reference to the attributes of the forecasts with which the partitions are concerned, the interpretation of the partitions in geometric terms, and the use of the partitions as the bases for the formulation of measures to evaluate probability forecasts. The results of these comparisons indicate that the new partition offers certain advantages vis-à-vis the original partition.

## Abstract

A *new* vector partition of the probability, or Brier, score (*PS*) is formulated and the nature and properties of this partition are described. The relationships between the terms in this partition and the terms in the *original* vector partition of the *PS* are indicated. The new partition consists of three terms: 1) a measure of the uncertainty inherent in the events, or states, on the occasions of concern (namely, the *PS* for the sample relative frequencies); 2) a measure of the reliability of the forecasts; and 3) a new measure of the resolution of the forecasts. These measures of reliability and resolution are and are not, respectively, equivalent (i.e., linearly related) to the measures of reliability and resolution provided by the original partition. Two sample collections of probability forecasts are used to illustrate the differences and relationships between these partitions. Finally, the two partitions are compared, with particular reference to the attributes of the forecasts with which the partitions are concerned, the interpretation of the partitions in geometric terms, and the use of the partitions as the bases for the formulation of measures to evaluate probability forecasts. The results of these comparisons indicate that the new partition offers certain advantages vis-à-vis the original partition.

^{ }

## Abstract

Scalar and vector partitions of the probability score (*PS*) in *N*-state (*N* > 2) situations are described and compared. In *N*-state, as well as in two-state (*N* = 2), situations these partitions provide similar, but not equivalent (i.e., linearly related), measures of the reliability and resolution of probability forecasts. Specifically, the vector partition, when compared to the scalar partition, decreases the reliability and increases the resolution of the forecasts. A sample collection of forecasts is used to illustrate the differences between these partitions in *N*-state situations.

Several questions related to the use of scalar and vector partitions of the *PS* in *N*-state situations are discussed, including the relative merits of these partitions and the effect upon sample size when forecasts are considered to be vectors rather than scalars. The discussions indicate that the vector partition appears to be more appropriate, in general, than the scalar partition, and that when the forecasts in a collection of forecasts are considered to be vectors rather than scalars the sample size of the collection may be substantially reduced.

## Abstract

Scalar and vector partitions of the probability score (*PS*) in *N*-state (*N* > 2) situations are described and compared. In *N*-state, as well as in two-state (*N* = 2), situations these partitions provide similar, but not equivalent (i.e., linearly related), measures of the reliability and resolution of probability forecasts. Specifically, the vector partition, when compared to the scalar partition, decreases the reliability and increases the resolution of the forecasts. A sample collection of forecasts is used to illustrate the differences between these partitions in *N*-state situations.

Several questions related to the use of scalar and vector partitions of the *PS* in *N*-state situations are discussed, including the relative merits of these partitions and the effect upon sample size when forecasts are considered to be vectors rather than scalars. The discussions indicate that the vector partition appears to be more appropriate, in general, than the scalar partition, and that when the forecasts in a collection of forecasts are considered to be vectors rather than scalars the sample size of the collection may be substantially reduced.

^{ }

## Abstract

Scalar and vector partitions of the probability score (PS) in the two-state situation are described and compared. These partitions, which are based upon expressions for the PS in which probability forecasts are considered to be scalars and vectors, respectively, provide similar, but not equivalent (i.e., linearly related), measures of the reliability and resolution of the forecasts. Specifically, the reliability (resolution) of the forecasts according to the scalar partition is, in general, greater (less) than their reliability (resolution) according to the vector partition. A sample collection of forecasts is used to illustrate the differences between these partitions.

Several questions related to the use of scalar and vector partitions of the PS in the two-state situation are discussed, including the interpretation of the results of previous forecast evaluation studies and the relative merits of these partitions. The discussions indicate that the partition most often used in such studies has been a special “scalar” partition, a partition which is equivalent to the vector partition in the two-state situation, and that the vector partition is more appropriate than the scalar partition.

## Abstract

Scalar and vector partitions of the probability score (PS) in the two-state situation are described and compared. These partitions, which are based upon expressions for the PS in which probability forecasts are considered to be scalars and vectors, respectively, provide similar, but not equivalent (i.e., linearly related), measures of the reliability and resolution of the forecasts. Specifically, the reliability (resolution) of the forecasts according to the scalar partition is, in general, greater (less) than their reliability (resolution) according to the vector partition. A sample collection of forecasts is used to illustrate the differences between these partitions.

Several questions related to the use of scalar and vector partitions of the PS in the two-state situation are discussed, including the interpretation of the results of previous forecast evaluation studies and the relative merits of these partitions. The discussions indicate that the partition most often used in such studies has been a special “scalar” partition, a partition which is equivalent to the vector partition in the two-state situation, and that the vector partition is more appropriate than the scalar partition.

^{ }

## Abstract

Comparative operational evaluation of probabilistic prediction procedures in cost-loss ratio decision situations in which the evaluator's knowledge of the cost-loss ratio is expressed in probabilistic terms is considered. First, the cost-loss ratio decision situation is described in a utility framework and, then, measures of the expected-utility of probabilistic predictions are formulated. Second, a class of expected-utility measures, the beta measures, in which the evaluator's knowledge of the cost-loss ratio is expressed in terms of a beta distribution, are described. Third, the beta measures are utilized to compare two prediction procedures on the basis of a small sample of predictions. The results indicate the importance, for comparative operational evaluation, of utilizing measures which provide a suitable description of the evaluator's knowledge. In particular, the use of the probability score, a measure equivalent to the *uniform* measure (which is a special beta measure), in decision situations in which the uniform distribution does not provide a suitable description of the evaluator's knowledge, may yield misleading results. Finally, the results are placed in proper perspective by describing several possible extensions to this study and by indicating the importance of undertaking such studies in actual operational situations.

## Abstract

Comparative operational evaluation of probabilistic prediction procedures in cost-loss ratio decision situations in which the evaluator's knowledge of the cost-loss ratio is expressed in probabilistic terms is considered. First, the cost-loss ratio decision situation is described in a utility framework and, then, measures of the expected-utility of probabilistic predictions are formulated. Second, a class of expected-utility measures, the beta measures, in which the evaluator's knowledge of the cost-loss ratio is expressed in terms of a beta distribution, are described. Third, the beta measures are utilized to compare two prediction procedures on the basis of a small sample of predictions. The results indicate the importance, for comparative operational evaluation, of utilizing measures which provide a suitable description of the evaluator's knowledge. In particular, the use of the probability score, a measure equivalent to the *uniform* measure (which is a special beta measure), in decision situations in which the uniform distribution does not provide a suitable description of the evaluator's knowledge, may yield misleading results. Finally, the results are placed in proper perspective by describing several possible extensions to this study and by indicating the importance of undertaking such studies in actual operational situations.

^{ }

## Abstract

Two fundamental characteristics of forecast verification problems—*complexity* and *dimensionality*—are described. To develop quantitative definitions of these characteristics, a general framework for the problem of absolute verification (AV) is extended to the problem of comparative verification (CV). Absolute verification focuses on the performance of individual forecasting systems (or forecasters), and it is based on the bivariate distribution of forecasts and observations and its two possible factorizations into conditional and marginal distributions.

Comparative verification compares the performance of two or more forecasting systems, which may produce forecasts under 1) identical conditions or 2) different conditions. The first type of CV is matched comparative verification, and it is based on a 3-yariable distribution with possible factorizations. The second and more complicated type of CV is unmatched comparative verification, and it is based on a 4-variable distribution with 24 possible factorizations.

Complexity can be defined in terms of the number of factorizations, the number of basic factors (conditional and marginal distributions) in each factorization, or the total number of basic factors associated with the respective frameworks. These definitions provide quantitative insight into basic differences in complexity among AV and CV problems. Verification problems involving probabilistic and nonprobabilistic forecasts are of equal complexity.

Dimensionality is defined as the number of probabilities that must be specified to reconstruct the basic distribution of forecasts and observations. It is one less than the total number of distinct combinations of forecasts and observations. Thus, CV problems are of higher dimensionality than AV problems, and problems involving probabilistic forecasts or multivalued nonprobabilistic forecasts exhibit particularly high dimensionality.

Issues related to the implications of these concepts for verification procedures and practices are discussed, including the reduction of complexity and/or dimensionality. Comparative verification problems can be reduced in complexity by making forecasts under identical conditions or by assuming conditional or unconditional independence when warranted. Dimensionality can be reduced by parametric statistical modeling of the distributions of forecasts and/or observations.

Failure to take account of the complexity and dimensionality of verification problems may lead to an incomplete and inefficient body of verification methodology and, thereby, to erroneous conclusions regarding the absolute and relative quality and/or value of forecasting systems.

## Abstract

Two fundamental characteristics of forecast verification problems—*complexity* and *dimensionality*—are described. To develop quantitative definitions of these characteristics, a general framework for the problem of absolute verification (AV) is extended to the problem of comparative verification (CV). Absolute verification focuses on the performance of individual forecasting systems (or forecasters), and it is based on the bivariate distribution of forecasts and observations and its two possible factorizations into conditional and marginal distributions.

Comparative verification compares the performance of two or more forecasting systems, which may produce forecasts under 1) identical conditions or 2) different conditions. The first type of CV is matched comparative verification, and it is based on a 3-yariable distribution with possible factorizations. The second and more complicated type of CV is unmatched comparative verification, and it is based on a 4-variable distribution with 24 possible factorizations.

Complexity can be defined in terms of the number of factorizations, the number of basic factors (conditional and marginal distributions) in each factorization, or the total number of basic factors associated with the respective frameworks. These definitions provide quantitative insight into basic differences in complexity among AV and CV problems. Verification problems involving probabilistic and nonprobabilistic forecasts are of equal complexity.

Dimensionality is defined as the number of probabilities that must be specified to reconstruct the basic distribution of forecasts and observations. It is one less than the total number of distinct combinations of forecasts and observations. Thus, CV problems are of higher dimensionality than AV problems, and problems involving probabilistic forecasts or multivalued nonprobabilistic forecasts exhibit particularly high dimensionality.

Issues related to the implications of these concepts for verification procedures and practices are discussed, including the reduction of complexity and/or dimensionality. Comparative verification problems can be reduced in complexity by making forecasts under identical conditions or by assuming conditional or unconditional independence when warranted. Dimensionality can be reduced by parametric statistical modeling of the distributions of forecasts and/or observations.

Failure to take account of the complexity and dimensionality of verification problems may lead to an incomplete and inefficient body of verification methodology and, thereby, to erroneous conclusions regarding the absolute and relative quality and/or value of forecasting systems.

^{ }

## Abstract

Several skill scores are defined, based on the mean-square-error measure of accuracy and alternative climatological standards of reference. Decompositions of these skill scores are formulated, each of which is shown to possess terms involving 1) the coefficient of correlation between the forecasts and observations, 2) a measure of the nonsystematic (i.e., conditional) bias in the forecast, and 3) a measure of the systematic (i.e., unconditional) bias in the forecasts. Depending on the choice of standard of reference, a particular decomposition may also contain terms relating to the degree of association between the reference forecasts and the observations. These decompositions yield analytical relationships between the respective skill scores and the correlation coefficient, document fundamental deficiencies in the correlation coefficient as a measure of performance, and provide additional insight into basic characteristics of forecasting performance. Samples of operational precipitation probability and minimum temperature forecasts are used to investigate the typical magnitudes of the terms in the decompositions. Some implications of the results for the practice of forecast verification are discussed.

## Abstract

Several skill scores are defined, based on the mean-square-error measure of accuracy and alternative climatological standards of reference. Decompositions of these skill scores are formulated, each of which is shown to possess terms involving 1) the coefficient of correlation between the forecasts and observations, 2) a measure of the nonsystematic (i.e., conditional) bias in the forecast, and 3) a measure of the systematic (i.e., unconditional) bias in the forecasts. Depending on the choice of standard of reference, a particular decomposition may also contain terms relating to the degree of association between the reference forecasts and the observations. These decompositions yield analytical relationships between the respective skill scores and the correlation coefficient, document fundamental deficiencies in the correlation coefficient as a measure of performance, and provide additional insight into basic characteristics of forecasting performance. Samples of operational precipitation probability and minimum temperature forecasts are used to investigate the typical magnitudes of the terms in the decompositions. Some implications of the results for the practice of forecast verification are discussed.

^{ }

## Abstract

Meteorologists have devoted considerable attention to studies of the use and value of forecasts in a simple two-action, two-event decision-making problem generally referred to as the cost-loss ratio situation, An *N*-action, *N*-event generalization of the standard cost-loss ratio situation is described here, and the expected value of different types of forecasts in this situation is investigated. Specifically, expressions are developed for the expected expenses associated with the use of climatological, imperfect, and perfect information, and these expressions are employed to derive formulas for the expected value of imperfect and perfect forecasts. The three-action, three-event situation is used to illustrate the generalized model and the value-information results, by considering examples based on specific numerical values of the relevant parameters. Some possible extensions of this model are briefly discussed.

## Abstract

Meteorologists have devoted considerable attention to studies of the use and value of forecasts in a simple two-action, two-event decision-making problem generally referred to as the cost-loss ratio situation, An *N*-action, *N*-event generalization of the standard cost-loss ratio situation is described here, and the expected value of different types of forecasts in this situation is investigated. Specifically, expressions are developed for the expected expenses associated with the use of climatological, imperfect, and perfect information, and these expressions are employed to derive formulas for the expected value of imperfect and perfect forecasts. The three-action, three-event situation is used to illustrate the generalized model and the value-information results, by considering examples based on specific numerical values of the relevant parameters. Some possible extensions of this model are briefly discussed.