Search Results

You are looking at 1 - 5 of 5 items for :

  • Author or Editor: Simon Mason x
  • Monthly Weather Review x
  • All content x
Clear All Modify Search
Simon J. Mason

Abstract

The Brier and ranked probability skill scores are widely used as skill metrics of probabilistic forecasts of weather and climate. As skill scores, they compare the extent to which a forecast strategy outperforms a (usually simpler) reference forecast strategy. The most widely used reference strategy is that of “climatology,” in which the climatological probability (or probabilities in the case of the ranked probability skill score) of the forecast variable is issued perpetually. The Brier and ranked probability skill scores are often considered harsh standards. It is shown that the scores are harsh because the expected value of these skill scores is less than 0 if nonclimatological forecast probabilities are issued. As a result, negative skill scores can often hide useful information content in the forecasts. An alternative formulation of the skill scores based on a reference strategy in which the outcome is independent of the forecast is equivalent to using randomly assigned probabilities but is not strictly proper. Nevertheless, positive values of the Brier skill score with random guessing as a strategy correspond to positive-sloping reliability curves, which is intuitively appealing because of the implication that the conditional probability of the forecast event increases as the forecast probability increases.

Full access
Simon J. Mason and Andreas P. Weigel

Abstract

There are numerous reasons for calculating forecast verification scores, and considerable attention has been given to designing and analyzing the properties of scores that can be used for scientific purposes. Much less attention has been given to scores that may be useful for administrative reasons, such as communicating changes in forecast quality to bureaucrats and providing indications of forecast quality to the general public. The two-alternative forced choice (2AFC) test is proposed as a scoring procedure that is sufficiently generic to be usable on forecasts ranging from simple yes–no forecasts of dichotomous outcomes to forecasts of continuous variables, and can be used with deterministic or probabilistic forecasts without seriously reducing the more complex information when available. Although, as with any single verification score, the proposed test has limitations, it does have broad intuitive appeal in that the expected score of an unskilled set of forecasts (random guessing or perpetually identical forecasts) is 50%, and is interpretable as an indication of how often the forecasts are correct, even when the forecasts are expressed probabilistically and/or the observations are not discrete.

Full access
Andreas P. Weigel and Simon J. Mason

Abstract

This article refers to the study of Mason and Weigel, where the generalized discrimination score D has been introduced. This score quantifies whether a set of observed outcomes can be correctly discriminated by the corresponding forecasts (i.e., it is a measure of the skill attribute of discrimination). Because of its generic definition, D can be adapted to essentially all relevant verification contexts, ranging from simple yes–no forecasts of binary outcomes to probabilistic forecasts of continuous variables. For most of these cases, Mason and Weigel have derived expressions for D, many of which have turned out to be equivalent to scores that are already known under different names. However, no guidance was provided on how to calculate D for ensemble forecasts. This gap is aggravated by the fact that there are currently very few measures of forecast quality that could be directly applied to ensemble forecasts without requiring that probabilities be derived from the ensemble members prior to verification. This study seeks to close this gap. A definition is proposed of how ensemble forecasts can be ranked; the ranks of the ensemble forecasts can then be used as a basis for attempting to discriminate between corresponding observations. Given this definition, formulations of D are derived that are directly applicable to ensemble forecasts.

Full access
Simon J. Mason, Michael K. Tippett, Andreas P. Weigel, Lisa Goddard, and Balakanapathy Rajaratnam
Full access
Simon J. Mason, Jacqueline S. Galpin, Lisa Goddard, Nicholas E. Graham, and Balakanapathy Rajartnam

Abstract

Probabilistic forecasts of variables measured on a categorical or ordinal scale, such as precipitation occurrence or temperatures exceeding a threshold, are typically verified by comparing the relative frequency with which the target event occurs given different levels of forecast confidence. The degree to which this conditional (on the forecast probability) relative frequency of an event corresponds with the actual forecast probabilities is known as reliability, or calibration. Forecast reliability for binary variables can be measured using the Murphy decomposition of the (half) Brier score, and can be presented graphically using reliability and attributes diagrams. For forecasts of variables on continuous scales, however, an alternative measure of reliability is required. The binned probability histogram and the reliability component of the continuous ranked probability score have been proposed as appropriate verification procedures in this context, but are subject to some limitations. A procedure is proposed that is applicable in the context of forecast ensembles and is an extension of the binned probability histogram. Individual ensemble members are treated as estimates of quantiles of the forecast distribution, and the conditional probability that the observed precipitation, for example, exceeds the amount forecast [the conditional exceedance probability (CEP)] is calculated. Generalized linear regression is used to estimate these conditional probabilities. A diagram showing the CEPs for ranked ensemble members is suggested as a useful method for indicating reliability when forecasts are on a continuous scale, and various statistical tests are suggested for quantifying the reliability.

Full access