1. Introduction
Forecasts are intended to provide information to the user. Forecast verification is the assessment of the quality of a single forecast or forecasting scheme (Jolliffe and Stephenson 2008). Verification should therefore assess the quality of the information provided by the forecast. It is important here to note the distinction between quality, which depends on the correspondence between forecasts and observations, and value, which depends on the benefits of forecasts to users (Murphy 1993). In this paper, we assume that the verification is intended to quantitatively measure quality. Several scores and visualization techniques have been developed that measure certain desirable properties of forecasts with the purpose of assessing their quality. One of the most commonly used skill scores (Stephenson et al. 2008) is the Brier score (BS) (Brier 1950), which is applicable to probabilistic forecasts of binary events. The Brier skill score (BSS) measures the BS relative to some reference forecast, which is usually climatology. Murphy (1973) showed that the BS can be decomposed into three components: uncertainty, resolution, and reliability. These components give insight into some different aspects of forecast quality. The first component, uncertainty, measures the inherent uncertainty in the process that is forecast. Resolution measures how much of this uncertainty is explained by the forecast. Reliability measures the bias in the probability estimates of the probabilistic forecasts. A perfect forecast has a resolution that is equal to (fully explains) the uncertainty and a perfect reliability.
Information theory provides a framework for measuring information and uncertainty (see Cover and Thomas 2006 for a good introduction). As forecast verification should assess the information that the forecaster provides to the user, using information theory for forecast verification appears to be a logical choice. A concept central to information theory is the measure of uncertainty named entropy. However, consulting two standard works about forecast verification, we noted that the word entropy is mentioned only thrice in Jolliffe and Stephenson (2003) and not one single time in Wilks (1995). This indicates that the use of information–theoretical measures for forecast verification is not yet widespread, although some important work has been done by Roulston and Smith (2002), Ahrens and Walser (2008), Leung and North (1990), and Kleeman (2002).
Leung and North (1990) used information–theoretical measures like entropy and transinformation in relation to predictability. Kleeman (2002) proposed to use the relative entropy between the climatic and the forecast distribution to measure predictability. The applications of information theory in the framework of predictability are mostly concerned with modeled distributions of states and how uncertainty evolves over time. Forecast verification, however, is concerned with comparing observed values with the forecast probability distributions. Roulston and Smith (2002) introduced the ignorance score, a logarithmic score for forecast verification, reinterpreting the logarithmic score (Good 1952) from an information–theoretical point of view. They related their score to relative entropy between the forecast distribution and the “true” probability distribution function (PDF), which they defined as “the PDF of consistent initial conditions evolved forward in time under the dynamics of the real atmosphere.” Ahrens and Walser (2008) proposed information–theoretical skill scores to be applied to cumulative probabilities of multicategory forecasts. Very recently, Benedetti (2009) showed that the logarithmic score is a unique measure of forecast goodness. He showed that the logarithmic score is the only score that simultaneously satisfies three basic requirements for such a measure. These requirements are additivity, locality (which he interprets as exclusive dependence on physical observations), and strictly proper behavior. For a discussion of these requirements, see Benedetti (2009). Furthermore, Benedetti (2009) analyzed the Brier score and showed that it is equivalent to a second-order approximation of the logarithmic score. He concludes that lasting success of the Brier score can be explained by the fact that it is an approximation of the logarithmic score. Benedetti also mentions the well-known and useful decomposition of the Brier score into uncertainty, resolution, and reliability as a possible reason for its popularity.
In this paper, we follow a route similar to Benedetti’s, but from a different direction. From an analogy with the Brier score, we propose to use the Kullback–Leibler divergence (or relative entropy) of the observation from the forecast distribution as a measure for forecast verification. The score is named “divergence score.” When assuming perfect observations, our measure is equal to the ignorance score or logarithmic score, and can be seen as a new reinterpretation of ignorance as the Kullback–Leibler divergence from the observation to the forecast distribution. Presenting a new decomposition into uncertainty, resolution, and reliability, analogous to the well-known decomposition of the Brier score (Murphy 1973), yields insight into the way the divergence score (DS) measures the information content of probabilistic binary forecasts. The decomposition can help acceptance and wider application of the logarithmic score in meteorology.
Section 2 of this paper presents the mathematical formulation of the DS and its components. Section 2 also shows the analogy with the Brier score components. Section 3 compares the divergence score with existing information–theoretical scores. It is shown that the DS is actually a reinterpretation of the ignorance score (Roulston and Smith 2002) and that one of the ranked mutual information scores defined by Ahrens and Walser (2008) is equal to the skill score version of DS, when the reliability component is neglected (perfect calibration assumed). A generalization to multicategory forecasts is presented in section 4. The inherent difficulty found in formulating skill scores for ordinal category forecasts is also analyzed and leads to the idea that this can be explained by explicitly distinguishing between information and useful information for some specific user. This distinction provides some insights in the roles of the forecaster and the user of the forecast. Section 5 presents an application to a real dataset of precipitation forecasts. Section 6 summarizes the conclusions and restates the main arguments for adopting the divergence score.
2. Definition of the divergence score
a. Background
By viewing the Brier score as a quadratic distance measure and translating it into the information–theoretical measures for uncertainty and divergence of one distribution from another, we formulate an information–theoretical twin of the Brier score and its components. First, some notation is introduced, followed by formulation of the Brier score. Then the information–theoretical concept of relative entropy is presented as an alternative scoring rule. In the second part of this section, it is shown how the new score can be decomposed into the classic Brier score components: uncertainty, resolution, and reliability.
b. Definitions
Consider a binary event, like a day without rainfall or with rainfall. This can be seen as a stochastic process with two possible outcomes. The outcome of the event can be represented in a probability mass function (PMF). For the case of binary events, the empirical PMF of the event after the outcome has been observed is a 2D vector, denoted by o = (1 − o, o)T. Assuming certainty in the observations, o
In information theory, the difference between two probability distributions is measured by relative entropy or Kullback–Leibler divergence (DKL). The DKL is not symmetric, so as not to be confused with a true distance like the quadratic score. In this paper DKL(x‖y) will be referred to as the divergence from x to y (or of y from x). It measures the missing information in case one assumes the distribution is y while the true distribution is x.
c. The divergence score
The divergence score can be interpreted as the information gain when one moves from the prior forecast distribution to the observation distribution. When this information gain is zero, the forecast already contained all the information that is in the observation and therefore is perfect. If the information gain from the forecast to the certain observation is equal to the climatological uncertainty, the forecast did not contain more information than the climate, and therefore was useless. Another way to view the divergence score is the remaining uncertainty about the true outcome, after having received the forecast (see Fig. 1).
d. Decomposition
e. Relation to Brier score and its components
For the binary case, the Brier score can be seen as a second-order approximation of the divergence score (also noted by Benedetti 2009). Both scores have their minimum only with a perfect forecast. When the forecast is not perfect, the Brier score is symmetric in the error in probabilities, while the divergence score is not, except for the case where the true forecast probability is 0.5. Therefore the divergence score is a double-valued function of the Brier score (Roulston and Smith 2002). Consequently, when two forecasting systems are compared, the forecasting system with the higher Brier score may have the lower divergence score.
The uncertainty component in the Brier score is a second-order approximation of the uncertainty term in the divergence score (entropy), with the same location of zero uncertainty (100% probability of one of the two events) and maximum uncertainty (equiprobable events with 50% probability, see Fig. 2). In the Brier score, the maximum of the uncertainty component is 0.5, while in the divergence score it is 1 (bit).
Resolution in the Brier score is the variance of conditional mean probabilities. It is a mean of squared deviations from the climatic probability. Resolution in the divergence score is a mean of divergences. Divergences are asymmetric in probabilities. The resolution in both the Brier and the divergence score can take on values between zero and the uncertainty. In both scores, it can be seen as the amount of uncertainty explained. The resolution in the Brier score is the second-order approximation of the resolution of the divergence score, satisfying the condition that the minimum is zero and in the same location (the climatic probability) and that the maximum possible value is equal to the inherent uncertainty of the forecast event (see Fig. 3).
Reliability in the Brier score is bounded between 0 and 1, whereas in the divergence score the reliability can reach infinity (see Fig. 4). This is the case when wrong deterministic forecasts are issued. Generally the reliability in the divergence score is especially sensitive to events with near-deterministic wrong forecasts. Overconfident forecasting is therefore sanctioned more heavily than in the Brier score.
f. Normalization to a skill score
Summarizing, the divergence score and its components combine two types of measures to replace the quadratic components in the Brier score decomposition. First, the quadratic distances between probability distributions are replaced by Kullback–Leibler divergences, which are asymmetric. Care should therefore be taken in which direction the divergence is calculated. Second, the polynomial uncertainty term is replaced by the entropy of the climatology distribution. The total scores and components are visualized in Fig. 1.
3. Relation to existing information–theoretical scores
a. Relation to the ranked mutual information skill scores
b. Equivalence to the ignorance score
Using the decomposition presented in (12), the ignorance score for a series of forecasts can now also be decomposed into a reliability, a resolution, and an uncertainty component. This decomposition, until now only applied to the Brier score, has proven very useful to gain insight into the aspects of forecast quality. Furthermore, the new interpretation of the ignorance score as the average divergence of observation PMFs from forecast PMFs links to results from information theory more straightforwardly.
4. Generalization to multicategory forecasts
a. Nominal category forecasts
When extending verification scores from forecasts of binary events to multicategory forecasts, it is important to differentiate between nominal and ordinal forecast categories. In the case of nominal forecasts, there is basically one question that is relevant for assessing their quality: How well did the forecaster predict the category to which the outcome of the event belongs? In nominal category forecasts, there is no natural ordering in the categories into which the forecast event is classified. For this case of forecast verification, there is no difference between the categories in which the event did not fall. Although the probability attached to those events conveys information at the moment the forecast is received, the only probability relevant for verification, after the observation has been made, is the probability that the forecaster attached to the event that did occur. The quadratic score of Brier (1950) can also be used for multiple category forecasts. In that case, ft and ot are the PMFs of the event before and after the observation, now having more than two elements. The problem with this score is that it depends on how the forecast probabilities are spread out over the categories that did not occur. For nominal events this dependency is not desirable, as all probability attached to the events that did not occur is equally wrong.
The DS does not suffer from this problem because it only depends on the forecast probability of the event that did occur (19). The DS as presented in (3) can directly be applied on nominal category forecasts. A property of the score is that a high number of categories make it more difficult to obtain a good score. To compare nominal category forecasts with different numbers of categories, the DS should be normalized to a skill score (DSS); see (13).
b. Ordinal category forecasts
When dealing with forecasts of events in ordinal categories, there is a natural ordering in the categories. This means that the cumulative distribution function (CDF) starts to become meaningful. There are now two possible questions that can be relevant for verification of the probabilistic forecast.
-
How well did the forecaster predict the category to which the outcome of the event belongs?
-
How well did the forecaster predict the exceedence probabilities of the thresholds that define the category boundaries?
c. The ranked divergence score
d. Relation to ranked mutual information
Apart from including the reliability or not, another question is how to weight the scores for the different thresholds, to come to one aggregated score. As every binary decision by some user with a certain cost–loss ratio can be associated with some threshold, the weighting reflects the importance of the forecast to the various users. No matter what aggregation method is chosen, there will always be an implicit assumption about the user’s importance and stakes in a decision-making process. This is inherent to summarizing forecast performance in one score. A diagram that plots the two skill score components against the thresholds contains the relevant information characteristics for different users. In this way each user can look up the score on the individual threshold, that is relevant for his decision problem, and compare it with the performance of some other forecasting system on that threshold.
e. Information and useful information
Forecasting is concerned with the transfer of information about the true outcome of uncertain future events that are important to a given specific user. The information in the forecast should reduce the uncertainty about the true outcome. It is important to note the difference between two estimations of this uncertainty. First, there is the uncertainty a receiver of a forecast has about the truth, estimated in hindsight, knowing the observation. This uncertainty is measured by the divergence score. Second, there is the perceived uncertainty about the truth in the eyes of the user after adopting the forecast, which is measured by the entropy of the forecast. The first depends on the observation, while the second does not. We note that information–theoretical concepts measure information objectively, without considering its use. The usefulness of information is different for each specific user. The amount of useful information in a forecast can explicitly be subdivided into two elements:
-
reduction of uncertainty about the truth (the information–theory part); and
-
the usefulness of this uncertainty reduction (the user part).
The second element of useful information in a forecast, usefulness, is user and problem specific. A forecast is useful if it is about a relevant subject. Communicating the exceedence probability of a certain threshold that is not a threshold for action for a specific user does not help him much. Usefulness also depends on how much importance is attached to events. This can be, for example, the costs associated with a certain outcome–action combination, typically reflected in a so-called payoff matrix. Implicitly, also information–theoretical scores make some assumption on the usefulness of events. The assumption is that the user attaches his own importance to the events by placing repeated proportional bets, each time reinvesting his money. This is referred to as Kelly betting [for a more detailed explanation, see Kelly (1956) and Roulston and Smith (2002)]. In other words, the assumption is that the user maximizes the utility of his information in a fair game by strategically deciding on his importance or stakes.
The explicit consideration of usefulness of information brings up an interesting question about the roles of the forecaster and the user of forecasts. The divergence score measures the remaining uncertainty after adopting the forecast, which is completely independent of the user. This focuses the score on evaluating a main task of the forecaster, which is to give the best possible estimate of probabilities. It might also be argued, however, that a forecaster should not just reduce uncertainty, but also deliver utility for users’ decisions. To be able to judge forecasts on that criterion, assumptions need to be made about the users and their decision problems. When scores based on these objectives are used to improve forecasting procedures, maximizing these two objectives does not always lead to the same answer. In such cases an improvement of the utility of forecasts may coincide with a reduction in informativeness. More research is needed to investigate the nature of this trade-off, which is strongly related to model complexity, overfitting, and calibration versus validation.
5. An example: Rainfall forecasts in the Netherlands
As an illustration of the practical application of the divergence score and its decomposition, it was applied to a series of probabilistic rainfall forecasts and corresponding observations for the measurement location De Bilt, Netherlands. The forecast series consist of daily forecasts of the probability of a precipitation amount equal or larger than 0.3 mm, for 0–5 days ahead. They are derived using output of both the European Centre for Medium-Range Weather Forecasts (ECMWF) deterministic numerical weather prediction model and the ECMWF ensemble prediction model. Predictors from both models are used in a logistic regression to produce probability forecasts for precipitation. The data cover the period from December 2005 to November 2008, in which the average probability was 0.4613, leading to an initial uncertainty of 0.9957 bits.
Figure 5 confirms the expectation that both the Brier skill score and the divergence skill score show a decline with increasing lead time. It also shows that the forecasts possess skill over annual climatology up to a lead time of at least 5 days. The dashed lines show the potential skill that could be attainable after recalibration. The estimation of this potential skill, however, is dependent on the correct decomposition. The decompositions of both the Brier and the divergence score need enough data (large enough nk) to calculate the conditional distributions fk for all unique forecasts k
A solution for estimating the components with limited data is rounding the forecast probabilities to a limited set of values. In this way, less conditional distributions fk need to be estimated and more data per distribution are available.
Figure 6 shows that for these 3 yr of data, using finer-grained probabilities as forecasts leads to an increasing overestimation of reliability. The skill scores themselves are not sensitive to this overestimation because the lack of data causes a compensating overestimation of resolution. The potential skill, however, should be interpreted with caution, as solving reliability issues by calibration on the basis of biased estimates of reliability does not lead to a real increase in skill. From the figure it can also be noted that too coarse-grained probabilities lead to a real loss of skill. In this case, giving the forecasts in 5% intervals seems the minimum needed to convey the full information that is in the raw forecast probabilities.
Figure 7 sheds more light on the relation between the Brier skill score and the divergence skill score, based on 5-day-ahead forecasts from a second dataset, which covers February 2008 to December 2009. For this set, the forecast probabilities ranged from 1% to 99%. The black dots indicate the scores that were attained for single forecast observation pairs. The dots show that the BSS and DSS have a monotonic relation as scoring rules. The limits of this relation are at (1, 1) for perfect forecasts and, in this case, (−∞, −3.095) for certain but wrong forecasts. The worst forecast was 98%, while no rain fell.
The total scores for different weeks of forecasts are plotted as gray dots. They are averages of sets of seven black dots. Because the relation of the single forecast scores is not a straight line, a scatter occurs in the relation of the weekly average scores, which is therefore no longer monotonic. The scatter implies that two series of forecasts can be ranked differently by the Brier score and the divergence score. In this example, the scatter is relatively small (r 2 = 0.9938) and will probably have no significant implications, but it would be larger if many overconfident forecasts were issued. An interesting example of differently ranked forecast series are the two weeks indicated by the triangles, where the scores disagree on which of the two weeks was better forecast than climate. The downward-pointing triangle marks the score for forecasts in week A, where performance according to the divergence score was worse than the climatological forecast (DSS = −0.0758), but according to the Brier score was slightly better than climate (BSS = 0.0230). Conversely, the upward-pointing triangle marks week B, where the forecasts according to Brier were worse than climate (=−0.0355) but still contained more information than climate according to the DSS (=0.0066).
Given that the scatter in the practical example is small, the Brier score appears to be a reasonable approximation of the divergence score and is useful to get an idea about the quality of forecasts. More practical comparisons are needed to determine if the approximation can lead to significantly different results in practice. These are mostly expected in case extreme probabilities are forecast.
The severe penalty the divergence gives for errors in the extreme probabilities, which is sometimes seen as a weakness, should actually be viewed as one of its strengths. As the saying goes, “you have to be cruel to be kind.” It is constructive to give an infinite penalty to a forecaster who issues a wrong forecast that was supposed to be certain. This is fair because the value that a user would be willing to risk when trusting such a forecast is also infinite.
6. Conclusions
Analogous to the Brier score, which measures the squared Euclidean distance between the distributions of observation and forecast, we formulated an information–theoretical verification score, measuring the Kullback–Leibler divergence between those distributions. More precisely, our score measures the divergence from the distribution of the event after the observation to the distribution that is the probability forecast. Our divergence score is a reinterpretation of the ignorance score or logarithmic score, which was previously not defined as a Kullback–Leibler divergence. Extending the analogy to the useful and well-known decomposition of the Brier score, the divergence score can be decomposed into uncertainty − resolution + reliability. For binary events, Brier score and its components are second-order approximations of the divergence score and its components.
The divergence score and its decomposition generalize to multicategory forecasts. A distinction can be made between nominal and ordinal category forecasts. Scores based on the cumulative distribution over ordinal categories can be seen as combinations of binary scores on multiple thresholds. How the scores for all thresholds should be weighted relative to each other depends on the user of the forecast. Scores on cumulative distribution are therefore not exclusively dependent on physical observations, but contain subjective weights for the different thresholds. Two possible formulations of a ranked divergence skill score have been formulated. The first equally weighs the skill scores relative to climate, while the second equally weights the absolute scores. The second-ranked divergence skill score is equal to the existing ranked mutual information skill score for the case of perfectly calibrated forecasts, but additionally includes a reliability component, measuring miscalibration.
In forecasting, a distinction can be made between information and useful information in a forecast. The latter cannot be evaluated without a statement about context in which the forecast will be used. The first is only dependent on how the forecasts relate to the observations and is objective. Therefore, in the authors’ opinion, information should be the measure for forecast quality. It can be measured using the logarithmic score, which now can be interpreted as the Kullback–Leibler divergence of the forecast from the observation. Useful information or forecast value, on the other hand, is a different aspect of forecast “goodness” (Murphy 1993), which should be evaluated while explicitly considering the decision problems of the users of the forecast.
The Brier score can be used as an approximation for quality or as an exact measure of value under the assumption of a group of users with uniformly distributed cost–loss ratios. In our opinion, these two applications should be clearly separated. In case one wants to assess quality, information–theoretical scores should be preferred. If an approximation is sufficient, the Brier score could still be used, with the advantage that it is well understood and extensive experience exists with the use of it. However, when extreme probabilities have to be forecast, the differences might become significant and the divergence score is to be preferred on theoretical grounds.
In case value is to be measured, an inventory of the users of the forecasts should be made to asses the total utility. When explicitly investigating the user base, a better estimator for utility than the Brier score can probably be defined. Using the Brier score as a surrogate for forecast value, implicitly assuming the emergent utility function is appropriate for a specific type of forecasts, is clearly unsatisfactory. In this respect, it is important to also stress that the divergence score does not measure value, but quality. Only in a very unrealistic case (a bookmaker offering fair odds) does a clear relation exist between the two. It might be argued that for practitioners in meteorology, quality is most likely of more concern than value, because the latter is in fact evaluating decisions rather than forecasts. (To facilitate the use of the score and its decomposition, scripts that can be used in Matlab and Octave are available online at http://divergence.wrm.tudelft.nl.)
Acknowledgments
The authors thank the Royal Netherlands Meteorological Institute (KNMI) and Meteo Consult for kindly providing the forecast and observation data that was used for this research. We also thank both anonymous reviewers for their constructive comments.
REFERENCES
Ahrens, B. , and A. Walser , 2008: Information-based skill scores for probabilistic forecasts. Mon. Wea. Rev., 136 , 352–363.
Benedetti, R. , 2009: Scoring rules for forecast verification. Mon. Wea. Rev., 138 , 203–211.
Brier, G. W. , 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78 , 1–3.
Cover, T. , and J. Thomas , 2006: Elements of Information Theory. 2nd ed. Wiley-Interscience, 776 pp.
Epstein, E. S. , 1969: A scoring system for probability forecasts of ranked categories. J. Appl. Meteor., 8 , 985–987.
Good, I. , 1952: Rational decisions. J. Roy. Stat. Soc., 14B , 107–114.
Jolliffe, I. T. , and D. B. Stephenson , 2003: Forecast Verification: A Practitioner’s Guide in Atmospheric Science. Wiley, 254 pp.
Jolliffe, I. T. , and D. B. Stephenson , 2008: Proper scores for probability forecasts can never be equitable. Mon. Wea. Rev., 136 , 1505–1510.
Kelly Jr., J. , 1956: A new interpretation of information rate. IRE Trans. Info. Theory, 2 , 185–189.
Kleeman, R. , 2002: Measuring dynamical prediction utility using relative entropy. J. Atmos. Sci., 59 , 2057–2072.
Laio, F. , and S. Tamea , 2007: Verification tools for probabilistic forecasts of continuous hydrological variables. Hydrol. Earth Syst. Sci., 11 , 1267–1277.
Leung, L. , and G. North , 1990: Information theory and climate prediction. J. Climate, 3 , 5–14.
Mason, S. , 2008: Understanding forecast verification statistics. Meteor. Appl., 15 , 31–40.
Murphy, A. H. , 1970: The ranked probability score and the probability score: A comparison. Mon. Wea. Rev., 98 , 917–924.
Murphy, A. H. , 1971: A note on the ranked probability score. J. Appl. Meteor., 10 , 155–156.
Murphy, A. H. , 1973: A new vector partition of the probability score. J. Appl. Meteor., 12 , 595–600.
Murphy, A. H. , 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting. Wea. Forecasting, 8 , 281–293.
Roulston, M. S. , and L. A. Smith , 2002: Evaluating probabilistic forecasts using information theory. Mon. Wea. Rev., 130 , 1653–1660.
Stephenson, D. B. , C. A. S. Coelho , and I. T. Jolliffe , 2008: Two extra components in the brier score decomposition. Wea. Forecasting, 23 , 752–757.
Wilks, D. , 1995: Statistical Methods in the Atmospheric Sciences: An Introduction. Academic Press, 467 pp.