## Abstract

The information content, that is, the predictive capability, of a forecast system is often quantified with skill scores. This paper introduces two ranked mutual information skill (RMIS) scores, RMIS* _{O}* and RMIS

*, for the evaluation of probabilistic forecasts. These scores are based on the concept of mutual information of random variables as developed in information theory. Like the ranked probability skill score (RPSS)—another and often applied skill score—the new scores compare cumulative probabilities for multiple event thresholds. The RMIS*

_{Y}*quantifies the fraction of information in the observational data that is explained by the forecasts. The RMIS*

_{O}*quantifies the amount of useful information in the forecasts. Like the RPSS, the new scores are biased, but they can be debiased with a simple and robust method. This and additional promising characteristics of the scores are discussed with ensemble forecast assessment experiments.*

_{Y}## 1. Introduction

Weather forecast systems have to be evaluated and the forecast quality of weather elements like temperature or precipitation has to be assessed. This paper introduces and discusses a pair of information–theoretical skill scores for the quality assessment of ensemble prediction system (EPS) forecasts. An EPS predicts forecast probabilities of weather events by the integration of an ensemble of numerical weather predictions, which are initialized with slightly different states (Ehrendorfer 1997; Palmer 2000). The forecast probability is often estimated as the relative frequency of occurrence of an event among the limited number of ensemble members (cf. discussion in Katz and Ehrendorfer 2006). The EPS forecasts approximate probability densities, and their usefulness in forecasting has to be assessed by comparison to observations. Here, probabilistic forecasts of precipitation will be compared to time series of observed precipitation *x _{o}*:

*x*;

_{o,t}*t*= 1, 2, . . . ,

*T*. The simulated precipitation series of a single forecast model run is denoted with

*x*:

_{f}*x*;

_{f,t}*t*= 1, 2, . . . ,

*T*.

In the forecasting of precipitation and other weather elements, it is of common interest whether a forecast value is above or below some critical value *c* (e.g., *c* = 40 mm day^{−1}) and whether or not a warning should be delivered. Therefore, often the probability forecasts of dichotomous events are assessed. There are two forecast categories, the event *x _{f}* ,

*≤*

_{t}*c*and the complementary event

*x*,

_{f}*>*

_{t}*c*, and the probabilistic forecasts are

*y*=

_{t}*P*(

*x*,

_{f}*≤ c) and 1 −*

_{t}*y*, respectively. If the forecast for the event

_{t}*x*,

_{f}*≤*

_{t}*c*is given, then there is no additional information in the forecast for the complementary event. These forecasts have to be compared against the observed occurrence:

*o*= 1 if the event

_{t}*x*,

_{o}*≤*

_{t}*c*occurred and

*o*= 0 if not (with perfect observations assumed here).

_{t}The assessment is often quantified by scores, and “the most commonly used example” (Wilks 2006) is the Brier score (BS) defined as

Following the original definition in Brier (1950), Eq. (1) defines the half Brier score, but now it is widely referred to as the BS. The *o _{t}* can be interpreted as the ex-postprobabilities

*P*(

*x*

_{o,t}≤

*c*) ∈ {0, 1}. Thus, the BS is a bivariate statistic applied on the time series of the event probabilities,

*y*and

*o*, and delivers the mean-squared error of the probabilistic forecasts.

Brier (1950) introduced more generally a quadratic score as the mean-squared error of probability forecasts for multiple mutually exclusive and exhaustive categories (e.g., the lowest, middle, and highest tercile categories) with category boundaries *c _{j}*;

*j*= 1, 2, . . . ,

*J*. For each date and category there is a pair of probability masses

*y*and

_{jt}*o*, and the score is defined as the temporal mean of the averaged

_{jt}*J*+ 1 squared differences of

*y*and

_{jt}*o*. It is not recommended to apply this quadratic score with more than two categories (Daan 1985; Stanski et al. 1989) because it does not take into account the distance between forecast categories and the observed category and because of its strong dependency on the prescription of category boundaries. The BS given in Eq. (1) is exactly half of the two-category (

_{jt}*J*= 1) quadratic score.

Instead, the ranked probability score (RPS) is suggested in the literature (see Jolliffe and Stephenson 2003, chapter 7; Wilks 2006, chapter 7) if more than one threshold, that is, *J* > 1, is to be considered. This score compares the cumulative forecast probabilities *Y _{jt}* =

*P*(

*x*

_{f,t}≤

*c*) = Σ

_{j}^{j}

_{k=1 }

*y*for multiple thresholds

_{kt}*c*:

_{j}*j*= 1, 2, . . . ,

*J*with the ex-postprobabilities

*O*=

_{jt}*P*(

*x*

_{o,t}≤

*c*) = Σ

_{j}^{j}

_{k=1 }

*o*∈ {0, 1} through

_{kt}Thus, the RPS is a distance measure applied on the time series of cumulative probabilities, *Y* and *O*. In the special case of *J* = 1 the RPS reduces to the BS, as defined in Eq. (1), since *Y*_{1t} = *y*_{1t} and *O*_{1t} = *o*_{1t}.

The values of scores (*S*), like the BS or RPS, are often compared against score values for reference forecasts. This relative performance of the system is measured by skill scores (SS). Here, as often, the reference forecast is chosen to be the observed climatology *O*, given by the cumulative probabilities *O _{j}* in case of the RPS. With

*S*(

*O*,

*O*) the score value of a perfect forecast (i.e., zero in the case of the BS and RPS), the skill score SS is defined by

for scores applied on cumulative probabilities or defined equivalently with small letters for scores applied on probability masses. The skill score value is zero if the performance of the evaluated forecast system equals the performance of the reference forecast system. This standardization of the forecast score through a reference score is mainly made for increasing the comparability of evaluation results in case of different situations (season, region, reference dataset, etc.). Comparison of the Brier and ranked probability forecast scores against scores for the climatological forecast is effectively done by score normalization with a measure of the uncertainty inherent in the events being forecast [*O*_{1}(1 − *O*_{1}) in the case of the Brier skill score (BSS) with *O*_{1} the observed event climatology (Murphy 1973; Mason 2004)].

If extreme event thresholds *c _{j}* are chosen in the evaluation, which have small but finite climatological uncertainty [like in Finley’s classical tornado forecasts, cf. Murphy (1996)], and if these events are neither observed nor forecast in the evaluation period, then the BSS and ranked probability skill score (RPSS) assess the forecasts as perfect with

*S*(

*Y*,

*O*) = 0 and consequently SS = 1. But instead of having a perfect model system, the evaluation period is too short and provides no information on the system’s forecast uncertainty, and thus the evaluation is inconclusive. In this case an appropriate skill score should not delude the user into thinking a system is perfect but should indicate the shortage of evaluated events.

Uncertainty and information are the basic concepts of information theory. Since forecast assessment has to deal with uncertainty, this has motivated the introduction of scores based on the information content in the realizations of some considered process *X*. The information content, or surprisal *I*, in the realization of an event in the *j*th category, *c _{j}*

_{−1}<

*x*≤

_{jt}*c*, is defined as

_{j}Here the process is either forecasting or observation, and thus *P*(*x _{jt}*) is

*y*or

_{jt}*o*, respectively. The information concept has appealing properties. For example, if a perfectly certain event is considered [i.e.,

_{jt}*P*(

*x*

_{j′t}) = 1 and

*P*(

*x*) = 0 for all

_{jt}*j*≠

*j*′], the future outcome is not surprising and its realization yields no additional information,

*I*= 0. Additionally, there is no negative surprise and there is no information loss in any realization. Therefore, a valid forecast system delivers positive information.

Examples of scores based on the concept of information are the logarithmic score (Good 1952; Selten 1998), ignorance score (Roulston and Smith 2002), or the mutual information (Leung and North 1990; DelSole 2004; Molini et al. 2006). Here, the ignorance score (Roulston and Smith 2002) shall briefly be discussed. For one forecast and observation pair the ignorance is given by

with *y _{j}*

_{(}

_{t}_{)}

*the probability mass of the forecast in the observed category*

_{t}*j*(

*t*) at date

*t*. To be well defined, forecasters should replace any zero forecast probability value in IGN

_{t}determination with a very small positive probability ɛ, as motivated in Roulston and Smith (2002). The IGN

_{t}is a statistic defined for single dates and thus is a scoring rule (Murphy 1997). As for the BS and RPS, the ignorance has to be averaged over a longer dataset of forecast–observation pairs for being representative. This defines the ignorance score:

In the case of perfect forecasts all of the *y _{j}*

_{(}

_{t}_{)}

*equal one and there is no ignorance. If the forecaster always delivers*

_{t}*y*

_{j}_{(}

_{t}_{)}

*= ɛ, then the ignorance is maximized at −logɛ: This is independent from the distance of the forecast categories with nonzero probability masses to the observed category. Therefore, the IGN, like the quadratic score, is insensitive to the distance between categories and so should not be recommended.*

_{t}In summary, the available information scores consider the probability masses *y _{jt}* and

*o*, like the Brier score or, more generally, Brier’s quadratic score. This paper introduces two new scores that adopt the idea of mutual information between two processes but, instead of the usual random variables precipitation forecast and precipitation observation, the cumulative probabilities

_{jt}*Y*and

_{j}*O*are compared. Therefore, just like in the RPS, the new scores are multithreshold scores and the distance of categories is considered in the evaluation of probabilistic forecasts. This and further advantages of the new scores are discussed in the following sections after the introduction of the scores.

_{j}## 2. Ranked mutual information skill scores

Before the new scores are defined, the concept of mutual information will briefly be introduced. In Eq. (4) the information content *I* in each realization of some stochastic process *X*, for example, precipitation forecasting or observation, has been introduced. The average information of a stochastic process, which is discretized into *J* + 1 categories as above, is

The average information is a measure of uncertainty in the process and is called the information entropy of *X* (Shannon 1948). By convention 0 log0 = 0 as justified by continuity. Here, the log is to the base *e* and thus the entropy is measured in the natural units nats. For more detailed introductions into information theory we refer to Cover and Thomas (1991) or, for example, DelSole (2004) and Molini et al. (2006) in the literature related to weather forecasting.

Equation (7) states that the uncertainty in a process is defined by its probability distribution. In the case of a uniform distribution we are maximally uncertain about a future outcome. If the probability mass distribution is heavily concentrated in one category, a forecast in this category is not very surprising and the forecaster is quite certain about the forecast’s future outcome.

If two processes *X*_{1} and *X*_{2} are considered, then the joint entropy is obtained in the form

where *P*(*x _{j}*

_{1},

*x*

_{j}_{2}) is the joint probability mass and the summation is over all possible combinations of categories. Two independent processes with

*P*(

*x*

_{j}_{1},

*x*

_{j}_{2}) =

*P*(

*x*

_{j}_{1})

*P*(

*x*

_{j}_{2}) imply

Hence, it is impossible to learn about the process *X*_{1} from the outcome of an independent process *X*_{2} and vice versa.

In the case of dependent processes, the joint entropy is smaller than the sum of the single process entropies:

Equation (10) defines the mutual information (MI) that can be applied as a distance measure between the processes *X*_{1} and *X*_{2} since it is positive and symmetric by definition and since it satisfies the triangle inequality (Cover and Thomas 1991). For *X*_{2} = *X*_{1}, the MI(*X*_{1}, *X*_{1}) = *H*(*X*_{1}) and thus the information entropy is also known as self-information. Of course, most information on any process *X*_{1} is given by *X*_{1} itself and consequently MI(*X*_{1}, *X*_{1}) ≥ MI(*X*_{1}, *X*_{2}). If, for example, forecasting and related observation are the considered processes and these processes are independent (in the case of completely useless forecasts), then the mutual information is zero.

In the following, the concept of mutual information is used to quantify the distance between the two processes of forecasting and observation. For any date *t* the ex-postprobability observed information entropy *H*(*X*_{o,}* _{t}*) is

since *o _{jt}* ∈ {0, 1}. Therefore, for whichever date

*t*, the mutual information is zero and the joint information is equal to the information in the forecast written as

and thus nothing can be learned about the forecast system performance from one good or bad forecast. Therefore, the mutual information cannot be applied in a scoring rule like the IGN discussed above. Instead, the concept of mutual information is relevant only given time series of forecast probabilities *y* and observations *o*. The crucial step is to estimate the probabilities *P*(*y*), *P*(*o*), and *P*(*y*, *o*) of the probabilistic forecasts and observations for the evaluated time series.

This is discussed here with a hypothetical forecast system for a binary outcome given by *x _{o}*,

*≤*

_{t}*c*. For this system the observed probabilities

*o*are 1, if the event occurs, and 0, otherwise. The forecast probability

_{t}*y*= 1, if

_{t}*x*≤

_{f}*c*, and 0, otherwise; that is, the forecasts are deterministic. If, for example, 100 hypothetical forecast–observation pairs are evaluated, then a relative contingency table as given in Table 1 might be the result. The skillful system predicts the event 10 + 60 = 70 times, that is, 70% of the time and 60% of the time correctly. A random forecast system predicts the event 50% of the time but only 35% correctly by chance.

There are numerous well-known verification scores based on such 2 × 2 contingency tables, for example, the false alarm rate or the hit rate (Wilks 2006). Alternatively, the mutual information MI between forecast and observation can be evaluated if the relative frequencies given in the table are interpreted as the joint and marginal probability masses of the random variables *y* and *o*. Hence, the contingency table of the skillful system evaluation yields *H*(*o*) = −(0.3 log 0.3 + 0.7 log 0.7), *H*(*y*) = −(0.3 log 0.3 + 0.7 log 0.7), *H*(*y*, *o*) = −(0.2 log 0.2 + 0.1 log 0.1 + 0.1 log 0.1 + 0.6 log 0.6); thus MI(*y*, *o*) = 0.13. The random forecast system and the observations have no mutual information since the corresponding table yields MI(*y*, *o*) = 0.00.

Consider an EPS forecast system in which the forecast probabilities are defined by the proportion of ensemble members that predict an event to occur (see Katz and Ehrendorfer 2006). If there is only one ensemble member (i.e., the forecasts are deterministic), there is a maximum of *K* = 2 classes of forecast probability: 0 and 1. More generally, given *M* ensemble members, then there are *K* = *M* + 1 classes of possible forecast probabilities. For example, with *M* = 3 the possible forecast probabilities are 0, ⅓, ⅔, 3/3 and with a time series of forecasts at maximum four classes are populated. With perfect observations two classes are observed. The resulting *K* × 2 contingency tables can be evaluated using reliability diagrams or so-called relative operating curves (ROC; Wilks 2006) or using the concept of mutual information (DelSole 2004). The information entropy of the forecasts is calculated with

with *K* classes. In the special case of uniform probabilities the uncertainty in forecasts and observations is proportional to the number of classes *K* and 2, respectively, and thus *H*(*y*) ≥ *H*(*o*).

Generally, the entropy depends on the number of probability classes (more classes, i.e., options, yield more uncertainty). Here, as the most straightforward choice, the entropies are determined by adopting the maximally possible number of populated classes, that is, two for the observation entropy, *K* = *M* + 1 for the probabilistic forecast entropy and 2 × *K* for the joint entropy. Fewer classes would unnecessarily waste information in the probabilistic forecasts.

Now, as suggested in this paper, the mutual information shall be used in a multiple threshold evaluation with *c _{j}*,

*j*= 1, 2, . . . ,

*J*. In analogy to Brier’s squared score a multiple category score could be introduced with Σ

^{J}

_{j=1}MI(

*y*,

_{j}*o*), but this also neglects the distance of the forecast and observation categories. Instead, we propose to apply the cumulative probabilities

_{j}*Y*and

*O*in analogy to the RPS and to evaluate the relative contingency tables for these cumulative probabilities for each event threshold, yielding

and equivalently for *H*(*O _{j}*) and

*H*(

*Y*,

_{j}*O*). Furthermore, the ranked mutual information score (RMI) is defined by

_{j}Therefore, the score is the sum of mutual informations for *J* thresholds.

Often applied as a reference forecast is the climatological probability forecast. It delivers a time-independent probability mass for each event category or a cumulative probability for each threshold event *x _{f}* ≤

*c*. If, for example, an event is climatologically expected 30% of the time, then the climatological forecast is

_{j}*O*= 0.3 for all forecast dates. Thus, the information in the climatological forecast is

_{j}*H*(

*O*) = 0 and the mutual information between observations and climatology is MI(

*O*,

*O*) = 0. In consequence, the RMI is equitable: it takes the smallest possible value, 0, for all unskillful forecasts: the climatological and random forecasts.

The equitability of the RMI avoids a weakness of the BSS with climatological forecasts as reference forecasts (Mason 2004): if the reliability (conditional bias) of the forecast is larger than the event resolution, the BSS is negative and therefore the forecasts are assumed unskillful. Nevertheless, there might be some useful resolution in the forecasts that can be used after forecast recalibration. The RMI does not measure conditional bias, but lack of resolution. For example, if a forecast is always “yes” but “no” is observed and vice versa, then the system is qualified perfect. This is similar to using the explained variance, that is, the square of the linear correlation coefficient, as performance measure: perfect correlation is as good as perfect anticorrelation. This is not a disadvantage of the RMI since (i) perfect reliability is impossible as long as the perfect model is impossible, that is, forever, and (ii) weather forecasts are calibrated explicitly by some regression-based system like a model output statistics system (Wilks 2006) or implicitly by the experienced human forecaster.

Another often cited important property of good scores is that they cannot be hedged: forecasting something different than what the available information implies must not improve the score. The MI and RMI cannot be hedged. This is a consequence of the data processing inequality of information theory (Cover and Thomas 1991): random variables *O*, *Y*, and *Y** that form a Markov chain satisfy the inequality MI(*O*, *Y**) ≤ MI(*O*, *Y*). In particular, a function *Y** = *g*(*Y*) forms a Markov chain (i.e., *Y** depends only on *Y* and is conditionally independent of *O*) and MI(*O*, *g*(*Y*)) ≤ MI(*O*, *Y*). Therefore, any modification of the forecast *Y* without additional information cannot improve the mutual information between the forecast and the observation. This is equivalent to a proper scoring rule, a positive feature of the BS and RPS.

Now, following Eq. (3), a skill score RMIS* _{O}* can be defined by

If climatology *O* or a random forecast is applied as reference, then RMI(*Y*_{ref}, *O*) = 0 and

With this definition the skill score values are in the interval [0, 1]. The RMIS* _{O}*(

*Y*,

*O*) normalizes the ranked mutual information with the uncertainty in the event observations as measured by the sum of information entropies

*H*(

*O*) for the

_{j}*J*event thresholds. The more peaked the distributions

*P*(

*O*) are, the smaller is the denominator in Eq. (18).

_{j}The RMIS* _{O}* can be interpreted as the fraction of information in the observation time series that is explained by the forecasts. It is generally accepted that larger ensembles of skillful forecasts explain a larger fraction of the entropy in the observations, and therefore RMIS

*(*

_{O}*Y*,

*O*) should increase with

*M*. It is interesting to note that for very long time series

*H*(

*O*) ≈ 2

*O*(1 −

*O*); thus the normalization in RMIS

*approximates twice the value of the uncertainty term in the BSS.*

_{O}The fraction of useful information in the EPS forecasts can be quantified with

In the case of a deterministic forecast system, *M* = 1, the possible outcomes for *Y _{j}* are 0 and 1 as for

*O*. Therefore, the denominators in Eqs. (18) and (19) are equal for very long evaluation periods if observation and forecast have the same climatology. With increasing ensemble size

_{j}*M*the number of potentially populated probability classes increases. Therefore, there is an increase in the information delivered by the forecasts, and the denominator increases as does the numerator. As mentioned above, with large ensemble forecasts the explained information RMIS

*is expected to be large, but at the same time the fraction of useful information RMIS*

_{O}*in the EPS might decrease [RMI(*

_{Y}*Y*,

*Y*) increases faster than RMI(

*Y*,

*O*)], which has to be considered when delivering forecasts to customers. It is desirable to keep the ratio RMIS

*(*

_{Y}*Y*,

*O*) as large as possible with increasing the ensemble size.

Alternatively, the normalization of the RMI could also be done with Σ^{J}_{j=1 }*H*(*O _{j}*,

*Y*), which would lead to a robust generalization of a squared correlation measure like the so-called explained variance (Ahrens 2003). However, the discrimination of explained and useful information fractions would be discounted and, thus, this type of normalized mutual information is not proposed here as a skill score. The two new scores fit well into the verification framework developed in Murphy and Winkler (1987): they interpret the joint forecast–observation distribution, which “contains all the information relevant to verification,” and illuminate two complementary aspects of the forecasts.

_{j}An important advantage of the new skill scores over the RPSS and BSS is their extreme value characteristics. As mentioned in the introduction, the BSS would deliver a perfect skill score if values above some critical value *c* are neither forecast nor observed. The information score would deliver no skill without informative events and with zero mutual information between forecast and observation.

A less extreme example is the following. An event has been observed once within an observation period with 100 dates. There are predictions from three forecast systems: system 1 never predicts the event, system 2 predicts the event once but at the wrong date, and system 3 hits the event without false alarms. Obviously, system 3 performs best. Systems 1 and 2 show similar performance but perhaps with some advantage for system 2 because it has proven that it is able to predict the event with the correct relative frequency. For this example the BSS gives the skill scores −0.1, −1.02, and 1, respectively. Therefore, the system 2 comes off far worse than system 1 contrary to intuition. The RMIS* _{O}* is 0.0, 0.002, and 1, and the RMIS

*is 0/0, 0.002, and 1, respectively. This is closer to the authors’ intuition: the results for RMIS*

_{Y}*are similar for systems 1 and 2, and we are warned through RMIS*

_{O}*= 0/0 that system 1 has not forecast the event during the evaluation period and thus the evaluation period might be too short.*

_{Y}## 3. Experiments with synthetic data

In a simple setup for testing the introduced scores we created synthetic, stochastic time series of precipitation that we perturbed with errors. The stochastic time series generator applied is a simple Poisson process with an hourly time scale and with exponentially distributed interstorm periods (60 h on average), storm durations (20 h on average), and precipitation intensities (0.5 mm h^{−1} on average). The chosen parameters are typical values for the midlatitudes (Eagleson 2002), and after aggregation of the generated hourly series to daily series they provide realistically distributed precipitation intensities compared to observations in Zurich (cf. Fig. 1) with 51% observed and 47% simulated rain days. Of course, the applied stochastic generator is the most simple method that is “at least roughly concordant with data” (Rodriguez-Iturbe et al. 1987) but is good enough to test the skill scores presented in this paper.

In Müller et al. (2005) a white noise experiment was conducted to illustrate the dependence of the RPSS on ensemble size. Here, we perform a similar experiment. With the described random process an ensemble of independent random forecast series is generated and compared to an independently synthetically generated observation time series. Afterward skill scores for ensembles with up to 51 members are estimated. Forecast and observation time series are independent and thus the estimated skill score should resemble a skill of zero measured with the MI skill scores. Additionally, an ensemble of independent forecasts approximates a climatological forecast and thus the RPSS should be zero by construction too. Since the skill scores depend on the particular realization of the time series, a set of 100 ensemble forecast experiments have been carried out to allow us to estimate a 95% confidence interval on the skill score.

Figure 2 shows simulated biases for the skill scores RPSS, RMIS* _{O}*, and RMIS

*with confidence intervals based on time series with length of 180 days. For the sake of simplicity only one precipitation threshold is applied with*

_{Y}*x*

_{0.66}= 1.4 mm day

^{−1}with the index indicating the 0.66 quantile of the stochastic process (66% of the realized values are smaller than the threshold). Figure 2 shows the often-mentioned negative bias of the RPSS for small ensemble sizes (Müller et al. 2005; Weigel et al. 2006a). The RMIS

*has a bias too—a bias that is increasing with ensemble size. The RMIS*

_{O}*bias is small.*

_{Y}The biases originate from sampling errors. In the case of the RPSS the bias is a direct consequence of the climatological reference being perfectly reliable, while the EPS is not, due to the discretization of forecast probabilities with a finite number of ensemble members (Weigel et al. 2006b). In case of entropy-based measures the bias is caused by statistical fluctuations in the estimation of the probability mass functions *p*(*Y*). These fluctuations render the *p* less uniform and this yields entropy underestimation (Miller 1955; Paninski 2003). The fluctuations have increasing importance with shorter time series and more discretization classes in the contingency table. Since this paper applies the maximum choice for the number of classes, that is, *K* = 2 for *H*(*O*), *K* = *M* + 1 for *H*(*Y*), and *K* = 2(*M* + 1) for *H*(*O*, *Y*) estimation, the fluctuation amount is maximized.

For the discussion, we assume that the entropy and the entropy underestimation bias depend linearly on the number of classes. Additionally, we assume that the bias is small compared to the entropy. Then *H*(*Y*) increases with *M* and compensates a positive bias of the order of *M* in the numerator of the skill scores. Therefore, the RMIS* _{Y}* shows a very small bias. The entropy estimate

*H*(

*O*) is independent of

*M*and thus there is no bias compensation in the case of RMIS

*, as Fig. 2 proves.*

_{O}The bias in RMIS* _{O}* is not a consequence of limited ensemble size as in the case of the RPSS. It can be reduced by comparison of longer time series or, at least, its influence can be estimated by evaluation of a time series of different length. Alternatively, the bias decreases by choosing a smaller number

*K*′ <

*K*of probability classes in the

*H*(

*Y*) and MI(

_{j}*Y*,

_{j}*O*) estimations. But this also decreases the evaluated information amount that is inherent in the probabilistic forecasts. The effects of these debiasing strategies are shown in Fig. 3.

_{j}Here, another simple entropy debiasing strategy is applied following Miller (1955) and Paninski (2003): the entropies are debiased by adding a term (*m* − 1)/2*T*, with *m* being an estimate of the number of classes with nonzero probability and *T* the length of the time series. More sophisticated approaches are available (Paninski 2003; Schuermann 2004), but as Fig. 3 demonstrates the simple approach is sufficient here. The experiment’s ideal skill score of zero is well within the confidence interval for the debiased skill score estimates and for all shown ensemble sizes. In assessment of even larger ensembles, and thus many categories, either longer time series or a more efficient debiasing approach could be applied.

It is possible to study the skill score behavior in more realistic ensemble forecast experiments by generating ensemble forecasts through disturbing the observation time series. Figure 4 shows an experiment in which the skill scores are estimated for forecasts that are generated as disturbed observations. The disturbance is done by multiplication of the observations with a noise process. The noise with a mean of 1 and a standard deviation of 0.9 is realized by sampling of a transformed symmetric beta distribution, which guarantees nonnegative precipitation and maximum positive disturbance by a factor of 2. Additionally, 10% of the time series values of all days are swapped pairwise to simulate gross forecast errors.

Figure 4 shows that the fraction of information in the observation time series that is explained by the forecasts, as measured by RMIS* _{O}*, increases rapidly with ensemble size

*M*for

*M*< 10. But, for larger

*M*the increase in explained information slows down. The fraction of useful forecast information, RMIS

*, decreases slightly if more than about five members are applied, but the decrease also slows down with increasing*

_{Y}*M*. Remember that the difference in definition between RMIS

*and RMIS*

_{O}*, Eqs. (18) and (19), is in the denominator. The increase in RMIS*

_{Y}*is due to the increase of the mutual information between ensemble forecast and observation time series. In RMIS*

_{O}*the increase in mutual information is overcompensated by the increase of useless information in the ensemble forecast: the useful information in the forecast increases slightly slower with ensemble size than the useless information. Additionally, it can be seen, as expected, that the information increase per additional member decelerates with increasing*

_{Y}*M*.

The RPSS in Fig. 4 increases quickly with small *M* but seems to be close to saturation for *M* > 10. Comparison with the debiased variant, RPSS* _{D}*, also shown in the figure, illustrates the marked bias of the RPSS for small ensemble sizes. Following Weigel et al. (2006a), debiasing is done by adding the term 1/

*M*×

*O*

_{1}(1 −

*O*

_{1}) to the BS of the climatological forecast (since if only one threshold is applied the RPSS collapses to the BSS). Obviously, the RPSSs are quickly saturated with increasing

*M*if evaluation errors are considered. This is as expected since the RPSSs are an integral measure of useful and useless information in the EPS, and thus the slope is expected to be between the slopes of RMIS

*and RMIS*

_{O}*and are hiding the increase of potentially good information in the forecast.*

_{Y}Without gross errors (not shown) the levels and evolution of RMIS* _{O}* and RPSS are quite similar to the ones in Fig. 4, but the RMIS

*is about 0.4 and even slightly increasing with ensemble size. This is because of the convergence of the mean ensemble forecast to the observation without gross errors by construction and, as a consequence, the increase in the denominator is relatively small with increasing member size.*

_{Y}Figure 5 shows the same forecast experiment skill scores but with multiple precipitation thresholds (*c*_{0.66}, *c*_{0.77}, *c*_{0.88}, *c*_{0.99}) = (1.4, 3.6, 8.1, 31.9) mm day^{−1} where the respective precipitation quantiles are given as indices. By construction there are no systematic errors in the forecasts (like “the forecast system often underestimates heavy rain events”), so the skill scores differ only slightly to the skill scores of the single threshold experiment.

Often the mean or median forecast from the ensemble is used rather than the ensemble itself (Ehrendorfer 1997; Atger 1999). Figure 6 evaluates the mean forecasts of the synthetic EPS with gross errors. In this case the effective number of classes in *H*(*Y*) collapses to two, and the precipitation forecast is smaller or larger than the threshold. Additionally, the expected forecast and observation climatologies are approximately the same in this experiment and, hence, the denominators in the skill scores RMIS* _{O}* and RMIS

*are almost equal. Consequently, the skill scores cannot discriminate between useful and useless information in the ensembles. In this specific experiment the RMIS*

_{Y}*and RMIS*

_{O}*indicate an increase in performance of the mean forecast with ensemble size, but the increase is slowing down with more and more ensemble members.*

_{Y}## 4. Experiments with real data

As an application of the skill scores, this section compares the forecasts of an operational EPS with daily precipitation as observed at a single Federal Office of Meteorology and Climatology MeteoSwiss station in Switzerland (Zurich). Direct EPS output from the grid cell that contains the observing station is applied in the comparison. Representativity questions related to different scales of the pointlike station site and the areal model grid cell are not being considered here since this single station comparison is not meant as a full forecast system evaluation, but as a test bed for our scores.

The applied ensemble data are forecast by the Consortium for Small-Scale Modeling (COSMO; see http://www.cosmo-model.org for information on the applied model system) Limited-Area Ensemble Prediction System (COSMO-LEPS; Montani et al. 2003). One member of the limited-area EPS is nested into one representative forecast of a coarser grid global EPS. The representative forecasts of the global EPS are selected by a cluster analysis on the simulation domain of the limited-area model. Therefore, the global forecasts are representative for a large domain (∼continental Europe) and not necessarily for the station site in Zurich. Consequently, in the experiments shown here, all limited-area EPS members are weighted uniformly, not by the cluster size as is done operationally. The implementation of the COSMO-LEPS is formally validated in Marsigli et al. (2005).

The operational COSMO-LEPS with numerical grid spacing of 10 km was initiated in November 2002. The number of representative members was set to *M* = 5 until May 2004 and 10 afterward until recently (February 2006) when it was set to 16. Here, for evaluation, ensemble forecasts from November to May in 2003/04 and 2004/05 are used. Therefore, two comparable periods (about 210 days each) in the same seasons are selected, but they have differences in the EPS setup: the first period is forecast with a 5-member EPS and the second period with a 10-member EPS that is slightly updated (e.g., by nesting into the two instead of the three most recent global EPS forecasts, which are initialized every 12 h, by accounting for model errors using two different convection parameterizations selected randomly and by introducing cloud ice and prognostic precipitation schemes). The limited-area EPS forecasts are initiated each day at 1200 UTC, and precipitation simulated for the forecast hours 18 to 42 (the 1-day forecast) and 42 to 66 (the 2-day forecast) is considered.

The evaluation uncertainty is of interest. Here confidence intervals are estimated following a bootstrap procedure (Efron and Tibshirani 1993). The idea of bootstrapping is to construct a large number of new samples of the original length out of the original sample (i.e., the time series of EPS forecasts) by random selection with replacement. For each new sample a new skill score is determined and from these the mean skill score as well as the 95% confidence interval are estimated. For the full 5 as well as for the 10 member EPS in total 504 samples are generated. Bootstrapping, as described, underestimates the width of the confidence intervals slightly in case of persistence in the daily evaluation time series (Ahrens et al. 1998).

In operational applications it is important to quantify whether more members can improve the EPS performance. Therefore, all possible combinations of EPS members are evaluated. For example, 10!/(5!5!) = 252 combinations without repetition of 5 members out of the *M* = 10 EPS members of the 2004/05 season are evaluated. Each smaller-ensemble forecast series is bootstrapped until the sample size of skill scores is about 500 and an average skill score with confidence interval can be estimated.

Figure 7 shows that the mutual information between EPS forecasts and observation time series increases with ensemble size *M*. However, the fraction of useful information in the EPS forecasts decreases with ensemble size. The RPSS increases strongly with ensemble size for *M* ≤ 4, but this is partly caused by the bias problem discussed before. As the RPSS does not discriminate between useful and useless information in the EPS the RPSS indicates earlier EPS saturation than the information skill scores.

The information skill scores shown in this section are debiased. For example, the 10-member RMIS* _{O}* is reduced by 5% through debiasing. Therefore, the bias is smaller than the standard deviation of 13% and it can be assumed that debiasing is efficient enough. A more difficult problem in EPS evaluation is the width of the confidence intervals of RMIS

*and RPSS. These will decrease only with larger datasets (i.e., to be realized with longer time series).*

_{O}The confidence intervals of RMIS* _{O}* and RPSS are of similar size. The RPS can principally be estimated at any instant of data time and is therefore a scoring rule (Matheson and Winkler 1976), but the RPS has to be averaged over longer periods for statistically conclusive results with reasonably small sampling errors in evaluation. The information skill scores can only be estimated if sequences of dates are evaluated, but in terms of evaluation uncertainty this is not a disadvantage.

Figure 8 compares skill scores using a single ⅔ quantile threshold with skill scores using a double threshold at the ⅔ and ⅘ quantiles. The measured skill of the EPS decreases if the larger threshold is also taken into consideration. This is consistent with the expectation of lower EPS performance for extreme events than for more typical weather situations.

The assessment of ensemble-mean forecasts in Fig. 8 indicates that additional members do not necessarily improve the mean forecast. The EPS mean approaches the model climatology (and not the observations as in the synthetic experiment shown in Fig. 6). Additionally, the smaller values of RMIS* _{Y}* in comparison with RMIS

*are a consequence of the larger uncertainty*

_{O}*H*(

*Y*) (44% of the mean forecasts with 38% of the ensemble member forecasts are in the above-normal tercile) than the uncertainty

*H*(

*O*) (only 29% of the observations are above the threshold). A calibration of the forecast system would yield equivalence of the scores for the mean forecasts.

The skill scores of the two-day forecasts and the older five-member EPS forecasts are estimated as a test of skill score sensitivity. The older EPS is slightly worse, by about 10%, than the newer one in terms of the information scores as shown in Fig. 9. The difference in the mean RPSS (not shown) is smaller than 1%, but with even slightly larger confidence intervals than in the information scores (cf. Fig. 7). Thus the RPSS is less sensitive to the performance change of the EPS than the information scores. The performance of the two-day forecasts is obviously worse than the one-day forecasts. But in all cases the model’s event climatology is comparable with the observed event climatology as shown by the intersection of the RMIS* _{O}* and RMIS

*skill scores for single-member forecasts.*

_{Y}## 5. Conclusions

This paper presents the application of two new skill scores, RMIS* _{O}* and RMIS

*, to assessing the performance of probabilistic forecasts. These skill scores are based on an information score RMI that measures the mutual information in time series of observed events and probabilistic forecasts. The skill scores define events by thresholds and not by mutually exclusive categories. Thus the forecast’s cumulative probabilities (as in the RPSS) and not their probability densities (as in the probability score or logarithmic score) are compared. The RPSS measures the distance as the multiple threshold average of the mean squared differences between the cumulative probabilities of forecasts and observations. On the other hand, the information skill scores estimate the average mutual information, an entropy-based quantity introduced in information theory, for multiple probability tables. The probability tables—one table for each threshold—summarize how often the forecasts and the observations assign which cumulative probability to an event. The RPSS as well as the information skill scores considers the cumulative forecast probabilities as the random variables.*

_{Y}It is shown that the mutual information score has several beneficial properties. Like the ranked probability score it penalizes forecasts increasingly if more probability is assigned to categories that are more distant from the one containing the observation. Conceptually, the information scores measure the mutual information between two different ordered random variables. In the present paper these variables are observed and simulated precipitation probabilities, but they could equally well be simulated temperature and simulated precipitation probabilities, if the statistical dependence between these variables is of interest.

The proposed scores are equitable and cannot be hedged. It is proposed that the scores are normalized by division with the information entropy in the observation data *H*(*O*) or the forecasts *H*(*Y*) yielding RMIS* _{O}* and RMIS

*, respectively. Therefore, two different aspects of the EPS forecasts are assessed. First, RMIS*

_{Y}*quantifies how much of the observed information is explained by the forecasts. Second, RMIS*

_{O}*quantifies the fraction of useful information in the EPS forecasts. Both quantities are helpful and easy to interpret as is demonstrated in experimental evaluation with synthetic and real precipitation forecasts.*

_{Y}The new skill scores are slightly biased, but the bias is well behaved since it decreases with the length of the evaluation period. Additionally, simple debiasing of the necessary entropy estimation following Miller (1955) is shown to be sufficient. A climatology is not necessary for debiasing. This is a substantial advantage over the necessary debiasing in RPSS estimation (Weigel et al. 2006a) since the climatologies are often not well known. This is especially important in case of extreme events.

Therefore, the new skill scores seem to be a promising complement of the RPSS in assessing the performance of probabilistic forecasts.

## Acknowledgments

This study was supported by the Swiss National Science Foundation through the National Centre for Competence in Research (NCCR-Climate). Valuable discussions with A. Weigel, Zurich, are acknowledged.

## REFERENCES

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

## Footnotes

*Corresponding author address:* Bodo Ahrens, IAU, University of Frankfurt, Frankfurt, Germany. Email: bodo.ahrens@iau.uni-frankfurt.de