## 1. Introduction

Numerical weather predictions are often expressed in the form of categorical or probabilistic forecasts of discrete predictands (a discrete predictand is an observable variable that takes one and only one of a finite set of possible values). A typical example is the prediction of more than 10 mm of precipitation or of temperature below freezing. The prediction of discrete events can be based either on categorical forecasts (“the event will/will not occur”) or on probabilistic forecasts (“there is a 30% probability of occurrence”). Generally speaking, categorical forecasts are defined as forecasts consisting of a flat statement that one and only one of a possible set of events will occur (Wilks 1995). Probabilistic forecasts are forecasts given in terms of a probability that a considered event would happen.

Numerical weather forecasts are often used by decision makers to decide whether or not to take an action to protect against a possible loss. Typically, a decision maker would spend an amount *C,* if an event were predicted, to protect against a loss *L* (with *L* > *C*). The potential economic value of a forecasting system can be assessed by using skill measured defined by coupling contingency tables and cost–loss decision models (Katz et al. 1982; Murphy 1985; Wilks and Hamill 1995).

In this work, the *potential economic value* of a forecast is defined, as in Richardson (2000), by a function of the probability of detection and the probability of false detection of the system. Since this measure is defined for both categorical and probabilistic forecasts, it can be used to compare the potential economic value of a single forecast and of an ensemble forecasting system.

Forecasts and observed values are compared over set of *N*_{G} = 1581 contiguous points, which can be considered as representing Europe (latitude 30°N ≤ *ϕ* ≤ 60°N, longitude 20°W ≤ *λ* ≤ 30°E) on a 1° regular grid. Synthetic observed and forecast patterns are defined as a combination of two-dimensional Gaussian functions and random fields. The sensitivity of different measures of forecast accuracy to imposed errors is investigated, and the potential economic benefit of an ensemble forecasting system instead of a single deterministic forecast is assessed.

The following aspects of the forecast accuracy and potential economic value of categorical and probabilistic forecasts are investigated in particular.

- The sensitivity of categorical deterministic and probabilistic forecasts to imposed model errors.
- The relative potential economic value of single deterministic and probabilistic forecasts.
- The sensitivity of the potential economic value of probabilistic predictions to ensemble size and model accuracy.

After this introduction, section 2 describes how the synthetic observed and forecast patterns are defined. The accuracy measures for categorical and probabilistic scores are introduced in section 3 and are applied to a single-case study in section 4. The average sensitivity (90-case average) of the accuracy measures and the potential economic value is investigated in section 5. The potential economic values of single deterministic and probabilistic forecasts are also compared in section 6. Conclusions are drawn in section 7.

## 2. Definition of synthetic forecast and observed fields

*g*

_{j}(

*λ,*

*ϕ*) a Gaussian function of the longitude

*λ*and the latitude

*ϕ*:defined by the maximum amplitude

*A*

_{j}, the rotation angle

*θ*

_{j}, the coordinate of the maximum value

*λ*

_{j}and

*ϕ*

_{j}, and the standard deviations

*σ*

_{x,j}and

*σ*

_{y,j}along the unrotated original axis. Figure 1a shows schematically how parameters

*λ*

_{j},

*ϕ*

_{j}, and

*θ*

_{j}define the position and the orientation of the Gaussian function

*g*

_{j}(

*λ,*

*ϕ*), and Fig. 1b shows how parameters

*A*

_{j},

*σ*

_{x,j}, and

*σ*

_{y,j}define its shape.

*f*

_{0}(

*λ,*

*ϕ*) and an ensemble of forecasts

*f*

_{j}(

*λ,*

*ϕ*), with each function defined by a different set of parameters

*A*

_{j},

*θ*

_{j},

*λ*

_{j},

*ϕ*

_{j},

*σ*

_{x,j}, and

*σ*

_{y,j}:

*f*

_{j}

*λ,*

*ϕ*

*c*

_{j}

*λ,*

*ϕ*

*g*

_{j}

*λ,*

*ϕ*

*c*

_{j}(

*λ,*

*ϕ*) is either a constant function, specifically

*c*

_{j}(

*λ,*

*ϕ*) = 1, or it is defined by a set of random numbers

*c*

_{j}(

*λ,*

*ϕ*) uniformly sampled in the interval [0, 2] [this choice guarantees a rescaling of the field

*g*

_{j}(

*λ,*

*ϕ*) by up to 100%]. Hereafter,

*j*= 0 will identify the

*verification pattern,*

*j*= 1, 51 will identify the

*ensemble of forecast,*and

*j*= 1 will identify the

*control forecast*(note that the control forecast is not different from any randomly chosen ensemble member).

*n*

_{f}(

*λ,*

*ϕ*) is the number of forecasts predicting the event at the grid point with coordinates (

*λ,*

*ϕ*). The spread

*s*(

*λ,*

*ϕ*) of the ensemble is defined as the ensemble second-order moment or standard deviation, that is, as the root-mean-square distance

Alternatively, the ensemble spread could be defined as the average distance of a randomly chosen pair of forecasts, or as the average distance of the control forecast (*j* = 1) from the other members.

## 3. Verification scores

A brief introduction of the scores used in this work to measure the skill of categorical and probabilistic forecasts of discrete dichotomous predictand (i.e., predictand allowed to be in only two possible states, yes or no) is reported hereafter. The reader is referred to Wilks (1995) for more details.

### a. Scores for categorical forecasts

Categorical verification of dichotomous (i.e., binary) events can be based on the 2 × 2 contingency table that displays the number of all possible combinations of forecast and observed events (Table 1). The performance of each of the *N*_{ens} forecasts defined in Eq. (2) is measured by a contingency table constructed by adding an entry to the 2 × 2 contingency for each of the *g* = 1, *N*_{G} grid points inside the area under investigation. A perfectly accurate forecast would clearly exhibit *b* = *c* = 0 in its corresponding contingency table.

#### 1) Some measures of forecast accuracy for binary events (hit rate, threat score, probability of detection, and probability of false detection)

Accuracy measures summarize the correspondence between individual forecasts and occurred events. Denote by *p*_{cli} = (*a* + *c*)/*n* the observed frequency of the event under consideration. Four of the most commonly used measures of *accuracy* (i.e., of the average correspondence between individual forecasts and the events they predict; see Wilks 1995) are the hit rate, the threat score, the probability of detection, and the probability of false detection (Doswell et al. 1990).

#### 2) Bias

#### 3) The Kuipers skill score

*A,*the skill score

*SS*

_{A}of a forecast with accuracy

*A*

_{f}with respect to the reference forecast with accuracy

*A*

_{ref}is given bywhere

*A*

_{perf}is the accuracy of a perfect forecast.

*p*

_{ref,D}(

*fc*= yes) =

*p*(ob = yes) and, by definition,

*a*

_{ref,D}= (

*a*+

*b*)

^{2}/

*n*

^{2}and

*d*

_{ref,D}= (

*b*+

*d*)

^{2}/

*n*

^{2}. Thus, the Kuipers skill score isThe Kuipers skill score can be written in terms of the probability of detection POD and the probability of false detection PFD:

Note that both random and constant forecasts receive the same zero score, and the contribution to KSS of correct no (yes) forecasts increases as the event is more (less) likely. It should be mentioned that the Kuipers skill score approaches the probability of detection POD when correct forecasts of no events dominate the contingency table, and therefore it is vulnerable to hedging in rare event forecasting (Doswell et al. 1990). Despite this, and the fact that Doswell et al. (1990) argue that the Heidke skill score (also called S statistics) should be preferred, the Kuipers skill score will be used in this paper because of its relationship with the potential economic value (see section 3c).

### b. Scores for probabilistic forecasts

*p*

_{f,g}=

*p*

_{f}(

*λ,*

*ϕ*), where the index

*g*= 1,

*N*

_{G}denotes the forecast–event pairs of all considered grid points. The observed probability function is defined to be

*o*

_{g}= 1 if the event occurs and

*o*

_{g}= 0 if the event does not occur. The Brier score can be computed as the sum of three terms related to reliability, resolution, and uncertainty:

_{rel}

_{res}

_{unc}

*p*

_{f}(

*λ,*

*ϕ*) defined in Eq. (4), stratified according to observation into 51 categories as in Table 2. For any given probability threshold

*j,*the entries of this table can be summed to produce the four entries of a 2 × 2 contingency table:From each of the

*j*th contingency tables, the probability of detection POD

_{j}and the probability of false detection PFD

_{j}can be computed. The 51 pairs (PFD

_{j}, POD

_{j}) can be plotted one against the other on a graph. The result is a smooth curve called the ROC curve.

_{per}= 1 for a perfect forecast, and ROCA

_{cli}= 0.5 (ROCA = 0.5 for a climatological forecast, since PFD = POD = 0.5).

*J*

_{ev}be the number of (ranked) forecast events,

*p*

^{j}

_{f,g}

*j*th event, and

*o*

^{j}

_{g}

*o*

^{k}

_{g}

*k*th event is observed, and

*o*

^{j}

_{g}

*j*≠

*k*), where

*g*= 1,

*N*

_{g}denotes the

*g*th grid point. The grid point ranked probability score RPS

_{g}is computed from the squared error of the cumulative forecast and observed probabilities:The area-average ranked probability score RPS is defined asThe RPSS is defined with respect to a forecast based on the sampleAnother measure of ensemble performance is the percentage of observed values lying outside the ensemble forecast range, also called the percentage of outliers (POUTL; Strauss and Lanzinger 1995). As a reference value, POUTL

_{ref}= 2/52 for an ensemble system with 51 members that randomly samples the forecast probability density function.

### c. Potential forecast economic value

As in Richardson (2000), consider decision makers interested in protecting from the occurrence of the event under consideration. Suppose that if they take an action incurring a cost *C* they can avoid a loss *L* (with *L* > *C*). Table 3 summarizes this simple cost–loss model.

*p*

_{cli}and assume that the sample observed frequency is equal to the long-term climatology, their optimal strategy would be to always protect if

*C*<

*p*

_{cli}

*L,*and their expected mean expense per unit loss would beIf the decision makers have access to a perfect forecast, than their mean expense per unit loss would besince they would incur a cost

*C*/

*L*(per unit loss) only in the

*p*

_{cli}occasions when they protected themselves against the loss (always avoided).

*p*

_{cli}= (

*a*+

*c*)/

*n.*Then, from Tables 1 and 3 it follows that their mean expense (per unit loss) would beThe potential economic value FV of the forecast is defined as the reduction of the mean expense with respect to the reduction of the expense that could be achieved by a perfect forecast:Applying Eqs. (22), (23), and (24) and the definition of the Kuipers skill score, the potential forecast economic value can written as

From Eq. (26) it is easy to see that

- if
*C*/*L*>*p*_{cli}then - if
*C*/*L*<*p*_{cli}then - when
*C*/*L*=*p*_{cli}the forecast value is maximum, FV = KSS.

Equation (26) and subsequent equations highlight the fact that given the observed frequency of the event *p*_{cli} and the user cost–loss ratio *C*/*L,* the forecast value depends only on the probability of false detection and the probability of detection of the system [since by definition KSS = POD − PFD; see Eq. (13)]. Furthermore, these equations show that the potential economic value is a weighed difference between the probability of detection and of false detection of the system, with weights a function of the event observed frequency (climatology) and the user cost loss ratio. Equation (26) also indicates that the Kuipers skill score of the system can be considered as the maximum forecast value that can be obtained from the system.

*f*

_{j}, which is again taken at face value (i.e., without adjusting the forecast for estimated model errors). As for the computation of the ensemble ROC area, consider the 51 pairs (PFD

_{j}, POD

_{j}) computed from the 2 × 2 contingency tables associated with the 51 probability thresholds. Applying Eq. (26), the 51 forecast values FV

_{j}associated with the

*j*th probability threshold can be computed. Given the observed frequency of the event

*p*

_{cli}, for each cost–loss ratio the forecast value of the ensemble is defined asEquation (27) shows that each user can optimize the ensemble forecast value for each cost–loss ratio

*C*/

*L*by choosing the probability threshold that has maximum value at that specific ratio.

## 4. Scores of categorical and probabilistic forecasts: A single-case study

Denote by *f*_{0}(*λ,* *ϕ*) an observed pattern defined by (*A*_{0} = 100, *λ*_{0} = 10, *ϕ*_{0} = 45, *ϕ*_{0} = 30, *σ*_{x,0} = 1.5, *σ*_{y,0} = 1), and denote by *f*_{j}(*λ,* *ϕ*) an ensemble of forecasts defined by randomly sampling the parameters (*A*_{j}, *λ*_{j}, *ϕ*_{j}, *θ*_{j}, *σ*_{x,j}, *σ*_{y,j}) in the intervals defined in Table 4. These patterns can be considered to represent an observed and a forecast precipitation field. Since the parameters used to define the observed pattern are included in the range of the forecast parameters, the ensemble is reliable. Furthermore, by construction each ensemble member is, on average, equally skillful. It is worthwhile to remind the reader that the results discussed in this section refer to one case only and thus may present some peculiar features.

Figure 2 shows the observed pattern *f*_{0}(*λ,* *ϕ*) and the control forecast *f*_{1}(*λ,* *ϕ*) defined by (*A*_{1} = 102, *λ*_{1} = 12, *ϕ*_{1} = 44, *θ*_{1} = 50, *σ*_{x,1} = 1.4, *σ*_{y,1} = 1.1). Figure 3a shows the observed frequency *p*_{cli} of events characterised by a different value, that is, by a different amount of precipitation. Note that all frequencies are smaller than 0.03, indicating quite rare events for any threshold.

The control forecast is practically unbiased, with a higher threat score for low precipitation amounts and with a positive skill score KSS up to 70 mm (Fig. 3b). The fact that the KSS and the POD curves overlap indicates a very low PFD [see Eq. (13)] and is a consequence of the fact that correct forecasts of no event dominates the contingency table (Doswell et al. 1990). As mentioned in section 3, the forecast value depends on the cost–loss ratio *C*/*L* and on the event observed frequency (Fig. 3c), and its upper bound is given by the Kuipers skill score KSS. This is evident by the comparison of Figs. 3b and 3c. Figure 3d shows the Kuipers skill score KSS for the whole ensemble compounded of the control forecast and the 50 other forecasts. Figure 3d shows that the Kuipers skill score of the ensemble mean forecast (dashed line) is more skillful than the control forecast (solid line) for precipitation amounts up to 50 mm.

The 51 ensemble forecasts have been used to generate probability forecasts of different precipitation amounts. Figure 4a shows that the probability forecasts have a positive Brier skill score for amounts up to 70 mm. The BSS of the ensemble forecast decreases with the threshold amount in a way similar to the KSS and the TS of the control forecast. By contrast, the ROC area skill score is always equal to 1 and does not decrease with the precipitation threshold, due to very low probabilities of false detection. Figure 4a also shows that the RPSS is always positive (dotted line; constant since it is an integrated measures computed considering all the precipitation amounts). Figure 4b shows the potential forecast value FV_{j} for all the probability thresholds *j* = 1, *N*_{ens} for the prediction of more than 10 mm of precipitation. Note that different probability thresholds for each cost–loss ratio achieve the maximum potential forecast value.

By definition any score depends on the area onto which they are computed. This sensitivity to the area definition is shown by the comparison of the forecast scores computed over a European subregion centered around the observed pattern (40° ≤ latitude ≤ 50°N, 0° ≤ longitude ≤ 20°E, 231 grid points) with the scores computed over the whole European area (1581 grid points). The reduction of the verification area induces an increase in the observed relative frequency *p*_{cli} by about a factor of 7 (see Fig. 5a for the small area and Fig. 3a for Europe). Since the contingency tables for the two areas differ mainly in the number of correctly forecast nonoccurrences, the only verification scores that are sensitivity to the area reduction are the scores that depend on the “d” entry of the 2 × 2 contingency table. Indeed, the bias, the threat score, and the probability of detection do not change (see Figs. 3b and 5b). By contrast, the probability of false detection changes, and this affects the Kuipers skill score (see Figs. 3b and 5b) and the forecast value (see Figs. 3c and 5c). Similarly, the Kuipers skill score of all the ensemble forecasts (not shown) and the forecast value of probabilistic forecast of precipitation amounts are affected (see Figs. 4b and 5d).

## 5. Scores sensitivity: A single-case study

The sensitivity of the forecast scores to a priori imposed errors due to amplitude under/overestimation, to position errors and to “shape errors” (i.e., errors in the definition of the width of the Gaussian function) is investigated hereafter. All results discussed in this section refer to the prediction of 10 mm.

Figure 6a shows the sensitivity of the control scores to amplitude errors. Results have been obtained by setting all parameters apart for *A*_{1} as in Table 4 (as in section 4) and with 0 ≤ *A*_{1} ≤ 200 (i.e., with 0 ≤ *A*_{1};cl*A*_{0} ≤ 200 since *A*_{0} = 100 for the observed field; see Table 4). It is interesting to note that for this precipitation amount (10 mm) the threat score is always positive for any *A*_{1} > 0. The threat score relative, for example, to the prediction of 40 mm is zero for *A*_{1}/*A*_{0} < 0.6 (not shown.)

Consider now an ensemble of 50 forecasts defined by (*A*_{j}, *λ*_{j}, *ϕ*_{j}, *θ*_{j}, *σ*_{x,j}, *σ*_{y,j}) generated as follows. For each *A*_{1}, the ensemble of 50 parameters *A*_{j} (*j* = 2, 51) is sampled in the interval max(0, *A*_{1} − 40) < *A*_{j} < (*A*_{1} + 40), while all the other parameters are sampled according to Table 4 (e.g., by setting 7 ≤ *λ*_{j} ≤ 13). Figure 6b shows the TS, the POD, and the KSS of the ensemble mean forecast. As for the control forecasts, the ensemble mean scores are sensitive to the amplitude errors. Compared to the control forecast (Fig. 6a), the ensemble mean has higher TS and KSS than the control for any *A*_{1}/*A*_{0} > 0.2. Figure 6c shows the BSS, the ROCAS, and the RPSS for the probabilistic prediction. The BSS and the RPSS show a sensitivity to *A*_{1}/*A*_{0} similar to the sensitive shown by the control and the ensemble mean forecasts, with smaller BSS and RPSS for *A*_{1}/*A*_{0} > 0.6. By contrast, the ROCAS is 1 for any ratio *A*_{1}/*A*_{0} > 0 and the POUTL is zero for any *A*_{1}/*A*_{0} (not shown).

Figure 7a shows the sensitivity of the control scores to a position error in predicting the longitude of the precipitation maximum (parameter *λ*). All parameters but *λ*_{1} were defined according to Table 4, while 0° ≤ *λ*_{1} ≤ 20°E [i.e., with errors −10 ≤ (*λ*_{1} − *λ*_{0}) ≤ 10 since *λ*_{0} = 10E for the observed field; see Table 4]. Results indicate that the (unbiased) control forecast is skillful only for a certain range of position errors, with this range depending on the precipitation amount and of the width (*σ* parameters) of the observed and forecast fields. As before, for each *λ*_{1} an ensemble of 50 forecasts have been generated, each with a different parameter *λ*_{j} sampled in the interval (*λ*_{1} − 3) ≤ *λ*_{j} ≤ (*λ*_{1} + 3). Figure 7b shows the sensitivity to the position error of the ensemble mean scores and Fig. 7c the sensitivity of the scores of the probabilistic predictions. Compared to the control forecast (Fig. 7a), the ensemble mean (Fig. 7b) has higher TS and KSS, but it has also a larger bias (see Hamill 1999 for a discussion on the impact of model bias on verification scores such as the equitable threat score). The scores of the probability forecasts show a strong sensitivity to the forecast position error (Fig. 7c). Large position errors induce low values of ROCAS, negative BSS and RPSS, and high POUTL. The fact that forecasts with a positive ROCAS have negative Brier skill score and negative RPSS confirm the fact that different measures of forecast quality give different results. Generally speaking, the comparison between Figs. 6 and 7 suggest that a position error has a more severe effect on the forecast scores than an amplitude error. Similar results would have been obtained by varying the latitude of the maximum value (not shown).

Figure 8 shows the sensitivity to errors in the prediction of the Gaussian function standard deviation along the (unrotated) *x* axis, that is, *σ*_{1,x}. All parameters but *σ*_{1,x} were defined according to Table 4, while 0.75 ≤ *σ*_{1,x} ≤ 3.75° (i.e., with 0.5 ≤ *σ*_{1,x}/*σ*_{0,x} ≤ 2.5 since *σ*_{0,x} = 1.5° for the observed field; see Table 4). Results show that the ensemble mean scores are higher than the scores of the control forecast for any *σ*_{1,x} (Figs. 8a,b) and that of the ensemble scores of the probabilistic precipitation prediction deteriorate for too small or too large *σ*_{1,x} (Fig. 8c), that is, if each forecast field is too narrow or too wide.

In Fig. 8c, the fact that POUTL > 0 for large *σ*_{1,x} is related to the way all forecast parameters are set (see Table 4). These results document the sensitivity of the different measures of skill to under/overestimation, to position errors, and to errors in the prediction of the correct shape (*σ* parameter) of the observed pattern. They also show that forecasts judged to be skillful according to one measure of forecast skill could be judged not to have any skill according to others. Furthermore, they indicate that a reliable ensemble can be used to construct a single deterministic forecast (i.e., the ensemble mean) that is more skillful than each single ensemble member. This aspect is further analyzed in the following section, where categorical and probabilistic forecasts are compared considering a larger dataset.

## 6. Potential economic value of categorical and probabilistic forecasts: 90 cases average results

The impact of random and systematic model errors on the average performance (90 cases) is investigated hereafter.

### a. Impact of random errors

A high quality forecasting system (no systematic model error and with equally skillful ensemble members) can be simulated by randomly sampling each of the parameters that define the observed and the forecast fields from the same interval.

Figure 9a shows the average scores of each single deterministic forecast given by one ensemble member (e.g., the control). The forecast has no bias, as expected by construction, and has a positive Kuipers skill score for all precipitation thresholds. Figure 9b shows the reliability, resolution, and uncertainty terms of the BS for the ensemble probabilistic predictions: it indicates that the ensemble is reliable (almost null Brier score reliability term). Figure 9c shows that the ensemble has a ROCAS above zero for all thresholds, a positive BSS for thresholds up to 70 mm, a positive RPSS, and an almost null percentage of outliers. Figures 9d–f show the potential forecast value of the control and the ensemble mean, and the potential forecast value of the ensemble probabilistic predictions for three different thresholds, 1, 10, and 40 mm. The comparison of the potential forecast value curves confirms the indications of section 5 that the ensemble mean is more valuable than the control forecast, and that the ensemble probabilistic prediction has a higher value than any deterministic forecast. It is worth pointing out that all potential economic value curves peak for very small cost–loss ratios since all events are very rare, even for 1 mm.

In other words, decision makers interested in predicting a binary event “rainfall greater than an amount x” would have a higher return if they make decisions (protect/nonprotect) according to the ensemble probabilistic forecast than to any single deterministic forecast. This result is summarized in the potential forecast value chess boards shown in Fig. 10. For any cost–loss ratio and any threshold amount, the potential forecast value is higher if actions are taken according to the ensemble probabilistic forecast rather than the control or the ensemble mean forecast (not shown).

Similar results have been obtained when a second source of random error [i.e., for forecasts defined applying Eq. (2) with *c*_{j} (*λ,* *ϕ*) set to be a random number uniformly sampled in the interval 0 ≤ *c*_{j} (*λ,* *ϕ*) ≤ 1] has been introduced in the generation of the ensemble forecasts (not shown).

### b. Impact of systematic over/underestimation

Consider now two ensemble systems characterized by random errors (as for the ensemble discussed in section 6a) and by a 40% under- or overestimation of the precipitation maxima. These results have been obtained by defining the forecasts as in section 6a but by multiplying the coefficient *A*_{0} of the observation field by a factor of 0.6 or by a factor of 1.4. These examples can be thought to describe the performance of ensemble systems based on a model characterized by a poor simulation of moist processes that induces either a rainfall overestimation or underestimation. Figure 11 shows the performance of these two systems.

On average, under or overestimation errors induces a bias on each single deterministic forecast (see Figs. 11a,b and 9a), especially for thresholds larger than 60 mm, and it has a sizeable impact on the threat scores and the Kuipers skill scores for thresholds larger than 60 mm. The bias curve for the ensemble characterized by overestimation (Fig. 11b) drops to zero for precipitation values larger than 60 mm because, by construction, 60 mm is the maximum observed value. Considering the ensemble probabilistic predictions, both over- or underestimation have a sizeable impact on the ROCAS for thresholds larger than 60 mm, while overestimation has a larger impact than underestimation on the BSS for thresholds. The impact of over- or underestimation on the potential forecast value reflects the impact on the ROCAS, that is, small for small thresholds (say, up to 40 mm; see Figs. 11e,f for the 10-mm threshold) and large for larger values (not shown). It is worthwhile to point out that the potential forecast value curves for the system affected by underestimation are to the left of the forecast value curves of the system affected by overestimation (Figs. 11e,f). This indicates that, depending on whether a user has a high or low cost–loss ratio C/L, it could be more valuable to under- or overestimation.

### c. Impact of systematic position errors

Consider now two ensemble systems characterized by random errors (as for the ensemble discussed in section 6a) and by a 3° or a 6° position error (which is two or four times the standard deviation of the observed pattern; see Table 4). These results have been obtained by defining the forecasts as in section 6a but by shifting the position of the maximum value of each day verification field by 3° or 6°. These examples can be thought to describe the performance of an ensemble of forecasts generated by a model with a tendency to predict too weak zonal flows. Figure 12 shows the performance of these systems.

The impact on the average scores of each single forecast of either a 3° or a 6° error is very small (see Figs. 12a,b and 9a) and almost undetectable on the threat score and the Kuipers skill score. The impact is larger on the probabilistic predictions (see Figs. 12c,d and 9c) especially for thresholds of 20 mm or more. Results show that for thresholds larger than 30 or 20 mm, respectively, a 3° or 6° position error makes the BSS negative. The impact on the potential forecast value reflects the impact on the ROCAS (see Figs. 12e,f and 9d).

It is interesting to compare these results of this section with the results obtained in the previous section. In particular, the comparison of the potential forecast value for the 10-mm threshold (see Figs. 11e,f; 12e,f; and 9c) indicate that systematic position errors of 3° to 6° (i.e., of 2 to 4 standard deviation) reduce the potential forecast value of single deterministic and probabilistic forecasts more than systematic over- or underestimation errors of 40%.

### d. Impact of systematic “shape” errors

Consider now two ensemble systems characterised by random errors (as for the ensemble discussed in section 6a) and by a systematic prediction of a too broad or too narrow precipitation field. More specifically, consider two ensemble systems defined by a two times too small or too large *σ*_{x,j}. These results have been obtained by defining the forecasts as in section 6a but by rescaling *σ*_{x,0} by a factor of 2 or by a factor of 0.5. Figure 13 shows the performance of these systems.

The impact on the average scores of each single forecast of predicting two times too small or too large *σ*_{x,j} is qualitatively similar to the impact of under or overestimation. The prediction of a two times too small *σ*_{x,j} leads to an average bias of 0.5 (Fig. 13a), while the prediction of a two times too large *σ*_{x,j} leads to an average bias of 2.0 (Fig. 13b). Qualitatively similar to the impact of under- or overestimation, the prediction of a two times too small *σ*_{x,j} has a small impact on the probabilistic scores (Fig. 13c) while the prediction of a two times too large *σ*_{x,j} leads to negative Brier skill scores for all thresholds (Fig. 13d). The impact on the ROCAS is smaller than the impact on the Brier skill score, and it is similar for both too small or too large predicted *σ*_{x,j} (Figs. 13c,d). It is worthwhile to point out that the potential forecast value curves for the system affected by a two times too small *σ*_{x,j} are shifted to the right with respect to the forecast value curves of the system affected by two times too large *σ*_{x,j} (Figs. 13e,f). Again, note that this shift is qualitatively similar to the shift of the forecast value curves of systems affected by over- or underestimation (Figs. 11e,f).

### e. Impact of ensemble size and systematic model (position) errors

Consider now four ensemble configurations, each of them characterized by a different ensemble size and based on models with different systematic errors. Ensemble E51_6d has 51 members and uses a model affected by a 6° systematic position error; ensembles E51_3d and E11_3d have, respectively, 51 and 11 members with a 3° systematic position error; and finally E5_1.5 has only 5 members with a 1.5° systematic position error. The comparison of the performance of the four ensembles helps in addressing the question of whether an ensemble with a small ensemble size but based on an accurate forecast model has higher potential forecast value than an ensemble based on a larger set of integrations of a less accurate model. Figure 14 shows the potential forecast value for the four ensemble configurations.

The comparison of the potential forecast value of configurations E51_6d and E51_3d (Figs. 14a,b) confirms the results discussed above: that a reduction of model systematic error increases the forecast value. The comparsion of the potential forecast value of configurations E51_3d and E11_3d (Figs. 14b,c) shows that a 90% reduction of the ensemble size decreases the forecast value of an ensemble system based on an accurate model to almost the same level as configuration E51_6d, which is based on a poor model (Fig. 14a). On the other hand, Fig. 14d confirms that a further reduction of the model systematic error can increase the potential forecast value. In other words, these results indicate that the potential forecast value of an ensemble system is strongly dependent on both ensemble size and model accuracy (in this particular case the potential forecast value is more sensitive to model error than ensemble size).

### f. Sensitivity of the ensemble performance to ensemble spread

The results presented so far did not analyze the sensitivity of the ensemble scores on the ensemble spread. This point has to be addressed since any ensemble system must have the right level of spread to be able to include the verification inside the range spanned by the ensemble forecast. On the other hand, it is worth investigating whether ensemble systems with a wrong level of spread may anyway provide probabilistic forecasts with a higher potential economic value than single deterministic forecasts.

Consider an ensemble of forecasts characterized by a mean position error of 6° in longitude (*λ*_{j} − *λ*_{0} = 6°), with the observed and the predicted Gaussian distributions characterized by 1.1 ≤ *σ*_{x,j} ≤ 1.9 and 0.8 ≤ *σ*_{y,j} ≤ 1.2. Suppose that the ensemble has the right level of spread in the meridional direction, and consider the sensitivity of the ensemble performance to the spread in the zonal direction. This can be investigated by considering ensembles with different ranges for the parameters *λ*_{j}. Figures 15 and 16 show some results relative to the 10-mm threshold.

Due to the systematic position error the control forecast of 10 mm of rain has no skill (both in terms of TS or KSS, not shown). By contrast, the ensemble mean has a positive TS and a positive KSS if the ensemble spread is neither too small nor too large (Fig. 15a). Similarly, the ensemble probabilistic prediction of 10 mm of rain has a positive BSS and a high ROCAS only if the ensemble spread is neither too small nor too large (Fig. 15b). Similarly, the potential forecast value is positive for both the ensemble mean forecast (Fig. 16a) and the ensemble probabilistic prediction (Fig. 16b) only if the ensemble spread is neither too small nor too large. Note that results indicate that even if the ensemble spread is not properly tuned (but not outrageously wrong) ensemble probabilistic predictions can be skillful and have potential economic value.

## 7. Conclusions

Issues related to the verification of the accuracy of categorical and probabilistic forecasts of discrete dichotomous events (occurrence/nonoccurrence) have been discussed. Synthetic forecasts have been compared to synthetic verification fields. The accuracy of categorical and probabilistic forecasts has been assessed using a variety of accuracy and skill measures (hit rate, threat score, probability of detection and probability of false alarm, bias, Kuipers skill score, Brier score, and skill score, ROC area score and skill score, rank probability skill score, probability of outliers).

A simple decision model has been used to estimate the potential economic value of both categorical and probabilistic forecasts (as in Richardson 2000). It has been shown that the potential economic value can be written as a weighted difference between the system probability of detection and the probability of false detection. It has also been shown that the Kuipers skill score gives the maximum potential economic value.

Each forecast accuracy or skill measure summarizes in one number the differences between observed and forecast pattern. It has been shown how difficult it is to associate to one specific number the actual difference between the two patterns, and that the use of more than one accuracy or skill measure gives a more complete picture of the performance of a system. Different measures of forecast accuracy have been shown to have a certain degree of coherence in behavior, all showing a qualitatively similar response to increasing model errors. Nevertheless, it has been also shown that a quantitative disagreement of the response can occur, with forecasts judged to be skillful according to one measure judged to have no skill according to others. This supports Murphy's (1991) indication of the large dimensionality of the verification problem.

The sensitivity of accuracy and skill measures to imposed random and systematic errors has been investigated. It has been shown how accuracy and skill measures are sensitive to the area definition, thus indicating that care must be taken when comparing forecast scores for different regions characterized by different observed frequencies. The sensitivity of accuracy and skill measures to amplitude, position and “shape” errors have been studied. Considering the Brier skill score or the ROC area skill score, results indicated, for example, that position errors could have bigger effect than over/underestimation errors.

Ensembles with different size and systematic model errors have been compared to investigate the sensitivity of probabilistic forecasts to ensemble configuration. Results have shown that both model errors and ensemble size affect the accuracy and the potential economic value of an ensemble system, with small-size accurate ensembles performing on average similarly to large-size less accurate ensembles. Results have also confirmed that any ensemble system should have the right level of spread to be skillful and have high potential economic value.

The potential economic value of categorical forecasts generated by single deterministic forecasts given by one member of an ensemble of 51 forecasts has been compared with the potential economic value of the probabilistic forecasts given by the whole ensemble. Results indicate that, independently from the model random or systematic error, ensemble-based probabilistic forecasts exhibit higher potential economic values than categorical forecasts.

These results indicate that the design of a forecasting system should follow the definition of its purposes (i.e., the definition of the accuracy measures used to gauge its performance). The design should be such that the ensemble system maximizes its outcome as assessed by the accuracy measures that best quantify the achievement of its purposes.

## Acknowledgments

I am very grateful to Robert Hine for all his editorial work, which improved substantially the quality of all the figures. David Richardson provided very useful comments to an earlier version of this manuscript. I am grateful to Steve Mullen and to two anonymous referees whose comments helped improve the first version of this manuscript.

## REFERENCES

Doswell, C. A. I. I. I., , R. Davies-Jones, , and D. L. Keller, 1990: On summary measures of skill in rare event forecasting based on contingency tables.

,*Wea. Forecasting***5****,**576–585.Epstein, E. S., 1969: A scoring system for probability forecasts of ranked categories.

,*J. Appl. Meteor***8****,**985–987.Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts.

,*Wea. Forecasting***14****,**155–167.Hanssen, A. W., , and W. J. A. Kuipers, 1965: On the relationship between the frequency of rain and various meteorological parameters.

,*Meded. Verh***81****,**2–15.Katz, R. W., , A. H. Murphy, , and R. L. Winkler, 1982: Assessing the value of frost forecasts to orchardists: A dynamic decision-making approach.

,*J. Appl. Meteor***21****,**518–531.Mason, I., 1982: A model for assessment of weather forecasts.

,*Austr. Meteor. Mag***30****,**291–303.Murphy, A. H., 1971: A note on the ranked probability score.

,*J. Appl. Meteor***10****,**155–156.Murphy, A. H., 1985: Decision making and the value of forecasts in a generalized model of the cost-loss ratio situation.

,*Mon. Wea. Rev***113****,**362–369.Murphy, A. H., 1991: Forecast verification: Its complexity and dimensionality.

,*Mon. Wea. Rev***119****,**1590–1601.Murphy, A. H., 1996: The Finley affair: A signal event in the history of forecast verification.

,*Wea. Forecasting***11****,**3–20.Richardson, D. S., 2000: Skill and economic value of the ECMWF Ensemble Prediction System.

,*Quart. J. Roy. Meteor. Soc***126****,**649–668.Strauss, B., , and A. Lanzinger, 1995: Validation of the ECMWF EPS.

*Proc. ECMWF Seminar on Predictability,*Vol. 2, Shinfield Park, Reading, United Kingdom, ECMWF, 157–166.Wilks, D. S., 1995:

*Statistical Methods in the Atmospheric Sciences*. Academic Press, 467 pp.Wilks, D. S., , and T. M. Hamill, 1995: Potential economic value of ensemble-based surface weather forecasts.

,*Mon. Wea. Rev***123****,**3564–3575.

Contingency table for dichotomous event

Table of occurrences/nonoccurrences for ROC area definition

Cost–loss decision model

Coefficients used to define the parameters of the observed and the forecast values [Eq. (1)]