## 1. Motivation and scope

There remains a place in weather forecast services for single-valued forecasts of weather parameters important for day-to-day decision-making (Lazo et al. 2009). Such forecasts provide easily understood information about the weather and can be used as a simple basis for decision-making. This is notwithstanding the widely recognized benefits of explicitly probabilistic forecasts for more sophisticated decision-makers (Palmer 2002; Buizza 2008; Verkade and Werner 2011; Ramos et al. 2013) and the increasing availability of such probabilistic forecasts based on dynamical ensemble prediction systems (Buizza and Leutbecher 2015) and statistical post processing systems (Vannitsem et al. 2018; Hemri et al. 2014).

If a single-valued forecast is to be provided, a decision must be made as to the most appropriate information to feed this service. This decision may reside in the way an automatically generated service is constructed, or it might be a choice being made day by day by a weather forecaster deciding whether or not to adjust forecast guidance from an automated system. Different types of single-valued forecasts have different characteristics. Some may be termed “deterministic,” in the sense that they relate to a particular forecast scenario such as provided by the output of a single numerical weather prediction (NWP) model. Others represent a statistical construct taken from the distribution of possible outcomes, rather than any one scenario. For instance, the forecast could be the average of an ensemble of forecasts, whether from a “poor-man’s ensemble” set of deterministic NWP systems or from an ensemble prediction system.

When single-valued forecasts are displayed in gridded form, the differences in the various types of information become more obvious, with the single model output giving a meteorological picture consistent with one possible future state of the atmosphere, and the ensemble mean giving a picture that is more or less fuzzy depending on the degree of uncertainty. The former might be preferred by the meteorologist who wants a precise picture to which they can attach conceptual models to craft an accompanying narrative to the forecast. Likewise, a forecast user such as a fire behavior modeler might need a dynamically consistent picture to get a realistic representation of how a cold front would affect a fire. On the other hand, the ensemble mean has the advantage of not representing unforecastable and potentially misleading detail. This could be preferable for users who base their decision-making on values of meteorological parameters at their location, without making inferences based on meteorological features.

In this paper we compare single-valued forecast systems by considering how valuable they are for users’ decision-making. This goes beyond how forecasts perform when compared with observations using standard metrics such as root mean squared error or mean absolute error. As has been pointed out (Murphy 1993; Roebber and Bosart 1996; Marzban 2012), the benefit of a forecast to a user (which we will refer to as “forecast value”) does not necessarily depend on forecast skill as measured by closeness of forecasts to observations. In section 2 we look at which is the more valuable out of two example forecast systems, using a simple representation of forecast value. We present a graphical depiction of this comparison across the range of user sensitivities and across the range of decision thresholds.

Section 3 explores how the results relate to the properties of each of the two forecast systems. We consider forecast value in the light of a simple linear forecast error model that expresses forecast system characteristics in terms of conditional bias, unconditional bias, and random error spread. In section 4 we draw some theoretical conclusions about which forecast system is the more valuable for all users, at particular decision thresholds. We further examine the relationship between forecast value and different forecast system characteristics in section 5 by performing experiments with synthetic series of observations and forecasts. This allows us to identify the superiority of consensus average approaches for forecast decisions in nonextreme conditions, due to their reduced spread of random errors. However, in more extreme conditions, a forecast system with less conditional bias than the consensus will benefit some users.

In section 6 we provide suggestions regarding aspects of forecast system performance for which improvements would increase value to users.

## 2. Comparing forecast value

### a. Relative economic value

*L*if that event occurs, unless they protect in advance against the event with cost

*C*. We consider the expense

*E*

_{climate}that they would have incurred due to the weather events if they had made the same decision on every occasion based on the climatology of the weather event, the expense

*E*

_{forecast}that they would have incurred had they acted based on whether or not the forecast system predicted the event, and the expense

*E*

_{perfect}that they would have incurred if they used a hypothetical perfect forecast. The relative economic value

*V*of a forecast system is defined as

*α*=

*C*/

*L*is the cost–loss ratio,

Contingency table for a binary event, where the different outcomes refer to relative frequencies of occurrence.

### b. Significant cost–loss ratios

*V*between two different sets of forecasts depends on the differences in hit rate Δ

*H*and false alarm rate Δ

*F*as follows:

*α*< 1 and

*α*<

*α*

_{equal}while the set of forecasts with a lower false alarm rate is more valuable for

*α*>

*α*

_{equal}, where

*α*at which a user is so sensitive to the event (loss ≫ cost) it is better to always protect against an event than make a decision based on the forecast. Following Richardson [2000, his Eq. (10)], this holds for

*α*<

*α*

_{low}, where

*α*at which a user is so insensitive to the event (loss ≈ cost) that it is better to never protect, rather than use the forecast. Again following Richardson [2000, his Eq. (11)], this holds for

*α*>

*α*

_{high}, where

### c. Single-valued forecasts to compare

Studies looking at forecast value have typically compared probability forecasts to each other or have been used to show that probability forecasts are more valuable than single-valued forecasts (Richardson 2000; Mylne 2002; Zhu et al. 2002). Here we are applying the concept of relative economic value to compare different single-valued forecasts. One study that applied this idea to different types of single-valued forecasts was by Buizza (2001). He showed, in the case of a synthetic ensemble of rainfall forecasts, that the ensemble mean was more valuable than the control ensemble member for most rainfall thresholds, but that the control was more valuable for very sensitive users at the highest rainfall threshold examined.

Following Buizza (2001) but using real forecast data, we take two forecast systems in use at the Australian Bureau of Meteorology that are examples of these two different approaches to providing single-valued forecasts. The gridded Operational Consensus Forecast (OCF) is a poor-man’s ensemble, taking bias-corrected forecasts from several NWP models as input to a weighted consensus (Engel and Ebert 2007, 2012) with statistical downscaling to around 5-km resolution based on a gridded analysis. It exemplifies the statistical averaging approach. The Australian Community Climate and Earth-System Simulator NWP model (Puri et al. 2013), run in its regional configuration (ACCESS-R) at around 12-km horizontal resolution, is an example of the single-scenario approach. The ACCESS-R output is one of the NWP inputs into OCF. In each case, the forecasts are for temperature at 0600 UTC (between 1400 and 1700 local time depending on time zone) compared to observations at sites, at a lead time of 36 h. The forecasts are extracted from data presented to forecasters in the Graphical Forecast Editor (GFE), for use in provision of forecaster-curated gridded forecast services. The OCF and ACCESS-R forecasts have been converted through bilinear interpolation to a 3- or 6-km grid (gridded forecast resolution varies by state across Australia) and have been adjusted further based on the difference in elevation between guidance and observation using a standard adiabatic lapse rate of 6.5°C km^{−1}. This is the standard processing that forecast guidance receives on ingestion into the GFE, and as the same processing has been applied to data from each forecast system, we do not expect that it will affect comparisons between the two. The forecasts are valid at 200 observation stations across the southern part of Australia, with forecasts taken from the Southern Hemisphere summer seasons (December–February) for 2015/16, 2016/17, 2017/18 and 2018/19.

### d. Display of comparison

In Fig. 1a we consider a decision threshold of 30°C and show relative economic value curves as presented by Richardson (2000) and others, plotting relative economic value as a function of cost–loss ratio for each forecast system. The resulting curves show that for a decision threshold of 30°C, the consensus forecasts (OCF) are more valuable irrespective of user sensitivity to the event. By contrast, Fig. 1b for a decision threshold of 35°C shows that more sensitive users with cost–loss ratios up to 0.33 would have been better off using the deterministic NWP forecasts (ACCESS-R). Less sensitive users with higher cost–loss ratios would have been better off using the OCF forecasts. In both examples, the most sensitive and insensitive users at the extremes of cost–loss ratio, would have been better off basing their decision on climatology.

To obtain a more general idea of the usefulness of the single-valued forecasts, we proceed to look at comparisons for decisions based on whether the temperature forecasts exceed a threshold *X*, across a range of values of *X* and cost–loss ratios *α* between 0 and 1. Figure 2 shows which is the most valuable forecast system across these ranges, with the changeover point being given by *α*_{equal} from (5). The gray shaded areas show where neither forecast system is superior to a climatologically based decision, using *α*_{low} and *α*_{high} from (6) and (7). The hit rate and false alarm rate for each forecast system are also shown. The forecast system with better (higher) hit rate is OCF at decision thresholds below 33°C while ACCESS-R has better hit rate above. ACCESS-R has the better (lower) false alarm rate below 26°C and OCF is better above. Therefore OCF is more valuable for all user sensitivities, with higher hit rate and lower false alarm rate, for decision thresholds between 26° and 33°C. Figure 1a demonstrates this for a cross section of Fig. 2 at threshold 30°C. For thresholds exceeding 33°C, the deterministic NWP model proves to be the more valuable for more sensitive users while the consensus forecast is the more valuable for less sensitive users. Figure 1b corresponds to a cross section of Fig. 2 at threshold 35°C that lies in this regime. For thresholds beneath 26°C, the opposite result holds.

## 3. Effect of error characteristics on value

### a. Linear error model

*y*

_{i}is the

*i*th prediction by the forecast system,

*x*

_{i}is the corresponding observation,

*λ*and

*δ*are constants, and

*ε*

_{i}is a random error drawn from a distribution with standard deviation

*σ*and mean 0.

*β*when averaged over all the events

*i*, then (8) can be expressed as

*λ*as the “scale.”

### b. Error characteristics of different types of forecast systems

We consider how the characteristics of a consensus forecast system compare with those of an individual NWP model within the linear error model framework. Say a consensus forecast system, with error model parameters *λ*_{C}, *β*_{C}, and *σ*_{C} comprises the mean of forecasts from *n* individual NWP models, the *j*th of which has its own error model parameters *λ*_{j}, *β*_{j}, and *σ*_{j}.

*n*models will have lower spread of random errors, relative to the constituent models, due to the factor

*σ*

_{C}. This is one reason why having more constituent models does not always improve performance of the consensus, as observed by Arribas et al. (2005).

Due to the linear form of the error model, the other parameters *λ*_{C} and *β*_{C} will simply be the mean of the corresponding parameters for the constituent models. The OCF consensus forecast system includes a removal of recently observed unconditional bias, which generally leads to forecast bias *β*_{C} ≈ 0. The scale parameter *λ* can be thought of as representing biases that are conditional on the observed value. This can be seen from the form of (9) where for *λ* closer to 1 the forecasts will follow the observations and for *λ* closer to 0 they will follow the mean observation and be insufficiently extreme. Conditional biases are not removed in the OCF system. Therefore the best performing NWP model in the consensus will be less conditionally biased (*λ* closer to 1) than the consensus that has the mean *λ*.

### c. Error characteristics of two forecast systems

Returning to the example used in section 2c, we use least squares fitting to estimate the error model parameters for the OCF and ACCESS-R forecast systems. The fitting was done separately for each observation site, to give an idea of the amount of variation in the results for different locations.

Figure 3a compares random error standard deviation for OCF and ACCESS-R, and shows that for most locations the error spread is smallest for the consensus system.

Figure 3b compares the scale parameter. It tends to be less than one, showing that both forecast systems tend to underforecast the extremes when verified at a point in time and space. However, the NWP system has the scale parameter nearer to one for the majority of locations, showing that it has less conditional bias.

The unconditional bias is shown in Fig. 4, and is more clustered around zero for the consensus system, reflecting the fact that a bias correction has been applied. There is still some scatter of bias for OCF, which may be due to the fact that the bias correction is done relative to a gridded analysis (Engel and Ebert 2007) while we are comparing to site observations.

The results for the parameters, from the fits for the 200 stations, are summarized in the means for each forecast system given in Table 2. These quantify the lower random error spread *σ* for OCF and better conditional error characteristics *λ* for ACCESS-R. The mean unconditional biases for both forecast systems are small, and in fact the mean result for ACCESS-R is slightly better than that for OCF in this example.

Linear error model parameters fit to 36-h OCF and ACCESS-R forecasts of 0600 UTC temperature for summer 2015/16, 2016/17, 2017/18, and 2018/19, with the mean taken across 200 observation sites for southern Australia.

## 4. Analytic results

To enable us to derive theoretical relationships between the error characteristics of two forecast systems A and B, and the difference in relative economic value between A and B, we make the simplifying assumptions that they can be adequately described by a linear error model with random errors drawn from a Gaussian distribution. Tian et al. (2016) note that real forecast errors may not be well described by the linear error model. For instance, the simple linear model of forecast error does not take into account the possibility that error spread might be dependent on extremeness of the observations. Furthermore, the residual errors may not be well represented by samples from a Gaussian distribution.

While we acknowledge these caveats, these simplifications allow us to gain useful insights from relationships derived for three cases described in sections 4a, 4b, and 4c.

### a. Forecasts with no conditional or unconditional biases

In the case where both forecast systems have no unconditional or conditional biases, so *β*_{A} = *β*_{B} = 0 and *λ*_{A} = *λ*_{B} = 1, then if *σ*_{A} < *σ*_{B} it follows (see derivation in the appendix) that system A always has higher hit rate and lower false alarm rate than system B. As discussed in section 2b, this means that A has higher relative economic value than B. This holds for any cost–loss ratio and any decision threshold, so all users are better off making decisions based on forecasts from system A than on forecasts from system B (though as discussed in section 2b, they may still be better off making decisions based on climatology if particularly sensitive or insensitive to the event).

### b. Forecasts with conditional biases only

There are straightforward techniques to bias-correct forecasts, for instance by removing recent unconditional bias and assuming that the past bias of the forecast system will be a good indicator of the future bias. Thus, it is useful to consider forecast systems that have only conditional biases (*λ*_{A}, *λ*_{B} ≠ 1), with the unconditional biases having been removed (*β*_{A} = *β*_{B} = 0).

This result can be understood as follows. If *λ* = 0, the forecasts are centered around the mean of the observations, and as *λ* increases toward 1, the forecasts are progressively closer to being centered on the observed value and thus more clearly separated from

As the decision threshold moves to more extreme values above (below) the mean, the result in this case no longer holds true. The forecast system that has *λ* closer to 1, being less conditionally biased, will forecast the event more (less) frequently and at some point will have more (fewer) of both hits and false alarms than the other forecast system. Then there will no longer be one forecast system that is most valuable for all users, but instead the more valuable system will depend on cost–loss ratio following (5). We will explore this behavior in more detail in section 5.

### c. Forecasts with unconditional biases

*X*:

*β*

_{A}and

*β*

_{B}differ, the more the threshold tends to move away from the observed mean. For instance, in the simple case where system A has zero unconditional bias, system B has negative unconditional bias, and they have the same conditional biases

*λ*, then the denominator of (12) is positive due to (11), which leads to

## 5. Synthetic results

To gain further insights into the relationship between the error characteristics of forecast systems and their comparative value, we generate series of synthetic observations and use the error model described earlier to create series of synthetic forecasts given these observations. The number of observations generated is large (5 million) so that the forecast performance is adequately sampled for the most extreme observations.

The relationships described in section 4 and the appendix are independent of the form of the distribution of the observations. For simplicity, we have drawn the synthetic observations from a Gaussian distribution. The synthetic observations in sections 5a, 5b, 5c, and 5e below are generated with mean value (24.2°C) and standard deviation (6.7°C) matching those of the real observations in the example in section 2c. In section 5d the standard deviation of the observations is varied to explore how this affects the forecast value comparisons.

### a. Forecasts with different random error spreads

We compare two hypothetical forecast systems A and B, neither of which have any biases (*β*_{A} = *β*_{B} = 0 and *λ*_{A} = *λ*_{B} = 1). System A has a random error standard deviation *σ*_{A} = 1.96°C while system *B* has *σ*_{B} = 2.39°C, matching the values for OCF and ACCESS-R, respectively, as given in Table 2. Figure 5 shows the more valuable of the two systems, in the same format as Fig. 2. As expected from section 4a, the system with the smaller spread of random errors has higher hit rate and lower false alarm rate and thus is more valuable for all users at all thresholds.

### b. Forecasts with different conditional biases

Now we give our two hypothetical forecast systems A and B an equal random error spread (*σ*_{A} = *σ*_{B} = 2.18°C, in between the OCF and ACCESS-R values) and no unconditional biases (*β*_{A} = *β*_{B} = 0), but system A has scale *λ*_{A} = 0.911 while system B has scale *λ*_{B} = 0.943, again as per the values for OCF and ACCESS-R in Table 2. This will introduce conditional biases such that the forecasts will tend to be less extremely high or low compared to observations, more so in the case of A than B. As can be seen in Fig. 6, the result of the greater underforecasting of extremes by A is that the hit rate (false alarm rate) becomes worse more rapidly as event thresholds rise above (fall below) the mean.

As expected from (11), system B provides more valuable forecasts for all users when the decision threshold equals the mean of 24.2°C. However, users who are more insensitive (sensitive) to an extremely high (low) event threshold, will obtain more value from system A.

### c. Forecasts with different biases and random error spreads

We make synthetic versions of the OCF and ACCESS-R forecasts by using all the error model parameters from Table 2. Because *σ*/*λ* is lower for OCF than ACCESS-R, it follows that the synthetic OCF forecasts are more valuable than synthetic ACCESS-R for all user sensitivities at the event threshold 28.9°C derived from (12). This can be seen in Fig. 7. However, as in section 5b, as one moves to higher thresholds, there comes a point where the system with larger conditional biases (synthetic OCF) ends up with a poorer hit rate and better false alarm rate than the other system due to underforecasting of the event. Beyond that point, (5) means that sensitive users (*α* < *α*_{equal}) would benefit more from the other forecast system (synthetic ACCESS-R) due to its better hit rate, despite its larger spread of random errors. Similarly, as one moves to lower thresholds, the overforecasting of the event by the system with larger conditional biases (synthetic OCF) leads to the other forecast system (synthetic ACCESS-R) being better for insensitive users (*α* > *α*_{equal}) due to its better false alarm rate.

By comparing Fig. 2 for the real forecasts and observations to Fig. 7 for the synthetic data, one can see similar features that suggest that, despite the many simplifications underpinning the theoretical results and synthetic calculations, they can provide insights into real data.

We have looked at a range of examples (not shown) for other parts of Australia, other times of day, other forecast systems (NWP forecasts from the European Centre for Medium-Range Weather Forecasts and manually produced forecasts) and other parameters (daily maximum and minimum temperature, dewpoint and wind speed). For temperatures, we find that results for real data match the theoretical expectations when the forecast systems being compared have well separated values of *σ*/*λ*. In other examples where *σ*/*λ* are close together, the theoretical results are not necessarily borne out by the real data, which is not surprising given the many assumptions that have been made. Also, there can be cases where the event threshold from (12) is extreme and therefore is not relevant to realistic ranges of forecast values and does not have well-sampled data around the threshold, for instance in cases where *σ*_{A} ≈ *σ*_{B} and *λ*_{A} ≈ *λ*_{B} but *β*_{A} ≠ *β*_{B}. Wind and dewpoint are not so well represented by the linear error model, and do not conform to the theoretical expectations.

### d. Dependence on observational spread

To show the effect of observational spread on the results, we repeat section 5b with the standard deviation of the synthetic observations being decreased and increased by 20% (Figs. 8 and 9, respectively). From this we see that when the observational scatter is smaller, the climatologically based decision is best for a larger range of user sensitivities (larger gray area on Fig. 8 compared with Fig. 9). In other words, if the weather doesn’t vary much, for instance for temperature in the tropics, a forecast is unlikely to help a user make a decision unless it is particularly accurate.

The range of user sensitivities for which system A, the more conditionally biased forecast system, is better than system B, is also wider when the observational scatter is smaller. This is shown by a smaller tan area at extreme decision thresholds on Fig. 8 compared with Fig. 9. System A tends to be closer to climatology than to the observed temperature in the extremes, compared with system B. For a given high extreme, system A will have lower false alarm rates and hit rates. As the observational scatter is reduced, false alarms become increasingly more prevalent than hits, and the false alarm term in *α*_{equal} (4) dominates, leading to a lower cost ratio at which system B becomes the more valuable. Conversely, at low extremes the hit terms in (4) dominate, leading to a higher cost–loss ratio at which system B becomes the most valuable.

### e. Sensitivity to form of forecast error distribution

The derivations of the theoretical relationships in section 4 assume that the distribution of random forecast errors is Gaussian. To test how sensitive the results are to this assumption, we replace the Gaussian with the skew normal distribution (Azzalini 1985) that corresponds to a Gaussian distribution when its shape parameter is zero, while giving a skewed distribution with other choices of shape parameter.

For example, Fig. 10 is a comparison of synthetic versions of OCF and ACCESS-R forecasts with the same parameters as in section 5c but with skewed random error distributions having shape parameter 1 for OCF (skewness 0.14) and shape parameter 2 for ACCESS-R (skewness 0.46). The skewness for the random errors of the actual OCF and ACCESS-R forecasts are 0.16 and 0.19, respectively. When synthetic forecasts are generated with random error distributions skewed by these amounts (not shown), the result is hard to distinguish from Fig. 7, so we have increased the difference in skewness for effect. It can be seen by comparing Fig. 7 and Fig. 10 that the area where synthetic ACCESS-R is most valuable decreases for high decision thresholds and increases for low thresholds. The skew moves the median of the random errors to the left of the mean, and the more skewed the distribution, the more hit rates reduce, for decision thresholds above the observational mean, and the more false alarm rates reduce, for decision thresholds below the observational mean.

To show that the analytic relationships of section 4 are useful for a range of distribution shapes beyond Gaussian, we have repeated the scenarios from sections 5a to 5d for the same choices of *σ*, *λ*, and *β* but with a variety of different shape parameter combinations for forecast systems A and B. The results remain consistent with the analytic relationships in all cases up until A reaches shape parameter −5 (skewness −0.85) and B reaches shape parameter 5 (skewness 0.85), when the analytic result for forecasts with no conditional or unconditional biases (section 4a) is no longer obeyed for all decision thresholds.

## 6. Discussion and conclusions

We have applied a user-oriented approach to compare the worth of single-valued temperature forecasts from two different forecast systems and have presented a new graphical depiction showing which system has higher relative economic value as a function of user cost–loss ratio and decision threshold. Through use of a linear error model we have been able to gain insights into how the nature of the forecast systems being compared affects which forecast system is more valuable.

Many assumptions and simplifications have been made along the way. Decisions made in the real world will not necessarily be well represented by a simple cost–loss proposition or a single decision threshold, and various more complex approaches have been suggested (Shorr 1966; Matte et al. 2017; Roulston and Smith 2004). There are other properties of the forecast that may affect user decision-making, for instance forecast stability (Griffiths et al. 2019), which is not considered in this framework. We have already noted limitations with the linear error model.

We have seen that synthetic forecasts generated using the linear error model can give qualitatively similar features to real forecasts when comparing relative economic value. This gives us reason to expect that the insights from a simple error model can also assist our understanding of how relative economic value compares for real-world forecasting systems.

The roles of unconditional and conditional bias can suggest strategies for increasing the usefulness of a forecast system. In the absence of biases, the spread of random errors is seen to have a direct relationship to forecast value for all user sensitivities. Therefore, averaging approaches to derive single-valued forecasts from an ensemble of independent forecasts are beneficial as they contribute to this aspect of forecast value. However, conditional biases degrade the value of forecasts in extreme conditions. While emphasis has been placed on the more straightforward task of correcting unconditional biases in forecast systems (Woodcock and Engel 2005), it is also beneficial to reduce conditional biases to maximize the value of a forecast system.

The results give insight into when a particular single-valued forecast system is going to be the most useful. We have seen, via synthetic forecasts, that forecasts become more valuable relative to climatology the more variable the range of observations is. We have shown that there can be event thresholds for which one forecast system is unequivocally better than another for all user sensitivities. For forecast systems where unconditional bias has been removed, this occurs around the mean observed value. We have seen how decisions based on whether conditions will be below or above normal can be better provided by a bias-corrected consensus than a deterministic NWP output. Such a forecast system provides the best information for all decision-makers in routine conditions that are not far removed from the mean.

However, single-valued forecasts from one forecast system are not going to be optimal for all types of users. For sensitive (insensitive) users in extreme high (low) conditions, a single output from a skillful NWP model can be more valuable than a consensus which fails to resolve the extremes. This is consistent with the findings of Buizza (2001) with synthetic rainfall forecasts. This has implications in a context where forecasters can intervene in the forecast production process. If forecasters start with a forecast system that is optimal in routine conditions, and can understand and consistently reduce conditional biases for more extreme conditions, this will have the effect of widening the range of decision thresholds for which the forecast is best for all users.

The most complete solution to addressing different user sensitivities is to move to multiple forecasts that are tailored to different user sensitivities, or else to provide explicitly probabilistic forecasts, particularly for thresholds corresponding to extreme and potentially impactful events, which will not be optimally served by one single-valued forecast system. However, for as long as there is an ongoing demand for a general single-valued forecast service, there will remain a need to consider how to optimize the user benefits of single-valued forecasts. The methods explored in this paper provide one avenue for such consideration.

## Acknowledgments

The authors thank Beth Ebert, Deryn Griffiths, Ioanna Ioannou, and two anonymous reviewers for their helpful review comments.

## APPENDIX

### Derivation of Theoretical Relationships

Here we give the derivation of the relationships in section 4 for comparative value of forecast systems in some particular cases, under the linear error model as described in section 3. We make the assumption that the random errors are drawn from a Gaussian distribution.

*H*

_{A}and decreases with false alarm rate

*F*

_{A}. Say the event of interest is an observation

*x*exceeding some threshold

*X*. If the forecast system is well described by the error model in (9) then the distribution of forecast values for a given observation

*x*, over many cases, is centered on

*σ*

_{A}. For a Gaussian distribution, the proportion of forecast values that exceed the event threshold

*X*is given by

*P*(

*x*) is the probability of

*x*being observed.

Forecast system A will be more valuable than forecast system B for all user cost–loss ratios if Δ*H* = *H*_{A} − *H*_{B} > 0 and Δ*F* = *F*_{A} − *F*_{B} < 0.

*H*> 0 if (but not only if)

*H*

_{A}includes a larger area than in the equivalent expression for

*H*

_{B}].

*F*< 0 if (but not only if)

*H*

_{B}].

*β*

_{A}=

*β*

_{B}= 0 and

*λ*

_{A}=

*λ*

_{B}= 1. Expression (A5) becomes

*X*.

*λ*

_{A}and

*λ*

_{B}≠ 1 and

*β*

_{A}and

*β*

_{B}≠ 0. We can rearrange (A5) to obtain

*x*>

*X*, where

*σ*

_{A}/

*λ*

_{A}) < (

*σ*

_{B}/

*λ*

_{B}) so that the denominator of (A9) is positive. For Δ

*H*> 0 to be true since

*x*>

*X*, we therefore require

*F*< 0 to be true, we require

Therefore Δ*H* > 0 and Δ*F* < 0 both hold when (*σ*_{A}/*λ*_{A}) < (*σ*_{B}/*λ*_{B}) and *X* leads to the result in section 4c. The result in section 4b follows when *β*_{A} = *β*_{B} = 0. Unlike the random-error only case, in these cases the value comparisons at other thresholds *X* are dependent on *P*(*x*) and may differ for users with different cost–loss ratios.

In the case where *σ*_{A} = *σ*_{B} and *λ*_{A} = *λ*_{B} but *β*_{A} ≠ *β*_{B} (for instance, where comparing an un-bias-corrected forecast system with a system using the same forecasts with unconditional biases removed), (A5) and (A6) cannot both be satisfied, so we do not obtain a simple general solution for circumstances where one forecast system is better than the other for all user sensitivities.

## REFERENCES

Arribas, A., K. B. Robertson, and K. R. Mylne, 2005: Test of a poor man’s ensemble prediction system for short-range probability forecasting.

,*Mon. Wea. Rev.***133**, 1825–1839, https://doi.org/10.1175/MWR2911.1.Azzalini, A., 1985: A class of distributions which includes the normal ones.

,*Scand. J. Stat.***12**, 171–178.Brunk, H. D., 1965:

. 2nd ed Blaisdell, 429 pp.*An Introduction to Mathematical Statistics*Buizza, R., 2001: Accuracy and potential economic value of categorical and probabilistic forecasts of discrete events.

,*Mon. Wea. Rev.***129**, 2329–2345, https://doi.org/10.1175/1520-0493(2001)129<2329:AAPEVO>2.0.CO;2.Buizza, R., 2008: The value of probabilistic prediction.

,*Atmos. Sci. Lett.***9**, 36–42, https://doi.org/10.1002/asl.170.Buizza, R., and M. Leutbecher, 2015: The forecast skill horizon.

,*Quart. J. Roy. Meteor. Soc.***141**, 3366–3382, https://doi.org/10.1002/qj.2619.Engel, C., and E. Ebert, 2007: Performance of hourly Operational Consensus Forecasts (OCFs) in the Australian region.

,*Wea. Forecasting***22**, 1345–1359, https://doi.org/10.1175/2007WAF2006104.1.Engel, C., and E. Ebert, 2012: Gridded operational consensus forecasts of 2-m temperature over Australia.

,*Wea. Forecasting***27**, 301–322, https://doi.org/10.1175/WAF-D-11-00069.1.Griffiths, D., M. Foley, I. Ioannou, and T. Leeuwenburg, 2019: Flip-flop index: Quantifying revision stability for fixed-event forecasts.

,*Meteor. Appl.***26**, 30–35, https://doi.org/10.1002/met.1732.Hemri, S., M. Scheuerer, F. Pappenberger, K. Bogner, and T. Haiden, 2014: Trends in the predictive performance of raw ensemble weather forecasts.

,*Geophys. Res. Lett.***41**, 9197–9205, https://doi.org/10.1002/2014GL062472.Lazo, J. K., R. E. Morss, and J. L. Demuth, 2009: 300 billion served: Sources, perceptions, uses, and values of weather forecasts.

,*Bull. Amer. Meteor. Soc.***90**, 785–798, https://doi.org/10.1175/2008BAMS2604.1.Marzban, C., 2012: Displaying economic value.

,*Wea. Forecasting***27**, 1604–1612, https://doi.org/10.1175/WAF-D-11-00138.1.Matte, S., M. A. Boucher, V. Boucher, and T. C. Fortier Filion, 2017: Moving beyond the cost-loss ratio: Economic assessment of streamflow forecasts for a risk-averse decision maker.

,*Hydrol. Earth Syst. Sci.***21**, 2967–2986, https://doi.org/10.5194/hess-21-2967-2017.Murphy, A. H., 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting.

,*Wea. Forecasting***8**, 281–293, https://doi.org/10.1175/1520-0434(1993)008<0281:WIAGFA>2.0.CO;2.Mylne, K. R., 2002: Decision-making from probability forecasts based on forecast value.

,*Meteor. Appl.***9**, 307–315, https://doi.org/10.1017/S1350482702003043.Palmer, T. N., 2002: The economic value of ensemble forecasts as a tool for risk assessment: From days to decades.

,*Quart. J. Roy. Meteor. Soc.***128**, 747–774, https://doi.org/10.1256/0035900021643593.Puri, K., and Coauthors, 2013: Implementation of the initial ACCESS numerical weather prediction system.

,*Aust. Meteor. Oceanogr. J.***63**, 265–284, https://doi.org/10.22499/2.6302.001.Ramos, M. H., S. J. Van Andel, and F. Pappenberger, 2013: Do probabilistic forecasts lead to better decisions?

,*Hydrol. Earth Syst. Sci.***17**, 2219–2232, https://doi.org/10.5194/hess-17-2219-2013.Richardson, D. S., 2000: Skill and relative economic value of the ECMWF ensemble prediction system.

,*Quart. J. Roy. Meteor. Soc.***126**, 649–667, https://doi.org/10.1002/qj.49712656313.Roebber, P. J., and L. F. Bosart, 1996: The complex relationship between forecast skill and forecast value: A real-world analysis.

,*Wea. Forecasting***11**, 544–559, https://doi.org/10.1175/1520-0434(1996)011<0544:TCRBFS>2.0.CO;2.Roulston, M. S., and L. A. Smith, 2004: The boy who cried wolf revisited: The impact of false alarm intolerance on cost–loss scenarios.

,*Wea. Forecasting***19**, 391–397, https://doi.org/10.1175/1520-0434(2004)019<0391:TBWCWR>2.0.CO;2.Shorr, B., 1966: The cost/loss utility ratio.

,*J. Appl. Meteor.***5**, 801–803, https://doi.org/10.1175/1520-0450(1966)005<0801:TCUR>2.0.CO;2.Tian, Y., G. S. Nearing, C. D. Peters-Lidard, K. W. Harrison, and L. Tang, 2016: Performance metrics, error modeling, and uncertainty quantification.

,*Mon. Wea. Rev.***144**, 607–613, https://doi.org/10.1175/MWR-D-15-0087.1.Vannitsem, S., D. S. Wilks, and J. Messner, 2018:

. Elsevier Science, 362 pp.*Statistical Postprocessing of Ensemble Forecasts*Verkade, J. S., and M. G. Werner, 2011: Estimating the benefits of single value and probability forecasting for flood warning.

,*Hydrol. Earth Syst. Sci.***15**, 3751–3765, https://doi.org/10.5194/hess-15-3751-2011.Woodcock, F., and C. Engel, 2005: Operational consensus forecasts.

,*Wea. Forecasting***20**, 101–111, https://doi.org/10.1175/WAF-831.1.Zhu, Y., Z. Toth, R. Wobus, D. Richardson, and K. Mylne, 2002: The economic value of ensemble-based weather forecasts.

,*Bull. Amer. Meteor. Soc.***83**, 73–83, https://doi.org/10.1175/1520-0477(2002)083<0073:TEVOEB>2.3.CO;2.