Some Remarks on the Reliability of Categorical Probability Forecasts

Jochen Bröcker Max-Planck-Institut für Physik Komplexer Systeme, Dresden, Germany

Search for other papers by Jochen Bröcker in
Current site
Google Scholar
PubMed
Close
Full access

Abstract

Studies on forecast evaluation often rely on estimating limiting observed frequencies conditioned on specific forecast probabilities (the reliability diagram or calibration function). Obviously, statistical estimates of the calibration function are based on only limited amounts of data and therefore contain residual errors. Although errors and variations of calibration function estimates have been studied previously, either they are often assumed to be small or unimportant, or they are ignored altogether. It is demonstrated how these errors can be described in terms of bias and variance, two concepts well known in the statistics literature. Bias and variance adversely affect estimates of the reliability and sharpness terms of the Brier score, recalibration of forecasts, and the assessment of forecast reliability through reliability diagram plots. Ways to communicate and appreciate these errors are presented. It is argued that these errors can become quite substantial if individual sample points have too large influence on the estimate, which can be avoided by using regularization techniques. As an illustration, it is discussed how to choose an appropriate bin size in the binning and counting method, and an appropriate bandwidth parameter for kernel estimates.

Corresponding author address: Jochen Bröcker, Max-Planck-Institut für Physik Komplexer Systeme, Nöthnitzer Strasse 34, 01187 Dresden, Germany. Email: Broecker@pks.mpg.de

Abstract

Studies on forecast evaluation often rely on estimating limiting observed frequencies conditioned on specific forecast probabilities (the reliability diagram or calibration function). Obviously, statistical estimates of the calibration function are based on only limited amounts of data and therefore contain residual errors. Although errors and variations of calibration function estimates have been studied previously, either they are often assumed to be small or unimportant, or they are ignored altogether. It is demonstrated how these errors can be described in terms of bias and variance, two concepts well known in the statistics literature. Bias and variance adversely affect estimates of the reliability and sharpness terms of the Brier score, recalibration of forecasts, and the assessment of forecast reliability through reliability diagram plots. Ways to communicate and appreciate these errors are presented. It is argued that these errors can become quite substantial if individual sample points have too large influence on the estimate, which can be avoided by using regularization techniques. As an illustration, it is discussed how to choose an appropriate bin size in the binning and counting method, and an appropriate bandwidth parameter for kernel estimates.

Corresponding author address: Jochen Bröcker, Max-Planck-Institut für Physik Komplexer Systeme, Nöthnitzer Strasse 34, 01187 Dresden, Germany. Email: Broecker@pks.mpg.de

1. Reliability

Assume the objective is to forecast whether a real-world event will or will not occur, for example whether the temperature in Dresden, Germany (or rather the temperature as measured by a specific thermometer in Dresden), at noontime on a given day n falls below 0°C. We define the variable Y, referred to as the verification, to be 1 if that event actually happens and 0 if it does not. As forecasters, we may or may not have some information available that we can employ to build our forecasts. As a probabilistic forecast for Y, we denote any function ρ that maps our information data onto a number between 0 and 1, requiring no further properties so far. Suppose, for example, that we have access to ensemble temperature forecasts for Dresden, produced by a numerical weather prediction system. These ensemble forecasts constitute the aforementioned information data, while a possible choice for the function ρ could be the fraction of ensemble members exhibiting temperatures below 0°C. Another example would be if we take a deterministic temperature forecast x for Dresden and use logistic regression to obtain a forecast ρ (Tippet et al. 2007; Wilks 2006a; Hamill et al. 2004, and references therein for various alternatives). The details of the employed information data play no role in the discussion of this paper, whence we shall have no need to distinguish between the information data and ρ itself.

We have carefully chosen the term probabilistic forecast rather than probability forecast, as we do not a priori assume our forecast to coincide with any observed frequencies. To be a probability in any objective sense though, a forecast should be reliable,1 which means being in agreement with observed frequencies in the following sense (see also Toth et al. 2003; Wilks 2006b): Whenever the probabilistic forecast ρ for the event Y = 1 falls into a small interval [r, r + Δr], the event Y = 1 should in fact occur with a relative frequency close to r, or more precisely, with a relative frequency equal to the average of all ρ in the interval [r, r + Δr]. To introduce a proper mathematical formulation of reliability, we will assume from now on and throughout the remainder of this paper that both Y and ρ are random variables. In terms of probability theory, reliability means that conditioned on the event ρ = r, the probability (as a limiting observed frequency) of Y = 1 is actually r, or as a formula,
i1520-0493-136-11-4488-e1
This follows from the definition of conditional probabilities. Throughout the paper, ρ will refer to the forecast as a random variable, while r denotes a realization of ρ. From Eq. (1), we see that a reliable forecast ρ can be written as a conditional probability, namely the conditional probability of Y given ρ itself (Murphy and Winkler 1987). The conditional probability on the left-hand side of Eq. (1), seen as a function of ρ, will be referred to as the calibration function2 from now on. We will furthermore adopt κ(r) as the notation for the calibration function. In the forecasting of binary events, the calibration function is an extremely important concept both theoretically and in applications. This is for the following reasons (which are not independent from each other). First, the calibration function would enable the forecaster to issue calibrated forecasts. In fact, it is easily seen that issuing κ(ρ) as a forecast instead of ρ itself would result in calibrated forecasts. Second, the forecasts κ(ρ) would not just be calibrated, but actually more skillful than the forecast ρ itself (according to a large variety of reasonable measures of forecast skill). In fact, as we will discuss in section 3b, κ(ρ) features optimum skill among all forecasts of the form f (ρ) for arbitrary functions f. Roughly speaking, this means that applying the function κ is the best we can do in terms of postprocessing the forecast ρ. Third, the calibration function κ is an extremely useful diagnostic, allowing for a detailed analysis of the forecast system. In this sense, the calibration function can help to identify possible weaknesses in the process that generates ρ. The practical problem here is that the calibration function is usually not available. In fact, if it were, there would no longer be any excuse for not issuing fully calibrated forecasts. Algorithms for estimating the calibration functions from data have been proposed, and in the course of this paper, several improvements will be suggested. But obviously, any estimate of the calibration function will always contain residual errors. These errors will have adverse effects on subsequent applications of the calibration function estimate. Although variations and errors in calibration function estimates have been studied previously, we feel that a detailed analysis of how, for example, the reliability term of the Brier score or the skill of a calibrated forecast is affected, is lacking. The aim of this paper is to study these adverse effects, and to give recommendations as to how they should be communicated and dealt with, and, if possible, how they can be reduced. In particular, it will be argued that, although the calibration function in theory would be the cure-all for a variety of problems, this is not true for estimators of it, and different applications of the calibration function require different techniques to estimate it.

In section 2, the basic facts necessary for the argumentation in this paper will be outlined. There it is discussed in what sense the true calibration function and its estimates can actually differ. Two types of errors are introduced, leading to the (well-known) concepts of bias and variance. In section 3, three important applications of the calibration function will be revisited: the reliability diagram, the reliability term of the Brier score, and recalibration. Using arguments from section 2, the effects of errors in the calibration function estimates on these applications are studied. Section 4 revisits the problem of estimating the calibration function from a more theoretical perspective. It is argued that this constitutes an ill-posed problem, which calls for techniques allowing for regularization. We will outline how the popular “binning and counting” fits into this framework, and give recommendations as to how to estimate a proper bin size. An example using temperature anomaly forecasts and verifications is presented in section 5. Here, the calibration function is estimated using kernel density estimators. It is demonstrated how the presented methodology can be employed here to use an appropriate bandwidth parameter. Section 6 concludes the paper.

2. Estimating the calibration function: General considerations

If the calibration function κ(r) is unknown, we are faced with an estimation problem. Estimating unknown bits of the distribution of data is one of the main objectives of statistics. The estimation problem considered here though certainly belongs to the trickier ones, as there is not only a single parameter to be estimated, but an entire function. There are of course important cases where the forecast ρ assumes only a finite set of values, in which case κ(r) needs to be estimated for only a finite number of arguments, r. But unless the available data contain a large number of forecast instances for each possible value of ρ, estimating κ(r) for a finite set of r’s is hardly any simpler than estimating κ(r) for continuous r. What is said in this paper generally applies to forecasts with discrete and continuous ranges of values alike, although the former case is given special attention if needed. However hard or simple the estimation of κ(r), estimation brings about errors, causing the calibration function estimate, which will be denoted by κ̂(r), to deviate from the true calibration function. These deviations can be of two types: systematic and random, or bias and variance, as they are often referred to in this section, a proper definition of bias and variance is provided, after which the total error of κ̂(r) is studied in terms of bias and variance. To this end, some notational conventions have to be introduced first. The entire experiment under consideration can be expressed through the compound distribution function:
i1520-0493-136-11-4488-eq1
In words, F(1, r) is the probability that Y = 1 and ρ < r. All other probabilities can be expressed in terms of F. For example, by means of the Bayesian rule, the calibration function can be expressed using F and the marginal distribution
i1520-0493-136-11-4488-eq2
by
i1520-0493-136-11-4488-eq3
Since G(r) = F(1, r) + F(0, r), we obtain
i1520-0493-136-11-4488-e2
To define bias and variance, we note first that κ̂(r) is a random variable, in contrast to κ(r), which is a fixed function. This is because κ̂(r) is estimated using a finite amount of random data, or more specifically, κ̂(r) is estimated employing a training set,
i1520-0493-136-11-4488-eq4
of forecast–verification pairs (Yi, ρi). The pairs (Yi, ρi) are assumed to be independent and to have the same distribution as (Y, ρ); that is, they have the distribution F, too. The size N of the training set T is assumed to be fixed. If necessary, the fact that the calibration function estimate depends on T will be indicated by writing κ̂T(r). Estimating the calibration function effectively means picking κ̂ from a class C of possible candidate functions, according to some rules. In other words, an estimator for the calibration function consists of a mapping that relates any T to a member κ̂T of the function class C. For an estimator of the calibration function, consider the average estimate given by
i1520-0493-136-11-4488-eq5
where the expectation ET runs only over T, leaving r unaffected. The function κ(r) is not random. It is completely defined through the function class C, the distribution F of Y and ρ, and the size N of the training set. The systematic error or bias of the calibration curve estimate is defined as
i1520-0493-136-11-4488-eq6
which is the squared difference between the average estimate and the true calibration function. The variance of the calibration curve estimate is defined as
i1520-0493-136-11-4488-eq7
which is nothing but the variance of κ̂T(r) as a function of r. Both the variance and the bias are nonrandom functions of r. Roughly speaking, bias comes about because we have to assume a priori that the calibration function belongs to a certain class C, which in general does not contain the true calibration curve. But even if it does, there is no reason why the estimator κ̂T(r) necessarily coincides on average with κ(r) for all possible distributions F. Variance comes about because the calibration function estimate depends on the measured data, and different realizations of that data, although from the same source, will lead to different realizations of calibration function estimates κ̂T.
The concepts so far introduced are presumably illustrated best with a simple example. To fix the distributions of ρ and Y, assume that ρ has a uniform distribution on the unit interval. Suppose that the true calibration function is κ(r). Using these data, it is straightforward to design a random experiment with output variables Y and ρ with the specified distribution and calibration function (Bröcker and Smith 2007). Furthermore, the joint distribution F(y, r) is easily calculated from this information, but we will not need it here. Now, we pretend not to know κ(ρ). To estimate the calibration function from the data, consider the following method, which is indeed very popular in weather forecasting (Toth et al. 2003; Murphy and Winkler 1977; Atger 2004; Bröcker and Smith 2007). First, bins B1 . . . BK are defined by partitioning the unit interval into K equally long subintervals. Then, we calculate, using the training data T,
i1520-0493-136-11-4488-e3
where for every r, Br is the bin that contains r and # denotes the number of elements. That is, we assign to every bin the number of forecasts in the bin that corresponds to an event, divided by the total number of forecasts in the bin. This method will henceforth be referred to as binning and counting. For binning and counting, the function class C is given by all step functions that are constant on the given bins Bk. Later (in appendix B), it is demonstrated that the average estimate κ(r) is approximately given by
i1520-0493-136-11-4488-e4
where |Br| denotes the diameter of Br. In other words, the average estimate κ(r) turns out to be a coarse-grained version of κ(r), obtained by averaging κ(r) over the individual bins. It follows readily from Eq. (4) that κ(r) → κ(r) for |Br| → 0; that is, for decreasing bin size, the average estimate converges to the true calibration function, or in other words the bias goes to 0. The variance of κ̂T(r) is approximately given by
i1520-0493-136-11-4488-e5
as is also shown later (see appendix B). If we let |Br| → 0, the numerator stays finite, but the denominator vanishes, whence the variance diverges. The results presented so far indicate that there is a trade-off between the variance and bias, which for the binning and counting method is controlled by the bin size. Small bins reduce bias but increase variance, with large bins having the opposite effect. Strictly speaking, Eqs. (4) and (5) are valid only if the probability for the bin Br to be empty is very small. This obviously stops being true if the bin size |Br| is small. In this situation, it becomes necessary to decide how to deal with the possibility of empty bins. Therefore, Eqs. (4) and (5) only cover a certain range of the bias–variance trade-off.
Finally, consider the total error e(r) := ET[κ(r) − κ̂T(r)]2 as a measure of the discrepancy between the true calibration function κ(r) and the estimate κ̂T(r). It is easy to see that
i1520-0493-136-11-4488-e6
thus, the bias and variance together amount to the total error. The total error involves averaging over T and, thus, is a nonrandom function, like the bias and the variance. We have numerically investigated the bias and variance for a slightly more sophisticated binning and counting approach. The difference is that between the bins, the calibration function estimate is not constant but interpolated linearly. The abscissas of the nodes are chosen as
i1520-0493-136-11-4488-e7
that is, as the mean value of the forecast probabilities in each bin. The ordinates of these nodes are obtained from Eq. (3).

It is common to choose the geometric midpoint of the bin Bi instead of the mean as in Eq. (7), but the choice presented here has an advantage, which will become clear in section 3a. Beyond the nodes corresponding to the two extreme bins, the calibration function estimate is extrapolated linearly, but values outside the unit interval are truncated. Figure 1 shows the true calibration function κ(r) (which we pretend not to know) as a black line. The shape of this calibration function might seem odd, as in practice, nonmonotonous calibration functions are rarely encountered. But since there is no general reason why calibration functions have to be monotonous, our example is by no means unrealistic. It was chosen here for illustrative purposes, which have something to do with the oscillations of the calibration function. However, the discussion in this paper is independent of whether the calibration function is monotonous or not.

Using κ(r) and a uniform distribution for the forecast probabilities ρ, an archive of 100 forecast–verification pairs was generated. Then, the calibration function was estimated from these forecast–verification pairs, using the aforementioned version of binning and counting with three bins. This experiment was repeated 10 000 times, with a new archive of forecast–verification pairs being generated every time. A few of the calibration function estimates are shown as gray lines in Fig. 1. From these 10 000 calibration function estimates, we computed κ(r) and υ(r), which are shown in Fig. 2. The average calibration function estimate κ(r) is represented as a piecewise linear graph, marked with white circles. The variance υ(r) is represented by the dashed line (plotted on a different scale, as indicated on the right-hand side of the viewgraph). The squared discrepancy between κ(r) and κ(r) gives the bias. Figures 3 and 4 show exactly the same experiment, but this time the estimator uses 24 bins instead of just 3. Comparing Figs. 1 and 2 with Figs. 3 and 4, the already discussed effects of the bin size on the bias and variance are evident. Using few bins results in fairly similar calibration function estimates and hence in small variance. The price is bias, as there is a considerable discrepancy between the average calibration function estimate and the true calibration function (Fig. 2). The opposite effect is observed if the number of bins is large (Figs. 3 and 4). A possible way to plot the bias–variance trade-off for different bin choices is presented in Fig. 5. Bias and variance where averaged over r and plot against each other, for various numbers of bins. The gray lines indicate loci of constant total error. The qualitative behavior of the bias and variance turns out to be as discussed above for the simpler binning and counting approach. For fewer than 24 bins, we obtain the interesting part of the bias–variance trade-off, featuring the mutually inverse behavior of the bias and variance, while for more than 24 bins, both quantities grow. As there were 100 instances to fill the bins, it is evident that this effect is due to an increasing number of empty bins.

At this point, the reader might well have some questions as to how the calculations presented so far bear on the problems this paper proposes to address. In particular,

  1. Is the trade-off between the bias and variance a general phenomenon, or does it occur only in the binning and counting approach?

  2. Is there a way to assess the bias–variance trade-off to determine, for example, a good bin size?

  3. How are the bias and variance related to the reliability diagram or the reliability term of the Brier score?

  4. How do the bias and variance affect recalibration?

The first question will be addressed in more detail in section 4, but the short answer is, it is a general phenomenon, for the simple reason that changing the model class (e.g., by changing the bin size in the binning and counting approach) generally affects both the bias and the variance. As seen in the example, it is of course not always true that when the bias becomes better, the variance generally becomes worse and vice versa. In some problems of statistics, the bias–variance trade-off is easy to get a handle on, for example if the problem allows for estimators without bias (e.g., the empirical mean as an estimator of the expected value). In section 4 and appendix A, it will be discussed that estimates of the calibration function are generally biased. This also applies to the (already mentioned) special case of forecasts with discrete values. It might seem surprising at first that in this case too the calibration function estimates are biased. It has to be kept in mind though that the number of occasions when the forecast assumes a specific value is also random. The well-known fact that observed frequencies are unbiased estimates of probabilities applies only if the number of instances is fixed. Furthermore, one might deliberately aggregate forecasts with different (but similar) values into one bin in order to improve the statistics, which means that the calibration function has to be interpolated at several values.

As to the second question, section 4 will contain suggestions on how to assess the bias–variance trade-off. The third and fourth questions form the core of this paper and will be addressed in the next section.

3. Applications of calibration function estimates

Calibration function estimates are employed for various purposes. The natural question arises as to how errors in the calibration function estimate affect its subsequent application. We investigate three common and important applications of the calibration function, namely the recalibration of forecasts, and the assessment of forecast reliability through either reliability diagrams or the reliability term of the Brier score. For all three applications, errors in the calibration function estimate have substantial negative impacts on the reliability of the result.

a. The reliability diagram

A common and important tool for the reliability analysis of categorical forecasts is the reliability diagram (Murphy and Winkler 1977; Toth et al. 2003; Wilks 2006b), which consists of a plot of the calibration function estimate. The true calibration function of a fully reliable forecast is equal to a diagonal. Therefore, deviations of the calibration function estimate from the diagonal are commonly attributed to a lack of reliability. But since estimators of the calibration function contain errors, even fully reliable forecast systems can exhibit reliability diagrams that deviate considerably from the diagonal. Thus, deviations of a reliability diagram from the diagonal always have to be compared to the deviations that are to be expected if the forecast is reliable. Again, these deviations are due to both the bias and variance. Within this context, the variance is manifest in that different datasets from the same source will give different calibration function estimates, while the bias is manifest in that when the different calibration function estimates are averaged over, the resulting function might still deviate from the diagonal, even for reliable forecast systems. The last statement might be surprising, but it is essentially saying that in general there is no reason why κ(r) should be equal to r just because κ(r) = r. This might be misleading for reliability diagrams, and hence it is advisable to use estimators that have small bias under the assumption that the forecast is reliable, or that κ(r) = r. This is, for example, the case if binning and counting is modified as already indicated in section 2: For each bin Bi, one defines
i1520-0493-136-11-4488-e8
at the nodes given by Eq. (7) and interpolates linearly in between. Under the assumption of reliability, all forecast probabilities should match the corresponding observed frequencies, whence forecast probabilities averaged over a particular bin should match the corresponding averaged observed frequencies. The latter statement just says that for N large, κ̂T(ri) in Eq. (8) is expected to be equal to ri in Eq. (7) for reliable forecasts. Therefore, this variant of binning and counting is better (in terms of bias) than taking ri as the geometric midpoint of the bin Bi, as is often done, especially if the forecast probabilities have a fairly nonuniform distribution.

For reliable forecasts, unbiased calibration curve estimates might still exhibit deviations from the diagonal due to variance. It is important to note that the variance depends on the distribution of the forecast ρ. Thus, certain deviations of the reliability diagram from the diagonal might be typical for one (reliable) forecast system, but not for another. This implies that deviations from the diagonal cannot be compared in terms of metric distance between forecast systems that exhibit different distributions of forecast probabilities. In Bröcker and Smith (2007), a methodology was suggested that allows for comparing the variations of the estimated calibration function with those to be expected if the forecast system were in fact reliable. To this end, the expected variations are plotted as consistency bars onto the diagonal, giving the forecaster an idea as to the amount of deviation from the diagonal expected from a reliable forecast. If an estimated calibration function (reliability diagram) falls outside these limits, there is an indication that the deviation from the diagonal is not purely due to chance, but that the forecast system is in fact not reliable. In Bröcker and Smith (2007), the variations of the estimated calibration function are generated using a bootstrap approach, which takes into account that the number of forecasts in each bin is random. If the random bin populations are not an issue, standard confidence intervals for sampling proportions such as in Wilks (2006b, section 7.9.1) or Jolliffe and Stephenson (2003, section 3.3.5) can be used. The latter approach though neglects bias, which is a further reason to use the binning and counting approach in the “low bias” version defined in Eqs. (7) and (8).

There is obviously an epistemological problem here: The preceding discussion seems to suggest that for reliability diagrams, estimators should be employed that give small variance and small bias under the assumption that the forecast is reliable. It is not hard to find such an estimator—simply ignore the data and take the diagonal. The problem with this estimator is of course that it would be unable to detect any unreliable forecasts. The ability of a test to catch cases where the null hypothesis is false is called the power of the test. Hence, in order for the reliability diagram to have any power, the reliability diagram estimates need to have some variability. We are apparently facing another trade-off here: power versus propensity of the reliability diagram to label reliable forecasts as unreliable (also called size). This trade-off will not be further investigated in this paper.

b. Recalibrating forecasts

An important and widely applied measure of forecast performance is the Brier score (Brier 1950; Toth et al. 2003; Wilks 2006b). For a forecast ρ, the Brier score is defined as the expectation EY,ρ[Yρ]2. It is possible to show that for any function f (r), it holds that
i1520-0493-136-11-4488-e9
Application of a function f (ρ) as in Eq. (9) will henceforth be referred to as postprocessing the forecast ρ. Obviously, the best possible postprocessing is the function κ(r), as this would set the second term in Eq. (9) to 0. Hence, the calibration function could be characterized as a postprocessing that minimizes the Brier score [and in fact any other proper score; see Bröcker (2008)], as was already mentioned in section 1. It follows from basic probability theory that this property uniquely defines the calibration function (this is true as well for any other strictly proper score; see Bröcker 2008). Thus, better agreement between forecasts and observed frequencies is not only an intuitively appealing goal but also actually a good strategy to achieve a better score. Hence, it is tempting to recalibrate forecasts by redefining
i1520-0493-136-11-4488-eq8
as the new forecast (Toth et al. 2003; Atger 2003). This approach would be optimal if κ̂(r) were the true calibration function κ(r). Unfortunately, recalibration is adversely affected by errors in the calibration function estimates. The aim of this section is to study these errors. More specifically, we are interested in the effects caused by the bias and variance of calibration function estimates. Applying decomposition (9) to f (r) = κ̂T(r) yields
i1520-0493-136-11-4488-e10
What is essential here is the assumption that T is independent of ρ. This assumption is justified though, as T is a forecast archive containing past instances of forecast–verification pairs, while we compute the score over future instances of forecast–verification pairs. In other words, relation (10) concerns the Brier score of the forecast κ̂T(ρ) instances of ρ that are not contained in T, which is exactly what we are interested in if we want to deploy the recalibrated forecast operationally. Equation (10) still contains a stochastic component: the training data T. Taking the expectation over T in Eq. (10), interchanging EY,ρ and ET, and using the decomposition (6) of the total error results in
i1520-0493-136-11-4488-e11
Hence, recalibrating a forecast using a calibration function estimate rather than the exact calibration function results in a deviation from ideal behavior. This deviation is given by a sum of the bias and the variance, both averaged over the range of forecasts. These two quantities are plot in Fig. 5 for the particular experiment discussed in section 2. The bias–variance decomposition of quadratic scoring rules is well known in the field of statistical learning; see, for example, Hastie et al. (2001). As was already discussed previously, bias and variance are unavoidable in calibration function estimates. The present discussion is not attempting to suggest that calibration, in view of bias and variance, is a bad idea. Actually, quite the contrary: Eq. (9) demonstrates that recalibrating forecasts gives a good Brier score if and only if the function used for calibration is in good agreement with the true calibration function. This suggests that calibration functions be estimated by minimizing the Brier score. This is nothing but least squares regression—a widely used and studied problem (see, e.g., Hastie et al. 2001). Although binning and counting might be an intuitively appealing approach, from regression and statistical learning we are provided with various (presumably superior) alternatives. Variants have already been applied within the context of postprocessing ensemble forecasts, for example, logistic regression (Wilks 2006a; Hamill et al. 2004), an approach that, unlike standard linear regression, takes into account that probabilities have to be between zero and one.
After having estimated the calibration function and recalibrated his or her forecasts, the forecaster might wonder whether these efforts actually improved the performance. In this respect, Eq. (9) might be deceiving, since in theory, recalibrating with the true calibration function cannot fail. In reality though, the score of a recalibrated forecast is in fact given by Eq. (11), and if the bias or variance of the estimator is large, recalibration might actually diminish the performance, which therefore needs to be checked. The problem here is that the Brier score EY,ρ,T[Yκ̂T(ρ)]2 cannot simply be estimated by a sample average over T. The “in sample” Brier score,
i1520-0493-136-11-4488-eq9
will give too optimistic results. For a detailed discussion, see Hastie et al. (2001), but roughly speaking the point is that κ̂T(r) has been adapted to the data T already, while we are interested in the performance for instances of (Y, ρ), which are not part of the training data, as was previously argued above. There are several ways to estimate the sample performance, with the most popular choice presumably being cross validation (CV). For details, we again refer the reader to Hastie et al. (2001), but the general idea is the following: Divide the data into K nonoverlapping sections, and let Tk be the training set with the kth section being left out. Then, construct an individual calibration curve estimate, κ̂k(r) := κ̂Tk(r), for every k. Note that for every i = 1, . . . , N, there exists exactly one k(i) so that (Yi, ρi) is not in Tk(i). Therefore, the forecast–verification pair (Yi, ρi) was not used in the construction of κ̂k(i). The cross-validation estimate of the Brier score is obtained by evaluating each calibration function estimate κ̂k exactly on those forecast–verification pairs that where not used to construct it. Hence, the cross-validation estimate of the Brier score is given by
i1520-0493-136-11-4488-e12
A reasonable choice for K is 10, but in some situations, K = N is a possible choice too. In section 4, we will demonstrate how cross validation can also be used to choose the number of bins in the binning and counting approach.
A convenient feature of the cross-validation approach is that it gives not only realistic estimates of the score, but can also be used to estimate variations of the score. A possible estimator of the variance of BCV is given by
i1520-0493-136-11-4488-e13
which can be used to build the standard ±2σ confidence intervals for BCV.

c. The reliability term of the Brier score

Considering the special case f (r) = r in Eq. (9) results in an interesting and well-known (Murphy and Winkler 1987; Murphy 1996) decomposition of the Brier score; namely,
i1520-0493-136-11-4488-e14
The terms on the right-hand side are the sharpness term and the reliability term, respectively. The sharpness term can be further decomposed; namely,
i1520-0493-136-11-4488-e15
with p1 = (Y = 1). Often the second term in Eq. (15) is referred to as the resolution3 (Toth et al. 2003; Wilks 2006b). It is mainly due to the decomposition (14) that sharpness and reliability have been appreciated as the two virtues of probabilistic forecasts. Furthermore, decomposition (14) suggests that the reliability term be used as a diagnostic for forecast reliability. Our objective is to study possible estimators of the reliability team. Common practice (see, e.g., Atger 2004) is to estimate both terms by means of replacing κ(r) with κ̂(r). In particular, the reliability term is estimated by
i1520-0493-136-11-4488-e16
Easy manipulation shows that
i1520-0493-136-11-4488-eq10
From these calculations we conclude that the estimated reliability term can carry quite a substantial error. The error can be either positive or negative, unlike the sharpness term [Eq. (10)], which is always overestimated. Furthermore, the estimated sharpness and reliability terms on average do not add up to the correct Brier score of ρ; that is,
i1520-0493-136-11-4488-eq11
In other words, the Brier score does not decompose into those sharpness and reliability estimates obtained from the estimated calibration function. In view of this problem, one might consider estimating the quantity
i1520-0493-136-11-4488-e17
and using it as a reliability term. But Eq. (17) is nothing but the overall Brier score less the Brier score of the recalibrated forecast discussed in the last subsection. By construction, this approach has the advantage of amounting to an exact decomposition of the Brier score on average. It effectively measures by how much the score is improved by calibration. Using the bias and variance, we can write
i1520-0493-136-11-4488-e18
which demonstrates that, using this estimator, the reliability term is always underestimated. To decide between the two approaches [i.e., Eqs. (16) and (17)], it is worth recapitulating what the sharpness and reliability terms are supposed to tell us. In view of Eq. (14) and the already discussed interpretation of the calibration function, the sharpness term is the optimum score we can possibly obtain by recalibrating the forecast. It can therefore be interpreted as the hidden potential or information content of the forecast. The reliability term is the remainder, measuring the deviation from the perfect calibration of the forecast. But for what are these quantities used? First and most important, I cannot see the value of just stating the reliability term of the Brier score for a forecast.4 It is not difficult to issue a reliable forecast (e.g., climatology) or to make a given forecast more reliable, if no constraint is imposed on the sharpness. The task is to improve the reliability, while at the same time not impairing the sharpness. It was pointed out in Murphy and Winkler (1987) that recalibration “does require some effort and knowledge of the characteristics of the forecaster or forecast system. For less sophisticated users, who take the forecasts at face value, calibration and refinement are important. A poorly calibrated forecast system would lead such users astray. Since we cannot assume that consumers of weather forecasts are sophisticated, properties such as calibration and refinement are important in practice although they might be dismissed as relatively unimportant in theory in a perfect world with completely knowledgeable users of forecasts.”

As far as the sophisticated user is concerned, this statement is even truer in view of the difficulties of recalibration mentioned in section 3b. As far as the less sophisticated user is concerned, we take the quoted statement to mean that for users who take the forecasts at face value, forecasts have to come (more or less) calibrated. It is therefore doubtful whether a less sophisticated user has any use for a reliability and sharpness assessment at all. It is hard to see how knowing how much of the score can be attributed to sharpness and reliability, respectively, can possibly change such a person’s position. The sophisticated user though, who is able to recalibrate the forecast if needs be, should be interested in how much the forecast can at least be improved through recalibration. This is exactly what the estimator in Eq. (17) would reveal. This amounts to reporting the original score as well as the score of the recalibrated forecast—giving an upper bound of the sharpness, as discussed in section 3b. This will tell the sophisticated user whether attempts to recalibrate the forecast are likely to meet with success.

As a final point, we would like to discuss an often-cited decomposition of the Brier score of the binning and counting method computed over samples, as for example in Wilks [(2006b), Eq. (7.40)]. The fact that here the “classical” reliability and sharpness estimates exactly add up to the Brier score seems to have led to the misconception that there is no error in the calibration function estimate if it is just calculated “in the right way.” The mathematics behind the mentioned equation are of course correct, but they rest on two problematic assumptions. First, the forecast is assumed to have values in a finite set of numbers, only, which is not an uncommon situation. Each bin is chosen so as to contain exactly one of these values. The decomposition stops being true if this is not the case, for example because the forecast has a continuous range of values, or if several forecast values are collected in one bin to improve the statistics. In this situation, the decomposition could be rendered true again by decreasing the range of forecast values in a suitable way, which, however, negatively affects the sharpness. Second, and more importantly, the mentioned decomposition is in-sample. Both the sharpness and reliability terms are evaluated on exactly the same data that were used to construct the calibration function estimate (or the recalibrated forecasts). What is relevant though is the performance of the forecast on data that were not used to construct the calibration function estimate, that is, hitherto unseen future data. This point was already stressed in section 3b.

4. The bias–variance trade-off revisited

In this section, we will take a closer look on the bias–variance trade-off and give recommendations for how to control it. Much of the presented material was taken from Hastie et al. (2001) and has been adapted to the particular problem considered in the present paper. Suppose we have an estimator of the calibration function, which in this section will often be written as κ̂(r; T, δ), where as before r is the argument of the calibration function, T is the training set (consisting of archived forecast–verification pairs), and δ is a parameter that, in one way or another, controls the degrees of freedom of the estimator. Typically, the actual degrees of freedom of the estimator are a function of δ and the size N of the training set T. For binning and counting for example, the parameter δ controls the number of bins. In this case, the parameter δ is discrete, but in many estimation approaches designed for active degrees of freedom control, δ is actually continuous, describing the “freezing” of the degrees of freedom. For example, suppose we envisage estimating κ(r) by a harmonic series, then δ could be the order of the series, but it could also determine by how much higher-frequency terms are damped.

The difficulty with estimating the calibration function (and in fact with regression in general) is that the “right” solution, which is an entire functional relationship, has to be determined based on an only a limited amount of information. The set of conceivable candidate functions has many more degrees of freedom than can possibly be determined from the data. If the estimator is allowed to employ too many degrees of freedom to obtain what seems to be a good fit to the data, the outcome of the estimation is in fact largely determined by chance. A different way of phrasing this problem is that if we try to extract too much information from too little data, individual sample points gain an unduly large influence on the result. To give a somewhat drastic example, suppose we have a forecast system with a continuous distribution of values ρ. Then, in a training set T with a finite number of samples (ρi, Yi), no value of ρi appears twice. The unit interval is split into very small bins so that each bin Bi contains a single forecast ρi only. Consider the estimate of the calibration function given by κ̂(r) = Yi for rBi. This estimate produces a perfect fit to the data in T. To see what is wrong with it, consider a “test” pair (ρ̃, ) that is not in T, but is from the same source. Obviously, κ̂(ρ̃) = either 0 or 1, which is presumably very far away from κ(ρ̃), the desired solution. If we use a different training set from the same source, we would get a very different answer, or in other words, the variance of this approach is prohibitively large. This estimator obviously allows individual samples to wield a too large influence on the result. An estimator with too few degrees of freedom on the other hand (e.g., a binning and counting approach with only one bin) is unlikely to give good results either, as the algorithm then lacks versatility. In other words, if the influence of the individual sample points is too small, the estimator simply fails to extract information from the data. Such estimates will have low variance, but exhibit bias. Within the context of reliability diagrams, it was noted by Atger (2004) that interpolating the calibration function reduces the variance. In that paper, instead of binning and counting, both F(1, r) and F(0, r) in Eq. (2) are approximated with normal models. This approach has very few degrees of freedom, which is why it has a low variance. However, the bias of this approach remains to be investigated. Furthermore, this approach does not allow us to control the degrees of freedom, whence it is impossible to adapt it to situations where more data are available.

The idea of regularization (Hastie et al. 2001; Vapnik 1998) is to use algorithms (or to modify existing algorithms) that allow for controlling the influence of individual samples by means of a parameter δ, thereby getting a handle on the bias–variance trade-off. In the binning and counting approach for example, the influence of individual sample points is controlled by the bin size, as was already discussed in section 2. The problem obviously lies in how to choose the regularization parameter δ in practice. There exists a considerable body of literature (see, e.g., Vapnik 1998) on regularization, which is largely concerned with asymptotic results, that is, how to choose δ as a function of N in order to ensure that κ̂(r) → κ(r) for N → ∞. For example, in the binning and counting approach, it can be shown that under suitable regularity conditions, this result holds provided that if N goes to infinity, the bin diameter δ goes to 0, but slowly enough so that the number of samples in each bin still goes to infinity. These results are of value for deciding which algorithms allow for regularization, but they provide little guidance as to how to set the regularization parameter for a dataset of fixed size. The relevant information is contained in the bias–variance trade-off, as in Fig. 5 (where N = 100). Generating such a plot though requires many simulations to be carried out and, thus, an essentially unlimited amount of data.

A feasible approach is to estimate the score of the recalibrated forecast and minimize this quantity as a function of the regularization parameter. In view of Eq. (11), this has the effect of minimizing (an estimate of) the expected total error Eρ[b(ρ)] + Eρ[υ(ρ)]. The difficulty is again in estimating the Brier score for unseen (future) forecast–verification pairs, similar to what was discussed in section 3b. For the purpose of choosing a suitable regularization parameter, a simple version of cross validation called leave-one-out cross validation shall be discussed here, as it is particularly suitable for the binning and counting method. Again, we refer the reader to Hastie et al. (2001) for various alternative techniques of model assessment. Leave-one-out cross validation is cross validation with K = 1. The leave-one-out cross estimate of the Brier score BLOO is obtained by building a calibration function estimate κ̂(ρi; Ti, δ) for each Ti, which is the entire dataset but with a single data point (ρi, Yi) left out, and then evaluating it on that single data point. In a formula,
i1520-0493-136-11-4488-eq12
This approach requires the construction of N estimates, which is tedious in general but fairly simple in a number of important cases, as for example the binning and counting approach, whence it is used here. In fact, using the notation of Eq. (3),
i1520-0493-136-11-4488-e19
where Bi is the bin that ρi falls into. If binning and counting is used with linear interpolation, then Eq. (19) gives the value at the node corresponding to bin Bi. The other nodes do not change.

A plot of BLOO for the binning and counting approach is shown in Fig. 6. All parameters here are as in Figs. 2 –6. Consistent with the latter plots (in particular the bias–variance trade-off in Fig. 5, Fig. 6 shows that the performance remains pretty much constant for up to 12 bins, but then starts to deteriorate. From this plot, one would conclude that six bins is likely to be a safe choice. Obviously, Fig. 6 does not give as detailed information as Fig. 5. The bars are ±2σ confidence intervals, where σ was computed via Eq. (13). They indicate some uncertainty in the performance assessment. Nevertheless, Fig. 6 certainly provides guidance on how to choose the number of bins, while using only 100 data points, which is only as much data as is assumed to be available. Hence, the approach is operationally feasible. Finally, note that this approach can be used not only with the Brier score, but also with any score suited to the problem at hand.

5. An example using weather forecasts

In this section, some of the tools discussed above are applied to ensemble forecasts of 2-m temperature anomalies. Results are presented for 2-m temperature at the Helgoland Düne, Germany, weather station [World Meteorological Organization (WMO) station 10015]. The forecasts consist of the 50- (perturbed) member medium-range ensemble, produced by the European Centre for Medium-Range Weather Forecasts (ECMWF) Ensemble Prediction System. Both the station data and forecasts were kindly provided by ECMWF. Forecasts were available for the years 2001–05 for lead times of 10 days. All data verified at noon. The observations from years previous to 2001 were used to fit a temperature normal, consisting of a fourth-order trigonometric polynomial. The normal was subtracted from both the ensembles and verifications. Furthermore, the ensembles were debiased, using the years 2001 and 2002. Eventually, we used a subsample of 362 forecast–verification pairs. A positive anomaly, that is, a temperature exceeding the normal, was considered an event. The ensembles were converted to probabilities by counting the ensemble members that exhibited positive anomalies. The goal was to estimate the sharpness (i.e., the score of the recalibrated forecast) and the reliability. For this study, instead of the binning and counting approach, we used kernel density estimators to estimate F(0, r) and F(1, r), and subsequently derived the calibration curve via Eq. (2). This is reminiscent of the already mentioned approach of Atger (2004), albeit with the advantage that kernel density estimators allow for controlling the degrees of freedom. The kernel density estimators employed here are of the form
i1520-0493-136-11-4488-e20
where Φ is the standard error function, N0 = #{i; Yi = 0}, and p0 = (Y = 0), which is estimated from the data. F(1, r) is estimated mutatis mutandis. Thus, the densities of F(0, r) and F(1, r) are estimated by a sum of Gaussian bumps of width δ called the bandwidth. This is one of the simplest ways to form kernel density estimators, and for details, the reader is referred to Silverman (1986). In this approach, the bandwidth plays the role of the regularization parameter. The bandwidth controls the smoothness of the estimate, with a large bandwidth giving rather smooth estimates, but diminishing the influence of individual samples, while a small bandwidth has the opposite effect. It should be noted that the bandwidth plays a role similar to that of the bin diameter in the binning and counting approach (somewhat inverse to the number of bins). Figure 7 shows the Brier score of the recalibrated temperature anomaly forecast (the sharpness), estimated using leave-one-out cross validation for different kernel bandwidths (solid line). A bandwidth around 0.32 gives an optimal score of the recalibrated forecast and, hence, the best approximation to the true sharpness. The Brier score of the uncalibrated temperature anomaly forecast is shown as a thin dashed line. Note that only for bandwidths between 0.02 and 0.32 does the recalibration in fact yield significant improvement in performance. The two estimates of the reliability term are shown as a thick dashed line [Eρ,T[κ̂(ρ) − ρ]2, Eq. (16)], and as a dashed–dotted line [Eρ[Yρ]2Eρ,T[Yκ̂(ρ)]2, Eq. (17)], respectively. The vertical bars represent ±2σ confidence intervals. Note that the sharpness and reliability add up to the total Brier score only for the second estimate of the reliability. A further interesting point is that at the optimal bandwidth, both estimates of reliability coincide, indicating a small total error in the calibration function estimate.

6. Conclusions

In this paper, the problem of estimating the calibration function from data is revisited. It was demonstrated how the estimation errors can be described in terms of the bias and variance. Variance is because different data from the same source would give a slightly different calibration function estimate. Bias is due to systematic deviations between the estimated and the true calibration functions. The bias and variance are typically subject to a nontrivial trade-off, which was studied in detail for binning and counting, a popular method of estimating calibration functions. It was argued that to better control the bias–variance trade-off, the influence of individual sample points on the final estimate has to be controlled, which is the central aim of regularization techniques. As a simple illustration, how to choose an appropriate bin size in the binning and counting method was discussed. The bias and variance adversely affect estimates of the reliability and sharpness terms of the Brier score, recalibration of forecasts, and the assessment of forecast reliability through reliability diagram plots. Ways to communicate and appreciate these errors were presented that avoid too optimistic or misleading forecast assessments. Furthermore, part of the methodology was applied to temperature anomaly forecasts, demonstrating its feasibility under operational constraints.

Acknowledgments

This paper presents research to a large extent carried out while the author was with the Centre for the Analysis of Time Series (CATS) at the London School of Economics. Fruitful discussions with the members of CATS are kindly acknowledged, in particular Liam Clarke, Milena Cuellar, and Leonard A. Smith. Valuable comments and suggestions by Sarah Hallerberg and in particular Markus Niemann, MPI für Physik komplexer Systeme, further improved the manuscript. Forecasts and observation data were kindly provided by the European Centre for Medium-Range Weather Forecasts.

REFERENCES

  • Atger, F., 2003: Spatial and interannual variability of the reliability of ensemble-based probabilistic forecasts: Consequences for calibration. Mon. Wea. Rev., 131 , 15091523.

    • Search Google Scholar
    • Export Citation
  • Atger, F., 2004: Estimation of the reliability of ensemble based probabilistic forecasts. Quart. J. Roy. Meteor. Soc., 130 , 627646.

  • Bickel, P. J., and E. L. Lehmann, 1969: Unbiased estimation in convex families. Ann. Math. Stat., 40 , 15231535.

  • Brier, G. W., 1950: Verification of forecasts expressed in terms of probabilities. Mon. Wea. Rev., 78 , 13.

  • Bröcker, J., 2008: Decomposition of proper scores. Tech. Rep., Max-Planck-Institut für Physik Komplexer Systeme, Dresden, Germany.

  • Bröcker, J., and L. A. Smith, 2007: Increasing the reliability of reliability diagrams. Wea. Forecasting, 22 , 651661.

  • Hamill, T. M., J. S. Whitaker, and X. Wei, 2004: Ensemble reforecasting: Improving medium-range forecast skill using retrospective forecasts. Mon. Wea. Rev., 132 , 14341447.

    • Search Google Scholar
    • Export Citation
  • Hastie, T., R. Tibshirani, and J. Friedman, 2001: The Elements of Statistical Learning. 1st ed. Springer, 533 pp.

  • Jolliffe, I. T., and D. B. Stephenson, 2003: Forecast Verification: A Practitioner’s Guide in Atmospheric Science. John Wiley and Sons, 240 pp.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1996: General decompositions of MSE-based skill scores: Measures of some basic aspects of forecast quality. Mon. Wea. Rev., 124 , 23532369.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., and R. L. Winkler, 1977: Reliability of subjective probability forecasts of precipitation and temperature. Appl. Stat., 26 , 1. 4147.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., and R. L. Winkler, 1987: A general framework for forecast verification. Mon. Wea. Rev., 115 , 13301338.

  • Silverman, B. W., 1986: Density Estimation for Statistics and Data Analysis. 1st ed. Chapman and Hall, 175 pp.

  • Tippett, M. K., A. G. Barnston, and A. W. Robertson, 2007: Estimation of seasonal precipitation tercile-based categorical probabilities from ensembles. J. Climate, 20 , 22102228.

    • Search Google Scholar
    • Export Citation
  • Toth, Z., O. Talagrand, G. Candille, and Y. Zhu, 2003: Probability and ensemble forecasts. Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. T. Jolliffe and D. B. Stephenson, Eds., John Wiley and Sons, 137–164.

    • Search Google Scholar
    • Export Citation
  • Vapnik, V. N., 1998: Statistical Learning Theory. John Wiley and Sons, 736 pp.

  • Wilks, D. S., 2006a: Comparison of ensemble–MOS methods in the Lorenz’96 setting. Meteor. Appl., 13 , 243256.

  • Wilks, D. S., 2006b: Statistical Methods in the Atmospheric Sciences. 2nd ed. International Geophysics Series, Vol. 59, Academic Press, 627 pp.

    • Search Google Scholar
    • Export Citation

APPENDIX A

Bias in Estimators of the Calibration Function

In this section, we prove that unbiased estimation of the calibration function is impossible, using a device discussed in Bickel and Lehmann (1969). Whether an estimator is unbiased or not also depends on the range of possible distributions underlying the data. An estimator might give unbiased estimates for one class of distributions, but might be biased for other classes. The main result of this section will be that if the class of possible distributions underlying the data is convex, no unbiased estimator of the calibration function exists. A class F of distribution functions is convex if, with any two members F, G, it also contains the convex combination λF + (1 − λ)G for any λ between 0 and 1. For example, the class of all distribution functions on the unit interval is a convex class.

To state and prove our results, we will have to introduce (and recall) some additional notation. Let T = {(Yn, ρn); n = 1 . . . N} denote the data from which the calibration function is to be estimated. Furthermore, let F be the class of distribution functions of forecast–verification pairs. We fix
i1520-0493-136-11-4488-eqa1
to be the true distribution of the pair (Yn, ρn). We assume F to be independent of n. By
i1520-0493-136-11-4488-eqa2
we denote the compound distribution of T. If the pairs (Yn, ρn) are independent, we have
i1520-0493-136-11-4488-eqa3
If F is furthermore a convex combination of the form F = λG + (1 − λ)H with G, HF, then FT is a polynomial of degree N in λ.
Let q be the quantity to be estimated. In general, q is a functional of F, which we indicate by writing q(F) where appropriate. For example, if q = κ(r), which is the calibration function, then by Eq. (2),
i1520-0493-136-11-4488-ea1
An estimator for q is a function (T), which maps the entire data T onto possible values of q. An estimator is unbiased if
i1520-0493-136-11-4488-ea2
The key observation is that the left-hand side is linear in FT. Keeping this in mind, we now assume that the class F is convex. Equation (A2) must hold if F is a convex combination F = λG + (1−λ)H of two distributions, G and H. If we now furthermore assume that the pairs (Yn, ρn) are independent, FT becomes a polynomial in λ, and so does the left-hand side of Eq. (A2), the degree being at most N (as some coefficients might cancel, depending on the particular choice of , G, and H). We have shown the following assertion [Lemma 2.1, (i) in Bickel and Lehmann (1969)]: “If there exists an estimator (T) for q which is unbiased for a convex family F of distribution functions, then q[λG + (1 − λ)H] must be a polynomial of degree not larger than the sample size N.”
We apply this to our problem by replacing F = λG + (1 − λ)H in Eq. (A1) and checking whether q(F) becomes a polynomial in λ. Replacing F = λG + (1 − λ)H in Eq. (A1), we obtain
i1520-0493-136-11-4488-ea3
where we use the abbreviations
i1520-0493-136-11-4488-eqa4
Note that A and B are the marginal distributions of ρ under G and H, respectively. If AB, q(F) converges for |λ| → ±∞, but is not constant, and hence cannot be a polynomial, since polynomials are either constant or unbounded for |λ| → ±∞. Obviously, q(F) is a polynomial if A = B, which in turn is the case if the marginal distributions of ρ under G and H are the same. We thus arrive at the following conclusion. If the family F of distributions is convex and contains at least two distributions with different marginals for ρ, then there exists no unbiased estimator for the calibration function for any sample size N.

In the case of convex families of distributions featuring similar marginal distributions for ρ,A1 the presented techniques give no answer as to whether there are unbiased estimators in this situation. If they exist, they necessarily depend on the marginal distribution of ρ, or in other words, in order to build the estimator, the distribution of forecast values would have to be known. This would rarely be the case in any practical application.

APPENDIX B

Demonstration of Eqs. (4) and (5)

The purpose of this appendix is to make Eqs. (4) and (5) plausible. As was mentioned already in section 2, these equations are, strictly speaking, not quite correct, as they ignore the slim, albeit nonzero chance that bins might be empty. This deficit is already inherent in the definition of the binning and counting estimator [Eq. (3)], which in fact ceases to make sense if the bin Br is empty and thus the denominator becomes zero. This subtle difficulty will be ignored in the discussion below. In what follows, some familiarity in the manipulation of mathematical expectations is assumed. In particular, we write E[. . .] for the usual mathematical expectation. Whenever a quantity (say ρ) under the expectation needs to be held fixed, the conditional expectation is used and is written as E[. . .|ρ]. This formalism replaces the subscripted expectation values used before, as they become cumbersome when variables are no longer independent. Suppose again that training data T = {(Yi, ρi); i = 1, . . . , N} of verifications Yi and the corresponding forecast probabilities ρi are given. Let the unit interval be partitioned into subintervals (bins) B1, . . . , Bk, which are assumed not to overlap. For any number r between 0 and 1, we denote by Br the bin that contains r. Define the random variables Ri = 1 if ρiBr and Ri = 0 otherwise. (The Ri implicitly depend on r.) Then, the binning and counting estimate in Eq. (3) can be written as
i1520-0493-136-11-4488-eb1
When taking the mathematical expectation conditioned on all Ri (written as E[. . .|{Ri}]) on both sides of Eq. (B1), the Ri are left unaffected, but it has to be taken into account that the Yi and Ri are not independent. We obtain
i1520-0493-136-11-4488-eb2
Since for every i, Yi depends on Ri only, we have E[Yi|{Ri}] = E[Yi|Ri]. Since forecasts and verifications are identically distributed for all i, this conditional expectation does not depend on i. Furthermore, only E[Yi|Ri = 1] is relevant, since E[Yi|Ri = 0] gets multiplied by 0 in the numerator of Eq. (B2). Therefore, Eq. (B2) reduces to
i1520-0493-136-11-4488-eb3
Using standard rules of probability theory, one obtains
i1520-0493-136-11-4488-eb4
where we used the distribution function G of the forecast probabilities ρ and the definition of κ. Substituting Eq. (B4) for E[Yi|Ri = 1] in Eq. (B3) and taking the expectation value on both sides, we obtain
i1520-0493-136-11-4488-eb5
If ρ has a uniform distribution, Eq. (4) results.
To establish Eq. (5), we look at the mathematical expectation of κ̂2 first. Since
i1520-0493-136-11-4488-eb6
we have
i1520-0493-136-11-4488-eb7
It turns out that
i1520-0493-136-11-4488-eqb1
After some algebraic manipulation, one gets
i1520-0493-136-11-4488-eqb2
Taking the mathematical expectation on both sides, we obtain
i1520-0493-136-11-4488-eb8
According to the law of large numbers,
i1520-0493-136-11-4488-eqb3
for N → ∞. This does not automatically imply that E[NiRi] converges to 1/G(Br) though, since (as was argued before) there is a nonvanishing chance that ΣiRi = 0, that is, no forecast probability falls into bin Br. For the purpose of mathematical rigor, this problem would need to be taken care of. For the purpose of the present discussion though, we can assume this to be approximately true if N and the bins themselves are not too small, thus rendering Eqs. (4) and (5) plausible.

Fig. 1.
Fig. 1.

Realizations of calibration function estimates, using the binning and counting approach: observed frequency vs forecast probability. The data are synthetic with forecasts drawn from a uniform distribution and verifications drawn according to the calibration function shown by the solid black line. The calibration function estimates were based on 100 forecast–verification pairs. The number of bins was three. Several estimates for different realizations of the data are shown as gray lines.

Citation: Monthly Weather Review 136, 11; 10.1175/2008MWR2329.1

Fig. 2.
Fig. 2.

The mean (white circles) and variance (dashed line) of calibration function estimates in Fig. 1: observed frequency vs forecast probability. Both the mean and variance were calculated from 10 000 realizations of the calibration curve estimate.

Citation: Monthly Weather Review 136, 11; 10.1175/2008MWR2329.1

Fig. 3.
Fig. 3.

As in Fig. 1 but with 24 bins. Several estimates for different realizations of the data are shown as gray lines. Evidently, the estimates exhibit larger variations than for only three bins.

Citation: Monthly Weather Review 136, 11; 10.1175/2008MWR2329.1

Fig. 4.
Fig. 4.

As in Fig. 2 but with 24 bins. The variance is larger than for only three bins, but the bias is negligible.

Citation: Monthly Weather Review 136, 11; 10.1175/2008MWR2329.1

Fig. 5.
Fig. 5.

Bias–variance trade-off for the experiment described in Figs. 1 –4 but for several numbers of bins. The bias and the variance, strictly speaking functions of r, were converted to numbers by integrating over r. The gray lines show loci of constant total error.

Citation: Monthly Weather Review 136, 11; 10.1175/2008MWR2329.1

Fig. 6.
Fig. 6.

Brier score of the recalibrated forecast, estimated using leave-one-out cross validation for various numbers of bins (solid line). The Brier score of the uncalibrated forecast is shown as a thin dashed line. The calibration curve was estimated using the binning and counting approach. The data were generated as in Fig. 1. The vertical bars represent ±2σ confidence intervals. The graph clearly indicates a decay of performance beyond 12 bins. Note also that when using more than 12 bins, recalibration does not significantly improve the Brier score.

Citation: Monthly Weather Review 136, 11; 10.1175/2008MWR2329.1

Fig. 7.
Fig. 7.

Brier score of the recalibrated temperature anomaly forecast, estimated using leave-one-out cross validation for different kernel bandwidths (solid line). The calibration curve was estimated using kernel estimators. The Brier score of the uncalibrated temperature anomaly forecast is shown as a thin dashed line. A bandwidth around 0.32 gives optimal estimates. Note that only in this range does recalibration yield significant improvements in performance. The two estimates of the reliability term are shown as a thick dashed line for {Eρ,T[κ̂(ρ) − ρ]2, see Eq. (16)} and as a thick dashed–dotted line for {Eρ[Yρ]2Eρ,T[Yκ̂(ρ)]2, Eq. (17)}. The vertical bars represent ±2σ confidence intervals.

Citation: Monthly Weather Review 136, 11; 10.1175/2008MWR2329.1

1

Other authors speak of calibrated forecasts.

2

Although this conditional probability is featured prominently in various publications, there seems to be no generally adopted name for it. “Calibration function” is used by Wilks (2006a, b), while Toth et al. (2003) call it a “reliability curve.”

3

The resolution is the same as the variance of the forecast probabilities, which, confusingly, some authors also call sharpness.

4

I found such investigations in three papers and one book.

A1

Note that this is a convex condition.

Save
  • Atger, F., 2003: Spatial and interannual variability of the reliability of ensemble-based probabilistic forecasts: Consequences for calibration. Mon. Wea. Rev., 131 , 15091523.

    • Search Google Scholar
    • Export Citation
  • Atger, F., 2004: Estimation of the reliability of ensemble based probabilistic forecasts. Quart. J. Roy. Meteor. Soc., 130 , 627646.

  • Bickel, P. J., and E. L. Lehmann, 1969: Unbiased estimation in convex families. Ann. Math. Stat., 40 , 15231535.

  • Brier, G. W., 1950: Verification of forecasts expressed in terms of probabilities. Mon. Wea. Rev., 78 , 13.

  • Bröcker, J., 2008: Decomposition of proper scores. Tech. Rep., Max-Planck-Institut für Physik Komplexer Systeme, Dresden, Germany.

  • Bröcker, J., and L. A. Smith, 2007: Increasing the reliability of reliability diagrams. Wea. Forecasting, 22 , 651661.

  • Hamill, T. M., J. S. Whitaker, and X. Wei, 2004: Ensemble reforecasting: Improving medium-range forecast skill using retrospective forecasts. Mon. Wea. Rev., 132 , 14341447.

    • Search Google Scholar
    • Export Citation
  • Hastie, T., R. Tibshirani, and J. Friedman, 2001: The Elements of Statistical Learning. 1st ed. Springer, 533 pp.

  • Jolliffe, I. T., and D. B. Stephenson, 2003: Forecast Verification: A Practitioner’s Guide in Atmospheric Science. John Wiley and Sons, 240 pp.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1996: General decompositions of MSE-based skill scores: Measures of some basic aspects of forecast quality. Mon. Wea. Rev., 124 , 23532369.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., and R. L. Winkler, 1977: Reliability of subjective probability forecasts of precipitation and temperature. Appl. Stat., 26 , 1. 4147.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., and R. L. Winkler, 1987: A general framework for forecast verification. Mon. Wea. Rev., 115 , 13301338.

  • Silverman, B. W., 1986: Density Estimation for Statistics and Data Analysis. 1st ed. Chapman and Hall, 175 pp.

  • Tippett, M. K., A. G. Barnston, and A. W. Robertson, 2007: Estimation of seasonal precipitation tercile-based categorical probabilities from ensembles. J. Climate, 20 , 22102228.

    • Search Google Scholar
    • Export Citation
  • Toth, Z., O. Talagrand, G. Candille, and Y. Zhu, 2003: Probability and ensemble forecasts. Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. T. Jolliffe and D. B. Stephenson, Eds., John Wiley and Sons, 137–164.

    • Search Google Scholar
    • Export Citation
  • Vapnik, V. N., 1998: Statistical Learning Theory. John Wiley and Sons, 736 pp.

  • Wilks, D. S., 2006a: Comparison of ensemble–MOS methods in the Lorenz’96 setting. Meteor. Appl., 13 , 243256.

  • Wilks, D. S., 2006b: Statistical Methods in the Atmospheric Sciences. 2nd ed. International Geophysics Series, Vol. 59, Academic Press, 627 pp.

    • Search Google Scholar
    • Export Citation
  • Fig. 1.

    Realizations of calibration function estimates, using the binning and counting approach: observed frequency vs forecast probability. The data are synthetic with forecasts drawn from a uniform distribution and verifications drawn according to the calibration function shown by the solid black line. The calibration function estimates were based on 100 forecast–verification pairs. The number of bins was three. Several estimates for different realizations of the data are shown as gray lines.

  • Fig. 2.

    The mean (white circles) and variance (dashed line) of calibration function estimates in Fig. 1: observed frequency vs forecast probability. Both the mean and variance were calculated from 10 000 realizations of the calibration curve estimate.

  • Fig. 3.

    As in Fig. 1 but with 24 bins. Several estimates for different realizations of the data are shown as gray lines. Evidently, the estimates exhibit larger variations than for only three bins.

  • Fig. 4.

    As in Fig. 2 but with 24 bins. The variance is larger than for only three bins, but the bias is negligible.

  • Fig. 5.

    Bias–variance trade-off for the experiment described in Figs. 1 –4 but for several numbers of bins. The bias and the variance, strictly speaking functions of r, were converted to numbers by integrating over r. The gray lines show loci of constant total error.

  • Fig. 6.

    Brier score of the recalibrated forecast, estimated using leave-one-out cross validation for various numbers of bins (solid line). The Brier score of the uncalibrated forecast is shown as a thin dashed line. The calibration curve was estimated using the binning and counting approach. The data were generated as in Fig. 1. The vertical bars represent ±2σ confidence intervals. The graph clearly indicates a decay of performance beyond 12 bins. Note also that when using more than 12 bins, recalibration does not significantly improve the Brier score.

  • Fig. 7.

    Brier score of the recalibrated temperature anomaly forecast, estimated using leave-one-out cross validation for different kernel bandwidths (solid line). The calibration curve was estimated using kernel estimators. The Brier score of the uncalibrated temperature anomaly forecast is shown as a thin dashed line. A bandwidth around 0.32 gives optimal estimates. Note that only in this range does recalibration yield significant improvements in performance. The two estimates of the reliability term are shown as a thick dashed line for {Eρ,T[κ̂(ρ) − ρ]2, see Eq. (16)} and as a thick dashed–dotted line for {Eρ[Yρ]2Eρ,T[Yκ̂(ρ)]2, Eq. (17)}. The vertical bars represent ±2σ confidence intervals.

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 2890 2593 154
PDF Downloads 359 121 3