1. Reliability
Assume the objective is to forecast whether a real-world event will or will not occur, for example whether the temperature in Dresden, Germany (or rather the temperature as measured by a specific thermometer in Dresden), at noontime on a given day n falls below 0°C. We define the variable Y, referred to as the verification, to be 1 if that event actually happens and 0 if it does not. As forecasters, we may or may not have some information available that we can employ to build our forecasts. As a probabilistic forecast for Y, we denote any function ρ that maps our information data onto a number between 0 and 1, requiring no further properties so far. Suppose, for example, that we have access to ensemble temperature forecasts for Dresden, produced by a numerical weather prediction system. These ensemble forecasts constitute the aforementioned information data, while a possible choice for the function ρ could be the fraction of ensemble members exhibiting temperatures below 0°C. Another example would be if we take a deterministic temperature forecast x for Dresden and use logistic regression to obtain a forecast ρ (Tippet et al. 2007; Wilks 2006a; Hamill et al. 2004, and references therein for various alternatives). The details of the employed information data play no role in the discussion of this paper, whence we shall have no need to distinguish between the information data and ρ itself.
In section 2, the basic facts necessary for the argumentation in this paper will be outlined. There it is discussed in what sense the true calibration function and its estimates can actually differ. Two types of errors are introduced, leading to the (well-known) concepts of bias and variance. In section 3, three important applications of the calibration function will be revisited: the reliability diagram, the reliability term of the Brier score, and recalibration. Using arguments from section 2, the effects of errors in the calibration function estimates on these applications are studied. Section 4 revisits the problem of estimating the calibration function from a more theoretical perspective. It is argued that this constitutes an ill-posed problem, which calls for techniques allowing for regularization. We will outline how the popular “binning and counting” fits into this framework, and give recommendations as to how to estimate a proper bin size. An example using temperature anomaly forecasts and verifications is presented in section 5. Here, the calibration function is estimated using kernel density estimators. It is demonstrated how the presented methodology can be employed here to use an appropriate bandwidth parameter. Section 6 concludes the paper.
2. Estimating the calibration function: General considerations
It is common to choose the geometric midpoint of the bin Bi instead of the mean as in Eq. (7), but the choice presented here has an advantage, which will become clear in section 3a. Beyond the nodes corresponding to the two extreme bins, the calibration function estimate is extrapolated linearly, but values outside the unit interval are truncated. Figure 1 shows the true calibration function κ(r) (which we pretend not to know) as a black line. The shape of this calibration function might seem odd, as in practice, nonmonotonous calibration functions are rarely encountered. But since there is no general reason why calibration functions have to be monotonous, our example is by no means unrealistic. It was chosen here for illustrative purposes, which have something to do with the oscillations of the calibration function. However, the discussion in this paper is independent of whether the calibration function is monotonous or not.
Using κ(r) and a uniform distribution for the forecast probabilities ρ, an archive of 100 forecast–verification pairs was generated. Then, the calibration function was estimated from these forecast–verification pairs, using the aforementioned version of binning and counting with three bins. This experiment was repeated 10 000 times, with a new archive of forecast–verification pairs being generated every time. A few of the calibration function estimates are shown as gray lines in Fig. 1. From these 10 000 calibration function estimates, we computed
At this point, the reader might well have some questions as to how the calculations presented so far bear on the problems this paper proposes to address. In particular,
Is the trade-off between the bias and variance a general phenomenon, or does it occur only in the binning and counting approach?
Is there a way to assess the bias–variance trade-off to determine, for example, a good bin size?
How are the bias and variance related to the reliability diagram or the reliability term of the Brier score?
How do the bias and variance affect recalibration?
The first question will be addressed in more detail in section 4, but the short answer is, it is a general phenomenon, for the simple reason that changing the model class (e.g., by changing the bin size in the binning and counting approach) generally affects both the bias and the variance. As seen in the example, it is of course not always true that when the bias becomes better, the variance generally becomes worse and vice versa. In some problems of statistics, the bias–variance trade-off is easy to get a handle on, for example if the problem allows for estimators without bias (e.g., the empirical mean as an estimator of the expected value). In section 4 and appendix A, it will be discussed that estimates of the calibration function are generally biased. This also applies to the (already mentioned) special case of forecasts with discrete values. It might seem surprising at first that in this case too the calibration function estimates are biased. It has to be kept in mind though that the number of occasions when the forecast assumes a specific value is also random. The well-known fact that observed frequencies are unbiased estimates of probabilities applies only if the number of instances is fixed. Furthermore, one might deliberately aggregate forecasts with different (but similar) values into one bin in order to improve the statistics, which means that the calibration function has to be interpolated at several values.
As to the second question, section 4 will contain suggestions on how to assess the bias–variance trade-off. The third and fourth questions form the core of this paper and will be addressed in the next section.
3. Applications of calibration function estimates
Calibration function estimates are employed for various purposes. The natural question arises as to how errors in the calibration function estimate affect its subsequent application. We investigate three common and important applications of the calibration function, namely the recalibration of forecasts, and the assessment of forecast reliability through either reliability diagrams or the reliability term of the Brier score. For all three applications, errors in the calibration function estimate have substantial negative impacts on the reliability of the result.
a. The reliability diagram
For reliable forecasts, unbiased calibration curve estimates might still exhibit deviations from the diagonal due to variance. It is important to note that the variance depends on the distribution of the forecast ρ. Thus, certain deviations of the reliability diagram from the diagonal might be typical for one (reliable) forecast system, but not for another. This implies that deviations from the diagonal cannot be compared in terms of metric distance between forecast systems that exhibit different distributions of forecast probabilities. In Bröcker and Smith (2007), a methodology was suggested that allows for comparing the variations of the estimated calibration function with those to be expected if the forecast system were in fact reliable. To this end, the expected variations are plotted as consistency bars onto the diagonal, giving the forecaster an idea as to the amount of deviation from the diagonal expected from a reliable forecast. If an estimated calibration function (reliability diagram) falls outside these limits, there is an indication that the deviation from the diagonal is not purely due to chance, but that the forecast system is in fact not reliable. In Bröcker and Smith (2007), the variations of the estimated calibration function are generated using a bootstrap approach, which takes into account that the number of forecasts in each bin is random. If the random bin populations are not an issue, standard confidence intervals for sampling proportions such as in Wilks (2006b, section 7.9.1) or Jolliffe and Stephenson (2003, section 3.3.5) can be used. The latter approach though neglects bias, which is a further reason to use the binning and counting approach in the “low bias” version defined in Eqs. (7) and (8).
There is obviously an epistemological problem here: The preceding discussion seems to suggest that for reliability diagrams, estimators should be employed that give small variance and small bias under the assumption that the forecast is reliable. It is not hard to find such an estimator—simply ignore the data and take the diagonal. The problem with this estimator is of course that it would be unable to detect any unreliable forecasts. The ability of a test to catch cases where the null hypothesis is false is called the power of the test. Hence, in order for the reliability diagram to have any power, the reliability diagram estimates need to have some variability. We are apparently facing another trade-off here: power versus propensity of the reliability diagram to label reliable forecasts as unreliable (also called size). This trade-off will not be further investigated in this paper.
b. Recalibrating forecasts
c. The reliability term of the Brier score

As far as the sophisticated user is concerned, this statement is even truer in view of the difficulties of recalibration mentioned in section 3b. As far as the less sophisticated user is concerned, we take the quoted statement to mean that for users who take the forecasts at face value, forecasts have to come (more or less) calibrated. It is therefore doubtful whether a less sophisticated user has any use for a reliability and sharpness assessment at all. It is hard to see how knowing how much of the score can be attributed to sharpness and reliability, respectively, can possibly change such a person’s position. The sophisticated user though, who is able to recalibrate the forecast if needs be, should be interested in how much the forecast can at least be improved through recalibration. This is exactly what the estimator in Eq. (17) would reveal. This amounts to reporting the original score as well as the score of the recalibrated forecast—giving an upper bound of the sharpness, as discussed in section 3b. This will tell the sophisticated user whether attempts to recalibrate the forecast are likely to meet with success.
As a final point, we would like to discuss an often-cited decomposition of the Brier score of the binning and counting method computed over samples, as for example in Wilks [(2006b), Eq. (7.40)]. The fact that here the “classical” reliability and sharpness estimates exactly add up to the Brier score seems to have led to the misconception that there is no error in the calibration function estimate if it is just calculated “in the right way.” The mathematics behind the mentioned equation are of course correct, but they rest on two problematic assumptions. First, the forecast is assumed to have values in a finite set of numbers, only, which is not an uncommon situation. Each bin is chosen so as to contain exactly one of these values. The decomposition stops being true if this is not the case, for example because the forecast has a continuous range of values, or if several forecast values are collected in one bin to improve the statistics. In this situation, the decomposition could be rendered true again by decreasing the range of forecast values in a suitable way, which, however, negatively affects the sharpness. Second, and more importantly, the mentioned decomposition is in-sample. Both the sharpness and reliability terms are evaluated on exactly the same data that were used to construct the calibration function estimate (or the recalibrated forecasts). What is relevant though is the performance of the forecast on data that were not used to construct the calibration function estimate, that is, hitherto unseen future data. This point was already stressed in section 3b.
4. The bias–variance trade-off revisited
In this section, we will take a closer look on the bias–variance trade-off and give recommendations for how to control it. Much of the presented material was taken from Hastie et al. (2001) and has been adapted to the particular problem considered in the present paper. Suppose we have an estimator of the calibration function, which in this section will often be written as κ̂(r; T, δ), where as before r is the argument of the calibration function, T is the training set (consisting of archived forecast–verification pairs), and δ is a parameter that, in one way or another, controls the degrees of freedom of the estimator. Typically, the actual degrees of freedom of the estimator are a function of δ and the size N of the training set T. For binning and counting for example, the parameter δ controls the number of bins. In this case, the parameter δ is discrete, but in many estimation approaches designed for active degrees of freedom control, δ is actually continuous, describing the “freezing” of the degrees of freedom. For example, suppose we envisage estimating κ(r) by a harmonic series, then δ could be the order of the series, but it could also determine by how much higher-frequency terms are damped.
The difficulty with estimating the calibration function (and in fact with regression in general) is that the “right” solution, which is an entire functional relationship, has to be determined based on an only a limited amount of information. The set of conceivable candidate functions has many more degrees of freedom than can possibly be determined from the data. If the estimator is allowed to employ too many degrees of freedom to obtain what seems to be a good fit to the data, the outcome of the estimation is in fact largely determined by chance. A different way of phrasing this problem is that if we try to extract too much information from too little data, individual sample points gain an unduly large influence on the result. To give a somewhat drastic example, suppose we have a forecast system with a continuous distribution of values ρ. Then, in a training set T with a finite number of samples (ρi, Yi), no value of ρi appears twice. The unit interval is split into very small bins so that each bin Bi contains a single forecast ρi only. Consider the estimate of the calibration function given by κ̂(r) = Yi for r ∈ Bi. This estimate produces a perfect fit to the data in T. To see what is wrong with it, consider a “test” pair (ρ̃, Ỹ) that is not in T, but is from the same source. Obviously, κ̂(ρ̃) = either 0 or 1, which is presumably very far away from κ(ρ̃), the desired solution. If we use a different training set T̃ from the same source, we would get a very different answer, or in other words, the variance of this approach is prohibitively large. This estimator obviously allows individual samples to wield a too large influence on the result. An estimator with too few degrees of freedom on the other hand (e.g., a binning and counting approach with only one bin) is unlikely to give good results either, as the algorithm then lacks versatility. In other words, if the influence of the individual sample points is too small, the estimator simply fails to extract information from the data. Such estimates will have low variance, but exhibit bias. Within the context of reliability diagrams, it was noted by Atger (2004) that interpolating the calibration function reduces the variance. In that paper, instead of binning and counting, both F(1, r) and F(0, r) in Eq. (2) are approximated with normal models. This approach has very few degrees of freedom, which is why it has a low variance. However, the bias of this approach remains to be investigated. Furthermore, this approach does not allow us to control the degrees of freedom, whence it is impossible to adapt it to situations where more data are available.
The idea of regularization (Hastie et al. 2001; Vapnik 1998) is to use algorithms (or to modify existing algorithms) that allow for controlling the influence of individual samples by means of a parameter δ, thereby getting a handle on the bias–variance trade-off. In the binning and counting approach for example, the influence of individual sample points is controlled by the bin size, as was already discussed in section 2. The problem obviously lies in how to choose the regularization parameter δ in practice. There exists a considerable body of literature (see, e.g., Vapnik 1998) on regularization, which is largely concerned with asymptotic results, that is, how to choose δ as a function of N in order to ensure that κ̂(r) → κ(r) for N → ∞. For example, in the binning and counting approach, it can be shown that under suitable regularity conditions, this result holds provided that if N goes to infinity, the bin diameter δ goes to 0, but slowly enough so that the number of samples in each bin still goes to infinity. These results are of value for deciding which algorithms allow for regularization, but they provide little guidance as to how to set the regularization parameter for a dataset of fixed size. The relevant information is contained in the bias–variance trade-off, as in Fig. 5 (where N = 100). Generating such a plot though requires many simulations to be carried out and, thus, an essentially unlimited amount of data.
A plot of BLOO for the binning and counting approach is shown in Fig. 6. All parameters here are as in Figs. 2 –6. Consistent with the latter plots (in particular the bias–variance trade-off in Fig. 5, Fig. 6 shows that the performance remains pretty much constant for up to 12 bins, but then starts to deteriorate. From this plot, one would conclude that six bins is likely to be a safe choice. Obviously, Fig. 6 does not give as detailed information as Fig. 5. The bars are ±2σ confidence intervals, where σ was computed via Eq. (13). They indicate some uncertainty in the performance assessment. Nevertheless, Fig. 6 certainly provides guidance on how to choose the number of bins, while using only 100 data points, which is only as much data as is assumed to be available. Hence, the approach is operationally feasible. Finally, note that this approach can be used not only with the Brier score, but also with any score suited to the problem at hand.
5. An example using weather forecasts

6. Conclusions
In this paper, the problem of estimating the calibration function from data is revisited. It was demonstrated how the estimation errors can be described in terms of the bias and variance. Variance is because different data from the same source would give a slightly different calibration function estimate. Bias is due to systematic deviations between the estimated and the true calibration functions. The bias and variance are typically subject to a nontrivial trade-off, which was studied in detail for binning and counting, a popular method of estimating calibration functions. It was argued that to better control the bias–variance trade-off, the influence of individual sample points on the final estimate has to be controlled, which is the central aim of regularization techniques. As a simple illustration, how to choose an appropriate bin size in the binning and counting method was discussed. The bias and variance adversely affect estimates of the reliability and sharpness terms of the Brier score, recalibration of forecasts, and the assessment of forecast reliability through reliability diagram plots. Ways to communicate and appreciate these errors were presented that avoid too optimistic or misleading forecast assessments. Furthermore, part of the methodology was applied to temperature anomaly forecasts, demonstrating its feasibility under operational constraints.
Acknowledgments
This paper presents research to a large extent carried out while the author was with the Centre for the Analysis of Time Series (CATS) at the London School of Economics. Fruitful discussions with the members of CATS are kindly acknowledged, in particular Liam Clarke, Milena Cuellar, and Leonard A. Smith. Valuable comments and suggestions by Sarah Hallerberg and in particular Markus Niemann, MPI für Physik komplexer Systeme, further improved the manuscript. Forecasts and observation data were kindly provided by the European Centre for Medium-Range Weather Forecasts.
REFERENCES
Atger, F., 2003: Spatial and interannual variability of the reliability of ensemble-based probabilistic forecasts: Consequences for calibration. Mon. Wea. Rev., 131 , 1509–1523.
Atger, F., 2004: Estimation of the reliability of ensemble based probabilistic forecasts. Quart. J. Roy. Meteor. Soc., 130 , 627–646.
Bickel, P. J., and E. L. Lehmann, 1969: Unbiased estimation in convex families. Ann. Math. Stat., 40 , 1523–1535.
Brier, G. W., 1950: Verification of forecasts expressed in terms of probabilities. Mon. Wea. Rev., 78 , 1–3.
Bröcker, J., 2008: Decomposition of proper scores. Tech. Rep., Max-Planck-Institut für Physik Komplexer Systeme, Dresden, Germany.
Bröcker, J., and L. A. Smith, 2007: Increasing the reliability of reliability diagrams. Wea. Forecasting, 22 , 651–661.
Hamill, T. M., J. S. Whitaker, and X. Wei, 2004: Ensemble reforecasting: Improving medium-range forecast skill using retrospective forecasts. Mon. Wea. Rev., 132 , 1434–1447.
Hastie, T., R. Tibshirani, and J. Friedman, 2001: The Elements of Statistical Learning. 1st ed. Springer, 533 pp.
Jolliffe, I. T., and D. B. Stephenson, 2003: Forecast Verification: A Practitioner’s Guide in Atmospheric Science. John Wiley and Sons, 240 pp.
Murphy, A. H., 1996: General decompositions of MSE-based skill scores: Measures of some basic aspects of forecast quality. Mon. Wea. Rev., 124 , 2353–2369.
Murphy, A. H., and R. L. Winkler, 1977: Reliability of subjective probability forecasts of precipitation and temperature. Appl. Stat., 26 , 1. 41–47.
Murphy, A. H., and R. L. Winkler, 1987: A general framework for forecast verification. Mon. Wea. Rev., 115 , 1330–1338.
Silverman, B. W., 1986: Density Estimation for Statistics and Data Analysis. 1st ed. Chapman and Hall, 175 pp.
Tippett, M. K., A. G. Barnston, and A. W. Robertson, 2007: Estimation of seasonal precipitation tercile-based categorical probabilities from ensembles. J. Climate, 20 , 2210–2228.
Toth, Z., O. Talagrand, G. Candille, and Y. Zhu, 2003: Probability and ensemble forecasts. Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. T. Jolliffe and D. B. Stephenson, Eds., John Wiley and Sons, 137–164.
Vapnik, V. N., 1998: Statistical Learning Theory. John Wiley and Sons, 736 pp.
Wilks, D. S., 2006a: Comparison of ensemble–MOS methods in the Lorenz’96 setting. Meteor. Appl., 13 , 243–256.
Wilks, D. S., 2006b: Statistical Methods in the Atmospheric Sciences. 2nd ed. International Geophysics Series, Vol. 59, Academic Press, 627 pp.
APPENDIX A
Bias in Estimators of the Calibration Function
In this section, we prove that unbiased estimation of the calibration function is impossible, using a device discussed in Bickel and Lehmann (1969). Whether an estimator is unbiased or not also depends on the range of possible distributions underlying the data. An estimator might give unbiased estimates for one class of distributions, but might be biased for other classes. The main result of this section will be that if the class of possible distributions underlying the data is convex, no unbiased estimator of the calibration function exists. A class
In the case of convex families of distributions featuring similar marginal distributions for ρ,A1 the presented techniques give no answer as to whether there are unbiased estimators in this situation. If they exist, they necessarily depend on the marginal distribution of ρ, or in other words, in order to build the estimator, the distribution of forecast values would have to be known. This would rarely be the case in any practical application.
APPENDIX B
Demonstration of Eqs. (4) and (5)
Realizations of calibration function estimates, using the binning and counting approach: observed frequency vs forecast probability. The data are synthetic with forecasts drawn from a uniform distribution and verifications drawn according to the calibration function shown by the solid black line. The calibration function estimates were based on 100 forecast–verification pairs. The number of bins was three. Several estimates for different realizations of the data are shown as gray lines.
Citation: Monthly Weather Review 136, 11; 10.1175/2008MWR2329.1
The mean (white circles) and variance (dashed line) of calibration function estimates in Fig. 1: observed frequency vs forecast probability. Both the mean and variance were calculated from 10 000 realizations of the calibration curve estimate.
Citation: Monthly Weather Review 136, 11; 10.1175/2008MWR2329.1
As in Fig. 1 but with 24 bins. Several estimates for different realizations of the data are shown as gray lines. Evidently, the estimates exhibit larger variations than for only three bins.
Citation: Monthly Weather Review 136, 11; 10.1175/2008MWR2329.1
As in Fig. 2 but with 24 bins. The variance is larger than for only three bins, but the bias is negligible.
Citation: Monthly Weather Review 136, 11; 10.1175/2008MWR2329.1
Bias–variance trade-off for the experiment described in Figs. 1 –4 but for several numbers of bins. The bias and the variance, strictly speaking functions of r, were converted to numbers by integrating over r. The gray lines show loci of constant total error.
Citation: Monthly Weather Review 136, 11; 10.1175/2008MWR2329.1
Brier score of the recalibrated forecast, estimated using leave-one-out cross validation for various numbers of bins (solid line). The Brier score of the uncalibrated forecast is shown as a thin dashed line. The calibration curve was estimated using the binning and counting approach. The data were generated as in Fig. 1. The vertical bars represent ±2σ confidence intervals. The graph clearly indicates a decay of performance beyond 12 bins. Note also that when using more than 12 bins, recalibration does not significantly improve the Brier score.
Citation: Monthly Weather Review 136, 11; 10.1175/2008MWR2329.1
Brier score of the recalibrated temperature anomaly forecast, estimated using leave-one-out cross validation for different kernel bandwidths (solid line). The calibration curve was estimated using kernel estimators. The Brier score of the uncalibrated temperature anomaly forecast is shown as a thin dashed line. A bandwidth around 0.32 gives optimal estimates. Note that only in this range does recalibration yield significant improvements in performance. The two estimates of the reliability term are shown as a thick dashed line for {Eρ,T[κ̂(ρ) − ρ]2, see Eq. (16)} and as a thick dashed–dotted line for {Eρ[Y − ρ]2 − Eρ,T[Y − κ̂(ρ)]2, Eq. (17)}. The vertical bars represent ±2σ confidence intervals.
Citation: Monthly Weather Review 136, 11; 10.1175/2008MWR2329.1
Other authors speak of calibrated forecasts.
Although this conditional probability is featured prominently in various publications, there seems to be no generally adopted name for it. “Calibration function” is used by Wilks (2006a, b), while Toth et al. (2003) call it a “reliability curve.”
The resolution is the same as the variance of the forecast probabilities, which, confusingly, some authors also call sharpness.
I found such investigations in three papers and one book.
Note that this is a convex condition.