• Barnston, A. G., Li S. , Mason S. J. , DeWitt D. G. , Goddard L. , and Gong X. , 2010: Verification of the first 11 years of IRI's seasonal climate forecasts. J. Appl. Climatol. Meteor., 49, 493520.

    • Search Google Scholar
    • Export Citation
  • Epstein, E. S., and Murphy A. H. , 1965: A note on the attributes of probabilistic predictions and the probability score. J. Appl. Meteor., 4, 297299.

    • Search Google Scholar
    • Export Citation
  • Jolliffe, I. T., and Stephenson D. B. , 2012: Introduction. Forecast Verification: A Practitioner's Guide in Atmospheric Sciences, I. T. Jolliffe and D. B. Stephenson, Eds., Wiley-Blackwell, 1–9.

  • Jupp, T. E., Lowe R. , Coelho C. A. S. , and Stephenson D. B. , 2012: On the visualization, verification and recalibration of ternary probabilistic forecasts. Philos. Trans. Roy. Soc. London, A370, 11001120.

    • Search Google Scholar
    • Export Citation
  • Katz, R. W., and Murphy A. H. , Eds., 1997: Economic Value of Weather and Climate Forecasts. Cambridge University Press, 222 pp.

  • Livezey, R. E., and Timofeyeva M. M. , 2008: The first decade of long-lead U.S. seasonal forecasts—Insights from a skill analysis. Bull. Amer. Meteor. Soc., 89, 843854.

    • Search Google Scholar
    • Export Citation
  • Livezey, R. E., Vinnikov K. Y. , Timofeyeva M. M. , Tinker R. , and Van den Dool H. M. , 2007: Estimation and extrapolation of climate normals and climatic trends. J. Appl. Meteor. Climatol., 46, 17591776.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1972: Scalar and vector partitions of the probability score: Part II. N-state situation. J. Appl. Meteor., 11, 11831192.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1991: Forecast verification: Its complexity and dimensionality. Mon. Wea. Rev., 119, 15901601.

  • Murphy, A. H., 1997: Forecast verification. Economic Value of Weather and Climate Forecasts, R.W. Katz and A.H. Murphy, Eds., Cambridge University Press, 19–74.

  • Murphy, A. H., and Winkler R. L. , 1977: Reliability of subjective probability forecasts of precipitation and temperature. Appl. Stat., 26, 4147.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., and Daan H. , 1985: Forecast evaluation. Probability, Statistics, and Decision Making in the Atmospheric Sciences, A. H. Murphy and R. W. Katz, Eds., Westview, 379–437.

  • Murphy, A. H., and Hsu W.-R. , 1986: The attributes diagram: A geometrical framework for assessing the quality of probability forecasts. Int. J. Forecasting, 2, 285293.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., and Winkler R. L. , 1987: A general framework for forecast verification. Mon. Wea. Rev., 115, 13301338.

  • Murphy, A. H., and Winkler R. L. , 1992: Diagnostic verification of probability forecasts. Int. J. Forecasting, 7, 435455.

  • Murphy, A. H., Brown B. G. , and Chen Y.-S. , 1989: Diagnostic verification of temperature forecasts. Wea. Forecasting, 4, 485501.

  • Van den Dool, H., 2007: Empirical Methods in Short-Term Climate Prediction. Oxford University Press, 215 pp.

  • Van den Dool, H., and Toth Z. , 1991: Why do forecasts for “near normal” often fail? Wea. Forecasting, 6, 7685.

  • Wilks, D. S., 2000: Diagnostic verification of the Climate Prediction Center long-lead outlooks, 1995–98. J. Climate, 13, 23892403.

    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 2001: A skill score based on economic value for probability forecasts. Meteor. Appl., 8, 209219.

  • Wilks, D. S., 2011: Statistical Methods in the Atmospheric Sciences. 3rd ed. Academic Press, 676 pp.

  • Wilks, D. S., 2013: Projecting “normals” in a nonstationary climate. J. Appl. Meteor. Climatol., 52, 289302.

  • Wilks, D. S., and Godfrey C. M. , 2002: Diagnostic verification of the IRI net assessment forecasts, 1997–2000. J. Climate, 15, 13691377.

    • Search Google Scholar
    • Export Citation
  • View in gallery

    Calibration simplexes illustrating well-calibrated forecasts exhibiting (a) lower and (b) higher sharpness.

  • View in gallery

    Plotting schematic showing locations of refinement distribution glyphs for calibrated forecasts (zero miscalibration errors, gray dashed arrows and point), and miscalibrated forecasts (nonzero calibration errors, black solid arrows and point).

  • View in gallery

    Calibration simplexes illustrating (a) overconfident [Eq. (7)], (b) underconfident [Eq. (8)], and (c) unconditionally biased [Eq. (9)] forecasts. In each case the refinement distribution is the same as in Fig. 1b.

  • View in gallery

    Calibration simplexes for (a) 6–10- and (b) 8–14-day temperature forecasts.

  • View in gallery

    As in Fig. 4, but for precipitation forecasts.

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 159 112 9
PDF Downloads 111 84 8

The Calibration Simplex: A Generalization of the Reliability Diagram for Three-Category Probability Forecasts

View More View Less
  • 1 Department of Earth and Atmospheric Sciences, Cornell University, Ithaca, New York
© Get Permissions
Full access

Abstract

Full exposition of the performance of a set of forecasts requires examination of the joint frequency distribution of those forecasts and their corresponding observations. In settings involving probability forecasts, this joint distribution has a high dimensionality, and communication of its information content is often best achieved graphically. This paper describes an extension of the well-known reliability diagram, which displays the joint distribution for probability forecasts of dichotomous events, to the case of probability forecasts for three disjoint events, such as “below,” “near,” and “above normal.” The resulting diagram, called the calibration simplex, involves a discretization of the 2-simplex, which is an equilateral triangle. Characteristics and interpretation of the calibration simplex are illustrated using both idealized verification datasets, and the 6–10- and 8–14-day temperature and precipitation forecasts produced by the U.S. Climate Prediction Center.

Corresponding author address: Daniel S. Wilks, Dept. of Earth and Atmospheric Sciences, Bradfield Hall, Rm. 1113, Cornell University, Ithaca, NY 14853. E-mail: dsw5@cornell.edu

Abstract

Full exposition of the performance of a set of forecasts requires examination of the joint frequency distribution of those forecasts and their corresponding observations. In settings involving probability forecasts, this joint distribution has a high dimensionality, and communication of its information content is often best achieved graphically. This paper describes an extension of the well-known reliability diagram, which displays the joint distribution for probability forecasts of dichotomous events, to the case of probability forecasts for three disjoint events, such as “below,” “near,” and “above normal.” The resulting diagram, called the calibration simplex, involves a discretization of the 2-simplex, which is an equilateral triangle. Characteristics and interpretation of the calibration simplex are illustrated using both idealized verification datasets, and the 6–10- and 8–14-day temperature and precipitation forecasts produced by the U.S. Climate Prediction Center.

Corresponding author address: Daniel S. Wilks, Dept. of Earth and Atmospheric Sciences, Bradfield Hall, Rm. 1113, Cornell University, Ithaca, NY 14853. E-mail: dsw5@cornell.edu

1. Introduction

Forecast verification is the process of systematically and quantitatively describing the relationships between forecasts and the events they are meant to predict. It is an essential process both for optimal use of forecasts in decision making (Katz and Murphy 1997), and as a part of the process of improving the forecasts themselves (e.g., Murphy 1997; Wilks 2011; Jolliffe and Stephenson 2012).

Most often forecast quality is characterized using scalar (i.e., single number) verification statistics such as mean squared error, which has been referred to as the measures-oriented approach to verification (Murphy 1997). Although restricting attention to one or a small number of scalar statistics simplifies the verification process both computationally and conceptually, this practice inevitably masks some aspects of forecast performance even in the most restricted verification settings. The problem arises because the dimensionality (Murphy 1991) of the joint distribution of forecasts and observations (Murphy and Winkler 1987) is large, especially in settings involving probability forecasts. Distributions-oriented (Murphy 1997), or diagnostic (Murphy et al. 1989; Murphy and Winkler 1992), verification methods communicate the richness of verification datasets by respecting their dimensionality, but at the same time must overcome the problem of how best to portray the high-dimensional information.

Arguably the most effective approaches to communicating the full information content of a joint distribution of forecasts and observations have involved the use of well-designed graphics, although only a small number of these have yet been devised. Murphy et al. (1989) used bivariate histograms to convey the joint distribution of nonprobabilistic daily temperature forecasts and their corresponding observations, and also introduced the conditional quantile plot to portray the same information in terms of both the calibration refinement, and the likelihood-base rate factorizations (Murphy and Winkler 1987) of the joint distribution. By far the most commonly used diagnostic verification graphic is the reliability diagram (Murphy and Winkler 1977; Wilks 2011). Reliability diagrams show the calibration-refinement factorization of the joint distribution of probability forecasts for dichotomous (yes–no) events, by plotting the subsample event relative frequency as a function of forecast probability (the calibration), and including also a plot of the frequency of use of each of the possible scalar probability forecasts (the refinement).

Probability forecasts are most frequently issued for dichotomous events, so that the reliability diagram, or the closely allied attributes diagram (Murphy and Hsu 1986), is appropriate and effective for graphically displaying the relevant joint distribution. However, probability forecasts may also be issued jointly for more than two predictand categories, notably including three-category temperature or precipitation forecasts at lead times of a week and longer (e.g., Van den Dool 2007; Livezey and Timofeyeva 2008; Barnston et al. 2010). In this format the predictand categories are often defined in terms of the climatological terciles, so that the below-normal (cold or dry), near-normal, and above-normal (warm or wet) categories each have climatological occurrence probabilities of ⅓. Diagnostic verification of such forecasts has been approached by reducing the three-element probability forecast vectors to collections of dichotomous probabilities (e.g., Wilks 2000; Wilks and Godfrey 2002; Barnston et al. 2010), but that approach is not fully satisfying because it neglects the relationships among probabilities assigned to the different events (Murphy and Hsu 1986).

This paper proposes an extension of the well-known reliability diagram, to verification of probability forecast vectors pertaining to three distinct outcome categories using a two-dimensional graphic called the calibration simplex. The calibration simplex graphically represents the calibration-refinement factorization of the full joint distribution of these forecasts and their corresponding observations. Section 2 defines the structure of the calibration simplex in terms of the joint distribution of the forecasts and observations, and in relation to the conventional reliability diagram. Characteristic calibration simplex forms that are diagnostic for important aspects of forecast performance are illustrated in section 3, using artificial data. Section 4 shows example results using 6–10- and 8–14-day forecasts for U.S. temperature and precipitation, and section 5 concludes the paper.

2. Structure of the calibration simplex in relation to the reliability diagram

Diagnostic verification methods are those that communicate the joint frequency distribution of the forecasts and their corresponding observations (Murphy and Winkler 1987):
e1
Here, there are I distinct forecasts fi possible, each of which pertains to J possible observations or outcomes oj. In the case of forecasts for dichotomous outcomes, J = 2. For the purposes of constructing a reliability diagram, the forecast probabilities are rounded to a discrete set of I values if they have not already been issued as such. For example, I = 11 for probability forecasts rounded to 10ths, in which case the forecasts range from f1 = 0.0 through f11 = 1.0.
It can be more informative to work with the joint distribution in terms of one of its factorizations (Murphy and Winkler 1987). The reliability diagram is based on the calibration-refinement factorization:
e2
Thus, the full joint distribution can be decomposed into a collection of I conditional distributions Pr{oj | fi} of the observations given each of the possible forecasts fi, called the calibration distributions, and a single I-element frequency distribution Pr{fi} specifying the frequencies of use of the possible forecasts, called the refinement distribution. In the case of the reliability diagram, the calibration distributions are Bernoulli (i.e., binomial with N = 1) distributions, defined by the probability distribution function:
e3
where each pi is estimated by its empirical (in the verification dataset) conditional relative frequency of the “yes” event being forecast on occasions following the corresponding forecast fi. Each of the I calibration distributions is fully characterized by its estimated Bernoulli probability , and, collectively these define the vertical positions of the plotted points in the main portion of the reliability diagram. A histogram, or other quantitative representation of the refinement distribution, completes the graphical portrayal of Eq. (2).

Probability vectors fi = [fB,i, fN,i, fA,i]T pertaining to three mutually exclusive and collectively exhaustive outcomes (e.g., below, near, and above normal) can be plotted in two dimensions because the three forecast probabilities in each vector must sum to 1. The geometrically appropriate graph in this case is the regular 2-simplex (Epstein and Murphy 1965; Murphy 1972), which takes the shape of an equilateral triangle. Each of the corners of the 2-simplex corresponds to forecast certainty (i.e., 100% probability) for one of the three outcomes being forecast. The point within the simplex at which a three-element probability vector fi is plotted is located at distances proportional to the probabilities for each of the three outcomes, perpendicularly from the sides of the simplex opposite the respective corners. This barycentric plotting system (a plotting position is at the center of masses corresponding to the probabilities, placed at the simplex vertices) generalizes the reliability diagram because the 1-simplex appropriate to two-element probability forecasts for dichotomous events is the unit interval on the real line, which is the horizontal axis of the reliability diagram.

Figure 1 illustrates the plotting of discretized forecast vectors onto the simplex, which has been rendered as a tessellation of hexagons. Each scalar forecast probability has been rounded to one of the K = 10 values 0/9, 1/9, …, 9/9, yielding I = K(K + 1)/2 = 55 distinct possible vector forecasts fi, each of which is represented by one of the hexagons. This choice for the discretization seems natural for forecasts pertaining to tercile-based outcome definitions, although any regular discretization (e.g., the K = 11 scalar forecast elements 0.0, 0.1, …, 1.0) may be used. The hexagons at the three vertices represent forecasts assigning all probability to the outcome labeled at that corner. Hexagons representing other forecast vectors are located at perpendicular distances from the respective opposite sides, which are indicated by the probability labels in the margins. For example, forecasts of equal probability for the middle category, fN, are located along the same horizontal row of hexagons (perpendicularly upward from the horizontal bottom edge of the simplex), as indicated by the probability labels increasing upward along the left edge of the figure. Probability labels for the above-normal category, fA, increase downward along the right edge of the simplex, and probability labels for the below-normal category, fB, increase to the left along the bottom edge of the simplex. The result, for example, is that the large dot in the center of Fig. 1a locates the climatological forecast vector fclim = [1/3, 1/3, 1/3]T. Similarly, the forecasts in Fig. 1a having the largest above-normal probabilities are [2/9, 2/9, 5/9]T and [1/9, 3/9, 5/9]T, which are represented by the glyphs located farthest to the right in that diagram. Any two of the three forecast vector elements are sufficient to locate that vector's position on the simplex.

Fig. 1.
Fig. 1.

Calibration simplexes illustrating well-calibrated forecasts exhibiting (a) lower and (b) higher sharpness.

Citation: Weather and Forecasting 28, 5; 10.1175/WAF-D-13-00027.1

Other choices for arranging the labels on the simplex edges are possible and equally valid, but those in Fig. 1 have been used in order for the outcomes from below to above normal to increase from left to right. Figure 1 illustrates the graphical representation of refinement distributions Pr{fi} as glyph scatterplots (e.g., Wilks 2011), where the circle areas are proportional to the subsample sizes. Empty hexagons represent forecast vectors that were never used in the verification datasets under consideration. The two refinement distributions represented in Fig. 1 will be defined quantitatively in section 3.

Generalizing Eq. (3) for the calibration distributions, in the case of three-element vector forecasts each of the I calibration distributions is multinomial, again with N = 1:
e4
Thus, each of the I calibration distributions are fully determined by any two of the three empirical (within the verification dataset) conditional relative frequencies , , and = 1 − of the three events being forecast by fi, and therefore can be represented by a two-dimensional vector defined by, for example, and within the ith hexagon of the simplex.
Figure 2 illustrates schematically how these conditional relative frequencies for the below- and above-normal outcomes are represented within each hexagon, using the corresponding miscalibration errors, or differences between the conditional average observations and , and the respective forecast vector elements:
e5
These vector conditional miscalibration errors are plotted within their respective hexagons, using barycentric coordinates similar to those in the overall simplex, but with the origins at the centers of the hexagons. For a well-calibrated forecast subsample fi, both elements of Eq. (5) will be zero, and the dot representing the corresponding subsample size will be plotted at the center of the ith hexagon, as illustrated by the gray dot in Fig. 2, and by all subsamples in both panels of Fig. 1. In contrast, the solid dot in Fig. 2 shows the plotting position when the above-normal category has been conditionally underforecast by 0.2 probability units, together with overforecasting of the below- and near-normal categories by 0.1 probability units each. In this case the location for the glyph representing the corresponding element of the refinement distribution will be displaced toward the “A” vertex of the simplex, consistent with the conditional outcome vector having a higher relative frequency than forecast for the above-normal category.
Fig. 2.
Fig. 2.

Plotting schematic showing locations of refinement distribution glyphs for calibrated forecasts (zero miscalibration errors, gray dashed arrows and point), and miscalibrated forecasts (nonzero calibration errors, black solid arrows and point).

Citation: Weather and Forecasting 28, 5; 10.1175/WAF-D-13-00027.1

It is necessary to choose a scale for these refinement distribution glyph displacements, which should be indicated in a figure's legend. Here, this scale has been chosen as ⅔ of a probability unit corresponding to a long (vertex to vertex) diameter of the hexagons. With this choice, displacement of a plotting glyph of ⅓ the distance from the hexagon center to one of the corners equates to a miscalibration of 1/9, which corresponds to one discrete probability category when K = 10. Of course the scale in Fig. 2 could be increased to one or as much as two probability units in order to deemphasize random variations due to small subsample sizes, or to accommodate large conditional miscalibration errors. Similarly, the scale could be decreased in order to better discern subtle miscalibrations exhibited by a generally well-calibrated set of forecasts with large subsample sizes. It may be more convenient when programming plotting software to use the horizontal and vertical Cartesian displacements from the hexagon centers, which are proportional (depending on the chosen scaling) to x = −tan(30°)eB + tan(30°)eA = −0.577eB + 0.577eA, and y = −eBeA, respectively.

3. Idealized forecast examples

Characteristic patterns of simplex glyph sizes and placements are diagnostic for different aspects of overall forecast performance, analogously to the situation for the reliability diagram (Wilks 2011, p. 335). These are illustrated in this section using simple idealized statistical models for the calibration and refinement distributions, designed to demonstrate results for a range of cases with known expected behaviors.

a. Sharpness: Refinement distributions

Sharpness is a characteristic of the refinement distribution, independently of the relationship between the forecasts and observations. Forecast sharpness is commonly characterized using a measure of the dispersion of the refinement distribution, such as the standard deviation or variance (e.g., Wilks 2001). Forecasts that deviate frequently, including relatively large differences from, the climatological forecast are referred to as sharp. Forecasts exhibiting poor sharpness deviate rarely and relatively little from the climatological forecast.

Although in the present setting the refinement distribution is trivariate, it is sufficient to define it in terms of the joint distribution of two of the three forecast probabilities because they must all sum to 1. Here, joint distributions of the log-odds transformations of fB and fA,
e6
will be modeled using bivariate normal distributions, in order to ensure 0 < fB < 1 and 0 < fA < 1. This is an arbitrary but convenient assumption that allows an easy exposition of the characteristic diagnostics for the diagram. The climatological forecast of fB = fA = 1/3 maps to E[] = E[] = −0.6931 for the log-odds bivariate normal distributions. In addition, a strong negative correlation of −0.95 yields vanishingly small probabilities for the impossible fB + fA > 1, as well as relatively small deviations of the near-normal forecast fN from its mean value of ⅓. Within this framework, relatively sharp forecasts are generated using var[] = var[] = 1, and forecasts with low sharpness are generated using var[] = var[] = 0.05. Some 106 independent bivariate normal realizations were generated to construct each of the two refinement distributions.

Figure 1 shows these two refinement distributions, plotted as glyph histogram scatterplots on the simplex. In both cases the resulting forecasts are shown as being perfectly calibrated, so all of the circles representing subsample relative frequencies are plotted at the centers of their hexagons. For the low-sharpness forecasts in Fig. 1a, it is clear that the most common forecast is fclim = (⅓, ⅓, ⅓)T, which accounts for 68.3% of the 106 realizations. Furthermore, the individual forecast elements deviate no more than 2/9 from fclim, and only very rarely are those deviations larger than 1/9. In contrast, the sharper forecasts in Fig. 1b often deviate quite strongly from fclim (which accounts for only 10.7% of the 106 realizations), and take on values throughout the unit interval for the extreme-category probabilities fB and fA.

b. Conditional and unconditional biases: Calibration distributions

Figure 3 shows calibration simplexes illustrating overconfident forecasts (Fig. 3a), underconfident forecasts (Fig. 3a), and forecasts exhibiting an unconditional overforecasting bias (Fig. 3c)for the above-normal category. Each of these cases is illustrated using the higher-sharpness refinement distribution shown in Fig. 1b.

Fig. 3.
Fig. 3.

Calibration simplexes illustrating (a) overconfident [Eq. (7)], (b) underconfident [Eq. (8)], and (c) unconditionally biased [Eq. (9)] forecasts. In each case the refinement distribution is the same as in Fig. 1b.

Citation: Weather and Forecasting 28, 5; 10.1175/WAF-D-13-00027.1

For the overconfident forecasts shown in Fig. 3a, the parameters of the multinomial calibration distributions [Eq. (4)] are based on those used to define overconfident idealized reliability diagrams in Wilks (2001):
e7
These functional relationships describe conditionally biased forecasts, which exhibit overforecasting for forecast probabilities above the climatological ⅓, and underforecasting for forecast probabilities below ⅓. These were chosen in Wilks (2001) to correspond to the “no skill” line (Murphy and Hsu 1986) on the attributes diagram. If such forecasts had been generated by a dynamical ensemble forecasting system, the overconfidence would be diagnostic for underdispersion (Wilks 2011, p. 372). Figure 3a shows that the signature for overconfidence in the calibration simplex is a displacement of the relative frequency glyphs toward the center of the diagram, analogously to the calibration function in conventional reliability or attributes diagrams being tilted from the ideal 45° diagonal toward the horizontal, climatological, “no resolution” (Murphy and Hsu 1986) line. It can be seen in Fig. 3a that the degree of miscalibration increases as the forecast probabilities become more extreme, so that the glyphs for the most extreme probabilities in the two lower corners of the diagram (fB = 9/9 and fA = 9/9) have been plotted at the hexagon vertices nearest the center of the diagram, indicating conditional miscalibrations of ⅓ of a probability unit.
Figure 3b shows the calibration simplex for conditionally biased forecasts that are underconfident, in the sense that probabilities smaller than ⅓ are overforecast and probabilities larger than ⅓ are underforecast, according to
e8
Here, conditional average observations more extreme than the unit interval are truncated at 0 or 1, and the values for the three outcomes are renormalized to sum to 1 if necessary. The relationships in Eq. (8) also generalize the underconfident calibration function used in Wilks (2001), and would be diagnostic for overdispersion of a dynamical ensemble. Figure 3b shows that the signature for underconfidence in the calibration simplex is the displacement of the relative frequency glyphs away from the climatological forecast at the center of the diagram, and toward the corners of the simplex. Again, this is analogous to the signature for underconfidence in the reliability diagram, in which the calibration function is tilted at an angle steeper than 45°, and away from the climatological horizontal no-resolution line.
Finally, Fig. 3c shows the calibration simplex for unconditionally biased forecasts, exhibiting uniform overforecasting of the above-normal category equally at the expense of both the below- and near-normal categories:
e9
Again, results outside the unit interval are truncated and renormalized if necessary. In Fig. 3c the relative frequency glyphs are displaced upward and to the left, away from the A corner of the simplex. Equivalently, the below- and near-normal categories have been uniformly and equally underforecast, at the expense of the above-normal category, and the result is that the glyphs are displaced toward the simplex edge connecting the B and N corners.

4. Real forecast examples

This section illustrates the use of the calibration simplex to understand the performance of the Climate Prediction Center (CPC) “extended range” forecasts for average temperature and accumulated precipitation, at lead times of 6–10 and 8–14 days. These are subjective probability forecasts generated on weekdays, during 2001–12 for the 6–10-day forecasts, and 2004–12 for the 8–14-day forecasts, and interpolated to station locations from the graphical map products posted operationally online (http://www.cpc.ncep.noaa.gov/). The text-format versions and general information on these forecasts are also available online (http://www.cpc.ncep.noaa.gov/products/archives/short_range/ and http://www.cpc.ncep.noaa.gov/products/archives/short_range/README.6-10day.txt, respectively).

Figure 4 shows the calibration simplexes for the temperature forecasts, which include n = 413 773 six-to-ten-day forecasts (Fig. 4a) and n = 313 523 eight-to-fourteen-day forecasts (Fig. 4b). Glyphs are plotted only for forecasts having subsample sizes of 20 or more, and verifying categories have been determined relative to the 1971–2000 normals for forecasts made through April 2011, and using the 1981–2010 normals thereafter.

Fig. 4.
Fig. 4.

Calibration simplexes for (a) 6–10- and (b) 8–14-day temperature forecasts.

Citation: Weather and Forecasting 28, 5; 10.1175/WAF-D-13-00027.1

At each of the lead times, the overwhelming majority of forecasts include the climatological or near-climatological near-normal forecast fN = ⅓, consistent with the conventional expectation that forecasts of the near-normal category will exhibit intrinsically weak skill (Van den Dool and Toth 1991). Thus, the sharpness for the near-normal probabilities is quite low. The most frequent forecast for each of the lead times is the climatological probability vector fclim, which was issued for 34.9% of the 6–10-day forecasts and 35.4% of the 8–14-day forecasts. These percentages correspond to nclim = 144 211 and 111 010 for the climatological forecast fclim (largest dots in the middles of the plots) in Figs. 4a and 4b, respectively (cf. glyph sizes in the legend). The corresponding error vectors [eB, eN, eA]T are [−0.085, 0.056, 0.029]T (Fig. 4a) and [−0.88, 0.036, 0.052]T (Fig. 4b). Accordingly, both of these glyphs have been displaced away from the B vertices, indicating too few below-normal verifications when the climatological temperature forecast was issued. The more extreme probability vectors were used correspondingly less frequently at the 8–14-day lead time, but overall the 6–10-day forecasts are only slightly sharper.

For both lead times, the temperature forecasts are only moderately well calibrated, with typical miscalibration errors in the range of 1/9–2/9. The above-normal outcome is underforecast for the larger (≥4/9) above-normal and near-normal probabilities, as these subsample-size glyphs are displaced toward the A corner of the simplex. The glyph representing the very large subsample of fclim forecasts is displaced toward the edge connecting the N and A vertices, indicating a smaller fraction of below-normal outcomes than the climatologically expected ⅓. Both of these results are consistent with the baseline 30-yr normals lagging the quasi-linear warming trend that has been evident in U.S. temperature data since the mid-1970s (e.g., Livezey et al. 2007; Wilks 2013), together with the mean forecasts not fully tracking the warming, as has also been observed for seasonal tercile forecasts (Wilks 2000; Wilks and Godfrey 2002; Barnston et al. 2010). On the other hand, displacement of glyphs away from the simplex center for below-normal probabilities of 4/9 and larger indicates an overall underconfidence (consistent with the pattern of glyph dispersion away from the center of Fig. 3b), suggesting that these forecasts could be somewhat sharper without degrading their skill.

Calibration simplexes for the precipitation forecasts are shown in Fig. 5, which include n = 406 856 six-to-ten-day forecasts (Fig. 5a) and n = 306 594 eight-to-fourteen-day forecasts (Fig. 5b). The precipitation forecasts exhibit notably less sharpness than their temperature counterparts in Fig. 4, with 44.9% of the 6–10-day forecasts and 44.7% of the 8–14-day forecasts using the climatological probabilities fclim. As was the case for the temperature forecasts, fN = ⅓ is overwhelmingly the most common near-normal forecast. The most extreme probabilities have been used somewhat less frequently in the 8–14-day precipitation forecasts, but overall these are only slightly less sharp that the 6–10-day forecasts. At both lead times there is a clear tendency for the subsample-size glyphs to be displaced toward the B vertex, indicating underforecasting of the below-normal category, and corresponding overforecasting in roughly equal proportions of the near-normal and above-normal outcomes. Figure 5 also indicates strong overconfidence in the larger (≥4/9) probabilities for the near-normal category, for both lead times.

Fig. 5.
Fig. 5.

As in Fig. 4, but for precipitation forecasts.

Citation: Weather and Forecasting 28, 5; 10.1175/WAF-D-13-00027.1

5. Summary and conclusions

Gaining a full appreciation of the performance of a set of forecasts requires investigation of the joint distribution of the forecasts and corresponding observations (Murphy and Winkler 1987), and a graphical approach to this exposition will often be the most immediately informative. The well-known reliability diagram for probability forecasts of dichotomous events is the most commonly encountered such graphical device. This paper has defined and illustrated a natural extension of the reliability diagram to probability forecasts for three disjoint events, called the calibration simplex. It displays the refinement distribution for such forecasts using a glyph scatterplot histogram of the K(K + 1)/2 possible vector forecasts that result when each probability element has been rounded to one of K discrete values. Simultaneously, it shows the two empirical outcome relative frequencies conditional on each forecast vector that are necessary to fully characterize the corresponding refinement distributions, using displacements of the glyphs from central plotting locations.

Two somewhat similar approaches to this same problem have been proposed previously, although apparently neither has been used subsequently. Epstein and Murphy (1965) suggested that the information in the calibration distributions (although not using this terminology) could be sketched by printing numerical values of the scalar lengths of the calibration error vectors [Eq. (5)] inside each of a tessellation of equilateral triangles within the 2-simpex. However, this approach neither expressed the vector nature of the miscalibration errors, nor did it communicate the refinement distribution in any way. Murphy and Daan (1985, p. 418) proposed to draw line segments on the 2-simplex connecting points locating each distinct forecast vector fi with its corresponding average outcome vector , and indicating the corresponding subsample sizes with numerals printed at the midpoints of these line segments. This approach does indeed express the full information content of the joint distribution, but overall the diagram is somewhat difficult to read. Recently, this latter idea has been independently rediscovered by Jupp et al. (2012), who call it the ternary reliability diagram. They also propose a color scheme for representing the three-element probability vectors.

In principle, the calibration simplex could be extended to settings having more than three predictand categories, although this would be difficult or impossible to realize in practice because of graphical and cognitive limitations with respect to visually rendering the high-dimensional geometries. For example the 3-simplex, which would be appropriate for probability forecasts of four distinct events, is a three-dimensional regular tetrahedron, within the volume of which the spherical glyphs corresponding to possible four-element forecast vectors would reside. It is difficult to imagine how such a geometrical object could be presented graphically in an understandable way.

Use of the calibration simplex has been illustrated using the CPC 6–10- and 8–14-day subjective temperature and precipitation forecasts. These temperature forecasts exhibit an unconditional bias consistent with the average forecasts lagging the ongoing climate warming, as has been observed also for seasonal forecasts, but these graphs also indicate that greater sharpness for the below- and above-normal elements of the forecast vectors could be employed without degrading the overall accuracy or skill. The near-normal element of the precipitation forecast vectors were seen to be strongly overconfident, with overall underforecasting of the below-normal category, and corresponding overforecasting in roughly equal proportions of the near-normal and above-normal outcomes.

Acknowledgments

I thank Scott Handel for supplying the CPC forecasts and corresponding verifications. This research was supported by the National Science Foundation under Grant AGS-1112200.

REFERENCES

  • Barnston, A. G., Li S. , Mason S. J. , DeWitt D. G. , Goddard L. , and Gong X. , 2010: Verification of the first 11 years of IRI's seasonal climate forecasts. J. Appl. Climatol. Meteor., 49, 493520.

    • Search Google Scholar
    • Export Citation
  • Epstein, E. S., and Murphy A. H. , 1965: A note on the attributes of probabilistic predictions and the probability score. J. Appl. Meteor., 4, 297299.

    • Search Google Scholar
    • Export Citation
  • Jolliffe, I. T., and Stephenson D. B. , 2012: Introduction. Forecast Verification: A Practitioner's Guide in Atmospheric Sciences, I. T. Jolliffe and D. B. Stephenson, Eds., Wiley-Blackwell, 1–9.

  • Jupp, T. E., Lowe R. , Coelho C. A. S. , and Stephenson D. B. , 2012: On the visualization, verification and recalibration of ternary probabilistic forecasts. Philos. Trans. Roy. Soc. London, A370, 11001120.

    • Search Google Scholar
    • Export Citation
  • Katz, R. W., and Murphy A. H. , Eds., 1997: Economic Value of Weather and Climate Forecasts. Cambridge University Press, 222 pp.

  • Livezey, R. E., and Timofeyeva M. M. , 2008: The first decade of long-lead U.S. seasonal forecasts—Insights from a skill analysis. Bull. Amer. Meteor. Soc., 89, 843854.

    • Search Google Scholar
    • Export Citation
  • Livezey, R. E., Vinnikov K. Y. , Timofeyeva M. M. , Tinker R. , and Van den Dool H. M. , 2007: Estimation and extrapolation of climate normals and climatic trends. J. Appl. Meteor. Climatol., 46, 17591776.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1972: Scalar and vector partitions of the probability score: Part II. N-state situation. J. Appl. Meteor., 11, 11831192.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1991: Forecast verification: Its complexity and dimensionality. Mon. Wea. Rev., 119, 15901601.

  • Murphy, A. H., 1997: Forecast verification. Economic Value of Weather and Climate Forecasts, R.W. Katz and A.H. Murphy, Eds., Cambridge University Press, 19–74.

  • Murphy, A. H., and Winkler R. L. , 1977: Reliability of subjective probability forecasts of precipitation and temperature. Appl. Stat., 26, 4147.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., and Daan H. , 1985: Forecast evaluation. Probability, Statistics, and Decision Making in the Atmospheric Sciences, A. H. Murphy and R. W. Katz, Eds., Westview, 379–437.

  • Murphy, A. H., and Hsu W.-R. , 1986: The attributes diagram: A geometrical framework for assessing the quality of probability forecasts. Int. J. Forecasting, 2, 285293.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., and Winkler R. L. , 1987: A general framework for forecast verification. Mon. Wea. Rev., 115, 13301338.

  • Murphy, A. H., and Winkler R. L. , 1992: Diagnostic verification of probability forecasts. Int. J. Forecasting, 7, 435455.

  • Murphy, A. H., Brown B. G. , and Chen Y.-S. , 1989: Diagnostic verification of temperature forecasts. Wea. Forecasting, 4, 485501.

  • Van den Dool, H., 2007: Empirical Methods in Short-Term Climate Prediction. Oxford University Press, 215 pp.

  • Van den Dool, H., and Toth Z. , 1991: Why do forecasts for “near normal” often fail? Wea. Forecasting, 6, 7685.

  • Wilks, D. S., 2000: Diagnostic verification of the Climate Prediction Center long-lead outlooks, 1995–98. J. Climate, 13, 23892403.

    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 2001: A skill score based on economic value for probability forecasts. Meteor. Appl., 8, 209219.

  • Wilks, D. S., 2011: Statistical Methods in the Atmospheric Sciences. 3rd ed. Academic Press, 676 pp.

  • Wilks, D. S., 2013: Projecting “normals” in a nonstationary climate. J. Appl. Meteor. Climatol., 52, 289302.

  • Wilks, D. S., and Godfrey C. M. , 2002: Diagnostic verification of the IRI net assessment forecasts, 1997–2000. J. Climate, 15, 13691377.

    • Search Google Scholar
    • Export Citation
Save