## 1. Introduction

Forecast verification is the process of systematically and quantitatively describing the relationships between forecasts and the events they are meant to predict. It is an essential process both for optimal use of forecasts in decision making (Katz and Murphy 1997), and as a part of the process of improving the forecasts themselves (e.g., Murphy 1997; Wilks 2011; Jolliffe and Stephenson 2012).

Most often forecast quality is characterized using scalar (i.e., single number) verification statistics such as mean squared error, which has been referred to as the measures-oriented approach to verification (Murphy 1997). Although restricting attention to one or a small number of scalar statistics simplifies the verification process both computationally and conceptually, this practice inevitably masks some aspects of forecast performance even in the most restricted verification settings. The problem arises because the dimensionality (Murphy 1991) of the joint distribution of forecasts and observations (Murphy and Winkler 1987) is large, especially in settings involving probability forecasts. Distributions-oriented (Murphy 1997), or diagnostic (Murphy et al. 1989; Murphy and Winkler 1992), verification methods communicate the richness of verification datasets by respecting their dimensionality, but at the same time must overcome the problem of how best to portray the high-dimensional information.

Arguably the most effective approaches to communicating the full information content of a joint distribution of forecasts and observations have involved the use of well-designed graphics, although only a small number of these have yet been devised. Murphy et al. (1989) used bivariate histograms to convey the joint distribution of nonprobabilistic daily temperature forecasts and their corresponding observations, and also introduced the conditional quantile plot to portray the same information in terms of both the calibration refinement, and the likelihood-base rate factorizations (Murphy and Winkler 1987) of the joint distribution. By far the most commonly used diagnostic verification graphic is the reliability diagram (Murphy and Winkler 1977; Wilks 2011). Reliability diagrams show the calibration-refinement factorization of the joint distribution of probability forecasts for dichotomous (yes–no) events, by plotting the subsample event relative frequency as a function of forecast probability (the calibration), and including also a plot of the frequency of use of each of the possible scalar probability forecasts (the refinement).

Probability forecasts are most frequently issued for dichotomous events, so that the reliability diagram, or the closely allied attributes diagram (Murphy and Hsu 1986), is appropriate and effective for graphically displaying the relevant joint distribution. However, probability forecasts may also be issued jointly for more than two predictand categories, notably including three-category temperature or precipitation forecasts at lead times of a week and longer (e.g., Van den Dool 2007; Livezey and Timofeyeva 2008; Barnston et al. 2010). In this format the predictand categories are often defined in terms of the climatological terciles, so that the below-normal (cold or dry), near-normal, and above-normal (warm or wet) categories each have climatological occurrence probabilities of ⅓. Diagnostic verification of such forecasts has been approached by reducing the three-element probability forecast vectors to collections of dichotomous probabilities (e.g., Wilks 2000; Wilks and Godfrey 2002; Barnston et al. 2010), but that approach is not fully satisfying because it neglects the relationships among probabilities assigned to the different events (Murphy and Hsu 1986).

This paper proposes an extension of the well-known reliability diagram, to verification of probability forecast vectors pertaining to three distinct outcome categories using a two-dimensional graphic called the calibration simplex. The calibration simplex graphically represents the calibration-refinement factorization of the full joint distribution of these forecasts and their corresponding observations. Section 2 defines the structure of the calibration simplex in terms of the joint distribution of the forecasts and observations, and in relation to the conventional reliability diagram. Characteristic calibration simplex forms that are diagnostic for important aspects of forecast performance are illustrated in section 3, using artificial data. Section 4 shows example results using 6–10- and 8–14-day forecasts for U.S. temperature and precipitation, and section 5 concludes the paper.

## 2. Structure of the calibration simplex in relation to the reliability diagram

*I*distinct forecasts

*f*possible, each of which pertains to

_{i}*J*possible observations or outcomes

*o*. In the case of forecasts for dichotomous outcomes,

_{j}*J*= 2. For the purposes of constructing a reliability diagram, the forecast probabilities are rounded to a discrete set of

*I*values if they have not already been issued as such. For example,

*I*= 11 for probability forecasts rounded to 10ths, in which case the forecasts range from

*f*

_{1}= 0.0 through

*f*

_{11}= 1.0.

*I*conditional distributions Pr{

*o*|

_{j}*f*} of the observations given each of the possible forecasts

_{i}*f*, called the calibration distributions, and a single

_{i}*I*-element frequency distribution Pr{

*f*} specifying the frequencies of use of the possible forecasts, called the refinement distribution. In the case of the reliability diagram, the calibration distributions are Bernoulli (i.e., binomial with

_{i}*N*= 1) distributions, defined by the probability distribution function:

*p*is estimated by its empirical (in the verification dataset) conditional relative frequency

_{i}*f*. Each of the

_{i}*I*calibration distributions is fully characterized by its estimated Bernoulli probability

Probability vectors **f**_{i} = [*f _{B}*

_{,i},

*f*

_{N}_{,i},

*f*

_{A}_{,i}]

^{T}pertaining to three mutually exclusive and collectively exhaustive outcomes (e.g., below, near, and above normal) can be plotted in two dimensions because the three forecast probabilities in each vector must sum to 1. The geometrically appropriate graph in this case is the regular 2-simplex (Epstein and Murphy 1965; Murphy 1972), which takes the shape of an equilateral triangle. Each of the corners of the 2-simplex corresponds to forecast certainty (i.e., 100% probability) for one of the three outcomes being forecast. The point within the simplex at which a three-element probability vector

**f**

_{i}is plotted is located at distances proportional to the probabilities for each of the three outcomes, perpendicularly from the sides of the simplex opposite the respective corners. This barycentric plotting system (a plotting position is at the center of masses corresponding to the probabilities, placed at the simplex vertices) generalizes the reliability diagram because the 1-simplex appropriate to two-element probability forecasts for dichotomous events is the unit interval on the real line, which is the horizontal axis of the reliability diagram.

Figure 1 illustrates the plotting of discretized forecast vectors onto the simplex, which has been rendered as a tessellation of hexagons. Each scalar forecast probability has been rounded to one of the *K* = 10 values 0/9, 1/9, …, 9/9, yielding *I* = *K*(*K* + 1)/2 = 55 distinct possible vector forecasts **f**_{i}, each of which is represented by one of the hexagons. This choice for the discretization seems natural for forecasts pertaining to tercile-based outcome definitions, although any regular discretization (e.g., the *K* = 11 scalar forecast elements 0.0, 0.1, …, 1.0) may be used. The hexagons at the three vertices represent forecasts assigning all probability to the outcome labeled at that corner. Hexagons representing other forecast vectors are located at perpendicular distances from the respective opposite sides, which are indicated by the probability labels in the margins. For example, forecasts of equal probability for the middle category, *f _{N}*, are located along the same horizontal row of hexagons (perpendicularly upward from the horizontal bottom edge of the simplex), as indicated by the probability labels increasing upward along the left edge of the figure. Probability labels for the above-normal category,

*f*, increase downward along the right edge of the simplex, and probability labels for the below-normal category,

_{A}*f*, increase to the left along the bottom edge of the simplex. The result, for example, is that the large dot in the center of Fig. 1a locates the climatological forecast vector

_{B}**f**

_{clim}= [1/3, 1/3, 1/3]

^{T}. Similarly, the forecasts in Fig. 1a having the largest above-normal probabilities are [2/9, 2/9, 5/9]

^{T}and [1/9, 3/9, 5/9]

^{T}, which are represented by the glyphs located farthest to the right in that diagram. Any two of the three forecast vector elements are sufficient to locate that vector's position on the simplex.

Other choices for arranging the labels on the simplex edges are possible and equally valid, but those in Fig. 1 have been used in order for the outcomes from below to above normal to increase from left to right. Figure 1 illustrates the graphical representation of refinement distributions Pr{**f**_{i}} as glyph scatterplots (e.g., Wilks 2011), where the circle areas are proportional to the subsample sizes. Empty hexagons represent forecast vectors that were never used in the verification datasets under consideration. The two refinement distributions represented in Fig. 1 will be defined quantitatively in section 3.

*I*calibration distributions is multinomial, again with

*N*= 1:

*I*calibration distributions are fully determined by any two of the three empirical (within the verification dataset) conditional relative frequencies

**f**

_{i}, and therefore can be represented by a two-dimensional vector defined by, for example,

*i*th hexagon of the simplex.

**f**

_{i}, both elements of Eq. (5) will be zero, and the dot representing the corresponding subsample size will be plotted at the center of the

*i*th hexagon, as illustrated by the gray dot in Fig. 2, and by all subsamples in both panels of Fig. 1. In contrast, the solid dot in Fig. 2 shows the plotting position when the above-normal category has been conditionally underforecast by 0.2 probability units, together with overforecasting of the below- and near-normal categories by 0.1 probability units each. In this case the location for the glyph representing the corresponding element of the refinement distribution will be displaced toward the “A” vertex of the simplex, consistent with the conditional outcome vector having a higher relative frequency than forecast for the above-normal category.

Plotting schematic showing locations of refinement distribution glyphs for calibrated forecasts (zero miscalibration errors, gray dashed arrows and point), and miscalibrated forecasts (nonzero calibration errors, black solid arrows and point).

Citation: Weather and Forecasting 28, 5; 10.1175/WAF-D-13-00027.1

Plotting schematic showing locations of refinement distribution glyphs for calibrated forecasts (zero miscalibration errors, gray dashed arrows and point), and miscalibrated forecasts (nonzero calibration errors, black solid arrows and point).

Citation: Weather and Forecasting 28, 5; 10.1175/WAF-D-13-00027.1

Plotting schematic showing locations of refinement distribution glyphs for calibrated forecasts (zero miscalibration errors, gray dashed arrows and point), and miscalibrated forecasts (nonzero calibration errors, black solid arrows and point).

Citation: Weather and Forecasting 28, 5; 10.1175/WAF-D-13-00027.1

It is necessary to choose a scale for these refinement distribution glyph displacements, which should be indicated in a figure's legend. Here, this scale has been chosen as ⅔ of a probability unit corresponding to a long (vertex to vertex) diameter of the hexagons. With this choice, displacement of a plotting glyph of ⅓ the distance from the hexagon center to one of the corners equates to a miscalibration of 1/9, which corresponds to one discrete probability category when *K* = 10. Of course the scale in Fig. 2 could be increased to one or as much as two probability units in order to deemphasize random variations due to small subsample sizes, or to accommodate large conditional miscalibration errors. Similarly, the scale could be decreased in order to better discern subtle miscalibrations exhibited by a generally well-calibrated set of forecasts with large subsample sizes. It may be more convenient when programming plotting software to use the horizontal and vertical Cartesian displacements from the hexagon centers, which are proportional (depending on the chosen scaling) to *x* = −tan(30°)*e _{B}* + tan(30°)

*e*= −0.577

_{A}*e*+ 0.577

_{B}*e*, and

_{A}*y*= −

*e*−

_{B}*e*, respectively.

_{A}## 3. Idealized forecast examples

Characteristic patterns of simplex glyph sizes and placements are diagnostic for different aspects of overall forecast performance, analogously to the situation for the reliability diagram (Wilks 2011, p. 335). These are illustrated in this section using simple idealized statistical models for the calibration and refinement distributions, designed to demonstrate results for a range of cases with known expected behaviors.

### a. Sharpness: Refinement distributions

Sharpness is a characteristic of the refinement distribution, independently of the relationship between the forecasts and observations. Forecast sharpness is commonly characterized using a measure of the dispersion of the refinement distribution, such as the standard deviation or variance (e.g., Wilks 2001). Forecasts that deviate frequently, including relatively large differences from, the climatological forecast are referred to as sharp. Forecasts exhibiting poor sharpness deviate rarely and relatively little from the climatological forecast.

*f*and

_{B}*f*,

_{A}*f*< 1 and 0 <

_{B}*f*< 1. This is an arbitrary but convenient assumption that allows an easy exposition of the characteristic diagnostics for the diagram. The climatological forecast of

_{A}*f*=

_{B}*f*= 1/3 maps to E[

_{A}*f*+

_{B}*f*> 1, as well as relatively small deviations of the near-normal forecast

_{A}*f*from its mean value of ⅓. Within this framework, relatively sharp forecasts are generated using var[

_{N}^{6}independent bivariate normal realizations were generated to construct each of the two refinement distributions.

Figure 1 shows these two refinement distributions, plotted as glyph histogram scatterplots on the simplex. In both cases the resulting forecasts are shown as being perfectly calibrated, so all of the circles representing subsample relative frequencies are plotted at the centers of their hexagons. For the low-sharpness forecasts in Fig. 1a, it is clear that the most common forecast is **f**_{clim} = (⅓, ⅓, ⅓)^{T}, which accounts for 68.3% of the 10^{6} realizations. Furthermore, the individual forecast elements deviate no more than 2/9 from **f**_{clim}, and only very rarely are those deviations larger than 1/9. In contrast, the sharper forecasts in Fig. 1b often deviate quite strongly from **f**_{clim} (which accounts for only 10.7% of the 10^{6} realizations), and take on values throughout the unit interval for the extreme-category probabilities *f _{B}* and

*f*.

_{A}### b. Conditional and unconditional biases: Calibration distributions

Figure 3 shows calibration simplexes illustrating overconfident forecasts (Fig. 3a), underconfident forecasts (Fig. 3a), and forecasts exhibiting an unconditional overforecasting bias (Fig. 3c)for the above-normal category. Each of these cases is illustrated using the higher-sharpness refinement distribution shown in Fig. 1b.

Calibration simplexes illustrating (a) overconfident [Eq. (7)], (b) underconfident [Eq. (8)], and (c) unconditionally biased [Eq. (9)] forecasts. In each case the refinement distribution is the same as in Fig. 1b.

Citation: Weather and Forecasting 28, 5; 10.1175/WAF-D-13-00027.1

Calibration simplexes illustrating (a) overconfident [Eq. (7)], (b) underconfident [Eq. (8)], and (c) unconditionally biased [Eq. (9)] forecasts. In each case the refinement distribution is the same as in Fig. 1b.

Citation: Weather and Forecasting 28, 5; 10.1175/WAF-D-13-00027.1

Calibration simplexes illustrating (a) overconfident [Eq. (7)], (b) underconfident [Eq. (8)], and (c) unconditionally biased [Eq. (9)] forecasts. In each case the refinement distribution is the same as in Fig. 1b.

Citation: Weather and Forecasting 28, 5; 10.1175/WAF-D-13-00027.1

*f*= 9/9 and

_{B}*f*= 9/9) have been plotted at the hexagon vertices nearest the center of the diagram, indicating conditional miscalibrations of ⅓ of a probability unit.

_{A}## 4. Real forecast examples

This section illustrates the use of the calibration simplex to understand the performance of the Climate Prediction Center (CPC) “extended range” forecasts for average temperature and accumulated precipitation, at lead times of 6–10 and 8–14 days. These are subjective probability forecasts generated on weekdays, during 2001–12 for the 6–10-day forecasts, and 2004–12 for the 8–14-day forecasts, and interpolated to station locations from the graphical map products posted operationally online (http://www.cpc.ncep.noaa.gov/). The text-format versions and general information on these forecasts are also available online (http://www.cpc.ncep.noaa.gov/products/archives/short_range/ and http://www.cpc.ncep.noaa.gov/products/archives/short_range/README.6-10day.txt, respectively).

Figure 4 shows the calibration simplexes for the temperature forecasts, which include *n* = 413 773 six-to-ten-day forecasts (Fig. 4a) and *n* = 313 523 eight-to-fourteen-day forecasts (Fig. 4b). Glyphs are plotted only for forecasts having subsample sizes of 20 or more, and verifying categories have been determined relative to the 1971–2000 normals for forecasts made through April 2011, and using the 1981–2010 normals thereafter.

Calibration simplexes for (a) 6–10- and (b) 8–14-day temperature forecasts.

Citation: Weather and Forecasting 28, 5; 10.1175/WAF-D-13-00027.1

Calibration simplexes for (a) 6–10- and (b) 8–14-day temperature forecasts.

Citation: Weather and Forecasting 28, 5; 10.1175/WAF-D-13-00027.1

Calibration simplexes for (a) 6–10- and (b) 8–14-day temperature forecasts.

Citation: Weather and Forecasting 28, 5; 10.1175/WAF-D-13-00027.1

At each of the lead times, the overwhelming majority of forecasts include the climatological or near-climatological near-normal forecast *f _{N}* = ⅓, consistent with the conventional expectation that forecasts of the near-normal category will exhibit intrinsically weak skill (Van den Dool and Toth 1991). Thus, the sharpness for the near-normal probabilities is quite low. The most frequent forecast for each of the lead times is the climatological probability vector

**f**

_{clim}, which was issued for 34.9% of the 6–10-day forecasts and 35.4% of the 8–14-day forecasts. These percentages correspond to

*n*

_{clim}= 144 211 and 111 010 for the climatological forecast

**f**

_{clim}(largest dots in the middles of the plots) in Figs. 4a and 4b, respectively (cf. glyph sizes in the legend). The corresponding error vectors [

*e*,

_{B}*e*,

_{N}*e*]

_{A}^{T}are [−0.085, 0.056, 0.029]

^{T}(Fig. 4a) and [−0.88, 0.036, 0.052]

^{T}(Fig. 4b). Accordingly, both of these glyphs have been displaced away from the B vertices, indicating too few below-normal verifications when the climatological temperature forecast was issued. The more extreme probability vectors were used correspondingly less frequently at the 8–14-day lead time, but overall the 6–10-day forecasts are only slightly sharper.

For both lead times, the temperature forecasts are only moderately well calibrated, with typical miscalibration errors in the range of 1/9–2/9. The above-normal outcome is underforecast for the larger (≥4/9) above-normal and near-normal probabilities, as these subsample-size glyphs are displaced toward the A corner of the simplex. The glyph representing the very large subsample of **f**_{clim} forecasts is displaced toward the edge connecting the N and A vertices, indicating a smaller fraction of below-normal outcomes than the climatologically expected ⅓. Both of these results are consistent with the baseline 30-yr normals lagging the quasi-linear warming trend that has been evident in U.S. temperature data since the mid-1970s (e.g., Livezey et al. 2007; Wilks 2013), together with the mean forecasts not fully tracking the warming, as has also been observed for seasonal tercile forecasts (Wilks 2000; Wilks and Godfrey 2002; Barnston et al. 2010). On the other hand, displacement of glyphs away from the simplex center for below-normal probabilities of 4/9 and larger indicates an overall underconfidence (consistent with the pattern of glyph dispersion away from the center of Fig. 3b), suggesting that these forecasts could be somewhat sharper without degrading their skill.

Calibration simplexes for the precipitation forecasts are shown in Fig. 5, which include *n* = 406 856 six-to-ten-day forecasts (Fig. 5a) and *n* = 306 594 eight-to-fourteen-day forecasts (Fig. 5b). The precipitation forecasts exhibit notably less sharpness than their temperature counterparts in Fig. 4, with 44.9% of the 6–10-day forecasts and 44.7% of the 8–14-day forecasts using the climatological probabilities **f**_{clim}. As was the case for the temperature forecasts, *f _{N}* = ⅓ is overwhelmingly the most common near-normal forecast. The most extreme probabilities have been used somewhat less frequently in the 8–14-day precipitation forecasts, but overall these are only slightly less sharp that the 6–10-day forecasts. At both lead times there is a clear tendency for the subsample-size glyphs to be displaced toward the B vertex, indicating underforecasting of the below-normal category, and corresponding overforecasting in roughly equal proportions of the near-normal and above-normal outcomes. Figure 5 also indicates strong overconfidence in the larger (≥4/9) probabilities for the near-normal category, for both lead times.

As in Fig. 4, but for precipitation forecasts.

Citation: Weather and Forecasting 28, 5; 10.1175/WAF-D-13-00027.1

As in Fig. 4, but for precipitation forecasts.

Citation: Weather and Forecasting 28, 5; 10.1175/WAF-D-13-00027.1

As in Fig. 4, but for precipitation forecasts.

Citation: Weather and Forecasting 28, 5; 10.1175/WAF-D-13-00027.1

## 5. Summary and conclusions

Gaining a full appreciation of the performance of a set of forecasts requires investigation of the joint distribution of the forecasts and corresponding observations (Murphy and Winkler 1987), and a graphical approach to this exposition will often be the most immediately informative. The well-known reliability diagram for probability forecasts of dichotomous events is the most commonly encountered such graphical device. This paper has defined and illustrated a natural extension of the reliability diagram to probability forecasts for three disjoint events, called the calibration simplex. It displays the refinement distribution for such forecasts using a glyph scatterplot histogram of the *K*(*K* + 1)/2 possible vector forecasts that result when each probability element has been rounded to one of *K* discrete values. Simultaneously, it shows the two empirical outcome relative frequencies conditional on each forecast vector that are necessary to fully characterize the corresponding refinement distributions, using displacements of the glyphs from central plotting locations.

Two somewhat similar approaches to this same problem have been proposed previously, although apparently neither has been used subsequently. Epstein and Murphy (1965) suggested that the information in the calibration distributions (although not using this terminology) could be sketched by printing numerical values of the scalar lengths of the calibration error vectors [Eq. (5)] inside each of a tessellation of equilateral triangles within the 2-simpex. However, this approach neither expressed the vector nature of the miscalibration errors, nor did it communicate the refinement distribution in any way. Murphy and Daan (1985, p. 418) proposed to draw line segments on the 2-simplex connecting points locating each distinct forecast vector **f*** _{i}* with its corresponding average outcome vector

In principle, the calibration simplex could be extended to settings having more than three predictand categories, although this would be difficult or impossible to realize in practice because of graphical and cognitive limitations with respect to visually rendering the high-dimensional geometries. For example the 3-simplex, which would be appropriate for probability forecasts of four distinct events, is a three-dimensional regular tetrahedron, within the volume of which the spherical glyphs corresponding to possible four-element forecast vectors would reside. It is difficult to imagine how such a geometrical object could be presented graphically in an understandable way.

Use of the calibration simplex has been illustrated using the CPC 6–10- and 8–14-day subjective temperature and precipitation forecasts. These temperature forecasts exhibit an unconditional bias consistent with the average forecasts lagging the ongoing climate warming, as has been observed also for seasonal forecasts, but these graphs also indicate that greater sharpness for the below- and above-normal elements of the forecast vectors could be employed without degrading the overall accuracy or skill. The near-normal element of the precipitation forecast vectors were seen to be strongly overconfident, with overall underforecasting of the below-normal category, and corresponding overforecasting in roughly equal proportions of the near-normal and above-normal outcomes.

## Acknowledgments

I thank Scott Handel for supplying the CPC forecasts and corresponding verifications. This research was supported by the National Science Foundation under Grant AGS-1112200.

## REFERENCES

Barnston, A. G., Li S. , Mason S. J. , DeWitt D. G. , Goddard L. , and Gong X. , 2010: Verification of the first 11 years of IRI's seasonal climate forecasts.

,*J. Appl. Climatol. Meteor.***49**, 493–520.Epstein, E. S., and Murphy A. H. , 1965: A note on the attributes of probabilistic predictions and the probability score.

,*J. Appl. Meteor.***4**, 297–299.Jolliffe, I. T., and Stephenson D. B. , 2012: Introduction.

*Forecast Verification: A Practitioner's Guide in Atmospheric Sciences,*I. T. Jolliffe and D. B. Stephenson, Eds., Wiley-Blackwell, 1–9.Jupp, T. E., Lowe R. , Coelho C. A. S. , and Stephenson D. B. , 2012: On the visualization, verification and recalibration of ternary probabilistic forecasts.

,*Philos. Trans. Roy. Soc. London***A370**, 1100–1120.Katz, R. W., and Murphy A. H. , Eds., 1997:

*Economic Value of Weather and Climate Forecasts*. Cambridge University Press, 222 pp.Livezey, R. E., and Timofeyeva M. M. , 2008: The first decade of long-lead U.S. seasonal forecasts—Insights from a skill analysis.

,*Bull. Amer. Meteor. Soc.***89**, 843–854.Livezey, R. E., Vinnikov K. Y. , Timofeyeva M. M. , Tinker R. , and Van den Dool H. M. , 2007: Estimation and extrapolation of climate normals and climatic trends.

,*J. Appl. Meteor. Climatol.***46**, 1759–1776.Murphy, A. H., 1972: Scalar and vector partitions of the probability score: Part II.

*N*-state situation.,*J. Appl. Meteor.***11**, 1183–1192.Murphy, A. H., 1991: Forecast verification: Its complexity and dimensionality.

,*Mon. Wea. Rev.***119**, 1590–1601.Murphy, A. H., 1997: Forecast verification.

*Economic Value of Weather and Climate Forecasts,*R.W. Katz and A.H. Murphy, Eds., Cambridge University Press, 19–74.Murphy, A. H., and Winkler R. L. , 1977: Reliability of subjective probability forecasts of precipitation and temperature.

,*Appl. Stat.***26**, 41–47.Murphy, A. H., and Daan H. , 1985: Forecast evaluation.

*Probability, Statistics, and Decision Making in the Atmospheric Sciences,*A. H. Murphy and R. W. Katz, Eds., Westview, 379–437.Murphy, A. H., and Hsu W.-R. , 1986: The attributes diagram: A geometrical framework for assessing the quality of probability forecasts.

,*Int. J. Forecasting***2**, 285–293.Murphy, A. H., and Winkler R. L. , 1987: A general framework for forecast verification.

,*Mon. Wea. Rev.***115**, 1330–1338.Murphy, A. H., and Winkler R. L. , 1992: Diagnostic verification of probability forecasts.

,*Int. J. Forecasting***7**, 435–455.Murphy, A. H., Brown B. G. , and Chen Y.-S. , 1989: Diagnostic verification of temperature forecasts.

,*Wea. Forecasting***4**, 485–501.Van den Dool, H., 2007:

*Empirical Methods in Short-Term Climate Prediction*. Oxford University Press, 215 pp.Van den Dool, H., and Toth Z. , 1991: Why do forecasts for “near normal” often fail?

,*Wea. Forecasting***6**, 76–85.Wilks, D. S., 2000: Diagnostic verification of the Climate Prediction Center long-lead outlooks, 1995–98.

,*J. Climate***13**, 2389–2403.Wilks, D. S., 2001: A skill score based on economic value for probability forecasts.

,*Meteor. Appl.***8**, 209–219.Wilks, D. S., 2011:

*Statistical Methods in the Atmospheric Sciences*. 3rd ed. Academic Press, 676 pp.Wilks, D. S., 2013: Projecting “normals” in a nonstationary climate.

,*J. Appl. Meteor. Climatol.***52**, 289–302.Wilks, D. S., and Godfrey C. M. , 2002: Diagnostic verification of the IRI net assessment forecasts, 1997–2000.

,*J. Climate***15**, 1369–1377.