1. Introduction
It is common practice in the atmospheric sciences to evaluate a set of forecasts by using scalar summary measures to quantify specific attributes of the quality of the relationship between the forecasts and the corresponding observations (i.e., forecast quality). This approach is known as “measures oriented” verification (Murphy 1997). Familiar examples of summary measures include mean absolute error and mean error, which are used to evaluate accuracy and unconditional bias, respectively. Such summary measures, however, provide a very limited description of the complex relationship between a set of forecasts and the corresponding set of observations. This was recognized by Murphy and Winkler (1987), who conceived of a new approach to forecast evaluation called distributions-oriented verification (also known as diagnostic verification), which aims to analyze forecast quality as comprehensively as possible rather than attempt to sum it up with just one number.
Specifically, distributions-oriented (DO) verification is concerned with analysis of the joint probability distribution of forecasts and observations, p( f, x), which describes all time-independent information about the forecasts, f; the corresponding observations, x; and their relationship1 (Murphy and Winkler 1987). The literature on DO verification describes estimation of the joint distribution based on a verification data sample, {( fk, xk); k = 1, . . . , N}, and methods of interpreting the joint distribution (for a general overview, see Murphy 1997; Jolliffe and Stephenson 2003; Wilks 2006). Studies concerning deterministic forecasts of a continuous scalar variable have utilized a primitive model in estimation of the joint distribution, in which p( f, x) is represented by a set of discrete categories (each covering a range of forecast and observed values) that are assigned probabilities according to the empirical relative frequency of occurrence in the verification data sample (Murphy et al. 1989; Brooks and Doswell III 1996; Brooks et al. 1997; de Elía and Laprise 2003). Distributions-oriented verification studies concerning probabilistic forecasts of a dichotomous variable have primarily used statistical models in the estimation of the joint distribution, reducing the number of quantities that need to be estimated relative to the primitive model-based technique (Murphy and Wilks 1998; Wilks 2000; Wilks and Godfrey 2002; Bradley et al. 2003, 2004). While the methods used to interpret the joint distribution are largely dependent on how it is estimated, the aforementioned authors emphasize the ability of the DO approach to highlight deficiencies in forecast quality that would be missed by the traditional measures-oriented approach, potentially providing useful feedback to modelers and forecasters.
Although the conceptual outlook and methodologies of DO verification are slowly being adopted (e.g., Charba et al. 2003; Maini et al. 2003; Nachamkin et al. 2005; Myrick and Horel 2006; Elsberry et al. 2007; Engel and Ebert 2007; Schulz et al. 2007), it is not clear whether their potential to direct changes that improve forecast quality is being generally realized. In the specific context of a deterministic forecast system predicting a scalar variable, the goal of optimizing a summary measure of forecast accuracy (e.g., mean absolute error, mean squared error) is often the sole driver of forecast system development. Modelers and forecasters know how their predictions will be evaluated, and as such, strive to optimize the appropriate summary accuracy measure through changes to model formulation or forecasting technique. Thus, forecast system development is driven through a process of summary accuracy measure optimization. This framework of interaction between forecast verification procedure and forecaster/modeler implicitly assumes that forecast accuracy serves as a proxy for the broader concept of forecast quality. Indeed, forecast accuracy is intimately related to forecast quality, but exactly how summary accuracy measure optimization influences the full relationship between forecasts and observations is unclear.
To address this issue, a primitive model-based distributions-oriented verification approach is used here to investigate the quality of scalar predictions from deterministic forecast systems that have been developed through summary accuracy measure optimization. The goal is to explore the consequences of driving forecast system development with summary accuracy measure optimization for the full scope of forecast quality. Operational deterministic tropical cyclone (TC) intensity forecasts and the corresponding observations will serve as verification data samples for DO verification, as TC intensity forecast system development is primarily driven by summary accuracy measure optimization (Bender et al. 2007; Franklin cited 2006; DeMaria et al. 2005; Knaff et al. 2005; Emanuel et al. 2004; Knaff et al. 2003; Kumar et al. 2003). The particular choice to verify TC intensity forecasts is motivated in part by the socioeconomic importance of such predictions, especially for situations involving a TC expected to make landfall. For these forecasts, it is thus especially important to come to a comprehensive understanding of their quality, and how its first-order features are shaped by summary accuracy measure optimization. Although forecast quality and value are not precisely synonymous (Murphy and Ehrendorfer 1987; Murphy 1997), the more complete view of forecast quality revealed through DO verification (relative to measures-oriented verification) should allow users to better optimize their decisions (Murphy and Winkler 1987; Wilks 2000, and reference therein).
The remainder of this paper is organized as follows. First, the details of the verification data samples are described in section 2. Then, section 3 introduces a graphical representation of the joint probability distribution of forecasts and observations, the fundamental instrument of DO verification, and applies it to display the joint distributions for some select verification data samples. Section 4 utilizes two factorizations of the joint distribution into the product of a marginal distribution and conditional distributions to further analyze the verification data samples and diagnose the influence of summary accuracy measure optimization. Section 5 discusses the use of summary measures in a role complementary to the DO verification approach, before a summary of the results and concluding thoughts are presented in section 6.
2. Verification data samples
The variable of interest here is tropical cyclone intensity, a continuous scalar variable representing the 1-min maximum sustained surface wind of a tropical cyclone (as defined by the National Hurricane Center). Intensity lends itself to the discretization involved in a primitive model-based DO verification, as observations of intensity are traditionally reported in multiples of 5 kt. Thus, an observation falls into one of roughly nx = 30 categories (e.g., 67.5–72.5, 72.5–77.5 kt, etc.), given the range of intensities over which TCs are known to exist. Forecasts are categorized into an analogous set of roughly nf = 30 categories. Assuming nx = nf = 30, the joint distribution is approximated with nfnx = 900 categories, each assigned a probability p( fi, xj) according to the empirical relative frequency of the forecast falling in the ith category and observation in the jth category (i = 1, . . . , nf; j = 1, . . . , nx) in the verification data sample. Such a primitive model–based estimate of the joint distribution has a dimensionality of 899, the number of relative frequencies needed to completely determine p( fi, xj) (Murphy 1991). An extensive verification data sample is needed to estimate the relative frequencies in such a high-dimensional approximation of the joint distribution [henceforth, “joint distribution” will be understood to mean the primitive model–based estimate, and will be denoted as p( f, x)].
In constructing verification data samples, only Atlantic basin TCs are considered, as the highest quality intensity observations are to be found there. These observations are synthesized into a “best track” analyzed intensity for each storm every 6 h while it is in existence; here, these analyses are considered to be the observed intensities for the verification. Forecasts are issued at 6-h intervals as well, for lead times up to 120 h since the 2001 season. Official forecasts for the intensity of Atlantic basin tropical cyclones are produced by forecasters at the National Hurricane Center (NHC).2 These forecasts are supported by three primary numerical models (DeMaria et al. 2005). The most basic is the 5-day Statistical Hurricane Intensity Forecast model (SHF5), which uses a multiple linear regression model (trained on data from past TCs) to predict intensity change, given predictors describing the current state of the TC and its recent history (Knaff et al. 2003). The Statistical Hurricane Intensity Prediction Scheme (SHIPS) improves on this concept by including predictors about the environment of the TC, at both the initialization time and at forecast lead times (DeMaria and Kaplan 1994, 1999; DeMaria et al. 2005). Predictors of future storm environmental conditions are calculated from the Global Forecast System (GFS) weather model forecast, at future storm positions predicted by the NHC. A postprocessing routine is applied to the raw SHIPS intensity forecast to account for any land-induced decay that may occur if the NHC track forecast brings the TC over or near land. The result, called the Decay-SHIPS (DSHP) forecast, is provided to forecasters. The last of the three primary models is the Geophysical Fluid Dynamics Laboratory–University of Rhode Island (GFDL–URI) coupled hurricane–ocean model, as run at the National Centers for Environmental Prediction (NCEP; henceforth this model will be referred to as the GFDL). The GFDL is a dynamical model, in contrast to the statistically based SHF5 and DSHP. It couples a nested-grid atmosphere centered on the TC (Kurihara et al. 1998) to an ocean model (Bender and Ginis 2000) to explicitly account for interaction between the TC and ocean. Forecasts from the three aforementioned models (SHF5, DSHP, and GFDL) and the official forecasts from the NHC (which will be abbreviated as OFCL) are all verified in this study.
The verification data samples used here are homogeneous in the OFCL, SHF5, DSHP, and GFDL forecasts, and span the 2001–05 Atlantic basin seasons.3 Homogeneity in the four types of forecasts requires that for a given TC, forecast initialization time, and lead time, all four forecasts exist and are able to be verified against an existing best-track observation. This ensures that any comparison between forecast systems is fair, as each is verified over the same set of situations. Forecast–observation pairs are not excluded according to storm classification at the forecast initialization time and verification time, as is current practice in the NHC’s verification methodology.4 Thus, some forecast– observation pairs included here pertain to situations where the best-track storm classification is extratropical, tropical wave, or remnant low. Table 1 shows the sample size for each lead time in the homogeneous verification data samples. Sample size decreases with lead time as long-lead forecasts do not exist at the beginning of the TC’s life (forecasts are initialized once a weather system is defined as a TC) or are not made because dissipation is expected to occur. Sample sizes are on the order of the dimension of the joint distribution of forecasts and observations, even with 5 yr of data considered. Still, interpretation of the joint distribution and its relatives is useful, as will be shown subsequently.
Before continuing, it must be noted that the DSHP and GFDL models underwent significant changes over the 5 yr encompassed in the verification data samples. The predictors used in DSHP are constantly evolving, as documented in DeMaria et al. (2005) for the 2001–03 seasons. Further updates to DSHP for the 2004 and 2005 seasons are described by DeMaria (2006). These include new predictors based on satellite observations, a new postprocessing routine to account for interaction with narrow landmasses (based on DeMaria et al. 2006), and an adjustment of the SST predictor to account for ocean mixing processes under the eyewall. The GFDL has undergone many significant changes since the coupled version was implemented operationally in 2001. These include upgraded physics, vastly increased resolution of the atmospheric model, and improved initialization procedures for both the atmosphere and ocean (Bender et al. 2007; Falkovich et al. 2005). Thus, the verification results presented here will not necessarily reflect the performance characteristics of the latest version of these models. However, the modelers’ (and forecasters’) goal of summary accuracy measure optimization has not changed [e.g., Knaff et al. (2003), DeMaria et al. (2005), Franklin (2006), and Bender et al. (2007) for SHF5, DSHP, OFCL, and GFDL forecasts, respectively], and it is argued here that this dominates the first-order nature of the DO verification results.
3. The joint distribution
The methodology of DO verification is heavily dependent on graphical representations to convey the rich information contained in the joint distribution of forecasts and observations. Graphical representation of the joint distribution itself builds upon the concept of the scatterplot, the basic tool for exploratory analysis of paired data. Whereas the density of points in a scatterplot of f versus x gives a qualitative sense of the relative frequency of ( f, x) pairs, a graphical representation of the joint distribution must quantify this information. Within the context of a primitive model–based estimate of the joint distribution, there is a grid of ( f, x) category relative frequencies to represent. For a very large number of categories, it is advantageous to contour the relative frequency field, as was done for the ( f, x) frequency field in Engel and Ebert (2007) and Schulz et al. (2007). For relatively few categories, the relative frequencies can be represented individually. Bivariate histograms are used by Potts (2003) and Murphy et al. (1989) for this purpose, but the three-dimensional perspective employed to view the histograms makes interpretation difficult, as pointed out by Wilks (2006). Wilks (2006) instead suggests utilizing a “glyph scatterplot” to display the relative frequencies of the joint distribution, where some characteristic of the scattered points corresponds to the relative frequency magnitude.
Here, the graphical representation style of the joint distribution of forecasts and observations suggested by Wilks (2006) is adopted. Figures 1 –4 show the joint distribution pertaining to the OFCL, GFDL, DSHP, and SHF5 forecasts, respectively. Each figure shows the joint distribution at four different lead times in the following panels: (a) 0, (b) 36, (c) 72, and (d) 120 h. Dots are drawn for all ( f, x) with nonzero relative frequency in the corresponding verification data sample, with the colors representing the magnitude of the relative frequency according to the nonlinear scale detailed below Fig. 1.
Along with the joint distribution of forecasts and observations, each panel in Figs. 1 –4 has a thin black line along the diagonal, where f = x. If a set of deterministic forecasts were perfect, all dots would be clustered along this diagonal. One can see that this is not the case for the TC intensity forecasts, even at the 0-h lead time, as operational analyses of intensity do not necessarily match the corresponding best-track values (which are based partially on observations taken after the time in question). At the 36-h lead time, all four forecast sets show a widening of the joint distribution about the (diagonal) major axis, indicating a growing proportion of large forecast errors. For the OFCL, DSHP, and SHF5 forecasts, the joint distribution has about equal probability on both sides of the diagonal, but for the GFDL forecasts, probability is concentrated below the diagonal, where f > x. This is evidence of unconditional high bias in the GFDL forecasts, which will be quantified in section 5 using a measures-oriented approach.
By the 72-h lead time in Figs. 1 –4, one can see that not only is each joint distribution widening about its major axis, but this axis is also rotating into a more vertical orientation. This is especially evident for the SHF5 forecasts, but is subtly present in the joint distributions for the other three forecast sets. Such rotation of the joint distribution is symptomatic of growing conditional bias: for high-intensity observations, the forecasts are generally too low, and for low-intensity observations, the forecasts are generally too high. Moving on to the 120-h lead time, Figs. 1 –4 show that all four forecast sets have substantial conditional bias, and if anything, probability has gathered in toward the nearly vertical major axis of the distribution, instead of spreading out further. The capability of DO verification to reveal conditional bias will be discussed in greater detail in section 4, as detection of conditional bias is an advantage of DO verification over the traditional measures-oriented approach (which cannot detect conditional bias at all).
4. Marginal and conditional distributions
a. Factorizations of the joint distribution
b. Marginal distribution analysis
The aforementioned marginal and conditional distributions are useful tools in DO verification, as each draws out relevant aspects of p( f, x) that are not easy to directly analyze from Figs. 1 –4. First, consider the marginal distributions s( f ) and t(x). For a perfect set of deterministic forecasts, the marginal distribution of those forecasts would be exactly the same as the marginal distribution of the observations. However, equivalence of the two marginal distributions does not necessarily imply a perfect set of forecasts, just that the forecast distribution is consistent with the sample climatology. This is because individual forecasts can be erroneous, but taken as a whole, the set of forecast values can still be distributed as the observations. So, ultimately, comparative analysis of the marginal distributions is most informative when the two are different, as this is an unequivocal sign of a flawed forecast set. Furthermore, the nature of the differences can allow one to infer some reasons for the discrepancy, as will be seen for the TC intensity forecasts.
The literature shows a number of possibilities for the graphical comparison of marginal distributions. Boxplots (see section 4e) for each marginal distribution can be juxtaposed to compare the sample quantiles of the observations with the sample quantiles of the corresponding forecasts (Murphy et al. 1989; Potts 2003). A more detailed comparison of the form of the (discrete) marginal distributions can be carried out with the aid of relative frequency histograms (Potts 2003; Elsberry et al. 2007). However, when multiple forecast distributions are to be compared with an observed distribution, the relative frequency histogram approach becomes unwieldy. An alternative way to compare the forms of the marginal distributions is to superimpose estimates of the underlying (continuous) forecast distributions upon an estimate of the underlying (continuous) observed distribution. This is done in Fig. 5 for the four sets of TC intensity forecasts, with each panel showing a different lead time. The forecast distributions are plotted with dashed lines, while the observed distribution is plotted in solid black. The continuous marginal distributions are estimated with a nonparametric kernel density smoothing technique (Wilks 2006), using Gaussian kernels and a 5-kt smoothing bandwidth.
In each panel in Fig. 5 the values of the mean and median of the observed distribution are marked along the abscissa with a black triangle and a gray triangle, respectively. There are slight differences in the central tendencies of the observed distributions among the four lead times, indicative of differences in the observed distributions themselves. This is due to subsampling of the full set of observations (distributed as in Fig. 5a) in accordance with the lead time of the corresponding forecast set. For example, the observed distribution in Fig. 5d only reflects observations taken when 120-h lead time forecasts are verifying. In the days immediately after a TC forms, there are no 120-h forecasts to verify, as forecasts are initiated only upon TC formation. Thus, the observed distributions corresponding to sets of forecasts at longer lead times are increasingly depleted of low-intensity observations characteristic of formative TCs.
The primary feature in Fig. 5, however, is the divergence with lead time of the four forecast distributions from the observed distribution. At the 36-h lead time, the forecast distributions fall into two groups. The OFCL, DSHP, and SHF5 forecast sets are overpopulated (relative to the observed distribution) in the 40–80-kt intensity range and underpopulated elsewhere. The GFDL shows somewhat different behavior, overpopulating the 70–110-kt intensity range, while leaving the 30–60-kt range and the highest intensities deficient. These general patterns largely persist through the 72- and 120-h lead times, becoming increasingly amplified with lead time, especially for SHF5 and GFDL. By 120 h, SHF5 virtually eliminates forecasts below 30 kt and above 100 kt, instead favoring a very narrow forecast distribution centered at 60 kt. Though not as extreme as SHF5, GFDL also lacks a sufficient number of forecasts at the lowest and highest intensities, while overpredicting the number of low- to moderate-intensity hurricanes (65–105 kt). The OFCL and DSHP show a similar pattern, but instead overpopulate strong tropical storm and weak hurricane intensities (50–90 kt).
While there are substantial differences among the forecast distributions shown in Fig. 5, the overall theme that emerges is an increasing tendency with lead time to predict moderate intensities rather than those at the low and high ends of the observed intensity range. This behavior can be explained as a response to the use of summary accuracy measure optimization as the driving principle behind TC intensity forecast system development. Summary accuracy measure optimization is explicit in the formulation of the two statistical models, DSHP and SHF5, that use multiple linear regression to find an optimal relationship between predictors and intensity change. Precisely, the linear relationship is optimal in the least squares sense, implying that these statistical models are designed to produce forecasts that minimize the mean squared error (MSE). The manner in which satisfying the demand of MSE minimization ends up producing the characteristic peaked forecast distributions in Fig. 5 is perhaps best demonstrated within the context of a very simple statistical forecast model.
c. Performance of a single linear regression (SLR) model
The coefficients of the aforementioned linear regression are listed in Table 2, along with the corresponding coefficients of determination, for each lead time. There is a gradual deterioration of the relationship between the initial and observed intensities, with the two variables basically unrelated at the 120-h lead time. As seen in Fig. 6, this causes the best-fit line to rotate from the diagonal at t = 0 to nearly horizontal at t = 120. While the slope tails off toward zero with increasing lead time, the intercept approaches the mean observation of the training data.
The deterioration of the initial intensity–observed intensity relationship with lead time has profound implications for the quality of forecasts from the SLR model. Consider the set of SLR forecasts homogeneous with those in the verification data samples described in section 2, created by applying Eq. (3) with the coefficients in Table 2. The joint distribution for this set of forecasts and corresponding observations is shown in Fig. 7, at four different lead times. The SLR joint distribution essentially shows exaggerated versions of the primary traits seen in the joint distributions for the OFCL, GFDL, DSHP, and SHF5 forecast sets in Figs. 1 –4: 1) rotation of the major axis of the joint distribution from diagonal to vertical as lead time increases and 2) initial widening of the distribution about the major axis at early lead times, followed by contraction at later lead times. The joint distribution for the SLR model is most similar to that of the SHF5 model in Fig. 4, which is reasonable, as SLR is closest to SHF5 in the nature of its statistical model formulation. However, all the forecast systems show this same general behavior.
As one would expect from the joint distribution, the marginal distribution of the SLR forecasts also displays an exaggerated version of the main pattern seen in Fig. 5: the preponderance of moderate intensity forecasts at the expense of low- and high-intensity predictions (relative to the observed distribution). Figure 8 shows the marginal distributions of SLR forecasts and SHF5 forecasts superimposed on the marginal distribution of the observations, in the manner of Fig. 5. The SLR and SHF5 forecast distributions are very similar in nature, with both distributions sharpening with lead time, as their respective modes converge toward the mean observation (marked by a black triangle along the abscissa). Again, while the SHF5 model has a forecast distribution most similar to that of the SLR model, all forecast systems show the same general behavior.
d. Inferring the influence of summary accuracy measure optimization
Although the SLR model is a vastly simpler forecast system than the operational intensity forecast systems introduced earlier, all these forecast systems share the same first-order characteristics of the joint distribution and forecast distribution. Here, it is argued that all are responding in the same qualitative manner to summary accuracy measure optimization, by sharpening the forecast distribution as lead time increases. This sharpening of the forecast distribution is manifested in the joint distribution as a rotation of the major axis from the diagonal into a more vertical orientation as lead time increases, and an attendant contraction of the distribution about the more vertical major axis. Such an evolution of the features of the joint distribution is necessary to accommodate the sharpening of the forecast distribution.
Summary accuracy measure optimization is explicit in the formulation of the SLR model, as the linear statistical model coefficients are chosen to minimize MSE (in the dependent, training data). These coefficients change drastically with lead time in response to the deterioration of the relationship between initial intensity and observed intensity, causing the range of intensity values forecasted by the SLR model to shrink as lead time increases. The shrinking range of SLR-forecasted values can be explained by considering the two limiting cases in the relationship between initial intensity and observed intensity. In the limit of a perfect relationship (e.g., at t = 0 h, approximately), the range of SLR-forecasted values is exactly the same as the range of observed values, as the forecast always equals the observed intensity. This can be seen by comparing the t = 0 h marginal distribution of SLR forecasts and the marginal distribution of the observations in Fig. 8a. In the opposite limit of no relationship between the initial and observed intensities (e.g., at t = 120 h, approximately), the SLR model always predicts the same value: the mean observation in the training data. With no useful information provided by the predictor, this is simply the course of action that must be taken to minimize MSE. At t = 120 h, the marginal distribution of SLR forecasts in Fig. 8d shows SLR-predicted values in a small range around the value of the mean observation. The SLR-forecasted intensity values here are not all the same (see Fig. 7d), but it is clear that the marginal distribution of SLR forecasts has sharpened substantially relative to that at the 0-h lead time. The marginal distributions of SLR forecasts at 36 and 72 h represent intermediate cases between the two limiting scenarios described above.
The similarity of the joint distributions and forecast distributions for the SLR and SHF5 models is relatively straightforward to understand, as both statistical models are explicitly designed to minimize the MSE of the forecasts. Thus, the two models behave in the same way as their predictors become uncorrelated with observed intensity at later lead times. With its considerably more substantial array of predictors, SHF5 can take advantage of better correlations at later lead times than those available to the SLR model (see Table 2), delaying the onset of behaviors seen in the SLR joint and forecast distributions. The same logic extends to DSHP, with its even larger array of useful predictors than SHF5. However, DSHP also has the ability to decay intensities predicted by the statistical model in a postprocessing scheme, which may be partly responsible for the differences in its forecast distribution relative to that of SHF5 (cf. dark blue and light blue curves in Fig. 5). These two models have very similar forecast distributions for intensities above 70 kt at all lead times, but show increasingly divergent behavior below 70 kt as lead time increases. SHF5 consolidates forecasts around the mean observation, while DSHP forms two modes: one near the mean observation and a secondary mode near 30 kt. Perhaps this secondary mode in the DSHP forecast distribution is due to the effects of the decay model applied during postprocessing.
To understand the reasons for the similarity of the joint distributions and forecast distributions for the statistical models to those for the GFDL and OFCL, the concept of a response to summary accuracy measure optimization must be generalized in terms of optimal deterministic forecasts. An optimal deterministic forecast optimizes the expected value of a particular summary accuracy measure, with the expectation calculated over the true forecast probability distribution. Production of optimal deterministic forecasts is the ultimate goal of driving forecast system development with summary accuracy measure optimization. The optimal deterministic forecast is slightly different for forecast trajectories verified with MAE relative to those verified with MSE: The MAE-optimal forecast trajectory is the time evolution of the median of the true forecast probability distribution, while the time evolution of the mean of the true forecast probability distribution minimizes MSE. In the case of TC intensity, there is no obvious reason that the time evolution of these two central tendencies of the true forecast probability distribution should be radically different. Thus, whether an intensity forecast system is driven explicitly to minimize MSE (e.g., the statistical models) or implicitly to minimize MAE (e.g., GFDL, OFCL), the goal is the same: production of intensity forecast trajectories that match the time evolution of the central tendency of the true forecast probability distribution.
While the time evolution of the central tendency of a true forecast probability distribution cannot be known exactly, its basic characteristics can be inferred. For TC intensity prediction, a true forecast probability distribution is fairly sharp at t = 0, representing uncertainty in the initial intensity of a TC. It can reasonably be assumed that such a distribution is centered on the operationally designated initial intensity. As lead time increases, the true forecast probability distribution ultimately evolves to take a form similar to that of the observed distribution in Fig. 5d, as uncertainty saturates. Note that the mean and median of the observed distribution in Fig. 5d are very similar, both near 60 kt. Thus, the time evolution of the central tendency of the true forecast probability distribution can be described as starting at the operationally designated initial intensity and asymptoting with lead time to about 60 kt.
Consider the evolution with lead time of the marginal distribution of a set of such optimal forecasts. At the initial time, the marginal distribution of forecasts would be the same as the marginal distribution of observations (operationally designated intensities). As lead time advances, the marginal distribution of forecasts would become sharper and sharper, as forecasts asymptote toward 60 kt. This is exactly the pattern seen in the evolution of the marginal distribution of forecasts for the OFCL, GFDL,5 DSHP, SHF5, and SLR forecast systems, as shown in Figs. 5 and 8. All these forecast systems are responding to summary accuracy measure optimization in the same qualitative manner (i.e., trying, however imperfectly, to predict optimal deterministic forecasts), regardless of the specific summary accuracy measure used (MAE or MSE) or how it is optimized. The primary difference among the forecast systems is how quickly the forecasts asymptote to 60 kt; this depends on the capability of the model or forecaster to predict the time trajectory of the central tendency of the true forecast probability distribution.
e. Conditional distribution analysis
Given the characteristic traits imbued to the marginal distribution of forecasts, s( f ), and the joint distribution of forecasts and observations, p( f, x), by the response of a forecast system to summary accuracy measure optimization, one would also anticipate some distinguishing characteristics to emerge in the set of conditional distributions of the observations given the forecast, q(x| f ), and the set of conditional distributions of the forecasts given the observation, r( f |x). Because q(x| f ) and r( f |x) represent sets of distributions, comparative analysis of the q(x| f ) or r( f |x) pertaining to each of the forecast systems cannot be done as for the marginal distributions in section 4b. Effective display of the conditional distributions for a particular forecast system is a challenging task in its own right. The approach taken by previous investigators has been to plot certain quantiles of each of the conditional distributions (Déqué 2003; Maini et al. 2003; Schulz et al. 2007), perhaps accompanied by one or both of the marginal distributions (Murphy et al. 1989; de Elía and Laprise 2003). The resultant figure is generally referred to as a conditional quantile diagram.
Figures 9 and 10 show conditional quantile diagrams pertaining to the OFCL TC intensity forecasts (at four different lead times), with the conditioning on the forecast intensity and observed intensity, respectively. In Fig. 9, five quantiles of the q(x| f ) are displayed using a set of boxplots (Potts 2003; Wilks 2006). In each boxplot, 1) the box extends from the lower quartile to the upper quartile, 2) the whiskers extend from the upper and lower quartiles to the extrema, and 3) the median is marked with a circle.6 A boxplot for t(x), the marginal (i.e., unconditional) distribution of observations, is included on the left side of each panel in Fig. 9, for comparison to the q(x| f ). Finally, a histogram representing the marginal distribution of forecasts, s( f ), is included at the bottom of each panel in Fig. 9. Figure 10 is similar to Fig. 9 but shows boxplots for the r( f |x) and s( f ), as well as a histogram for t(x).
Consider first the set of conditional distributions of the observations given the OFCL forecast, as displayed in Fig. 9. The range (extent of the whiskers) and interquartile range (extent of the box) of the conditional distributions expand with lead time, reflecting growing uncertainty in the observed intensity that follows a particular OFCL intensity prediction. Nonetheless, Fig. 9d shows that at the 120-h lead time, the conditional distributions of the observations still substantially differ from the unconditional distribution of the observations. In particular, the conditional medians still generally align along the diagonal, where f = x, as is the case at the other lead times (see Figs. 9a–c). Thus, it can be inferred from Fig. 9 that the OFCL forecasts have little type I conditional bias.
Type I conditional bias (often called reliability or calibration) describes deviation of a forecast value from the mean observation given that forecast value, f − μx| f , where μx|f is calculated from q(x| f ). Further investigation of type I conditional bias, for all the TC intensity forecast systems, can be carried out with the aid of type I conditional bias comparative scatterplots, as shown in Fig. 11. In Fig. 11, there are five sets of colored dots in each panel, each set corresponding to one of the five forecast systems. A set of dots mark ( f, μx|f) for all values of f predicted by a particular forecast system at the given lead time. With five forecast systems, there is then a maximum of five dots lined up vertically at each value of f (at the early lead times, they overlap a bit). The vertical displacement of the dots from the solid black diagonal line shows the magnitude and direction of the type I conditional bias in the forecast sets. There appears to be a slight overall tendency for positive type I conditional biases ( f > μx|f) for high-intensity forecasts and negative type I conditional biases ( f < μx|f) for low-intensity forecasts, but this tendency is not particularly pronounced.
Hence, it can be concluded from Figs. 9 and 11 that all five forecast systems qualitatively have little type I conditional bias, which in and of itself is a desirable feature of a forecast system. A user can expect that the observed intensity, in an average sense, will be near the forecast intensity. To accomplish this, however, the forecast systems have had to sacrifice the refinement of their marginal distributions of forecasts (i.e., the marginal distributions of forecasts are too sharp). For an extreme example of this, consider the SLR model (magenta dots) in Fig. 11. Note that the number of dots radically decreases with increasing lead time, as the range of forecasted values collapses down to only those near μx, consistent with the marginal distribution of forecasts in Fig. 8. Because of this, it is easy for the 120-h lead time SLR forecasts to be type I conditionally unbiased, as if the mean observation is always predicted, the type I conditional bias is precisely zero ( f = μx = μx|f). The sharpening of the marginal distribution of forecasts in response to summary accuracy measure optimization favors a low-magnitude type I conditional bias.
The analysis of the conditional distributions, however, is not yet complete. Consider now, the set of conditional distributions of the forecasts given the observation, r( f|x). For example, the r( f|x) for the OFCL forecasts are displayed in Fig. 10. Here, it is apparent that the range and interquartile range of the conditional distributions expand substantially during the first 72 h of the forecast (but not during the 72–120-h interval), reflecting growing uncertainty in the forecast intensity that precedes a particular observed intensity value. Figure 10 also shows that the conditional distributions of the forecasts and the unconditional distribution of the forecasts become more similar to each other as lead time increases. The systematic migration of the conditional median forecasts away from the diagonal and toward the unconditional median forecast is of particular interest, as this is evidence of increasing type II conditional bias with lead time.
Type II conditional bias describes the deviation of a mean forecast given an observed value from that observed value, μf|x − x, where μf|x is calculated from r( f|x). As for type I conditional bias, the investigation of type II conditional bias for all the TC intensity forecast systems is aided by comparative scatterplots, as shown in Fig. 12. Here, a set of dots mark (μf|x, x) for all values of x observed at the given lead time. The dots line up horizontally, with five dots for every observed value of x. In these type II conditional bias comparative scatterplots, the forecast systems have no “choice” over the number of dots (corresponding to conditional distributions) that will exist, as this is wholly controlled by the observations. The horizontal displacement of the dots from the solid black diagonal line shows the magnitude and direction of the type II conditional bias in the forecast sets. Figure 12 shows that there is little type II conditional bias at the initial time, but as lead time increases, the μf|x all migrate closer to the dashed black line, which marks a representative value of μf , the mean forecast.7 At the 120-h lead time, some forecast systems (the three statistical models, in particular) are near the limiting case where μf|x = μf for all x, meaning that the mean forecast conditioned on the observation is the same as the mean forecast. In other words, the distribution of forecasts that precede a particular observation is basically the same for every observation; the forecast system cannot “discriminate” (Murphy and Winkler 1987) between different observations.
As opposed to type I conditional bias, it is clear from Figs. 10 and 12 that the TC intensity forecast systems show qualitatively substantial type II conditional bias, with noticeable differences among the forecast systems. OFCL has the least type II conditional bias, while SLR has the most. Both forecast systems are responding to summary accuracy measure optimization by collapsing their marginal distributions of forecasts toward the mean observation, but at different rates according to the ability of the two forecast systems to mimic the time trajectory of the central tendency of the true forecast probability distribution. The type II conditional bias comparative scatterplot (Fig. 12) shows the effects of this phenomenon quite explicitly, whereas it has to be inferred in a very indirect fashion from the type I conditional bias comparative scatterplot (Fig. 11). Qualitatively, these diagrams are useful diagnostics, but for quantitative results concerning the conditional bias of the forecast systems, one can turn to summary measures as a tool to be used in conjunction with the distributions-oriented techniques described thus far.
5. The use of summary measures to complement DO verification techniques
Figure 14b shows the MSE due to the shape term (dashed lines) and the conditional bias term (dotted lines) of the CR-based MSE decomposition for the five forecast systems. The sum of these two components yields the total MSE (Fig. 14a). Clearly, the shape term is the dominant contributor to the total MSE, as the type I conditional bias is quite low for all the forecast systems, especially OFCL and SLR. The conditional bias term quantifies what was observed qualitatively in Fig. 11, while the shape term gives an indication of what the conditional bias comparative scatterplot cannot show, the spread of each of the conditional distributions.
6. Summary and conclusions
In this paper, distributions-oriented verification techniques were used to investigate the quality of deterministic tropical cyclone intensity predictions from four operational forecast systems. Development of these operational forecasts systems is driven by summary accuracy measure optimization, which is not equivalent to optimization of the broader concept of forecast quality. The primary goal here, then, was to explore the consequences of driving deterministic forecast system development with summary accuracy measure optimization for the full scope of forecast quality, as embodied in the joint distribution of forecasts and observations.
Despite differences among the TC intensity forecast systems in the summary accuracy measure utilized (MSE or MAE) and differences in exactly how the demand of summary accuracy measure optimization was imposed (explicitly in the model formulation, or implicitly over many changes to the forecast system), DO verification of predictions from the forecast systems yielded similar results. Based on the similarities, it was argued that all of the TC intensity forecast systems must be responding to summary accuracy measure optimization in a similar fashion: by asymptoting forecasts toward the central tendency of the observed distribution with lead time, in an attempt to predict the optimal deterministic forecast trajectory. This causes forecasts to become more similar to each other with lead time, resulting in the sharpening of the marginal distribution of forecasts, as seen in Fig. 5. Sharpening of the marginal distribution of forecasts is manifested in the joint distribution of forecasts and observations as a rotation of the major axis of the distribution from the diagonal into the vertical, and a contraction of probability about that major axis (see Figs. 1 –4).
The sharpening of the marginal distribution of forecasts is also reflected in the unconditional and conditional biases of the forecasts. A small-magnitude unconditional bias is theoretically promoted by summary accuracy measure optimization, as forecasts asymptote with lead time to a limiting case of zero unconditional bias (all the forecasts are of the mean observation). This limiting case is also one of zero type I conditional bias (conditioning on the forecast), so summary accuracy measure optimization theoretically promotes good performance with respect to this attribute of forecast quality. Indeed, the type I conditional bias comparative scatterplot (Fig. 11) suggests that such conditional bias is of small magnitude (dots are along the diagonal) for the TC intensity forecasts, a conclusion quantitatively bolstered by the type I conditional bias contribution to MSE shown in Fig. 14b. Summary accuracy measure optimization, however, theoretically promotes a large-magnitude type II conditional bias (conditioning on the observations); as in the limiting case where all forecasts equal the mean observation, there will be large differences between some observations and the corresponding forecasts given those observations. These large differences are seen in Fig. 12, the type II conditional bias comparative scatterplot for the TC intensity forecasts. The contribution of type II conditional bias to MSE is correspondingly large, as shown in Fig. 14c.
The large-magnitude type II conditional bias in the TC intensity forecasts clearly demonstrates that all attributes of forecast quality are not optimized by driving deterministic forecast system development with summary accuracy measure optimization. The implication of this result is that forecast system developers should be careful not to conflate forecast quality optimization with summary accuracy measure optimization. The decision to use a particular summary measure to guide forecast system development should be based on the attribute(s) of forecast quality considered most desirable, as it is unrealistic to expect optimization of a particular summary measure to promote good forecasts with respect to all attributes of forecast quality. For example, if one values good forecast accuracy, unconditional bias, and type I conditional bias, but is not bothered by very poor type II conditional bias, driving forecast system development with summary accuracy measure optimization is an appropriate choice. However, if one values type II conditional bias above other attributes of forecast quality, one must seek some other summary measure, such as the conditional bias term in the LBR-based MSE decomposition of Eq. (9), to optimize in the forecast system development process.
Ultimately, it would be best to utilize distributions-oriented verification techniques in driving forecast system development, but the complexity of probability distributions (joint, marginal, conditional) hinders straightforward, objective comparison of those from competing forecast systems. For example, it is much easier to pass judgment concerning the relative performance of the TC intensity forecast systems based on the MAE time series in Figure 13b than based on the 40 total joint distributions corresponding to the eight lead times and five forecast systems represented in Fig. 13b. Thus, the primacy of summary measures in the forecast system development process is likely unavoidable, leaving DO techniques to play a supporting role in the forecast system development process.
One important aspect of this supporting role for DO verification is the facilitation of “DO dependent” summary measure calculation. DO-dependent summary measures, which can only be expressed as functions of the joint distribution (or its marginal and conditional components), quantify attributes of forecast quality that cannot be measured by prototypical summary measures (e.g., ME, MAE, MSE), which are able to be expressed either as a function of the joint distribution or as a function of the set of forecast–observation realizations. Two examples of a DO-dependent summary measure are the type I conditional bias term of the CR-based MSE decomposition [Eq. (8)] and the type II conditional bias term of the LBR-based MSE decomposition [Eq. (9)], each of which quantifies a conditional bias attribute of forecast quality. As described above, such DO-dependent summary measures aid in the interpretation of the graphical DO verification results, and could potentially be used in forecast system development. The principles of information theory can be used to define another DO-dependent summary measure, quantifying an information content attribute of forecast quality. This summary measure, the mutual information between the forecasts and observations (Leung and North 1990; DelSole 2005), measures the average amount of information a forecast provides about the observation, relative to prior knowledge of the sample climatology. Mutual information has a number of interesting properties, including the ability to verify forecasts categorized into a mixture of nominal and ordinal categories. For example, in the context of TC intensity predictions, a forecast category of “dissipated” could be included along with a set of ordinal categories. The utility of mutual information as a summary measure to complement DO verification of deterministic forecasts will be explored by the author in a future paper.
Another route of future investigation is to apply the distributions-oriented approach in the verification of deterministic predictions of tropical cyclone track, the other primary TC-related predictand (besides intensity). The challenge inherent in this endeavor is to formulate a categorization of TC position that sufficiently limits the dimensionality of the joint distribution such that it can be adequately populated by the verification data sample, but still provides sufficient distinction between the predicted category and the observed category (i.e., the forecast and observed categories are not always the same). Discretizing the forecast and observed latitude and longitude of the TC center into 30 categories each, similar to the approach taken here for intensity, would result in a joint distribution with 304 − 1 independent elements, vastly too extensive to populate with a verification data sample similar to those utilized here. Coarser discretization and/or transformation of position into a storm-relative coordinate system are likely needed to move forward with DO verification of deterministic TC track predictions.
Acknowledgments
The author thanks two anonymous reviewers, whose insightful comments led to the improvement of this manuscript. The author also thanks Kerry Emanuel and James A. Hansen for their guidance, and acknowledges the support of ONR Grant N00014-05-1-0323.
REFERENCES
Bender, M. A., and Ginis I. , 2000: Real-case simulations of hurricane–ocean interaction using a high-resolution coupled model: Effects on hurricane intensity. Mon. Wea. Rev., 128 , 917–946.
Bender, M. A., Ginis I. , Tuleya R. , Thomas B. , and Marchok T. , 2007: The operational GFDL coupled hurricane–ocean prediction system and summary of its performance. Mon. Wea. Rev., 135 , 3965–3989.
Bradley, A. A., Hashino T. , and Schwartz S. S. , 2003: Distributions-oriented verification of probability forecasts for small data samples. Wea. Forecasting, 18 , 903–917.
Bradley, A. A., Schwartz S. S. , and Hashino T. , 2004: Distributions-oriented verification of ensemble streamflow predictions. J. Hydrometeor., 5 , 532–545.
Brooks, H. E., and Doswell C. A. III, 1996: A comparison of measures-oriented and distributions-oriented approaches to forecast verification. Wea. Forecasting, 11 , 288–303.
Brooks, H. E., Witt A. , and Eilts M. D. , 1997: Verification of public weather forecasts available via the media. Bull. Amer. Meteor. Soc., 78 , 2167–2177.
Charba, J. P., Reynolds D. W. , McDonald B. E. , and Carter G. M. , 2003: Comparative verification of recent quantitative precipitation forecasts in the National Weather Service: A simple approach for scoring forecast accuracy. Wea. Forecasting, 18 , 161–183.
de Elía, R., and Laprise R. , 2003: Distributions-oriented verification of limited-area model forecasts in a perfect-model framework. Mon. Wea. Rev., 131 , 2492–2509.
DelSole, T., 2005: Predictability and information theory. Part II: Imperfect forecasts. J. Atmos. Sci., 62 , 3368–3381.
DeMaria, M., 2006: Statistical tropical cyclone intensity forecast improvements using GOES and aircraft reconnaissance data. Preprints, 27th Conf. on Hurricanes and Tropical Meteorology, Monterey, CA, Amer. Meteor. Soc., 14A.3. [Available online at http://ams.confex.com/ams/pdfpapers/108035.pdf.].
DeMaria, M., and Kaplan J. , 1994: A Statistical Hurricane Intensity Prediction Scheme (SHIPS) for the Atlantic basin. Wea. Forecasting, 9 , 209–220.
DeMaria, M., and Kaplan J. , 1999: An updated Statistical Hurricane Intensity Prediction Scheme (SHIPS) for the Atlantic and eastern North Pacific basins. Wea. Forecasting, 14 , 326–337.
DeMaria, M., Mainelli M. , Shay L. K. , Knaff J. A. , and Kaplan J. , 2005: Further improvements to the Statistical Hurricane Intensity Prediction Scheme (SHIPS). Wea. Forecasting, 20 , 531–543.
DeMaria, M., Knaff J. A. , and Kaplan J. , 2006: On the decay of tropical cyclone winds crossing narrow landmasses. J. Appl. Meteor. Climatol., 45 , 491–499.
Déqué, M., 2003: Continuous variables. Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. T. Jolliffe and D. B. Stephenson, Eds., Wiley, 97–119.
Elsberry, R. L., Lambert T. , and Boothe M. , 2007: Accuracy of Atlantic and eastern North Pacific tropical cyclone intensity forecast guidance. Wea. Forecasting, 22 , 747–762.
Emanuel, K., DesAutels C. , Holloway C. , and Korty R. , 2004: Environmental control of tropical cyclone intensity. J. Atmos. Sci., 61 , 843–858.
Engel, C., and Ebert E. , 2007: Performance of hourly operational consensus forecasts (OCFs) in the Australian region. Wea. Forecasting, 22 , 1345–1359.
Falkovich, A., Ginis I. , and Lord S. , 2005: Ocean data assimilation and initialization procedure for the coupled GFDL/URI hurricane prediction system. J. Atmos. Oceanic Technol., 22 , 1918–1932.
Franklin, J., cited. 2006: National Hurricane Center forecast verification. [Available online at http://www.nhc.noaa.gov/verification].
Jolliffe, I. T., and Stephenson D. B. , 2003: Forecast Verification: A Practitioner’s Guide in Atmospheric Science. Wiley, 240 pp.
Knaff, J. A., DeMaria M. , Sampson C. R. , and Gross J. M. , 2003: Statistical, 5-day tropical cyclone intensity forecasts derived from climatology and persistence. Wea. Forecasting, 18 , 80–92.
Knaff, J. A., Sampson C. R. , and DeMaria M. , 2005: An operational Statistical Typhoon Intensity Prediction Scheme for the western North Pacific. Wea. Forecasting, 20 , 688–699.
Kumar, T. S. V. V., Krishnamurti T. N. , Fiorino M. , and Nagata M. , 2003: Multimodel superensemble forecasting of tropical cyclones in the Pacific. Mon. Wea. Rev., 131 , 574–583.
Kurihara, Y., Tuleya R. E. , and Bender M. A. , 1998: The GFDL hurricane prediction system and its performance in the 1995 hurricane season. Mon. Wea. Rev., 126 , 1306–1322.
Leung, L., and North G. R. , 1990: Information theory and climate prediction. J. Climate, 3 , 5–14.
Maini, P., Kumar A. , Rathore L. S. , and Singh S. V. , 2003: Forecasting maximum and minimum temperatures by statistical interpretation of numerical weather prediction model output. Wea. Forecasting, 18 , 938–952.
Murphy, A. H., 1973: A new vector partition of the probability score. J. Appl. Meteor., 12 , 595–600.
Murphy, A. H., 1991: Forecast verification: Its complexity and dimensionality. Mon. Wea. Rev., 119 , 1590–1601.
Murphy, A. H., 1996: General decompositions of MSE-based skill scores: Measures of some basic aspects of forecast quality. Mon. Wea. Rev., 124 , 2353–2369.
Murphy, A. H., 1997: Forecast verification. The Economic Value of Weather and Climate Forecasts, R. W. Katz and A. H. Murphy, Eds., Cambridge University Press, 19–74.
Murphy, A. H., and Ehrendorfer M. , 1987: On the relationship between accuracy and value of forecasts in the cost–loss ratio situation. Wea. Forecasting, 2 , 243–251.
Murphy, A. H., and Winkler R. L. , 1987: A general framework for forecast verification. Mon. Wea. Rev., 115 , 1330–1338.
Murphy, A. H., and Wilks D. S. , 1998: A case study in the use of statistical models in forecast verification: Precipitation probability forecasts. Wea. Forecasting, 13 , 795–810.
Murphy, A. H., Brown B. G. , and Chen Y. , 1989: Diagnostic verification of temperature forecasts. Wea. Forecasting, 4 , 485–501.
Myrick, D. T., and Horel J. D. , 2006: Verification of surface temperature forecasts from the National Digital Forecast Database over the western United States. Wea. Forecasting, 21 , 869–892.
Nachamkin, J. E., Chen S. , and Schmidt J. , 2005: Evaluation of heavy precipitation forecasts using composite-based methods: A distributions-oriented approach. Mon. Wea. Rev., 133 , 2163–2177.
Potts, J. M., 2003: Basic concepts. Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. T. Jolliffe and D. B. Stephenson, Eds., Wiley, 13–36.
Schulz, E. W., Kepert J. D. , and Greenslade D. , 2007: An assessment of marine surface winds from the Australian Bureau of Meteorology numerical weather prediction systems. Wea. Forecasting, 22 , 613–636.
Wilks, D. S., 2000: Diagnostic verification of the Climate Prediction Center long-lead outlooks, 1995–98. J. Climate, 13 , 2389–2403.
Wilks, D. S., 2006: Statistical Methods in the Atmospheric Sciences. Academic Press, 648 pp.
Wilks, D. S., and Godfrey C. M. , 2002: Diagnostic verification of the IRI net assessment forecasts, 1997–2000. J. Climate, 15 , 1369–1377.
APPENDIX
MSE Decompositions
CR-based MSE decomposition
Equation (A6) differs from Eq. (8) in how the MSE due to the shapes of the conditional distributions is handled, expressing it as the difference of two terms [“uncertainty” and “resolution,” the first two terms on the rhs of Eq. (A6)] rather than just one term. When comparing homogeneous forecast sets, uncertainty is constant, so differences in MSEShape are due exclusively to differences in resolution. Resolution can be qualitatively estimated from the type I conditional bias comparative scatterplot (Fig. 11), whereas MSEShape cannot, so perhaps Eq. (A6) is a better companion to the type I conditional bias comparative scatterplot than Eq. (8). However, in the context of Fig. 14, it is more intuitive to interpret MSE as the sum of the two positive terms of Eq. (8) than the positive and negative contributions of Eq. (A6).
LBR-based MSE decomposition
Like the CR-based MSE decomposition of Eq. (A6), Eq. (A12) also expresses the MSE due to the shapes of the conditional distributions as the sum of two terms, instead of one. Since σ2f can vary substantially between forecast systems, there is no advantage to using Eq. (A12) as a companion of the type II conditional bias comparative scatterplot (Fig. 12) rather than Eq. (9), as differences in MSEshape between forecast systems cannot be inferred from only Σxt(x)(μf|x − μf)2. Like the CR-based MSE decomposition of Eq. (8), the LBR-based MSE decomposition of Eq. (9) also seems preferable for use in Fig. 14.
Joint distribution of official NHC forecasts and observations at lead times of (a) 0, (b) 36, (c) 72, and (d) 120 h. Dots mark all ( f, x) for which there is nonzero relative frequency in the corresponding verification data sample. The colors represent relative frequency magnitude, according to the following scale: 0 < p( f, x) ≤ 0.0025 (purple); 0.0025 < p( f, x) ≤ 0.005 (dark blue); 0.005 < p( f, x) ≤ 0.01 (light blue); 0.01 < p( f, x) ≤ 0.015 (green); 0.015 < p( f, x) ≤ 0.025 (yellow); 0.025 < p( f, x) ≤ 0.05 (orange); and 0.05 < p( f, x) ≤ 1 (red). The thin black line marks the diagonal, where f = x.
Citation: Weather and Forecasting 23, 6; 10.1175/2008WAF2222133.1
As in Fig. 1 but for the GFDL model forecasts.
Citation: Weather and Forecasting 23, 6; 10.1175/2008WAF2222133.1
As in Fig. 1 but for the Decay-SHIPS model forecasts.
Citation: Weather and Forecasting 23, 6; 10.1175/2008WAF2222133.1
As in Fig. 1 but for the SHF5 model forecasts.
Citation: Weather and Forecasting 23, 6; 10.1175/2008WAF2222133.1
Marginal distributions of OFCL forecasts (dashed red), GFDL forecasts (dashed green), DSHP forecasts (dashed dark blue), SHF5 forecasts (dashed light blue), and observations (solid black) at lead times of (a) 0, (b) 36, (c) 72, and (d) 120 h. The black triangle marks the mean observation and the gray triangle marks the median observation in each panel.
Citation: Weather and Forecasting 23, 6; 10.1175/2008WAF2222133.1
As in Fig. 1 but for persistence forecasts. A persistence forecast is defined to take the value of the operationally designated initial intensity; thus, the joint distributions here can be interpreted as weighted scatterplots of the training data used to estimate the linear statistical model coefficients in Eq. (3). The magenta line in each panel shows the best linear fit, in the least squares sense. Its slope and intercept are used as the coefficients in the linear statistical model for each lead time.
Citation: Weather and Forecasting 23, 6; 10.1175/2008WAF2222133.1
As in Fig. 1 but for forecasts from the SLR model described in the text.
Citation: Weather and Forecasting 23, 6; 10.1175/2008WAF2222133.1
Marginal distributions of the SHF5 forecasts (dashed light blue), SLR forecasts (dashed magenta), and observations (solid black) at lead times of (a) 0, (b) 36, (c) 72, and (d) 120 h. The black triangle marks the mean observation in each panel. Note that the probability range is twice as great as in Fig. 5.
Citation: Weather and Forecasting 23, 6; 10.1175/2008WAF2222133.1
Conditional quantile diagram for the OFCL forecasts at lead times of (a) 0, (b) 36, (c) 72, and (d) 120 h. Each panel shows boxplots for the set of conditional distributions of the observations given the forecast, q(x| f ). A boxplot for the marginal distribution of observations, t(x), is shown to the left of the dashed gray line in each panel and is marked with a “U” (for “unconditional”). A histogram at the bottom of each panel represents the marginal distribution of forecasts, s( f ). The solid gray line marks the diagonal, where f = x
Citation: Weather and Forecasting 23, 6; 10.1175/2008WAF2222133.1
Conditional quantile diagram for the OFCL forecasts at lead times of (a) 0, (b) 36, (c) 72, and (d) 120 h. Each panel shows boxplots for the set of conditional distributions of the forecasts given the observation, r( f |x). A boxplot for the marginal distribution of forecasts, s( f ), is shown below the dashed gray line in each panel and is marked with a “U” (for “unconditional”). A histogram on the left of each panel represents the marginal distribution of observations, t(x). The solid gray line marks the diagonal, where f = x.
Citation: Weather and Forecasting 23, 6; 10.1175/2008WAF2222133.1
Type I conditional bias comparative scatterplot, at lead times of (a) 0, (b) 36, (c) 72, and (d) 120 h. For a given lead time, a set of dots is plotted for each of the OFCL (red), GFDL (green), DSHP (dark blue), SHF5 (light blue), and SLR (magenta) forecast systems. The dots in each set mark ( f, μx|f) for all values of f predicted by the forecast system. In each panel, the solid black line marks the diagonal and the dashed black line the value of the mean observation, μx
Citation: Weather and Forecasting 23, 6; 10.1175/2008WAF2222133.1
Type II conditional bias comparative scatterplot, at lead times of (a) 0, (b) 36, (c) 72, and (d) 120 h. For a given lead time, a set of dots is plotted for each of the OFCL (red), GFDL (green), DSHP (dark blue), SHF5 (light blue), and SLR (magenta) forecast systems. The dots in each set mark (μf|x, x) for all values of x. In each panel, the solid black line marks the diagonal and the dashed black line marks a representative value of the mean forecast, μf , as described in the text.
Citation: Weather and Forecasting 23, 6; 10.1175/2008WAF2222133.1
The (a) ME and (b) MAE, as a function of lead time for the OFCL (red), GFDL (green), DSHP (dark blue), SHF5 (light blue), and SLR (magenta) forecast systems.
Citation: Weather and Forecasting 23, 6; 10.1175/2008WAF2222133.1
(a) MSE, as a function of lead time for the OFCL (red), GFDL (green), DSHP (dark blue), SHF5 (light blue), and SLR (magenta) forecast systems. (b) MSE due to the shapes of the conditional distributions q(x|f ) (dashed) and MSE due to type I conditional bias (dotted), the two terms in the CR-based MSE decomposition of Eq. (8). (c) MSE due to the shapes of the conditional distributions r( f|x) (dashed) and MSE due to type II conditional bias (dotted), the two terms in the LBR-based MSE decomposition of Eq. (9).
Citation: Weather and Forecasting 23, 6; 10.1175/2008WAF2222133.1
Sample size, N, as a function of lead time for the homogeneous verification data samples described in the text.
Coefficients for the SLR model relating the operationally designated initial intensity to the observed intensity, as a function of lead time. Coefficients of determination for the linear relationship are also listed.
Notation in this paper follows that of Murphy (1997).
Public dissemination of official forecasts with lead times beyond 72 h has only occurred since 2003.
Archived forecasts and best-track observations were obtained from the “A-decks,” and “B-decks”, respectively, of the NHC’s digital forecast database (information online at ftp://ftp.nhc.noaa. gov/pub/atcf/; accessed November 2006).
According to the NHC Web site (http://www.nhc.noaa.gov/verification), realizations are only included if the storm is classified as tropical or subtropical at both the forecast initialization time and verification time.
Note that for GFDL, forecasts asymptote toward the “wrong” value: 85 kt instead of 60 kt. This may in part be due to the difficulty of developing a dynamical model toward the goal of optimal forecast production, relative to statistical models. The model dynamics constrain the possible trajectories that can be produced (perhaps excluding the optimal one), whereas the DSHP and SHF5 have no such constraints.
The box and whiskers are not shown for conditional distribution estimates based on a sample size of less than 10.
Strictly, the panels should show the mean forecast for each of the five forecast systems. For graphical clarity, a single “representative” μf is used, equal to μx, as μf is generally within 5 kt of μx (see Fig. 13a).
This bias was corrected in the version of the decay model used operationally, starting in 2005.
Lines of constant f − x parallel the diagonal, where f − x = 0.