• Bender, M. A., and Ginis I. , 2000: Real-case simulations of hurricane–ocean interaction using a high-resolution coupled model: Effects on hurricane intensity. Mon. Wea. Rev., 128 , 917946.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Bender, M. A., Ginis I. , Tuleya R. , Thomas B. , and Marchok T. , 2007: The operational GFDL coupled hurricane–ocean prediction system and summary of its performance. Mon. Wea. Rev., 135 , 39653989.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Bradley, A. A., Hashino T. , and Schwartz S. S. , 2003: Distributions-oriented verification of probability forecasts for small data samples. Wea. Forecasting, 18 , 903917.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Bradley, A. A., Schwartz S. S. , and Hashino T. , 2004: Distributions-oriented verification of ensemble streamflow predictions. J. Hydrometeor., 5 , 532545.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Brooks, H. E., and Doswell C. A. III, 1996: A comparison of measures-oriented and distributions-oriented approaches to forecast verification. Wea. Forecasting, 11 , 288303.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Brooks, H. E., Witt A. , and Eilts M. D. , 1997: Verification of public weather forecasts available via the media. Bull. Amer. Meteor. Soc., 78 , 21672177.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Charba, J. P., Reynolds D. W. , McDonald B. E. , and Carter G. M. , 2003: Comparative verification of recent quantitative precipitation forecasts in the National Weather Service: A simple approach for scoring forecast accuracy. Wea. Forecasting, 18 , 161183.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • de Elía, R., and Laprise R. , 2003: Distributions-oriented verification of limited-area model forecasts in a perfect-model framework. Mon. Wea. Rev., 131 , 24922509.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • DelSole, T., 2005: Predictability and information theory. Part II: Imperfect forecasts. J. Atmos. Sci., 62 , 33683381.

  • DeMaria, M., 2006: Statistical tropical cyclone intensity forecast improvements using GOES and aircraft reconnaissance data. Preprints, 27th Conf. on Hurricanes and Tropical Meteorology, Monterey, CA, Amer. Meteor. Soc., 14A.3. [Available online at http://ams.confex.com/ams/pdfpapers/108035.pdf.].

  • DeMaria, M., and Kaplan J. , 1994: A Statistical Hurricane Intensity Prediction Scheme (SHIPS) for the Atlantic basin. Wea. Forecasting, 9 , 209220.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • DeMaria, M., and Kaplan J. , 1999: An updated Statistical Hurricane Intensity Prediction Scheme (SHIPS) for the Atlantic and eastern North Pacific basins. Wea. Forecasting, 14 , 326337.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • DeMaria, M., Mainelli M. , Shay L. K. , Knaff J. A. , and Kaplan J. , 2005: Further improvements to the Statistical Hurricane Intensity Prediction Scheme (SHIPS). Wea. Forecasting, 20 , 531543.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • DeMaria, M., Knaff J. A. , and Kaplan J. , 2006: On the decay of tropical cyclone winds crossing narrow landmasses. J. Appl. Meteor. Climatol., 45 , 491499.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Déqué, M., 2003: Continuous variables. Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. T. Jolliffe and D. B. Stephenson, Eds., Wiley, 97–119.

    • Search Google Scholar
    • Export Citation
  • Elsberry, R. L., Lambert T. , and Boothe M. , 2007: Accuracy of Atlantic and eastern North Pacific tropical cyclone intensity forecast guidance. Wea. Forecasting, 22 , 747762.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Emanuel, K., DesAutels C. , Holloway C. , and Korty R. , 2004: Environmental control of tropical cyclone intensity. J. Atmos. Sci., 61 , 843858.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Engel, C., and Ebert E. , 2007: Performance of hourly operational consensus forecasts (OCFs) in the Australian region. Wea. Forecasting, 22 , 13451359.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Falkovich, A., Ginis I. , and Lord S. , 2005: Ocean data assimilation and initialization procedure for the coupled GFDL/URI hurricane prediction system. J. Atmos. Oceanic Technol., 22 , 19181932.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Franklin, J., cited. 2006: National Hurricane Center forecast verification. [Available online at http://www.nhc.noaa.gov/verification].

  • Jolliffe, I. T., and Stephenson D. B. , 2003: Forecast Verification: A Practitioner’s Guide in Atmospheric Science. Wiley, 240 pp.

  • Knaff, J. A., DeMaria M. , Sampson C. R. , and Gross J. M. , 2003: Statistical, 5-day tropical cyclone intensity forecasts derived from climatology and persistence. Wea. Forecasting, 18 , 8092.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Knaff, J. A., Sampson C. R. , and DeMaria M. , 2005: An operational Statistical Typhoon Intensity Prediction Scheme for the western North Pacific. Wea. Forecasting, 20 , 688699.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Kumar, T. S. V. V., Krishnamurti T. N. , Fiorino M. , and Nagata M. , 2003: Multimodel superensemble forecasting of tropical cyclones in the Pacific. Mon. Wea. Rev., 131 , 574583.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Kurihara, Y., Tuleya R. E. , and Bender M. A. , 1998: The GFDL hurricane prediction system and its performance in the 1995 hurricane season. Mon. Wea. Rev., 126 , 13061322.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Leung, L., and North G. R. , 1990: Information theory and climate prediction. J. Climate, 3 , 514.

  • Maini, P., Kumar A. , Rathore L. S. , and Singh S. V. , 2003: Forecasting maximum and minimum temperatures by statistical interpretation of numerical weather prediction model output. Wea. Forecasting, 18 , 938952.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1973: A new vector partition of the probability score. J. Appl. Meteor., 12 , 595600.

  • Murphy, A. H., 1991: Forecast verification: Its complexity and dimensionality. Mon. Wea. Rev., 119 , 15901601.

  • Murphy, A. H., 1996: General decompositions of MSE-based skill scores: Measures of some basic aspects of forecast quality. Mon. Wea. Rev., 124 , 23532369.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1997: Forecast verification. The Economic Value of Weather and Climate Forecasts, R. W. Katz and A. H. Murphy, Eds., Cambridge University Press, 19–74.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., and Ehrendorfer M. , 1987: On the relationship between accuracy and value of forecasts in the cost–loss ratio situation. Wea. Forecasting, 2 , 243251.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., and Winkler R. L. , 1987: A general framework for forecast verification. Mon. Wea. Rev., 115 , 13301338.

  • Murphy, A. H., and Wilks D. S. , 1998: A case study in the use of statistical models in forecast verification: Precipitation probability forecasts. Wea. Forecasting, 13 , 795810.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., Brown B. G. , and Chen Y. , 1989: Diagnostic verification of temperature forecasts. Wea. Forecasting, 4 , 485501.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Myrick, D. T., and Horel J. D. , 2006: Verification of surface temperature forecasts from the National Digital Forecast Database over the western United States. Wea. Forecasting, 21 , 869892.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Nachamkin, J. E., Chen S. , and Schmidt J. , 2005: Evaluation of heavy precipitation forecasts using composite-based methods: A distributions-oriented approach. Mon. Wea. Rev., 133 , 21632177.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Potts, J. M., 2003: Basic concepts. Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. T. Jolliffe and D. B. Stephenson, Eds., Wiley, 13–36.

    • Search Google Scholar
    • Export Citation
  • Schulz, E. W., Kepert J. D. , and Greenslade D. , 2007: An assessment of marine surface winds from the Australian Bureau of Meteorology numerical weather prediction systems. Wea. Forecasting, 22 , 613636.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 2000: Diagnostic verification of the Climate Prediction Center long-lead outlooks, 1995–98. J. Climate, 13 , 23892403.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 2006: Statistical Methods in the Atmospheric Sciences. Academic Press, 648 pp.

  • Wilks, D. S., and Godfrey C. M. , 2002: Diagnostic verification of the IRI net assessment forecasts, 1997–2000. J. Climate, 15 , 13691377.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • View in gallery

    Joint distribution of official NHC forecasts and observations at lead times of (a) 0, (b) 36, (c) 72, and (d) 120 h. Dots mark all ( f, x) for which there is nonzero relative frequency in the corresponding verification data sample. The colors represent relative frequency magnitude, according to the following scale: 0 < p( f, x) ≤ 0.0025 (purple); 0.0025 < p( f, x) ≤ 0.005 (dark blue); 0.005 < p( f, x) ≤ 0.01 (light blue); 0.01 < p( f, x) ≤ 0.015 (green); 0.015 < p( f, x) ≤ 0.025 (yellow); 0.025 < p( f, x) ≤ 0.05 (orange); and 0.05 < p( f, x) ≤ 1 (red). The thin black line marks the diagonal, where f = x.

  • View in gallery

    As in Fig. 1 but for the GFDL model forecasts.

  • View in gallery

    As in Fig. 1 but for the Decay-SHIPS model forecasts.

  • View in gallery

    As in Fig. 1 but for the SHF5 model forecasts.

  • View in gallery

    Marginal distributions of OFCL forecasts (dashed red), GFDL forecasts (dashed green), DSHP forecasts (dashed dark blue), SHF5 forecasts (dashed light blue), and observations (solid black) at lead times of (a) 0, (b) 36, (c) 72, and (d) 120 h. The black triangle marks the mean observation and the gray triangle marks the median observation in each panel.

  • View in gallery

    As in Fig. 1 but for persistence forecasts. A persistence forecast is defined to take the value of the operationally designated initial intensity; thus, the joint distributions here can be interpreted as weighted scatterplots of the training data used to estimate the linear statistical model coefficients in Eq. (3). The magenta line in each panel shows the best linear fit, in the least squares sense. Its slope and intercept are used as the coefficients in the linear statistical model for each lead time.

  • View in gallery

    As in Fig. 1 but for forecasts from the SLR model described in the text.

  • View in gallery

    Marginal distributions of the SHF5 forecasts (dashed light blue), SLR forecasts (dashed magenta), and observations (solid black) at lead times of (a) 0, (b) 36, (c) 72, and (d) 120 h. The black triangle marks the mean observation in each panel. Note that the probability range is twice as great as in Fig. 5.

  • View in gallery

    Conditional quantile diagram for the OFCL forecasts at lead times of (a) 0, (b) 36, (c) 72, and (d) 120 h. Each panel shows boxplots for the set of conditional distributions of the observations given the forecast, q(x| f ). A boxplot for the marginal distribution of observations, t(x), is shown to the left of the dashed gray line in each panel and is marked with a “U” (for “unconditional”). A histogram at the bottom of each panel represents the marginal distribution of forecasts, s( f ). The solid gray line marks the diagonal, where f = x

  • View in gallery

    Conditional quantile diagram for the OFCL forecasts at lead times of (a) 0, (b) 36, (c) 72, and (d) 120 h. Each panel shows boxplots for the set of conditional distributions of the forecasts given the observation, r( f |x). A boxplot for the marginal distribution of forecasts, s( f ), is shown below the dashed gray line in each panel and is marked with a “U” (for “unconditional”). A histogram on the left of each panel represents the marginal distribution of observations, t(x). The solid gray line marks the diagonal, where f = x.

  • View in gallery

    Type I conditional bias comparative scatterplot, at lead times of (a) 0, (b) 36, (c) 72, and (d) 120 h. For a given lead time, a set of dots is plotted for each of the OFCL (red), GFDL (green), DSHP (dark blue), SHF5 (light blue), and SLR (magenta) forecast systems. The dots in each set mark ( f, μx|f) for all values of f predicted by the forecast system. In each panel, the solid black line marks the diagonal and the dashed black line the value of the mean observation, μx

  • View in gallery

    Type II conditional bias comparative scatterplot, at lead times of (a) 0, (b) 36, (c) 72, and (d) 120 h. For a given lead time, a set of dots is plotted for each of the OFCL (red), GFDL (green), DSHP (dark blue), SHF5 (light blue), and SLR (magenta) forecast systems. The dots in each set mark (μf|x, x) for all values of x. In each panel, the solid black line marks the diagonal and the dashed black line marks a representative value of the mean forecast, μf , as described in the text.

  • View in gallery

    The (a) ME and (b) MAE, as a function of lead time for the OFCL (red), GFDL (green), DSHP (dark blue), SHF5 (light blue), and SLR (magenta) forecast systems.

  • View in gallery

    (a) MSE, as a function of lead time for the OFCL (red), GFDL (green), DSHP (dark blue), SHF5 (light blue), and SLR (magenta) forecast systems. (b) MSE due to the shapes of the conditional distributions q(x|f ) (dashed) and MSE due to type I conditional bias (dotted), the two terms in the CR-based MSE decomposition of Eq. (8). (c) MSE due to the shapes of the conditional distributions r( f|x) (dashed) and MSE due to type II conditional bias (dotted), the two terms in the LBR-based MSE decomposition of Eq. (9).

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 95 64 2
PDF Downloads 55 35 3

A Case Study of Deterministic Forecast Verification: Tropical Cyclone Intensity

View More View Less
  • 1 Massachusetts Institute of Technology, Cambridge, Massachusetts
Full access

Abstract

Deterministic predictions of tropical cyclone (TC) intensity from operational forecast systems traditionally have been verified with a summary accuracy measure (e.g., mean absolute error). Since the forecast system development process is coupled to the verification procedure, it follows that TC intensity forecast systems have been developed with the goal of producing predictions that optimize the chosen summary accuracy measure. Here, the consequences of this development process for the quality of the resultant forecasts are diagnosed through a distributions-oriented (DO) verification of operational TC intensity forecasts. DO verification techniques examine the full relationship between a set of forecasts and the corresponding set of observations (i.e., forecast quality), rather than just the accuracy attribute of that relationship.

The DO verification results reveal similar first-order characteristics in the quality of predictions from four TC intensity forecast systems. These characteristics are shown to be consistent with the theoretical response of a forecast system to the imposed goal of summary accuracy measure optimization: production of forecasts that asymptote with lead time to the central tendency of the observed distribution. While such forecasts perform well with respect to the accuracy, unconditional bias, and type I conditional bias attributes of forecast quality, they perform poorly with respect to type II conditional bias. Thus, it is clear that optimization of forecast accuracy is not equivalent to optimization of forecast quality. Ultimately, developers of deterministic forecast systems must take care to employ a verification procedure that promotes good performance with respect to the most desired attributes of forecast quality.

Corresponding author address: Jonathan R. Moskaitis, MIT, Rm. 54-1721, 77 Massachusetts Ave., Cambridge, MA 02139. Email: jonmosk@mit.edu

Abstract

Deterministic predictions of tropical cyclone (TC) intensity from operational forecast systems traditionally have been verified with a summary accuracy measure (e.g., mean absolute error). Since the forecast system development process is coupled to the verification procedure, it follows that TC intensity forecast systems have been developed with the goal of producing predictions that optimize the chosen summary accuracy measure. Here, the consequences of this development process for the quality of the resultant forecasts are diagnosed through a distributions-oriented (DO) verification of operational TC intensity forecasts. DO verification techniques examine the full relationship between a set of forecasts and the corresponding set of observations (i.e., forecast quality), rather than just the accuracy attribute of that relationship.

The DO verification results reveal similar first-order characteristics in the quality of predictions from four TC intensity forecast systems. These characteristics are shown to be consistent with the theoretical response of a forecast system to the imposed goal of summary accuracy measure optimization: production of forecasts that asymptote with lead time to the central tendency of the observed distribution. While such forecasts perform well with respect to the accuracy, unconditional bias, and type I conditional bias attributes of forecast quality, they perform poorly with respect to type II conditional bias. Thus, it is clear that optimization of forecast accuracy is not equivalent to optimization of forecast quality. Ultimately, developers of deterministic forecast systems must take care to employ a verification procedure that promotes good performance with respect to the most desired attributes of forecast quality.

Corresponding author address: Jonathan R. Moskaitis, MIT, Rm. 54-1721, 77 Massachusetts Ave., Cambridge, MA 02139. Email: jonmosk@mit.edu

1. Introduction

It is common practice in the atmospheric sciences to evaluate a set of forecasts by using scalar summary measures to quantify specific attributes of the quality of the relationship between the forecasts and the corresponding observations (i.e., forecast quality). This approach is known as “measures oriented” verification (Murphy 1997). Familiar examples of summary measures include mean absolute error and mean error, which are used to evaluate accuracy and unconditional bias, respectively. Such summary measures, however, provide a very limited description of the complex relationship between a set of forecasts and the corresponding set of observations. This was recognized by Murphy and Winkler (1987), who conceived of a new approach to forecast evaluation called distributions-oriented verification (also known as diagnostic verification), which aims to analyze forecast quality as comprehensively as possible rather than attempt to sum it up with just one number.

Specifically, distributions-oriented (DO) verification is concerned with analysis of the joint probability distribution of forecasts and observations, p( f, x), which describes all time-independent information about the forecasts, f; the corresponding observations, x; and their relationship1 (Murphy and Winkler 1987). The literature on DO verification describes estimation of the joint distribution based on a verification data sample, {( fk, xk); k = 1, . . . , N}, and methods of interpreting the joint distribution (for a general overview, see Murphy 1997; Jolliffe and Stephenson 2003; Wilks 2006). Studies concerning deterministic forecasts of a continuous scalar variable have utilized a primitive model in estimation of the joint distribution, in which p( f, x) is represented by a set of discrete categories (each covering a range of forecast and observed values) that are assigned probabilities according to the empirical relative frequency of occurrence in the verification data sample (Murphy et al. 1989; Brooks and Doswell III 1996; Brooks et al. 1997; de Elía and Laprise 2003). Distributions-oriented verification studies concerning probabilistic forecasts of a dichotomous variable have primarily used statistical models in the estimation of the joint distribution, reducing the number of quantities that need to be estimated relative to the primitive model-based technique (Murphy and Wilks 1998; Wilks 2000; Wilks and Godfrey 2002; Bradley et al. 2003, 2004). While the methods used to interpret the joint distribution are largely dependent on how it is estimated, the aforementioned authors emphasize the ability of the DO approach to highlight deficiencies in forecast quality that would be missed by the traditional measures-oriented approach, potentially providing useful feedback to modelers and forecasters.

Although the conceptual outlook and methodologies of DO verification are slowly being adopted (e.g., Charba et al. 2003; Maini et al. 2003; Nachamkin et al. 2005; Myrick and Horel 2006; Elsberry et al. 2007; Engel and Ebert 2007; Schulz et al. 2007), it is not clear whether their potential to direct changes that improve forecast quality is being generally realized. In the specific context of a deterministic forecast system predicting a scalar variable, the goal of optimizing a summary measure of forecast accuracy (e.g., mean absolute error, mean squared error) is often the sole driver of forecast system development. Modelers and forecasters know how their predictions will be evaluated, and as such, strive to optimize the appropriate summary accuracy measure through changes to model formulation or forecasting technique. Thus, forecast system development is driven through a process of summary accuracy measure optimization. This framework of interaction between forecast verification procedure and forecaster/modeler implicitly assumes that forecast accuracy serves as a proxy for the broader concept of forecast quality. Indeed, forecast accuracy is intimately related to forecast quality, but exactly how summary accuracy measure optimization influences the full relationship between forecasts and observations is unclear.

To address this issue, a primitive model-based distributions-oriented verification approach is used here to investigate the quality of scalar predictions from deterministic forecast systems that have been developed through summary accuracy measure optimization. The goal is to explore the consequences of driving forecast system development with summary accuracy measure optimization for the full scope of forecast quality. Operational deterministic tropical cyclone (TC) intensity forecasts and the corresponding observations will serve as verification data samples for DO verification, as TC intensity forecast system development is primarily driven by summary accuracy measure optimization (Bender et al. 2007; Franklin cited 2006; DeMaria et al. 2005; Knaff et al. 2005; Emanuel et al. 2004; Knaff et al. 2003; Kumar et al. 2003). The particular choice to verify TC intensity forecasts is motivated in part by the socioeconomic importance of such predictions, especially for situations involving a TC expected to make landfall. For these forecasts, it is thus especially important to come to a comprehensive understanding of their quality, and how its first-order features are shaped by summary accuracy measure optimization. Although forecast quality and value are not precisely synonymous (Murphy and Ehrendorfer 1987; Murphy 1997), the more complete view of forecast quality revealed through DO verification (relative to measures-oriented verification) should allow users to better optimize their decisions (Murphy and Winkler 1987; Wilks 2000, and reference therein).

The remainder of this paper is organized as follows. First, the details of the verification data samples are described in section 2. Then, section 3 introduces a graphical representation of the joint probability distribution of forecasts and observations, the fundamental instrument of DO verification, and applies it to display the joint distributions for some select verification data samples. Section 4 utilizes two factorizations of the joint distribution into the product of a marginal distribution and conditional distributions to further analyze the verification data samples and diagnose the influence of summary accuracy measure optimization. Section 5 discusses the use of summary measures in a role complementary to the DO verification approach, before a summary of the results and concluding thoughts are presented in section 6.

2. Verification data samples

The variable of interest here is tropical cyclone intensity, a continuous scalar variable representing the 1-min maximum sustained surface wind of a tropical cyclone (as defined by the National Hurricane Center). Intensity lends itself to the discretization involved in a primitive model-based DO verification, as observations of intensity are traditionally reported in multiples of 5 kt. Thus, an observation falls into one of roughly nx = 30 categories (e.g., 67.5–72.5, 72.5–77.5 kt, etc.), given the range of intensities over which TCs are known to exist. Forecasts are categorized into an analogous set of roughly nf = 30 categories. Assuming nx = nf = 30, the joint distribution is approximated with nfnx = 900 categories, each assigned a probability p( fi, xj) according to the empirical relative frequency of the forecast falling in the ith category and observation in the jth category (i = 1, . . . , nf; j = 1, . . . , nx) in the verification data sample. Such a primitive model–based estimate of the joint distribution has a dimensionality of 899, the number of relative frequencies needed to completely determine p( fi, xj) (Murphy 1991). An extensive verification data sample is needed to estimate the relative frequencies in such a high-dimensional approximation of the joint distribution [henceforth, “joint distribution” will be understood to mean the primitive model–based estimate, and will be denoted as p( f, x)].

In constructing verification data samples, only Atlantic basin TCs are considered, as the highest quality intensity observations are to be found there. These observations are synthesized into a “best track” analyzed intensity for each storm every 6 h while it is in existence; here, these analyses are considered to be the observed intensities for the verification. Forecasts are issued at 6-h intervals as well, for lead times up to 120 h since the 2001 season. Official forecasts for the intensity of Atlantic basin tropical cyclones are produced by forecasters at the National Hurricane Center (NHC).2 These forecasts are supported by three primary numerical models (DeMaria et al. 2005). The most basic is the 5-day Statistical Hurricane Intensity Forecast model (SHF5), which uses a multiple linear regression model (trained on data from past TCs) to predict intensity change, given predictors describing the current state of the TC and its recent history (Knaff et al. 2003). The Statistical Hurricane Intensity Prediction Scheme (SHIPS) improves on this concept by including predictors about the environment of the TC, at both the initialization time and at forecast lead times (DeMaria and Kaplan 1994, 1999; DeMaria et al. 2005). Predictors of future storm environmental conditions are calculated from the Global Forecast System (GFS) weather model forecast, at future storm positions predicted by the NHC. A postprocessing routine is applied to the raw SHIPS intensity forecast to account for any land-induced decay that may occur if the NHC track forecast brings the TC over or near land. The result, called the Decay-SHIPS (DSHP) forecast, is provided to forecasters. The last of the three primary models is the Geophysical Fluid Dynamics Laboratory–University of Rhode Island (GFDL–URI) coupled hurricane–ocean model, as run at the National Centers for Environmental Prediction (NCEP; henceforth this model will be referred to as the GFDL). The GFDL is a dynamical model, in contrast to the statistically based SHF5 and DSHP. It couples a nested-grid atmosphere centered on the TC (Kurihara et al. 1998) to an ocean model (Bender and Ginis 2000) to explicitly account for interaction between the TC and ocean. Forecasts from the three aforementioned models (SHF5, DSHP, and GFDL) and the official forecasts from the NHC (which will be abbreviated as OFCL) are all verified in this study.

The verification data samples used here are homogeneous in the OFCL, SHF5, DSHP, and GFDL forecasts, and span the 2001–05 Atlantic basin seasons.3 Homogeneity in the four types of forecasts requires that for a given TC, forecast initialization time, and lead time, all four forecasts exist and are able to be verified against an existing best-track observation. This ensures that any comparison between forecast systems is fair, as each is verified over the same set of situations. Forecast–observation pairs are not excluded according to storm classification at the forecast initialization time and verification time, as is current practice in the NHC’s verification methodology.4 Thus, some forecast– observation pairs included here pertain to situations where the best-track storm classification is extratropical, tropical wave, or remnant low. Table 1 shows the sample size for each lead time in the homogeneous verification data samples. Sample size decreases with lead time as long-lead forecasts do not exist at the beginning of the TC’s life (forecasts are initialized once a weather system is defined as a TC) or are not made because dissipation is expected to occur. Sample sizes are on the order of the dimension of the joint distribution of forecasts and observations, even with 5 yr of data considered. Still, interpretation of the joint distribution and its relatives is useful, as will be shown subsequently.

Before continuing, it must be noted that the DSHP and GFDL models underwent significant changes over the 5 yr encompassed in the verification data samples. The predictors used in DSHP are constantly evolving, as documented in DeMaria et al. (2005) for the 2001–03 seasons. Further updates to DSHP for the 2004 and 2005 seasons are described by DeMaria (2006). These include new predictors based on satellite observations, a new postprocessing routine to account for interaction with narrow landmasses (based on DeMaria et al. 2006), and an adjustment of the SST predictor to account for ocean mixing processes under the eyewall. The GFDL has undergone many significant changes since the coupled version was implemented operationally in 2001. These include upgraded physics, vastly increased resolution of the atmospheric model, and improved initialization procedures for both the atmosphere and ocean (Bender et al. 2007; Falkovich et al. 2005). Thus, the verification results presented here will not necessarily reflect the performance characteristics of the latest version of these models. However, the modelers’ (and forecasters’) goal of summary accuracy measure optimization has not changed [e.g., Knaff et al. (2003), DeMaria et al. (2005), Franklin (2006), and Bender et al. (2007) for SHF5, DSHP, OFCL, and GFDL forecasts, respectively], and it is argued here that this dominates the first-order nature of the DO verification results.

3. The joint distribution

The methodology of DO verification is heavily dependent on graphical representations to convey the rich information contained in the joint distribution of forecasts and observations. Graphical representation of the joint distribution itself builds upon the concept of the scatterplot, the basic tool for exploratory analysis of paired data. Whereas the density of points in a scatterplot of f versus x gives a qualitative sense of the relative frequency of ( f, x) pairs, a graphical representation of the joint distribution must quantify this information. Within the context of a primitive model–based estimate of the joint distribution, there is a grid of ( f, x) category relative frequencies to represent. For a very large number of categories, it is advantageous to contour the relative frequency field, as was done for the ( f, x) frequency field in Engel and Ebert (2007) and Schulz et al. (2007). For relatively few categories, the relative frequencies can be represented individually. Bivariate histograms are used by Potts (2003) and Murphy et al. (1989) for this purpose, but the three-dimensional perspective employed to view the histograms makes interpretation difficult, as pointed out by Wilks (2006). Wilks (2006) instead suggests utilizing a “glyph scatterplot” to display the relative frequencies of the joint distribution, where some characteristic of the scattered points corresponds to the relative frequency magnitude.

Here, the graphical representation style of the joint distribution of forecasts and observations suggested by Wilks (2006) is adopted. Figures 1 –4 show the joint distribution pertaining to the OFCL, GFDL, DSHP, and SHF5 forecasts, respectively. Each figure shows the joint distribution at four different lead times in the following panels: (a) 0, (b) 36, (c) 72, and (d) 120 h. Dots are drawn for all ( f, x) with nonzero relative frequency in the corresponding verification data sample, with the colors representing the magnitude of the relative frequency according to the nonlinear scale detailed below Fig. 1.

Along with the joint distribution of forecasts and observations, each panel in Figs. 1 –4 has a thin black line along the diagonal, where f = x. If a set of deterministic forecasts were perfect, all dots would be clustered along this diagonal. One can see that this is not the case for the TC intensity forecasts, even at the 0-h lead time, as operational analyses of intensity do not necessarily match the corresponding best-track values (which are based partially on observations taken after the time in question). At the 36-h lead time, all four forecast sets show a widening of the joint distribution about the (diagonal) major axis, indicating a growing proportion of large forecast errors. For the OFCL, DSHP, and SHF5 forecasts, the joint distribution has about equal probability on both sides of the diagonal, but for the GFDL forecasts, probability is concentrated below the diagonal, where f > x. This is evidence of unconditional high bias in the GFDL forecasts, which will be quantified in section 5 using a measures-oriented approach.

By the 72-h lead time in Figs. 1 –4, one can see that not only is each joint distribution widening about its major axis, but this axis is also rotating into a more vertical orientation. This is especially evident for the SHF5 forecasts, but is subtly present in the joint distributions for the other three forecast sets. Such rotation of the joint distribution is symptomatic of growing conditional bias: for high-intensity observations, the forecasts are generally too low, and for low-intensity observations, the forecasts are generally too high. Moving on to the 120-h lead time, Figs. 1 –4 show that all four forecast sets have substantial conditional bias, and if anything, probability has gathered in toward the nearly vertical major axis of the distribution, instead of spreading out further. The capability of DO verification to reveal conditional bias will be discussed in greater detail in section 4, as detection of conditional bias is an advantage of DO verification over the traditional measures-oriented approach (which cannot detect conditional bias at all).

4. Marginal and conditional distributions

a. Factorizations of the joint distribution

The joint distribution of forecasts and observations can be expressed as a product of marginal and conditional distributions in two distinct factorizations: one conditioning on the forecasts and the other conditioning on the observations (Murphy and Winkler 1987). The calibration-refinement factorization, which conditions on the forecasts, is written as
i1520-0434-23-6-1195-e1
where s( f ) is the marginal distribution of the forecasts (i.e., forecast distribution) and q(x|f ) represents the set of conditional distributions of the observations given the forecast. Note that there is only one s( f ), but there is a separate conditional distribution for each category of f. The second factorization is similar, but with the conditionality on the observations. It is called the likelihood-base rate factorization:
i1520-0434-23-6-1195-e2
where t(x) is the marginal distribution of the observations (i.e., observed distribution) and r( f|x) represents the set of conditional distributions of the forecasts given the observation. Again, there is only one t(x), but there is a separate conditional distribution for each category of observation.

b. Marginal distribution analysis

The aforementioned marginal and conditional distributions are useful tools in DO verification, as each draws out relevant aspects of p( f, x) that are not easy to directly analyze from Figs. 1 –4. First, consider the marginal distributions s( f ) and t(x). For a perfect set of deterministic forecasts, the marginal distribution of those forecasts would be exactly the same as the marginal distribution of the observations. However, equivalence of the two marginal distributions does not necessarily imply a perfect set of forecasts, just that the forecast distribution is consistent with the sample climatology. This is because individual forecasts can be erroneous, but taken as a whole, the set of forecast values can still be distributed as the observations. So, ultimately, comparative analysis of the marginal distributions is most informative when the two are different, as this is an unequivocal sign of a flawed forecast set. Furthermore, the nature of the differences can allow one to infer some reasons for the discrepancy, as will be seen for the TC intensity forecasts.

The literature shows a number of possibilities for the graphical comparison of marginal distributions. Boxplots (see section 4e) for each marginal distribution can be juxtaposed to compare the sample quantiles of the observations with the sample quantiles of the corresponding forecasts (Murphy et al. 1989; Potts 2003). A more detailed comparison of the form of the (discrete) marginal distributions can be carried out with the aid of relative frequency histograms (Potts 2003; Elsberry et al. 2007). However, when multiple forecast distributions are to be compared with an observed distribution, the relative frequency histogram approach becomes unwieldy. An alternative way to compare the forms of the marginal distributions is to superimpose estimates of the underlying (continuous) forecast distributions upon an estimate of the underlying (continuous) observed distribution. This is done in Fig. 5 for the four sets of TC intensity forecasts, with each panel showing a different lead time. The forecast distributions are plotted with dashed lines, while the observed distribution is plotted in solid black. The continuous marginal distributions are estimated with a nonparametric kernel density smoothing technique (Wilks 2006), using Gaussian kernels and a 5-kt smoothing bandwidth.

In each panel in Fig. 5 the values of the mean and median of the observed distribution are marked along the abscissa with a black triangle and a gray triangle, respectively. There are slight differences in the central tendencies of the observed distributions among the four lead times, indicative of differences in the observed distributions themselves. This is due to subsampling of the full set of observations (distributed as in Fig. 5a) in accordance with the lead time of the corresponding forecast set. For example, the observed distribution in Fig. 5d only reflects observations taken when 120-h lead time forecasts are verifying. In the days immediately after a TC forms, there are no 120-h forecasts to verify, as forecasts are initiated only upon TC formation. Thus, the observed distributions corresponding to sets of forecasts at longer lead times are increasingly depleted of low-intensity observations characteristic of formative TCs.

The primary feature in Fig. 5, however, is the divergence with lead time of the four forecast distributions from the observed distribution. At the 36-h lead time, the forecast distributions fall into two groups. The OFCL, DSHP, and SHF5 forecast sets are overpopulated (relative to the observed distribution) in the 40–80-kt intensity range and underpopulated elsewhere. The GFDL shows somewhat different behavior, overpopulating the 70–110-kt intensity range, while leaving the 30–60-kt range and the highest intensities deficient. These general patterns largely persist through the 72- and 120-h lead times, becoming increasingly amplified with lead time, especially for SHF5 and GFDL. By 120 h, SHF5 virtually eliminates forecasts below 30 kt and above 100 kt, instead favoring a very narrow forecast distribution centered at 60 kt. Though not as extreme as SHF5, GFDL also lacks a sufficient number of forecasts at the lowest and highest intensities, while overpredicting the number of low- to moderate-intensity hurricanes (65–105 kt). The OFCL and DSHP show a similar pattern, but instead overpopulate strong tropical storm and weak hurricane intensities (50–90 kt).

While there are substantial differences among the forecast distributions shown in Fig. 5, the overall theme that emerges is an increasing tendency with lead time to predict moderate intensities rather than those at the low and high ends of the observed intensity range. This behavior can be explained as a response to the use of summary accuracy measure optimization as the driving principle behind TC intensity forecast system development. Summary accuracy measure optimization is explicit in the formulation of the two statistical models, DSHP and SHF5, that use multiple linear regression to find an optimal relationship between predictors and intensity change. Precisely, the linear relationship is optimal in the least squares sense, implying that these statistical models are designed to produce forecasts that minimize the mean squared error (MSE). The manner in which satisfying the demand of MSE minimization ends up producing the characteristic peaked forecast distributions in Fig. 5 is perhaps best demonstrated within the context of a very simple statistical forecast model.

c. Performance of a single linear regression (SLR) model

Consider a linear statistical model that predicts observed intensity, I, solely on the basis of the operationally designated initial intensity, Iinit:
i1520-0434-23-6-1195-e3
The coefficients a and b are calculated for each lead time, t, via single linear regression of operationally designated initial intensities onto observed intensities, using the set of t = 0 OFCL forecasts and corresponding best-track observations (as described in section 2) as training data. This process can be visualized with the aid of the joint distribution of persistence forecasts and observations, as shown in Fig. 6 at four different lead times. A persistence forecast is defined here as a forecast of the operationally designated initial intensity. Thus, the joint distribution of persistence forecasts and observations can be thought of as a weighted scatterplot of operationally designated initial intensities versus the observed intensities. The best- fit (in the least squares sense) line relating these two quantities is shown in magenta in each panel of Fig. 6. The slope of the magenta line, a(t), and its intercept, b(t), are used as the coefficients in the linear statistical model of Eq. (3). The resultant intensity forecast model will henceforth be called the single linear regression (SLR) model.

The coefficients of the aforementioned linear regression are listed in Table 2, along with the corresponding coefficients of determination, for each lead time. There is a gradual deterioration of the relationship between the initial and observed intensities, with the two variables basically unrelated at the 120-h lead time. As seen in Fig. 6, this causes the best-fit line to rotate from the diagonal at t = 0 to nearly horizontal at t = 120. While the slope tails off toward zero with increasing lead time, the intercept approaches the mean observation of the training data.

The deterioration of the initial intensity–observed intensity relationship with lead time has profound implications for the quality of forecasts from the SLR model. Consider the set of SLR forecasts homogeneous with those in the verification data samples described in section 2, created by applying Eq. (3) with the coefficients in Table 2. The joint distribution for this set of forecasts and corresponding observations is shown in Fig. 7, at four different lead times. The SLR joint distribution essentially shows exaggerated versions of the primary traits seen in the joint distributions for the OFCL, GFDL, DSHP, and SHF5 forecast sets in Figs. 1 –4: 1) rotation of the major axis of the joint distribution from diagonal to vertical as lead time increases and 2) initial widening of the distribution about the major axis at early lead times, followed by contraction at later lead times. The joint distribution for the SLR model is most similar to that of the SHF5 model in Fig. 4, which is reasonable, as SLR is closest to SHF5 in the nature of its statistical model formulation. However, all the forecast systems show this same general behavior.

As one would expect from the joint distribution, the marginal distribution of the SLR forecasts also displays an exaggerated version of the main pattern seen in Fig. 5: the preponderance of moderate intensity forecasts at the expense of low- and high-intensity predictions (relative to the observed distribution). Figure 8 shows the marginal distributions of SLR forecasts and SHF5 forecasts superimposed on the marginal distribution of the observations, in the manner of Fig. 5. The SLR and SHF5 forecast distributions are very similar in nature, with both distributions sharpening with lead time, as their respective modes converge toward the mean observation (marked by a black triangle along the abscissa). Again, while the SHF5 model has a forecast distribution most similar to that of the SLR model, all forecast systems show the same general behavior.

d. Inferring the influence of summary accuracy measure optimization

Although the SLR model is a vastly simpler forecast system than the operational intensity forecast systems introduced earlier, all these forecast systems share the same first-order characteristics of the joint distribution and forecast distribution. Here, it is argued that all are responding in the same qualitative manner to summary accuracy measure optimization, by sharpening the forecast distribution as lead time increases. This sharpening of the forecast distribution is manifested in the joint distribution as a rotation of the major axis from the diagonal into a more vertical orientation as lead time increases, and an attendant contraction of the distribution about the more vertical major axis. Such an evolution of the features of the joint distribution is necessary to accommodate the sharpening of the forecast distribution.

Summary accuracy measure optimization is explicit in the formulation of the SLR model, as the linear statistical model coefficients are chosen to minimize MSE (in the dependent, training data). These coefficients change drastically with lead time in response to the deterioration of the relationship between initial intensity and observed intensity, causing the range of intensity values forecasted by the SLR model to shrink as lead time increases. The shrinking range of SLR-forecasted values can be explained by considering the two limiting cases in the relationship between initial intensity and observed intensity. In the limit of a perfect relationship (e.g., at t = 0 h, approximately), the range of SLR-forecasted values is exactly the same as the range of observed values, as the forecast always equals the observed intensity. This can be seen by comparing the t = 0 h marginal distribution of SLR forecasts and the marginal distribution of the observations in Fig. 8a. In the opposite limit of no relationship between the initial and observed intensities (e.g., at t = 120 h, approximately), the SLR model always predicts the same value: the mean observation in the training data. With no useful information provided by the predictor, this is simply the course of action that must be taken to minimize MSE. At t = 120 h, the marginal distribution of SLR forecasts in Fig. 8d shows SLR-predicted values in a small range around the value of the mean observation. The SLR-forecasted intensity values here are not all the same (see Fig. 7d), but it is clear that the marginal distribution of SLR forecasts has sharpened substantially relative to that at the 0-h lead time. The marginal distributions of SLR forecasts at 36 and 72 h represent intermediate cases between the two limiting scenarios described above.

The similarity of the joint distributions and forecast distributions for the SLR and SHF5 models is relatively straightforward to understand, as both statistical models are explicitly designed to minimize the MSE of the forecasts. Thus, the two models behave in the same way as their predictors become uncorrelated with observed intensity at later lead times. With its considerably more substantial array of predictors, SHF5 can take advantage of better correlations at later lead times than those available to the SLR model (see Table 2), delaying the onset of behaviors seen in the SLR joint and forecast distributions. The same logic extends to DSHP, with its even larger array of useful predictors than SHF5. However, DSHP also has the ability to decay intensities predicted by the statistical model in a postprocessing scheme, which may be partly responsible for the differences in its forecast distribution relative to that of SHF5 (cf. dark blue and light blue curves in Fig. 5). These two models have very similar forecast distributions for intensities above 70 kt at all lead times, but show increasingly divergent behavior below 70 kt as lead time increases. SHF5 consolidates forecasts around the mean observation, while DSHP forms two modes: one near the mean observation and a secondary mode near 30 kt. Perhaps this secondary mode in the DSHP forecast distribution is due to the effects of the decay model applied during postprocessing.

To understand the reasons for the similarity of the joint distributions and forecast distributions for the statistical models to those for the GFDL and OFCL, the concept of a response to summary accuracy measure optimization must be generalized in terms of optimal deterministic forecasts. An optimal deterministic forecast optimizes the expected value of a particular summary accuracy measure, with the expectation calculated over the true forecast probability distribution. Production of optimal deterministic forecasts is the ultimate goal of driving forecast system development with summary accuracy measure optimization. The optimal deterministic forecast is slightly different for forecast trajectories verified with MAE relative to those verified with MSE: The MAE-optimal forecast trajectory is the time evolution of the median of the true forecast probability distribution, while the time evolution of the mean of the true forecast probability distribution minimizes MSE. In the case of TC intensity, there is no obvious reason that the time evolution of these two central tendencies of the true forecast probability distribution should be radically different. Thus, whether an intensity forecast system is driven explicitly to minimize MSE (e.g., the statistical models) or implicitly to minimize MAE (e.g., GFDL, OFCL), the goal is the same: production of intensity forecast trajectories that match the time evolution of the central tendency of the true forecast probability distribution.

While the time evolution of the central tendency of a true forecast probability distribution cannot be known exactly, its basic characteristics can be inferred. For TC intensity prediction, a true forecast probability distribution is fairly sharp at t = 0, representing uncertainty in the initial intensity of a TC. It can reasonably be assumed that such a distribution is centered on the operationally designated initial intensity. As lead time increases, the true forecast probability distribution ultimately evolves to take a form similar to that of the observed distribution in Fig. 5d, as uncertainty saturates. Note that the mean and median of the observed distribution in Fig. 5d are very similar, both near 60 kt. Thus, the time evolution of the central tendency of the true forecast probability distribution can be described as starting at the operationally designated initial intensity and asymptoting with lead time to about 60 kt.

Consider the evolution with lead time of the marginal distribution of a set of such optimal forecasts. At the initial time, the marginal distribution of forecasts would be the same as the marginal distribution of observations (operationally designated intensities). As lead time advances, the marginal distribution of forecasts would become sharper and sharper, as forecasts asymptote toward 60 kt. This is exactly the pattern seen in the evolution of the marginal distribution of forecasts for the OFCL, GFDL,5 DSHP, SHF5, and SLR forecast systems, as shown in Figs. 5 and 8. All these forecast systems are responding to summary accuracy measure optimization in the same qualitative manner (i.e., trying, however imperfectly, to predict optimal deterministic forecasts), regardless of the specific summary accuracy measure used (MAE or MSE) or how it is optimized. The primary difference among the forecast systems is how quickly the forecasts asymptote to 60 kt; this depends on the capability of the model or forecaster to predict the time trajectory of the central tendency of the true forecast probability distribution.

e. Conditional distribution analysis

Given the characteristic traits imbued to the marginal distribution of forecasts, s( f ), and the joint distribution of forecasts and observations, p( f, x), by the response of a forecast system to summary accuracy measure optimization, one would also anticipate some distinguishing characteristics to emerge in the set of conditional distributions of the observations given the forecast, q(x| f ), and the set of conditional distributions of the forecasts given the observation, r( f |x). Because q(x| f ) and r( f |x) represent sets of distributions, comparative analysis of the q(x| f ) or r( f |x) pertaining to each of the forecast systems cannot be done as for the marginal distributions in section 4b. Effective display of the conditional distributions for a particular forecast system is a challenging task in its own right. The approach taken by previous investigators has been to plot certain quantiles of each of the conditional distributions (Déqué 2003; Maini et al. 2003; Schulz et al. 2007), perhaps accompanied by one or both of the marginal distributions (Murphy et al. 1989; de Elía and Laprise 2003). The resultant figure is generally referred to as a conditional quantile diagram.

Figures 9 and 10 show conditional quantile diagrams pertaining to the OFCL TC intensity forecasts (at four different lead times), with the conditioning on the forecast intensity and observed intensity, respectively. In Fig. 9, five quantiles of the q(x| f ) are displayed using a set of boxplots (Potts 2003; Wilks 2006). In each boxplot, 1) the box extends from the lower quartile to the upper quartile, 2) the whiskers extend from the upper and lower quartiles to the extrema, and 3) the median is marked with a circle.6 A boxplot for t(x), the marginal (i.e., unconditional) distribution of observations, is included on the left side of each panel in Fig. 9, for comparison to the q(x| f ). Finally, a histogram representing the marginal distribution of forecasts, s( f ), is included at the bottom of each panel in Fig. 9. Figure 10 is similar to Fig. 9 but shows boxplots for the r( f |x) and s( f ), as well as a histogram for t(x).

Consider first the set of conditional distributions of the observations given the OFCL forecast, as displayed in Fig. 9. The range (extent of the whiskers) and interquartile range (extent of the box) of the conditional distributions expand with lead time, reflecting growing uncertainty in the observed intensity that follows a particular OFCL intensity prediction. Nonetheless, Fig. 9d shows that at the 120-h lead time, the conditional distributions of the observations still substantially differ from the unconditional distribution of the observations. In particular, the conditional medians still generally align along the diagonal, where f = x, as is the case at the other lead times (see Figs. 9a–c). Thus, it can be inferred from Fig. 9 that the OFCL forecasts have little type I conditional bias.

Type I conditional bias (often called reliability or calibration) describes deviation of a forecast value from the mean observation given that forecast value, fμx| f , where μx|f is calculated from q(x| f ). Further investigation of type I conditional bias, for all the TC intensity forecast systems, can be carried out with the aid of type I conditional bias comparative scatterplots, as shown in Fig. 11. In Fig. 11, there are five sets of colored dots in each panel, each set corresponding to one of the five forecast systems. A set of dots mark ( f, μx|f) for all values of f predicted by a particular forecast system at the given lead time. With five forecast systems, there is then a maximum of five dots lined up vertically at each value of f (at the early lead times, they overlap a bit). The vertical displacement of the dots from the solid black diagonal line shows the magnitude and direction of the type I conditional bias in the forecast sets. There appears to be a slight overall tendency for positive type I conditional biases ( f > μx|f) for high-intensity forecasts and negative type I conditional biases ( f < μx|f) for low-intensity forecasts, but this tendency is not particularly pronounced.

Hence, it can be concluded from Figs. 9 and 11 that all five forecast systems qualitatively have little type I conditional bias, which in and of itself is a desirable feature of a forecast system. A user can expect that the observed intensity, in an average sense, will be near the forecast intensity. To accomplish this, however, the forecast systems have had to sacrifice the refinement of their marginal distributions of forecasts (i.e., the marginal distributions of forecasts are too sharp). For an extreme example of this, consider the SLR model (magenta dots) in Fig. 11. Note that the number of dots radically decreases with increasing lead time, as the range of forecasted values collapses down to only those near μx, consistent with the marginal distribution of forecasts in Fig. 8. Because of this, it is easy for the 120-h lead time SLR forecasts to be type I conditionally unbiased, as if the mean observation is always predicted, the type I conditional bias is precisely zero ( f = μx = μx|f). The sharpening of the marginal distribution of forecasts in response to summary accuracy measure optimization favors a low-magnitude type I conditional bias.

The analysis of the conditional distributions, however, is not yet complete. Consider now, the set of conditional distributions of the forecasts given the observation, r( f|x). For example, the r( f|x) for the OFCL forecasts are displayed in Fig. 10. Here, it is apparent that the range and interquartile range of the conditional distributions expand substantially during the first 72 h of the forecast (but not during the 72–120-h interval), reflecting growing uncertainty in the forecast intensity that precedes a particular observed intensity value. Figure 10 also shows that the conditional distributions of the forecasts and the unconditional distribution of the forecasts become more similar to each other as lead time increases. The systematic migration of the conditional median forecasts away from the diagonal and toward the unconditional median forecast is of particular interest, as this is evidence of increasing type II conditional bias with lead time.

Type II conditional bias describes the deviation of a mean forecast given an observed value from that observed value, μf|xx, where μf|x is calculated from r( f|x). As for type I conditional bias, the investigation of type II conditional bias for all the TC intensity forecast systems is aided by comparative scatterplots, as shown in Fig. 12. Here, a set of dots mark (μf|x, x) for all values of x observed at the given lead time. The dots line up horizontally, with five dots for every observed value of x. In these type II conditional bias comparative scatterplots, the forecast systems have no “choice” over the number of dots (corresponding to conditional distributions) that will exist, as this is wholly controlled by the observations. The horizontal displacement of the dots from the solid black diagonal line shows the magnitude and direction of the type II conditional bias in the forecast sets. Figure 12 shows that there is little type II conditional bias at the initial time, but as lead time increases, the μf|x all migrate closer to the dashed black line, which marks a representative value of μf , the mean forecast.7 At the 120-h lead time, some forecast systems (the three statistical models, in particular) are near the limiting case where μf|x = μf for all x, meaning that the mean forecast conditioned on the observation is the same as the mean forecast. In other words, the distribution of forecasts that precede a particular observation is basically the same for every observation; the forecast system cannot “discriminate” (Murphy and Winkler 1987) between different observations.

As opposed to type I conditional bias, it is clear from Figs. 10 and 12 that the TC intensity forecast systems show qualitatively substantial type II conditional bias, with noticeable differences among the forecast systems. OFCL has the least type II conditional bias, while SLR has the most. Both forecast systems are responding to summary accuracy measure optimization by collapsing their marginal distributions of forecasts toward the mean observation, but at different rates according to the ability of the two forecast systems to mimic the time trajectory of the central tendency of the true forecast probability distribution. The type II conditional bias comparative scatterplot (Fig. 12) shows the effects of this phenomenon quite explicitly, whereas it has to be inferred in a very indirect fashion from the type I conditional bias comparative scatterplot (Fig. 11). Qualitatively, these diagrams are useful diagnostics, but for quantitative results concerning the conditional bias of the forecast systems, one can turn to summary measures as a tool to be used in conjunction with the distributions-oriented techniques described thus far.

5. The use of summary measures to complement DO verification techniques

As described in Murphy (1997), summary verification measures can be expressed as functions of the joint distribution of forecasts and observations, p( f, x), or as functions of the components of its factorizations in Eqs. (1) and (2). A pertinent example is mean error, which can be expressed as the difference in the means of the marginal distributions of forecasts and observations:
i1520-0434-23-6-1195-e4
Mean error is a measure of unconditional bias. Part of its appeal as a scalar value is that it can be plotted as a function of lead time, as is done in Fig. 13a for the five TC intensity forecast systems. GFDL has the largest magnitude unconditional bias among the five forecast systems, which was qualitatively noted in section 3 from the appearance of its joint distribution. This positive unconditional bias is almost certainly a reflection of GFDL’s tendency to asymptote forecasts toward a value larger than that of the mean observation. DSHP has a negative unconditional bias at the longer lead times, consistent with a tendency of the decay scheme used operationally from 2001 to 2004 to overestimate the weakening of TCs during traverses of islands and peninsulas8 (DeMaria et al. 2006). Of the operational forecasts systems, OFCL has the lowest magnitude unconditional bias.
In verification of TC predictions, forecast accuracy is typically measured with mean absolute error, which is expressed as
i1520-0434-23-6-1195-e5
in the DO context. With respect to the graphical display of the joint distribution (e.g., Fig. 1), Eq. (5) states that MAE is the weighted [by p( f, x)] sum of the distances of the colored dots to the diagonal.9 The larger the (weighted) “spread” of dots about the diagonal, the larger the MAE will be. One can see this correspondence between the nature of the joint distributions and MAE by comparing Figs. 1 –4 and 7 to Fig. 13b, which shows MAE as a function of lead time for the five forecast systems.
Another commonly used summary accuracy measure is mean squared error, written in the DO context as
i1520-0434-23-6-1195-e6
Like MAE, MSE can be interpreted in light of the graphical display of the joint distribution as the weighted sum of the (squared) distances of the colored dots from the diagonal. Figure 14a shows MSE as a function of lead time; the relative performance of the five forecast systems is essentially unchanged from that seen in Fig. 13b for MAE.
MSE has the further advantage that it can be readily decomposed into sums of components describing different attributes of forecast quality, via utilization of the calibration-refinement (CR) and likelihood-base rate (LBR) factorizations of the joint distribution of forecasts and observations. This was first done by Murphy (1973) using the CR factorization in the context of probabilistic forecasts of a binary predictand, then later extended to the LBR factorization in the same context (Murphy and Winkler 1987), and finally advocated for use in any context (Murphy 1996, 1997). Here, related MSE decompositions are developed, along the lines of de Elía and Laprise (2003). These decompositions express the total MSE as the sum of that due to two aspects of the conditional distributions: 1) the inherent forms of the conditional distributions (shape) and 2) the displacements of the conditional distributions from their optimal locations (conditional bias). The form of these MSE decompositions will thus be
i1520-0434-23-6-1195-e7
where MSEShape and MSECB refer, respectively, to the MSE attributable to the two factors enumerated above. Derivations of the MSE decompositions and a description of their relationship to those developed by Murphy (1997) are contained in the appendix.
The CR-based MSE decomposition is written as
i1520-0434-23-6-1195-e8
The first term on the rhs of Eq. (8) is the weighted sum of the MSE due to the shape of each of the conditional distributions, q(x|f ). It can be interpreted as being the total MSE that would exist if all the q(x|f ) were shifted such that μx|f = f for all f, that is, if there was no type I conditional bias. This shape term is minimized by sharp conditional distributions, rather than broad ones, rewarding a forecast system in which each f only corresponds to a small number of different x. The second term on the rhs of Eq. (8) is the weighted sum of the MSE due to type I conditional bias—the deviations of μx|f from f. It is minimized by keeping dots along the diagonal, within the context of the type I conditional bias comparative scatterplots in Fig. 11. The shape term in Eq. (8) is ignored in the simplifications necessary to produce Fig. 11, which is one reason why it is useful to complement it with the CR-based MSE decomposition. The shape term in Eq. (8) can also serve as a quantitative complement to the ranges and interquartile ranges of the conditional distributions represented in a conditional quantile diagram, such as Fig. 9.

Figure 14b shows the MSE due to the shape term (dashed lines) and the conditional bias term (dotted lines) of the CR-based MSE decomposition for the five forecast systems. The sum of these two components yields the total MSE (Fig. 14a). Clearly, the shape term is the dominant contributor to the total MSE, as the type I conditional bias is quite low for all the forecast systems, especially OFCL and SLR. The conditional bias term quantifies what was observed qualitatively in Fig. 11, while the shape term gives an indication of what the conditional bias comparative scatterplot cannot show, the spread of each of the conditional distributions.

Analogous to the CR-based MSE decomposition, the LBR-based MSE decomposition is written as
i1520-0434-23-6-1195-e9
The first term on the rhs of Eq. (9) is the MSE due to the shapes of the conditional distributions r( f|x), and the second term on the rhs of Eq. (9) is the MSE due to type II conditional bias, the deviations of μf|x from x. Like the conditional bias term for the CR-based MSE decomposition, this second term on the rhs of Eq. (9) is the only one that can be directly related to the conditional bias comparative scatterplot (Fig. 12). Qualitatively, Fig. 12 does show substantial type II conditional bias, which is reflected in the conditional bias term MSE. This can be seen in Fig. 14c, which shows the MSE due to the shape term (dashed lines) and the conditional bias term (dotted lines) of the LBR-based MSE decomposition for the five forecast systems. The two terms are generally of comparable magnitude, especially for OFCL, GFDL, and DSHP. For these forecast systems, MSEShape levels off around the 72-h lead time, while MSECB continues to grow through the 120-h lead time. SHF5 and SLR show a somewhat different pattern, with MSECB coming to dominate quite early, while MSEShape actually decreases at later lead times. The growing conditional bias and leveling off (or decrease) of the shape term MSE are symptomatic of the convergence of the forecasts toward the mean observation. The LBR-based MSE factorization of Eq. (9) provides a simple, quantitative measures-oriented technique to document these phenomena that were uncovered by the distributions-oriented approach of sections 3 and 4.

6. Summary and conclusions

In this paper, distributions-oriented verification techniques were used to investigate the quality of deterministic tropical cyclone intensity predictions from four operational forecast systems. Development of these operational forecasts systems is driven by summary accuracy measure optimization, which is not equivalent to optimization of the broader concept of forecast quality. The primary goal here, then, was to explore the consequences of driving deterministic forecast system development with summary accuracy measure optimization for the full scope of forecast quality, as embodied in the joint distribution of forecasts and observations.

Despite differences among the TC intensity forecast systems in the summary accuracy measure utilized (MSE or MAE) and differences in exactly how the demand of summary accuracy measure optimization was imposed (explicitly in the model formulation, or implicitly over many changes to the forecast system), DO verification of predictions from the forecast systems yielded similar results. Based on the similarities, it was argued that all of the TC intensity forecast systems must be responding to summary accuracy measure optimization in a similar fashion: by asymptoting forecasts toward the central tendency of the observed distribution with lead time, in an attempt to predict the optimal deterministic forecast trajectory. This causes forecasts to become more similar to each other with lead time, resulting in the sharpening of the marginal distribution of forecasts, as seen in Fig. 5. Sharpening of the marginal distribution of forecasts is manifested in the joint distribution of forecasts and observations as a rotation of the major axis of the distribution from the diagonal into the vertical, and a contraction of probability about that major axis (see Figs. 1 –4).

The sharpening of the marginal distribution of forecasts is also reflected in the unconditional and conditional biases of the forecasts. A small-magnitude unconditional bias is theoretically promoted by summary accuracy measure optimization, as forecasts asymptote with lead time to a limiting case of zero unconditional bias (all the forecasts are of the mean observation). This limiting case is also one of zero type I conditional bias (conditioning on the forecast), so summary accuracy measure optimization theoretically promotes good performance with respect to this attribute of forecast quality. Indeed, the type I conditional bias comparative scatterplot (Fig. 11) suggests that such conditional bias is of small magnitude (dots are along the diagonal) for the TC intensity forecasts, a conclusion quantitatively bolstered by the type I conditional bias contribution to MSE shown in Fig. 14b. Summary accuracy measure optimization, however, theoretically promotes a large-magnitude type II conditional bias (conditioning on the observations); as in the limiting case where all forecasts equal the mean observation, there will be large differences between some observations and the corresponding forecasts given those observations. These large differences are seen in Fig. 12, the type II conditional bias comparative scatterplot for the TC intensity forecasts. The contribution of type II conditional bias to MSE is correspondingly large, as shown in Fig. 14c.

The large-magnitude type II conditional bias in the TC intensity forecasts clearly demonstrates that all attributes of forecast quality are not optimized by driving deterministic forecast system development with summary accuracy measure optimization. The implication of this result is that forecast system developers should be careful not to conflate forecast quality optimization with summary accuracy measure optimization. The decision to use a particular summary measure to guide forecast system development should be based on the attribute(s) of forecast quality considered most desirable, as it is unrealistic to expect optimization of a particular summary measure to promote good forecasts with respect to all attributes of forecast quality. For example, if one values good forecast accuracy, unconditional bias, and type I conditional bias, but is not bothered by very poor type II conditional bias, driving forecast system development with summary accuracy measure optimization is an appropriate choice. However, if one values type II conditional bias above other attributes of forecast quality, one must seek some other summary measure, such as the conditional bias term in the LBR-based MSE decomposition of Eq. (9), to optimize in the forecast system development process.

Ultimately, it would be best to utilize distributions-oriented verification techniques in driving forecast system development, but the complexity of probability distributions (joint, marginal, conditional) hinders straightforward, objective comparison of those from competing forecast systems. For example, it is much easier to pass judgment concerning the relative performance of the TC intensity forecast systems based on the MAE time series in Figure 13b than based on the 40 total joint distributions corresponding to the eight lead times and five forecast systems represented in Fig. 13b. Thus, the primacy of summary measures in the forecast system development process is likely unavoidable, leaving DO techniques to play a supporting role in the forecast system development process.

One important aspect of this supporting role for DO verification is the facilitation of “DO dependent” summary measure calculation. DO-dependent summary measures, which can only be expressed as functions of the joint distribution (or its marginal and conditional components), quantify attributes of forecast quality that cannot be measured by prototypical summary measures (e.g., ME, MAE, MSE), which are able to be expressed either as a function of the joint distribution or as a function of the set of forecast–observation realizations. Two examples of a DO-dependent summary measure are the type I conditional bias term of the CR-based MSE decomposition [Eq. (8)] and the type II conditional bias term of the LBR-based MSE decomposition [Eq. (9)], each of which quantifies a conditional bias attribute of forecast quality. As described above, such DO-dependent summary measures aid in the interpretation of the graphical DO verification results, and could potentially be used in forecast system development. The principles of information theory can be used to define another DO-dependent summary measure, quantifying an information content attribute of forecast quality. This summary measure, the mutual information between the forecasts and observations (Leung and North 1990; DelSole 2005), measures the average amount of information a forecast provides about the observation, relative to prior knowledge of the sample climatology. Mutual information has a number of interesting properties, including the ability to verify forecasts categorized into a mixture of nominal and ordinal categories. For example, in the context of TC intensity predictions, a forecast category of “dissipated” could be included along with a set of ordinal categories. The utility of mutual information as a summary measure to complement DO verification of deterministic forecasts will be explored by the author in a future paper.

Another route of future investigation is to apply the distributions-oriented approach in the verification of deterministic predictions of tropical cyclone track, the other primary TC-related predictand (besides intensity). The challenge inherent in this endeavor is to formulate a categorization of TC position that sufficiently limits the dimensionality of the joint distribution such that it can be adequately populated by the verification data sample, but still provides sufficient distinction between the predicted category and the observed category (i.e., the forecast and observed categories are not always the same). Discretizing the forecast and observed latitude and longitude of the TC center into 30 categories each, similar to the approach taken here for intensity, would result in a joint distribution with 304 − 1 independent elements, vastly too extensive to populate with a verification data sample similar to those utilized here. Coarser discretization and/or transformation of position into a storm-relative coordinate system are likely needed to move forward with DO verification of deterministic TC track predictions.

Acknowledgments

The author thanks two anonymous reviewers, whose insightful comments led to the improvement of this manuscript. The author also thanks Kerry Emanuel and James A. Hansen for their guidance, and acknowledges the support of ONR Grant N00014-05-1-0323.

REFERENCES

  • Bender, M. A., and Ginis I. , 2000: Real-case simulations of hurricane–ocean interaction using a high-resolution coupled model: Effects on hurricane intensity. Mon. Wea. Rev., 128 , 917946.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Bender, M. A., Ginis I. , Tuleya R. , Thomas B. , and Marchok T. , 2007: The operational GFDL coupled hurricane–ocean prediction system and summary of its performance. Mon. Wea. Rev., 135 , 39653989.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Bradley, A. A., Hashino T. , and Schwartz S. S. , 2003: Distributions-oriented verification of probability forecasts for small data samples. Wea. Forecasting, 18 , 903917.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Bradley, A. A., Schwartz S. S. , and Hashino T. , 2004: Distributions-oriented verification of ensemble streamflow predictions. J. Hydrometeor., 5 , 532545.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Brooks, H. E., and Doswell C. A. III, 1996: A comparison of measures-oriented and distributions-oriented approaches to forecast verification. Wea. Forecasting, 11 , 288303.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Brooks, H. E., Witt A. , and Eilts M. D. , 1997: Verification of public weather forecasts available via the media. Bull. Amer. Meteor. Soc., 78 , 21672177.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Charba, J. P., Reynolds D. W. , McDonald B. E. , and Carter G. M. , 2003: Comparative verification of recent quantitative precipitation forecasts in the National Weather Service: A simple approach for scoring forecast accuracy. Wea. Forecasting, 18 , 161183.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • de Elía, R., and Laprise R. , 2003: Distributions-oriented verification of limited-area model forecasts in a perfect-model framework. Mon. Wea. Rev., 131 , 24922509.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • DelSole, T., 2005: Predictability and information theory. Part II: Imperfect forecasts. J. Atmos. Sci., 62 , 33683381.

  • DeMaria, M., 2006: Statistical tropical cyclone intensity forecast improvements using GOES and aircraft reconnaissance data. Preprints, 27th Conf. on Hurricanes and Tropical Meteorology, Monterey, CA, Amer. Meteor. Soc., 14A.3. [Available online at http://ams.confex.com/ams/pdfpapers/108035.pdf.].

  • DeMaria, M., and Kaplan J. , 1994: A Statistical Hurricane Intensity Prediction Scheme (SHIPS) for the Atlantic basin. Wea. Forecasting, 9 , 209220.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • DeMaria, M., and Kaplan J. , 1999: An updated Statistical Hurricane Intensity Prediction Scheme (SHIPS) for the Atlantic and eastern North Pacific basins. Wea. Forecasting, 14 , 326337.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • DeMaria, M., Mainelli M. , Shay L. K. , Knaff J. A. , and Kaplan J. , 2005: Further improvements to the Statistical Hurricane Intensity Prediction Scheme (SHIPS). Wea. Forecasting, 20 , 531543.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • DeMaria, M., Knaff J. A. , and Kaplan J. , 2006: On the decay of tropical cyclone winds crossing narrow landmasses. J. Appl. Meteor. Climatol., 45 , 491499.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Déqué, M., 2003: Continuous variables. Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. T. Jolliffe and D. B. Stephenson, Eds., Wiley, 97–119.

    • Search Google Scholar
    • Export Citation
  • Elsberry, R. L., Lambert T. , and Boothe M. , 2007: Accuracy of Atlantic and eastern North Pacific tropical cyclone intensity forecast guidance. Wea. Forecasting, 22 , 747762.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Emanuel, K., DesAutels C. , Holloway C. , and Korty R. , 2004: Environmental control of tropical cyclone intensity. J. Atmos. Sci., 61 , 843858.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Engel, C., and Ebert E. , 2007: Performance of hourly operational consensus forecasts (OCFs) in the Australian region. Wea. Forecasting, 22 , 13451359.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Falkovich, A., Ginis I. , and Lord S. , 2005: Ocean data assimilation and initialization procedure for the coupled GFDL/URI hurricane prediction system. J. Atmos. Oceanic Technol., 22 , 19181932.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Franklin, J., cited. 2006: National Hurricane Center forecast verification. [Available online at http://www.nhc.noaa.gov/verification].

  • Jolliffe, I. T., and Stephenson D. B. , 2003: Forecast Verification: A Practitioner’s Guide in Atmospheric Science. Wiley, 240 pp.

  • Knaff, J. A., DeMaria M. , Sampson C. R. , and Gross J. M. , 2003: Statistical, 5-day tropical cyclone intensity forecasts derived from climatology and persistence. Wea. Forecasting, 18 , 8092.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Knaff, J. A., Sampson C. R. , and DeMaria M. , 2005: An operational Statistical Typhoon Intensity Prediction Scheme for the western North Pacific. Wea. Forecasting, 20 , 688699.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Kumar, T. S. V. V., Krishnamurti T. N. , Fiorino M. , and Nagata M. , 2003: Multimodel superensemble forecasting of tropical cyclones in the Pacific. Mon. Wea. Rev., 131 , 574583.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Kurihara, Y., Tuleya R. E. , and Bender M. A. , 1998: The GFDL hurricane prediction system and its performance in the 1995 hurricane season. Mon. Wea. Rev., 126 , 13061322.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Leung, L., and North G. R. , 1990: Information theory and climate prediction. J. Climate, 3 , 514.

  • Maini, P., Kumar A. , Rathore L. S. , and Singh S. V. , 2003: Forecasting maximum and minimum temperatures by statistical interpretation of numerical weather prediction model output. Wea. Forecasting, 18 , 938952.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1973: A new vector partition of the probability score. J. Appl. Meteor., 12 , 595600.

  • Murphy, A. H., 1991: Forecast verification: Its complexity and dimensionality. Mon. Wea. Rev., 119 , 15901601.

  • Murphy, A. H., 1996: General decompositions of MSE-based skill scores: Measures of some basic aspects of forecast quality. Mon. Wea. Rev., 124 , 23532369.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1997: Forecast verification. The Economic Value of Weather and Climate Forecasts, R. W. Katz and A. H. Murphy, Eds., Cambridge University Press, 19–74.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., and Ehrendorfer M. , 1987: On the relationship between accuracy and value of forecasts in the cost–loss ratio situation. Wea. Forecasting, 2 , 243251.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., and Winkler R. L. , 1987: A general framework for forecast verification. Mon. Wea. Rev., 115 , 13301338.

  • Murphy, A. H., and Wilks D. S. , 1998: A case study in the use of statistical models in forecast verification: Precipitation probability forecasts. Wea. Forecasting, 13 , 795810.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., Brown B. G. , and Chen Y. , 1989: Diagnostic verification of temperature forecasts. Wea. Forecasting, 4 , 485501.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Myrick, D. T., and Horel J. D. , 2006: Verification of surface temperature forecasts from the National Digital Forecast Database over the western United States. Wea. Forecasting, 21 , 869892.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Nachamkin, J. E., Chen S. , and Schmidt J. , 2005: Evaluation of heavy precipitation forecasts using composite-based methods: A distributions-oriented approach. Mon. Wea. Rev., 133 , 21632177.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Potts, J. M., 2003: Basic concepts. Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. T. Jolliffe and D. B. Stephenson, Eds., Wiley, 13–36.

    • Search Google Scholar
    • Export Citation
  • Schulz, E. W., Kepert J. D. , and Greenslade D. , 2007: An assessment of marine surface winds from the Australian Bureau of Meteorology numerical weather prediction systems. Wea. Forecasting, 22 , 613636.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 2000: Diagnostic verification of the Climate Prediction Center long-lead outlooks, 1995–98. J. Climate, 13 , 23892403.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 2006: Statistical Methods in the Atmospheric Sciences. Academic Press, 648 pp.

  • Wilks, D. S., and Godfrey C. M. , 2002: Diagnostic verification of the IRI net assessment forecasts, 1997–2000. J. Climate, 15 , 13691377.

    • Crossref
    • Search Google Scholar
    • Export Citation

APPENDIX

MSE Decompositions

CR-based MSE decomposition

The CR-based decomposition of MSE in Eq. (8) is derived by adding and subtracting (μxfx)2 inside the summations of Eq. (6) to give
i1520-0434-23-6-1195-ea1
After applying the CR factorization of p( f, x), the first term on the rhs of Eq. (A1) is the same as that in Eq. (8). Applying the CR factorization to the second term on the rhs of Eq. (A1) gives Σfs( f )Σxq(x| f ) ( f2 − 2fxμ2x| f + 2μx| fx), which reduces to Σfs( f ) ( f2 − 2fμx| f + μ2x| f ), using the fact that Σxq(x| f )x = μx| f . Simplifying again yields
i1520-0434-23-6-1195-ea2
the second term on the rhs of Eq. (8).
A derivation of the Murphy (1997) CR-based MSE decomposition follows a similar process. One starts by adding and subtracting both (μx|fx)2 and (μxx)2 inside the summations of Eq. (6), giving
i1520-0434-23-6-1195-ea3
There are now three terms on the rhs of Eq. (A3), rather than two, as in Eq. (A1). The first term on the rhs of Eq. (A3) is the MSE that would be earned by always predicting the mean observation. It does not depend on the forecasts and simplifies to
i1520-0434-23-6-1195-ea4
in the notation of Murphy (1997). For the second term on the rhs of Eq. (A3), apply the CR factorization of p( f, x) to get Σf s( f )Σxq(x|f )(μ2x − 2μxxμ2x| f + 2μx| fx), and use Σxq(x| f )x = μx|f to simply it to Σf s( f ) (μ2x − 2μxμx|f + μ2x|f). Further simplification yields
i1520-0434-23-6-1195-ea5
The third term on the rhs of Eq. (A3) is the same as the second term on the rhs of Eq. (A1), and simplifies in the same manner. Thus, using Eqs. (A4), (A5), and (A2), Eq. (A3) can be written as
i1520-0434-23-6-1195-ea6
the CR-based MSE decomposition in Murphy (1997).

Equation (A6) differs from Eq. (8) in how the MSE due to the shapes of the conditional distributions is handled, expressing it as the difference of two terms [“uncertainty” and “resolution,” the first two terms on the rhs of Eq. (A6)] rather than just one term. When comparing homogeneous forecast sets, uncertainty is constant, so differences in MSEShape are due exclusively to differences in resolution. Resolution can be qualitatively estimated from the type I conditional bias comparative scatterplot (Fig. 11), whereas MSEShape cannot, so perhaps Eq. (A6) is a better companion to the type I conditional bias comparative scatterplot than Eq. (8). However, in the context of Fig. 14, it is more intuitive to interpret MSE as the sum of the two positive terms of Eq. (8) than the positive and negative contributions of Eq. (A6).

LBR-based MSE decomposition

The derivation of the LBR-based decomposition of MSE in Eq. (9) follows a similar procedure to that of the CR-based decomposition described above. First, add and subtract (μf∣xf )2 inside the summations of Eq. (6) to give
i1520-0434-23-6-1195-ea7
noting also that the order of summation has been reversed. Application of the LBR factorization to p( f, x) in the first term on the rhs of Eq. (A7) yields the first term on the rhs of Eq. (9). Following the same procedure for the second term on the rhs of Eq. (A7) gives Σxt(xf r( f|x)(−2fx + x2μ2f |x + 2μf|xf ), which simplifies to Σxt(x)(−2f|x + x2 + μ2f|x), noting that Σf r( f|x) f = μf|x. Further simplification yields
i1520-0434-23-6-1195-ea8
the second term on the rhs of Eq. (9).
A derivation of the Murphy (1997) LBR-based MSE decomposition is analogous to that shown for the Murphy (1997) CR-based MSE decomposition. Add and subtract (μf|xf )2 and (μff )2 inside the summations of Eq. (6) to get
i1520-0434-23-6-1195-ea9
noting that the order of summations has been reversed. The first term on the rhs of Eq. (A9) describes the variance of the forecasts:
i1520-0434-23-6-1195-ea10
To simplify the second term on the rhs of Eq. (A9), apply the LBR factorization of p( f, x) to get Σxt(xf r( f|x) (u2f − 2μffμ2f |x + 2μf |xf ), and then use Σf r( f|x) f = μf|x to get Σxt(x)(u2f − 2μfμf |x + μ2f |x). This can be further simplified to give
i1520-0434-23-6-1195-ea11
Finally, using Eqs. (A10), (A11), and (A8) in Eq. (A9) yields
i1520-0434-23-6-1195-ea12
the LBR-based MSE decomposition in Murphy (1997).

Like the CR-based MSE decomposition of Eq. (A6), Eq. (A12) also expresses the MSE due to the shapes of the conditional distributions as the sum of two terms, instead of one. Since σ2f can vary substantially between forecast systems, there is no advantage to using Eq. (A12) as a companion of the type II conditional bias comparative scatterplot (Fig. 12) rather than Eq. (9), as differences in MSEshape between forecast systems cannot be inferred from only Σxt(x)(μf|xμf)2. Like the CR-based MSE decomposition of Eq. (8), the LBR-based MSE decomposition of Eq. (9) also seems preferable for use in Fig. 14.

Fig. 1.
Fig. 1.

Joint distribution of official NHC forecasts and observations at lead times of (a) 0, (b) 36, (c) 72, and (d) 120 h. Dots mark all ( f, x) for which there is nonzero relative frequency in the corresponding verification data sample. The colors represent relative frequency magnitude, according to the following scale: 0 < p( f, x) ≤ 0.0025 (purple); 0.0025 < p( f, x) ≤ 0.005 (dark blue); 0.005 < p( f, x) ≤ 0.01 (light blue); 0.01 < p( f, x) ≤ 0.015 (green); 0.015 < p( f, x) ≤ 0.025 (yellow); 0.025 < p( f, x) ≤ 0.05 (orange); and 0.05 < p( f, x) ≤ 1 (red). The thin black line marks the diagonal, where f = x.

Citation: Weather and Forecasting 23, 6; 10.1175/2008WAF2222133.1

Fig. 2.
Fig. 2.

As in Fig. 1 but for the GFDL model forecasts.

Citation: Weather and Forecasting 23, 6; 10.1175/2008WAF2222133.1

Fig. 3.
Fig. 3.

As in Fig. 1 but for the Decay-SHIPS model forecasts.

Citation: Weather and Forecasting 23, 6; 10.1175/2008WAF2222133.1

Fig. 4.
Fig. 4.

As in Fig. 1 but for the SHF5 model forecasts.

Citation: Weather and Forecasting 23, 6; 10.1175/2008WAF2222133.1

Fig. 5.
Fig. 5.

Marginal distributions of OFCL forecasts (dashed red), GFDL forecasts (dashed green), DSHP forecasts (dashed dark blue), SHF5 forecasts (dashed light blue), and observations (solid black) at lead times of (a) 0, (b) 36, (c) 72, and (d) 120 h. The black triangle marks the mean observation and the gray triangle marks the median observation in each panel.

Citation: Weather and Forecasting 23, 6; 10.1175/2008WAF2222133.1

Fig. 6.
Fig. 6.

As in Fig. 1 but for persistence forecasts. A persistence forecast is defined to take the value of the operationally designated initial intensity; thus, the joint distributions here can be interpreted as weighted scatterplots of the training data used to estimate the linear statistical model coefficients in Eq. (3). The magenta line in each panel shows the best linear fit, in the least squares sense. Its slope and intercept are used as the coefficients in the linear statistical model for each lead time.

Citation: Weather and Forecasting 23, 6; 10.1175/2008WAF2222133.1

Fig. 7.
Fig. 7.

As in Fig. 1 but for forecasts from the SLR model described in the text.

Citation: Weather and Forecasting 23, 6; 10.1175/2008WAF2222133.1

Fig. 8.
Fig. 8.

Marginal distributions of the SHF5 forecasts (dashed light blue), SLR forecasts (dashed magenta), and observations (solid black) at lead times of (a) 0, (b) 36, (c) 72, and (d) 120 h. The black triangle marks the mean observation in each panel. Note that the probability range is twice as great as in Fig. 5.

Citation: Weather and Forecasting 23, 6; 10.1175/2008WAF2222133.1

Fig. 9.
Fig. 9.

Conditional quantile diagram for the OFCL forecasts at lead times of (a) 0, (b) 36, (c) 72, and (d) 120 h. Each panel shows boxplots for the set of conditional distributions of the observations given the forecast, q(x| f ). A boxplot for the marginal distribution of observations, t(x), is shown to the left of the dashed gray line in each panel and is marked with a “U” (for “unconditional”). A histogram at the bottom of each panel represents the marginal distribution of forecasts, s( f ). The solid gray line marks the diagonal, where f = x

Citation: Weather and Forecasting 23, 6; 10.1175/2008WAF2222133.1

Fig. 10.
Fig. 10.

Conditional quantile diagram for the OFCL forecasts at lead times of (a) 0, (b) 36, (c) 72, and (d) 120 h. Each panel shows boxplots for the set of conditional distributions of the forecasts given the observation, r( f |x). A boxplot for the marginal distribution of forecasts, s( f ), is shown below the dashed gray line in each panel and is marked with a “U” (for “unconditional”). A histogram on the left of each panel represents the marginal distribution of observations, t(x). The solid gray line marks the diagonal, where f = x.

Citation: Weather and Forecasting 23, 6; 10.1175/2008WAF2222133.1

Fig. 11.
Fig. 11.

Type I conditional bias comparative scatterplot, at lead times of (a) 0, (b) 36, (c) 72, and (d) 120 h. For a given lead time, a set of dots is plotted for each of the OFCL (red), GFDL (green), DSHP (dark blue), SHF5 (light blue), and SLR (magenta) forecast systems. The dots in each set mark ( f, μx|f) for all values of f predicted by the forecast system. In each panel, the solid black line marks the diagonal and the dashed black line the value of the mean observation, μx

Citation: Weather and Forecasting 23, 6; 10.1175/2008WAF2222133.1

Fig. 12.
Fig. 12.

Type II conditional bias comparative scatterplot, at lead times of (a) 0, (b) 36, (c) 72, and (d) 120 h. For a given lead time, a set of dots is plotted for each of the OFCL (red), GFDL (green), DSHP (dark blue), SHF5 (light blue), and SLR (magenta) forecast systems. The dots in each set mark (μf|x, x) for all values of x. In each panel, the solid black line marks the diagonal and the dashed black line marks a representative value of the mean forecast, μf , as described in the text.

Citation: Weather and Forecasting 23, 6; 10.1175/2008WAF2222133.1

Fig. 13.
Fig. 13.

The (a) ME and (b) MAE, as a function of lead time for the OFCL (red), GFDL (green), DSHP (dark blue), SHF5 (light blue), and SLR (magenta) forecast systems.

Citation: Weather and Forecasting 23, 6; 10.1175/2008WAF2222133.1

Fig. 14.
Fig. 14.

(a) MSE, as a function of lead time for the OFCL (red), GFDL (green), DSHP (dark blue), SHF5 (light blue), and SLR (magenta) forecast systems. (b) MSE due to the shapes of the conditional distributions q(x|f ) (dashed) and MSE due to type I conditional bias (dotted), the two terms in the CR-based MSE decomposition of Eq. (8). (c) MSE due to the shapes of the conditional distributions r( f|x) (dashed) and MSE due to type II conditional bias (dotted), the two terms in the LBR-based MSE decomposition of Eq. (9).

Citation: Weather and Forecasting 23, 6; 10.1175/2008WAF2222133.1

Table 1.

Sample size, N, as a function of lead time for the homogeneous verification data samples described in the text.

Table 1.
Table 2.

Coefficients for the SLR model relating the operationally designated initial intensity to the observed intensity, as a function of lead time. Coefficients of determination for the linear relationship are also listed.

Table 2.

1

Notation in this paper follows that of Murphy (1997).

2

Public dissemination of official forecasts with lead times beyond 72 h has only occurred since 2003.

3

Archived forecasts and best-track observations were obtained from the “A-decks,” and “B-decks”, respectively, of the NHC’s digital forecast database (information online at ftp://ftp.nhc.noaa. gov/pub/atcf/; accessed November 2006).

4

According to the NHC Web site (http://www.nhc.noaa.gov/verification), realizations are only included if the storm is classified as tropical or subtropical at both the forecast initialization time and verification time.

5

Note that for GFDL, forecasts asymptote toward the “wrong” value: 85 kt instead of 60 kt. This may in part be due to the difficulty of developing a dynamical model toward the goal of optimal forecast production, relative to statistical models. The model dynamics constrain the possible trajectories that can be produced (perhaps excluding the optimal one), whereas the DSHP and SHF5 have no such constraints.

6

The box and whiskers are not shown for conditional distribution estimates based on a sample size of less than 10.

7

Strictly, the panels should show the mean forecast for each of the five forecast systems. For graphical clarity, a single “representative” μf is used, equal to μx, as μf is generally within 5 kt of μx (see Fig. 13a).

8

This bias was corrected in the version of the decay model used operationally, starting in 2005.

9

Lines of constant fx parallel the diagonal, where fx = 0.

Save