## 1. Introduction

The National Hurricane Center (NHC) Wind Speed Probability Forecast Product (WPFP) gives a forecast probability for three wind speed categories at predefined locations for threatening tropical cyclones (TCs). The Monte Carlo–based WPFP depends on established track and intensity errors (DeMaria et al. 2009). The model has undergone a series of upgrades; the most recent of which selects the track and intensity errors to use based on the spread from a small set of models (DeMaria et al. 2013) initially explored by Hauke (2006). NHC issues a WPFP text product for selected coastal and inland locations for 0–12-, 12–24-, 24–36-, 36–48-, 48–72-, 72–96-, and 96–120-h periods and that includes both an interval (IP) and cumulative (CP) probability forecast for 34, 50, and 64 knots (kt; where 1 kt = 0.51 m s^{−1}) wind speed radii. Interval probabilities reflect the likelihood of onset (i.e., that a wind speed threshold will begin within the interval), whereas the cumulative product indicates the likelihood that conditions will occur during the cumulative period (http://www.nhc.noaa.gov/aboutnhcprobs2.shtml).

The WPFP forecast probabilities have been shown to be reliable (DeMaria et al. 2009, 2013). Previous evaluation of the WPFP IP text product indicated a tendency for the product to overforecast TC events at higher probabilities for landfalling tropical cyclones (Splitt et al. 2010). This result was based on both direct evaluation of the probabilities (e.g., reliability diagrams) and indirect assessment of bias. The latter requires conversion of probability values to binary forecasts using an optimal threshold, where the chosen threshold maximizes the value of a particular skill score such as the true skill statistic (TSS) or the Heidke skill score (HSS). The TSS is defined as the forecast hit rate minus the false alarm rate (see the appendix) and is commonly interpreted as a measure that best discriminates between yes and null events. The HSS is based on the proportion of correct forecasts relative to that expected simply by chance (Wilks 2006). The optimal decision thresholds for TSS and HSS were shown to vary with forecast period interval as well as wind speed category—consistent with the results of DeMaria et al. (2013) for the threat score (TS). The threat score is a variant of the proportion correct (or hit rate) that ignores correct null forecasts and is thus useful in rare-event scenarios (Wilks 2006). In Splitt et al. (2010), the TSS, HSS, and bias score were used to select decision thresholds that ranged from 1% to 55% depending on the score applied, wind speed category, and time interval. Because of its reduced sensitivity to the filtering of correct negatives and better WPFP bias scores, the HSS was recommended over the TSS for determining optimal thresholds. Thresholds were higher (lower) for the shorter- (longer-) range forecasts and lower (higher) wind speed categories. Sampson et al. (2012) used the TSS to evaluate the skill of an objective Tropical Cyclone Conditions of Readiness (TC-CORs) system that used WPFP forecasts. Thresholds for this system were determined using the probability of detection and false alarm rate. In a study that focused on operational decision making, Santos et al. (2010) showed similar results regarding the temporal dependency of the skill thresholds for the 34- and 64-kt wind categories. Roeder and Szpak (2010) also used an optimal threshold approach with the WPFP to produce guidance based upon a normalized threat scale. The normalization uses the lowest and highest forecast gridded probabilities observed over a multiyear dataset to provide a bounded estimate of the relative probability.

In contrast to previous studies, the work presented here is designed to provide context in terms of the expectations associated with a Monte Carlo–based system. To accomplish this, a simple forecast system that mimics a tropical cyclone wind speed probability problem is developed. The simple model depends on two factors: 1) feature size (analogous to a tropical cyclone wind speed radius) and 2) location uncertainty (analogous to a tropical cyclone track error). The simple model is then used to assess results from the WPFP. This includes the operational WPFP dataset and a rerun of the most recent version of the WPFP with 2012 error statistics and the Goerss ensemble adjustment (Goerss 2007; DeMaria et al. 2013) for the 2004–11 Atlantic hurricane seasons. The rerun simulations were provided by M. DeMaria and A. Schumacher of National Oceanic and Atmospheric Administration/National Environmental Satellite, Data, and Information Service/Center for Satellite Applications and Research (NOAA/NESDIS/STAR), Cooperative Institute for Research in the Atmosphere/Colorado State University (CIRA/CSU).

## 2. The simple Monte Carlo model

### a. Model description and probability expectations

*R*

_{f}and location error standard deviation

*σ*. For simplicity, it is assumed that there is no radius error (in some instances, however, this might be a critical parameter). Figure 1 illustrates a random displacement (from some initial location) of a circular feature within a Monte Carlo simulation. Permutations of the circular feature, which delineates a particular wind speed isotach, result in an overlap of the individual ensemble members and visually represent a probabilistic forecast. The darker shades of gray depict areas with higher probabilities that a particular location is within a given feature. The highest probability would be expected to be at the center of the ensemble spread if there is no bias. Assuming a bivariate normal distribution (e.g., Schwarzenbach et al. 2003, appendix A), the maximum probability

*p*

_{max}of being within

*R*

_{f}at the ensemble center can be determined analytically fromFor simplicity it is assumed that

*σ*

_{x}=

*σ*

_{y}and that the errors are uncorrelated, yielding the special case of a circular normal distribution (for the anisotropic case where

*σ*

_{x}≠

*σ*

_{y,}the across- and along-track errors differ resulting in an elliptical error pattern).

*R*

_{f}does not have a simple analytic solution. The probability of a given “station” location

*s*being within

*R*

_{f},

*p*(

*s ∈*

*R*

_{f}) is equivalent to the probability of the feature center being within

*R*

_{f}of a given location. In either case, the problem can be solved via integration of the bivariate normal distribution (Gilliland 1962). Here, the probability is approximated over a (gridded) circular region of radius

*R*

_{f}with respect to a coordinate system origin (

*x*

_{0},

*y*

_{0}) at the center of the feature:where

*r*′ is the distance from

*s*(

*x*

_{s},

*y*

_{s}) to the center of a grid box located within

*R*

_{f}(Fig. 2). For the case in which the station location is the same as the feature location (i.e.,

*x*

_{s}=

*x*

_{0},

*y*

_{s}=

*y*

_{0}), Eq. (2) yields

*p*

_{max}[Eq. (1)]. Figure 3, an adaptation of Eq. (2), shows the probability of a point being within a circular feature, for varying values of

*R*

_{f}/

*σ*as a function of normalized range (

*r*/

*R*

_{f}). The probability of a location being within the feature decreases as the distance from the feature center increases. We observe

*p*

_{max}where the curves intersect the ordinal axis (i.e., when the location is at the “forecast” feature center). When the feature location error is relatively small (as in a short-term forecast), the probabilities exhibit a high degree of bimodality; whereas for larger location error, the probability more gradually varies with range. To illustrate this, contours of the operational WPFP incremental 50-kt wind probabilities are shown for both short (6–12 h) and relatively long (42–48 h) lead times for Hurricane Earl (2010) advisory 33 (Fig. 4). The incremental product differs from the IP in that it represents the probability of a condition occurring, rather than beginning, in the time interval. The bimodality of the short–lead time forecast is readily apparent with probability values decreasing rapidly moving outward from the center of the cyclone track. Conversely, the later forecast interval probability contours decrease more gradually.

Maximum probability estimates from the gridded approximation of Eq. (2) are quite good—with only slight differences from their theoretical values [i.e., Eq. (1)]. As an example, for *R*_{f}/*σ* equal to 1.0, the two are equivalent to within 10^{−5} (0.393 48 versus 0.393 47, respectively). Hence, the maximum probability for the simple Monte Carlo approach is approximately 40% when the feature radius equals the location error. While it might have been thought that the upper limit in the maximum forecast probabilities within the NHC WPFP was, in part, due to the small number of years from which the error distribution is estimated, this analysis shows that this bound, which is less than 100%, depends explicitly on *R*_{f} and *σ* and not sample size.

Gridded values of the WPFP incremental forecast product for 2004–11 were used to compare to the *p*_{max} estimates obtained from the simple model [i.e., Eq. (1)]. Given that the incremental product is the probability of a condition occurring within forecast intervals that range from 12- to 24-h duration, it will likely differ somewhat from the instantaneous probabilities that are calculated using information available at the end of the time interval. NHC public advisories were used to acquire average forecast radii information for 34-, 50-, and 64-kt features (available through 72 h) from an average of the four quadrants. Estimates of track error as a function of forecast hour were obtained from the NHC official forecast error database using the 2010 and 2011 seasons. Error estimates as a function of forecast hour and maximum wind speed were also derived from the database using linear regression with the error decreasing as the speed increases (regression coefficients are included in Table 1). These error estimates, which are strictly climatological, are less rigorous than those used by the WPFP. The advisories were filtered to exclude those forecasts for which a tropical cyclone was expected to be inland and when the forecast maximum wind was equal to that of a particular wind speed category (e.g., the 50-kt radius feature would be excluded, but not that at 34 kt, if the forecast maximum wind was 50 kt). The scatterplot of the *p*_{max} estimates versus the WPFP incremental product maximum gridded probabilities indicates that the two reasonably compare (Fig. 5). A better fit is obtained using both forecast hour and maximum wind speed (plus signs in Fig. 5) to estimate the error rather than using forecast hour alone (gray-filled circles).

Linear regression coef and goodness of fit for NHC track error as a function of max wind speed. Units for slope and intercept include nautical miles (n mi), where 1 n mi = 1.852 km.

The previous Hurricane Earl advisory (Fig. 4) is used to illustrate how the instantaneous probabilities from the simple model compare to the WPFP. The rerun WPFP IP maximum probabilities and the maximum incremental probabilities (not available in the text product) from the gridded data and for the location of Nantucket, off the coast of Massachusetts, are given in Table 2. The IP probabilities from the original operational product are also included for Nantucket. Recall that the IP is the probability of a condition beginning in a forecast interval whereas the incremental product is the probability of a condition occurring in the forecast interval. The probability from the simple model is an instantaneous quantity that is valid at the end of the forecast intervals shown. The simple model maximum probabilities are lower than the maximum incremental probabilities, which might be expected because the simple model is obtained from the end of the forecast interval, and compare better with the WPFP IP maximum probabilities. Notable differences between the WPFP rerun using the Goerrs error estimate and the original operational product are observed at Nantucket for the IP product. The maximum probabilities for all models are higher than the IP for Nantucket, which is consistent with the center of the storm remaining offshore. In this case, the maximum probabilities might be useful in interpreting the relative risk at Nantucket and perhaps as a general tool for probability interpretation.

Max probabilities (%) for Hurricane Earl advisory 33 issued at 1500 UTC 2 Sep 2010.

### b. Probability frequency distributions

*A*within the probability range from

*p*± Δ

*p*is given bywhere

*r*(

*p*+ Δ

*p*) and

*r*(

*p*– Δ

*p*) are the radii bounding the probability interval. This area, which represents a frequency distribution of the forecast probabilities, can be calculated using a combination of Eqs. (2) and (3). Here, an alternative route is taken whereby gridded probabilities (that a particular grid point lies within a feature) are estimated using 100 000 realizations from the simple Monte Carlo model. A count of the number of grid points for a given probability range (which represents an annular area) yields nearly identical results as direct integration. From Fig. 3 it is anticipated, in general, that frequencies will be higher for lower-probability intervals. The annular region defined by a given probability interval will typically be larger at lower probabilities because both the average

*r*′ and the Δ

*r*′ are larger. Figure 6 shows the frequency distribution (both counts and normalized frequencies) of forecast probabilities for five different feature size–location error (

*R*

_{f}/

*σ*) ratios. Each of the distributions follows a near-linear log–log relationship for a large range of probabilities. For the case with relatively large location error (i.e.,

*R*

_{f}/

*σ*= 0.5), the frequency distribution trends toward a large number of low probabilities and the data distribution ends (i.e., zero count) near the maximum probability predicted by Eq. (1). Conversely, for relatively small location error [i.e.,

*R*

_{f}/

*σ*=

*O*(10)], the data distribution becomes more U shaped and thus bimodal, as shown earlier in Fig. 3. Note that the observed values of

*R*

_{f}/

*σ*for the previous analysis (i.e., Fig. 5) ranged from 0.3 to 9.5. The mean ratio decreases with increasing wind speed and forecast lead time (Table 3). For example, the 50-kt, 24-h category has approximately the same mean ratio as the 34-kt, 48-h category.

Estimated mean feature size–location error (i.e., *R*_{f}/*σ*) using NHC forecast radii and track error as a function of wind speed (kt) and forecast lead time (h). NHC advisories do not provide radius estimates beyond 36 h for the 64-kt radii.

The various curves in Fig. 6 were fit to a power-law distribution of the form *f*(*p*) = *Ap*^{−α}. In cases where the distribution was U shaped, the high-probability data were excluded (i.e., right portions of Fig. 6). Although these data are relevant, the focus here is on the trend within the near-linear log–log portions of the probability space where most of the probabilities occur. The results of the regression fit are provided in Table 4. Power-law distributions are common in the other disciplines. For example, tree-size frequency distributions (Niklas et al. 2003), earthquake magnitudes (Main 1996), and near-earth asteroid sizes (Chapman 2004) are each well approximated (for some range) by a power law. The WPFP IP frequency distributions are assessed with respect to the power-law distribution in more detail in section 3b.

Power-law coef (*A*, *α*) obtained by fitting the simple model frequency distributions for varying ratios of the feature radius–location error (i.e., *R*_{f}/*σ*).

### c. Implications for skill scores determined from decision thresholds

Expectations of forecast skill can be made via the simple Monte Carlo model if the forecast system is assumed to be reliable and follows a power-law relationship. Tables 5 and 6 describe the 2 × 2 contingency tables for both power and bounded power-law distributions given a total number of events (i.e., *n*), probability threshold (i.e., *p*_{t}; %), and power-law coefficients *A* and *α*. Marginal totals are shown for Table 5 only (far-right column and bottom row). By their nature, the contingency tables are constructed in a complementary fashion such that the entire probability space is represented. The rows are characterized by an inequality and specified threshold *p*_{t} that delineates between forecast events (*p*_{t}, 100%) and nonevents (1%, *p*_{t}). Because the observed frequency is assumed to be equal to the forecast probability (reliability constraint, top-left box of the contingency table), the nonevent probabilities are given by 100% − *p*, where *p* is the forecast probability.

Contingency table for a power-law distribution with coef *A* and *α*. Note that *p*_{t} is a threshold probability (%), and *n* is the total number of events.

Contingency table for a bounded power-law distribution with coef *α*. Here, *p*_{max} is the max probability (%), *p*_{t} is a threshold probability (%), and *n* is the total number of events.

*H*,

*M*, FA, and CN are the numbers of hits, misses, false alarms, and correct negatives respectively. Using the integral form of either the power-law or bounded power-law distribution for each of the components (i.e.,

*H*,

*M*, FA, and CN), Eq. (4) can be differentiated with respect to the probability threshold to find the threshold that maximizes the TSS. The probability threshold that maximizes the TSS (TSS

_{o}), over the probability interval from 1% to 100%, is given by (see appendix for details)For example, if

*α*= 0.5, the optimal threshold for a reliable forecast system with a power-law distribution would be expected to be 37%. Thus, the optimal TSS decision threshold for a reliable forecast system with a power-law distribution is strictly a function of the frequency distribution. The forecast probability distribution is, in turn, related to the ratio of feature size–location error (

*R*

_{f}/

*σ*). Figure 7 shows the variation of TSS

_{o}(black line) as a function of the power-law exponent

*α*obtained using Eq. (5). Optimal HSS, TS, and equitable threat score (ETS) thresholds are also shown, but were arrived at computationally. The ETS is a variant of the TS that takes into account the number of correct yes forecasts that could be arrived at by chance. The HSS and ETS follow the same threshold curve. When

*α*= 0, the frequency distribution is flat and all probabilities are forecast with equal frequency. In this case, the HSS, ETS, and TSS converge to 50%—a result that is consistent with the data distribution (i.e., half the data are above or below this value). As

*α*increases, TSS

_{o}decreases much more rapidly than HSS

_{o}(or ETS

*) with the largest difference between the two at*

_{o}*α*= 1.5. The TSS is thus more sensitive to changes in the underlying frequency distribution. This is consistent with the previously mentioned recommendation to optimize the probability threshold using HSS due to the sensitivity of TSS (Splitt et al. 2010). It is also worth pointing out that the differences between the optimal TS and ETS decrease with increasing

*α*.

## 3. Application to the NHC Wind Speed Probability Forecast Product

The expectations for a probabilistic wind speed forecast system, developed in section 2, provide the basis for verification of the NHC WPFP. The evaluation focuses on the IP forecasts, as they are closest to the instantaneous probabilities of the simple Monte Carlo system. The WPFP is first described followed by the verification procedure. Results include a discussion of the WPFP IP frequency distributions, system reliability, and decision thresholds.

### a. NHC WPFP data

The NHC issues the WPFP when a TC is a potential threat to the United States or countries in the Atlantic basin (i.e., at least one city has a forecast probability greater than 0%). The wind speed probabilities are calculated based, in part, on running 5-yr errors in the official NHC track and intensity forecasts. The WPFP has both a graphical and a text component, the latter of which is evaluated here. For a given storm and designated cities, the text product includes both the wind speed category and probability forecasts (e.g., see Table 1; Splitt et al. 2010). The rerun WPFP forecasts used the 2012 error statistics and the Goerss ensemble adjustment for the 2004–11 Atlantic hurricane seasons, even though track forecasts improved throughout this period. In total, the dataset includes 136 land-threatening storms. As in Splitt et al. (2010), a 400-km buffer zone is implemented to limit the number of null forecasts (i.e., probability forecasts less than 0.5%). The nulls directly impact the “correct negatives” and can thus skew performance evaluation. However, the length of the buffer is large enough so as to ensure no “missed” forecasts are thrown out. The locations used in this work include 155 cities in the United States and other countries in the Atlantic basin and eight points in the Gulf of Mexico (Fig. 8). The NHC issues the WPFP every 6 h for wind speeds of at least 34, 50, and 64 kt and forecast time intervals including 0–12, 12–24, 24–36, 36–48, 48–72, 72–96, and 96–120 h (NWS 2008). Hereafter, the intervals in the NHC products are referred to by the end hour, for example 0–12 h will be 12 h.

### b. WPFP text product probability frequency distributions

Normalized frequency distributions of the WPFP text product IP values were generated for each of the wind speed radii and forecast time intervals (Fig. 9). The IP product frequency distributions have attributes similar to those developed from the simple Monte Carlo system (cf. with Fig. 6) even with the wide range of feature sizes in the observed dataset. The 120-h IP frequency distributions (gray line) appear to be fairly well described by a power-law distribution and the distributions cut off well before reaching 100%. This cutoff is expected given that the 120-h forecast will likely have the largest location errors and, thus, lower *R*_{f}/*σ* and smaller *p*_{max} [Eq. (1)]. In addition, the maximum probability for the 120-h interval successively decreases for increasing wind speed radii (e.g., the 64-kt radii is smaller than the 34-kt radii). While a portion of the 12-h forecast interval has a frequency distribution that is well described by a power law, the distribution has become increasingly U shaped as a result of smaller location errors associated with the shorter forecast interval. Table 7 lists the values determined for the power-law exponent *α* from a best-fit line for each of the data subsets. In cases with a U-shaped distribution, a subjectively determined probability range was used. The *α* values range from about 1.3 to 2.4 and are larger than the idealized values in Table 4. As expected, the *α* values do change systematically from short to long lead forecast times due to the higher uncertainty in the extended forecasts. Although not an instantaneous probability, the IP product follows the general characteristics observed in the simple Monte Carlo model.

Bounded power-law fits to WPFP IP frequency distributions.

### c. Verification data for the WPFP

HURREVAC (short for Hurricane Evacuation; FEMA 1995; Sea Island Software Inc. 2006), a GIS-based hurricane decision assistance program for emergency managers, is used to verify the WPFP. To maintain consistency with previous studies (Splitt et al. 2010), the HURREVAC radii for 34-, 50-, and 64-kt wind speeds for each storm advisory were used. Shape files consisting of the radii polygons were output from HURREVAC and used for verification. It is possible that HURREVAC might overestimate (high bias) the actual wind speeds due to the use of the maximum extent of the NHC wind radii within a storm quadrant, but an underforecasting due to this type of error was not observed (Splitt et al. 2010).

The IP product verification is different for the first forecast interval (the 12-h forecast), because each newly issued WPFP advisory is independent of the previous forecast advisory. As a result, the 12-h forecast interval verification is unique in that it is based on the occurrence (rather than onset) of the condition during this interval (i.e., it counts as a hit regardless of whether the onset of the wind speed category occurred in the previous advisory). For all other time intervals, the IP product verifies as a hit only if the forecast wind condition started during the forecast interval. If the condition begins before or after the interval, the forecast would be considered a miss.

### d. Attributes diagrams

Attributes diagrams (Hsu and Murphy 1986) were generated using the R (R Core Team 2013) verification package (Pocernich 2013). In an attributes diagram, the closer each probability point pair (forecast, observed) is located to the one–one line, the better the forecast. The numbers plotted in the diagram adjacent to each point (with subsample bin width of 10%) provide information regarding the relative frequency distribution of the forecast probabilities. The horizontal dashed line labeled “no resolution” delineates the region of the attributes diagram where the subsample relative frequency is equal to the full sample (or climatological) frequency. In this study, this represents a conditional climatological frequency when TCs are present in the basin (i.e., when NHC issues advisories for tropical cyclones). By definition, the “no skill” line is located equidistant between the perfect reliability and horizontal no-resolution lines. If a point lies below this line, the Brier skill score is negative. The shaded area represents positive skill. The 95% confidence intervals, calculated using a bootstrap methodology with replacement, are also shown (whiskers).

The reliability for the entire (i.e., all wind speed categories and forecast intervals for 2004–11) set of WPFP IP text products is depicted in an attributes diagram (Fig. 10) with refinement distribution. The reliability is fairly close to the one–one line but diverges at higher forecast probabilities (i.e., greater than 60%) at which point the forecasts lack resolution and are biased somewhat high (i.e., overforecast). Although these high probability forecasts compose a small subset of the total (less than 4%), the results are similar to those in Splitt et al. (2010). While the overforecasting at higher probabilities is not present in the reliability diagrams in DeMaria et al. (2013), their reliability curves have a similar shape to that in Fig. 10 such that both depict poorer resolution for the forecast system at higher probabilities. The differences in the reliability curves may be an artifact of selecting verification locations that are predominantly land based as opposed to the full WPFP grid. Here, the focus on land-based locations of the text product may produce some observational and/or forecast biases during landfall or near-landfall scenarios. The no-resolution line for the full dataset is at 2.4% and represents the conditional climatology for the full dataset. For individual wind radii and lead times (forecast intervals), the conditional climatological values range from approximately 4% (shorter lead times) to 1% (longer lead times).

### e. Decision thresholds and bootstrap resampling

Indirect methods of probability forecast evaluation are used to calculate verification statistics (e.g., skill scores). In this approach, the probabilities are converted to a binary (yes or no) forecast using a threshold whereas the observations are already binary (i.e., event occurred or event did not occur). The WPFP probabilities and verification data for each forecast location and advisory time interval are classified via the standard 2 × 2 contingency table as previously described. The classification depends on two factors namely 1) the threshold that is used to determine whether or not an event is forecast to occur and 2) whether the event was “observed” to occur (as verified using HURREVAC). The following verification statistics were calculated: probability of detection (PoD); probability of false detection (PoFD); false alarm ratio (FAR); threat score, which is also known as the critical success index (CSI); Heidke skill score; true skill statistic; and bias score. One advantage of the TSS and HSS is that they take into account all categories of a contingency table and are, thus, generally more robust. In addition, the interpretation of TSS and HSS is intuitive; they measure skill relative to random guessing.

Optimal decision thresholds are commonly determined by finding the particular threshold that maximizes the value of a given skill score (Hennon and Hobgood 2003). Here, the optimal decision threshold is the probability associated with the maximum value of either the Heidke skill score (HSS based) or true skill statistic (TSS based). While uncertainty in the optimal threshold values has been documented (e.g., Hennon et al. 2005), it is uncommon to include these uncertainties in the actual selection process. Here, the verification data are mined using bootstrap resampling (Efron and Tibshirani 1993) and the mean of all the thresholds from the resampling is chosen as the best overall threshold. In addition to providing statistically robust threshold values, the bootstrap methodology yields confidence interval estimates for the TSS- and HSS-based thresholds. This approach is particularly useful for data that are not normally distributed, as is the case for both the TSS and HSS where the skill scores as a function of threshold are skewed distributions.

For both the full verification dataset and subsets, paired data (forecast and verification) are randomly selected (with replacement) to construct a new dataset. The total number of pairs in each of the new datasets is constrained to equal the original. TSS- and HSS-based decision thresholds (and associated skill scores) are determined for each of the 1000 samples and are sorted in ascending order. Using the 2.5th and the 97.5th percentiles of the bootstrap distribution, the 95% confidence intervals are obtained. Also, the mean and mode values for the decision thresholds are mined from the bootstrap set.

The resulting optimal decision thresholds from both the peak skill value and those calculated via bootstrap resampling procedures are shown in Figs. 11 and 12 (TSS and HSS, respectively). The figures include the bootstrap median, mean, and associated confidence intervals. As expected, optimal TSS and HSS thresholds decrease with an increase in the forecast time interval, as well as increasing location error at larger forecast intervals. The optimal (peak skill) threshold values differ from the bootstrap mean and median by a few percent (at most) and are within the bootstrap confidence intervals. The relative threshold uncertainty is larger for the TSS in comparison to the HSS, which appears to be a result of the TSS values being generally low. Furthermore, the observed power-law exponents in Table 7 fall in a range that maximizes the difference between the TSS and HSS optimal thresholds (Fig. 7) and thus explains the large differences observed in these decision thresholds for the WPFP.

The optimal TSS and HSS thresholds derived from a bounded power-law fit to each of the observed data distributions (i.e., Table 7) are also included in Figs. 11 and 12. The best-fit thresholds depend on the exponent *α*, which, in turn, is sensitive to the extent to which the data distribution approximates a power law. Hence, agreement between the best-fit power law and the other optimal-based thresholds is best at the longer forecast intervals, while the results are degraded for the shorter intervals where the data distributions tend to be U shaped. The relatively good agreement between the power-law fit and optimal-based thresholds is consistent with the expectation that these thresholds are largely determined by the feature location error.

As a baseline, Figs. 11 and 12 include the conditional climatological probability, that is, the probability of occurrence (for the data subset) given a land-threatening tropical cyclone is in the basin. While the optimal TSS thresholds can become nearly indistinguishable from the conditional climatological value, especially for the longer forecast intervals, the HSS thresholds remain distinct. This is an inherent advantage of the HSS, compared to the TSS, for this forecast system.

## 4. Concluding comments

A simple Monte Carlo scheme to assess wind speed probabilities from tropical cyclones provides a framework for interpreting the probabilities obtained from a more advanced system such as the WPFP. In the simple model, the maximum probability forecast is a function of the feature size and location error. The probability distributions from such systems tend toward a bounded power law when the location error is large relative to the feature size and toward a U shape for small location error with respect to the feature size. Both of these characteristics are observable in the WPFP IP product. Because the power-law slope is observed to be larger for lower-skill forecasts, it could serve as a metric to track improvements within the forecast system.

Given probability distributions that approximate power-law distributions, theoretical skill score values are shown to be a function of the power-law slope if the forecast system is reliable. In this case, the optimal threshold is directly tied to the data distribution and is thus prescribed by the forecast system model error. Hence, the optimal threshold is essentially just another metric for model error (i.e., higher threshold implies lower error and vice versa) rather than a value that actually optimizes decision making. Given the data distributions for the WPFP, the TSS score tends to have a fairly small range between data subsets as compared to the HSS. When used within the context of decision thresholds, the TSS might benefit if it were evaluated using probabilities with higher precision (e.g., tenths of percent). Additionally, TSS-based thresholds become nearly indistinguishable from the conditional climatological probabilities at longer forecast time intervals and thus it may be better to use the HSS or another skill score that has greater fidelity. In either case however, this has no bearing on the inherent error in the forecast system.

The analysis of Monte Carlo wind probability forecast products presented here could be used to improve interpretation tools such as that developed for the NHC WPFP (Roeder and Szpak 2010). The tool provides interpretation categories by scaling the forecast probabilities using the highest probability historically issued for each forecast interval in each wind speed category. An alternative approach would be to scale the probabilities by the highest probability (i.e., *p*_{max}) in each individual forecast interval. Also, the conditional climatological values could be used in helping interpret the lower end of the forecast probabilities. These types of modifications to the interpretation tool are currently being developed. While the focus here is on the use of the simple model as a conceptual and interpretative tool, it could be enhanced for operational use by addressing, for example, the wind speed radius errors.

## Acknowledgments

This research was supported by funding from the 45th Weather Squadron (Grant FA2521-11-P-0152) and by NASA KSC (Grant NNX10AE31G). In particular, the authors sincerely acknowledge the contributions of Dr. Mark DeMaria and Andrea Schumacher of NOAA/NESDIS/STAR, CIRA/CSU, for providing updated (i.e., with the Goerrs adjustment) probability forecasts for the 2004–11 hurricane seasons—without their generous support, this project would not have been possible. The authors thank the helpful comments of reviewers, in particular John Knaff.

## APPENDIX

### Optimal TSS from a Reliable Forecast System with a Power-Law Distribution

*H*is the number of hits,

*M*is the misses, FA is the false alarms, and CN is the correct negatives. Equation (A1) can be rearranged asThe optimal TSS threshold from a power-law distribution (with probabilities that range from 1% to 100%) can be obtained by substituting the integral form (of the components of Table 5) for probabilities between 1 and

*p*

_{max}into Eq. (A2):where

*p*

_{t}is a specified probability threshold and

*α*is defined in Tables 5 and 6. Performing the various integrations in Eq. (A3) yieldsAfter some cancelation and applying the limits of integration,Differentiating Eq. (A5) with respect to

*p*

_{t}, setting the result equal to zero, and solving for

*p*

_{t}gives the desired result (i.e., the optimal TSS threshold TSS

_{o}):

## REFERENCES

Chapman, C. R., 2004: The hazard of near-Earth asteroid impacts on Earth.

,*Earth Planet. Sci. Lett.***222**, 1–15, doi:10.1016/j.epsl.2004.03.004.DeMaria, M., , Knaff J. A. , , Knabb R. , , Lauer C. , , Sampson C. R. , , and DeMaria R. T. , 2009: A new method for estimating tropical cyclone wind speed probabilities.

,*Wea. Forecasting***24**, 1573–1591, doi:10.1175/2009WAF2222286.1.DeMaria, M., and et al. , 2013: Improvements to the operational tropical cyclone wind speed probability model.

,*Wea. Forecasting***28**, 586–602, doi:10.1175/WAF-D-12-00116.1.Efron, B., , and Tibshirani R. , 1993:

*An Introduction to the Bootstrap.*Monographs on Statistics and Applied Probability, Vol. 57, Chapman and Hall, 436 pp.FEMA, 1995: A hurricane inland winds model for the southeast U.S. [HURREVAC Inland Winds 1.0]: Documentation and user's guide. Division of Emergency Management, Florida Dept. of Community Affairs, Region IV, Federal Emergency Management Agency, 33 pp. [Available from Division of Emergency Management, FEMA, 2555 Shumard Oak Blvd., Tallahassee, FL 32399-2100].

Gilliland, D. C., 1962: Integral of the bivariate normal distribution over an offset circle.

,*J. Amer. Stat. Assoc.***57**, 758–768, doi:10.1080/01621459.1962.10500813.Goerss, J. S., 2007: Prediction of consensus tropical cyclone track forecast error.

,*Mon. Wea. Rev.***135**, 1985–1993, doi:10.1175/MWR3390.1.Hauke, M. D., 2006: Evaluating Atlantic tropical cyclone track error distributions based on forecast confidence. M.S. thesis, Dept. of Meteorology, Naval Postgraduate School, Monterey, CA, 105 pp.

Hennon, C. C., , and Hobgood J. S. , 2003: Forecasting tropical cyclogenesis over the Atlantic basin using large-scale data.

,*Mon. Wea. Rev.***131**, 2927–2940, doi:10.1175/1520-0493(2003)131<2927:FTCOTA>2.0.CO;2.Hennon, C. C., , Marzban C. , , and Hobgood J. S. , 2005: Improving tropical cyclogenesis statistical model forecasts through the application of a neural network classifier.

,*Wea. Forecasting***20**, 1073–1083, doi:10.1175/WAF890.1.Hsu, W. R., , and Murphy A. H. , 1986: The attributes diagram: A geometrical framework for assessing the quality of probability forecasts.

,*Int. J. Forecasting***2**, 285–293, doi:10.1016/0169-2070(86)90048-8.Main, I., 1996: Statistical physics, seismogenesis, and seismic hazard.

,*Rev. Geophys.***34**, 433–462, doi:10.1029/96RG02808.Niklas, K. J., , Midgley J. J. , , and Rand R. H. , 2003: Tree size frequency distributions, plant density, age and community disturbance.

,*Ecol. Lett.***6**, 405–411, doi:10.1046/j.1461-0248.2003.00440.x.NWS, 2008: Operations and Services Tropical Cyclone Weather Services Program. National Weather Service Instruction 10-601, NWSPD 10-6 Tropical Cyclone Products, 128 pp.

Pocernich, M., cited 2013: R verification package. [Available online at http://CRAN.R-project.org/.]

R Core Team, cited 2013: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. [Available online at http://www.R-project.org/.]

Roeder, W. P., , and Szpak M. C. , 2010: A tool to interpret the National Hurricane Center wind probability forecasts.

*Proc. More Effectively Communicating the Science of Tropical Climate and Tropical Cyclones,*Seattle, WA, Amer. Meteor. Soc., 3.4. [Available online at https://ams.confex.com/ams/91Annual/webprogram/Paper180682.html.]Sampson, C. R., and et al. , 2012: Objective guidance for use in setting tropical cyclone conditions of readiness.

,*Wea. Forecasting***27**, 1052–1060, doi:10.1175/WAF-D-12-00008.1.Santos, P., , DeMaria M. , , and Sharp D. W. , 2010: Determining optimal thresholds for inland locations of tropical cyclone incremental wind speed probabilities to support the provision of expressions of uncertainty within text forecast products. Preprints,

*20th Conf. on Probability and Statistics in the Atmospheric Sciences/14th Conf. on Aviation, Range, and Aerospace Meteorology/8th Conf. on Artificial Intelligence Applications to Environmental Science,*Atlanta, GA, Amer. Meteor. Soc., J10.2 [Available online at http://ams.confex.com/ams/pdfpapers/160384.pdf.]Schwarzenbach, R. P., , Gschwend P. M. , , and Imboden D. M. , 2003:

*Environmental Organic Chemistry.*2nd ed. John Wiley and Sons, 1312 pp.Splitt, M. E., , Shafer J. A. , , Lazarus S. M. , , and Roeder W. P. , 2010: Evaluation of the National Hurricane Center’s tropical cyclone wind speed probability forecast product.

,*Wea. Forecasting***25**, 511–525, doi:10.1175/2009WAF2222279.1.Wilks, D. S., 2006:

*Statistical Methods in the Atmospheric Sciences.*2nd ed. International Geophysics Series, Vol. 100, Academic Press, 467 pp.