## 1. Introduction

It is commonly accepted that an a priori expectation of forecast error is a necessary part of every forecast (Kalnay and Dalcher 1987; Molteni and Palmer 1991; Tennekes et al. 1988; Palmer and Tibaldi 1988). In other words, a perfect forecast provides no societal benefit unless the end user knows beforehand the forecast is likely to perform better than average. Creating forecasts for any meteorological variable is a complex problem due to initial condition errors, imperfect model formulations, and the inherent uncertainty associated with the particular atmospheric flow pattern (Kalnay and Dalcher 1987). Although assessing the effects of model deficiencies on a given forecast is close to impossible, there have been many attempts to use proxies for the intrinsic stability of the atmospheric state and the quality of the initial conditions to predict forecast error (see the reviews by Ehrendorfer 1997; Wobus and Kalnay 1995). As a technique to quantify the level of uncertainty in a dynamical regime, this study links forecast error to synoptic parameters that represent the large-scale flow surrounding a tropical cyclone. Estimating the relationship between a particular atmospheric regime and forecast accuracy would be useful for quantifying confidence in an individual forecast. The economic value of having prior knowledge whether a particular forecast will be more or less reliable than average is well documented (Pielke and Carbone 2002; Katz and Murphy 1997; Wilks and Hamill 1995). Even with the obvious benefits of error predictions, to our knowledge there have been no studies applying such predictions to tropical cyclone intensity forecasts.

Operationally, the National Hurricane Center (NHC) includes a wind speed probability graphic in their suite of products that provides users with an array of probabilities for different intensity outcomes. However, the percentage likelihood of the different intensity outcomes is based on a random sampling of the previous 5 yr of NHC official forecast (OFCL) errors and does not consider information unique to the particular storm (and its environment) being evaluated (National Hurricane Center 2012b). Additionally, persons and institutions that rely on this information for important financial decisions and their own safety cannot easily translate the data presented into a measure of confidence in the deterministic forecast. Another concern for end users of intensity forecasts is that the improvements have lagged behind track forecasts considerably in the past 20 yr. According to the NHC official verification results, the average forecast error from 1990 to 2010 for the 24–72-h official intensity forecasts have shown less than 1-kt improvement and in the case of the 24-h forecast, the forecast error has increased (Cangialosi and Franklin 2011).

The main goal of this study is to test if environmental parameters can be used to anticipate forecast uncertainty at the time of a forecast. These results are a first step toward providing real-time confidence guidance to accompany each deterministic intensity forecast, which would increase the value of forecasts without necessarily reducing forecast error. To test the feasibility of predicting the short-range forecast error of tropical cyclone forecasts, the Logistic Growth Equation Model (LGEM; DeMaria 2009), Statistical Hurricane Intensity Prediction Scheme (SHIPS; DeMaria and Kaplan 1994a; DeMaria et al. 2005), 5-Day Statistical Hurricane Intensity Forecast (SHF5; Knaff et al. 2003), Geophysical Fluid Dynamics Laboratory (GFDL; Bender et al. 2007) hurricane model, and OFCL are evaluated based on different performance metrics. SHF5 uses a statistical algorithm based on climatology and persistence and is one of the simplest statistical models; for this reason, it is considered a satisfactory benchmark for mean forecast errors (DeMaria et al. 2007). The U.S. Navy's version of GFDL (GFDN; Bender et al. 2007) is not included in our analysis because both GFDL and GFDN solve the dynamical equations nearly identically and their verification statistics for intensity forecasts are very similar (GFDL produces slightly better results at shorter forecast lengths). The better-performing inland decay version (DSHP; DeMaria et al. 2006) of the SHIPS model is used instead of SHIPS. A brief summary of the models^{1} assessed and their methodologies are available at the NHC model summary page (National Hurricane Center 2012a).

The performance of each model is evaluated by computing the mean absolute error (MAE), bias, and the percent skill (PS) relative to the SHF5 model for 24-, 48-, and 72-h forecasts in the Atlantic basin. This study focuses on the shorter forecast periods because in recent years these forecasts have not improved at the same rate as the longer ones (Cangialosi and Franklin 2011). Additionally, dynamical parameters are forecasted more accurately at shorter forecast lengths (McNoldy et al. 2012). These performance metrics are binned according to the magnitude of six dynamical parameters (“predictors”) and computed for each of the different models. Conventional one-variable histograms (hereafter referred to as histograms) and two-variable (joint) histograms are created to display the forecast performance metrics based on the bins. We believe that the statistical significance established between different bins in an individual model and corresponding bins in different models indicate that the predictor values (and synoptic regimes) are related to forecast error.

The data analyzed to calculate model performance are described in section 2. Section 3 discusses the methods used to process the data and compute statistical significance. Section 4 displays a sample of the histograms and joint histograms created with emphasis placed on the inclusion of statistically significant results. Section 5 provides a summary and conclusions.

## 2. Data

The 24-, 48-, and 72-h intensity forecasts for all models are located in the National Oceanographic and Atmospheric Administration's (NOAA) Automated Tropical Cyclone Forecast (ATCF) guidance comma delimited files (a-decks). For the duration of this paper, the word “intensity” will refer to the wind speed of a tropical cyclone at the appropriate forecast or verification time. More precisely, tropical cyclone intensity is defined as the maximum 1-min average sustained surface wind (NWS 2012). The forecasts are verified with two different datasets: the NHC “best track” digital database (Landsea and Franklin 2013) and the 0-h operational intensity estimates from the aforementioned a-deck. In the a-deck files, the intensity forecasts archived for the GFDL, DSHP, LGEM, and SHF5 models are recorded to the nearest knot (1 kt = 0.52 m s^{−1}), while the OFCL forecast is recorded to the nearest 5 kt. The best-track intensities are also provided to the nearest 5 kt. For consistency, any measure of intensity for the rest of this paper will have the units of knots.

Forecasts are verified with the best-track and operational intensities of each model, but this study focuses on the best-track results. Comparing forecasts against the best-track data offers a more consistent and accurate verification technique because the disagreement between operational analyses from different models can reach 45 kt. Also, best-track values are a combination of tropical cyclone data from many diverse sources (surface observations, ship and buoy reports, aircraft measurements, dropsonde measurements, and satellite observations), yielding a more well-informed intensity estimate (Landsea and Franklin 2013). Summary statistics in section 4 will highlight the differences between the verification techniques, but the best-track results should be interpreted as more reliable.

The predictor values are available in the stext (SHIPS) files (available online at

The POT predictor in SHIPS is the difference between the maximum POT (MPI) and the current storm intensity (DeMaria and Kaplan 1994b). The MPI is determined empirically (DeMaria and Kaplan 1994b) and sometimes differs considerably from the theoretical MPI of Bister and Emanuel (1998). The shear predictor is the difference between the 850- and 200-hPa wind vectors. From 2007 to 2010, the horizontal wind components used to determine shear were computed from a spatial average (vortex removed) of all the GFS model grid points within 500 km of the 850-hPa storm center at the appropriate height levels. In 2006, the spatial average consisted of grid points between radii of 200 and 800 km because the vortex removal technique had not been implemented (Knaff et al. 2007). In both scenarios, the shear predictor is particularly adept at capturing the large-scale environment of a tropical cyclone. The direction of the 850–200-hPa shear vector has units of degrees and follows the convention that the shear is coming from the given heading. For example, a value of 90° means the shear vector is pointing west, 180° is a shear vector pointing north, etc. All of the predictors assessed are available in real time and therefore can serve as useful tools for predicting the error of tropical cyclone intensity forecasts.

The 5-yr period between 2006 and 2010 for the Atlantic basin is an exceptional sample for statistical analysis because the evaluated models received no major upgrades during this time period. Only the underlying model, GFS, which provides initial conditions into the GFDL, DSHP, and LGEM, evolved considerably during the five hurricane seasons (documentation for GFS upgrades is available online at http://www.emc.ncep.noaa.gov/gmb/STATS/html/model_changes.html). These GFS upgrades were experienced by each model concurrently, thereby keeping the dataset consistent between the models. Also, the models selected for this study were operational for the duration of the 2006–10 Atlantic hurricane seasons (for this reason, HWRF is not included in our analysis). As a result, the dataset is homogeneous for the different models and a large number of verified forecasts are available. Still, some adjustments to the SHIPS and a-deck files are necessary to achieve proper alignment between the predictors and the forecasts. In 2006, NHC added a new storm (storm 2) to the a-deck and organized the rest of the storms accordingly. The SHIPS files are created in real time, so this postseason adjustment is not realized. Additionally, it is important to note that data for “invests” (low pressure areas monitored by forecasters for possible development) are not included in our analysis. As a result, the performance of the different models is calculated using storms that at least became tropical or subtropical depressions during their life cycles.

However, the model verification presented here contains more cases (forecasts with corresponding 0-h best-track verification) than what is listed from 2006 to 2010 in the NHC verification report (Cangialosi and Franklin 2011). Similar to this study, the NHC excludes invests from their verification statistics. The discrepancy in the number of cases between the two verification results originates from the way the NHC treats weakening storms. If a tropical cyclone is forecasted to dissipate, then the NHC will not include that storm in their OFCL forecast performance metrics. For example, a storm with a current intensity of 60 kt and a 48-h forecast for a 25-kt low (LO in best-track file) in the best-track file is not included in the NHC verification results even if the verification exists. Additionally, when a hurricane is forecasted to transition into an extratropical storm (EX in best-track file), NHC excludes these cases (J. Cangialosi 2012, personal communication). Results from both scenarios are included in our analysis, thereby expanding the sample size for each forecast time. Table 1 shows the number of forecasts verified for the models at each of the forecast times.

Number of verified forecasts (best-track verification), for DSHP, GFDL, LGEM, and OFCL during each of the forecast periods. These totals are for Atlantic basin storms between 2006 and 2010.

## 3. Methodology

There are 12 synoptic variables used as predictors at each forecast time: six are taken from the initial forecast time and six are derived from averages over the forecast time period. To evaluate intensity forecasts for each model, histograms are made for individual predictors by selectively binning a predictor and plotting the performance metrics based on those bins. Joint histograms are created by graphing the performance metrics against two predictors. Each square on the joint histograms represents a range of values for each predictor, and the square is shaded a symbolic color to indicate the magnitude of MAE, PS, or bias. Binning is accomplished through two different methods: either dividing the data into three approximately “equal sized” bins or selecting arbitrary bin ranges to gain insight on whether certain synoptic regimes yield anomalous results. The equal-sized bins are determined by collecting all the predictor values for the different models and splitting them up into thirds based on the values of the predictor. The convention for plotting both single and joint histograms is that each bin includes data that fall on the exact value of its upper limit, while exact values on the lower limit belong to the preceding bin. The only exception is the lowest bin in each figure.

The 0-h and time-averaged predictors for each forecast time lead to many possible predictor combinations to qualitatively analyze the data. The 12 different predictors for each of the three forecast times and performance metrics lead to the creation of 108 histograms for each of the four models. The joint histograms lead to 15 possible outcomes of two predictor combinations (shear paired with intensity, shear paired with storm speed, shear paired with POT, etc.). These combinations can be plotted for both 0-h and time-averaged predictors for three forecast times and three performance metrics (time-averaged predictors are only paired with time-averaged predictors and 0-h predictors are only paired with 0-h predictors). As a result, the total figures required to display the different permutations are 270 joint histograms for each of the four models. Because of space limitations, we will be showing only a small sample of the figures, focusing on the statistically significant results.

After creating the standard and joint histograms, two-variable *t* tests (e.g., Wilks 2006) are carried out to establish statistical significance between the different models and synoptic regimes within each model. When determining the significance between different bins in a histogram for a particular model (e.g., the 40–70-kt initial intensity compared to the 70–100-kt initial intensity for the GFDL model), an unpaired *t* test is conducted to test if the differences in the means of individual bins are significant. Equation (5.8) from Wilks (2006), adjusted to account for serial correlation between the forecasts [see Wilks's Eq. (5.12)], is used to determine the Gaussian test statistic *z*, which is converted to a *p* value. When the *p* value is less than the significance threshold of 0.05 for the two-sided test, the difference in the means of individual bins is considered statistically significant. In joint histograms, there are smaller bin sizes and additional bins to compare against, so it is harder for a bin to achieve statistical significance against all other bins.

For establishing significance between the same bin in different models (e.g., comparing the 40–70-kt initial intensity bin for GFDL with the 40–70-kt initial intensity bin for LGEM), paired *t* tests are used. In this case, Eq. (5.11) from Wilks (2006), again adjusted to account for serial correlation, is used to determine the Gaussian test statistic *z*, which is converted to a *p* value. The *t* test is paired in this scenario, because the data values making up the corresponding bins in different models are observed simultaneously. The two-sided *t*-test *p* value of 0.05 is again used as the statistical significance threshold.

*t*tests require individual bin entries to be compared, so

*t*tests are not carried out for PS histograms.

## 4. Results

In this section, the performance of the intensity forecasts for four operational models during five Atlantic basin hurricane seasons are discussed in detail. Tables 2 and 3, respectively, show the average MAE and bias for the different models at each forecast time. Tables 4 and 5 display similar information to Tables 2 and 3 except the forecasts are verified with operational intensity estimates for the individual models instead of the best-track data. In general, the best-track verification results in smaller errors for every model and identifies OFCL as the best-performing model and DSHP as the worst-performing model for 48- and 72-h forecasts (GFDL is the worst at 24 h). The operational intensity verification results are considerably different; using these data, the GFDL consistently provides the poorest intensity forecasts while the LGEM is the best-performing model for 48- and 72-h intensity forecasts (OFCL is the best at 24 h).

MAE (kt), using best-track verification, for DSHP, GFDL, LGEM, and OFCL during each forecast period.

Bias (kt), using best-track verification, for DSHP, GFDL, LGEM, and OFCL during each forecast period.

MAE (kt), using 0-h operational analyses, for DSHP, GFDL, LGEM, and OFCL during each forecast period.

Bias (kt), using 0-h operational analyses, for DSHP, GFDL, LGEM, and OFCL during each forecast period.

Tables 3 and 5 are generally consistent irrespective of the verification technique, although the best-track verification usually results in lower bias values. Tables 3 and 5 also indicate that longer forecast lead times are associated with higher biases. It is clear that the statistical–dynamical models account for model bias with a forecast correction whereas the dynamical model, GFDL, does not and consequently displays the largest mean bias. The discrepancies between the best-track and operational intensity tables demonstrate the verification method employed, and specifically different models' analysis of 0-h intensity, can greatly influence the conclusions one makes about the performance of different models. Although the topic of seeking the “best estimate” of intensity is outside the scope of this paper, the effects of using two different verification datasets demonstrate this issue deserves further attention. Subsequent analysis will focus on the more reliable best-track verification results.

Several important observations should be highlighted from the multitude of figures created. First, the verification trends at different forecast lead times occasionally show discrepancies. Certain bin ranges are statistically significant for all models or a particular model at one forecast hour but exhibit dissimilar behavior at another forecast hour. Second, more statistically significant bins are present for the time-averaged predictors than 0-h predictors when verifying long-range forecasts. The more statistically robust correlations with time-averaged parameters are largely because they account for the time variation of the large-scale flow over the longer forecast period.

When evaluating the performance of a model at a particular lead time, it is important to review all the different performance metrics as a way to develop an understanding of the observed forecast error. If a model records high MAE values for a bin, one needs to question if this MAE occurs due to positive or negative bias in the model. When the difference in the mean bias between this bin and the other bins is statistically significant, then a conditional bias correction might be possible or there potentially is a correctible flaw in the model. If the same bin shows comparatively high PS even with the anomalously large MAE, then it is clear that the model is still improving on the benchmark model. Moreover, a high MAE bin might incorrectly imply a model flaw, whereas the high PS value explains the large error is due to an unpredictable synoptic pattern. Finally, adding a second predictor is found to provide useful information about the relationship between the meteorological conditions surrounding a tropical cyclone and forecast error. A histogram will frequently highlight a bin range as anomalously different from the mean of the forecast lead time, but adding another predictor better defines the synoptic environment that is leading to the anomalous forecast errors.

### a. One-variable histograms

Figures 1–3 convey how predictor values can be differently associated with forecast performance depending on the forecast lead time. Figure 1 shows the MAE of the 48-h intensity forecasts for the DSHP, GFDL, LGEM, and OFCL models plotted against the 48-h average POT. The black numbers at the bottom of each histogram entry represent the number of cases in each bin. The histogram indicates that tropical cyclones with lower forecasted POTs produce lower errors for all models, while the 140–160-kt POT bin is the worst-performing bin for every model. Additionally, an unpaired *t* test reveals that both GFDL and OFCL record a statistically significant difference between the 100–120-kt bin and every other bin. A paired *t* test demonstrates the MAE of this bin in GFDL and OFCL is significantly different from the corresponding bin in LGEM and DSHP (but not against each other). On the other hand, the 140–160-kt bin is not statistically significant against more than one bin in each of the models.

Figure 2 contains the same predictor and performance metric as in Fig. 1 but with equal-sized bins. Figure 2 also captures the same general trend as Fig. 1, with MAE rising as POT increases. However, the fewer bins in Fig. 2 prevent it from demonstrating that the tropical cyclones with the highest mean forecast POT are actually not the hardest to forecast. Figure 1 shows that the 140–160-kt bin has higher MAE than the 160–180-kt bin. Since the equal-sized-bin graphics frequently reveal trends similar to the manually selected bins and often miss some important details in the performance metrics, the following results will focus on manually selected bins.

Figure 3 is similar to Fig. 1, but it shows the MAE of 72-h forecasts plotted against 72-h average POT. In Fig. 3, the 100–120-kt bin also contains the lowest MAE for every model but unlike Fig. 1, no model contains a 100–120-kt bin that is statistically significant against every other bin. Also, the 140–160-kt bin does not record the highest MAE in GFDL and OFCL. Figures 1 and 3 show that the different lead-time error statistics communicate a different message about the effect of POT on model performance. The 48-h MAE histogram highlights a definitive average POT range where GFDL and OFCL excel, while a similar deduction is not possible for the 72-h case. Although it is not a statistically robust conclusion, Fig. 1 also highlights a bin where all models perform the worst. The physical justification for these results is not straightforward but the method used to create a DSHP forecast suggests that they should be expected. DSHP uses a different set of regression coefficients for each forecast length (DeMaria and Kaplan 1994a). In other words, the same dynamical parameters (storm speed, shear, etc.) are weighted differently based on the forecast period (24 h, 48 h, etc.). The results presented here support the practice of time-dependent weighting, and further exploration of the dynamical properties behind these trends could be useful.

Figures 4 and 5 illustrate that selecting average parameters as predictors for long-range forecasts often produces more statistically significant results. Figure 4 displays the 72-h MAE histograms for the four models with 0-h shear as the predictor. There are no bins in the GFDL and DSHP histograms that are statistically significant compared to any of the other bins for each model. In fact, the 30–40-kt initial shear bin compared to the 10–20-kt initial shear bin in LGEM and OFCL are the only statistically significant bin comparisons. However, in Fig. 5, there are much larger discrepancies between the MAEs of different bins. In Fig. 5, the predictor is the average forecast shear over the 72-h forecast. For the GFDL model, an unpaired *t* test shows the anomalously low MAE of the 30–40-kt shear bin is statistically significant at the 99% level compared to every other GFDL bin. OFCL also achieves low MAE for 72-h average shear between 20 and 30 kt; the MAE of this bin along with the highest shear bin is significant against both of the low-shear bins for OFCL at the 99% level. Additionally, the 20–30-kt shear bins for LGEM and DSHP are statistically significant against the two lower-shear bins in each model. Therefore, it appears that the mean forecast shear for a tropical cyclone over the longer forecast period is better linked to MAE than the initial shear. This observation is a common theme throughout all of the results.

Figures 6–8 illustrate the various analysis methods utilized in this study and emphasize how using multiple performance metrics can provide more detailed conclusions about forecast errors. These figures show the MAE, bias, and PS histograms for 24-h forecasts with 0-h intensity as the independent variable. Although Table 2 lists OFCL forecasts as the most skillful at 24 h, Fig. 6 indicates OFCL contains the bin (100–130 kt) with the highest MAE out of all the models. LGEM is the best-performing model in this bin range, with almost 5 kt less MAE than OFCL. A paired *t* test reveals that the difference in the means of the 100–130-kt intensity bin for the two models is statistically significant. Therefore, when producing 24-h forecasts of major hurricanes, it appears LGEM guidance should be weighted more heavily.

Figure 7 provides the bias values for the same predictor and bin ranges, supplementing the MAE results with a possible explanation for the surprisingly poor performance of OFCL for high-intensity tropical cyclones. All models have a high positive bias when the 0-h intensity falls between 100 and 130 kt. The OFCL bin has the largest bias with a mean of 12.4 kt for 95 verified 24-h forecasts. A *t*-test calculation indicates the difference between the mean bias of this high-intensity bin and every other OFCL bin is statistically significant. Additionally, a paired *t* test is used to compare the 100–130-kt bin for OFCL with the same bin in other models; the bias of this bin is statistically significant compared to all the other models. In summary, it appears that OFCL forecasts are less skillful than other models for high-intensity storms because forecasts in this bin have an anomalously high positive bias. A possible explanation for this surprising result is that the NHC is employing a “better safe than sorry” protocol for strong hurricanes and, consequently, maintaining or intensifying strong tropical cyclones in their forecasts. This technique is important for hurricanes approaching land because an underestimate in intensity for strong hurricanes could cause civilians and emergency managers to inadequately prepare, resulting in additional fatalities and monetary loss.

Figure 8 shows the same results but for PS. Clearly, OFCL forecasts achieve the least PS for the high-intensity bin, which is also observed to have the highest bias and MAE. However, OFCL still obtains a PS of 16.6% and GFDL, LGEM, and DSHP all have at least one bin that achieves a lower PS. The fact that the 100–130-kt intensity bin achieves the highest MAE but does not record the lowest PS emphasizes OFCL is struggling in a regime that is inherently uncertain. Nevertheless, all the other analyzed models attain considerably higher PS values for short-range forecasts of strong hurricanes.

### b. Joint histograms

To better define the environment that is lowering the PS for all models in high-intensity storms, 0-h shear is paired with 0-h intensity to create a joint histogram. Figures 9–11 show, respectively, the MAE, bias, and PS in joint histograms for 24-h forecasts. The white numbers at the bottom-right corner of each box represent the number of cases per bin. It is difficult to establish statistical significance between the larger number of bins in joint histograms, so only three bins are prescribed for each independent variable. The reduction in bins requires the 0-h intensity predictor to use different bin ranges than those seen in Figs. 6–8; the intensity bin ranges are instead chosen to approximately represent tropical depressions, tropical storms, and hurricanes. Figure 9 shows a similar trend to previous histograms with MAE increasing with higher initial intensity. When initial intensity is greater than 70 kt and the 850–200-hPa shear is between 10 and 20 kt, the 24-h forecast MAE for OFCL is 16.1 kt, almost 6 kt greater than the average MAE of OFCL 24-h forecasts. This high-intensity, medium-shear bin represents the synoptic regime with the largest error in each model, so it is clear tropical cyclone intensity is difficult to forecast in these situations regardless of the model. A *t* test demonstrates that the difference in the MAE between the OFCL high-intensity, medium-shear bin and almost every other OFCL bin is statistically significant. The only exception is the intensity bin of greater than 70 kt and a shear between 0 and 10 kt; the difference in the means of these two bins achieves a *p* value of only 0.09.

It is important to mention that OFCL also has the bin with the lowest MAE in Fig. 9. When the initial intensity of a tropical cyclone is between 0 and 35 kt and the 0-h shear is greater than 20 kt, the MAE of OFCL is only 5.1 kt; the mean of this bin is significantly different from every other OFCL bin. All models appear to have small errors for this bin but only OFCL has a bin that is significant against all of the other bins within the model. Figure 10 shows the bias of the different models with the same predictors as Fig. 9. In all of the models, the forecasts for hurricane-strength tropical cyclones have a strong positive bias, which agrees well with Fig. 7. The large positive biases are collocated with the largest MAE values in Fig. 9, which suggests the biases could be responsible for the highest MAE. As expected, the high-intensity, medium-shear bin for the OFCL joint histogram records the highest positive bias.

Figure 11 shows PS as the performance metric. All three bins for the 10–20-kt shear range for the OFCL forecasts show low PS. This observation is consistent with the previous two figures. The superior PS of the GFDL, LGEM, and DSHP models for hurricane strength storms is largely attributable to the lower biases seen in Fig. 10. OFCL also attains a positive PS (even with the highest MAE) for hurricanes because SHF5 is even worse at forecasting in this inherently chaotic regime. From the analysis of Figs. 9–11, it is apparent that medium-shear environments are contributing the most error for forecasts involving high-intensity storms. Some hypotheses are presented as to why strong storms lead to poor intensity forecasts but it is less clear why medium shear is detrimental to forecast performance. It is possible the shear range between 10 and 20 kt is associated with less reliable 24-h forecasts because models are struggling with determining when moderate shear either fosters or stifles tropical cyclone development. Unlike with very weak or strong shear, models and forecaster intuition do not always agree on how a medium shear environment will affect tropical cyclone intensification. It is likely that this uncertainty for medium-shear cases is largely due to the current deficiencies of dynamical models, which are unable to capture the small-scale interactions that control how shear exactly interacts with storm structure.

### c. Other notable results

Determining how background environments affect the performance of the statistical models is another focal point of this investigation. DSHP and especially LGEM have emerged as two of the best-performing intensity models over the last decade (Cangialosi and Franklin 2011). Regardless, LGEM and DSHP typically struggle with 850–200-hPa easterly shear, especially at lower latitudes (M. DeMaria 2012, personal communication). The data analysis conducted in this study is particularly adept at testing such a hypothesis, and we created figures for all three forecast times and performance metrics. Three joint histograms (Figs. 12–14) are presented with 72-h average latitude and shear direction as the independent variables. Even though these figures show only 72-h intensity forecast verification results, shorter forecast lead times also indicate that LGEM and DSHP have high errors in the low-latitude, easterly shear regime (not shown).

The cutoff between low and high latitudes is 20°N, and only two latitude bins are used (below 20°N and above 20°N) to keep the sample size large for all bins. Shear direction is divided into four bins to capture the effects of the shear vector pointing in all four cardinal directions. Figure 12 shows the MAE for the four models and confirms LGEM and DSHP are less skillful for low latitudes and an 850–200-hPa shear vector that is directed west. In fact, all of the models have their largest MAE when the 72-h average latitude falls between 0° and 20°N and the 72-h average 850–200-hPa shear direction is between 0° and 90° (pointing from the northeast quadrant). The bin with the largest MAE occurs in the LGEM joint histogram; the orange color in the low-latitude, northeasterly shear bin represents an MAE of 32.2 kt. This MAE is over 15 kt larger than the average MAE for 72-h LGEM forecasts. DSHP also performs poorly for this bin, recording an MAE that exceeds 30 kt. For both of these models, a paired *t* test shows that the mean of this bin is not larger at a statistically significant level than the corresponding bin in GFDL and OFCL due to the small sample size of the analyzed bins. However, an unpaired *t* test reveals that the mean of the low-latitude, northeasterly shear bin in LGEM and DSHP is significantly different than all of the other bins in the respective models (except for the 90°–180° shear direction, 0°–20°N bin in each model and the 0°–90° shear direction, >20°N bin in DSHP).

Figure 13 depicts joint histograms with the same independent variables but bias is used as the performance metric. For the 0°–90° shear direction with 0°–20°N latitude bin, DSHP and LGEM have a positive bias but the mean bias of this bin is not statistically significant against any bin. Therefore, the extremely high MAE in these two bins is not necessarily attributable to bias. Figure 14 illustrates that the PS of this bin for LGEM and DSHP is considerably lower than the same bin in other models. LGEM and DSHP obtain PSs of 12.9% and 17.9%, respectively, while GFDL and OFCL have PSs of 30.5% and 33.2%, respectively. Although the positive PS for DSHP and LGEM is surprising, this result is possible because SHF5 performs very poorly in this bin as well.

Figure 15 is very similar to Fig. 12 except the number of shear direction bins is reduced so there are only westerly and easterly shear bins. These larger bin results agree with the previous figures that all models are performing poorly at low latitudes with easterly shear. The low-latitude, easterly shear bin for LGEM obtains the highest MAE out of all the bins. This bin is different than all of the other bins in LGEM at a *p* value of 0.08 or lower. Therefore, the results shown are consistent irrespective of the binning of the predictors and therefore present an accurate diagnosis of forecast error dependence on tropical cyclone latitude and shear direction. Furthermore, the analysis of the figures suggests that forecasters should avoid relying on DSHP and especially LGEM if a tropical cyclone is in the low-latitude, easterly shear synoptic regime. These statistical–dynamical models appear to excel overall because their performance in westerly shear and high latitudes compensates for their forecasts in the troublesome environmental conditions.

There are some interesting atmospheric patterns where GFDL performs significantly worse than the other models. Figure 16 shows the MAE of 24-h intensity forecasts with 0-h storm speed as the independent variable. The >15-kt storm speed bin in the GFDL histogram contains a noticeably higher MAE than any other bin in all of the models. A paired *t* test reveals that this GFDL bin is statistically significant compared to the corresponding bin in every other model. With the exception of the 5–10-kt 0-h storm speed bin (*p* value of 0.12), the MAEs of the other GFDL bins are also different from the high storm speed bin at a statistically significant level. Surprisingly, the other models show no real trend between MAE and the translation speed of the tropical cyclone. Therefore, we can confidently conclude that 24-h forecasts by GFDL for storms traveling at greater than 15 kt are not only worse than other GFDL forecasts but are also worse than other models at forecasting these fast-moving storms.

By pairing 0-h storm speed with 0-h POT, we gain further insight into the environmental conditions that are leading to the large GFDL errors for fast-moving storms. We select 0-h POT as the second predictor instead of initial intensity because the results are more statistically significant. In Fig. 17, both 0-h storm speed and 0-h POT are used as independent variables for an MAE joint histogram. The GFDL joint histogram captures two high storm speed bins where the MAE is substantially larger than the average MAE for 24-h forecasts. The GFDL bin that represents a 0-h storm speed between 15 and 22.5 kt and a 0-h POT between 150 and 180 kt has an MAE of 17.7 kt, which is almost 6 kt larger than the GFDL 24-h forecast average. The 120–150-kt POT bin for GFDL also has a very high MAE value, 15.2 kt. Due to the small sample size of the higher POT bin, only the 120–150-kt POT bin is different from the corresponding bin in all of the other models at a statistically significant level. The other three models do not distinguish a regime in storm speed–POT space that causes anomalously poor forecasts. Figure 18 displays bias as a function of the same two independent variables. Neither bin with anomalously high MAE records large bias values. In Fig. 19, PS is plotted against 0-h storm speed and 0-h POT. The GFDL bin corresponding to a 0-h storm speed between 15 and 22.5 kt and a 0-h POT between 120 and 150 kt is highlighted as the bin with the lowest PS. In fact, the PS of this bin is −16.3% for 92 verified forecasts. Based on this information, a forecaster could avoid GFDL guidance when producing short-range intensity predictions of fast-moving, medium-POT tropical cyclones.

## 5. Summary and conclusions

This study represents one of the first attempts to distinguish if tropical cyclone intensity forecast performance is dependent on the synoptic environment. A small sample of figures is presented with accompanying *t* tests to provide an example of the possible deductions from this innovative binning technique. The statistical significance established between bins conveys that there is robust evidence that forecast error is often related to the surrounding atmospheric environment. Important conclusions are established for the overall behavior of all models as well as the environment-based performance of each model. For all models, we frequently observed that the same predictor will lead to varying evaluations of model performance, depending on the forecast period. In other words, bins that captured a significant synoptic regime for particular models at a short forecast lead time would not necessarily be as statistically robust at a longer forecast lead time. Also, for long-range forecasts, time-averaged predictors are found to produce more statistically significant results than 0-h predictors.

Several observations about individual models are also possible using this regime-dependent analysis. Although strong hurricanes lead to poor 24-h forecasts for every model, OFCL records the highest MAE. The bias histograms suggest this anomalously poor performance is attributable to a strong positive bias. Pairing initial intensity with initial shear reveals that hurricanes in medium shear are responsible for the high error observed for hurricane-strength storms in the MAE histogram. For 24-h forecasts with a 0-h storm speed above 15 kt, GFDL performs significantly worse than other models. The joint histogram with initial storm speed and initial POT as predictors shows that that a storm speed between 15 and 22.5 kt and a POT between 120 and 150 kt results in negative PS. A synoptic regime that is not conducive to good forecasts for DSHP and LGEM is also presented. When a tropical cyclone is at low latitudes in easterly shear, both DSHP and LGEM struggle for all forecast times (only results for 72-h forecasts are shown). The general trends and the individual inferences about each model that emerge from this environment-based evaluation technique emphasize the utility of this more detailed validation of tropical cyclone intensity forecasts.

In future work, additional dynamical and climatological parameters that are available in the SHIPS files will be added as predictors. Along with the already discussed synoptic variables, other predictors that serve as proxies for initial condition error and atmospheric instability will be used as independent variables. Predictors utilized by Molteni and Palmer (1991) and Palmer and Tibaldi (1988) will serve as a guide. An example of a predictor worth future investigation is the standard deviation (or spread) of the different models' initial operational analyses (from the ATCF a-decks); this parameter may represent the uncertainty in the initial environment. Also, the error in the previous 12-h forecast appears to be a good indication of the quality of the initial analysis for the current forecast (Palmer and Tibaldi 1988). The dispersion of the different models' forecasts has also been shown to be a good estimate of forecast error. The consistency between forecasts initialized at neighboring times (e.g., today's 48-h forecast versus yesterday's 72-h forecast) is another possible predictor. These predictors will be used to develop the verification statistics for additional models and Pacific basin storms.

Although the results represent a first step toward providing real-time confidence guidance for each model's intensity forecasts, these binned objective statistics have a variety of uses. First, by knowing when models are consistently underperforming or succeeding, forecasters can gain intuition as to what situations produce forecasts that deserve higher or lower confidence. If a tropical cyclone is approaching land and models are in a high-confidence regime, then emergency managers can better focus their evacuations and storm preparations as a result of the larger reliability of a landfalling prediction. Second, a handful of statistical, dynamical, and “hybrid” (a mixture of the two) models have recently been developed but no individual model consistently excels (DeMaria and Gross 2003). If forecast solutions diverge, knowledge of which model is reliable in a given situation can help NHC forecasters decide which model to favor and, consequently, produce better verifying forecasts. Finally, if forecasts of forecast errors reveal that certain environmental conditions or the dynamical instabilities in the current flow pattern consistently lead to more or less accurate forecasts, then further investigation into these regimes is worthwhile. Modelers can focus their efforts on improving a model in these less reliable situations and explore the dynamical mechanisms that cause low-confidence regimes.

## Acknowledgments

K. Bhatia and D. Nolan were supported by the NOAA Office of Weather and Air Quality (OWAQ) through its funding of the OSSE Testbed at the Atlantic Oceanographic and Meteorological Laboratory, and by the NOAA Unmanned Aerial Systems (UAS) program. We thank Mark DeMaria and John Kaplan for their informative explanations and useful suggestions.

## REFERENCES

Bender, M. A., , Ginis I. , , Tuleya R. , , Thomas B. , , and Marchok T. , 2007: The operational GFDL coupled hurricane–ocean prediction system and a summary of its performance.

,*Mon. Wea. Rev.***135**, 3965–3989.Bister, M., , and Emanuel K. A. , 1998: Dissipative heating and hurricane intensity.

,*Meteor. Atmos. Phys.***65**, 233–240.Cangialosi, J. P., , and Franklin J. L. , 2011: 2010 National Hurricane Center forecast verification report. National Hurricane Center, 77 pp. [Available online at http://www.nhc.noaa.gov/verification/pdfs/Verification_2010.pdf.]

DeMaria, M., 2009: A simplified dynamical system for tropical cyclone intensity prediction.

,*Mon. Wea. Rev.***137**, 68–82.DeMaria, M., , and Kaplan J. , 1994a: A Statistical Hurricane Intensity Prediction Scheme (SHIPS) for the Atlantic basin.

,*Wea. Forecasting***9**, 209–220.DeMaria, M., , and Kaplan J. , 1994b: Sea surface temperature and the maximum intensity of Atlantic tropical cyclones.

,*J. Climate***7**, 1324–1334.DeMaria, M., , and Kaplan J. , 1999: An updated Statistical Hurricane Intensity Prediction Scheme (SHIPS) for the Atlantic and eastern North Pacific basins.

,*Wea. Forecasting***14**, 326–337.DeMaria, M., , and Gross J. M. , 2003: Evolution of prediction models.

*Hurricane! Coping with Disaster: Progress and Challenges since Galveston, 1900,*R. Simpson, Ed., Amer. Geophys. Union, 103–126.DeMaria, M., , Mainelli M. , , Shay L. K. , , Knaff J. A. , , and Kaplan J. , 2005: Further improvement to the Statistical Hurricane Intensity Prediction Scheme (SHIPS).

,*Wea. Forecasting***20**, 531–543.DeMaria, M., , Knaff J. A. , , and Kaplan J. , 2006: On the decay of tropical cyclone winds crossing narrow landmasses.

,*J. Appl. Meteor. Climatol.***45**, 491–499.DeMaria, M., , Knaff J. A. , , and Sampson C. R. , 2007: Evaluation of long-term trends in operational tropical cyclone intensity forecasts.

,*Meteor. Atmos. Phys.***59**, 19–28.Ehrendorfer, M., 1997: Predicting the uncertainty of numerical weather forecasts: A review.

,*Meteor. Z.***6**, 147–183.Kalnay, E., , and Dalcher A. , 1987: Forecasting forecast skill.

,*Mon. Wea. Rev.***115**, 349–356.Katz, R. W., , and Murphy A. H. , Eds., 1997:

*Economic Value of Weather and Climate Forecasts*. Cambridge University Press, 222 pp.Knaff, J. A., , DeMaria M. , , Sampson C. R. , , and Gross J. M. , 2003: Statistical, 5-day tropical cyclone intensity forecasts derived from climatology and persistence.

,*Wea. Forecasting***18**, 80–92.Knaff, J. A., , DeMaria M. , , and Kaplan J. , 2007: Improved statistical intensity forecast models. Joint (NOAA, Navy, and NASA) Hurricane Testbed Final Rep., 9 pp. [Available online at http://www.nhc.noaa.gov/jht/05-07reports/final_Knaffetal_JHT07.pdf.]

Landsea, C., , and Franklin J. , 2013: How “good” are the best tracks? Estimating uncertainty in the Atlantic hurricane database.

, in press.*Mon. Wea. Rev.*McNoldy, B., , Musgrave K. D. , , and DeMaria M. , 2012: Diagnostics and verification of the tropical cyclone environment in regional models. Preprints,

*30th Conf. on Hurricanes and Tropical Meteorology,*Jacksonville, FL, Amer. Meteor. Soc., 15A.3. [Available online at https://ams.confex.com/ams/30Hurricane/webprogram/Paper204581.html].Molteni, F., , and Palmer T. N. , 1991: A real-time scheme for the prediction of forecast skill.

,*Mon. Wea. Rev.***119**, 1088–1097.National Hurricane Center, cited 2012a: NHC track and intensity models. [Available online at http://www.nhc.noaa.gov/modelsummary.shtml.]

National Hurricane Center, cited 2012b: NHC tropical cyclone graphical product descriptions. [Available online at http://www.nhc.noaa.gov/aboutnhcgraphics.shtml?#WINDTABLE.]

NWS, 2012: Tropical cyclone definitions. National Weather Service Instruction 10-604, 12 pp. [Available online at http://nws.noaa.gov/directives/sym/pd01006004curr.pdf.]

Palmer, T. N., , and Tibaldi S. , 1988: On the prediction of forecast skill.

,*Mon. Wea. Rev.***116**, 2453–2480.Pielke, R., Jr., , and Carbone R. E. , 2002: Weather impacts, forecasts, and policy: An integrated perspective.

,*Bull. Amer. Meteor. Soc.***83**, 393–403.Tennekes, H., , Baede A. P. M. , , and Opsteegh J. D. , 1988: Forecasting forecast skill.

*ECMWF Workshop Proceedings, 16–18 May 1988: Predictability in the Medium and Extended Range,*ECMWF, 277–302.Wilks, D. S., 2006:

*Statistical Methods in the Atmospheric Sciences*. 2nd ed. Academic Press, 627 pp.Wilks, D. S., , and Hamill T. M. , 1995: Potential economic value of ensemble-based surface weather forecasts.

,*Mon. Wea. Rev.***123**, 3565–3575.Wobus, R. L., , and Kalnay E. , 1995: Three years of operational prediction of forecast skill at NMC.

,*Mon. Wea. Rev.***123**, 2132–2148.

^{1}

Throughout the paper, the OFCL forecast generated by NHC is occasionally referred to as a “model” forecast for convenience. We are aware that the OFCL forecast is created by an NHC employee who uses a synthesis of model guidance and their own expertise to produce a forecast.