## 1. Introduction

Tropical cyclone track prediction continues to be one of the most difficult problems facing forecasters. The various operational hurricane track forecast models commonly provide skillful predictions, but none performs well consistently, and they often give solutions that diverge widely from each other (Sheets 1990). A significant task, then, is to supply accurate estimates of the track forecast errors for each model in addition to the track forecasts themselves.

Attempts to predict the accuracy or skill of forecast models began with the stochastic dynamic modeling technique (Epstein 1969). Leith (1974) approximated this with a less-expensive Monte Carlo method that examined the sensitivity of each particular forecast to small perturbations in the initial conditions; ensembles of forecasts with similar solutions were shown to be relatively skillful. Both of these techniques, however, require large amounts of computer resources.

Several simpler and less-expensive methods for the prediction of forecast skill have been developed. Hoffman and Kalnay (1983) introduced lagged average forecasting, in which model runs initialized at different synoptic times and verifying at the same time are compared; as in the Monte Carlo method, sets of forecasts that are similar are likely to be skillful. Palmer and Tibaldi (1986) discovered that variables derived from the current synoptic situation could predict the skill of model runs, and Kistler et al. (1988) found that forecasts of cases in which the atmosphere is slowly changing tend to be more skillful than others in which the atmosphere changes more rapidly. Palmer and Tibaldi (1988) listed the methods for predicting forecast skill, noting the four important types of predictors: those derived from 1) the synoptic situation, 2) the amount of persistence in the atmosphere, 3) the consistency of forecasts initialized at different times and verifying at the same time, and 4) the recent past performance of the model.

These four types of variables are used for the prediction of model performance (POMP) of a barotropic hurricane track forecast model (VICBAR). Aberson and DeMaria (1994, hereafter AD) showed that during the 1989–93 Atlantic hurricane seasons VICBAR had skill that was competitive with the best prediction models available but that the forecasts were not consistently good. The version of VICBAR described in AD supplies the forecast sample for this study. For a detailed description of VICBAR model results, the reader is referred to that paper. A discussion of the predictors used in this study is given in section 2. Descriptions of the discriminant analyses that provide POMP are given in section 3, with results in section 4.

## 2. Predictors used for POMP

Linear multiple regression and discriminant analyses are used to examine the possibility of providing an operational estimate of the performance of each VICBAR track forecast. All potential predictors tested for POMP are summarized in Table 1. Aberson and DeMaria (1994) showed that a number of synoptic and climatological variables can be important in operationally assessing the performance of individual VICBAR forecasts. For example, VICBAR has smaller forecast errors for tropical cyclone cases that were initially of hurricane strength than for those of weaker storms, probably because weak storms tend to be steered by the flow averaged over a layer shallower than the 850–200-mb deep layer mean (DLM) used by VICBAR (Velden 1993). Forecasts made in early summer tend to be better than those of late fall, since baroclinic processes reach farther into tropical regions during the latter period. Storms originating in the southern areas of the basin tend to have smaller errors than those farther north, because the atmosphere is more likely to be barotropic in lower latitudes than farther north. A weak relationship between the size of VICBAR errors and the initial longitude of tropical cyclones is also noted in AD, since data coverage is more extensive over the American continent than over the Atlantic Ocean. Also, storms with large northward or eastward current storm motions (those storms that are recurving or have recurved) tend to have larger forecast errors than those moving southward or westward. Therefore, initial intensity, Julian date, latitude, longitude, and the difference between the initial direction of storm motion and due west are included as potential predictors; the VICBAR-forecasted latitude, longitude, and direction of storm motion are also included. However, the regression does not choose the Julian date because it is highly correlated with some of the other potential predictors.

*p*

_{T}and

*p*

_{B}are the top and bottom of the DLM, respectively;

*f*is the Coriolis parameter;

*R*the ideal gas constant; and

*u*and

*v*the zonal and meridional wind components, respectively. Also, the beta and advection model (BAM, Marks 1992) is run for three different levels in the troposphere (shallow-layer, middle-layer, and deep-layer mean). The distance between the three BAM model forecasts at each forecast time may provide information on the future vertical shear in the vicinity of the tropical cyclone.

*D*) of the flow is given by the combination of the shearing deformation (

*H*) and stretching deformation (

*T*): where

*x*and

*y*are the zonal and meridional distances, respectively.

The predictors that describe the amount of persistence in the atmosphere are the initial and VICBAR-forecasted forward speed of the storm, the 12-h intensity change of the storm, and the 12-h change in the CSMV. Additionally, the distance between track forecasts from the two versions of the NHC90 model (McAdie 1991) initialized at the same time, one run from the 12-h-old global spectral model run and the other from the current run, can quantify the amount of persistence in the atmosphere. Cases in which any of these values, except the intensity change, are large may have large track forecast errors. Curiously, the correlation between the difference between the two NHC90 runs and forecast error is positive for the 0000 and 1200 UTC cases, and negative for the 0600 and 1800 UTC cases, and both correlations are found to be statistically significant at the 90% level.

All of the above predictors are available for every synoptic time except the vertical shear, the temperature advection, and the NHC90 predictor, which are available only for 0000 and 1200 UTC cases. The previous values of these predictors are used for the 0600 and 1800 UTC forecasts.

The consistency between current and past VICBAR forecasts may provide an indication of future model performance. Figure 1 shows the correlation coefficients of time-lagged VICBAR forecasts. The correlations never surpass 0.4 and are generally highest around 24 h, possibly due to many rawinsonde sites in the Caribbean Basin reporting only every 24 h. The calculation of the statistical separation time between forecasts (AD) also confirms that correlations between successive forecasts are not very large. Despite these relatively small correlations, all possible combinations of the lagged-average predictors, and also those related to the recent performance of VICBAR, are tested by the linear regression analysis, though only some are included in the final analyses. If the necessary previous VICBAR run is not available, the value of the mean of the particular predictor is substituted. Table 1 shows the mean values of the lagged-average and recent past performance predictors.

## 3. Methods for POMP

### a. Linear multiple regression analyses

Multiple linear regression analyses are performed on the 5-yr (1989–93) sample of VICBAR runs to predict the performance of individual forecasts. Because 0000 and 1200 UTC VICBAR forecasts have different error characteristics than those from 0600 and 1800 UTC, the regression analyses of the two sets are done separately. A stepwise regression technique includes those predictors that are statistically significant at the 90% level at any of the five forecast times (12, 24, 36, 48, and 72 h). The most important predictors in both samples are the initial speed of motion of the storm, the initial thermal advection in the vicinity of the tropical cyclone, the VICBAR-forecasted latitude of the storm, and the forecast horizontal wind shear in the vicinity of the storm (BAM dispersion predictor). Additional important predictors in the 0000 and 1200 UTC sample are the 12- and 24-h forecast errors from the most recent VICBAR forecast available, the initial horizontal wind shear in the vicinity of the storm, and the VICBAR-forecasted direction of motion of the storm. Additional important predictors in the 0600 and 1800 UTC samples are the initial and VICBAR-forecasted longitudes of the storm, the initial deformation of the DLM flow in the vicinity of the storm, and the VICBAR-forecasted speed of motion of the storm.

Neumann et al. (1977) note that the significance of the predictors can be spuriously inflated due to the large number of predictors tested. Despite this, the regression analyses generally explain less than one-third of the variance of the errors at the different forecast times, and thus few forecasts of errors deviate significantly from the mean. However, distinguishing cases in which the errors are above or below average still may be possible. Actual forecast errors and those predicted by the regression analyses are each stratified into three categories (good, average, and poor) with equal numbers of cases in each group. Classification tables and a skill score (appendix), shown in Table 2, assess the ability of the regression analyses to group the forecasts correctly. If the classification were random, 33% of the cases would be grouped accurately, and 22% would be predicted good when really poor, and vice versa. The regression analyses are able to classify 44% (0000 and 1200 UTC samples) and 46% (0600 and 1800 UTC samples) of the cases correctly, and only 9% (0000 and 1200 UTC samples) and 12% (0600 and 1800 UTC samples) of the cases are forecast to be good when they are in fact poor, and vice versa. All classification tables are significantly different from chance at the 95% level using the chi-square test (appendix), yet the skill scores are only modest. The regression analysis therefore performs better than a chance forecast.

Figure 2 shows the average actual forecast errors of the cases that are predicted to be good, average, and poor (0000 and 1200 UTC cases only). The regression scheme is unable to clearly distinguish the good, average, and poor cases. A one-sided test of the hypothesis that the means of the errors in the predicted-good group is smaller than in the other two groups, and that the mean of the predicted-average group is lower than that of the predicted-poor group, is performed for these samples. The 0000 UTC forecasts are tested separately from the 1200 UTC forecasts, as are the 0600 and 1800 UTC forecasts, to reduce the effect of serial correlation in the dataset. None of the differences between the means of the groups are statistically significant at any forecast times. Despite the modest skill and the fact that the classification tables are significantly better than chance tables, the regression technique is unsuitable for POMP.

### b. Linear discriminant analyses

Unlike the linear regression analyses, linear discriminant analyses (Lachenbruch 1975) are specifically designed to classify cases into different groups (Allen and Marshall 1994). The discriminant analysis classifies each forecast as either good, average, or poor, by maximizing the differences between the three groups based upon values of the predictors. The predictors that the multiple linear regression analysis chose as most effective (Table 1) are used in the linear discriminant analysis.

Table 3 shows the discriminant analysis classification tables, and Fig. 3 shows the average actual forecast errors for the cases predicted to be in the three groups (0000 and 1200 UTC samples only). For both samples, more than half of the cases are correctly classified, and only about one-tenth of the cases are classified as good or poor when they are actually poor or good, respectively. The skill is much higher than the multiple regression analyses in all cases, and the chi-square test shows that all classification tables are statistically significantly different than chance at the 95% level. The differences between the means of the predicted-poor and predicted-average groups in Fig. 3 are statistically significant at all times. The differences between the means of the predicted-good and predicted-average groups are only significant at 36 and 72 h in the 0000 and 1200 UTC samples. This scheme therefore seems best suited to distinguishing cases that are likely to be poorly forecast from others. Linear discriminant analyses seem better able to provide accurate POMP than regression analyses.

These results would not be attainable in an operational setting. Cases from the 1994 hurricane season are used as an independent dataset to test how well this scheme might perform operationally. Table 4 shows classification tables for all forecast times for this independent dataset. These discriminant analyses correctly classify 40%–42% of both samples. Classification as good when poor, and vice versa, occurs 8% and 18% of the time in the 0000 and 1200 UTC and the 0600 and 1800 UTC samples, respectively. The early forecast times of both samples have no skill, but the skill increases to 15%–17% in the 0000 and 1200 UTC sample by 48 and 72 h but is lower in the 0600 and 1800 UTC samples. These skill values are comparable to those from the dependent sample in the linear multiple regression. The classification tables for these later times are statistically significantly different from chance at the 95% level. Statistical significance tests also show that the differences between the means of the predicted-average and predicted-poor groups at all times except 72 h, and between the predicted-good and predicted-poor types at all times are statistically significant in the 0000 and 1200 UTC samples (Fig. 4). Similarly, the differences between the means of the predicted-average and predicted-poor groups at 36, 48, and 72 h, and between the predicted-good and predicted-poor groups at all forecast times, are statistically significant in the 0600 and 1800 UTC samples. Therefore, the scheme is best able to distinguish the worst cases from the others in this relatively small independent sample. Since forecasters have previously had no method to accurately discern poor model forecasts from others, this tool can be especially helpful in operational forecasting.

Since the value of forecasts that are inconsistent from one forecast time to the next may be uncertain, all forecasts from the independent sample reaching 72 h are examined for consistency. Only about 7% of the predictions varied from good to poor, or vice versa, between forecast times. Therefore, the results of the discriminant analyses are usually consistent through the forecasts and may present an accurate prognosis of the ability of each VICBAR forecast to predict the track of tropical cyclones.

## 4. Summary

A linear discriminant analysis scheme using predictors that are important in forecasting VICBAR model behavior has been developed to assess the performance of individual VICBAR forecasts operationally. This technique uses predictors based upon the synoptic situation, the persistence of the atmosphere, the consistency of model runs initialized at different times but verifying at the same time, and the previous performance of the model. These analyses are effective in predicting the model’s performance, and such forecasts can be provided with the actual prediction to enable the hurricane forecaster to better utilize the forecasts. Discriminant analyses are being developed for the model output of the other operational hurricane track forecast models in the Atlantic basin.

The author wishes to thank Dr. Lloyd Shapiro for his valuable suggestions during the creations of the POMP scheme and Dr. Robert Burpee for his encouragement during the work. Additional reviews of the manuscript by James Franklin and three anonymous reviewers have helped to make the manuscript clearer and more accessible.

## REFERENCES

Aberson, S. D., and M. DeMaria, 1994: Verification of a nested barotropic hurricane track forecast model (VICBAR).

*Mon. Wea. Rev.,***122,**2804–2815.Allen, G., and J. F. Le Marshall, 1994: An evaluation of neural networks and discriminant analysis methods for application in operational rain forecasting.

*Aust. Meteor. Mag.,***43,**17–28.Epstein, E. S., 1969: Stochastic dynamic prediction.

*Tellus,***21,**739–759.Hoffman, R. N., and E. Kalnay, 1983: Lagged average forecasting, an alternative to Monte Carlo forecasting.

*Tellus,***35A,**100–118.Kistler, R. E., E. Kalnay, and M. S. Tracton, 1988: Forecast agreement, persistence, and forecast skill. Preprints,

*Eighth Conf. on Numerical Weather Prediction,*Baltimore, MD, Amer. Meteor. Soc., 641–646.Lachenbruch, P. A., 1975:

*Discriminant Analysis.*Hafner Press, 128 pp.Leith, C. E., 1974: Theoretical skill of Monte Carlo forecasts.

*Mon. Wea. Rev.,***102,**409–418.Marks, D. G., 1992: The beta and advection model for hurricane track forecasting. NOAA Tech. Memo. NWS NMC 70, 89 pp. [Available from National Center for Environmental Prediction, 5200 Auth Rd., Camp Springs, MD 20746-4304.].

McAdie, C. J., 1991: A comparison of tropical cyclone track forecasts produced by NHC90 and an alternate version (NHC90A) during the 1990 hurricane season. Preprints,

*19th Conf. on Hurricanes and Tropical Meteorology,*Miami, FL, Amer. Meteor. Soc., 290–294.Neumann, C. J., M. B. Lawrence, and E. L. Caso, 1977: Monte Carlo significance testing as applied to statistical tropical cyclone prediction models.

*J. Appl. Meteor.,***16,**1165–1174.Palmer, T. N., and S. Tibaldi, 1986: Forecast skill and predictability. ECMWF Tech. Memo. 139. [Available from ECMWF, Shinfield Park, Reading, RG2 9AX, United Kingdom.].

——, and ——, 1988: On the prediction of forecast skill.

*Mon. Wea. Rev.,***116,**2453–2480.Panofsky, H. A., and G. W. Brier, 1958:

*Some Applications of Statistics to Meteorology.*The Pennsylvania State University, 224 pp.Sheets, R. C., 1990: The National Hurricane Center—Past, present, and future.

*Wea. Forecasting,***5,**185–232.Velden, C., 1993: The relationship between tropical cyclone motion, intensity, and the vertical extent of the environmental steering layer in the Atlantic basin. Extended Abstracts,

*20th Conf. Hurricanes and Tropical Meteorology,*San Antonio TX, Amer. Meteor. Soc., 31–34.

# APPENDIX

## Skill Scores and Statistical Independence Tests

*S*

*C*

*E*

*T*

*E*

*C*is the number of correct forecasts,

*T*is the total number of forecasts, and

*E*is the number of forecasts expected to be correct based on chance, can be calculated for each classification table (Panofsky and Brier 1958). The skill score will be unity if all cases are correctly predicted, and negative for no skill.

*H*stands for the numbers in a chance classification table and

*O*represents the observation classification table. The number of degrees of freedom is (

*m*− 1), where

*m*is the number of rows or columns in the classification table.

Predictors tested by the multiple linear regression and linear discriminant analyses. Predictors positively correlated with forecast error are marked with a plus sign, those negatively correlated with a minus sign, and those not chosen by the regression with an “N.” Predictors marked with an asterisk are calculated 0–500 km from the current storm center. Those marked with a double asterisk are calculated 500–1000 km from the current storm center. Mean values (in km) of the consistency and past performance predictors are given. The left- and right-hand columns signify use in the 0000 and 1200 UTC and the 0600 and 1800 UTC samples, respectively.

Classification tables for VICBAR forecast errors for both the 0000 and 1200 UTC samples (left) and the 0600 and 1800 UTC samples (right) for the linear multiple regression analysis. Columns signify the predicted group; rows the actual group. Here, N is the number of cases, and S is the skill score for each table.

Classification tables for VICBAR forecast errors for both the 0000 and 1200 UTC samples (left) and the 0600 and 1800 UTC samples (right) for the linear discriminant analysis. Columns signify the predicted group; rows the actual group. Here, N is the number of cases, and S is the skill score for each table.

Classification tables for VICBAR forecast errors for both the 0000 and 1200 UTC samples (left) and the 0600 and 1800 UTC samples (right) for the independent sample for the linear discriminant analysis. Columns signify the predicted group; rows the actual group. Here, N is the number of cases, and S is the skill score for each table.