The International Research Institute for Climate and Society (IRI) has been issuing experimental seasonal tropical cyclone activity forecasts for several ocean basins since early 2003. In this paper the method used to obtain these forecasts is described and the forecast performance is evaluated. The forecasts are based on tropical cyclone–like features detected and tracked in a low-resolution climate model, namely ECHAM4.5. The simulation skill of the model using historical observed sea surface temperatures (SSTs) over several decades, as well as with SST anomalies persisted from the previous month’s observations, is discussed. These simulation skills are compared with skills of purely statistically based hindcasts using as predictors recently observed SSTs. For the recent 6-yr period during which real-time forecasts have been made, the skill of the raw model output is compared with that of the subjectively modified probabilistic forecasts actually issued.
Despite variations from one basin to another, the levels of hindcast skill for the dynamical and statistical forecast approaches are found, overall, to be approximately equivalent at fairly modest but statistically significant levels. The dynamical forecasts require statistical postprossessing (calibration) to be competitive with, and in some circumstances superior to, the statistical models. Skill levels decrease only slowly with increasing lead time up to 2–3 months. During the recent period of real-time forecasts, the issued forecasts have had higher probabilistic skill than the raw model output, due to the forecasters’ subjective elimination of the “overconfidence” bias in the model’s forecasts. Prospects for the future improvement of dynamical tropical cyclone prediction are considered.
Tropical cyclones (TCs; see the appendix for a list of the acronyms used in this paper) are one of the most devastating types of natural disasters. Seasonal forecasts of TC activity could help the preparedness of coastal populations for an upcoming TC season and reduce economical and human losses.
Currently, many institutions issue operational seasonal TC forecasts for various regions. In most cases, these are statistical forecasts, such as the Atlantic hurricane outlooks produced by NOAA (information online at http://www.cpc.noaa.gov/products/outlooks/hurricane.shtml), and Colorado State University (Gray et al. 1993; Klotzbach 2007a), the typhoon activity forecasts of the City University of Hong Kong (Chan et al. 1998, 2001), and Tropical Storm Risk (Saunders and Lea 2004). A review of TC seasonal forecasts is found in Camargo et al. (2007a), and the skill levels of some of them were discussed in Owens and Landsea (2003).
Since April 2003 the International Research Institute for Climate and Society (IRI) has been issuing experimental dynamical seasonal forecasts for five ocean basins (information online at http://portal.iri.columbia.edu/forecasts). In this paper, we describe how these forecasts are produced and discuss their skills when the atmospheric general circulation model (AGCM) is forced by predicted sea surface temperature (SST) in a two-tiered prediction system.
The possible use of dynamical climate models for forecasting seasonal TC activity has been explored by various authors (e.g., Bengtsson et al. 1982). Although the low horizontal resolution of climate general circulation models of the early 2000s is not adequate to realistically reproduce the structure and behavior of individual cyclones, such models are capable of forecasting with some skill several aspects of the general level of TC activity over the course of a season (Bengtsson 2001; Camargo et al. 2005). Dynamical TC forecasts can serve specific applications, for example, TC landfall activity over Mozambique (Vitart et al. 2003). The level of performance of dynamical TC forecasts depends on many factors, including the model used (Camargo et al. 2005), the model resolution (Bengtsson et al. 1995), and the inherent predictability of the large-scale circulation regimes (Vitart and Anderson 2001), including those related to El Niño–Southern Oscillation (ENSO) (Wu and Lau 1992; Vitart et al. 1999).
In addition to IRI’s dynamically based experimental TC forecasts, such forecasts are also produced by the European Centre for Medium-Range Weather Forecasts (ECMWF) (Vitart 2006), the Met Office, and the European Seasonal to Interannual Prediction (EUROSIP) superensemble of ECMWF, Met Office, and Météo-France coupled models (Vitart et al. 2007). An important consideration is the dynamical design used to produce the forecasts. The European dynamical TC forecasts are produced using fully coupled atmosphere–ocean models (Vitart and Stockdale 2001; Vitart 2006). At IRI, a two-tiered (Bengtsson et al. 1993), multimodel (Rajagopalan et al. 2002; Robertson et al. 2004) procedure is used to produce temperature and precipitation forecasts once an SST forecast (or set of them) is first established (Mason et al. 1999; Goddard et al. 2003; Barnston et al. 2003, 2005). The IRI experimental TC forecasts use a subset of the IRI two-tier forecast system, in that only a single AGCM is used, compared with several AGCMs for surface climate. As described below, more than one SST forcing scenario is used.
TCs in low-resolution models have many characteristics comparable to those observed, but at much lower intensity and larger spatial scale (Bengtsson et al. 1995; Vitart et al. 1997). The climatology, structure, and interannual variability of model TCs have been examined (Bengtsson et al. 1982, 1995; Vitart et al. 1997; Camargo and Sobel 2004). A successful aspect of this work has been that, over the course of a TC season in a statistical sense, the spatial and temporal distributions, as well as the interannual anomalies of the number and total energy content, of model TCs roughly follow those of observed TCs (Vitart et al. 1997; Camargo et al. 2005). There have been two general methods in which climate models are used to forecast TC activity. One method is to analyze large-scale variables known to affect TC activity (Ryan et al. 1992; Thorncroft and Pytharoulis 2001; Camargo et al. 2007c). Another approach, and the one used here, is to detect and track the cyclonelike structures in climate models (Manabe et al. 1970; Broccoli and Manabe 1990; Wu and Lau 1992), coupled ocean–atmosphere models (Matsuura et al. 2003; Vitart and Stockdale 2001), and regional climate models (Landman et al. 2005; Knutson et al. 2007). These methods have also been used in studies of possible changes in TC intensity due to global climate change using AGCMs (Bengtsson et al. 1996; Royer et al. 1998; Bengtsson et al. 2007a,b) and regional climate models (Walsh and Ryan 2000; Walsh et al. 2004).
In section 2 we describe how the real-time seasonal tropical forecasts are produced at IRI. The model’s performance over a multidecadal hindcast period and over the recent 6-yr period of real-time forecasting is discussed in section 3. A comparison of the AGCM performance result with that of simple SST-based statistical forecasts is shown in section 4. The conclusions are given in section 5.
2. Description of the real-time forecasts
The IRI climate forecast system (Mason et al. 1999) is two-tiered: SSTs are first forecasted, and then each of a set of atmospheric models is forced with several tropical SST forecast scenarios. Many ensemble members of atmospheric response are produced from each model forced with the SST scenarios. For the TC seasonal forecasts, just one atmospheric model is used: ECHAM4.5, which is run on a monthly basis. Six-hourly output data are used, as this fine temporal resolution makes possible the detection of the needed TC characteristics. The ECHAM4.5 was developed at the Max Planck Institute for Meteorology in Hamburg, Germany (Roeckner et al. 1996), and has been studied extensively for various aspects of seasonal TC activity (Camargo and Zebiak 2002; Camargo et al. 2005, 2007c).
The integrations of the ECHAM4.5 model are subject to differing tropical SST forcing scenarios (Table 1). In all of the scenarios, the extratropical SST forecasts consist simply of the damped persistence of the anomalies from the previous month’s observation (added to the forecast season’s climatology), with an anomaly e-folding time of 3 months (Mason et al. 1999). In the tropics, multimodel, mainly dynamical, SST forecasts are used for the Pacific, while statistical and dynamical forecasts are combined for the Indian and Atlantic Oceans. Statistical forecasts play the greatest role in the tropical Atlantic. The models contributing to the tropical SST forecasts, particularly for the Pacific, have changed during our study period as forecast-producing centers have introduced newer, more advanced prediction systems. In the non-Pacific tropical basins during seasons having near-zero apparent SST forecast predictive skill, damped persisted SST anomalies are used, but at a lower damping rate than that used in the extratropics. (No damping occurs in the first 3 months, followed by linear damping that reaches zero by month 8.) However, for seasons in which SST predictive skill is found beyond that of damped persistence, CCA models are used in the Indian Ocean (Mason et al. 1999) and the tropical Atlantic Ocean (Repelli and Nobre 2004).
Globally undamped anomalous SST persisted from the previous month, applied to the climatology of the months being forecast, is used as an additional SST forcing scenario (called FSSTp). In this case the 24 ensemble members of ECHAM4.5 are integrated using persisted SST anomalies out to 5 months beyond the previous month. For example, for a mid-January forecast, the model is forced from January to May using undamped persisted SST anomalies from December globally.1
In the case of the nonpersisted, evolving forecasted SST anomalies (denoted by FSSTe), the AGCM is run out to 7 months beyond the previous month’s observed SST (e.g., for a mid-January forecast, observed SST exists for December, and the model is forced from January to July with evolving SST predictions). Several versions of the forecasted SST anomalies have been used since 2001. These are described in detail in Camargo and Barnston (2008).2
The ECHAM4.5 was also forced with the actual observed SSTs (OSSTs; Reynolds et al. 2002) prescribed during the period from 1950 to the present. These AMIP-type runs provide estimates of the upper limit of the skill of the model in forecasting TC activity, as discussed in previous studies (Camargo and Zebiak 2002; Camargo et al. 2005). The skill levels presented below are broken out into three SST forcing types: 1) FSST (for real-time forecasts, comprising FSSTp and FSSTe), 2) HSSTp (long-term hindcast anomally persisted SST), and 3) OSST (long-term observed SST for AMIP-type AGCM simulations).
For any type of SST forcing, we analyze the output of the AGCM for TC activity. To define and track TCs in the models, we used objective algorithms (Camargo and Zebiak 2002) based in large part on prior studies (Vitart et al. 1997; Bengtsson et al. 1995). The algorithm has two parts: detection and tracking. In the detection part, storms that meet environmental and duration criteria are identified. A model TC is identified when chosen dynamical and thermodynamical variables exceed thresholds calibrated to the observed tropical storm climatology.3 Most studies (Bengtsson et al. 1982; Vitart et al. 1997) use a single set of threshold criteria globally. However, to take into account model biases and deficiencies, we use basin- and model-dependent threshold criteria, based on analyses of the correspondence between the modeled and observed climatologies (Camargo and Zebiak 2002). Thus, we use a threshold exclusive to ECHAM4.5. Once detected, the TC tracks are obtained from the vorticity centroid, defining the center of the TC, using relaxed criteria appropriate for the weak model storms. The detection and tracking algorithms have been applied to regional climate models (Landman et al. 2005; Camargo et al. 2007b) and to multiple AGCMs (Camargo and Zebiak 2002; Camargo et al. 2005).
Following detection and tracking, we count the number of TCs (NTC) and compute the model accumulated cyclone energy (ACE) index (Bell et al. 2000) over a TC season. ACE is defined as the sum of the squares of the wind speeds in the TCs active in the model at each 6-h interval. For the observed ACE, only TCs of tropical storm intensity or greater are included.
The model ACE and NTC results are then corrected for bias, based on the historical model and observed distributions of NTC and ACE over the 1971–2000 period, on a per basin basis. Corrections yield matching values within a percentile reference framework (i.e., a correspondence is achieved nonparametrically). Using 1971–2000 as the climatological base period, tercile boundaries for model and observed NTCs and ACEs are then defined, since the forecasts are probabilistic with respect to tercile-based categories of the climatology (below, near, and above normal).4
For each of the SST forcing designs, we count the number of ensemble members having their NTCs and ACEs in a given ocean basin in the below normal, normal, and above normal categories, and divide by the total number of ensembles. These constitute the “raw,” objective probability forecasts. In a final stage of forecast production, the IRI forecasters examine and discuss these objective forecasts and develop subjective final forecasts that are posted on the IRI Web site. The most typical difference between the raw and the subjective forecasts is that the latter have weaker probabilistic deviations from climatology, given the knowledge that the models are usually too “confident.” The overconfidence of the model may be associated with too narrow an ensemble spread, too strong a model signal (deviation of ensemble mean from climatology), or both of these. The subjective modification is intended to increase the probabilistic reliability of the predictions. The issues of model overconfidence, calibration to correct it, and probabilistic reliability will be discussed in more detail in section 3b. Another consideration in the subjective modification is the degree of agreement among the forecasts, in which less agreement would suggest greater uncertainty and thus more caution with respect to the amount of deviation from the climatological probabilities.
The raw objective forecasts are available starting from August 2001. The first subjective forecast for the western North Pacific basin was produced in real time in April 2003. However, subjective hindcasts were also produced for August 2001–April 2003 without knowledge of the observed result, making for 6 yr of experimental forecasts.
For each ocean basin, forecasts are produced only for the peak TC season, from certain initial months prior to that season (Table 2), and updated monthly until the first month of the peak season.5 The lead time of this latest forecast is defined as being zero, and the lead times of earlier forecasts are defined by the number of months earlier that they are issued.
The basins in which forecasts are issued are shown in Fig. 1, and the numbers of years available for each SST scenario and basin are indicated in Table 3. In the Southern Hemisphere (South Pacific and Australian regions), only forecasts for NTC are produced, while in the Northern Hemisphere basins both NTC and ACE forecasts are issued. ACE is omitted for the Southern Hemisphere because ACE is more sensitive to data quality than NTC, and the observed TC data from the Southern Hemisphere are known to be of somewhat questionable quality, particularly in the earlier half of the study period (e.g., Chu et al. (2002); Buckley et al. (2003); Landsea et al. (2006); Trewin (2008); Harper et al. (2008)).
The observed TC data used to correct historical model biases and for verification of the model forecasts is the best-track data from the National Hurricane Center (Atlantic and eastern North Pacific; information online at http://www.nhc.noaa.gov) and the Joint Typhoon Warning Center (western North Pacific and Southern Hemisphere; information online at https://metocph.nmci.navy.mil/jtwc.php).
3. Performance in hindcasts and real-time forecasts
NTC or ACE historical simulation and real-time predictive skill results are computed for each ocean basin for their respective peak TC seasons. Both deterministic and probabilistic skills are examined.
a. Deterministic skills
Temporal anomaly correlation skills are shown in Table 4 for NTC by lead time, for each type of SST forcing, and likewise for ACE in Table 5. The simulation skills are shown both for the full period of 1950–2005 and for 1970–2005, during which time the TC data are known to be of higher quality. The correlations for the real-time predictions are uncentered.6 Simulation skills (OSST) are seen to be at statistically significant levels for most of the ocean basins. Skills for the longer period (OSSTr) tend to exceed those for 1970–2005, due both to better average data quality and the greater ENSO variability following 1970. Consistent with Camargo et al. (2005), the highest skill results occur in the Atlantic basin with correlations of roughly 0.50, with more modest skill levels in the other basins. Skill levels for zero-lead forecasts using SST anomalies persisted from those of the most recent month (HSSTp, lead 0), as expected, are usually lower than those of observed simultaneous SSTs. For the three Northern Hemisphere basins, simulation skills are higher for ACE than for NTC, as noted also in Camargo et al. (2005). This may be related to the continuous nature of ACE as opposed to the discrete, more nonparametric, character of NTC.
A reference forecast more difficult to beat than a random or a climatology forecast is that of simple persistence of observed TC observation from the previous year. The correlation score for such a reference forecast is just the 1-yr autocorrelation coefficient over the 1971–2005 base period, and is shown at the bottom of Tables 4 and 5. The persistence correlation scores are lower than those of the AGCM’s forecast using observed or persisted SST, with the one exception of the NTC forecasts in the northwestern Pacific.
Real-time predictive verification skill levels (FSST in Tables 4 and 5) over the basins not only have lower expected values than those using simultaneous observed SST due to the imperfection of the predicted SST forcing, but also much greater sampling errors given only six to seven cases per lead time per basin (Table 3). These skills range from near or below zero for the western North Pacific NTC to approximately 0.5 for the three shortest leads for the eastern North Pacific ACE. For all basins collectively and for NTC and ACE together, the skill results approximate those of HSSTp, with individual differences likely due foremost to sampling variability. Consistent with the small sample problem, the correlations for FSST for all of the basin–lead time combinations are statistically nonsignificant, as nearly 0.8 is required for significance.
A look at the possible impact of differing SST forcing scenarios and lead times on the real-time forecast skills is more meaningful when results for all oceans basins are combined, lessening the sampling problem. Basin-combined skill results by lead time and SST forcing type are shown in Table 6 for NTC and ACE. Table 6 shows NTC results for Northern Hemisphere basins only, allowing a direct comparison between NTCs and ACEs. The results show higher skill levels for forecasts of ACE than NTC, and only a very weak tendency for decreasing skill with increasing lead time. This is summarized further in the bottom row of Table 6, showing the results for NTC and ACE combined.
Skill levels were evaluated using additional deterministic verification measures: the Spearman rank correlation, the Heidke skill score, and the mean squared error skill score (MSESS). Table 7 provides an example of the four scores together, for ACE in the northwestern Pacific basin. The rank correlation and Heidke skill scores are roughly consistent with the correlation skill, allowing for expected scaling differences where the Heidke is roughly one-half of the correlation (Barnston 1992). The MSESS, however, which uses the 1971–2000 climatology as the zero-skill reference forecast, is comparatively unfavorable: some of the cases having positive correlation and Heidke skills have negative MSESS results. This outcome is attributable to a marked tendency of the model forecasts toward too great a departure from climatological forecasts, given the degree of inherent uncertainty and thus the relatively modest level of true predictability. Such “overconfidence” in the model forecasts, which can be adjusted for statistically, will be discussed in more detail below within the context of probabilistic verification, where a detrimental effect on scores comparable to that seen in MSESS will become apparent.
b. Probabilistic skills
The TC forecasts were verified probabilistically using the ranked probability skill score (RPSS), likelihood skill score, and, for the real-time forecasts, the relative operating characteristics (ROC) score.
RPSS (Epstein 1969; Goddard et al. 2003) measures the sum of the squared errors between categorical forecast probabilities and the observed categorical probabilities, cumulative over categories, relative to a reference (or standard baseline) forecast—here, the climatology forecast of 1/3 probability for each category. The observed probabilities are 1 for the observed category and 0 for the other categories.
Verifications using the RPSS are shown for NTC and ACE in Tables 8 and 9. These skills are mainly near or below zero. This poor result can be attributed to the lack of probabilistic reliability of the model ensemble-based TC predictions as is seen in many predictions made by individual AGCMs—not just for TC activity but for most climate variables (Anderson 1996; Barnston et al. 2003; Wilks 2006). Climate predictions by AGCMs have model-specific systematic biases, and their uncorrected probabilities tend to deviate too strongly from climatological probabilities due to too small an ensemble spread and/or too large a mean shift from climatology. This problem leads to comparably poor probability forecasts, despite positive correlation skills for the ensemble means of the same forecast sets. Positive correlations, but negative probabilistic verification, are symptomatic of poorly calibrated probability forecasts—a condition that can be remedied using objective statistical correction procedures.
Probabilistic persistence may be a more competitive simple reference forecast than forecasts of climatological probabilities. Based on the weak but generally positive year-to-year autocorrelations shown in Tables 4 and 5, we designed the persistence probabilistic forecasts to be 0.4 for the tercile-based category observed the previous year, and 0.3 for the other two categories. Resulting RPSSs are shown at the bottom of Tables 8 and 9. These weakly persistent probabilistic forecasts often have better RPSS scores than those of the AGCM forced with persisted SSTs (HSSTp), and sometimes as good as or better than those forced with observed SSTs. Rather than showing that use of the AGCM with observed or predicted SSTs is unsuccessful, this outcome again shows that probabilities that deviate only mildly from climatological probabilities, even if derived from something as simple as the TC activity of the previous year, fare better under calibration-sensitive probabilistic verification measures (here, RPSS) than the higher-amplitude probability shifts from climatology typically produced by today’s AGCMs without proper statistical calibration.
The probability forecasts actually issued by IRI begin with the “raw” AGCM probabilities, modified to what the forecasters judge to have better probabilistic reliability. This nearly universally involves damping the amplitude of the model’s deviation from climatological probabilities. A typical adjustment might be to modify the model’s predicted probabilities of 5%, 10%, and 85% to 20%, 30%, and 50% for the below, near, and above normal categories, respectively. A less common adjustment is that of “rounding out” a bimodal probability forecast such as 35%, 5%, and 60% to a more Gaussian distribution such as 25%, 30%, and 45%.7 Part of the reason for sharply bimodal distributions is assumed to be the limited (24 member) ensemble size. A still less common case for modification, and one that does not always improve the forecast quality, is that of the forecasters’ judgment against the model forecasts, believing there is a model bias. Such doubt can pertain also to the SST forecast used to force the AGCM.
Tables 8 and 9 indicate that the actually issued forecasts have better probabilistic reliability than the forecasts of the model output. Likelihood skill scores (not shown), and especially RPSS, are mainly positive for the issued forecasts, although modest in magnitude. This implies that the probability forecasts of the AGCM are potentially useful, once they are calibrated to correct for overconfidence or an implausible distribution shape. Such calibration could be done objectively, based on the longer hindcast history, rather than subjectively by the forecasters as done to first order here.
Figure 2 shows the approximately 6-yr record of AGCM ensemble forecasts of NTC and ACE at all forecast lead times for each of the ocean basins. The vertical boxes show the interquartile range among the ensemble members, and the vertical dashed lines (“whiskers”) extend to the ensemble member forecasts outside of that range. The asterisk indicates the observation value. Favorable and unfavorable forecast outcomes can be identified, such as, respectively, the ACE forecasts for the western North Pacific for 2002 and the ACE forecasts for the North Atlantic for 2004.
Figure 3 shows the same forecasts, except probabilistically for each of the tercile-based categories, both for the AGCM’s forecasts (crisscross symbols) and for the subjectively modified publicly issued forecasts (circle symbols connected by lines). The AGCM’s probability forecasts often deviate by large amounts from climatology, while the issued forecasts remain closer to climatology. Figure 4 shows the RPSSs of these probability forecasts in the same format. The AGCM’s probability forecasts result in highly variable skill (including both strongly negative and positive cases), leading to a somewhat negative overall skill. The issued forecasts, while never reaching positive magnitudes as great as those of some of the AGCM forecasts, also avoid negative overall skill levels of more than small magnitude.8 Hence, the humanly modified TC forecasts have a higher average probabilistic skill level using RPSS.
The “overconfidence” of the AGCM forecasts is shown in more concrete terms in a reliability (or attributes) diagram (Hsu and Murphy 1986) in Fig. 5. Here, the correspondence of the forecast probabilities with the observed relative frequency of occurrence is shown for the above normal and below normal categories. When the forecast probabilities closely match the observed relative frequencies, as would be desired, the lines approximate the dotted 45° line. Figures 5a and 5b show, for the 6-yr period of forecasts, the reliabilities for the issued forecasts and for the AGCM’s forecasts prior to subjective modification, respectively. Despite the “jumpy” lines due to the small sample sizes, the lines for the issued forecasts are seen to have slopes roughly resembling the 45° line, indicating favorable reliability, while the lines for the AGCM’s forecasts have a less obvious upward slope. The AGCM’s forecast probabilities for the above or below normal categories of TC activity deviate from the climatological probabilities of 1/3 by much greater amounts than do their corresponding observed relative frequencies (see bottom inset in the panels of Fig. 5), resulting in low probabilistic forecast skill. The issued forecasts’ deviations from climatological probabilities are limited by the forecasters according to the perceived level of uncertainty, and within the restricted probability ranges an approximate correspondence to the observed relative frequencies is achieved. The more reliable issued forecasts carry appropriately limited utility as represented by the lack of forecast sharpness—that is, that the forecast probabilities rarely deviate appreciably from climatology, and from one another.
The bottom panels in Fig. 5 show reliabilities for the longer historical period of AGCM hindcasts using prescribed observed SSTs (OSSTr; Fig. 5c) and the persisted SST anomaly (HSSTp; Fig. 5d). Here, the lines are smoother due to the larger sample sizes. Both diagrams show forecasts having some informational value, as the lines have positive slopes, but the slopes are considerably shallower than the 45° line, indicating forecast overconfidence. The slopes for forecasts using observed SSTs are slightly steeper than those for forecasts using the persisted SST anomaly, as would be expected with the higher skill realized in forecasts forced by the actually observed lower boundary conditions.
That the TC activity forecasts of the AGCM have mainly positive correlation skill is consistent with their positive slopes in Figs. 5b–d. Additionally, their mainly negative RPSS (Tables 8 and 9) is expected when the positive slopes in the reliability diagram (Fig. 5) are shallower than one-half of the ideal 45° slope = 1 line (i.e., slope < 0.5) because then the forecasts’ potential information value is more than offset by the miscalibration of the forecast probabilities (Hsu and Murphy 1986; Mason 2004). This is consistent with the deterministic TC forecasts having positive correlation skill but negative MSESSs using climatology as the reference forecast, due to forecast anomalies that are stronger than warranted for the expected skill level.
The skills of the real-time probabilistic forecasts over the approximately 6-yr period are summarized in full aggregation (over basins and TC variables) in Table 10 using the RPSS, likelihood [based on the concept of maximum likelihood estimation; Aldrich (1997)], and ROC (Mason 1982) verification measures. The comparisons between the objective AGCM forecast output and the actually issued forecasts again underscore the need for the calibration of AGCM forecasts that greatly underestimates the real-world forecast uncertainty. The AGCM’s nontrivially positive scaled ROC areas for both above and below normal observed outcomes reveal their ability to provide useful information, as the ROC lacks sensitivity to calibration in a manner analogous to correlation for deterministic, continuous forecasts. In this particular set of forecasts, greater capability to discriminate the above from the below normal TC activity is suggested by the ROC skill.
c. A favorable and an unfavorable real-time forecast
Identification of “favorable” or “unfavorable” forecasts, while straightforward when considered deterministically, is less clear when comparing an observed outcome with its corresponding probability forecast. Probabilistic forecasts implicitly contain expressions of uncertainty. The position of an observed outcome within the forecast distribution is expected to vary across cases, and many cases are required to confirm that this variation is well described by the forecast probability distributions. When an observation lies on a tail of the forecast distribution, it is impossible to determine whether this represents an unfavorable forecast or is an expected rare case, without examining a large set of forecasts. The forecast distribution may be fully appropriate given the known forcing signals (Barnston et al. 2005). Here, we identify favorable and unfavorable cases in terms of the difference between the deterministic forecast (the model ensemble mean, which usually also approximates the central tendency of the forecast probability distribution) and the corresponding observation.
A critical aspect of the SST forcing to be forecast is the ENSO state during the peak season. Figure 6 shows the IRI’s forecasts of the seasonal Niño-3.4 index at 2-month lead time (e.g., a forecast for ASO SST issued in mid-June, with observed data through May) during the period of issued TC forecasts, with the corresponding observed seasonal SST. A moderate El Niño (EN) occurred during 2002–03, with weak ENs in 2004–05 and late 2006. A weak, brief La Niña (LN) condition was observed in late 2005 and early 2006, and a stronger LN developed during mid-2007. The average of the observed Niño-3.4 SST anomalies over the approximately 5-yr period is 0.45, compared with an average 2-month lead forecast anomaly of 0.37, indicating a small forecast bias. The uncentered correlation coefficient for the period is in the range of 0.70–0.79 for forecasts for the Northern Hemisphere peak seasons, and in the range of 0.80–0.89 for forecasts for the Southern Hemisphere peak season, suggesting somewhat skillful forecasts of tropical Pacific SST fluctuations for these peak TC seasons.
A favorable forecast for ACE in the western North Pacific took place in 2002. Figure 2d shows that the observation was in the above-normal category, and that the AGCM forecasts were not far from this number for the four lead times. For ACE in the western North Pacific, the ENSO condition is key, with EN (LN) associated with higher (lower) ACE. Between April and June of 2002 it became clear that an EN was developing, although the SST predictions contained a weaker EN than was observed (Fig. 7). Nonetheless, the SST predictions contained ENSO-related anomaly patterns of sufficient amplitude to force an above normal ACE prediction that verified positively. The favorable AGCM forecasts are shown probabilistically in Fig. 3d, with a positive RPSS verification shown in Fig. 4d.
An unfavorable forecast outcome occurred for ASO 2004, when the North Atlantic ACE was observed to be 2.41 × 106 kt2, the highest on record after 1970 for this season, but the AGCM forecasts from all five lead times were for between 0.5 and 1.0 × 106 kt2, only in the near-normal category. A weak EN developed just prior to the peak season, which, while somewhat underpredicted, was present in the SST forecasts. But despite weak EN conditions during the 2004 peak season, NTC and especially ACE were well above normal (Figs. 2e and 2f). A feature of the EN that likely weakened its inhibiting effect on Atlantic TC development was its manifestation mainly in the central part of the tropical Pacific, rather than in the Niño-3 region that appears more critical. Coupling of the warmed SSTs to the overlying atmosphere was also modest in ASO. However, aspects of the SST that were not well predicted were those that mattered more critically in this case: the North and tropical Atlantic SSTs (Goldenberg et al. 2001; Vimont and Kossin 2007), including the main development region (Goldenberg and Shapiro 1996). These regions developed markedly stronger positive anomalies than had been observed in April and May or forecast for the forthcoming peak season months, and are believed to have been a major cause of the high 2004 Atlantic TC activity level.
Both examples described above highlight the importance of the quality of the SST forecast for the peak TC season in the relevant tropical and subtropical ocean regions. ENSO-related Pacific SST is known to have some predictability, but there is room for improvement in capturing it, and the seasonal prediction of SST in the tropical Atlantic is a yet more serious challenge.
4. Comparison with simple statistical predictions
One reasonably might ask whether the skill levels of the AGCM simulations and predictions are obtainable using statistical models derived purely from the historically observed TC data and the immediately preceding environmental data such as sea level pressure or SST conditions. How much does the dynamical approach to TC prediction offer that is not obtainable using empirical approaches? Here, we explore this for deterministic skill, using observed predictors in multiple regressions. To minimize the artificial skill associated with “fishing” for accidentally skillful predictors, four restrictions are imposed: a maximum of two predictors is used for each basin, the same predictors are used for NTC as for ACE for each basin, all predictors are SSTs averaged over rectangular index regions,9 and all predictors must have a plausible physical relevance to the TC activity. “Leave out one” cross validation is applied to assess the expected real-time predictive skill of the statistical models. We use mainly SST because of the well-documented influence of SST anomaly patterns, including in particular the state and direction of the evolution of ENSO, on the interannual variability of TC activity in most ocean basins. Statistical predictions are made at a lead time of 1 month (e.g., June SST predicting the Atlantic peak season of ASO). A similar “prediction” is made for the simulation of TC activity using predictors simultaneous with the center month of the peak TC season.10 The simulation predictors are usually the same as those used for the 1-month lead prediction.
Selection of the predictor SST indices is based on previous studies and on examination of the geographical distribution of the interannual correlation between SST and the TC variables using 1970–2005 data. For example, Fig. 7 shows the correlation field for SST in June versus Atlantic NTC during the ASO peak season, indicating the well-known inverse relationship with warm ENSO, and positive association with SSTs in the North Atlantic, associated with the Atlantic meridional mode (Vimont and Kossin 2007) and the Atlantic multidecadal oscillation (Goldenberg et al. 2001). When September SST is used, simultaneous with the Atlantic TC activity, these same two key regions remain important, but even stronger correlations appear for SST in the main development region (Goldenberg and Shapiro 1996).
Table 11 identifies the predictors used for each ocean basin for 1-month lead forecasts and for simultaneous simulations. In the cases of the Atlantic and western North Pacific forecasts, the first predictor contains both a recent SST level and a recent time derivative for the same region, to capture the ENSO status and direction of evolution. Many of the statistical predictors are ENSO related. The Niño-3 SST in the east-central tropical Pacific is found to be more relevant to Atlantic TC activity (Gray et al. 1993) than the location more central to ENSO itself [i.e., Niño-3.4; Barnston et al. (1997)]. Niño-3.4 is used for western North Pacific TC activity, with the second predictor being SST in the subtropical northeastern Pacific associated with the North Pacific atmospheric circulation pattern that is found to be linked with the TC activity (Barnston and Livezey 1987; Chan et al. 1998). For northeast Pacific TC activity, the SST regions highlight an ENSO-related east–west dipole at northern subtropical latitudes, while for the South Pacific and Australia the regional TC predictions, also ENSO governed, are tailored to their Southern Hemisphere locations.
Table 12 indicates the strengths of the relationship between each of the predictors and the predictand, the predictors’ correlations with one another, and the resulting multiple correlation coefficients both within the model development sample and upon using cross validation. The latter skill is considered to be a less biased skill estimate with which the AGCM-based skill (shown in the subsequent column in Table 12) can be compared. Results for this comparison are mixed. The dynamical forecasts and simulations are slightly more skillful for South Pacific NTCs, as well as in the eastern North Pacific basin in most cases. The statistical model produces higher skill levels in most cases in the Atlantic and western North Pacific for forecasts and for simulations in some cases. Statistical tests indicate that none of the dynamical–statistical skill differences are significant for the 36-case sample. Considering this, and the alternation of skill rank between the approaches over the basins, there is no clear suggestion that one approach is generally superior to the other. That the dynamical approach tended to yield higher skill levels in the South Pacific, and results that were no lower than the statistical method in the Australian region, could be related to the comparatively lower quality of the SST predictor data south of the equator, as well as NTC data in the Southern Hemisphere, particularly in the 1970s. It is possible that less accurate SST data would degrade the statistical forecasts more than the AGCM forecasts forced by the SST because the SST indices used in the statistical forecasts represent relatively smaller regions than the aggregate of the SST regions influencing the behavior of the AGCM. The larger areas of SST influencing the model may allow the opportunity for opposing error impacts, leading to smaller net impacts.
Some notable features of this methodological comparison are (i) the statistical models used here were restricted to be fairly simple, and may not be near optimum; (ii) despite the use of cross validation, some “fishing” may still have occurred in selecting the predictor SST indices, and there may be some artificial skill; and (iii) the one-year-out cross validation design has a negative skill bias in truly low predictability situations (Barnston and van den Dool 1993). Such caveats of opposing implications suggest that the skill comparisons should be considered as rough estimates, intended to detect obvious skill differences—and such differences are not revealed here. One might expect that much of the skill of a near-perfect dynamical model would be realizable by a sophisticated (e.g., containing nonlinearities) statistical model if accurate observed data were available, since the observations should occur because of, and be consistent with, the dynamics of the ocean–atmosphere system with noise added. Seasonal climate has been shown to be statistically modeled fairly well using only linear relationships (Peng et al. 2000). However, linearity may compromise statistical skill in forecasting seasonal phenomena such as TC activity, with its highly nonlinear hydrodynamics in individual storms that may not reduce to linear behavior even upon aggregating over a TC peak season.
The IRI has been issuing experimental TC activity forecasts for several ocean basins since early 2003. The forecasts are based on TC-like features detected and tracked in a climate model at low horizontal resolution. The model is forced at its lower boundary by SSTs that are predicted first, using several other dynamical and statistical models. The skill of the model’s TC predictions using historical observed SSTs are discussed as references against which skill levels using several types of predicted SSTs (including persisted SST anomalies) are compared. The skill of the raw model output is also compared with that of subjective probabilistic forecasts actually developed since mid-2001, where the subjective forecasts attempt to correct the “overconfident” probabilistic forecasts from the AGCM. The skill levels of the AGCM-based forecasts are also compared with those from simple statistical forecasts based on observed SSTs preceding the period being forecast.
Results show that low-resolution uncoupled climate models deliver statistically significant, but fairly modest, skill in predicting the interannual variability of TC activity. The levels of correlation skill are comparable to the levels obtained with simple empirical forecast models—here, models employing two-predictor multiple regression using preceding area-average SST anomalies and their recent time derivative. In ocean basins where observed SST predictor data are of questionable quality, statistical prediction is less effective. Despite that this same SST is used as the boundary forcing for the climate model, the dynamical predictions tend to slightly outperform the statistical predictions in this circumstance.
In a two-tiered dynamical prediction system such as that used in this study, the effect of imperfect SST prediction is noticeable in the skill levels of TC activity compared with skill levels when the model is forced with historically observed SSTs.
Similar to climate forecasts made by AGCMs, the probabilistic reliability of the AGCM’s forecasts for TC activity forecasts is not favorable in that the model ensemble forecasts usually deviate too strongly from the climatological distribution, due to too narrow an ensemble spread or to too large a shift in the ensemble mean from climatology. This overconfidence of the AGCM forecasts is partly due to their being based on specific representations of the physics, through parameterizations, and their own hindcast performances are not taken into account in forming ensemble forecasts. Upon subjective human intervention, the forecasts are made more probabilistically conservative and reliability is improved, leading to higher probabilistic verification scores than for the uncalibrated AGCM forecasts.
The potential skill levels (i.e., discriminating information, but needing calibration) seen in the approximately 6-yr-period real-time AGCM TC predictions are tentative and nonrobust due to the small sample size. However, these skill levels are not inconsistent with those of longer-period AGCM-based hindcasts forced by SST anomalies persisted from the previous month, as seen in comparing the correlation skills for leads of 0 and 1 of FSST and HSSTp in Tables 4 and 5. Thus, assuming a calibration step, real-time skill levels are expected to be fairly modest but contain useful informational value (e.g., most short-lead HSSTp correlations are 0.2–0.5) with the highest skill potential appearing in the Atlantic, eastern Pacific, and South Pacific, and somewhat lower levels of skill potential found in the western Pacific and Australia regions. Skill levels are generally seen to decrease slowly with increasing lead time, such that forecasts issued several months prior to the peak season onset are also expected to have some informational value.
We plan to examine the skill of other models in hopes of being able to add more information, and hopefully skill, to our seasonal TC forecasts. The problem of overconfidence in AGCMs is relieved to some extent by the use of multimodel ensembles (Kharin and Zwiers 2002; Vitart 2006; Tippett et al. 2007): adding additional models should help restrain the probabilistic amplitude exhibited by a single model. The merging of TC forecasts made by AGCMs and by statistical methods may also prove beneficial. Another possibility that could be explored in the future is to combine dynamical forecasts using the direct method (tracking of model storms) and the indirect method (using only the large-scale fields of the model).
Issues not examined here are the role of AGCM spatial resolution in governing predictive skill, and the impact of using a fully coupled dynamical system rather than a two-tiered system as is employed here. Although the prospects for the future improvement of dynamical TC prediction are uncertain, it appears likely that additional improvements in dynamical systems will make possible better TC predictions. As is the case for dynamical approaches to ENSO and near-surface climate prediction, future improvements will depend on a better understanding of the underlying physics, more direct physical representation through higher spatial resolution, and substantial increases in computer capacity. Hence, improved TC prediction should be a natural by-product of the improved prediction of ENSO, global tropical SST, and climate across various spatial scales.
This work was supported by a grant/cooperative agreement from the National Oceanic and Atmospheric Administration (Grant NA05OAR4311004). The views expressed herein are those of the authors and do not necessarily reflect the views of NOAA or any of its subagencies.
Acronyms and Their Definitions
ACE accumulated cyclone energy
AGCM atmospheric general circulation model
AMIP Atmospheric Model Intercomparison Project (AGCM is forced using observed SST)
CCA canonical correlation analysis
CFS Climate Forecast System (global coupled model of NOAA/NCEP)
DJFM December–January–February–March (and similarly for other multimonth periods)
ECHAM4.5 ECMWF–Hamburg, Germany, AGCM, version 4.5
ECMWF European Centre for Medium-Range Weather Forecasts
EN El Niño
ENSO El Niño–Southern Oscillation
EUROSIP European Seasonal to Interannual Prediction superensemble
FSST forecasted SST used for real-time AGCM forecasts, 2001 onward
FSSTe FSST using evolving (predicted) SST anomalies
FSSTpFSST using persisted SST anomalies observed from previous month
HSSTp hindcasted SST, covering a long past history, using persisted anomalies observed from the previous month
IRI International Research Institute for Climate and Society
LDEO Lamont-Doherty Earth Observatory (a campus of Columbia University)
LN La Niña
MSESS mean squared error skill score
NCEP National Centers for Environmental Prediction
NOAA National Oceanic and Atmospheric Administration
NTC number of tropical cyclones
OSST observed SST used for a long past history of AGCM hindcasts, starting in 1950
OSSTr OSST for a relatively more recent period, starting in 1970
ROC relative operating characteristics
RPSS ranked probability skill score
SST sea surface temperature
TC tropical cyclone
Corresponding author address: Suzana Camargo, Lamont-Doherty Earth Observatory, The Earth Institute at Columbia University, 61 Rte. 9W, P.O. Box 1000, Palisades, NY 10964-8000. Email: firstname.lastname@example.org
The tropical Pacific SST has been based on one or more of the following: the NCEP coupled ENSO prediction model (Ji et al. 1998), the NCEP Climate Forecast System (NCEP-CFS) (Saha et al. 2006), the Lamont-Doherty Earth Observatory intermediate model, version 5 (LDEO-5), (Chen et al. 2004), and the statistical constructed analog (CA) model (van den Dool 1994, 2007, chapter 7).
A model TC needs to exceed simultaneously thresholds for low-level vorticity (850 hPa), surface wind speed, and vertically integrated local temperature anomaly for at least 2 days, and must also have a relative local minimum of sea level pressure, local maximum of temperature anomalies in various levels, and mean wind speed at 850 hPa larger than at 300 hPa.
For the South Pacific, a temporal aspect of bias correction is used: the TC forecast season is DJFM, but the model output for NDJF is used to forecast DJFM, including the bias correction. It is found that hindcast skill levels are appreciably higher with this 1-month offset, which we consider a temporal aspect of the bias correction.
The data available for the forecast released during the first month of the TC peak season cover only through the end of the previous month.
In computing the correlation skill for forecasts for much shorter periods than the climatological base period, the subperiod means are not removed and are not used for computing the standard deviation terms. Instead, the longer base period means are used. This is done so that, for example, if in the subperiod the forecasts and observations have small-amplitude out-of-phase variations but both are generally on the same side of the longer period mean, a positive correlation would result, and we believe justifiably.
It is expected that NTC and ACE are composed of enough independent individual TC events over the season that the central limit theorem would result in a smoother, and usually unimodal, forecast probability distribution.
Because the RPSS is computed as a sum of squares of cumulative (over tercile categories) differences between forecast and observed probabilities, the lower limit of RPSS (−3.5) is farther below zero than the upper limit (+1.0) is above zero. Thus, high probabilities forecasted for an incorrect category outweigh high probabilities forecast for the correct category, and “overconfident” forecasts result in severe penalties even when the forecasts have some positive level of informational value.
The only exception to this is Australia, in which sea level pressure at Darwin is used.
The second month is used for both 3- and 4-month peak seasons.