This study analyzes the differences between an objective, automated identification of tropical cyclones (TCs) that undergo extratropical transition (ET), and the designation of ET determined subjectively by human forecasters in best track data in all basins globally. The objective identification of ET is based on the cyclone phase space (CPS), calculated from the Japanese 55-yr Reanalysis (JRA-55) or the ECMWF interim reanalysis (ERA-Interim). The resulting classification into ET storms and non-ET storms underlies the global climatology of ET presented in Part I of this study. Here, the authors investigate how well the CPS classifications agree with those in the best track records calculated from JRA-55 or from ERA-Interim data. According to F1 scores and Matthews correlation coefficients (MCCs), the classification of ET storms in the CPS agrees best with the best track classification in the western North Pacific (MCC > 0.7) and the North Atlantic (MCC > 0.5). In other basins, the correlation between the CPS classification and the best track classification is only slightly higher than that of a random classification. The JRA-55 classification achieves higher performance scores than does the ERA-Interim classification, and the differences are statistically significant in all basins. The lower performance of ERA-Interim is mainly due to a higher false alarm rate, particularly in the eastern North Pacific. Overall, the results show that while the CPS-based classifications are good enough to be useful for many purposes, there is almost certainly room for improvement—in the representation of the storms in reanalyses, in our objective metrics of ET, and in our scientific understanding of the ET process.
Extratropical transition (ET) is a process in which a tropical cyclone (TC) loses its radially symmetric warm-core structure and becomes an extratropical cyclone with frontal features and a cold core (Jones et al. 2003; Evans et al. 2017). To identify the ET of individual storms, forecasters in TC warning centers analyze a wide range of satellite images, model output, and observations. In the TC best track archives, a storm that is determined to have completed ET based on this analysis (and after a poststorm review taking into account all available data) receives an “extratropical” label.
The exact procedure for determining whether a cyclone is considered tropical or extratropical varies among different TC warning centers. Usually, the decision is based on a combination of satellite imagery, model forecast fields, and other operational tools such as the CPS; Fogarty (2010) provides an overview of ET-related operational forecast practices in many agencies. Examples of satellite products consulted in ET forecasts include cloud imagery, wind retrievals from scatterometers, or Advanced Microwave Sounding Unit temperature and moisture soundings. These products are used to monitor the defining characteristics of ET: the increasing asymmetry of the cloud pattern, expansion of the wind field, intrusion of dry air from the midlatitude trough, and the erosion of the TC’s warm core structure (Fogarty 2010). Sometimes a “human dimension” may be included because public perception of a cyclone’s threat changes when the system is declared extratropical (Masson 2014). The “extratropical” labels thus represent a definition of ET that involves subjective expert judgment. In contrast, the cyclone phase space (CPS) framework proposed by Hart (2003) can be used to define ET in a purely objective, automatable way. The CPS has become widely used and has been applied to operational analysis and reanalysis data (e.g., Hart 2003; Kitabatake 2011; Wood and Ritchie 2014) as well as climate model output (Zarzycki et al. 2017; Liu et al. 2017).
In the first part of this study (Bieli et al. 2019, hereafter Part I), we used two reanalyses, the Japanese 55-yr Reanalysis (JRA-55; Kobayashi et al. 2015) and the European Centre for Medium-Range Weather Forecasts’ (ECMWF) interim reanalysis (ERA-Interim; Dee et al. 2011), to locate TCs in the CPS and study ET in seven global ocean basins. For comparison, statistics obtained from the storm type information (i.e., the “extratropical” labels) in the TC best track data were included as well. The resulting geographical, seasonal, and temporal characteristics of ET differed between the basins, but also between the two reanalyses and the best track labels. This raises the question to what extent the globally consistent view obtained from a reanalysis is consistent from one reanalysis dataset to another and also with forecaster judgment.
Objective definitions for the onset and completion of ET in the CPS were developed by Evans and Hart (2003) using 61 Atlantic TCs, all of which had been declared by the National Hurricane Center (NHC) to have undergone ET. The study includes a comparison of the timing of ET in the CPS with that in the best track data from the NHC. However, Evans and Hart (2003) did not examine how the classification into “ET storms” (i.e., storms that undergo ET at some point in their lifetimes) and “non-ET storms” (i.e., storms that do not undergo ET) obtained from the CPS compares to that in the best tracks, when considering a set of TCs with unknown classification. Applying the CPS to identify ET in a set of recurving TCs, Kofron et al. (2010) found that the CPS does not discriminate between ET storms and non-ET storms. However, their definition of ET is not based on the best track labels but on a manual examination of each cyclone’s surface pressure field in reanalysis data.
The dependence on the dataset used to locate the TCs in the CPS makes it difficult to isolate the effect of the methodological differences between the definition of ET in the CPS and that in the best tracks. An example of this is the fraction of TCs undergoing ET as presented in Part I: The classification obtained from ERA-Interim diagnoses a larger number of storms as undergoing ET than does the JRA-55 classification. As there is no universal definition of ET, it is not possible to assess the correctness of the two classifications in absolute terms. However, we can evaluate how well the CPS classifications agree with the best track records, and how that agreement depends on whether the CPS is calculated from JRA-55 or from ERA-Interim data. This second part of the study sets out to answer these questions on a global basis.
2. Data and methods
a. TC best track and reanalysis datasets
This study is based on the same data as Part I: The cyclone data are best track datasets from the National Hurricane Center in the North Atlantic (NAT) and in the eastern North Pacific (ENP), from the Joint Typhoon Warning Center (JTWC) in the north Indian Ocean (NI), the Southern Hemisphere (SH), and the western North Pacific (WNP), and from the Japan Meteorological Agency (JMA) in the WNP. Within the SH, we distinguish the south Indian Ocean (SI), the Australian region (AUS), and the South Pacific (SP). Table 1 provides an overview of the basin acronyms and best track datasets used in this study.
In Part I, we considered TCs with tropical storm intensity or higher that occurred in the satellite era 1979–2017. Here, we consider only the years for which the best track data provide the “extratropical” labels that denote TCs that have undergone ET, as declared by the respective operational meteorological agencies. The time periods for which these labels are available vary by basin (Table 1).
We use two reanalysis datasets, the Japanese 55-yr Reanalysis (1.25° × 1.25°) released by the JMA (Kobayashi et al. 2015) and the ECMWF interim reanalysis (0.7° × 0.7°; Dee et al. 2011). Both reanalyses apply a four-dimensional variational data assimilation. A unique feature of the JRA-55 assimilation system is the use of artificial wind profile retrievals in the vicinity of TCs. In this retrieval scheme, three wind models are combined to reconstruct 3D wind profile data at certain locations around the storm center, using TC information from best track data (Fiorino 2002). In the assimilation process, the wind profiles are treated as if they were observations from dropwindsondes (Hatsushika et al. 2006; Ebita et al. 2011). In contrast, ERA-Interim does not assimilate any artificial TC information.
b. Cyclone phase space
We use the cyclone phase space proposed by Hart (2003) to objectively identify storms that undergo ET. In the CPS framework, the physical structure of cyclones is described based on three parameters: the B parameter measures the asymmetry in the layer-mean temperature surrounding the cyclone, and two thermal wind (−VT) parameters assess whether the cyclone has a warm or cold core structure in the upper and lower troposphere (with the convention of the minus sign, positive values correspond to warm cores). As in Part I, ET onset is defined here as the first time a TC is either asymmetric (B > 11) or has a cold core ( and ), and ET completion is defined as the time when the second criterion is met. This definition allows us to distinguish three pathways of ET in the CPS: B → VT ETs start when the TC becomes asymmetric and end with the formation of a cold core, VT → B ETs start with the formation of a cold core and end when the TC becomes asymmetric, and direct ETs become asymmetric and cold core at the same 6-hourly time step. The reader is referred to Hart (2003) and Evans and Hart (2003) for a comprehensive exposition of the CPS, and to Part I for details on its application to the definition of ET in this study.
After computing the CPS parameters along all best tracks, we applied the CPS criteria to classify each storm either as an ET storm if it completes the transition from a tropical to an extratropical system at some point during its lifetime or as a non-ET storm if it does not. This resulted in two binary classifications, one from the CPS parameters computed using JRA-55 data (the JRA-55 classifier), and one from the CPS parameters obtained from ERA-Interim data (the ERA-Interim classifier). A third is given by the storm type information in the best track archives, whose “extratropical” labels represent the classification proposed by the specialists at the operational warning centers.
c. Statistical performance measures
For the purpose of this study, we treat the best track labels as the “true” classifications of ET storms (see section 4 for a discussion of this assumption). Consequently, the performance of the CPS classifiers is assessed by comparing them to the ET events in the best track labels, both by checking the agreement on individual storms as well as by applying statistical performance measures. Two commonly used statistical performance metrics for binary classification algorithms are precision and recall (e.g., Ting 2010), which are defined as follows:
TP, FP, and FN are the numbers of true positives, false positives, and false negatives. Thus, precision is the ratio of correctly classified positive observations (here: ET storms) to the total observations classified as positive, and answers the question, “Of all storms a CPS classifier declares to have undergone ET, what fraction actually did?” Recall is the ratio of correctly classified positive observations to the total positive observations, answering the question “Of all true ET storms, what fraction does the CPS classifier label as such?” The harmonic mean of precision and recall is called the F1 score and quantifies the overall performance of the CPS classifiers in a single number:
The F1 score, precision, and recall all range from 0 to 1, with higher scores signaling better performances.
The Matthews correlation coefficient (MCC) introduced by Matthews (1975) additionally takes into account the number of true negatives (TN). It is defined as
The MCC can take on a value between −1 and 1, where 1 represents a perfect classification, 0 is equivalent to a random classification, and −1 indicates total disagreement between classification and observation.
d. Significance test for differences in F1 scores and MCCs
We use a subsampling method to assess the significance of the differences in the performance metrics (F1 scores and MCCs) achieved by the classifications obtained from JRA-55 and ERA-Interim. The method is based on n = 1000 draws of randomly (without replacement) sampled subsets of 5 years. In each draw, the performance metrics of the two classifiers are calculated on the storms that occurred in the sampled 5 years, and the classifier that achieves the higher score is said to have won the draw. Based on the kJRA-55 times the JRA-55 classifier wins a draw and the kERAInt = n − kJRA-55 draws the ERA-Interim classifier wins, we let k = max(kJRA-55, kERAInt) denote the number of draws won by the better performing classifier, and we define “success” to be the event that the better classifier wins a draw. Individual draws are treated as Bernoulli trials, that is, as independent random experiments with two possible outcomes (“success” and “failure”), in which the probability of success is the same every time the experiment is conducted.
The null hypothesis is that the JRA-55 and ERA-Interim classifiers are equally likely to win a draw (i.e., that the probability of success ps equals 0.5). The number k of successes in n Bernoulli trials with probability ps of success is a binomial(n, ps) random variable. Thus, the probability of obtaining at least k successes is
If this probability is smaller than a significance level of σ = 0.05, we reject the null hypothesis and conclude that the difference in the performance scores of the JRA-55 and ERA-Interim classifiers is statistically significant.
There is no set rule for determining the appropriate subset size S (Politis et al. 1999). To account for this, the subsampling was repeated with subsets of 7 and 10 years.
a. Spatial distribution of misclassifications
In our evaluation of the JRA-55 and ERA-Interim classifiers against the best track labels, we distinguish between misclassification of positive samples and negative samples. Misclassified positive samples are false negatives (i.e., actual ET storms that are not identified in the CPS), and misclassified negative samples are false positives (i.e., storms that are classified as ET storms in the CPS but not in the best track data). Similarly, correctly classified storms are either true positives or true negatives. Table 2 gives the complete breakdown for each basin and reveals that in four of seven basins, false negatives are the dominant source of error in JRA-55, whereas ERA-Interim has more false positives than false negatives in all basins. The classification difference is greatest in the ENP, where ERA-Interim has 130 false positives, compared to 38 for JRA-55 (this finding will be analyzed further in section 3b). It is likely that the wind profile retrievals used in the JRA-55 data assimilation mentioned in section 2a (Hatsushika et al. 2006; Ebita et al. 2011) enhance the tropical characteristics of the cyclones in the reanalysis, reducing the number of false positives while increasing the number of false negatives.
Table 2 demonstrates that a meaningful comparison of the CPS classification with the best track classification has to be based on a storm-by-storm evaluation, not on ET fractions: Part I showed that in the WNP, the difference between ERA-Interim’s ET fraction and the ET fraction in the best tracks is smaller for the JTWC data than for the JMA data. However, the percentage of correctly classified cyclones is greater for the JMA data (90.7% in JRA-55 and 85.8% in ERA-Interim) than for the JTWC data (79.8% and 72.8%).
Figure 1 presents the spatial distribution of the storm-by-storm evaluation for the NAT and the WNP, showing the prevailing correct classifications (true positives and true negatives, marked by green and blue dots) compared to the misclassified storms (false positives and false negatives, marked by red and orange triangles). The majority of false positives are located north of 20°N, but they can occur as far south as 6°N (ERA-Interim, WNP). We also note the absence of any obvious systematic differences in the spatial distribution of the wrongly classified storms between the two reanalyses.
For the SH, the distribution as well as the number of wrongly classified storms are similar in the results for JRA-55 (Fig. 2a) and ERA-Interim (not shown). There is a zonal band of true negatives with false positives at its southern edge, which implies that the CPS classifiers tend to declare ET more readily and farther north than the JTWC. At the same time, though, the CPS classification also fails to identify ET events that happen considerably farther south, as indicated by the false negatives poleward of 30°S.
ET in the NI (Fig. 2b) is more difficult to assess due to the blocking effect of the continental landmass, which prevents storms from moving far enough north to undergo ET. From 2004 to 2017, the JTWC only labeled two storms as extratropical. As a result, the evaluation of ET detection in the NI proved most sensitive to changes in the threshold values of the CPS parameters; for example, the JRA-55 classifier misclassifies only a single storm when increasing the asymmetry threshold of the B parameter from 11 to 14.
b. A closer look at the ENP
The discrepancy between the ET classifications of JRA-55 and ERA-Interim in the ENP (Table 2) motivates a closer inspection of that basin. Figure 3 confirms that the ET detection in JRA-55 matches the observations better, showing fewer false positives west of Mexico than does ERA-Interim. Hence, ERA-Interim’s overestimation of the ET fraction in the ENP is the result of wrongly classified ET events occurring over the ocean, in the latitude range from about 10° to 30°N. This proneness to false positives is also manifest in boxplots of all 6-hourly CPS parameters in the ENP (Fig. 4)—compared to their counterparts in JRA-55, the distributions of all three parameters in ERA-Interim have larger fractions of their values in the extratropical range (i.e., B > 11, , ).
Of all 96 storms that are false positive in the ERA-Interim classification but true negative in the JRA-55 classification, 62 (65%) do not begin ET based on the CPS in JRA-55; that is, they neither exceed the asymmetry threshold B = 11 nor exhibit a cold core ( and ) at any point in their lifetimes. In the remaining cases, the JRA-55 classifier diagnoses the onset of ET, but the condition for the completion of ET is not satisfied.
Composite fields of geopotential height (Fig. 5) show the representation of these 96 storms in JRA-55 and ERA-Interim. The composites are the averages of fields centered on the best track storm location, which were extracted in a 20° latitude × 20° longitude box at the time when the ERA-Interim classifier declared ET completion. Both reanalyses feature a cyclone located in the center. Thus, positional differences between the locations of the storm centers in the best tracks and those in ERA-Interim are not the primary reason for ERA-Interim’s higher false alarm rate. At the 900- and 600-hPa levels, the composites of JRA-55 show a more radially symmetric and stronger cyclone than those of ERA-Interim. This is consistent with the lower values of the B parameter reached in JRA-55, which leads to fewer storms being diagnosed to have undergone ET.
Weak or dissipating stages at the end of a TC’s lifetime may produce CPS signatures similar to those of ET storms, which raises the question if there is a specific type of cyclone in the best track data that tends to be misdiagnosed as ET in the ERA-Interim classification. At the time when ET is completed according to the ERA-Interim classifier, about 45% of the cyclones are labeled “tropical storms” (TCs with an intensity of 34–63 kt; 1 kt ≈ 0.51 m s−1) in the NHC best track data. “Tropical depressions” (TCs of intensity < 34 kt) and “lows” (lows of any intensity that are neither tropical, subtropical, nor extratropical cyclones) each account for about 20% of the cases (not shown). Thus, the false alarms in ERA-Interim cannot be attributed to a single type of storm. Instead, they are the result of storms that exhibit a persistent cold-core structure in ERA-Interim throughout much of their lifetimes: On average, a cold core is present at 53% of all time steps along the tracks of the ET storms, while the asymmetry parameter is only exceeded at 15% of the time steps. The median CPS trajectory of the false positives (Fig. S1 in the online supplemental material) only makes a brief excursion into the asymmetric range of the B parameter, but is located in the cold-core region from an early point on. Evidence for a bias in ERA-Interim toward cold-core structures in the representation of TCs was also found by Wood and Ritchie (2014) in their study of ET in the ENP.
The chance of a fluctuation into the B > 11 parameter range may be increased because the TCs in the ENP are the smallest of all basins (Knaff et al. 2014); they are about a third smaller than TCs in the NAT or the WNP. For small TCs, the (fixed) radius of 500 km used to calculate the CPS parameters may include less symmetric regions at the outer edge of the storm.
As mentioned in section 2a, JRA-55 uses historical data to produce artificial dropsonde observations in the vicinity of TCs, which are then processed like regular observations (Hatsushika et al. 2006). This is a key difference between JRA-55 and ERA-Interim, which does not apply a special TC treatment in its data assimilation process, and may help to explain the greater strength and higher symmetry of the vortices in the JRA-55 composites. Still, it does not explain why the resulting difference in classification skill is greater in the ENP than in the other basins. However, according to the best track classification, there are only nine ENP ET storms between 1988 and 2017. This small sample makes it difficult to analyze whether and how ET may differ in the ENP compared to other basins; thus, our analysis is limited to studying the character of false positives in the reanalysis datasets.
c. ET time
To analyze the timing of ET, probability density functions (PDFs) of the differences between the best track ET times (as defined by the operational warning centers) and the times of ET completion in the CPS were calculated using a Gaussian kernel density estimation (Fig. 6). These PDFs are based on the set of all ET events that were identified both in the CPS and in the best track archives (i.e., on the set of all true positives). The distributions in the NAT are broader than those in the WNP and the SH, indicating a higher variance in the declared ET times between the CPS and the NHC than between the CPS and either the JMA or the JTWC. In the NAT, ERA-Interim on average declares ET completion 32 h before the NHC assigns the first “extratropical” label. This is consistent with Evans and Hart (2003), who examined the ET time of 38 cyclones in the NAT and found that the time of ET completion diagnosed by the CPS in the ECMWF’s 15-yr Reanalysis (ERA-15; Gibson et al. 1997) occurs on average about 28 h earlier than in the NHC best tracks. In contrast, the mean difference between the ET time in JRA-55 and that of the NHC classification is only 10 h. The JRA-55 ET completion times also agree better with the JMA labels in the WNP than the ERA-Interim completion times do, while the PDFs of the ET time differences to the JTWC labels in the SH are almost identical for the two reanalysis datasets. Based on a t test for the sample mean and an F test for the sample variance, the inter-reanalysis differences in the transition time periods are significant in the NAT and the WNP, but not in the SH.
d. Precision, recall, F1 scores, and Matthews correlation coefficients
Figure 7a shows the F1 scores of the JRA-55 and ERA-Interim classifiers. The CPS classification agrees best with the observations in the WNP and the NAT, with F1 scores of 0.90 and 0.77, respectively, for JRA-55, and 0.86 and 0.76, respectively, for ERA-Interim. As already indicated in Table 2, the classification in the WNP based on the JTWC best tracks receives a lower F1 score than that based on the JMA best tracks. In Part I, it was shown that the JMA best tracks on average extend farther northeast than the JTWC best tracks. Thus, the operational treatment of ET in the JMA and the JTWC as well as the tracks themselves may contribute to the differences in the F1 scores.
Compared to the F1 scores in the NAT and the WNP, the scores in the ENP, the NI and the SH basins are lower for both reanalysis classifiers, but consistently higher for the JRA-55 classifier than for the ERA-Interim classifier.
The decomposition of the F1 scores into precision and recall (Fig. 7b) shows that the F1 scores in the NAT, the WNP, and the SH basins are composed of almost equal values of precision and recall—in other words, the CPS ET classification is equally good at avoiding false positives as at avoiding false negatives. The F1 performance in the NI and the ENP is more asymmetric, with a higher recall than precision. This is likely a result of the scarcity of ET events in these two basins, which makes it difficult to identify the rare true ET storms while avoiding false alarms.
As with the F1 scores, the MCCs (Fig. 8) are highest in the WNP and the NAT, and the MCCs of JRA-55 exceed those of ERA-Interim in all basins. The MCCs are greater than zero in all basins, indicating a better than random correlation with the best track classification (recall that the MCC ranges from −1 to 1), although only by a small margin for the ERA-Interim classifications in the SP and the ENP. In the SP, the MCC is considerably lower than in the other two SH basins, despite similar F1 scores. With that exception, the general pattern of the evaluation is robust with respect to the two performance metrics.
However, it is notable that if we used the proportion of correct classifications, also termed accuracy, as a measure of classification skill, the NI would achieve the highest scores (0.93 in JRA-55 and 0.85 in ERA-Interim), and the average score of the two reanalyses in the ENP would be higher than that in the NAT (0.82 compared to 0.78). These results make it clear that accuracy is a misleading performance metric when the two classes (ET storms and non-ET storms) are of very different sizes. To further illustrate this point, consider a hypothetical basin where only 1% of all storms undergo ET. A “dummy” classifier that, without performing any analysis, assigns each storm to the majority class (here: non-ET storms) would achieve an accuracy of 0.99 despite not having any classification skill.
Table 3 presents the results of the significance test described in section 2d, for the F1 score and the MCC. All differences between the performance scores of the JRA-55 and the ERA-Interim classifications are significant. Repeating the test with different subset sizes (S = 7 years and S = 10 years) did not change the significance of the results. Recall that a high statistical significance does not imply that the performance difference is large, but that a (possibly small) difference in classification skill is consistently present on randomly sampled subsets of storms.
e. Time series of classification skill
In the NAT and the WNP, the high quality of the best track datasets and the frequency of ET motivate a look at how the agreement between the CPS classification and the best track classification has evolved over time. A possible reason for changes in that agreement is modifications in the operational procedures at TC warning centers; for example, since 2005, the NHC has routinely used model-derived CPS parameters in operational forecast discussions.
Figure 9 shows time series of F1 scores and MCCs in these two basins, and Table 4 summarizes some statistics of these time series. In both basins, the slopes of the linear regression lines are positive, but only those in the WNP are statistically significant for both reanalysis classifiers. In the WNP, the MCCs are almost as high as the F1 scores, indicating that the CPS classifiers perform well both in classifying positive samples and in correctly recognizing negative samples.
The correlations between the time series of JRA-55 and ERA-Interim are high and statistically significant (Table 4). Thus, the two classifiers do not only have similar F1 scores and MCCs on the set of all storms (Figs. 7 and 9), but also on individual 3-yearly subsets of storms.
The introduction of the CPS as an operational tool at the NHC does not lead to a jump in the F1 scores and MCCs in the NAT, which may reflect the fact that Evans and Hart (2003) originally built the CPS diagnostics of ET on the NHC classifications.
However, the performance of the CPS classifiers has an upward trend in both basins. Two conceivable reasons are that the increasing number of observations assimilated into JRA-55 and ERA-Interim has made the representation of TCs more accurate over time, or that there have been changes in the operational practices and attention dedicated to the ET designation at the warning centers.
The fact that the JRA-55 classifier agrees better with the observed ETs recorded in the best track datasets than the ERA-Interim classifier is consistent with the study by Murakami (2014), in which JRA-55 comes out ahead in an evaluation of the representation of TCs in six reanalyses. As mentioned in section 3b, the high rate of false positives we found in the ENP is consistent with Wood and Ritchie (2014), who noted in their study of ET in the ENP that ERA-Interim has a bias toward cold-core values in the 900–600-hPa layer compared with both JRA-55 and the final operational global analysis (FNL) data from the Global Forecast System. However, deficiencies in the representation of TCs are by no means limited to ERA-Interim but are a well-known issue of reanalyses (including JRA-55; e.g., Schenkel and Hart 2012; Murakami 2014; Hodges et al. 2017) and climate model output (e.g., Randall et al. 2007; Camargo and Wing 2016) in general.
The most prominent problem associated with TCs in reanalyses is the substantial underestimation of the storm intensities. However, the CPS parameters are based on relative comparisons (layer thickness left and right of the storm for B, and vertical profiles of ΔZ for thermal wind parameters) and do not depend in any direct way on storm intensity. This offers the hope that the threshold parameters used to detect ET may not have to be adjusted to the increasing resolution and stronger intensities of cyclones in future reanalyses.
Of course, the performance evaluation of the CPS classifiers presented in this study hinges on the quality of the best track data, in particular on the labels indicating the tropical or extratropical nature of each cyclone. Even though the best tracks are the most accurate and comprehensive archives of historical TC data available, they are still prone to considerable uncertainty, especially the components that are derived from a forecaster’s subjective judgment (e.g., Landsea and Franklin 2013). In addition, there may be inhomogeneities in the data quality due to agencies putting less effort into the classification of transitioning storms or stopping the tracking earlier in basins where ET storms do not pose a threat to land.
Given these limitations, it is clear that assessing the CPS classifiers against the best track labels cannot in all cases be interpreted as a comparison with the “true” classification. Put simply, when the labels are wrong, high performance scores do not indicate good classification skill, and vice versa. However, the time series of best track ET fractions shown in Part I did not reveal any statistically significant trends at the 0.05 significance level that were robust between the two reanalyses, and neither did time series of the magnitude of the difference between the CPS-based fractions and the best track labels (not shown). Trends were also absent in time series of the annual mean latitude of storm track end points (not shown). Taken together, these results indicate that operational procedures in the tracking and characterization of cyclones have been fairly consistent in the time period 1979–2017, which provides some reassuring evidence that the best track labels can to a reasonable approximation be assumed to represent the “ET truth.” In basins where that assumption is less valid, it still provides a means to examine differences in the ET classifications of the two reanalysis datasets, but there is limited value in interpreting the observed differences in terms of classification skill.
5. Summary and concluding remarks
In this study, we analyze the statistical performance of a global classification of tropical cyclones (TCs) that undergo extratropical transition (ET). The classification is used in Part I of this study for an examination of the geographical, seasonal, and temporal characteristics of ET in seven ocean basins. Here, we have investigated how well the ET storms defined in the CPS agree with those defined in the best track records, and how that agreement depends on whether the CPS is calculated from JRA-55 or from ERA-Interim data. At the core of this evaluation is the binary classification into ET storms (TCs that undergo ET at some point in their lifetimes) and non-ET storms (TCs that do not undergo ET) obtained from the CPS analysis using JRA-55 data (the JRA-55 classifier) and ERA-Interim data (the ERA-Interim classifier).
Our results can be summarized as follows:
According to the F1 score and the Matthews correlation coefficient (MCC), two performance metrics that balance classification sensitivity and specificity, the CPS classification agrees best with the best track classification in the western North Pacific (MCC > 0.7) and the North Atlantic (MCC > 0.5).
The correlations between the CPS classification and the best track classification are considerably weaker in the other basins. In the South Pacific and the eastern North Pacific, the MCC of the ERA-Interim classification is only slightly higher than that of a random classification.
The JRA-55 classifier achieves higher performance scores than does the ERA-Interim classifier. The differences are statistically significant in all basins.
The lower performance of ERA-Interim is mainly due to a higher false alarm rate, which is especially pronounced in the eastern North Pacific. The false positives in the eastern North Pacific are the result of a bias toward cold-core structures in the representation of TCs in ERA-Interim.
On average, ET completion in the North Atlantic and the western North Pacific occurs earlier in ERA-Interim than in JRA-55, but almost simultaneously in the Southern Hemisphere.
In the North Atlantic and the western North Pacific, the agreement between the CPS classification and the best track classification (as measured by the MCC and the F1 score) has increased from 1979 to 2017, but only the trend in the western North Pacific is statistically significant for both the JRA-55 and the ERA-Interim classifier.
Our results show that the CPS computed from reanalysis data can be used to provide a globally consistent dataset that, while by no means in perfect agreement with the diagnoses of ET produced by forecasters, are nonetheless close enough—especially in the basins where ET is most common—to be usable for the purposes of some kinds of climatological studies, as long as the limitations are understood. At the same time, improvement is clearly possible. While we are not certain, it seems plausible that we obtain higher performance scores with JRA-55 than ERA-Interim here due to JRA-55’s special procedures to initialize TCs; this suggests that further improvement in the representation of TCs in reanalysis datasets—whether through higher resolution, improved physics, data assimilation, or other TC-specific initialization procedures—might yield further improvements. The CPS itself is also an imperfect measure, and exploration of other objective metrics of ET is warranted, as also suggested by Evans et al. (2017). Since diagnosing ET is in some sense a problem in pattern recognition, machine learning or other advanced statistical approaches might be beneficial; we are exploring a small subset of such methodologies and will report on this in due course.
It is also possible that even the forecaster-generated best track datasets we take here as ground truth are themselves imperfect indicators of ET, and perhaps even that in some cases there might be fundamental scientific uncertainty (i.e., not simply a consequence of inadequate data) as to whether a storm should be considered tropical or extratropical at a given moment, or even whether a binary classification is adequate to describe what might be better thought of as a gradual transition process (Beven 2008, 2012). In cases where different metrics of ET (including CPS from different reanalyses and/or best track datasets) yield strongly different results, in-depth case studies to examine physical mechanisms could be valuable, and could add to our fundamental understanding of the ET process.
The funding for this research was provided by NASA Cooperative Agreement NNX15AJ05A, and by NSF under Grant ATM-1322532. The authors also thank the following organizations for making the data used in this study available: ECMWF (ERA-Interim reanalysis data), JMA (JRA-55 reanalysis data and western North Pacific best track data), NHC (North Atlantic and eastern North Pacific best track data), and JTWC (western North Pacific, North Indian Ocean, and Southern Hemisphere best track data).
Supplemental information related to this paper is available at the Journals Online website: https://doi.org/10.1175/JCLI-D-18-0052.1.s1.
This article has a companion article which can be found at http://journals.ametsoc.org/doi/abs/10.1175/JCLI-D-17-0518.1.
Publisher’s Note: This article was revised on 5 August 2019 in order to correct a typographical error in section 3d.