## Abstract

This paper introduces a logistic regression model for the extratropical transition (ET) of tropical cyclones in the North Atlantic and the western North Pacific, using elastic net regularization to select predictors and estimate coefficients. Predictors are chosen from the 1979–2017 best track and reanalysis datasets, and verification is done against the tropical/extratropical labels in the best track data. In an independent test set, the model skillfully predicts ET at lead times up to 2 days, with latitude and sea surface temperature as its most important predictors. At a lead time of 24 h, it predicts ET with a Matthews correlation coefficient of 0.4 in the North Atlantic, and 0.6 in the western North Pacific. It identifies 80% of storms undergoing ET in the North Atlantic and 92% of those in the western North Pacific. In total, 90% of transition time errors are less than 24 h. Select examples of the model’s performance on individual storms illustrate its strengths and weaknesses. Two versions of the model are presented: an “operational model” that may provide baseline guidance for operational forecasts and a “hazard model” that can be integrated into statistical TC risk models. As instantaneous diagnostics for tropical/extratropical status, both models’ zero lead time predictions perform about as well as the widely used cyclone phase space (CPS) in the western North Pacific and better than the CPS in the North Atlantic, and predict the timings of the transitions better than CPS in both basins.

## 1. Motivation

Tropical cyclones (TCs) moving into midlatitude regions can transform into strong extratropical cyclones, changing their structures and the nature of the hazards to coastal populations and infrastructure. In this transformation, which is called extratropical transition (ET), TCs lose their radial symmetry and become cold-cored systems with fronts. Transitioning cyclones typically grow larger and accelerate their forward motion, producing intense precipitation, strong winds, and large surface water waves (Jones et al. 2003; Evans et al. 2017). In the western North Pacific and the North Atlantic, 40%–50% of all TCs undergo ET (Hart and Evans 2001; Kitabatake 2011), posing a serious threat to coastal regions in East Asia, Japan, northeast North America, and western Europe.

Forecasts of the timing of ET as well as of the track and structural evolution of the transitioning storm are difficult for numerical weather prediction models (e.g., Buizza et al. 2005; Fogarty 2010; Keller et al. 2011; Evans et al. 2017; Keller et al. 2019), as these forecasts depend on the representation of both the TC and the midlatitude circulation into which the TC is moving. Errors in forecasts of ET can be large as a result of nonlinear interactions between errors in the simulation of the TC structure and of the large-scale environment (Chan and Kepert 2010). Dynamical models may be supplemented by statistical models to provide baseline guidance for evaluation or, in situations where dynamical models do not yet produce skillful forecasts, to provide actual operational guidance. For example, the Statistical Hurricane Intensity Prediction Scheme (SHIPS; DeMaria and Kaplan 1994), a simple linear regression scheme, is still one of the National Hurricane Center’s best performing products for the prediction of TC intensity changes (DeMaria et al. 2014). Aberson (2014) developed a climatological “no-skill” baseline model that predicts the “phase” of the storm (i.e., whether it is in a tropical, subtropical, extratropical, dissipated, or “remnant low” stage of its life cycle). The model is a simple linear discriminant analysis scheme whose predictors match those of the National Hurricane Center’s baseline models for track and intensity prediction. Our intent here is broadly similar to that of Aberson (2014), with the difference that the statistical model presented here is optimized for the prediction of the extratropical phase and ET.

Despite its important implications for risk assessment (Loridan et al. 2015), ET has received little attention in the development of TC hazard models for the (re)insurance industry (e.g., AIR Worldwide 2015) or in academia (e.g., Emanuel et al. 2006; Hall and Jewson 2007; Lee et al. 2018). The primary output of a hazard model is the annual probability of a storm intensity exceeding a given threshold at a specific location. Hazard models used in the industry (so-called catastrophe, or “cat” models) additionally incorporate vulnerability curves and insurance exposure data, which allow them to convert hazard intensity into an estimated level of financial loss. Due to the limited observed record, these models often use large ensembles of stochastically generated synthetic storms that resemble those in the historical data. In some, the storms’ track and intensity evolution is guided by a few environmental parameters such as the state of the El Niño–Southern Oscillation, or seasonal or monthly average sea surface temperature. Integrating ET into hazard models thus requires a statistical prediction without real-time information about the environment.

Here, we introduce a statistical model that predicts how likely a TC is to be extratropical, given some information about the TC’s present characteristics and environment. Two versions of the models are presented: an “operational model” and a “hazard model.” The operational model makes forecasts at lead times up to four days, based on predictors that include real-time observations of the storm environment. This operational model is not intended to outperform dynamical models. Rather, its primary purpose is to serve as a “baseline model” that provides a benchmark for the evaluation of dynamical models. In other words, a dynamical model can be said to make skillful predictions if it beats the operational model presented here. There are numerous examples of forecast validations that define skill as an improvement over a benchmark given by a statistical model (e.g., Landsea and Knaff 2000; Vitart et al. 2010; Rashid et al. 2011; Eden et al. 2015). Note, however, that contrary to what its name may suggest, a baseline model can itself have actual skill when compared to a simple reference model such as climatology or persistence, or to a model that makes random predictions (and it does in this case, as will be shown in section 6). Sometimes, a so-called baseline model may even produce more accurate forecasts than a numerical weather prediction model (e.g., Cangialosi and Franklin 2016). The second version, the hazard model, is designed for use in TC risk assessments; it does not make true forecasts at all, but is diagnostic (i.e., it makes predictions at a lead time of 0 h) and works with a monthly mean representation of the environment.

This paper is organized as follows: section 2 introduces the datasets used to train and evaluate the regression models. Section 3 outlines the underlying mathematics and concepts of the models, and section 4 describes the model fitting and evaluation. The evaluation is based on the performance metrics presented in section 5. The results of the performance evaluation as well as some case studies are shown in section 6. Section 7 compares the models to cyclone phase space (CPS) diagnostics (Hart 2003), a well-established objective method to identify storms undergoing ET. Throughout the paper, we focus first on the operational model, and follow each segment with the corresponding analysis of the hazard model. The paper concludes with a summary of the findings (section 8).

## 2. Data: TC best track and reanalysis datasets

The features (predictors) are a combination of storm properties from 6-hourly TC best track data and environmental parameters derived from reanalysis data. All features are scaled to zero mean and unit variance. We use the best track archive of the National Hurricane Center (NHC) in the North Atlantic (NAT), and that of the Japan Meteorological Agency (JMA) in the western North Pacific (WNP). These datasets provide information on the position of the storm center, central pressure, and the storm phase. The phase information is used to generate the predictand, which is a Boolean variable taking the value 1 if a storm is extratropical (EX) and 0 (non-EX) otherwise. Our data consist of 501 storms in the NAT and 966 storms in the WNP in the period 1979–2017.

Features taken from best track datasets are the same for the operational model and the hazard model, but the two models use different environmental features and reanalysis datasets.

### a. Operational model

The 6-hourly environmental fields surrounding the TCs are taken from the Japanese 55-year Reanalysis (JRA-55; Kobayashi et al. 2015), which provides data on a regular latitude–longitude grid at a horizontal resolution of 1.25°. Geopotential height fields from JRA-55 are used to characterize the storms using the CPS (Hart 2003), which captures the physical structure of cyclones in terms of three parameters: the *B* parameter measures the asymmetry in the layer-mean temperature surrounding the cyclone, and two thermal wind (*V*_{T}) parameters assess whether the cyclone has a warm or cold core structure in the upper $(\u2212VTU)$ and lower $(\u2212VTL)$ troposphere (with the convention of the minus sign, positive values correspond to warm cores). A detailed explanation of the CPS can be found in Hart (2003).

Vertical wind shear is computed using area-weighted azimuthal mean Cartesian wind components at 200 and 850 hPa, following the method of Hanley et al. (2001). The azimuthal averaging, performed in five annular rings of 100-km width around the storm center, removes a symmetric vortex. Thus, the resulting areal-mean winds are measures of the environmental flow across the cyclone. The components of the areal-mean winds are given by

where *u* and *υ* are the Cartesian wind components; *A* is the area of a circle of radius 500 km; *A*_{i} is the area of an annular ring with inner radius *r*_{i−1} and outer radius *r*_{i} (*r*_{0} = 0 km, *r*_{1} = 100 km, …, *r*_{5} = 500 km); ⟨⋅⟩ indicates an area average; and the overbar is an azimuthal average. The 200–850-hPa vertical shear (SHR) is then calculated as the magnitude of the vector difference between the area-averaged wind at the two pressure levels:

A further environmental feature is sea surface temperature (SST), which is averaged within a circle of radius 500 km around the storm center.

### b. Hazard model

The hazard model is customized to function as a component in the Columbia Hazard model (CHAZ) developed by Lee et al. (2018) and therefore uses data from the European Centre for Medium-Range Weather Forecasts interim reanalysis (ERA-Interim; Dee et al. 2011) with a 0.75° horizontal grid spacing. For this application, monthly means of potential intensity (PI; Emanuel 1988) and 200–850-hPa vertical wind shear at the ERA-Interim grid point closest to the storm location are linearly interpolated from monthly means to daily values. PI is used instead of SST because CHAZ is designed for TC risk assessment in the present-day as well as in a future climate, and in the latter case, the relationship between SST and TC intensity is expected to change (Wing et al. 2007; Swanson 2008; Johnson and Xie 2010).

## 3. Logistic regression with elastic net

We apply a logistic regression model with elastic net regularization (Zou and Hastie 2005). This approach has the following advantages:

Logistic regression produces probabilistic results and, as a generalized linear model, allows us to understand the impact of a feature on the predictand.

The elastic net performs both regularization and feature selection, which enhances the prediction accuracy and interpretability of the model. It is often used when the number of features is large, perhaps even larger than the number of observations (e.g., Sokolov et al. 2016; Laing et al. 2018; Liu et al. 2018). In the application presented here, elastic net regularization was chosen out of preference for a simple model with as few predictors as possible, even though the initial number of features was not particularly large.

The elastic net encourages a “grouping effect” (Zou and Hastie 2005), which is the property that highly correlated predictors tend to get equal coefficients (up to a change of sign if negatively correlated). This is an advantage over least absolute shrinkage and selection operator (LASSO) regularization (Tibshirani 1996), which also performs variable selection, but does not exhibit a grouping effect: it tends to pick only one of a set of correlated features, such that the resulting selection of features may not include all features that provide predictive information.

At the core of logistic regression is the logistic function *h*_{θ}(**x**), which maps predicted values to probabilities:

Given a feature vector **x** at time *t*, *h*_{θ}(**x**) is interpreted as $P^\u2061(EX|x)$ (i.e., as an estimate of the probability that the system will be extratropical at time *t* + Δ*t*, where Δ*t* is the lead time considered). If desired, this probability can be converted into a deterministic binary prediction by applying a decision threshold. In general, the optimal value of the decision threshold may be problem dependent. In this study, however, it was set to a fixed value of 0.5, as treating the threshold as a model parameter that is determined separately for each lead time only slightly improved model performance. Thus, the class is determined according to

The vector ** θ** contains the coefficients (weights), which are found by minimizing a cost function

*J*(

**) defined as**

*θ*where *N* is the number of samples, **x**^{(i)} is the feature vector of the *i*th sample, *y*^{(i)} is the *i*th predictand (1 for EX, 0 non-EX), and

The second term on the right-hand side of (3), *λP*_{α}, is the elastic net term, which regularizes the model by penalizing a linear combination of the sum of the L2 norm and the L1 norm of each coefficient. The parameter *α* determines the relative contributions of the two norms. For *α* = 0, the elastic net penalty is equivalent to the ridge penalty (Hoerl and Kennard 1970), and for *α* = 1, it is equivalent to the LASSO penalty (Tibshirani 1996). The parameter *λ* controls the strength of the regularization. The parameters *α* and *λ* are the so-called hyperparameters of the model. Their values are determined in a grid search over a range of possible combinations, each of which is evaluated in a 10-fold cross validation, as described in the next section.

## 4. Model fitting and evaluation

### a. Operational model

The model is fit on a training set containing about 85% of the data, while the remaining 15% constitute the test set used to evaluate the model. Feature selection serves the purpose of producing a sparse model with better physical interpretability, and it is done here in the following way: On the training set, we perform cross-validated logistic regression with elastic net regularization for a range of different regularization parameters. As the regularization gets stronger, more and more features drop out (i.e., their coefficients shrink to zero). So-called “regularization paths,” plots of the feature coefficients as a function of the regularization strength, are shown in Fig. S1 in the online supplemental material. The higher the regularization strength required to drive a feature’s coefficient to zero, the more predictive skill that feature contributes (e.g., the most important feature in both ocean basins is sea surface temperature, which is the last one to drop out as the regularization strength increases). The combination of these regularization paths with the corresponding sequence of cross-validation performance scores (black dashed line in Fig. S1) describes a trade-off between model sparsity and performance. Determining the optimal set of features within this trade-off is partly a matter of preference for certain model properties (e.g., one may be willing to sacrifice some predictive skill for the benefit of having fewer predictors) and may depend on the particular application.

Based on the regularization paths in the NAT and the WNP, we determine a subset of 8 (out of 20) features that will be used for both basins. This feature selection is done once, for a lead time of 24 h. Model fits are then calculated using the training data for lead times from 0 to 96 h, in 6-hourly time steps, using only those eight features. Each model fit is the result of a grid search over a range of possible values for the hyperparameters *α* (the relative balance between the L1 and the L2 penalty) and *λ* (the overall regularization strength), where each combination of *α* and *λ* is evaluated in a 10-fold cross validation. In a 10-fold cross validation, the training data are divided into 10 folds. Out of the 10 folds, nine sets are used for training while the remaining set is used for testing. The model is trained and tested 10 times, each time a new set is used as testing set while the remaining sets are used for training. The final result of the 10-fold cross-validation is then obtained by averaging the results of all 10 folds.

We apply this procedure separately for the WNP and the NAT data, which gives rise to two sets of feature weights for each lead time. To estimate confidence intervals for the feature weights, we apply a bootstrap procedure, in which we resample the training set 1000 times (with replacement), fit a model to each of these training sets, and collect the feature weights of each bootstrap replicate.

The evaluation of the model performance on the independent test set addresses three questions: how well does the model predict the future status (EX or not EX) of a storm, how well does it predict transitions (i.e., changes in status), and how well does it classify TCs into “ET storms” (storms that undergo ET at some point in their lifetimes) and “non-ET storms” (storms that do not undergo ET). Predicting the status is what the model learns in the training phase; evaluating the performance on this task in the test set is thus straightforward. The model’s ability to forecast transitions is evaluated by sliding a time window of 24 h along the predicted and the true time series of a storm’s “extratropical” status, computing the difference in the storm’s phase occurring between the last and the first time step of the time window. To account for the rare cases of tropical transitions (which would result in values of −1), the absolute value of the differences is taken. The sliding-window transformation results in time series that are 1 at time steps within 24 h of a status change, and 0 otherwise. For example, considering a lead time of 6 h, the transformed predictand is 1 if a phase transition occurs between 6 and 30 h, and 0 if no phase transition occurs within that time period.

We evaluate the model performance for 6-hourly lead times up to 96 h. To estimate the sensitivity of the performance to different random splits into training and test set, we resample the data to generate 50 bootstrap replicates for each lead time.

Finally, the classification into ET storms and non-ET storms is obtained from the model’s predictions at individual time steps. If, at any point in a TC’s lifetime, the model’s prediction switches from 0 (not extratropical) to 1 (extratropical), a TC is classified as an ET storm, and otherwise it is classified as a non-ET storm.

### b. Hazard model

The hazard model is only fit for a lead time of 0 h, since forecasting a storm’s future phase is not necessary for the application in TC risk models. All features are variables that are already used in CHAZ, and no feature selection is performed. Due to the model’s diagnostic nature, evaluating the prediction of status changes (in the way described in the previous section) is not possible as changes can only occur over nonzero time intervals.

Table 1 summarizes the features used for the operational model and the hazard model, and Table S1 lists all 20 features from which the features for the operational model were selected.

## 5. Performance metrics

The Matthews correlation coefficient (MCC; Matthews 1975) is regarded as a balanced skill metric for binary classification tasks, which can be used even if the two classes are of very different sizes. It is defined as

where TP, TN, FP, and FN are the number of true positives, true negatives, false positives, and false negatives, respectively. The MCC takes on values between −1 and 1. A coefficient of 1 represents a perfect prediction, 0 is equivalent to a random prediction (e.g., as would be output by a classifier that generates predictions uniformly at random, or by a classifier that makes random predictions respecting the training set’s class distribution), and −1 indicates total disagreement between prediction and observation.

The Brier score loss (BSL; Brier 1950) is the mean squared difference between forecast probabilities and actual outcomes:

Where *N* is the number of samples, *h*_{θ}(**x**^{(i)}) is the model’s prediction (a probability between 0 and 1) on sample **x**^{(i)}, and *y*^{(i)} is the true outcome (either 0 or 1). The BSL ranges between 0 and 1, with smaller values indicating better predictions.

Two further metrics used in this study are precision and recall (e.g., Ting 2010), which are defined as follows:

Precision is the ratio of correctly classified positive observations (here ET storms) to the total observations classified as positive. Applied to the classification into ET storms and non-ET storms explained in the previous section, it answers the question, “Of all storms the model declares to have undergone ET, what fraction actually did?” Recall is the ratio of correctly classified positive observations to all observations in the actual class—the corresponding question is “Of all true ET storms, what fraction does the model identify?”

The harmonic mean of precision and recall is called the F1 score (Chinchor 1992):

Precision, recall, and F1 score reach their best value at 1 and worst value at 0.

## 6. Results

### a. Operational model

#### 1) Feature weights

Figure 1a shows coefficients of the operational model for a lead time of 24 h. In both ocean basins, increasing latitude and decreasing SST cause the largest increases in a cyclone’s odds of being extratropical at time *t* + 24 h. (Note that the coefficients of a logistic regression model describe the effect of a feature on the predictand in terms of odds ratios—a brief guide on how to interpret logistic regression coefficients is given in the appendix.) The influence of the SST is smaller in the NAT than in the WNP, possibly because TCs in the NAT experience a smaller spatial variability in SST as they tend to travel along mean SST gradients rather than across them (Foltz et al. 2018).

During ET, a storm loses its upper-level warm core and radial symmetry and becomes a cold-cored system with fronts, which is consistent with the signs of the $\u2212VTU$ and *B* coefficients. Positive anomalies in translational speed and vertical wind shear, indicating a TC’s acceleration by midlatitude westerlies and interaction with a baroclinic environment, are associated with enhanced odds of being in the extratropical phase as well.

The heading angle *H*_{a} has a bimodal distribution in which negative anomalies correspond to directions ranging from north to northeastward. These directions are associated with TCs that recurve from their westward-moving tracks in the tropics into eastward ones in higher latitudes where they are more likely to undergo ET.

#### 2) Performance evaluation

Figure 2 shows the MCCs and BSLs for lead times from 0 to 96 h. For the status prediction (i.e., whether a storm will be extratropical or tropical), the model is most accurate for lead times up to one day, with maximum MCCs of about 0.9 in the WNP and 0.7 in the NAT (Fig. 2a). Note that the prediction at a lead time of 0 h is not trivial (and doesn’t necessarily have to achieve the highest score), since all predictions are made without knowledge of the storm’s current phase, and the feature selection has been made based on a lead time of 24 h. The scores in the two basins decrease at similar rates, dropping by about 0.3 between the 24- and 96-h forecasts. The performance difference between the two basins is confirmed by the lower (i.e., better) BSLs of the WNP model. It is noteworthy that the WNP model outperforms the NAT model even when they are trained on equal-sized training sets (recall that the two datasets differ in size by a factor of almost two due to the higher TC activity in the WNP).

The MCCs for the prediction of ET (computed on the time series resulting from applying a sliding time window of 24 h; Fig. 2c) are lower than those for the status prediction, which reflects the difficulty of predicting the precise timing of transitions. At a lead time of 24 h, the model predicts phase changes with an MCC of 0.4 in the NAT, and 0.6 in the WNP. Recall that a model that makes random predictions has an MCC ≈ 0 (e.g., Boughorbel et al. 2017), and that an MCC > 0 thus means that a model has skill when compared to random predictions. In one-sample Wilcoxon signed rank tests (one test for each lead time), the sample mean MCC for the prediction of ET is significantly greater than zero for all lead times considered. However, even if the sample mean remains statistically distinguishable from zero for all lead times, Fig. 2c shows that the MCCs in both basins decrease rapidly from the 6-h lead time to the 48-h lead time forecast, and not much predictive skill is retained beyond a lead time of about three days.

The performance scores presented thus far are calculated on the basis of the model’s predictions at individual time steps. These predictions can be used to assign each storm to one of two classes: ET storms, which undergo ET at some point in their lifetimes, and non-ET storms, which do not undergo ET. The confusion matrices (2 × 2 contingency tables) in Fig. 3a summarize the performance of the resulting classification. In the NAT, the model classifies 36 of the test set storms as ET storms, 8 of which are false positives (precision = 0.78), and it correctly identifies 28 of the 35 true ET storms (recall = 0.80). In the WNP, 5 of the 63 storms classified as ET storms are false positives (precision = 0.92), and 58 of the 63 true ET storms are identified (recall = 0.92).

Distributions of the time difference between the predicted and actual occurrence of ET (Fig. 4a) are obtained from the true positive ET storms of the test set. For the calculation of these time differences, the forecast TC was considered to undergo ET at the first 24-h forecast with $P^\u2061(EX|x)>0.5$, regardless of whether the forecast decreased in later forecasts. In both ocean basins, the median time difference is 6 h, and the difference between the predicted and the true ET time is less than 24 h in 93% of the cases in the NAT, and in 97% of the cases in the WNP.

#### 3) Examples

Figure 5 shows the model’s 24-h forecast for six storms in the test set (i.e., storms not used in the model development), whose meteorological histories we will outline in the following using information from the TC reports provided by the NHC (available at http://www.nhc.noaa.gov/data/tcr) and the JMA (available at http://www.jma.go.jp/jma/jma-eng/jma-center/rsmc-hp-pub-eg/annualreport.html). The examples are chosen to illustrate different phase evolutions and degrees of model performance.

Karl (2004; Fig. 5a) was a Cape Verde hurricane that moved northwestward across the Atlantic before recurving due to a baroclinic trough developing north of the storm. Continuing northeastward, Karl underwent ET and eventually made landfall in Norway as an extratropical storm. Karl’s predicted probability of being extratropical (yellow line) increases from nearly zero to one over the course of about 30 h, exceeding the threshold of 0.5 exactly 24 h before the NHC declares the system to be extratropical (blue line) (i.e., when plotted as a function of the verification date, the two lines cross the threshold of 0.5 at the same time). Thus, the model accurately forecasts Karl’s phase evolution and transition time.

Similarly, the model captures the ET of Halong (2014; Fig. 5b), which took place over the Sea of Japan. Formed about 450 km east of Guam, Halong had rapidly intensified into a category 5 supertyphoon, but weakened to tropical storm intensity by the time it made landfall as an extratropical system over the southern part of Japan.

Kit (1981; Fig. 5c) was a slow moving system that tracked to the northwest, passing just south of Guam, and then continued westward. Kit remained tropical throughout its lifetime, and the model indeed predicts the storm to remain tropical with almost 100% certainty.

Ivan (2004; Fig. 5d) is a rare example of a TC that undergoes ET and then undergoes “tropical transition” (e.g., Davis and Bosart 2004; Pezza and Simmonds 2008) back into a tropical system. Ivan first made landfall in Alabama, traveled through the eastern United States, looped back around the Atlantic Ocean and re-entered the Gulf of Mexico. Though not a perfect match, the model predicts the transition to the extratropical phase as well as the return to the tropical phase.

Helene (2006; Fig. 5e) formed from a tropical wave off the coast of Africa and gradually strengthened on its way northwest. Recurving along the northwest periphery of the subtropical ridge, Helene weakened and turned into a hybrid storm with both tropical and extratopical characteristics, featuring a deep warm core and an asymmetric, frontal-like appearance. However, the storm retained hurricane strength until becoming extratropical about 500 km northwest of the Azores. The model predicts Helene to become extratropical, but the transition is delayed and nonmonotonic. Consistent with the NHC’s report, the $\u2212VTU$ feature (not shown) indicates a persistent warm core in the upper troposphere that intensifies during the day before Helene is declared extratropical, delaying ET in the model prediction.

After its formation in the eastern tropical Atlantic, Marilyn (1995; Fig. 5f) gradually intensified on its northwestward track, peaking as a category 3 hurricane shortly after hitting the Lesser Antilles at category 1 strength. Heading north past Bermuda, Marilyn weakened and became extratropical. The model forecasts Marilyn to become increasingly extratropical over the three days before the NHC changes the status of the system, but does not predict a complete ET. Instead, the output probability drops back to nearly zero. The model’s failure is the result of Marilyn abruptly turning south and then southeast after its ET, slowing down and meandering over an area with higher SST for the remainder of its lifetime.

The model’s forecasts for these six storms are summarized in Fig. 6 as a function of lead time and verification date, in so-called chiclet diagrams (Carbin et al. 2016). The true phase evolution (as given by the best track datasets of the NHC and the JMA) is shown at the bottom of each panel. For any combination of lead time and verification date, the color shows the probability that the storm is extratropical at the verification date, as predicted by the model at the given lead time. For example, the chiclet diagram of Karl (2004) shows that the white color indicating the switch from tropical to extratropical is aligned with the actual transition up to a lead time of about 48 h. At longer lead times, the model still predicts an ET, but with increasing delay (as indicated by the positive slope of the line connecting the white blocks). Halong’s (2014) ET is predictable on longer time scales: the white blocks remain vertically aligned up to a lead time of about three days, even though the transition becomes less “sharp.” In contrast to Karl, the prediction of Halong’s ET shifts to earlier times for longer forecast horizons.

Kit’s (1981) tropical nature is highly predictable at all lead times considered here. Though present even in the 96-h forecast, the “double transition” of Ivan (2004) occurs with an increasing lag for longer lead times. For Helene (2006), the model’s skill is almost independent of lead time, while Marilyn’s (1995) phase evolution is, in fact, more accurately predicted at longer lead times, which show a longer extratropical phase (but still an incorrect reversal to a tropical status).

### b. Hazard model

#### 1) Feature weights

In the hazard model, storm latitude and central pressure are the most important features for determining if a storm is in the extratropical phase (Fig. 1b). Changes over the previous 12 h contribute less to the prediction than do the current values of the features. Central pressure is given more weight in the hazard model than in the 24-h forecast of the operational model (Fig. 1a), especially in the NAT. The importance of central pressure depends strongly on the lead time considered: when the operational model is trained for the same diagnostic (0-h lead time) task the hazard model is trained for, central pressure is the most important feature in the WNP, and also contributes predictive skill in the NAT (not shown). Over a 24-h time period, central pressure may vary too strongly to provide much predictive skill (e.g., in rapidly intensifying TCs, the central pressure falls by more than 40 hPa over the course of one day) (Holliday and Thompson 1979).

#### 2) Performance evaluation

The hazard model’s performance scores for the prediction of the extratropical status in the NAT (MCC = 0.62, BSL = 0.05) and the WNP (MCC = 0.86, BSL = 0.02) are similar to those achieved by the operational model at short lead times.

The confusion matrices in Fig. 3b show that in the NAT, the model classifies 33 of the test set storms as ET storms, 5 of which are false positives (precision = 0.85), and it correctly identifies 28 of the 35 true ET storms (recall = 0.80). In the WNP, 9 of the 68 storms classified as ET storms are false positives (precision = 0.87), and 59 of the 63 true ET storms are identified (recall = 0.94). The model diagnoses 86% of all ET times in the NAT and 95% of those in the WNP with an error of less than 24 h (Fig. 4b).

Figure S2 shows the output of the hazard model for the six storms discussed for the operational model (Fig. 5). Similar to the operational model, the hazard model has difficulties diagnosing the phase evolutions of Ivan, Helene, and Marilyn, but captures the ETs of Karl and Halong as well as the purely tropical life cycle of Kit with high precision.

## 7. Comparison with CPS diagnostics

As mentioned in section 2, the CPS proposed by Hart (2003) has become a widely accepted diagnostic tool for defining ET, both in research and operational communities (Evans et al. 2017). However, the CPS is unable to distinguish between recurving TCs that undergo ET, and TCs that recurve without undergoing ET (Kofron et al. 2010). Further, the ET classification obtained from a CPS analysis depends on the dataset used to calculate the CPS parameters and is sensitive to choices of parameter thresholds (e.g., Hart 2003; Wood and Ritchie 2014; Bieli et al. 2019a,b). Evans et al. (2017) therefore recommend the continued evaluation of other ET classifiers. Note that an alternative model design could use CPS diagnostics (instead of the best track labels) to generate the predictand. Such a model was developed and tested, but, due to the reasons outlined above and the more reliable quality of the best track data in the NAT and the WNP since beginning of the satellite era (e.g., Holland and Webster 2007), it was not included in this paper.

Here, we compare the performance of the logistic regression model introduced in this paper to that of the CPS analysis in Bieli et al. (2019b, hereafter BCS19), who examine how well ET storms defined in the CPS agree with those defined in the best track records, using a global set of TCs from 1979 to 2017. Some of the statistics discussed here are not shown explicitly in BCS19, but were reproduced here by the authors for the purpose of this comparison. In BCS19, ET onset is defined as the first time a TC is either asymmetric (*B* > 11) or has a cold core ($\u2212VTU<0$ and $\u2212VTL<0$), and ET completion is when the second criterion is met. Two reanalysis datasets, JRA-55 and ERA-Interim, were used to calculate the CPS. To minimize false alarms due to thermal asymmetry in the early stage of a TC’s lifetime, ET onset was only declared if a storm had a wind speed of at least 33 kt (1 kt ≈ 0.51 m s^{−1}). The CPS analysis thus differs from the logistic regression model in that its definition of ET is not based on a statistical regression but on a combination of threshold criteria for the onset and completion of ET. Based on these criteria, BCS19 classify each TC as an ET storm or as a non-ET storm, and the resulting CPS classification is evaluated against the best track extratropical labels using the MCC and the F1 score. Table 2 compares their results to the test set performance scores obtained for the operational model in “diagnostic mode” (i.e., at a lead time of 0 h) and for the hazard model. In the NAT, both logistic regression models achieve higher F1 scores and MCCs than do the CPS diagnostics. In the WNP, the classifications from the logistic regression models and the JRA-55-based CPS diagnostics have almost equal scores, while those of the ERA-Interim-based CPS diagnostics are a bit lower.

Figure 7 compares the timing errors made with respect to the true (i.e., best track) ET time: In BCS19, the mean difference between the time of ET completion in the CPS and the true ET time for storms in the NAT is 10 h using JRA-55 and 32 h using ERA-Interim. The operational model and the hazard model have mean timing differences of 1 and 4 h, respectively. In the WNP, the CPS on average misses the true ET time by 5 h using JRA-55, and by 19 h using ERA-Interim, compared to time differences of 2 h for the operational model and 7 h for the hazard model. In both basins, the CPS timing errors have a higher variance. Thus, the logistic regression models diagnose the timing of ET with less bias and a lower variance than do the CPS diagnostics.

A separate analysis to evaluate the models’ ability to distinguish recurving TCs that undergo ET from TCs that recurve without undergoing ET was not performed. When a storm moves into higher latitudes along a recurving track, the associated changes in heading angle and latitude increase its probability of being classified as extratropical. On the other hand, if the storm stays in a tropically favorable local environment characterized by high values of SST or PI, the model may still classify the storm as tropical. Thus, identifying recurving TCs that do not undergo ET is possible for the model in principle, but the performance would likely be lower than on the set of all (recurving and nonrecurving) storms.

## 8. Summary

We have introduced a logistic regression model to predict the extratropical transition (ET) of tropical cyclones (TCs), using predictors from best track and reanalysis datasets. Two versions of the model are presented: a baseline “operational model” that uses observed properties of the storm and its environment to forecast the storm’s phase (extratropical or not extratropical), and a “hazard model” that diagnoses the phase without access to a representation of the storm’s environment. The operational model was developed with a focus on predictive performance as well as physical interpretability and thus resides at the interface between machine learning and traditional statistics. It may be applied to provide baseline guidance for operational forecasts. The hazard model was developed with the purpose of integrating ET into statistical TC risk models used for hazard assessment.

Our findings can be summarized as follows:

The operational model has skill in forecasting ET at lead times up to two days. At a lead time of 24 h, the Matthews correlation coefficient for the prediction of phase changes is 0.6 in the western North Pacific, and 0.4 in the North Atlantic.

The model can be used to classify each TC of the test set as an ET storm (if it undergoes ET at some point in its lifetime) or as a non-ET storm (if it does not undergo ET). Table 3 shows precision, recall, and the percentage of all true positive storms whose ET occurs within 24 h of the model’s predicted ET time. Both the operational and the hazard model correctly identify 80% of true ET events in the North Atlantic and more than 90% of true ET events in the western North Pacific.

For the 24-h lead time forecast, the most important predictors of the operational model are latitude and sea surface temperature. In the hazard model, the largest contributions come from latitude and storm central pressure.

Using six storms from the test set, we illustrate that the model can predict the phase evolution of storms that undergo ET as well as of storms that remain tropical throughout their lifetimes, and that it also allows for tropical transitions (phase changes from extratropical to tropical).

Used as an instantaneous diagnostic of a storm’s tropical/extratropical status, the model performs about as well as the widely used cyclone phase space (CPS) in the western North Pacific and better than the CPS in the North Atlantic, and predicts the timings of the transitions better than CPS in both basins.

A next step will be to test the effectiveness of the operational model in real-time applications. In principle, the model is portable to any dataset that provides the environmental features used for model input. However, applying the model to a different dataset will require to refit the model, as the optimal coefficients likely depend on the dataset. Future work may also use the operational model as a benchmark for the validation of numerical weather prediction models’ forecasts of ET. Regarding the model’s integration into a TC risk assessment system, we hope that the ability to identify storms undergoing ET can provide a starting point for modeling the associated changes in wind or precipitation fields, which determine the risk of TC damage.

## Acknowledgments

The first author gratefully acknowledges support from the NASA Modeling, Analysis, and Prediction (MAP) program, which provided the funding for this research under NASA Cooperative Agreement NNX15AJ05A. S.J.C. and A.H.S. acknowledge partial support from NOAA MAPP Grant NA16OAR4310079. The authors thank the three anonymous reviewers; their thoughtful comments helped improve the paper.

### APPENDIX

#### Interpreting the Coefficients of a Logistic Regression

There is an important distinction between linear and logistic regression in the interpretation of the regression coefficients: In linear regression, a coefficient *θ*_{i} means that raising *x*_{i} by 1 causes an increase of *θ*_{i} in the expected value of the predictand *y*. In logistic regression, a coefficient *θ*_{i} means that raising *x*_{i} by 1 causes a multiplicative increase of $e\theta i$ in the odds that *y* occurs. In the following, we present a brief mathematical derivation of this result. A comprehensive treatment is given, for example, in Hosmer et al. (2013).

First, recall that in logistic regression, a given feature vector **x** is transformed by the logistic sigmoid function *h*_{θ}(**x**) to give an estimate of the probability $P^$ that **x** is in class 1:

Letting *z* = *θ*^{T}**x**, $P^$ can be viewed as a function of *z*. It can easily be verified that the inverse of $P^$ is the logarithm of the odds, also known as the logit function:

The odds are a monotonic transformation of probabilities *p* and describe the chance of an event (here: “**x** is in class 1”) as the ratio of the probability that the event does occur to the probability that it does not occur.

Exponentiating (A2) yields

Thus, the coefficients ** θ** determine the odds of class 1 for a given

**x**. To express the effect of an individual coefficient

*θ*

_{i}, suppose we increase feature

*x*

_{i}by one unit, holding all other features constant, and consider the odds ratio of

*x*, OR(

*x*

_{i}):

Thus, each exponentiated coefficient $e\theta i$ is the expected multiplicative change in the odds of class 1 for a unit increase in the corresponding predictor *x*_{i}, holding the other predictors constant. Note that this represents a *constant* effect that is independent of **x**. While the odds resulting *from a given***x** in (A3) can be converted to a probability, it is not possible to express the constant effect of logistic regression coefficients in terms of probabilities.

## REFERENCES

*Wea. Forecasting*

*J. Climate*

*J. Climate*

*PLOS ONE*

*Mon. Wea. Rev.*

*Mon. Wea. Rev.*

*Bull. Amer. Meteor. Soc.*

*Global Perspectives on Tropical Cyclones: From Science to Mitigation*

*Proc. Fourth Conf. on Message Understanding*, McLean, VA, Association for Computational Linguistics, 22–29

*Bull. Amer. Meteor. Soc.*

*Quart. J. Roy. Meteor. Soc.*

*Wea. Forecasting*

*Bull. Amer. Meteor. Soc.*

*Geosci. Model Dev.*

*J. Atmos. Sci.*

*Bull. Amer. Meteor. Soc.*

*Mon. Wea. Rev.*

*Proc. Seventh Int. Workshop on Tropical Cyclones*, St. Gilles Les Bains, La Reunion, France, World Meteorological Organization, 2.5

*Mon. Wea. Rev.*

*Tellus*

*Mon. Wea. Rev.*

*Mon. Wea. Rev.*

*J. Climate*

*Technometrics*

*Philos. Trans. Roy. Soc. London*

*Mon. Wea. Rev.*

*Applied Logistic Regression*

*Nat. Geosci.*

*Wea. Forecasting*

*Geophys. Res. Lett.*

*Mon. Wea. Rev.*

*J. Meteor. Soc. Japan*

*J. Meteor. Soc. Japan*

*Mon. Wea. Rev.*

*Sleep*

*Bull. Amer. Meteor. Soc.*

*J. Adv. Model. Earth Syst.*

*BMC Bioinf.*

*J. Appl. Meteor. Climatol.*

*Biochim. Biophys. Acta*

*Ann. N.Y. Acad. Sci.*

*Climate Dyn.*

*PLOS Comput. Biol.*

*Geochem. Geophys.*

*J. Roy. Stat. Soc.*

*Encyclopedia of Machine Learning*, C. Sammut and G. I. Webb, Eds., Springer, 781 pp

*Mon. Wea. Rev.*

*Geophys. Res. Lett.*

*J. Climate*

*J. Roy. Stat. Soc.*