1. Introduction
A tropical cyclone (TC) is a low pressure system originating from tropical or subtropical waters and develops by drawing energy from the sea. It is characterized by a warm core, organized deep convection, and a closed surface wind circulation about a well-defined center. Every year, tropical cyclones cause hundreds of deaths and billions of dollars of damage to households and businesses (Grinsted et al. 2019). Therefore, producing an accurate prediction for TC track and intensity with sufficient lead time is critical to undertake life-saving measures.
The forecasting task encompasses the track, intensity, size, structure of TCs, and associated storm surges, rainfall, and tornadoes. Most forecasting models focus on producing track (trajectory) forecasts and intensity forecasts, i.e., intensity measures such as the maximum sustained wind speed in a particular time interval. Current operational TC forecasts can be classified into dynamical models, statistical models, and statistical–dynamical models (Cangialosi 2020). Dynamical models, also known as numerical models, utilize powerful supercomputers to simulate atmospheric fields’ evolution using dynamical and thermodynamical equations (Biswas et al. 2018; ECWMF 2019). Statistical models approximate historical relationships between storm behavior and storm-specific features and, in general, do not explicitly consider the physical process (Aberson 1998; Knaff et al. 2003). Statistical–dynamical models use statistical techniques but further include atmospheric variables provided by dynamical models (DeMaria et al. 2005). Last, ensemble models combine the forecasts made by multiple runs of a single model (Cangialosi 2020). Moreover, consensus models typically combine individual operational forecasts with a simple or weighted average (Sampson et al. 2008; Simon et al. 2018; Cangialosi 2020; Cangialosi et al. 2020).
In addition, recent developments in deep learning (DL) enable machine learning (ML) models to employ multiple data processing techniques to process and combine information from a wide range of sources and create sophisticated architectures to model spatial–temporal relationships. Several studies have demonstrated the use of recurrent neural networks (RNNs) to predict TC trajectory based on historical data (Moradi Kordmahalleh et al. 2016; Gao et al. 2018; Alemany et al. 2019). Convolutional neural networks (CNNs) have also been applied to process reanalysis data and satellite data for track forecasting (Mudigonda et al. 2017; Lian et al. 2020; Giffard-Roisin et al. 2020) and storm intensification forecasting (Chen et al. 2019; Su et al. 2020).
This paper introduces a machine learning framework called Hurricast (HUML) for both intensity and track forecasting by combining several data sources using deep learning architectures and gradient-boosted trees.
Our contributions are threefold:
-
We present novel multimodal1 machine learning techniques for TC intensity and track predictions by combining distinct forecasting methodologies to utilize multiple individual data sources. Our Hurricast framework employs XGBoost models to make predictions using statistical features based on historical data and spatial–temporal features extracted with deep learning encoder–decoder architectures from atmospheric reanalysis maps.
-
Evaluating in the North Atlantic and eastern Pacific basins, we demonstrate that our machine learning models produce comparable results to currently operational models for 24-h lead time for both intensity and track forecasting tasks.
-
Based on our testing, adding one machine learning model as a member to a consensus model can improve the performance, suggesting the potential for incorporating machine learning approaches for hurricane forecasting.
The paper is structured as follows: section 2 describes the data used in the scope of this study; section 3 explains the operational principles underlying our machine learning models; section 4 describes the experiments conducted; and section 5 deals with conclusions from the results and validates the effectiveness of our framework. Finally, section 6 discusses limitations and future work needed for the potential operational deployment of such ML approaches.
2. Data
In this study, we employed three kinds of data dated from 1980: historical storm data, reanalysis maps, and operational forecast data. We used all storms from the seven TC basins since 1980 that reach 34 kt (1 kt ≈ 0.51 m s−1) maximum intensity at some time, i.e., are classified at least as a tropical storm, and where more than 60 h of data are available after they reached the speed of 34 kt for the first time. Table 1 summarizes the TCs distribution in each basin included in our data.
Number of TCs satisfying our selection criteria from the dataset. We list the number of TCs for each basin and storm category: from tropical storm (TS) to hurricanes of category 1–5. We also report the total number of 3-h interval cases we used from each basin.
a. Historical storm dataset
We obtained historical storm data from the National Oceanic and Atmospheric Administration through the postseason storm analysis dataset IBTrACS (Knapp et al. 2010). Among the available features, we selected time, latitude, longitude, and minimum pressure at the center of the TC, distance-to-land, translation speed of the TC, direction of the TC, TC type (disturbance, tropical, extratropical, etc.), basin (North Atlantic, eastern Pacific, western Pacific, etc.), and maximum sustained wind speed from the WMO agency (or from the regional agency when not available). Our overall feature choice is consistent with previous statistical forecasting approaches (DeMaria and Kaplan 1994; DeMaria et al. 2005; Giffard-Roisin et al. 2020). In this paper, we will refer to this data as statistical data (see Table 2).
List of features included in our statistical data.
The IBTrACS dataset interpolates some features to a 3-h frequency from the original 6-h recording frequency. It provides a spline interpolation of the position features (e.g., latitude and longitude) and a linear interpolation of the features not related to position (wind speed, pressure reported by regional agencies). However, the WMO wind speed and pressure were not interpolated by IBTrACS and we interpolated them linearly to match the 3-h frequency.
We processed statistical data through several steps before inputting it into machine learning models. First, we treated the categorical features using the one-hot encoding technique: for a specific categorical feature, we converted each possible category as an additional binary feature, with 1 indicating the sample belongs to this category and 0 otherwise. We encoded the basin and the nature of the TC as one-hot features. Second, we encoded cyclical features using cosine and sine transformations to avoid singularities at endpoints. Features processed using this smoothing technique include date, latitude, longitude, and storm direction.2
We also engineered two additional features per time step to capture first-order dynamical effects: the latitude and longitude displacements in degrees between two consecutive steps.
Finally, the maximum sustained wind speed feature reported can have different averaging policies depending on the specific reporting agency: 1 min for U.S. basins and 10 min for other WMO Regional Specialized Meteorological Centers. We adjusted all averaging time periods to 1 min by dividing the 10-min values by 0.93 as recommended by Harper et al. (2010).
b. Reanalysis maps
In our work, we used the extensive ERA5 reanalysis dataset (Hersbach et al. 2020) developed by the European Centre for Medium-Range Weather Forecasts (ECWMF). ERA5 provides hourly estimates of a large number of atmospheric, land, and oceanic climate variables. The data cover Earth on a 30-km grid and resolve the atmosphere using 137 levels from the surface up to a height of 80 km.
We extracted (25° × 25°) maps centered at the storm locations across time, given by the IBTrACS dataset described previously, of resolution 1° × 1°, i.e., each cell corresponds to one degree of latitude and longitude, offering a sufficient frame size to capture the entire storm. We obtained nine reanalysis maps for each TC time step, corresponding to three different features (geopotential height z, zonal component of the wind u, and meridional component of the wind υ) at three pressure levels (225, 500, and 700 hPa), as illustrated in Fig. 1. We chose the three features to incorporate physical information that would influence the TC evolution, and this choice is motivated by previous literature in applying ML techniques to process reanalysis maps (Shimada et al. 2018; Chen et al. 2019; Giffard-Roisin et al. 2020).
Representation of the nine reanalysis maps repeatedly extracted for each time step, corresponding to three different features (geopotential height z, zonal component of the wind u, and meridional component of the wind υ) at three pressure levels (225, 500, and 700 hPa). Each map is of size 25° × 25°, centered on the TC center location, and each pixel corresponds to the average field value at the given latitude and longitude degree.
Citation: Weather and Forecasting 37, 6; 10.1175/WAF-D-21-0091.1
As a remark, we acknowledge two main limitations from using reanalysis maps for TC forecasting. First, since they are reanalysis products, they are not available in real time and thus significantly hinder operational use. Second, they have deficiencies in representing tropical cyclones (Schenkel and Hart 2012; Hodges et al. 2017; Bian et al. 2021); for instance, with large TC sizes particularly being underestimated (Bian et al. 2021).
c. Operational forecast models
We obtained operational forecast data from the Automated Tropical Cyclone Forecasting (ATCF) dataset, maintained by the National Hurricane Center (NHC) (Sampson and Schrader 2000; National Hurricane Center 2021). The ATCF data contains historical forecasts by operational models used by the NHC for its official forecasting for tropical cyclones and subtropical cyclones in the North Atlantic and eastern Pacific basins. To compare the performance of our models with a benchmark, we selected the strongest operational forecasts with a sufficient number of cases concurrently available: including DSHP, GFSO, HWRF, FSSE, and OFCL for the intensity forecast; CLP5, HWRF, GFSO, AEMN, FSSE, and OFCL for the track forecast (see detailed list in Table 3). We extracted the forecast data using the Tropycal Python package (Burg and Lillo 2020).
Summary of all operational forecast models included in our benchmark.
3. Methodology
Our Hurricast framework makes predictions based on time series data with different formats: three-dimensional vision-based reanalysis maps and one-dimensional historical storm data consisting of numerical and categorical features. The problem of simultaneously using different types of data is broadly known as multimodal learning in the field of machine learning.
Overall, we adopt a three-step approach to combine the multiple data sources. We first extract a one-dimensional feature representation (embedding) from each reanalysis maps sequence. Second, we concatenate this one-dimensional embedding with the statistical data to form a one-dimensional vector. Third, we make our predictions using gradient-boosted tree XGBoost models (Chen and Guestrin 2016) trained on the selected features.
At a given time step (forecasting case), we perform two 24-h lead-time forecasting tasks: intensity prediction, i.e., predicting the maximum sustained wind speed at a 24-h lead time; and displacement prediction, i.e., the latitude and longitude storm displacement in degrees between given time and forward 24-h time. Figure 2 illustrates the three-step pipeline.
Representation of our multimodal machine learning framework using the two data sources: statistical and reanalysis maps. During step 1, we extract embeddings from the reanalysis maps. In particular, we use encoder–decoder architectures or tensor decomposition to obtain a one-dimensional representation. During step 2, we concatenate the statistical data with the features extracted from the reanalysis maps. During step 3, we train one XGBoost model for each of the prediction tasks: intensity in 24 h, latitude displacement in 24 h, and longitude displacement in 24 h.
Citation: Weather and Forecasting 37, 6; 10.1175/WAF-D-21-0091.1
To perform the feature extraction in step 1, we have experimented with two computer vision techniques to obtain the reanalysis maps embeddings: 1) encoder–decoder neural networks and 2) tensor decomposition methods. The former is a supervised learning method; for each input, we use an associated prediction target to train the network. On the other hand, tensor decomposition is an unsupervised method; there is no specific labeled prediction target, and instead, embeddings are drawn directly from the patterns within the data.
a. Feature extraction
1) Encoder–decoder architectures
The encoder–decoder neural network architecture refers to a general type of deep learning architecture consisting of two components: an encoder, which maps the input data into a latent space; and a decoder, which maps the latent space embeddings into predictions. It is well suited to deal with multimodal data as different types of neural network layers can be adapted to distinct modalities.
In our work, the encoder component consists of a CNN, a successful computer vision technique to process imagery data (LeCun et al. 1989; He et al. 2016; Krizhevsky et al. 2017).
We compare two decoder variations. The first one relies on gated recurrent units (GRU) (Chung et al. 2014), a well-suited recurrent neural network to model temporal dynamic behavior in sequential data. The second one uses transformers (Vaswani et al. 2017), a state-of-the-art architecture for sequential data. While the GRU model the temporal aspect through a recurrence mechanism, the Transformers utilize attention mechanisms and positional encoding (Bahdanau et al. 2015; Vaswani et al. 2017) to model long-range dependencies.
First, we train the encoder–decoder architectures using standard backpropagation to update the weights parameterizing the models (Rumelhart et al. 1987; Goodfellow et al. 2016). We use a mean squared error loss with either an intensity or track objective and add an L2 regularization penalty on the network’s weights. We then freeze the encoder–decoder’s weights when training is completed.
To perform feature extraction from a given input sequence of reanalysis maps and statistical data, we pass them through the whole frozen encoder–decoder, except the last fully connected layer (see Figs. 3 and 4). The second fully connected layer after the GRU or the pooling layer after the Transformer output a vector of relatively small size, e.g., 128 features, to compress information and provide predictive features. This vector constitutes our one-dimensional reanalysis maps embedding that we extract from the initial 45 000 (8 × 9 × 25 × 25) features forming the spatial–temporal input. The motivation is that since the encoder–decoder acquired intensity or track prediction skills during training, it should capture relevant reanalysis maps information in the embeddings. Using these internal features as input to an external model is a method inspired by transfer learning and distillation, generally efficient in visual imagery (Yosinski et al. 2014; Kiela and Bottou 2014; Hinton et al. 2015; Tan et al. 2018).
Schematic of our CNN-encoder GRU-decoder network for an eight-time-step TC sequence. At each time step, we utilize the CNN to produce a one-dimensional representation of the reanalysis maps. Then, we concatenate these embeddings with the corresponding statistical features to create a sequence of inputs fed sequentially to the GRU. At each time step, the GRU outputs a hidden state passed to the next time step. Finally, we concatenate all the successive hidden states and pass them through three fully connected layers to predict intensity or track with a 24-h lead time. We finally extract our spatial–temporal embeddings as the output of the second fully connected layer.
Citation: Weather and Forecasting 37, 6; 10.1175/WAF-D-21-0091.1
Schematic of our CNN-encoder transformer-decoder network for an eight-time-step TC sequence. At each time step, we utilize the CNN to produce a one-dimensional representation of the reanalysis maps. Then, we concatenate these embeddings with the corresponding statistical features to create a sequence of inputs fed as a whole to the transformer. The transformer outputs a new 8-time-step sequence that we average (pool) feature-wise and then feed into one fully connected layer to predict intensity or track with a 24-h lead time. We finally extract our spatial–temporal embeddings as the output of the pooling layer.
Citation: Weather and Forecasting 37, 6; 10.1175/WAF-D-21-0091.1
Figures 3 and 4 illustrate the encoder–decoder architectures. More details on all components are given in the appendix.
2) Tensor decomposition
We also explored tensor decomposition methods as a means of feature extraction. The motivation of using tensor decomposition is to represent high-dimensional data using low dimension features. We use the Tucker decomposition definition throughout this work, also known as the higher-order singular value decomposition. In contrast to the aforementioned neural network-based feature processing techniques, tensor decomposition is an unsupervised extraction technique, meaning features are not learned with respect to specific prediction targets.
At each time step, we treated past reanalysis maps over past time steps as a four-dimensional tensor of size 8 × 9 × 25 × 25 (corresponding to 8 past time steps of 9 reanalysis maps of size 25 pixels by 25 pixels). We used the core tensor obtained from the Tucker decomposition as extracted features after flattening it. We decomposed the tensor using the multilinear singular value decomposition (SVD) method, which is computationally efficient (De Lathauwer et al. 2000).
The size of the core tensor, i.e., the Tucker rank of the decomposition, is a hyper-parameter to be tuned. Based on validation, the Tucker rank is tuned to size 3 × 5 × 3 × 3. More details on tensor decomposition methodology can be found in the appendix.
b. Forecasting models
During step 2, we concatenated features from relevant data sources to form a one-dimensional input vector corresponding to each forecasting case.
First, we reshaped the statistical data sequence corresponding to the fixed window size of past observations into a one-dimensional vector. Then, we concatenated it to the one-dimensional reanalysis maps embeddings obtained with one of the feature extraction techniques.
During step 3, we used XGBoost models for the track and intensity forecasts. XGBoost is a gradient-boosted tree-based model widely used in the machine learning community for superior modeling skills and efficient computation time. We compared several other machine learning models during the experimentation phase, including linear models, support vector machines, decision trees, random forests, feed-forward neural networks, and found XGBoost to be generally the most performing.
c. Summary of models
This section lists all the forecast models tested and retained and summarizes the methodologies employed in Table 4.
Summary of the various versions of the Hurricast framework for which we report results. Models differ in architecture and data used and are named based on these two characteristics.
Models 1–4 are variations of the three-step framework described in Fig. 2, with the variation of input data source or processing technique. Model 1, HUML-(stat, xgb), has the simplest form, utilizing only statistical data. Models 2–4 utilize statistical and vision data and are referred to as multimodal models. They differ on the extraction technique used on the reanalysis maps. Model 2, HUML-(stat/viz, xgb/td), uses vision features extracted with the tensor decomposition technique. In contrast, Models 3 and 4 utilize vision features extracted with the encoder–decoder, with GRU and transformer decoders, respectively. Model 5, HUML-ensemble is a weighted consensus model of Models 1–4. The weights given to each model are optimized using ElasticNet. Model 6 is a simple average consensus of a few operational forecasts models used by the NHC and our Model 4, HUML-(stat/viz, xgb/cnn/transfo). We use Model 6 to explore whether the Hurricast framework can benefit current operational forecasts by comparing its inclusion as a member model.
4. Experiments
a. Evaluation metrics
We computed the mean geographical distance error in kilometers between the actual position and the predicted position in 24 h to evaluate our track forecasts’ performance, using the Haversine formula. The Haversine metric (see the appendix for the exact formula) calculates the great-circle distance between two points—i.e., the shortest distance between these two points over Earth’s surface.
We also report the MAE error standard deviation and the forecast skills, using Decay-SHIPS and CLP5 as the baselines for intensity and track, respectively.
b. Training, validation, and testing protocol
We separated the dataset into training (80% of the data), validation (10% of the data), and testing (10% of the data). The training set ranges from 1980 to 2011, the validation set from 2012 to 2015, and the test set from 2016 to 2019. Within each set, we treated all samples independently.
The test set comprises all the TC cases between 2016 and 2019 from the NA and EP basins where the operational forecast predictions are concurrently available as benchmarks. We compared all models on the same cases.
We used data from all basins during training and validation, but we only report performance on the North Atlantic and eastern Pacific basins, where we have operational forecast data available.
The precise validation-testing methodology and hyper-parameter tuning strategy are detailed in the appendix.
c. Computational resources
Our code is available at https://github.com/leobix/hurricast. We used Python 3.6 (Van Rossum and Drake 1995) and we coded neural networks using Pytorch (Paszke et al. 2019). We trained all our models using one Tesla V100 GPU and six CPU cores. Typically, our encoder–decoders trained within an hour, reaching the best validation performance after 30 epochs. XGBoost models trained within two minutes. When making a new prediction at test time, the whole model (feature extraction + XGBoost) runs within a couple of seconds, which shows practical interest for deployment. The bottleneck lies in the acquisition of the reanalysis maps only. We further discuss this point in section 6a.
5. Results
a. Stand-alone machine learning models produce a comparable performance to operational models
For 24-h lead-time track forecasting, as shown in Table 5, the best Hurricast model, HUML-(stat/viz, xgb/cnn/transfo), has a skill with respect to CLP5 of 40% in the EP basin. In comparison, HWRF has a skill of 45% and GFSO 46%. In the NA basin, HUML-(stat/viz, xgb/cnn/transfo) has a skill of 46%, compared to 63% for HWRF and 65% for GFSO.
Mean absolute error (MAE), forecast skill with respect to CLP5, and standard deviation of the error (error sd) of stand-alone Hurricast models and operational forecasts on the same test set between 2016 and 2019, for 24-h lead-time track forecasting task. Bold values highlight the best performance in each category.
For 24-h lead-time intensity forecasting, as shown in Table 6, the multimodal Hurricast models have a better MAE and lower standard deviation in errors than Decay-SHIPS, HWRF, and GFSO in the EP basin. In particular, our best model, HUML-(stat/viz, xgb/cnn/transfo), outperforms Decay-SHIPS by 12% and HWRF by 3% in MAE. In the NA basin, HUML-(stat/viz, xgb/cnn/transfo) underperforms Decay-SHIPS by 2% and HWRF by 7% but has a lower error standard deviation.
Mean absolute error (MAE), forecast skill with respect to Decay-SHIPS, and standard deviation of the error (error sd) of stand-alone Hurricast models and operational forecasts on the same test set between 2016 and 2019, for 24-h lead-time intensity forecasting task. Bold values highlight the best performance in each category.
These results highlight that machine learning approaches can emerge as a new methodology to currently existing forecasting methodologies in the field. In addition, we believe there is potential for improvement if given more available data sources.
b. Machine learning models bring additional insights to consensus models
Consensus models often produce better performance than individual models by averaging out errors and biases. Hence we conducted testing for two consensus models: HUML-ensemble is the weighted average of all individual Hurricast variations; HUML/OP-consensus is a simple average of HUML-(stat/viz, xgb/cnn/transfo) and the other stand-alone operational models included in our benchmark.
As shown in Tables 7 and 8, HUML-ensemble consistently improves upon the best performing Hurricast variation in terms of MAE, showcasing the possibility of building practical ensembles from machine learning models.
Mean absolute error (MAE), forecast skill with respect to CLP5, and standard deviation of the error (Error sd) of consensus models compared with NHC’s official model OFCL on the same test set between 2016 and 2019 for track forecasting task. Bold values highlight the best performance in each category.
Mean absolute error (MAE), forecast skill with respect to Decay-SHIPS, and standard deviation of the error (error sd) of consensus models compared with NHC’s official model OFCL on the same test set between 2016 and 2019 for intensity forecasting task. Bold values highlight the best performance in each category.
Moreover, OP-average consensus is the equal-weighted average of available operational forecasts. We constructed the HUML/OP-average consensus with the additional inclusion of the HUML-(stat/viz, xgb/cnn/transfo) model. Results show that the inclusion of our machine learning model brings value into the consensus for both track and intensity tasks. In addition, HUML/OP-average produces lower MAE and standard deviation under our testing scope than the NHC’s official forecast OFCL for 24-h lead time.
In particular, in our 24-h lead-time testing scope, in terms of intensity MAE, HUML/OP-average outperforms OFCL by 8% on the EP basin and 2% on the NA basin. In track MAE, HUML/OP-average outperforms OFCL by 7% on the EP basin and 14% on the NA basin.
As a remark, we do not consider the computational time lag of operational model forecasts in our experiments. Computational time varies and can take several hours for dynamical models. Nevertheless, these results highlight the complementary benefits of machine learning models to operational models.
c. A multimodal approach leads to more accurate forecasts than using single data sources
As shown in Tables 5 and 6, for both track and intensity forecasts, multimodal models achieve higher accuracy and lower standard deviation than the model using only statistical data.
The deep learning feature extraction methods outperform the tensor-decomposition-based approach. This is not surprising as our encoder–decoders trained with a supervised learning objective, which means extracted features are tailored for the particular downstream prediction task. Tensor decomposition is, however, advantageously label agnostic but did not extract features with enough predictive information to improve the performance.
6. Limitations and extensions
a. The use of reanalysis maps
A significant limitation of reanalysis maps is the computation time for construction, as they are assimilated based on observational data. Thus, although our models can compute forecasts in seconds, the dependence on reanalysis maps is a bottleneck in real-time forecasting. Therefore, a natural extension for effective deployment is to train our models using real-time observational data or field forecasts from powerful dynamical models such as HWRF. Since dynamical models are constantly updated with improved physics, higher resolution, and fixed bugs, reforecast products (e.g., Hamill et al. 2013) should be well suited for training our encoder–decoders. Nevertheless, we hope our framework could provide guidance and reference to build operational machine learning models in the future.
b. Incorporate additional data
Under the scope of this work, we used nine reanalysis maps per time step, corresponding to the geopotential height (z), the zonal (u), and meridional (υ) component of the wind fields from three pressure levels. One natural extension is to include additional features, such as the sea surface temperature, the temperature, and the relative humidity, and include information from more vertical levels to potentially improve model performance.
In addition, one could include more data sources, such as satellite and radar data. Notably, we highlight the flexibility of our framework that can easily incorporate new data: we can adopt different feature extraction architectures and then append or substitute extracted features in the XGBoost forecasting model accordingly.
c. Longer-term forecasts
We conducted our experiments for 24-h lead-time predictions to demonstrate the potential of ML techniques in hurricane forecasting tasks. However, experiments on longer-term forecasts are needed before deploying such approaches. For example, the official NHC forecast provides guidance for up to 5 days. Nevertheless, our framework can be extended to longer lead-time forecasts. In particular, we recommend extending the input window size (from current 24 h) as our models can process arbitrary long input sequences.
7. Conclusions
This work demonstrates a novel multimodal machine learning framework for tropical cyclone intensity and track forecasting utilizing historical storm data and meteorological reanalysis data. We present a three-step pipeline to combine multiple machine learning approaches, consisting of 1) deep feature extraction, 2) concatenation of all processed features, and 3) prediction. We demonstrate that a successful combination of deep learning techniques and gradient-boosted trees can achieve strong predictions for both track and intensity forecasts, producing comparable results to current operational forecast models, especially in the intensity task. We acknowledge that the unavailability of real-time reanalysis data poses a challenge for operational use, and suggest future work to extend our framework with other operational data sources.
We demonstrate that multimodal encoder–decoder architectures can successfully serve as a spatial–temporal feature extractor for downstream prediction tasks. In particular, this is also the first successful application of a transformer-decoder architecture in tropical cyclone forecasting.
Furthermore, we show that consensus models that include our machine learning model could improve upon the NHC’s official forecast for both intensity and track, thus demonstrating the potential value of developing machine learning approaches as a new branch of methodology for tropical cyclone forecasting.
Moreover, once models are trained, they can compute results in seconds, showing practical interest for real-time forecasting. We remark the bottleneck of machine learning approaches lie only in data acquisition, and propose extensions and guidance for effective real-world deployment.
In conclusion, our work demonstrates that machine learning can provide valuable additions to the field of tropical cyclone forecasting. We hope this work opens the door for further use of machine learning in meteorological forecasting.
Multimodality in machine learning refers to the simultaneous use of different data formats, including, for example, tabular data, images, time series, free text, and audio.
For example, we encoded the latitude value by
Acknowledgments.
We thank the review team of the Weather and Forecasting journal for insightful comments that improved the paper substantially. We thank Louis Maestrati, Sophie Giffard-Roisin, Charles Guille-Escuret, Baptiste Goujaud, David Yu-Tung Hui, Ding Wang, and Tianxing He for useful discussions. We thank Nicolò Forcellini and Miya Wang for proofreading. The work was partially supported from a grant to MIT by the OCP Group. The authors acknowledge the MIT SuperCloud and Lincoln Laboratory Supercomputing Center for providing high-performance computing resources that have contributed to the research results reported within this paper.
Data availability statement.
All data we used are open-source and can directly be accessed from the Internet with IBTrACS for TC features, Tropycal for operational forecasts, and ERA-5 for reanalysis maps. Our code is available at https://github.com/leobix/hurricast.
APPENDIX
Technical Components
a. Encoder–decoder architectures
1) Overall architecture and mechanisms
(i) The CNN-encoder
At each time step, we feed the nine reanalysis maps into the CNN-encoder, which produces one-dimensional embeddings. The CNN-encoder consists of three convolutional layers, with ReLU activation and MaxPool layers in between, followed by two fully connected layers.
Next, we concatenate the reanalysis maps embeddings with processed statistical data corresponding to the same time step. At this point, data are still sequentially structured as eight time steps to be passed on to the decoder.
(ii) The GRU-decoder
Our GRU-decoder consists of two unidirectional layers. The data sequence embedded by the encoder is fed sequentially in chronological order into the GRU-decoder. For each time step, the GRU-decoder outputs a hidden state representing a “memory” of the previous time steps. Finally, a track or intensity prediction is made based upon these hidden states concatenated all together and given as input to fully connected layers (see Fig. 3).
(iii) The transformer-decoder
Conversely to the GRU-decoder, we feed the sequence as a whole into the transformer-decoder. The time-sequential aspect is lost since attention mechanisms allow each hidden representation to attend holistically to the other hidden representations. Therefore, we add a positional encoding token at each time-step input, following standard practices (Vaswani et al. 2017). This token represents the relative position of a time-step within the sequence and reintroduces some information about the inherent sequential aspect of the data and experimentally improves performance.
Then, we use two transformer layers that transform the 8 time steps (of size 142) into an 8-time-step sequence with similar dimensions. To obtain a unique representation of the sequence, we average the output sequence feature-wise into a one-dimensional vector, following standard practices. Finally, a track or intensity prediction is made based upon this averaged vector input into one fully connected layer (see Fig. 4).
(iv) Loss function
2) Technical details on the CNN-encoder GRU-decoder network
We provide more formal and precise explanations of our encoder–decoder architectures.
CNN-encoder GRU-decoder architecture details
Let t the instant when we want to make a 24-h lead-time prediction. Let
First,
Representation of our CNN-encoder. We use three convolutional layers, with batch normalization, ReLU and MaxPool in between. We use fully connected (dense) layers to obtain in the end a one-dimensional vector
Citation: Weather and Forecasting 37, 6; 10.1175/WAF-D-21-0091.1
Finally, we concatenate
To extract the spatial–temporal embedded features, we use the output of the second fully connected layer, of dimension 128. Therefore, this technique allows to reduce 8 × 9 × 25 × 25 = 45 000 features into 128 predictive features that can be input into our XGBoost models.
For each convolutional layer of the CNN, we use the following parameters: kernel size = 3, stride = 1, padding = 0. For each MaxPool layer, we use the following parameters: kernel size = 2, stride = 2, padding = 0.
The CNN-encoder architecture is inspired from Giffard-Roisin et al. (2020). The combination with the GRU-decoder or transformer-decoder and the feature extraction is a contribution of our work.
3) Technical details on the transformer-decoder architecture
As with the CNN-encoder GRU-decoder network, the spatial–temporal inputs are processed and concatenated with the statistical data to obtain a sequence of input
We used self-attention layers (i.e., Q = K = V), specifically 2 layers with 2 heads, the model’s dimension dk being fixed to 142 and the feedforward dimension set to 128.
We then averaged the outputs of our transformer
b. Tucker decomposition for tensors
The multilinear singular value decomposition (SVD) expresses a tensor
Illustration of the tensor decomposition of a three dimensional tensor. Tensor
Citation: Weather and Forecasting 37, 6; 10.1175/WAF-D-21-0091.1
Analogous to truncated SVD, we can reduce the dimensionality of tensor
Finally, we flatten the truncated core tensor
c. Experiment details
1) Testing methodology
We employed the validation set to perform hyper-parameter tuning. Then, we retrained the models on the training and validation set combined using the best combination of hyper-parameters. We then evaluated our models’ performance on the test set.
We report the performance obtained on the NA and EP test set with each method for 24-h lead time for both intensity and track forecasts. As a remark, in reality, there is often a time lag when operational models become available. Such lag is shorter for statistical models but longer for dynamical models (up to several hours) because of expensive computational time. Due to the lag time variability, we do not consider such lag in our comparisons with operational models. In other words, we neglect the time lag for all models and compare model results assuming all forecasts compute instantaneously. We hope to provide an overall sense of the predictive power of our methodology, although we acknowledge that using reanalysis maps data is not possible in real time. We discussed this bottleneck in section 6.
2) The specific protocol for HUML-ensemble
For the HUML-ensemble model, we used the HUML models 1–4 trained on the training set only (i.e., data until 2011). We then used their forecasts on the unseen validation set (2012–15) and their forecasts on the unseen test set (2016–19) as the training and testing data for the ensemble. The goal is to understand how each model behaves with respect to the others on unseen data. We cross-validated the ElasticNet parameters on the 2012–15 HUML forecasts and we finally tested on the same cases as before using the best hyper-parameter combination found.
3) Hyper-parameter tuning
We distinguish five categories of hyper-parameters to tune: 1) the data-related features, 2) the neural network–related features, 3) the tensor decomposition–related features, 4) the tree-based-method-related features, and 5) the consensus models–related features.
(i) Data-related features
The data-related features include the area covered by the reanalysis maps (grid size) and the number of historical time steps of data to use for each forecast. We tune these features by comparing the 24-h lead-time forecast performance of the encoder–decoders for each different hyper-parameter configuration.
We found that using eight past time steps (i.e., up to 21 h in the past) and a grid size of 25° × 25° for the reanalysis maps was the best combination. We also found that standardizing the vision and statistical data—i.e., rescaling each feature to mean 0 and standard deviation 1—yielded better results than normalizing—i.e., rescaling each feature to the [0, 1] range.
(ii) Neural network–related features
The neural network–related features include the optimizer, the architecture itself, the batch size during training, and the loss function’s regularizer.
The best results were obtained using a batch size of 64, a λ regularization term of 0.01, and the encoder–decoder architectures described previously. Regarding the optimizer, we use Adam (Kingma and Ba 2014) with a learning rate of 10−3 for the intensity forecast and 4 × 10−4 for the track forecast.
(iii) Tensor decomposition features
The tensor decomposition algorithm includes the choice of the core tensor size, i.e., the compressed size of the original tensor. Recall that the original tensor size is 8 × 9 × 25 × 25. Based on empirical testing, we found using a small tensor size of 3 × 5 × 3 × 3 yielded the best performance when compressed reanalysis maps are included as features in XGBoost models.
(iv) Tree-based method features
Based on empirical testing, we found XGBoost models consistently outperforming decision trees and random forests or other ML methods such as support vector machines, regularized linear regression and multi-layer perceptrons. XGBoost also trains fast, which is a considerable advantage for heavy hyper-parameter search. Therefore, we selected XGBoost as the core model for prediction.
Then, there is variability in the best combinations of hyper-parameters depending on each task (track or intensity), basin (NA or EP) or data sources to use (statistical, various reanalysis maps embeddings). However, these particular features were typically important and were the best in the following ranges: maximum depth of the trees (between 6 and 9), number of estimators (between 100 and 300), learning rate (between 0.03 and 0.15), subsample (between 0.6 and 0.9), column sampling by tree (between 0.7 and 1), and minimum child by tree (between 1 and 5).
(v) Consensus models–related features
We tested different kinds of consensus models on the HUML forecasts, including ElasticNet (Zou and Hastie 2005), tree-based models, and multilayer perceptrons (MLPs) as metalearners. MLPs had similar performance with ElasticNet, but since they are less interpretable and stable, ElasticNet is the strongest ensembler candidate and our final choice for HUML-ensemble. We tune the L1/L2 ratio between 0 and 1 and the regularization penalty between 10−4 and 10.
4) Metrics
(i) Haversine formula
(ii) Skill
REFERENCES
Aberson, S. D., 1998: Five-day tropical cyclone track forecasts in the North Atlantic basin. Wea. Forecasting, 13, 1005–1015, https://doi.org/10.1175/1520-0434(1998)013<1005:FDTCTF>2.0.CO;2.
Alemany, S., J. Beltran, A. Perez, and S. Ganzfried, 2019: Predicting hurricane trajectories using a recurrent neural network. Proc. Conf. AAAI Artif. Intell., 33, 468–475, https://doi.org/10.1609/aaai.v33i01.3301468.
Bahdanau, D., K. Cho, and Y. Bengio, 2015: Neural machine translation by jointly learning to align and translate. arXiv, 1409.0473, https://doi.org/10.48550/arXiv.1409.0473.
Bian, G.-F., G.-Z. Nie, and X. Qiu, 2021: How well is outer tropical cyclone size represented in the ERA5 reanalysis dataset? Atmos. Res., 249, 105339, https://doi.org/10.1016/j.atmosres.2020.105339.
Biswas, M. K., and Coauthors, 2018: Hurricane Weather Research and Forecasting (HWRF) Model: 2018 scientific documentation. Developmental Testbed Center, 112 pp.
Burg, T., and S. P. Lillo, 2020: Tropycal: A Python package for analyzing tropical cyclones and more. 34th Conf. on Hurricanes and Tropical Meteorology, online, Amer. Meteor. Soc., 16B.9, https://ams.confex.com/ams/34HURR/meetingapp.cgi/Paper/373965.
Cangialosi, J. P., 2020: National hurricane center forecast verification report: 2020 hurricane season. National Hurricane Center, 77 pp., https://www.nhc.noaa.gov/verification/pdfs/Verification_2020.pdf.
Cangialosi, J. P., E. Blake, M. DeMaria, A. Penny, A. Latto, E. Rappaport, and V. Tallapragada, 2020: Recent progress in tropical cyclone intensity forecasting at the National Hurricane Center. Wea. Forecasting, 35, 1913–1922, https://doi.org/10.1175/WAF-D-20-0059.1.
Chen, R., X. Wang, W. Zhang, X. Zhu, A. Li, and C. Yang, 2019: A hybrid CNN-LSTM model for typhoon formation forecasting. GeoInformatica, 23, 375–396, https://doi.org/10.1007/s10707-019-00355-0.
Chen, T., and C. Guestrin, 2016: XGBoost: A scalable tree boosting system. KDD’16 Proc. 22nd ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, New York, NY, Association for Computing Machinery, 785–794.
Chung, J., Ç. Gülçehre, K. Cho, and Y. Bengio, 2014: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv, 1412.3555, https://doi.org/10.48550/arXiv.1412.3555.
De Lathauwer, L., B. De Moor, and J. Vandewalle, 2000: A multilinear singular value decomposition. SIAM J. Matrix Anal. Appl., 21, 1253–1278, https://doi.org/10.1137/S0895479896305696.
DeMaria, M., and J. Kaplan, 1994: A Statistical Hurricane Intensity Prediction Scheme (SHIPS) for the Atlantic basin. Wea. Forecasting, 9, 209–220, https://doi.org/10.1175/1520-0434(1994)009<0209:ASHIPS>2.0.CO;2.
DeMaria, M., M. Mainelli, L. K. Shay, J. A. Knaff, and J. Kaplan, 2005: Further improvements to the Statistical Hurricane Intensity Prediction Scheme (SHIPS). Wea. Forecasting, 20, 531–543, https://doi.org/10.1175/WAF862.1.
ECWMF, 2019: IFS documentation CY46R1—Part III: Dynamics and numerical procedures. ECMWF, No. 3, 31 pp., https://www.ecmwf.int/node/19307.
Gao, S., P. Zhao, B. Pan, Y. Li, M. Zhou, J. Xu, S. Zhong, and Z. Shi, 2018: A nowcasting model for the prediction of typhoon tracks based on a long short term memory neural network. Acta Oceanol. Sin., 37, 8–12, https://doi.org/10.1007/s13131-018-1219-z.
Giffard-Roisin, S., M. Yang, G. Charpiat, C. Kumler Bonfanti, B. Kégl, and C. Monteleoni, 2020: Tropical cyclone track forecasting using fused deep learning from aligned reanalysis data. Front. Big Data, 3, 1, https://doi.org/10.3389/fdata.2020.00001.
Goodfellow, I., Y. Bengio, and A. Courville, 2016: Deep Learning. The MIT Press, 800 pp.
Grinsted, A., P. Ditlevsen, and J. H. Christensen, 2019: Normalized us hurricane damage estimates using area of total destruction, 1900–2018. Proc. Natl. Acad. Sci. USA, 116, 23 942–23 946, https://doi.org/10.1073/pnas.1912277116.
Hamill, T., G. Bates, J. Whitaker, D. Murray, M. Fiorino, T. Galarneau, Y. Zhu, and W. Lapenta, 2013: NOAA’s second-generation global medium-range ensemble reforecast dataset. Bull. Amer. Meteor. Soc., 94, 1553–1565, https://doi.org/10.1175/BAMS-D-12-00014.1.
Harper, B., J. Kepert, and J. Ginger, 2010: Guidelines for converting between various wind averaging periods in tropical cyclone conditions. Tech. Doc. WMO/TD-1555, WMO, 64 pp., https://library.wmo.int/doc_num.php?explnum_id=290.
He, K., X. Zhang, S. Ren, and J. Sun, 2016: Deep residual learning for image recognition. Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, NV, Institute of Electrical and Electronics Engineers, 770–778.
Hersbach, H., and Coauthors, 2020: The ERA5 global reanalysis. Quart. J. Roy. Meteor. Soc., 146, 1999–2049, https://doi.org/10.1002/qj.3803.
Hinton, G., O. Vinyals, and J. Dean, 2015: Distilling the knowledge in a neural network. arXiv, 1503.02531, http://arxiv.org/abs/1503.02531.
Hodges, K., A. Cobb, and P. L. Vidale, 2017: How well are tropical cyclones represented in reanalysis datasets? J. Climate, 30, 5243–5264, https://doi.org/10.1175/JCLI-D-16-0557.1.
Kiela, D., and L. Bottou, 2014: Learning image embeddings using convolutional neural networks for improved multi-modal semantics. Proc. 2014 Conf. on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, Association for Computational Linguistics, 36–45.
Kingma, D., and J. Ba, 2014: Adam: A method for stochastic optimization. arXiv, 1415.6980, https://arxiv.org/abs/1412.6980.
Knaff, J., M. DeMaria, C. Sampson, and J. Gross, 2003: Statistical 5-day tropical cyclone intensity forecasts derived from climatology and persistence. Wea. Forecasting, 18, 80–92, https://doi.org/10.1175/1520-0434(2003)018<0080:SDTCIF>2.0.CO;2.
Knapp, K. R., M. C. Kruk, D. H. Levinson, H. J. Diamond, and C. J. Neumann, 2010: The International Best Track Archive for Climate Stewardship (IBTrACS): Unifying tropical cyclone best track data. Bull. Amer. Meteor. Soc., 91, 363–376, https://doi.org/10.1175/2009BAMS2755.1.
Krizhevsky, A., I. Sutskever, and G. E. Hinton, 2017: ImageNet classification with deep convolutional neural networks. Comput. ACM, 60, 84–90, https://doi.org/10.1145/3065386.
LeCun, Y., B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, 1989: Backpropagation applied to handwritten zip code recognition. Neural Comput., 1, 541–551, https://doi.org/10.1162/neco.1989.1.4.541.
Lian, J., P. Dong, Y. Zhang, J. Pan, and K. Liu, 2020: A novel data-driven tropical cyclone track prediction model based on CNN and GRU with multi-dimensional feature selection. IEEE Access, 8, 97 114–97 128, https://doi.org/10.1109/ACCESS.2020.2992083.
Moradi Kordmahalleh, M., M. Gorji Sefidmazgi, and A. Homaifar, 2016: A sparse recurrent neural network for trajectory prediction of Atlantic hurricanes. GECCO’16: Proc. Genetic and Evolutionary Computation Conf. 2016, Denver, CO, Association for Computing Machinery, 957–964, https://doi.org/10.1145/2908812.2908834.
Mudigonda, M., and Coauthors, 2017: Segmenting and tracking extreme climate events using neural networks. 31st Conf. on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, Neural Information Processing Systems, https://dl4physicalsciences.github.io/files/nips_dlps_2017_20.pdf.
National Hurricane Center, 2021: Automated Tropical Cyclone Forecasting System (ATCF). National Hurricane Center (NHC), accessed 4 June 2021, https://ftp.nhc.noaa.gov/atcf/.
Paszke, A., and Coauthors, 2019: Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32, H. Wallach et al., Eds., Curran Associates, Inc., 8024–8035.
Rumelhart, D. E., G. E. Hinton, and R. J. Williams, 1987: Learning internal representations by error propagation. Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Foundations, D. E. Rumelhart and J. L. McClelland, Eds., MIT Press, 318–362.
Sampson, C., and A. J. Schrader, 2000: The Automated Tropical Cyclone Forecasting System (version 3.2). Bull. Amer. Meteor. Soc., 81, 1231–1240, https://doi.org/10.1175/1520-0477(2000)081<1231:TATCFS>2.3.CO;2.
Sampson, C., J. L. Franklin, J. A. Knaff, and M. DeMaria, 2008: Experiments with a simple tropical cyclone intensity consensus. Wea. Forecasting, 23, 304–312, https://doi.org/10.1175/2007WAF2007028.1.
Schenkel, B. A., and R. E. Hart, 2012: An examination of tropical cyclone position, intensity, and intensity life cycle within atmospheric reanalysis datasets. J. Climate, 25, 3453–3475, https://doi.org/10.1175/2011JCLI4208.1.
Shimada, U., H. Owada, M. Yamaguchi, T. Iriguchi, M. Sawada, K. Aonashi, M. DeMaria, and K. D. Musgrave, 2018: Further improvements to The Statistical Hurricane Intensity Prediction Scheme using tropical cyclone rainfall and structural features. Wea. Forecasting, 33, 1587–1603, https://doi.org/10.1175/WAF-D-18-0021.1.
Simon, A., A. B. Penny, M. DeMaria, J. L. Franklin, R. J. Pasch, E. N. Rappaport, and D. A. Zelinsky, 2018: A description of the real-time HFIP Corrected Consensus Approach (HCCA) for tropical cyclone track and intensity guidance. Wea. Forecasting, 33, 37–57, https://doi.org/10.1175/WAF-D-17-0068.1.
Su, H., L. Wu, J. H. Jiang, R. Pai, A. Liu, A. J. Zhai, P. Tavallali, and M. DeMaria, 2020: Applying satellite observations of tropical cyclone internal structures to rapid intensification forecast with machine learning. Geophys. Res. Lett., 47, e2020GL089102, https://doi.org/10.1029/2020GL089102.
Tan, C., F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu, 2018: A survey on deep transfer learning. arXiv, 1808.01974, https://doi.org/10.48550/arXiv.1808.01974.
Van Rossum, G., and F. L. Drake Jr., 1995: Python Tutorial. Centrum voor Wiskunde en Informatica, 54 pp.
Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, and L. Kaiser, 2017: Attention is all you need. 31st Conf. on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, Neural Information Processing Systems, 11 pp., https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
Yosinski, J., J. Clune, Y. Bengio, and H. Lipson, 2014: How transferable are features in deep neural networks? arXiv, 1411.1792, https://doi.org/10.48550/arXiv.1411.1792.
Zou, H., and T. Hastie, 2005: Regularization and variable selection via the elastic net. J. Roy. Stat. Soc., 67B, 301–320, https://doi.org/10.1111/j.1467-9868.2005.00503.x.