TCDetect: a new method of detecting the presence of tropical cyclones using deep learning

: Tropical cyclones are high-impact weather events that have large human and economic effects, so it is important to be able to understand how their location, frequency, and structure might change in a future climate. Here, a lightweight deep learning model is presented that is intended for detecting the presence or absence of tropical cyclones during the execution of numerical simulations for use in an online data reduction method. This will help to avoid saving vast amounts of data for analysis after the simulation is complete. With run-time detection, it might be possible to reduce the need for some of the high-frequency high-resolution output that would otherwise be required. The model was trained on ERA-Interim reanalysis data from 1979 to 2017, and the training was concentrated on delivering the highest possible recall rate (successful detection of cyclones) while rejecting enough data to make a difference in outputs. When tested using data from the two subsequent years, the recall or probability of detection rate was 92%. The precision rate or success ratio obtained was that of 36%. For the desired data reduction application, if the desired target included all tropical cyclone events, even those that did not obtain hurricane-strength status, the effective precision was 85%. The recall rate and the area under curve for the precision – recall (AUC-PR) compare favorably with other methods of cyclone identi ﬁ cation while using the smallest number of parameters for both training and inference.


Introduction
Tropical cyclones (TCs) are extreme weather events that can leave devastating effects on human populations; for example, Hurricane Irma impacted the Caribbean Islands and the southeast United States in September 2017 causing 47 direct deaths, 82 indirect deaths, hundreds of injuries, and an estimated monetary damage of around USD 50 billion (Cangialosi et al. 2018).Climate models can be used to understand how the properties of TCs and other meteorological phenomena might evolve in a changing climate, but such global circulation models (GCMs) produce large amounts of data.A method to reduce the data volumes in order to target analysis would be useful, and such a method is presented here that is based on a deep learning model that detects the presence of tropical cyclones in simulation output.Simulated time steps are split into eight regions that the deep learning model uses to infer the presence or absence of TCs.The method is intended for eventual deployment online (i.e., while a simulation model is running) to preclude the need to output data that does not include TCs (at least for the situation where TCs are the product of interest).Hence, the ability to not miss TCs is more important than maximum data reduction, as any analysis would be impaired if TC occurrences are missed.The method presented here is lightweight, with relatively short training and inference times when tested offline (i.e., after the meteorological data are output), and requires no explicit a priori thresholds in meteorological variables.It is also shown to perform similarly to other more standard, and more complex, deep learning models.

Motivation
Data volumes from climate simulations are huge.The current phase of the Climate Model Intercomparison Project (CMIP6; Eyring et al. 2016) comprises hundreds of different model simulations and was projected to produce 18 PB of data (Balaji et al. 2018).While it may not reach that volume, it will be close, and that total does not include the data that was produced and analyzed in the production of the archived data.Whether in or out of the CMIP archive, such data are costly to store and maintain and the volume makes analysis difficult.
A method of automatically detecting interesting phenomena in a model before saving the data could have two major benefits: 1) a fast way of finding and tabulating summary data (without writing out the actual data used), or 2) a method for reducing the need to write all the data to disk for subsequent analysis}for example, only data from a specific region and time where an event was present could be saved.
In both cases, this would likely have increased scientific productivity in that there would be significant savings in time and data volume saved}leading to more efficient science (efficient in time saved, storage costs, and storage energy consumption).
The possibility of efficiencies arises because although many simulations are carried out to target multiple use cases, some are carried out to investigate specific phenomena (e.g., when checking the impact of resolution on simulated TCs as in Roberts et al. 2015).In these cases, data relating to other phenomena might not be needed.However, currently in order to be able to retrieve the data for specific phenomena, the simulation will store sufficient data for postprocessing analysis at fixed intervals.The first postprocessing step then involves the retrieval of relevant data only, the other data are not used.
To select the correct data for analysis, it is important to have confidence in the method used for identifying the feature of interest.There is sometimes a conflict between the abstract notion of the feature of interest (in this case, a TC) and the practical implementation of a definition for a TC}the latter is intimately related to the tool for discovering it; for example, if the practical definition of a TC is the same as the metric for detecting it, of course we have confidence in it}but this definition may miss (or include) things we would abstractly consider to be TCs (or detect phenomena we would not consider to be TCs but that fall inside a poorly drawn definition).This could be seen in the differentiation between a TC of category 1 on the Saffir-Simpson scale and a tropical storm.Both show similar properties, but the main difference is the strength of the associated 10-m wind speeds, hence two definitions of a TC, one that includes 10-m wind strength and another that does not, would not produce the same outcome.Many previous techniques for TC identification in numerical data generally conflate the detection method with the definition.Deep learning can avoid this conflation as the definition needs to be clearly laid out before any deep learning work is undertaken, so understanding the distinction is important.
Although our initial interest is in detecting TCs, the conceptual method is expected to be extensible to other important phenomena such as fronts and atmospheric rivers.

Previous work
Several methods used to detect TCs have been developed.Most operate by using thresholds set for a few meteorological variables to determine the presence of a tropical cyclone.The use of thresholds leads to two problems: setting such thresholds involves scientific subjectivity and the combination of method and threshold may not be transferable across different models, data, or climates effected by climate change.More recently, deep learning has been used, and while deep learning may suffer from aspects of the transferability problem, it should be possible to avoid subjectivity.

a. TC detection using conventional techniques
Conventional techniques for identifying TC centers usually work by applying various thresholds to the available data.A few examples of such methods follow, with a tabular summary in Table 1.Vitart et al. (1997) used the closest minimum of mean sea level pressure (MSLP) to a local maximum of relative vorticity at 850 hPa over 3.5 3 10 25 s 21 as a storm center.A warmcore check is performed to classify the storm center as a TC.This requires that the closest local maximum of the average temperature between 550 and 200 hPa must be within 28 of the storm center, and the temperature must decrease by at least 0.58C for at least 88 latitude in all directions.Also, the closest maximum thickness between 1000 and 200 hPa must be within 28 of the storm center, and the thickness must decrease by at least 50 m for at least 88 latitude in all directions.Camargo and Zebiak (2002) introduce a detection method that uses vorticity at 850 hPa, surface wind speed and a vertically integrated temperature anomaly as variables on which to impose basin-dependent thresholds.Kleppek et al. (2008) use multiple thresholds to identify the TC centers.The first is that a local minimum of sea level pressure (SLP) needs to be observed within a neighborhood of an 8 3 8 square of grid points.This is assigned as the storm center.For it to be designated a TC center, there needs to be a maximum relative vorticity at 850 hPa above 5 3 10 25 s 21 at the storm center.The presence of vertical wind shear between 850 and 200 hPa of at least 10 m s 21 is also required, as well as an event lifetime of 36 or more hours.Finally, if the storm center is over land, the relative vorticity condition has to be fulfilled or the wind speed maximum at 850 hPa needs to be inside 250 km from the TC center.
A final example is that of Roberts et al. (2015) who use the method explained by Hodges (1995Hodges ( , 1996Hodges ( , 1999) ) and Bengtsson et al. (2007), where a TC is identified by a maximum of 850-hPa relative vorticity in data that have been spectrally filtered to keep features greater than 250 km in scale, and a warm-core check similar to that performed by Vitart et al. (1997) using features larger than 180 km using the 850-, 500-, 300-, and 200-hPa levels.

b. TC detection using deep learning
A newer crop of algorithms has been developed to detect and track TCs using deep learning methods.These are summarized in Table 2. Liu et al. (2016) used an image cropped in such a way that if a TC was present, it was centered in the image.They used eight different meteorological variables: surface level pressure, meridional and zonal wind speed at 850 mb (1 mb 5 1 hPa) and at the lowest available model level, temperature at 200 and 500 mb, and total vertically integrated precipitable water.They then predicted whether the image was one of a TC or not.The model obtained a 99% accuracy with a relatively simple model, but the preprocessing cropping step involved significant noise reduction that would have helped obtain good performance.Racah et al. (2017) created a method where a deep learning model takes in a snapshot of the world simulated by the CAM5 climate model with 16 different meteorological variables and created bounding boxes around the detected TCs.
The variables used were total precipitation rate, surface pressure, sea level pressure, reference height humidity, temperature at 200 and 500 mb, total vertically integrated precipitable water, reference height temperature, radiative surface temperature, meridional and zonal wind speed at 850 mb and at the lowest available model level, and geopotential at 100 and 200 mb and the lowest model level height.The architecture used was that of an autoencoder with three smaller networks using the bottleneck layer to draw a box around a suspected TC.Given the size of the inputs and number of kernels used in the convolution layers, the model presented was expected to be time consuming to train.It certainly required supercomputing: an adaptation of this deep learning model was trained by Kurth et al. (2017) using 9622 nodes of 68 cores each with a peak throughput of 15.04 petaflops (PF) s 21 and reached a sustained throughput of 13.27 PF s 21 , although the total time to train was not reported.The accuracy for this model was specified as the percentage of overlap between the predicted box and the box given as the ground truth}an intersection of union (IOU)}which was created using the Toolkit for Extreme Climate Analysis (TECA) (Prabhat et al. 2012(Prabhat et al. , 2015)).This is a collection of climate analysis algorithms aimed for extreme event detection and tracking.The model had 24.74% of the predicted boxes having an overlap of at least 10% with the ground truth, while 15.53% of the predicted boxes had an overlap of at least 50% with the ground truth.Mudigonda et al. (2017) created a deep learning model that used integrated water vapor (IWV) snapshots and image segmentation techniques to classify whether each pixel in an image was a part of a TC.It used an adaptation of the Tiramisu model, which applies the DenseNet architecture to semantic segmentation.The labels were created using TECA and Otsu's method (Otsu 1979).It was trained and tested on images in which at least 10% of the pixels were part of a TC and an accuracy of 92% was obtained.
Similarly, Kumler-Bonfanti et al. ( 2020) used a U-net network to perform image segmentation for TCs using the total precipitable water field from a weather forecast model: the National Centers for Environmental Prediction (NCEP) Global Forecast System (GFS).A U-net network is very similar to an autoencoder network, with the difference being that connections between the two branches of the network are used to convey GFS weather model IBTrACS U-net of 6 layers IOU 5 1: 75% 720 3 361 pixels 0.58 resolution any information missed while creating the dense representation of the inputs at the end of the encoder branch of the network.The inputs used also take in a measure of time as two snapshots are given to the deep learning model, one approximating the state of the atmosphere at the time the inference was initiated and another having the forecasted state of the atmosphere three hours in the future.Labels of areas belonging to a TC were generated by creating a 25 3 25 pixel box, approximating 300 km 2 , around a latitude and longitude pair obtained from the International Best Track Archive for Climate Stewardship (IBTrACS; Knapp et al. 2010Knapp et al. , 2018) ) dataset.This model managed to obtain an IOU of 75% but it should be noted that the labeling boxes, taken as the ground truth, were of the same size, so it is possible that the value for the IOU is inflated.

Deep learning model
This section presents the data used to train the deep learning model, the model architecture, and summarizes the method used to develop it.Full details of the training appear in the appendices.

a. Data
The deep learning model TCDetect uses meteorological data from a region to detect the presence of a TC.An output higher than 0.5 would signify that a TC was detected while a value of 0.5 or lower will imply the absence of a TC.TCDetect was trained, tested, and validated on data extracted from the ERA-Interim reanalysis dataset (Dee et al. 2011), with the validation data used for manual hyperparameter tuning as described in appendix B. The testing set was used for producing the final testing statistics and for interpreting the results produced by the trained model.
Five 6-hourly fields from 1 January 1979 until 31 July 2017 were used: MSLP, 10-m wind speed, and relative vorticity at 850, 700, and 600 hPa, each at a spatial resolution of '0.78 3 0.78.These fields were chosen because they had been used in previous TC detection algorithms and produced the best-performing deep learning model during hyperparameter tuning, as shown in appendix B. Also, the results obtained from TCDetect are compared to those obtained from a non-machine learning (ML) algorithm in a companion paper (Galea and Lawrence 2023, manuscript submitted to Artif.Intell.Earth Syst.); hence, the choice of inputs makes the comparison more faithful.Spherical filtering was performed on each field to reduce some of the smaller scale features.For the MSLP and 10-m wind speed fields, spherical harmonic filtering is performed to keep wavenumbers between 5 and 106.The vorticity fields were spherical harmonic filtered between wavenumbers 1 and 63.These field and spherical filtering options were chosen to match those used in the TC tracking algorithm developed in Hodges (1995Hodges ( , 1996Hodges ( , 1999)).
Each field was further split into eight regions as shown in Fig. 1.These regions were loosely based on those used by the IBTrACS dataset.The regions were collected into a single dataset that TCDetect was then trained on.Thus, the entire dataset included 450 944 individual regions (14 092 days 3 4 time steps per day 3 8 regions per time step), each with an 86 3 114 gridpoint domain with 5 channels.
Labels for these cases were derived from the IBTrACS dataset.IBTrACS contains temporal information, including category, latitude, and longitude of the storm center, for all major storms across the globe.The labels were set up in such a way that each case was labeled according to the presence or absence of a TC recorded in IBTrACS.At the end of the labeling process, 22 826 (5.06%) positive labels were generated as well as 428 118 (94.94%) negative labels.
This dataset was used for training and validation; it was split by taking data from 1979, 1986, 1991, 1996, 2001, 2006, 2011, and 2016 to make up the validation set and the rest of the data to make up the training set.This method of splitting the available data was chosen so that the possible effects of a changing climate were taken into consideration, so that any hyperparameter tuning performed would not be skewed.
Additional data from 1 August 2017 until 31 August 2019 were used as a testing set.This had a total of 24 352 cases, 1342 (5.51%) with and 23 010 (94.49%) without a TC.Table 3 shows how the splits are made and that the split between positive and negative cases is similar across training and testing.
All data were preprocessed to reduce resolution to 1/16 of the original ERA-Interim resolution by taking the mean value of all data points inside a 4 3 4 box, reducing the dimensionality of each case to 22 3 29 gridpoint domain and 5 channels.This reduction of resolution was arrived at during hyperparameter tuning.For each channel (variable), standardization was employed on each individual field according to where m field and s field are the mean and standard deviations of the values in that field for that case.The resulting standardized fields have a mean of 0 with a standard deviation of 1. Figure 2 shows an example of the data used, before and after preprocessing, from the time when Hurricane Katrina obtained its maximum strength, 1800 UTC 28 August 2005.

b. Architecture
TCDetect uses a convolutional base attached to a fully connected classifier that outputs a value between 0 and 1, with any values larger than 0.5 signifying that the model detects a TC.A more detailed explanation of the model architecture can be found in appendix A, while a graphical view is shown in Fig. 3.
To arrive at the model architecture presented, stepwise manual hyperparameter tuning was used to determine which changes to the architecture performed well (see appendix B).Various metrics could have been used to determine how the architecture should be changed to produce a better-performing model.Accuracy, defined by where TP is the number of true positives, FP is the number of false positives, TN is the number of true negatives and FP is the number of false positives, was considered, but was not suitable due to the large class imbalance present in our training set.Had accuracy been the chosen metric, the model would learn to infer "no TC" for all cases.This leads to TN being a high number, producing a high accuracy, but a model with no skill.Given that obtaining the highest possible number of TCs is important for the use of the developed model, recall, defined by recall 5 TP TP 1 FN , could have been used, as a high value would indicate that the number of false negatives, that is, not detected TC cases, is small.However, this would not have considered the need to detect non-TC cases as such.For this, precision, defined by precision 5 TP TP 1 FP , could be used as a high value would indicate that the number of false positives, that is, non-TC cases inferred as TC cases, is small.To get the right balance between the two functions of the model, the area under curve for the precision-recall curve (AUC-PR) was used.This gives a single value that takes into account the two important functions of the model.This value can still be slightly obscure as the same value could be produced for high precision and low recall rates, high recall and low precision rate or average recall and precision rates.However, as the model was being developed primarily to identify data for further postprocessing, false negatives would be a bigger problem than false positives, so improvements caused by different values for hyperparameters in recall were favored over those in precision if AUC-PR varied only marginally or the balance between recall and precision needed to be addressed as a change was assessed.
The training of the deep learning model was done using a NVIDIA Volta 100 GPU on a node having 32 GB of RAM and 32 CPU cores.Some initial tests were performed on a cloud instance provided by Oracle, while the main model development was performed on the Joint Analysis System Meeting Infrastructure Needs (JASMIN) platform (Lawrence et al. 2012).The software packages used were Python 3.6.8and Tensorflow, version 2.20 (Abadi et al. 2015).The training set was traversed 21 times for the model to converge to a solution with a total time to train of 12 min.Although the time taken to train the final model was relatively short, much more time was taken up in progressing through the various optimizations detailed in appendix B, mainly due to the use of tenfold cross validation.

Results
The resulting deep learning model, TCDetect, was evaluated using the test set described above.The inferences obtained were also investigated to understand how the model generates its results.We present these results in this section.
The outcome could have been further varied by changing the value that is the boundary between a positive and a negative prediction, currently 0.5.Figure 4 shows the AUC-PR curve for the model with the values at each point signifying FIG. 2. An example of the data that were used to train TCDetect.The data are from the reanalysis at the point where Hurricane Katrina obtained its maximum strength (1800 UTC 28 Aug 2005).(left) The original data from ERA-Interim and (right) how these data are transformed after preprocessing.The rows are as follows (in reverse order of height above the surface): MSLP, 10-m wind speed, and the vorticity at 850, 700, and 600 hPa.

b. Comparison with standard models
There are many existing deep learning standard models, so it is reasonable to ask whether any of those do better than the model developed here.
To test this, the convolutional base, that is, the part of the network that learns spatial patterns, from a variety of standard model architectures were used in conjunction with the classifier developed here and compared.The convolutional bases of the following standard architectures were used: DenseNet121, DenseNet169, and DenseNet201 (Huang et al. 2017 (Chollet 2017).The weights of all of these bases were obtained by training on the ImageNet dataset (Deng et al. 2009).These were then frozen and the weights of the classifier were retrained on data from all regions with the presented model's learning rate, momentum value and L2 normalization factor.
Given that these convolutional bases required inputs of at least 75 pixels 3 75 pixels with 3 channels, some changes to the inputs were required.First, as an input with only 3 channels is required for most of the above architectures, the fields retained were those of relative vorticity at 850 hPa, relative vorticity at 600 hPa, and MSLP.This choice was made as these three fields were deemed the most important for the model being presented by tests detailed in section 5a.Second, the input size was extended fivefold from 22 3 29 to 110 3 145 pixels by interpolating any intermediate values.
Of the standard architectures tested, none managed to obtain better AUC-PR or loss values on the test set (Fig. 5).Table 5 compares the complexity of some of these more standard models and their performance metrics with TCDetect.All of the more standard models had far higher complexity in terms of the number of parameters used than TCDetect, and the latter also outperforms the others in terms of AUC-PR, precision rate and loss.While the recall rate obtained by TCDetect does not outperform all of the other models, it is in the top third of the list.
This shows that standard image processing models that are pretrained on a benchmark dataset did not perform very well when applied to meteorological data.This is expected as the latter has different properties than, for example, images of animals, and the standard models had not been trained with any meteorological data.Therefore, the previous analysis was

FIG.
3. Visual representation of the architecture of TCDetect.The inputs, having 22 rows, 29 columns, and 5 fields, are passed through a 2 3 2 convolution window whose weights are learnt producing 8 feature maps, but losing one row and column.The resulting feature maps are passed through a MaxPool window, of size 2 3 2, which takes the maximum value in its window.This further reduces the feature maps' size by a row and a column, to 20 rows and 27 columns.This is repeated three more times, with each step producing more feature maps.The final ones are combined and reshaped into one long array.This array is used as the input to a fully connected layer of 128 nodes, each of which output a single value.Hence, 128 values are now obtained and are used as inputs to the next layer.This is repeated three more times, ending with the final value.redone}this time with the whole network retrained, not just the fully connected classifier.When this is performed, a different result is obtained.As can be seen in Fig. 6 and tabulated in Table 6, the AUC-PR and loss values for all models are much closer.Similarly, the recall and precision rates are very close across the different models.Encouragingly, TCDetect is only outperformed by 1% for recall rate but the maximum precision obtained is 44% by ResNet50, 8% higher than TCDetect.This shows that standard models pretrained on standard image processing benchmark datasets are not optimized for meteorological data.This analysis also shows that TCDetect is comparable is standard deep learning architectures when these are trained on the same meteorological dataset, despite having less complexity in terms of the number of parameters used in the network.

c. Comparison with threshold-based method
It is reasonable to ask what, if any, value deep learning adds beyond simple threshold-based methods.To address this, and recognizing that most existing threshold-based methods usually use the meteorological parameter of relative vorticity at 850 hPa as one of the criteria when detecting and tracking TCs, we use this variable for a simple comparison.
As we discussed in section 2, the value of a cutoff for such threshold is subjective and can be data dependent, and so we investigate a range of possible thresholds.A case is classified  Clearly this is a fast method, and it is capable of getting either good accuracy or good recall, but it struggles with precision (as do all techniques).Given for our use case we want the best possible recall, the precision rate of 6% at high recall values is far below that which can be achieved with deep learning (36%).
One might then ask, what about more sophisticated threshold and tracking methods, and what is it about the meteorology that drives these results?We address these issues in a companion paper (Galea et al. 2023, manuscript submitted to Artif.Intell.Earth Syst.), which compares TCDetect with a more sophisticated method, and investigates the interaction of the meteorology with the detection technique.

Model explainability
Deep learning outcomes can be inscrutable and arise from unexpected aspects of the inputs and so in order to trust the inferences, it is helpful to try and explain aspects of the outcomes in terms of the inputs and the process.In this section, four aspects are investigated: feature importance, to determine which inputs influence the inferences most; feature strength, how the strength of a TC influences the results; how results are influenced by regional location; and how the size of the training dataset used affects the model's performance.In the discussion that follows, it is important to remember that the data used as input to TCDetect is ERA-Interim, which is itself an inaccurate representation of reality.It does not give a perfect reconstruction of history due to physical processes that are not able to be simulated, so there will inevitably be an element of discrepancy that can be explained by the data used as input, and not the deep learning tools themselves.Also, as a reanalysis dataset, ERA-Interim is a compromise between observations and modeling.This is most notable in the strength of 10-m wind speeds, which do not match reality (since they represent values over a larger spatial scale than the observations).This introduces noise in the data that may inhibit TCDetect from obtaining better performance.Finally, not all TCs that occurred are part of IB-TrACS and strength classification might not be correct due to a lack of observational data.

a. Input feature importance
It is possible to quantify the relative impact of input field importance to the learned output.Two methods are employed for this: the Breiman method (Breiman 2001) and the Lakshmanan method (Lakshmanan et al. 2015).
The Brieman method, sometimes called the single-pass permutation test, involves permuting the data from one field across all the test cases and then retesting the model with this modified dataset.A decrease in the model's performance is expected, with the most important field obtaining the largest decrease in performance.
The Lakshmanan method, sometimes called the multipass permutation test, involves several steps: First, to permute the data as in the previous method for one field.Once the field with the most importance, that is, the field that produces the largest decline in performance, is found, it is kept permuted, while the other fields are permuted individually.The next most important field is now found by repeating the algorithm on the remaining fields.This process is repeated until all the fields are permuted.
Both methods were performed 30 times each using the testing dataset and an average was taken to make sure of consistent and robust results: Fig. 7 shows the results.The most important field was found to be that of relative vorticity at 850 hPa, similar to what Roberts et al. (2015) found, with the Breiman method showing a decrease in AUC-PR from 0.7173 to 0.0936.Then, the Breiman method shows the rest of the ranking for the most important field as follows: MSLP, relative vorticity at 600 hPa, relative vorticity at 700 hPa and 10-m wind speed.The Lakshmanan method shows a slightly different ranking, with MSLP demoted from being the second-most important field to the fourth-most important field.Not much should be read into this slight change as the difference in AUC-PR values is minimal.Together these suggest the most important field is that of relative vorticity at 850 hPa and the second-most is that of relative vorticity at 600 hPa.If this is because it is matching areas of high relative vorticity at 850 and 600 hPa, then this would be consistent with a physical interpretation that it is checking for horizontal flow deformations associated with deep convection.

b. Performance by strength of tropical cyclone
A manual exploration of instances incorrectly classified by the deep learning model being presented here indicated that stronger tropical cyclones are detected with more success.To provide quantitative support for this conclusion, the recall rate stratified by the tropical cyclone category on the input labels was examined.This allowed us to examine the proportion of positively labeled cases being correctly classified as a function of labeled cyclone strength.
Table 8 shows very high recall for all Saffir-Simpson scale (Kelman 2013) categories, as recorded in the IBTrACS database.The model has a recall rate of 88% for tropical cyclones of category 1 weakest class of TC) and a perfect recall rate for category 5 (the strongest).As expected, an increasing trend of recall against category can be seen as higher category cyclones are easier to detect.
A possible reason for this trend of increasing recall with strength is that the deep learning model developed used data at a 1/16 of ERA-Interim's original resolution.This means that the model was using data with a resolution of around 2.88, or around 280 km.While this value was chosen during the manual hyperparameter search as detailed in appendix B, and optimizes the overall results, it is likely to impact the representation of the weakest features the most.
The hypothesis as to why the hyperparameter search selected such averaging, is that the coarsening of the data filtered out some of the noise present in the data while still preserving the structure of any system present.
The nature of the cases labeled by TCDetect as TCs was also investigated and compared to expectations in Table 9.It was found that only 506 out of the 3397 cases that obtained a positive inference from the model were cases that had no meteorological system present.This suggests that the deep learning model has learnt the required pattern but is mislabeling weaker features as TCs, despite the averaging discussed earlier.

c. Impact of location
Having investigated the impact of the strength of the features of interest and which fields influence the results the most, the next question is to what extent the results are dependent on regionality.
During development, due to time and computational constraints, the manual hyperparameter tuning was carried out on only the western Atlantic and western Pacific (WAWP) regions (see appendix B).When doing this, two linked assumptions were made: any change made to the architecture that caused an improvement in performance would result in a similar improvement when the architecture was trained and tested on data from all regions; a model trained on data from the WAWP regions would generalize well when tested on data from all regions of the world.
The first and third columns of Table 10 show that the first assumption holds, although it can be seen that the magnitude of the improvements between the two models can vary.Also, as shown in Table 11, the architecture has similar performance when trained and tested only on data from the WAWP regions and when trained and tested on data from all regions.
However, the second assumption was found not to hold.The first two columns of Table 10 show that a model trained on WAWP data decreased in performance considerably when tested on data from all regions.This is mirrored when using the final models, as shown in Table 11.
A possible reason for this is that the background meteorological state in the WAWP regions differs from that of the whole world.Figure 8 shows the mean state from the WAWP regions in the left column and the mean state of all regions in   the right column, with each row corresponding to each variable used, that is, MSLP, 10-m wind speed and relative vorticity at 850, 700, and 600 hPa.The individual regions all differ significantly, which yields to the noise seen in the average.These differences could be confounding the results when the model is trained on TCs seen only in one region.
To further understand how the model trained on WAWP data differs from that trained on data from all regions, the results have been split by basin and shown in Table 12.
As expected, the model trained on WAWP data performs best in the western Atlantic and western Pacific regions, with a recall of 90.80% and 90.75%, respectively.It also performs well in the eastern Pacific region with a recall of 80.15%.However, all other regions do not surpass the recall rate of 60%.
On the other hand, when the model trained on data from all regions is used, all recall rates improve, some significantly.
The western Atlantic and western Pacific regions improve their recall rates by 2.25% and 6% to 93% and 96.80%, respectively.The most improved region is the one bounded by 1008E-1808 in the Southern Hemisphere, with its recall rate increasing by more than half from 58.41% to 92.92%.All but one region obtained a recall rate of at least 80%, with many surpassing a recall rate of 90%.The region that did not do well was that bounded by 1808-2608E in the Southern Hemisphere, which obtained a recall rate of 42.31%.A possible reason for this to not perform as well as the other regions is that a smaller number of cases with TCs are available for this region, with only 26 in the test set.

d. Size of training dataset
Finally, though not strictly relevant for explaining the inferences, it is interesting to address the impact of the size of the training dataset on the results.To do this, the final architecture was trained using varying amounts of the training dataset from 10% to 100%.When training using the entire global dataset (all regions) it is seen that despite the fluctuations, (Fig. 9) the AUC-PR is generally increasing, while the testing loss is mostly flat, as more data are added.(The average testing loss essentially measures bad predictions: lower is better; if it is flat while the desired output is increasing, that is a good thing!)This suggests that a larger training dataset would have been beneficial.

Summary
A deep learning method to identify the presence or absence of tropical cyclones in simulation data is presented.Trained on ERA-Interim data, the deep learning model}TCDetect} obtained an AUC-PR of 0.7173 a recall rate of 93% and a precision rate of 36% on a test set, which was made up of 24 352 cases.As TCDetect is intended to be used during a GCM simulation, not missing TCs (i.e., recall) was deemed more important than obtaining maximum data reduction (i.e., precision) as a lack of recall would impair any future analysis of data output.
The performance of TCDetect is comparable to other standard (and more complex) models in terms of test loss and the AUC-PR metric that takes into account both precision and recall.It is relatively cheap to train and run the inference, although there is a preprocessing stage that involves relatively expensive spherical harmonic smoothing.
The architecture of the model was developed using manual hyperparameter tuning on the western Pacific and western Atlantic basins, two of the eight regions used for labeling the presence or absence of tropical cyclones from the IBTrACS dataset.While the model trained on those regions did not generalize well to all regions, when that model architecture is used with the global training set it does better.An analysis of the impact of the size of the influence of the size of the training dataset suggests that even better results might be obtained with more training data.
TCDetect performs best with the strongest events, with a 100% recall rate for category 5 TCs.The most important inputs were found to be relative vorticity at 850 and 650 hPa, suggesting that the key physical characteristic of the data the deep learning has identified is the presence of strong deep convection.While recall increases with storm strength, there are also many tropical storms misidentified as TCs by TCDetect.While affecting performance measured using deep learning metrics, the inclusion of such storms is not a problem from an application point of view}tropical storms are important for those studying dynamics and can also wreak significant damage in their own right.
While the training data were obtained from ERA-Interim, the ground truth used was IBTrACS, which introduces an element of uncertainty in interpreting the results}is an incorrect label (presence/absence) a consequence of the presence or absence of the TC in the ERA-Interim or IBTrACS data or an issue with the deep learning?It is known that reanalysis data cannot resolve the full strength of storms, and so will likely undercount TCs, and hence depress the possible accuracy rates.We discuss the impact of such uncertainties in a companion paper (Galea et al. 2023, manuscript submitted to Artif. Intell. Earth Syst.).
Future work includes attempting to improve TCDetect to better handle TCs of a low category potentially via ideas imported from other standard techniques or using different meteorological fields, as well as implementing an inference step using a version of the model in a full general circulation model to evaluate the pros and cons of avoiding data output.Data availability statement.The data and code to produce the dataset, deep learning model/s and subsequent results is available to access at Galea et al. (2022).Also included is the IBTrACS version 4 dataset, obtained from Knapp et al. (2018).

Model Architecture
Table A1 gives the details of the architecture that makes up TCDetect: An input of dimensions 22 rows 3 29 columns 3 5 fields goes through five convolutional blocks, each made up of a convolutional layer of 8, 16, 32, 64, and 128 kernels, respectively, with weights initialized using the Glorot Uniform method (Glorot and Bengio 2010) with ReLU activation functions, each with strides of 1 and a kernel size of 2 3 2; a dropout layer with a dropout rate of 10%; and a maximum pooling layer with strides equal to 1.The resulting kernels are flattened and passed through three fully connected blocks, each made up of a dense layer of 128, 64, and 32 hidden nodes, respectively, with L2 normalization with a normalization factor of 0.005, weights initialized by the Glorot Uniform method and a dropout layer with a dropout rate of 10%.TCDetect ends with another fully connected layer of one node, this time using the sigmoid activation function with weights initialized by the Glorot Uniform method, as well as L2 normalization with a normalization factor of 0.005 that outputs a prediction.The optimizer used was the stochastic gradient descent (SGD) optimizer with a learning rate of 0.01 and momentum of 0.8 with the loss function being that of binary cross entropy.

Hyperparameter Tuning
TCDetect was developed on data from the western Atlantic and western Pacific regions.The developments used the training set as described in section 3a to perform tenfold cross validation.Each fold was selected such that consecutive time steps, which are bound to be similar, are mostly kept in singular folds to reduce data leakage across folds.Each fold was then evaluated using the validation set.The metric used to assess whether a change improved the model performance was the AUC-PR.Given the major class imbalance in the dataset used and that as the model is intended to be used as a filtering technique, this metric is used because it weights both precision and recall equally.Also, due to the metric chosen, it was ensured that each fold used in the cross-validation process had a similar ratio of positive cases to negative cases that represented that of the full training set.Finally, as this model was being developed to identify data for further postprocessing, false negatives are a bigger problem than false positives, so improvements in recall were favored over those in precision if AUC-PR varied only marginally as a change was implemented.
Development and optimization using this value proceeded as described below, with the final models being described and evaluated using the testing dataset in sections 3b and 4, respectively.Table B1 shows a summary of the steps taken during hyperparameter tuning.
The initial architecture that was used as the starting point for developing TCDetect consisted of an input of dimensions 84 rows 3 110 columns 3 2 channels that passed through 5 convolutional blocks, each made up of a convolutional layer of 8, 16, 32, 64, and 128 kernels, respectively, with weights initialized using the Glorot Uniform method with ReLU activation functions, each with strides of 1 and a kernel size of 2 3 2; and a MaxPooling2D layer with strides equal to 1.The resulting kernels are flattened and passed through three fully connected blocks, each made up of a dense layer of 128, 64, and 32 hidden nodes, respectively, with weights initialized by the Glorot Uniform method.The model finishes off with another fully connected layer of one node, this time using the sigmoid activation function with weights initialized by the Glorot Uniform method that was used to output a prediction.The optimizer used was the SGD optimizer with the default learning rate of 0.01 with the loss function being binary cross entropy.Finally, a batch size of 32 cases was initially used.

a. Choice of data
The first optimization made was to choose the number and type of meteorological fields to supply as inputs to the model.Four possible configurations were tested: • MSLP and 10-m wind speed • MSLP, 10-m wind speed and relative vorticity at 850, 700 and 600 hPa • MSLP and 10-m wind speed with spherical harmonic filtering between wavenumbers 5 and 106 • MSLP, 10-m wind speed with spherical harmonic filtering between wavenumbers 5 and 106 and relative vorticity at 850, 700, and 600 hPa with spherical harmonic filtering between wavenumbers 1 and 63 The last option provided the best mean AUC-PR, that of 0.5309.

b. Early stopping
Next, it was noted that the model was overfitting as Fig. B1a shows.Except for the first two epochs, the training loss gets smaller while the validation loss gets larger with an increasing number of epochs.Figure B1b shows similar behavior to AUC-PR.
To overcome this issue, model training was stopped when the training and validation AUC-PR started to diverge.A number of epochs of patience, that is, the number of epochs to wait until stopping to make sure that training was not stopped too early, were trialed to get the best possible performance.Patience values trialed were of 2, 5, 10, and 20 epochs.That of 10 epochs obtained the best mean AUC-PR of 0.6788.Resolution of the data used was checked next.The resolution used up to the current stage was that of the original ERA-Interim dataset, but resolutions of 1.48 3 1.48, 2.18 3 2.18, 2.88 3 2.88, and 3.58 3 3.58 were tested.The resolution of 2.88 3 2.88, which was obtained by taking the mean of every four pixels of the image in both the x and y directions, produced the best mean AUC-PR of 0.7842.

e. Dataset balancing
One problem that was known when starting hyper-parameter optimization was that the dataset was heavily dominated by negatively labeled cases.In fact, the training dataset having data from the WAWP regions had 89.46% of the cases negatively labeled, while that having data from all regions had 95% of cases negatively labeled.This split of data would inhibit the model learning the right pattern to maximize its performance.Therefore, six ways of balancing the dataset were investigated: • Naive oversampling}Making copies of the positively labeled cases until the dataset is balanced.• Undersampling without replacement}Undersample the negatively labeled cases prior to training, therefore, some data are not used.• Undersampling with replacement}Undersample the negatively labeled cases during training, so they change from epoch to epoch; possible overfitting on positively labeled cases.
• Weighting the cases}Weighting the cases so that the negatively labeled cases have less influence on the learning process.
• Adding bias}Add a bias to the output layer to prevent the model from learning the bias.• Weighting the cases and adding bias}A combination of the previous two options.
Undersampling with replacement produced the best performance with a mean AUC-PR of 0.7839.It can be noted that the model's performance decreased marginally from the previous step, but this was still selected as recall became much more favored by the model, which is important for the data reduction method in mind.

f. Loss and optimizer
The model so far used the binary cross-entropy loss function with the SGD optimizer.All possible combinations of the mean absolute error, mean standard error, and binary crossentropy loss functions and SGD, SGD with momentum using a momentum parameter of 0.9 (Qian 1999), RMSProp, A1 Adam (Kingma and Ba 2014), Adagrad (Duchi et al. 2011), Adamax (Kingma and Ba 2014), and Nadam (Dozat 2016) optimizers were examined.
Binary cross-entropy loss with the SGD optimizer with momentum using a momentum parameter of 0.9 obtained the best mean AUC-PR of 0.7890.

g. Learning rate and momentum
A grid search for the best learning rate and momentum parameters was performed.The values for the learning rate included were those of 0.0001, 0.0005, 0.001, 0.005, 0.01, and 0.05 while those used for the momentum parameter were in the range of 0.1-1 with a step of 0.1.The combination that produced the best-performing model was that having a learning rate of 0.01 and a momentum of 0.8.

h. Data augmentation methods
Several techniques including random rolls, rotations, adding random noise, flipping the input data along either the x or y directions, and random cropping were evaluated.The augmentation rate was set to 50%.The options that obtained a comparative or better mean AUC-PR were rolling the picture along the x direction, flipping the picture left to right and rotating the image by a random amount.These were all included in the model and the combined methods produced a mean AUC-PR of 0.7988.

i. Data augmentation rate
The best data augmentation rate was also varied from 10% to 100% in steps of 10% to find the best possible rate.The best-performing model with a mean AUC-PR of 0.8018 was that with an augmentation rate of 60%.

j. Dropout position and rate
Dropout was investigated next.It was trialed in three places, namely the convolutional base only, the fully connected classifier only and throughout the model with dropout rates varying from 10% to 100% in steps of 10%.The model with the best AUC-PR, that of 0.8104, was that employing dropout with a rate of 10% throughout the model.

k. L2 normalization position and factor
L2 normalization was also investigated.It was trialed in the same three places as in the previous optimization.The normalization factors checked were 0.000 01, 0.000 05, 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, and 0.5.The model that produced the best performance with a mean AUC-PR of 0.8128 was that having L2 normalization in the classifier only with a rate of 0.005.

l. Batch size
The final optimization tested was of that for the batch size.Batch sizes of 8, 16, 64, 128, 256, 512, 1024, and 2048 were tested with the first option producing the best-performing model with a mean AUC-PR of 0.8135.

m. Others
Other optimizations tested that did not produce a model with an improved performance included batch normalization, varying the number of hidden layers and nodes, and
Brought to you by UNIVERSITY OF READING | Unauthenticated | Downloaded 08/07/23 11:09 AM UTC the boundary at which the corresponding recall and precision rates are obtained.

FIG. 5 .
FIG. 5. Test AUC-PR (bars) and test loss (points) for standard convolutional bases, pretrained on the ImageNet database, attached to the fully connected classifier developed in the presented model.The classifier was retrained for each convolutional base with data from all regions.

FIG. 6 .
FIG.6.Test AUC-PR (bars) and test loss (points) for standard convolutional bases attached to the fully connected classifier developed in the presented network.The whole network was retrained for each convolutional base with data from all regions.

FIG. 7 .
FIG. 7. Feature importance using the (top) Breiman and (bottom) Lakshamanan methods for model trained and tested on data from all regions.

FIG. 8 .
FIG. 8. Mean case (left)  for data originating only from the western Atlantic and western Pacific regions and (right) for data originating from all regions.(first row) MSLP, (second row) 10-m wind speed, (third row) vorticity at 850 hPa, (fourth row) vorticity at 700 hPa, and (fifth row) vorticity at 600 hPa.

Acknowledgments.
This work was funded by Natural Environment Research Council (NERC) as part of the U.K. Government Department for Business, Energy and Industrial Strategy (BEIS) National Productivity Investment Fund (NPIF), Grant NE/R008868/1 under the SCENARIO Doctoral Training Partnership hosted by the University of Reading.Additional support was provided by Mr. Jeff Adie from NVIDIA, and Oracle Corporation.
FIG. B1.Evolution of training and validation loss and AUC-PR across the training process.(a) Loss for model trained and tested on data from the western Atlantic and western Pacific regions before applying early stopping.(b) AUC-PR model trained and tested on data from the western Atlantic and western Pacific regions before applying early stopping.

TABLE 1 .
Overview of thresholds applied to meteorological variables for detecting and tracking tropical cyclones with the conventional techniques given.

TABLE 2 .
Previous deep learning models that detect and track tropical cyclones.

TABLE 4 .
Confusion matrix resulting from inference on the testing dataset.
Precision-recall curve for final trained model evaluated on the testing dataset.

TABLE 5 .
Comparison of total parameters used and performance metrics for TCDetect and similar networks using more standard convolutional bases.
as having a TC present if the threshold is exceeded by any value in that area at that time.The results of this exercise are tabulated in Table7and similar results were observed for the training and validation datasets (not shown).

TABLE 6 .
Comparison of total parameters used and performance metrics for TCDetect and similar networks using more standard convolutional bases.

TABLE 8 .
Number of cases having a TC and associated recall rate stratified by TC category when using test data.

TABLE 7 .
Classification metrics when using a threshold on the field of relative vorticity at 850 hPa.

TABLE 9 .
The IBTrACS classification seen for the 3397 cases where TCDetect inferred the presence of a TC.

TABLE 10 .
Results when using validation data.

TABLE 12 .
Evolution of accuracy during model development by basin (see text for explanation of rows).
9. Test AUC-PR and loss for model trained and tested on data from regions around the world.

TABLE A1 .
The architecture of TCDetect.