1. Introduction and background
Deep convective storms are sustained by updrafts that rapidly lift lower-tropospheric air into the upper troposphere and lower stratosphere (UTLS) and can adversely impact society. Such updrafts trigger generation of hazardous weather such as hail, heavy rainfall, tornadoes, and destructive straight-line winds, as well as in-flight aircraft icing and turbulence (Reynolds 1980; Negri 1982; Fujita 1989; Liu and Zipser 2005; Dworak et al. 2012; Bedka et al. 2018; Liu et al. 2020). These storms also inject moisture, aerosols, and trace gases into the UTLS, which has a major impact, both directly and indirectly, on chemistry, dynamics, and radiation (Holton et al. 1995; Forster and Shine 1999; Stohl et al. 2003; Dessler and Sherwood 2004; Liu and Zipser 2005; Gettelman et al. 2011; Anderson et al. 2012; Pan et al. 2014; Smith et al. 2017; Herman et al. 2017; Anderson et al. 2017; Gordon et al. 2024). Intense convection occurs globally and is especially common over the United States (Liu and Zipser 2005; Bedka et al. 2010; Solomon et al. 2016; Cooney et al. 2018; Liu et al. 2020). Over the past decade, there has been a notable increase in the frequency and intensity of extreme weather events due to rising greenhouse gas concentrations (IPCC 2021). In 2023, there were 28 U.S. “Billion Dollar Disasters,” 19 of which were attributed to severe thunderstorms (Smith 2024).
Automated satellite-based methods that pinpoint storms most likely to be severe are valuable for weather and climate analysis, especially in regions with insufficient ground-based weather radar coverage. Severe storms generate multiple distinct patterns within their cloud tops that are readily discernible by the human eye in satellite visible (VIS) and infrared (IR) imagery (Bedka et al. 2010). Two of these patterns are called overshooting tops (OTs) and above-anvil cirrus plumes (AACPs). Figure 1 shows numerous OTs and AACPs in Geostationary Operational Environmental Satellite (GOES-16) imagery of severe convection that produced hail, damaging wind, and tornadoes. OTs signify rapidly rising updrafts that appear as “dome-like protrusions above a cumulonimbus anvil” (Glickman 2000). These strong updrafts are evident via cold temperatures in IR imagery, dome-like texture in VIS imagery, and high lightning flash rates (Bedka et al. 2010; Bedka and Khlopenkov 2016; Bedka et al. 2018).
GOES-16 satellite imagery showing a cluster of intense storms generating many OTs and AACPs over Colorado, Kansas, and Nebraska valid at 2300:00 UTC 26 May 2019. The rainbow colorbar was created using the Google AI-generated color-blind-friendly map called Turbo.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0037.1
As the updraft penetrates the environment’s equilibrium level, typically located near the anvil cloud altitude (Bedka and Khlopenkov 2016), gravity wave breaking can occur at storm top, injecting cirrus cloud above the anvil (Homeyer et al. 2017; O’Neill et al. 2021). Unlike OTs, AACPs are not rapidly rising and typically reside in the lower stratosphere, allowing the plume to mix with the stratospheric environment that can be over 30 K warmer than the “parent” OT responsible for generating the AACP (Brunner et al. 2007; Homeyer et al. 2017; O’Neill et al. 2021). Some AACPs do not reach the stratosphere in high tropopause, tropical environments and can retain comparably cold temperatures as their parent OTs even though their texture in VIS imagery is the same as warm AACPs (Murillo and Homeyer 2022). A U- or V-shaped region of cold temperature is often present in conjunction with an AACP, with an OT being located at the apex of the U/V shape. The AACP is bracketed by the “arms” of the U/V, referred to extensively in the literature as the “enhanced-V” signature (McCann 1983; Brunner et al. 2007). In weaker anvil-level wind environments, the arms have a circular shape around the AACP, leading to them being called “cold-ring patterns” (Setvak et al. 2010).
Although OTs are ubiquitous atop deep convection, only a fraction of OT-producing storms generate severe weather (Dworak et al. 2012; Khlopenkov et al. 2021). AACPs are considered a stronger indicator of a severe storm (Homeyer et al. 2017; Bedka et al. 2018). Severe weather outbreaks can feature dozens of long-lived AACP-producing storm cells (Bedka et al. 2018). Bedka et al. (2018) analyzed over 4500 storms (405 of which produced AACPs) using a fusion of 1-min GOES imagery and gridded Next Generation Weather Radar (NEXRAD) network volumes. They found AACP-producing storms were 14× more severe (in terms of severe weather report counts) than deep convection without AACPs. Furthermore, over 85% of storm cells that generated hail with a diameter of 5+ cm and EF-2+ tornadoes were from AACP-producing storms. On average, AACPs appeared 30 min prior to the first severe weather report.
Various methods or data have been developed to identify OTs and AACPs. Some data are geometric (radar, lidar, stereoscopy), while others are indirect (passive microwave, VIS, IR, and near-IR imaging) (Setvak and Doswell 1991; Liu and Zipser 2005; Bedka et al. 2010, 2012; Setvak et al. 2013; Bedka and Khlopenkov 2016; Cooney et al. 2018; Homeyer et al. 2017; Bang and Cecil 2019; Liu et al. 2020; Khlopenkov et al. 2021). Prior to launch, the GOES-R program categorized an OT and AACP/enhanced-V detection product as an “Option 2/Future Capability” (NOAA 2018). Bedka et al. (2010, 2012) developed an algorithm to detect these patterns within the GOES-R Aviation Algorithm Working Group. OT detection was based on the premise that they are significantly colder than the surrounding anvil and often colder than the tropopause. The authors used fixed thresholds, requiring that an OT candidate “cold spot” brightness temperature (BT) ≤ 215 K and within an anvil that is ≤225 K. Bedka et al. (2010, 2012) also required cold spots be at least 6.5 K colder than the surrounding anvil, resulting in a binary yes/no OT detection. At the time of their publications, detection performance was found to exceed that of another published method, based on positive differences between ∼6.7-micron water vapor (WV) absorption channel and ∼11-micron IR window BTs (WVIRDIFF, Schmetz et al. 1997; Bedka et al. 2010). AACPs were identified by finding a cluster of anomalously warm pixels relative to the surrounding anvil (5–35 K warmer) near (≤50 km) and downwind of an OT.
Although these algorithms were a proof-of-concept that OTs could be detected in an automated way from satellite data, their fixed criteria limited detection to only the most prominent features, thereby missing less apparent features that could still be associated with hazardous weather. Subsequent refinements to the Bedka et al. (2010, 2012) OT methods eliminated the use of fixed thresholds and introduced greater sophistication to the detection process (Bedka et al. 2019; Khlopenkov et al. 2021). The net result is an OT probability product, rather than the binary detection mask, with improved overall performance (Sandmæl et al. 2019; Cooney et al. 2021; Khlopenkov et al. 2021). Bedka and Khlopenkov (2016) also introduced a method for quantifying VIS texture associated with OTs. There is only one known published work on automated AACP detection (Zibert and Zibert 2013). While these methods have been described in the literature, they can be complicated to implement, and no detection software has been widely released for public use.
With the advent of artificial intelligence/machine learning (AI/ML), alternative OT and AACP detection approaches are possible. ML methods expand upon traditional statistics-based techniques (Glahn and Lowry 1972) for fast, user-friendly, and widely distributable products for real-time and research applications (McGovern et al. 2017). Automated, open-source spatial pattern recognition approaches based on deep learning, such as convolutional neural networks (CNNs), have emerged and been used for severe storm research (McGovern et al. 2017; Lagerquist et al. 2019; McGovern et al. 2019; Cintineo et al. 2020). CNNs have also been specifically used for OT detection (Kim et al. 2018; Lee et al. 2021). Deep learning models, such as U-Net, have been very popular in the medical imaging community (Ronneberger et al. 2015; Oktay et al. 2018; Ibtehaz and Rahman 2020; Minaee et al. 2022). U-Net models are semantic segmentation models intended to identify exactly where the feature is present in an image. These models treat each scene independently and do not take advantage of temporal evolution like recurrent neural networks (RNNs) (Hochreiter and Schmidhuber 1997; Williams and Zipser 1989; Tealab 2018). Given that trained analysts can see OT and AACP features in imagery, ML algorithms should also be able to detect them.
We present a novel OT and AACP detection method developed at NASA Langley Research Center that has been made available by NASA as open-source software. We anticipate this software to be useful for a variety of weather nowcasting and research applications. It has the potential to aid in the issuance of severe weather warning, development of storm climatologies, and improve our understanding of chemical transport between the lower troposphere and UTLS, especially in regions without ground-based radar coverage. While this is the first publicly available OT and AACP detection tool for both operational and research purposes, it should be viewed as a proof-of-concept and users should be cautious before treating it as a fully operational system that performs equally well in a broad range of weather scenarios. To train, validate, and test the detection models, a team of analysts manually labeled ∼50 h of 1-min GOES-16 imagery, spanning seven severe storm events. Three semantic segmentation-based ML models, U-Net, MultiResUnet, and AttentionUnet, were evaluated using various model input combinations of GOES-16 Advanced Baseline Imager (ABI) channels, Geostationary Lightning Mapper (GLM), and numerical model tropopause data to identify the combination of inputs yielding the best detection performance. Detection accuracy is quantified with human analyst labels. OT performance is also assessed via comparison with precipitation echo tops derived from volumetric NEXRAD data.
2. Data
a. Model data inputs
1) Satellite imagery
Level 1b (L1b) GOES-16 ABI 1-min resolution mesoscale domain sector (MDS) and GLM data are the primary satellite inputs for training and testing the ML models. CONUS scans at 5-min intervals are also used to test the models via comparison with NEXRAD precipitation echo top. All GOES-16 data used in this study are obtained from Google Cloud Platform (GCP) in the GOES-R series netCDF format (Google 2024). While GCP provides free, efficient access to real-time GOES data, it is not required to run the detection software.
The ABI is the GOES-R series IR, shortwave IR, and VIS wavelength imager (Schmit et al. 2017, 2018). It samples 16 spectral bands at 0.5–2-km horizontal resolution every minute within an MDS. MDSs span ∼1000 km × 1000 km domains, and the locations are often chosen by NOAA to provide detailed observations of high-impact weather phenomena like severe storms. GOES-16 spectral bands (Ayd 2024) include the 0.64 μm (VIS band), 1.37 μm (cirrus band), 1.6 μm (snow/ice band), 6.2 μm (upper-level tropospheric WV band), 10.3 μm (clean IR window band), and 12.3 μm (dirty IR window band). To distinguish between the two IR channels, the remainder of this paper refers to the 10.3-μm band as the IR channel and the 12.3-μm band as the dirtyIR channel. The VIS and 1.6 μm channels have 0.5- and 1-km pixel spacing at GOES nadir, respectively, while the remaining channels have 2-km spacing (Schmit et al. 2017). IR BT and shortwave IR reflectance are calculated from L1b radiances using Losos (2019).
GLM flash extent density (FED) is tested as another model input. FED quantifies the number of flashes (both cloud-to-ground and cloud-to-cloud) that occur within a grid cell over a given time period (Rudlosky 2018). GLM data are provided at a horizontal resolution of ∼10 km, with files available every 20 s. GLM data are gridded using the glmtools Python library (Bruning et al. 2019). Data within ±2.5 min of GOES start scan times are gridded to generate the FED product and collocate with the ABI data.
2) GFS tropopause data
The Global Forecast System (GFS) model tropopause temperature product is differenced from the GOES IR temperature and tested as a model input. GFS was chosen due to the ease of access to near-real-time and historical data as well as the fact that GFS covers the entire globe, allowing the software to be applied anywhere in the world. GFS grib files are downloaded from the NCAR Research Data Archive (RDA; https://rda.ucar.edu/datasets/ds627.0/). The GFS lapse-rate tropopause temperature (TropT) analysis, available 6 hourly on a 0.25° × 0.25° longitude–latitude grid (NCEP et al. 2015), is linearly interpolated in time and space to the GOES pixels. The lapse-rate tropopause is identified by finding the lowest level the lapse rate decreases to 2 K km−1 or less (WMO 1957). Modeled TropT fields can depict unrealistic spatial variability (Khlopenkov et al. 2021). In addition, a model may accurately capture the primary tropopause while nearby it may incorrectly label the secondary tropopause as the primary. This is most common near jet streams (Bethan et al. 1996; Schmidt et al. 2005; Homeyer et al. 2010; Solomon et al. 2016). To mitigate potential artifacts, TropT is smoothed using a 2.5° smoothing radius with a Lanczos kernel, as described by Khlopenkov et al. (2021).
3) Cases analyzed
Data for the ML models are broken up into three categories: 1) training–data sample used to fit the model, 2) validation–used to prevent overfitting while the model is being tuned, and 3) testing–used to provide unbiased evaluation of the final model (Brownlee 2020). Table 1 shows the GOES-16 MDS (either M1 or M2 sector) dates and their categories. We chose seven storm outbreaks: five are used for training, one for validation, and one is used for testing performance. Figure 2 shows the MDS regions for each of these cases. They were selected because they featured a very high number of storms with a diversity of size and morphologies (e.g., isolated cells, squall line, supercells, and mesoscale convective systems). Each event features some differences in maximum and minimum cloud-top temperatures with the OTs and AACPs, and their surrounding anvil, that we feel are representative of the spectrum of storms that can present across much of the world. Figure 3 provides an example scene from each of the outbreaks.
Dates and times of convective cases included in this study. The case classifier corresponds to whether or not the time period was used by the model as a training or validation dataset. The independent model testing data were not input into the model at any time.
MDS regions for the ML model training, validation, and testing case. Domains are color coded by date.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0037.1
GOES-16 IR/VIS sandwich imagery of example scenes from each labeled case day. The black contours correspond to analyst identified OT locations, while magenta and white contours correspond to CP and AP regions, respectively (see section 3b). The blue arrows point to previously labeled AACPs whose parent updrafts dissipated. The BT color enhancement table is the same as that shown in Fig. 1.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0037.1
b. Human analyst label database
As will be described in section 3b, human analysts created an OT and AACP label database from MDS images for each event listed in Table 1. These labels serve as our “truth” and “masks” to train the ML models. The labelers consisted of two experienced analysts and three lesser-experienced student analysts. One experienced analyst labeled every event, while the other labeled the validation and independent test cases. The students labeled some, but not all, of the training cases. Labeling is an extremely tiresome and arduous process. It is possible that some OTs and AACPs were missed by the labelers, or a label was incorrectly placed. The various labels were combined to arrive at an optimal “master” label database that was quality controlled by the experienced analysts.
c. Validation and comparison datasets
1) Khlopenkov et al. (2021) OT detection
OT detections were generated using the Khlopenkov et al. (2021) method OTK for the 26–27 May 2019 test case for comparison with ML model detections. The Khlopenkov et al. (2021) method is currently one of the leading methods to identify OTs from geostationary imagery that has been successful in detecting overshooting convection depicted by radar, as well as depicting severe hailstorm cell tracks (Cooney et al. 2021; Scarino et al. 2023). This method produces an OT probability product with values that range 0–1. OTK probability values ≥0.5 were found to provide skillful detection.
2) NEXRAD GridRad data
This study also makes use of a gridded NEXRAD data product, referred to as GridRad. GridRad composites all available NEXRAD azimuth scans within ±3.8 min of an analysis time onto a regular, three-dimensional (longitude–latitude–altitude) grid (Cooney et al. 2018). GridRad data are obtained from the GridRad-Severe database on NCAR’s RDA. GridRad-Severe data are centered on areas where there is a concentration of severe storm reports (Murphy et al. 2023). The data have a grid spacing of 2 km in the horizontal and 1 km in the vertical, and a 5-min frequency (Cooney et al. 2018; Murphy et al. 2023), providing valuable insights into the vertical structure of storms. GridRad data were obtained for the model test case as well as for the cases and regions shown in Fig. 4. Comparison with GridRad enables a broader assessment of OT detection performance across multiple seasons and geographic regions than what is possible with human analyst labels but with some uncertainty to be discussed later. For more details on GridRad and its applications, refer to Homeyer (2014), Cooney et al. (2018, 2021), Homeyer and Bowman (2022), and Murphy et al. (2023).
Severe storm domains containing GridRad data used to test the OT model detection product. Domains are color coded by date.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0037.1
3) MERRA-2 reanalysis
To generate GridRad tropopause-relative echo-top heights zr, tropopause altitudes are calculated from Modern-Era Retrospective Analysis for Research and Applications, version 2 (MERRA-2). MERRA-2 was chosen here to remain consistent with the methods outlined in Khlopenkov et al. (2021). MERRA-2 data are available 3 hourly in the MERRA2_400.inst3_3d_asm_Np collection on a global grid with a longitude–latitude resolution of ∼0.625° × ∼0.5° and 42 irregularly spaced pressure levels (Bosilovich and Lucchesi 2016). Due to the coarser grid than GridRad data, temperature and geopotential height products are linearly interpolated to the GridRad grid for comparison. The tropopause height is then calculated by applying the World Meteorological Organization (WMO) lapse-rate tropopause definition described in section 2a(2) (WMO 1957) and linearly interpolated to the GridRad 5-min analysis times. This same approach is used in Homeyer (2014), Solomon et al. (2016), and Cooney et al. (2018, 2021).
3. Methods
a. Semantic segmentation models
We test three model types: the original U-Net (Ronneberger et al. 2015), MultiResUnet (Ibtehaz and Rahman 2020), and AttentionUnet (Oktay et al. 2018). Figure 5 shows a diagram of each architecture’s convolutional, pooling, and concatenation layers. For each model tested, all convolutional layers except for the output layer are activated by the rectified linear unit (ReLU) activation function. The output layer for each is activated by the sigmoid activation function. In each model, we use a batch size of 16 and minimize binary cross-entropy loss using an Adam optimizer with a learning rate set to 0.001. Other batch sizes and Adam optimizer learning rates were briefly tested, but we found this combination to work best (not shown).
Semantic segmentation model architecture for (a) U-Net, (b) MultiResUnet, and (c) AttentionUnet. The arrow colors correspond to each transformation between layers.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0037.1
AACP and OT detection models are trained with their corresponding labels. The output of each model consists of checkpoint files. Model checkpoints are snapshots of a model’s weights (influence the input data have on the output product) such that the model can resume training or make predictions from this point (Lakshmanan 2019). Checkpoint files provided with the software contain the weights saved from each model’s most accurate state. Predictions yield pixel-by-pixel likelihood values, ranging 0–1, with 1 representing the highest likelihood that the pixel is part of an OT or AACP and 0 representing not likely.
1) U-Net model
A U-Net is a specific type of CNN used for semantic segmentation that was originally designed for biomedical image segmentation (Ronneberger et al. 2015) but can detect patterns in imagery useful in other fields (McGovern et al. 2017). In semantic segmentation, the model does not simply determine the likelihood of a specific feature being present in an image like ordinary CNNs, but rather they identify where the desired feature is present at the pixel scale. U-Net models often take 3-channel RGB images as input but can be adapted to accept n-channel data arrays as required. The architecture consists of contracting convolutional and pooling layers, resembling traditional CNNs. However, instead of concluding with fully connected layers for binary or multiclass prediction, the contracting layers are followed by further convolutional and upsampling layers, expanding the model’s output to match the original input dimensions (Ronneberger et al. 2015). This contraction and expansion resemble a “U” shape, hence the name U-Net. Moreover, the expanding layers of the U-Net incorporate skip connection concatenations from corresponding contraction layers, improving model accuracy (Ronneberger et al. 2015).
2) MultiResUnet model
The MultiResUnet model (Ibtehaz and Rahman 2020) attempts to enhance the U-Net architecture by introducing “MultiRes” blocks. It has been suggested that U-Net models suffer from variations in the scale of images as well as fusing features from skipped connections with those that have gone through more processing in the decoder (Oktay et al. 2018; Ibtehaz and Rahman 2020). Instead of keeping the number of filters in sequential convolutional layers the same, the MultiRes block increases the number of filters. Additionally, the MultiResUnet model replaces the skipped connections with residual path connections (Ibtehaz and Rahman 2020) as shown in Fig. 5.
3) AttentionUnet model
The AttentionUnet model (Oktay et al. 2018) also attempts to improve upon U-Net by introducing an attention gate that selectively suppresses feature activations in regions unlikely to contain the object of interest. This gives the model more computational resources to focus on regions of interest (Oktay et al. 2018). These attention gates are applied to filter features propagating through the skipped connections.
b. OT and AACP labels
Segmentation models rely on labeled imagery with masks identifying separate classes of interest for detection (Everingham et al. 2015). GridRad can be used to identify echo tops above the tropopause as an OT proxy; however, not every OT surpasses the tropopause. Figure 6 shows the frequency of the minimum tropopause-relative 10.3-μm BT within all OT labels across the seven storm outbreaks listed in Table 1. While most OTs apparent to the analysts are colder than the tropopause, thousands are warmer. Even when OTs overshoot the tropopause, echo-top regions do not precisely outline OT and AACP boundaries. Figure 7 shows an example of the difference between analyst-drawn OT labels and GridRad echo-top heights above the tropopause. The analyst labels in Fig. 7d most precisely outline the OT regions, while the GridRad tropopause-overshooting contours in Figs. 7b and 7c show broad regions that cover the OT along with some of the AACP and other nearby regions. One-kilometer vertical grid spacing of echo-top heights contributes to the uncertainty. Attempts to train models using 10- or 20-dBZ echo top at multiple levels above the tropopause yielded worse results than when using manually drawn labels (not shown). U-Nets perform best when using precise labels. Since there are no suitable truth OT and AACP datasets available to train the model, we created our own.
A histogram of the minimum tropopause-relative BT difference within every OT labeled by analysts during the seven severe storm outbreaks listed in Table 1.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0037.1
Example scene valid 2245:09 UTC 17 May 2019. (a) IR/VIS sandwich. The contours in black show (b) 20-dBZ GridRad echo-top heights at least 1 km above the tropopause, (c) 20-dBZ GridRad echo-top heights above the tropopause, and (d) human analyst OT labels. GridRad data were adjusted in reverse for parallax to match GOES-16 cloud features.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0037.1
IR/VIS sandwich images (Setvak et al. 2013) are used as the basis for labeling. Label examples are provided in Fig. 3. Anomalously warm and uniquely textured regions, outlined in pink and white, are embedded within colder anvil and downwind of OTs, outlined in black. Sandwich images are labeled using LabelMe (Wada 2021), which provides users with a graphical annotation tool that allows users to zoom and contrast-enhance imagery, enabling precise OT and AACP outlines. LabelMe outputs a json file that contains the image coordinates of every vertex of an analyst-drawn label polygon.
There are four classes: “overshoot,” “plume,” “confident plume,” and “null.” Confident and nonconfident AACPs are differentiated to allow training of the model on the features most apparent to an analyst, which should also be most easily detectable by an ML model. AACPs can be either cold or warm, depending on the level of the tropopause relative to the storm top and their distance downstream from the parent OT (Murillo and Homeyer 2022). AACP warm anomaly regions, directly adjacent to an OT, are considered confident detections because they are easiest to see in the sandwich imagery. In addition, AACPs can span hundreds of kilometers, and the peripheries may not be near the parent OT. Thus, models may confuse similarly textured anvil regions with AACPs. Approximately 2.4% of all labeled plumes include pixels extending beyond 256 × 256 pixel regions used for model training. In contrast, confident plume labels are entirely within these regions. Sandwich imagery videos were also used by analysts to determine AACP timing, duration, and extent. See section 3e for the discussion of null scenes.
Analysts can disagree in terms of both where features are present and their spatial extent. OTs and AACPs are rapidly evolving phenomena, and labeling them is tedious. The duration of labeling a particular OT or AACP can differ greatly. For example, the blue arrows in Fig. 3 point to regions in which an AACP, labeled in many previous timesteps, detached from its decayed parent updraft. Warm anomalies from detached plumes can persist for over an hour. Some analysts continued to label this feature, while others did not due to it no longer being associated with an active OT. To optimize the labels, input from each analyst is taken and a master set is created based on the opinion of an experienced analyst. The master set combines each analysts’ labels into a single json file for each image. The master set is then quality controlled using LabelMe by the experienced analysts. In total, the master label set identifies 71 741 OTs and 24 737 AACPs of which 16 759 are considered confident AACPs. The evaluation of analyst labels relative to the master set indicates analysts disagree up to 30% of the time (not shown). There is also greater subjectivity and disagreement for AACP than OT, particularly in terms of their spatial extent. Since humans do not have a perfect match with one another, we should not expect a perfect match between what the model detects and the labels.
ML models are trained to identify OTs and AACPs separately. No model training is conducted to identify both OTs and AACPs simultaneously. Automated AACP detection methods have not been extensively studied, so a primary goal is to examine the impact of GOES-16 inputs on OT and AACP model performance. We want to identify which input combination works best for identifying OTs and which works best for identifying AACPs. For AACPs, two models are trained for all possible data input combinations. One model is trained using all plume labels (confident and not confident plume labels), and the other model is trained using just the confident plume labels. This paper refers to them as the confident plume (CP) and all plume (AP) models.
c. Preprocessing satellite imagery inputs
1) GOES shortwave channel normalization
Reflectance derived from the 0.64-, 1.37-, and 1.6-μ channels is impacted by variations in solar illumination throughout a day. As the sun sets, the brightness of sunlight reflected to the satellite from clouds and Earth’s surface diminishes, which could impact OT and AACP detection that uses these channels. To mitigate this, each pixel’s radiance is normalized by dividing it by the cosine of the solar zenith angle (SZA). A pixel’s SZA is calculated by taking the difference between 90° and the sun elevation angle, which is provided by the suncalc Python package (Barron et al. 2014). Reflectance is calculated by multiplying the measured L1b radiance by the kappa factor, where the kappa factor, included in the L1b files, represents the incident Lambertian-equivalent radiance for each channel (Losos 2019). All model training periods are kept within daytime hours, which we define as SZA <85° at the midpoint of the GOES-16 image. Model input combinations that do not use shortwave channels are unbiased throughout the day and can be used equally well at night. The 0.64-, 1.37-, and 1.6-μ data are invalid at night, so restricting training, validation, and testing to daytime scenes ensure equivalent comparisons among all model inputs. Furthermore, analysts are much less confident when labeling without the VIS imagery.
2) Multispectral IR channel differencing
Multispectral ABI channel combinations can provide more information than individual channels alone. Several studies have used the upper-level WV (6.2 μm) absorption channel and IR window channel (10.3 μm) BT difference (WV-IR BTD) to detect OT regions (Schmetz et al. 1997; Bedka et al. 2010, 2011, 2012; Scarino et al. 2023). This difference could be useful for detecting OTs and AACPs at night when shortwave data are unavailable. In addition, AACPs are often associated with enhanced WV in the UTLS that could potentially generate a unique signal in the WV-IR BTD. The 12.3-μm “dirtyIR” channel has a higher sensitivity to WV than other IR channels, and its difference with the 10.3-μm IR channel has been widely used to identify cirrus clouds (Heidinger and Pavolonis 2009). Figures 1f and 1g show an example of how OT and AACP regions are depicted in these multispectral channel data, respectively.
In addition to updrafts penetrating the environment’s equilibrium level, deep convective storms often penetrate or nearly penetrate the tropopause. OT and AACP IR BT can change considerably depending on latitude and season (Cooney et al. 2018; Khlopenkov et al. 2021). We seek to develop an automated detection solution, which is applicable anywhere and any season, so the difference between GFS analysis TropT and 10.3-μm IR BT attempts to standardize IR BT input across regions and seasons.
3) Normalizing model inputs
To prepare the data for model input, inputs are normalized onto a 0–1 scale, using threshold values that were chosen based upon our current knowledge of typical ranges for OTs and AACPs. These thresholds were chosen so that features of interest are enhanced in the imagery, and data that we believe will not be useful for detection are removed from consideration. Table 2 shows the thresholds applied for each input. Normalized channel weights increase linearly from zero at the minimum weight threshold to one at the maximum weight threshold. Values outside of the range receive weights of 0 or 1, depending on which side of the range they fall. It is possible that a different set of thresholds could work better for certain regions of the globe or time of year, but the thresholds chosen here should encompass the global range of the likeliest OT and AACP values (Bedka et al. 2010; Cooney et al. 2021).
ABI channel and multispectral channel combinations tested as model inputs. The thresholding columns show the range of values used to normalize the inputs between 0 and 1.
4) Input combinations at varying resolution
Our analysis tests 18 different model input combinations: 1) IR, 2) VIS, 3) TROPDIFF, 4) IR+GLM, 5) IR+WVIRDIFF, 6) IR+VIS, 7) IR+Cirrus, 8) IR+Snow/Ice, 9) IR+DIRTYIRDIFF, 10) IR+TROPDIFF, 11) VIS+TROPDIFF, 12) GLM+TROPDIFF, 13) IR+VIS+TROPDIFF, 14) VIS+GLM+TROPDIFF, 15) IR+VIS+GLM, 16) TROPDIFF+DIRTYIRDIFF, 17) IR+VIS+DIRTYIRDIFF, and 18) VIS+TROPDIFF+DIRTYIRDIFF. As mentioned in section 2a, higher-resolution channels like VIS (0.5 km) contain more pixels per MDS image (2000 × 2000) than IR and some other shortwave channels (500 × 500). To comprehensively evaluate model performance, ML models are trained for each input combination at the VIS image resolution. To ensure consistency across model inputs, nearest neighbor interpolation is applied to lower-resolution channels to match the VIS image. Due to the much lower resolution of GLM, FED data are smoothed after interpolation using a two-dimensional median filter with a window size of ∼6 km. For input combinations with no shortwave channels, models are trained at the native 2-km IR pixel spacing. For these combinations, we interpolate GLM data to match the IR image resolution and apply the same smoothing process.
d. Subsetting satellite data for model training
OTs and AACPs are spatially small and infrequent. Training an ML model across the entire domain would result in a significant class imbalance, as the number of OT/AACP pixels is vastly outnumbered by the null pixels. Consequently, such a model could default to the null case, overlooking the severe storm signatures (i.e., underfitting). This was observed in our initial attempts to train the model over the full domains (not shown). To address this challenge, we adopt a strategy of extracting subset image scenes, prior to model input, centered on all OT or AACP identifications made by analysts, depending on which signature the model is being trained to identify. The center of each labeled object is determined by calculating the mean longitude and latitude pixels within the labeled area. We investigated the subset sizes of 8 × 8, 16 × 16, 32 × 32, 64 × 64, 128 × 128, and 256 × 256 pixels around the OT/AACP object center. Through experimentation, described in section 4c, the optimal subset sizes for OTs and AACPs are 128 × 128 and 256 × 256 pixels, respectively. For IR-resolution models (2-km spatial resolution), the best performance is seen when using subsets of 64 × 64 pixels for OTs and 128 × 128 pixels for AACPs.
e. Adding null cases to training and validation datasets
Our initial models were trained with only OT/AACP labels and not any null scenes. These initial AACP models exhibited a tendency to misclassify anvils as AACP, prompting an alternative approach. Within our master label set, analysts identify null features, strategically positioned within anvil regions without OT/AACP across all training cases in an attempt to teach the model the difference between ordinary convection and the desired features. For example, null features are placed in wispy anvil regions, not associated with an OT. In addition, null feature identifications are located where two convective cells merge and BT is reminiscent of warm anomalies (see section 4e). This deliberate placement aimed to simulate faulty detections observed from the initial models. Our rationale extends beyond just balancing classes; by exposing the models to these null scenes, we sought to fortify them against such erroneous identifications. We handpicked 1162 random scenes for inclusion, ensuring consistency across various model combinations for both OTs and AACPs. While this may seem small relative to the total number of OTs and AACPs, each subset image already contains null pixels. These additional scenes were specifically chosen to help mitigate faulty detections through a targeted approach.
f. Model validation statistics
As described earlier, model training produces checkpoint files containing the weights that the input data have on the output product. By saving the model’s most accurate state in these files, we use them to make OT and AACP predictions on new scenes. These predictions are in the form of likelihood scores, which range from zero (least confidence) to one (most confidence) for each pixel in the domain.
1) OT and AACP object creation
Figure 8 illustrates the object creation process for a fictional AACP model output example. Each rectangle in the three panels corresponds to a pixel. OT and AACP objects are found by first removing all pixels with model output likelihood values <0.05 as they are unlikely to be part of either an OT or AACP. Remaining pixels (≥0.05), that are contiguous at the sides and corners, are grouped together to form preliminary objects. Within each preliminary object, the maximum likelihood value is extracted. Using this value, all pixels within 50% (90%) of the maximum value of the OT (AACP) are considered part of the object, while the remaining pixels are discarded. This ensures only the most confident OT and AACP pixels are included. We chose to use 90% for AACPs to better capture the full plume extent, which is not necessary for OT, which typically contain higher likelihood throughout the detected feature. Each pixel remaining within the object is then treated as if it has the same likelihood value as the maximum value. This allows us to incrementally check model viability at different thresholds and determine a threshold value that provides the best combination of POD, FAR, and IoU.
Fictional example of the AACP object identification process. The three main steps progress from (top) to (bottom). Each rectangle in the panels corresponds to a GOES pixel. Each color represents a distinct AACP object.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0037.1
Thresholding ML model outputs at a specific value (e.g., ≥0.35) are often employed for optimization purposes. For example, if we choose a threshold of 0.9 for the objects identified in Fig. 8, only the blue- and green-shaded objects would remain. This approach serves to enhance model performance by filtering out lower confidence predictions, thereby focusing on more reliable estimations. In determining the optimal thresholds for each model, we prioritize minimizing false detections to give software users confidence that the detections are real. FAR importance is offset though by ensuring the model is detecting the majority of occurrences. Although IoU remains an important metric for capturing the entire OT and AACP extent, we consider accurately predicting object occurrences to be of higher importance, which is represented by the POD, FAR, and CSI statistics.
2) Spatial matching of radar and satellite data
Weather radar data have been used extensively to detect and study overshooting convection (Homeyer 2014; Homeyer and Kumjian 2015; Solomon et al. 2016; Cooney et al. 2018, 2021; Jellis et al. 2023). These studies assume that if an updraft surpasses the tropopause, then it is also overshooting the environment’s equilibrium level. This is correct; however, the tropopause can often be much higher than the equilibrium level. This is often seen in late summer months as well as tropical environments where a warmer atmospheric column environment elevates the tropopause (Cooney et al. 2018, 2021). While the tropopause is a useful benchmark for identifying overshooting convection, severe storms with updrafts that overshoot the equilibrium level but not the tropopause are common.
Following Cooney et al. (2018, 2021), zr is calculated. Spatial matches between individual GOES and GridRad data are identified by first reverse parallax-correcting the GridRad output using cloud-top heights of 15 km, which is typical for deep convective storms across the United States (Cooney et al. 2018). To ensure uniform gridbox adjustment and maintain the original GridRad grid spacing, the average GOES-16 parallax correction at 15-km altitude is calculated for each storm event domain. A “reverse” correction is applied by using the inverse of the GOES-16 calculated parallax correction, adjusting the GridRad domain longitudes and latitudes accordingly. Due to an inexact parallax correction and height assignments, GridRad and model overshooting locations are considered spatial matches if they are within 10 km of one another, following Cooney et al. (2021).
4. Results
a. Model output
As an introduction to OT and AACP model predictions, Fig. 9 shows two IR+VIS+GLM OT and CP model prediction scenes during the independent test case. The highest confidence OT pixels correspond to the most “bubbly” texture and coldest BTs. For AACPs, the highest confidence pixels are often located at the point of the largest temperature gradient between the parent OT and AACP.
Example OT and AACP model outputs using the IR+VIS+GLM data as inputs into the MultiResUnet model. Results from the CP model are shown. Likelihood values <0.05 are set to transparent. (a)–(c) 2222:15 UTC 26 May 2019. (d)–(f) 2339:15 UTC 26 May 2019.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0037.1
Figure 10 shows the same scenes as Fig. 9 but instead shows the overlap of OT and AACP objects and analyst labels (green) or lack of overlap (dark red and dark blue). OT (AACP) objects are only kept if the maximum likelihood value within the object was ≥0.4 (0.35), following the methods outlined in section 3f. For these scenes, the OT and CP models detect the occurrence of OT or AACP objects. OT detections (Figs. 10a,c) do a much better job of encompassing its respective analyst-labeled object than the CP model (Figs. 10b,d). The CP model is trained to focus on the region with the strongest AACP warm anomaly, leading to the model being unable to capture the colder outer peripheries of many AACPs.
Example (a),(c) OT and (b),(d) AACP objects using the IR+VIS+GLM input combination and MultiResUnet model. (a),(b) 2222:15 UTC 26 May 2019. (c),(d) 2339:15 UTC 26 May 2019. Results from the CP model relative to AP label masks are shown. Dark red shading corresponds to analyst-labeled regions that were not captured by the model, dark blue shading corresponds to model detection regions that were not labeled by the analyst, and green shading corresponds to regions where the analyst and model agree.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0037.1
b. Comparing semantic segmentation model performance
Before comparing the three U-Net models, we first evaluate whether they are overfitting or underfitting by comparing performance on training and validation datasets. Overfitting is a common concern in ML models, particularly for extreme events like severe storms, due to the inherent imbalance in the data. Figure 11 presents the performance curves for the training and validation cases using the MultiResUnet model (U-Net and AttentionUnet results are not shown). The dashed lines represent the input combinations with shortwave channels (IR+VIS for OTs and IR+VIS+DIRTYIRDIFF for AACPs), while the solid lines show the input combinations unbiased by the time of day (IR-only for OTs and IR+DIRTYIRDIFF for AACPs). Across both training and validation, the CSI consistently peaks between 0.5 and 0.7, indicating a good balance between hits, false alarms, and misses and avoiding overfitting by not overly matching the training data. POD reaches ∼0.9, while FAR remains substantially lower across thresholds, confirming effective detection without an excessive number of false positives. Finally, IoU remains stable without extreme fluctuations, reflecting a consistent balance between precision and recall.
POD, FAR, CSI, and IoU curves for MultiResUnet model predictions on the (a),(c) training and (b),(d) validation cases. (a),(b) OT detection model curves for IR+VIS (dashed lines) and IR-only (solid lines) model input combinations. (c),(d) AACP CP detection model curves for IR+VIS+DIRTYIRDIFF (dashed line) and IR+DIRTYIRDIFF model input combinations (solid lines).
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0037.1
The three U-Nets are validated at the OT object scale to quantify performance differences using a variety of input combinations (Figs. 12 and 13). The CSI is generally the highest for the MultiResUnet model for each input combination tested. The maximum CSI for both MultiResUnet and U-Net models in the IR+VIS combination is nearly the same; however, the goal of the OT and AACP models is to minimize false alarms while still detecting a significant number of OTs and AACPs. With this in mind, the MultiResUnet model outperforms the other ML models tested. The AttentionUnet and U-Net architectures achieve nearly equivalent POD but often yield worse FAR than the MultiResUnet model. MultiResUnet has the lowest FAR while detecting >70% of the analyst-labeled OTs for each model input combination. In addition, MultiResUnet gives a larger range of POD (>0.7) to FAR (<0.25) ratios as a function of maximum detection confidence than the other models. This is true for all of the OT and AACP model input combinations tested (not shown). This larger range allows software users to choose model confidence thresholds to suit their applications. Some users may accept false alarms, as long as severe storm signature occurrences are not missed, while others may seek to be more conservative. A large range of “good” values allows users to choose the threshold that works best for them. As a result, the only model type we continue to discuss is the MultiResUnet model.
CSI curves for each ML model type tested over the independent test case (26–27 May 2019). All models run at a GOES-16 VIS resolution of 0.5 km. The circles correspond to 0.05 likelihood threshold increments.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0037.1
Performance diagram curves for each ML model type tested over the independent test case (26–27 May 2019). The black dashed line corresponds to the 1:1 line. All models are at the GOES-16 0.5-km VIS resolution. The circles correspond to 0.05 likelihood threshold increments. Threshold values at the endpoints are written in associated model color.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0037.1
c. Sensitivity to image subset size
Figures 14 and 15 demonstrate the impact of subset size on detection performance at the IR and VIS resolution. OT (AACP) tests here use the IR-only (IR+DIRTYIRDIFF) model input combination. Colored lines missing in a particular panel correspond to subset sizes that resulted in extremely poor detection performance (i.e., uncorrectable underfitting). In these size tests, the models were unable to learn anything and maximum IoU scores were less than 0.001. As AACPs are often much larger than OTs, the smallest subset size lines are missing. Conversely, OT models did not produce usable results at the largest subset sizes tested. The AACPs are likely too large for those small subset sizes and images need to encompass more of their surroundings while OTs are so small, relative to large subsets, that the model defaults to generating a very low likelihood throughout the domain.
Performance diagram curves for MultiResUnet models over the independent test case (26–27 May 2019). (a),(c) OT models shown from IR-only input model. (b),(d) CP models shown from the IR+DIRTYIRDIFF input combination. The black dashed line corresponds to the 1:1 line. The circles correspond to 0.05 likelihood threshold increments. Threshold values at the endpoints are written in the associated subset image size color. Missing colored lines on the plot correspond to subset sizes that were tested multiple times but never achieved usable results.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0037.1
CSI curves for MultiResUnet models over the independent test case (26–27 May 2019). (a),(c) OT models shown from IR-only input model. (b),(d) CP models shown from the IR+DIRTYIRDIFF input combination. Missing colored lines on the plot correspond to subset sizes that were tested multiple times but never achieved usable results.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0037.1
Smaller subset sizes often show higher FAR but achieve higher POD (magenta and dark green lines in Figs. 14a,b), while the opposite is often true of large subset sizes. However, this is not true for the OT VIS-resolution model (Fig. 14c) where 64 × 64 provides better FAR but worse POD than 128 × 128 when the POD < 0.4 and worse FAR and better POD when POD > 0.4. In this case, the evaluation of the CSI in Fig. 15c shows 128 × 128 to be consistently over 0.1 better than the 64 × 64 image subset size. Thus, we believe 128 × 128 is a better subset size that requires less model tuning to achieve valid results. The OT IR-resolution model performs well for the 16 × 16 and 32 × 32 pixel subsets, but we ultimately believe the optimal size is 64 × 64 as it provides lower FAR and nearly equivalent POD for likelihoods with FAR < 0.25. In addition, the maximum CSI is slightly higher than the other subset sizes (Fig. 15a).
In the AACP IR-resolution performance diagram provided in Fig. 14b, 128 × 128 pixel subsets (cyan) have 0.05–0.1 lower FAR than 64 × 64 (red) for nearly equivalent POD. The CSI is also slightly higher for 128 × 128 than 64 × 64 (Fig. 15b). The 32 × 32 image size CSI (purple) is larger than 128 × 128 and 64 × 64 at higher AACP thresholds due to multiple object detections overlapping with the same AACP mask. The optimal subset size for AACP VIS resolution is apparent (256 × 256) in Figs. 14d and 15d with a much better ratio of POD to FAR and much higher CSI scores.
d. Model performance from varying input combinations
This study tests 18 different model input combinations at the VIS spacing and eight input combinations at the IR spacing. The AACP models in Figs. 16d–f and 17d–f correspond to CP model results. CP models are validated against every plume labeled by the analysts, not just the CP labels. The VIS-only models are not included because the models yielded extreme underfitting. Even though VIS performs poorly as the only input, it generally improves detection when used in combination with other inputs. Figures 16a–c show OT models that use shortwave channels (orange shading) generally outperform models that do not (tan shading).
(a),(d),(g) Performance diagram, (b),(e),(h) CSI, and (c),(f),(i) IoU plots for the best (a)–(c) OT, (d)–(f) CP, and (g)–(i) AP models for each input configuration at the GOES VIS data resolution across the independent test case. Circles along the performance diagram curves correspond to likelihood thresholds. The likelihood thresholds are spaced at 0.05 intervals starting with 0.05 at the top right and ending with 0.95 at the bottom left. Threshold values at the endpoints are written in the associated model input combination color. The black dashed line corresponds to the 1:1 line. The shaded orange (tan) regions reveal the range of daytime (nighttime) models.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0037.1
(a),(d) Performance diagram, (b),(e) CSI, and (c),(f) IoU plots for each (a)–(c) OT and (d)–(f) CP model input configuration at the GOES native IR resolution for the independent test case. The circles along the performance diagram curves correspond to likelihood thresholds. The likelihood thresholds are spaced at 0.05 intervals starting with 0.05 at the top right and ending with 0.95 at the bottom left. Threshold values at the endpoints are written in the associated model input combination color. The black dashed line corresponds to the 1:1 line. For comparison, the dashed cyan lines correspond to the (a)–(c) OT IR-only and (d)–(f) CP IR+DIRTYIRDIFF model runs at the VIS pixel spacing (plotted in Fig. 16). The shaded tan regions reveal the range of the IR-resolution models.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0037.1
The best daytime OT models use IR+VIS and IR+VIS+GLM inputs, which have very comparable validation statistics. IoU is the highest (>0.4) for the IR+VIS+GLM OT detection model, but the maximum CSI for both models is ∼0.65. In addition to the extra time and computational resources it takes to grid, process, and calculate GLM FED data, GLM may not provide much information for storms that do not generate much lightning, or lightning is obscured by optically thick ice cloud (Peterson et al. 2022). Thus, IR+VIS is the OT model we recommend for daytime. We sought to have OT models with >0.7 POD and <0.25 FAR. This range encompasses the analyst label uncertainty of up to 30% (not shown). The selection of the best OT model thresholds is straightforward, leveraging performance diagram, CSI, and IoU curves from the test (Figs. 16a–c and 17a–c) and validation cases (not shown). Using the performance diagram curves in Figs. 16a and 17a, we identify the likelihood value near the inflection point where FAR decreases slower than POD. For example, the optimal threshold for the IR+VIS configuration (red) is 0.25.
OTK detections are shown in pink in Figs. 16a–c. Recall that this algorithm uses 10.3-μm IR data and TropT to make detections. Khlopenkov et al. (2021) suggest using OT probability ≥0.5 as the detection threshold where the CSI for the test case is ∼0.5. While OTK detections have fewer false detections at higher probability thresholds than the IR-only model (shown in cyan), the IR-only model, using a threshold of 0.4, provides >0.1 improvement in FAR with a slightly better POD over the 0.5 probability OTK detections. IoU and CSI are also considerably better (>0.1) for the IR-only OT model. The addition of TROPDIFF (IR+TROPDIFF), which provides a more 1:1 comparison with OTK detections, tends to make OT detection likelihood more conservative. Using a threshold of 0.25, we see 0.2 (0.07) lower FAR (POD) than OTK 0.5 detections. In addition, the CSI is ∼0.07 better for IR+TROPDIFF. The best daytime models show ∼0.20 decrease in the FAR for nearly the same POD as OTK 0.5 probability detections.
Figure 16 also compares detection performance between CP (Figs. 16d–f) and AP (Figs. 16g–i) models. CP models exhibit far better POD and FAR than AP models. One exception is the IR+VIS+DIRTYIRDIFF model. This AP model shows better IoU and POD metrics than CP, but slightly poorer FAR. All AP input configurations have IoU scores higher than their CP counterparts, even though FAR is also higher and maximum CSI scores are lower. Since IoU is calculated on a pixel-by-pixel basis and false alarms negatively impact this metric, better IoU suggests that the AP models better depict AACP spatial extent. This is reinforced by displays of AP and CP model output shown in section 4e.
FAR and POD are generally lower for the IR-resolution models (Figs. 17a,d) than models with the same inputs applied to VIS pixel spacing (tan shading in Figs. 16a,d). This is likely due to the way the data are preprocessed. Recall that the optimal image subset size for IR-resolution models is half the size of VIS-resolution models even though the VIS channel has 4× smaller pixel spacing. Thus, each IR-resolution model training image contains a higher fraction of null pixels, which reduces FAR, as we see from the subset size tests in Figs. 14 and 15. Ultimately, we believe this is favorable because running model predictions at a degraded resolution should yield less confidence than at higher resolutions. One input combination to note is IR+WVIRDIFF. This combination performs substantially better at IR pixel spacing (2 km) than at VIS (0.5 km). We tried training the model multiple times at both resolutions and found this to consistently be the case. Currently, it is unknown why but performance for training and validation cases is not as favorable as the test case for the IR-resolution IR+WVIRDIFF model. Due to IR+DIRTYIRDIFF performing consistently well, we believe this is the best AACP nighttime model combination.
In general, OT models appear more reliable than AACP models. This was expected as detecting AACPs is also more challenging for analysts and therefore should be harder for a model. In addition, the model may continue to label an AACP long after the parent updraft has dissipated or the AACP has detached from its parent OT while the analysts stopped labeling it (discussed further in section 4e). This is one reason FAR is consistently higher for AACPs than OTs. Overall, the CP model often captures the existence of an AACP but does not capture its spatial extent well. The best performing AACP detection models use combinations of IR, VIS, and DIRTYIRDIFF. Figure 1g shows DIRTYIRDIFF exhibits greater values in AACP regions, but the importance of DIRTYIRDIFF to AACP detection was unexpected at the beginning of this research as it has not been previously documented.
e. Comparison of AACP model detection behavior
Determining optimal AACP model thresholds is more challenging than for OTs, necessitating a multifaceted approach, because statistical metrics heavily depend upon the characteristics of analyst labels. This approach includes qualitative evaluation of test case scenes in addition to statistical analysis. Maps displaying AACP object likelihood value contours, overlaying IR/VIS sandwich test case scenes, such as those shown in Figs. 18 and 19, are used to identify commonalities of false positives and incrementally raise detection thresholds to eliminate them. Examples from two daytime and all-day models are provided in Figs. 18 and 19. The blue arrows in Fig. 18 show the locations of daytime AP models that incorrectly detect wispy clouds in the anvil, reminiscent of AACPs in VIS imagery, with high confidence. Sometimes, these detections are detached AACPs, but occasionally, the model misidentifies an anvil. The IR-only model clearly struggled to learn due to the lack of defined BT gradients in the AACP. DIRTYIRDIFF helps constrain the AP model detections, but IR+DIRTYIRDIFF still falsely detects some anvil regions.
Example scenes where AP models incorrectly identify AACP in anvil regions. Contour colors correspond to the maximum likelihood value in each AACP object. (a),(c) Daytime and (b),(d) nighttime models. The blue arrows point to some examples of false detections in the anvil. Green arrows point to some examples of false detections within anvil mergers and on the outside of enhanced Vs.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0037.1
Example scenes where CP models incorrectly identify AACP near anvil mergers and on the outside of the enhanced-V arms (green arrows). Contour colors correspond to the maximum likelihood value in each AACP object. (a),(c) Daytime and (b),(d) nighttime models.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0037.1
Figures 19b and 19d show that CP model false alarms typically occur near anvil mergers and outside of the arms of enhanced Vs (green arrows). For example, the green arrows in the IR+DIRTYIRDIFF scene point to these bad detections, which all have likelihoods <0.4. The CP models identify these regions as AACPs because of the large temperature gradient. The bridge is warmer than the adjacent anvil, reminiscent of a plume between enhanced-V arms. These incorrect identifications are also seen in the AP models (green arrows in Figs. 18b,d). Even though these occurrences are rare and often signify severe convection, attempts were made to reduce their frequency by using targeted null cases around anvil merger points in training scenes. Some high FAR input combinations did show improvement while others did not (not shown). Raising the required likelihood threshold greatly reduces false detections. The optimal threshold for IR+DIRTYIRDIFF is found to be 0.45.
Figures 20 and 21 show the examples of threshold detections from the IR+VIS+DIRTYIRDIFF CP and AP models, respectively, and can be compared to analyst labels in Fig. 3. As described previously, the CP model detects AACP warm anomalies while the AP model better captures the peripheries. The 0024:33 UTC 6 May 2019 scene in Fig. 21c shows the AP model capturing two detached AACPs. These are not incorrect detections, but they were no longer labeled due to dissipated parent updrafts, penalizing the performance statistics. The remaining scenes exhibit very good AACP detection. There are no false detections here, and nearly every AACP is captured by the CP and AP models.
The same scenes as Figs. 3 and 21 but with black contours representing MultiResUnet model detections from the IR+VIS OT input configuration and magenta contours representing model detections from the IR+VIS+DIRTYIRDIFF CP model configuration. All OT objects shown have maximum likelihood values ≥0.25. All AACP objects shown have maximum likelihood values ≥0.30.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0037.1
As in Figs. 3 and 20, but black (magenta) contours representing MultiResUnet model detections from the IR+VIS OT and IR+VIS+DIRTYIRDIFF AP models. All OT objects shown have maximum likelihood values ≥0.25. All AACP objects shown have maximum likelihood values ≥0.45.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0037.1
f. Model detections aggregated across cases
In addition to generally performing well from scene to scene, the models also accurately capture OTs and AACPs when aggregated over entire cases. Figure 22 shows the time aggregation of model detections, analyst labels, and GridRad zr from the best daytime and nighttime OT/AACP models throughout three labeled cases: a training, validation, and test case. While the IR-only OT model (Figs. 22a,e,i) exhibits greater false positives than the IR+VIS model (Figs. 22b,f,j), it yields better overlap with labels when aggregated across these cases (55%–68% vs 52%–63%). A lower IR+VIS threshold may provide a better aggregated overlap; however, it would also introduce more false alarms from scene to scene. OT and AACP model detections generally coincide with GridRad identifications of tropopause-overshooting convection. The overlap area of the models relative to GridRad is worse than relative to analyst labels. GridRad depicts much larger overshooting areas (white to red shading) than what the GOES imagery shows, as illustrated previously in Fig. 7. For each aggregated event, ≤5.2% (11%) of IR+VIS (IR-only) OT model detection pixels are observed without a 10-dBZ echo above the tropopause. Further GridRad comparisons are provided in the following section. The CP model exhibits a worse overlap for 5 May and 17 May 2019 cases (33% and 35% vs 41%) due to the very large AACPs observed during those events. For both cases, the CP model makes very few false detections and overlaps nearly every labeled AACP at some point during the storm’s lifetime, even if it does not capture the full AACP extent.
(a)–(d) Time-aggregated maps between 1930 UTC 5 May 2019 and 0100 UTC 6 May 2019. (e)–(h) Time-aggregated maps between 1800 UTC 17 May 2019 and 0120 UTC 18 May 2019. (i)–(l) Time-aggregated maps between 1800 UTC 26 May 2019 and 0055 UTC 27 May 2019. (a)–(c),(e)–(g),(i)–(k) Time aggregation of model detections and analyst labels. Dark red corresponds to pixels labeled by the analyst with no matching model detection. Dark blue corresponds to model detection pixels with no matching analyst label. Green shows the model and analyst overlapping pixels. Analyst overlap provides the time-aggregated fraction of overlapping pixels between model and analyst-labeled pixels, as well as model and GridRad overshooting regions. (d),(h),(l) Tropopause-relative GridRad 10-dBZ precipitation echo top. The white lines signify edges of the GOES MDS domain.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0037.1
g. Comparing model and GridRad OT detections
While radar data were not considered effective for training the models (see section 3b and Fig. 7), it can be used to evaluate satellite-based OT detection quality over broad domains and time periods impractical for human analysts to label. Following Cooney et al. (2021), OT detection accuracy is quantified using POD and FAR, with GridRad zr serving as a reference. Figure 23 shows the IR-only (left) and IR+VIS OT (right) model performance (at VIS pixel spacing) across the various regions seen in Fig. 4. For an echo-top region to be considered an OT, GridRad must have ≥40-dBZ echo in the column and ≥10-dBZ echo above the tropopause. The performance curves show that both models skillfully detect GridRad OTs, with POD exceeding FAR. Performance varies from case to case but is generally better for the spring and summer cases. Performance also differs for the same case between the IR-only and IR+VIS models. IR-only performance is much better relative to echo top than relative to the human labels (magenta solid versus dashed lines), compared to IR+VIS. This suggests that cold spots in the IR are more closely related to what are likely to be high-altitude echo tops with very little to no texture. If these high echo tops had OT-like texture, they would have been labeled by an analyst. To better elucidate performance differences across the various cases, Fig. 24 provides histograms of maximum column reflectivity and 10-dBZ echo tops within 10 km of IR-only model OT pixels with confidence values >0.05. Cases with worse FAR often see a higher frequency of echo tops 1–2 km beneath the tropopause than cases with better FAR. In each case though, the reflectivity distribution is quite similar, peaking around 50 dBZ. This implies that the convection being detected is comparably vigorous from a reflectivity perspective, but not all strong storms have tops that reach the tropopause. Thus, while both models exhibit skillful performance relative to both echo top and labels, these results further reinforce the challenge with definitively assessing OT detection model performance.
Performance diagrams for (left) IR+VIS and (right) IR-only OT model detections relative to 10-dBZ echo-top heights that surpass the tropopause. For the reference, the dashed pink lines with box symbols correspond to (left) IR+VIS and (right) IR-only OT model detections relative to human labeled OTs for the 26–27 May 2019 test case. The gray dashed line corresponds to the 1:1 line. Threshold values at the endpoints are written in the associated case color.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0037.1
Histogram of maximum (left) column reflectivity within 10 km of IR OT pixels with OT probability greater than 0.05 and (right) 10-dBZ tropopause-relative echo top. The line colors correspond to the case date.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0037.1
5. Conclusions
This paper presents models for skillful detection of OT and AACP patterns depicted in GOES-R series data using a novel, open-source software based on machine learning. ML models are trained and validated using analyst OT and AACP labels across seven widespread severe storm outbreaks. OT model output is also compared with OT estimates from precipitation echo tops for several other cases across the United States. Multiple ML models and input combinations are tested to find those that perform best, determined through quantitative and qualitative analyses. Of the three ML models tested, the MultiResUnet model provides the best ratio of POD to FAR as well as a larger range of likelihood scores. A variety of statistical metrics are used to determine optimal likelihood thresholds. AACPs are more challenging due to questions over how long to label an AACP and if detached AACPs would be labeled and captured by the models, necessitating qualitative analysis in addition to statistical metrics.
OT models show very good performance on the independent test case in terms of detection accuracy and spatial extent. IR+VIS+GLM and IR+VIS are the best daytime models with ∼0.65 CSI; however, IR+VIS is the recommended OT model due to inconsistent lightning from case to case and the time it takes to grid and calculate the GLM FED product. The best OT model for day + night operation uses only IR data. OT detection regions overlap with >50% of the labeled regions across a multihour event and show improvement over a leading detection method from Khlopenkov et al. (2021). Models are skillful relative to OT detections from GridRad echo tops surpassing the tropopause across many events in various regions and seasons. Echo-top data show that not all strong convective cells associated with localized cold spots depicted by IR satellite data have tops above the tropopause, contributing to uncertainty when quantifying satellite OT detection performance.
Models trained using confident plume labels generally outperform models trained with all plumes. The IR+VIS+DIRTYIRDIFF model is the best-performing AP model; however, further analysis is needed before it can be integrated into the detection software due to concerns about high FAR across all of the AP models. Additional training cases may be required to improve performance before applying these models globally. Our goal is to increase detection labels by bootstrapping and quality controlling the IR+VIS+DIRTYIRDIFF model on new cases and regions. This will allow us to continue refining and testing the AP models for broader use. CP models accurately detect AACPs with few false detections but are unable to capture their full spatial extent. Knowledge that an OT updraft is producing an AACP could be sufficient to consider issuing a severe weather warning, given the strong correspondence of AACPs and severe weather (Bedka et al. 2018); however, those interested in studying the impact of severe convection on UTLS composition may desire better detection of plume extent. The best CP models are the IR+DIRTYIRDIFF and IR+VIS+DIRTYIRDIFF combinations. We recommend IR+DIRTYIRDIFF CP as the optimal day + night model, which provides users with consistent AACP detections across an entire day. This model exhibits >30% overlap with labeled regions when aggregated across multihour events.
The software can be acquired at https://github.com/nasa/svrstormsig. This is the only known open-source software for detecting OT and AACP. It can efficiently process imagery and keep up with the ABI imaging frequency for near-real-time operations, while also accommodating historical cases. We envision it to be useful for a variety of operational weather nowcasting and weather and climate research applications. While we believe this software product and research is highly impactful, it only represents a first step. Due to the limited number of training cases and case regions, the software product should be considered proof-of-concept at this stage. For example, the performance of OT detections relative to GridRad (Fig. 23) showed some regional and seasonal variation, which needs to be better understood and addressed. Future work will focus on expanding the training dataset by using the current machine learning model presented in this paper to make detections across additional regions and seasons, followed by manual quality control of those detections. These semiautomated detections, combined with the original dataset, will form a larger and more diverse training set to improve the performance and fidelity of future models. We also plan to simulate coarser resolution historical geostationary data using current data and retraining models so that they perform consistently throughout the climate data record and with other satellites such as the Meteosat Second and Third Generation series. Efforts will continue to be devoted to improving AACP spatial extent while maintaining high accuracy. Labeling of 1-min MDS imagery also enables the possibility of using an RNN to detect the unique temporal evolution of OTs and AACPs, which could further improve accuracy.
Acknowledgments.
Three anonymous University of Oklahoma undergraduate students helped create the OT and AACP labels used as masks for model training. Google’s AI-generated color-blind-friendly color table, Turbo, was used to generate figures. The authors thank the NASA ROSES-2019 A.33 “Research from Geostationary Satellites” award, which provided funding for this work.
Data availability statement.
The master set of OT and AACP labels used in this study, netCDF files containing model input variables, and model-trained checkpoint files can be found at https://science-data.larc.nasa.gov/LaRC-SD-Publications/2023-05-05-001-JWC/data/. All software can be found at https://github.com/nasa/svrstormsig. All GridRad volumes were obtained from the GridRad-Severe database. All raw GOES L1b files were obtained from the Google Cloud Platform.
REFERENCES
Anderson, J. G., D. M. Wilmouth, J. B. Smith, and D. S. Sayres, 2012: UV dosage levels in summer: Increased risk of ozone loss from convectively injected water vapor. Science, 337, 835–839, https://doi.org/10.1126/science.1222978.
Anderson, J. G., and Coauthors, 2017: Stratospheric ozone over the United States in summer linked to observations of convection and temperature via chlorine and bromine catalysis. Proc. Natl. Acad. Sci. USA, 114, E4905–E4913, https://doi.org/10.1073/pnas.1619318114.
Ayd , P., 2024: GOES-16 band reference guide. 4 pp., https://www.weather.gov/media/crp/GOES_16_Guides_FINALBIS.pdf.
Bang, S. D., and D. J. Cecil, 2019: Constructing a multifrequency passive microwave hail retrieval and climatology in the GPM domain. J. Appl. Meteor. Climatol., 58, 1889–1904, https://doi.org/10.1175/JAMC-D-19-0042.1.
Barron, K., V. Sarago, and B. Varga, 2014: suncalc-py. GitHub, https://github.com/kylebarron/suncalc-py.
Bedka, K. M., and K. Khlopenkov, 2016: A probabilistic multispectral pattern recognition method for detection of overshooting cloud tops using passive satellite imager observations. J. Appl. Meteor. Climatol., 55, 1983–2005, https://doi.org/10.1175/JAMC-D-15-0249.1.
Bedka, K. M., J. Brunner, R. Dworak, W. Feltz, J. Otkin, and T. Greenwald, 2010: Objective satellite-based detection of overshooting tops using infrared window channel brightness temperature gradients. J. Appl. Meteor. Climatol., 49, 181–202, https://doi.org/10.1175/2009JAMC2286.1.
Bedka, K. M., J. Brunner, and W. Feltz, 2011: Objective overshooting top and enhanced-V signature detection for the GOES-R Advanced Baseline Imager. Algorithm theoretical basis document. NOAA/NESDIS Algorithm Theoretical Basis Doc., version 1.0, 53 pp., https://www.goes-r.gov/products/ATBDs/option2/Aviation_OvershootingTop_v1_no_color.pdf.
Bedka, K. M., R. Dworak, J. Brunner, and W. Feltz, 2012: Validation of satellite-based objective overshooting cloud-top detection methods using CloudSat cloud profiling radar observations. J. Appl. Meteor. Climatol., 51, 1811–1822, https://doi.org/10.1175/JAMC-D-11-0131.1.
Bedka, K., E. M. Murillo, C. R. Homeyer, B. Scarino, and H. Mersiovsky, 2018: The above-anvil cirrus plume: An important severe weather indicator in visible and infrared satellite imagery. Wea. Forecasting, 33, 1159–1181, https://doi.org/10.1175/WAF-D-18-0040.1.
Bedka, K., and Coauthors, 2019: Analysis and automated detection of ice crystal icing conditions using geostationary satellite datasets and in situ ice water content measurements. SAE Int. J. Adv. Curr. Prac. Mobility, 2, 35–57, https://doi.org/10.4271/2019-01-1953.
Bethan, S., G. Vaughan, and S. J. Reid, 1996: A comparison of ozone and thermal tropopause heights and the impact of tropopause definition on quantifying the ozone content of the troposphere. Quarterly Journal of the Royal Meteorological Society, 122, 929–944, https://doi.org/10.1002/qj.49712253207.
Bosilovich, M., and R. Lucchesi, 2016: Merra-2: File specification, Global Modeling and Assimilation Office, https://gmao.gsfc.nasa.gov/pubs/docs/Bosilovich785.pdf.
Brownlee, J., 2020: What is the difference between test and validation datasets? Machine Learning Process, https://machinelearningmastery.com/difference-test-validation-datasets/.
Bruning, E. C., and Coauthors, 2019: Meteorological imagery for the geostationary lightning mapper. J. Geophys. Res. Atmos., 124, 14 285–14 309, https://doi.org/10.1029/2019JD030874.
Brunner, J., S. A. Ackerman, A. S. Bachmeier, and R. M. Rabin, 2007: A quantitative analysis of the enhanced-V feature in relation to severe weather. Wea. Forecasting, 22, 853–872, https://doi.org/10.1175/WAF1022.1.
Cintineo, J. L., M. J. Pavlonis, J. M. Sieglaff, L. Cronce, and J. Brunner, 2020: NOAA ProbSevere v2.0—ProbHail, ProbWind, and ProbTor. Wea. Forecasting, 35, 1523–1543, https://doi.org/10.1175/WAF-D-19-0242.1.
Cooney, J. W., K. P. Bowman, C. R. Homeyer, and T. M. Fenske, 2018: Ten year analysis of tropopause-overshooting convection using GridRad data. J. Geophys. Res. Atmos., 123, 329–343, https://doi.org/10.1002/2017JD027718.
Cooney, J. W., K. M. Bedka, K. P. Bowman, and K. V. Khlopenkov, 2021: Comparing tropopause-penetrating convection identifications derived from NEXRAD and GOES over the contiguous United States. J. Geophys. Res., 126, e2020JD034319, https://doi.org/10.1029/2020JD034319.
Cui, W., D. Xiquan, B. Xi, Z. Feng, and J. Fan, 2020: Can the GPM IMERG final product accurately represent MCSs’ precipitation characteristics over the central and eastern United States? J. Hydrometeor., 21, 39–57, https://doi.org/10.1175/JHM-D-19-0123.1.
Dessler, A. E., and S. C. Sherwood, 2004: Effect of convection on the summertime extratropical lower stratosphere. J. Geophys. Res., 109, D23301, https://doi.org/10.1029/2004JD005209.
Dworak, R., K. M. Bedka, J. Brunner, and W. Feltz, 2012: Comparison between GOES-12 overshooting-top detections, WSR-88D radar reflectivity, and severe storm reports. Wea. Forecasting, 27, 684–699, https://doi.org/10.1175/WAF-D-11-00070.1.
Everingham, M., L. V. Gool, S. A. Eslami, C. K. Williams, J. Winn, and A. Zisserman, 2015: The PASCAL visual object classes challenge: A retrospective. Int. J. Comput. Vision, 111, 98–136, https://doi.org/10.1007/s11263-014-0733-5.
Forster, P. M. D., and K. P. Shine, 1999: Stratospheric water vapour changes as a possible contributor to observed stratospheric cooling. Geophys. Res. Lett., 26, 3309–3312, https://doi.org/10.1029/1999GL010487.
Fujita, T. T., 1989: The Teton-Yellowstone tornado of 21 July 1987. Mon. Wea. Rev., 117, 1913–1940, https://doi.org/10.1175/1520-0493(1989)117<1913:TTYTOJ>2.0.CO;2.
Glahn, H. R., and D. A. Lowry, 1972: The use of model output statistics (MOS) in objective weather forecasting. J. Appl. Meteor., 11, 1203–1211, https://doi.org/10.1175/1520-0450(1972)011<1203:TUOMOS>2.0.CO;2.
Glickman, T. S., 2000: Glossary of Meteorology. Amer. Meteor. Soc., 855 pp.
Gettelman, A., P. Hoor, L. L. Pan, W. J. Randel, M. I. Hegglin, and T. Birner, 2011: The extratropical upper troposphere and lower stratosphere. Rev. Geophys., 49, RG3003, https://doi.org/10.1029/2011RG000355.
Google, 2024: gcp-public-data-goes-16. https://console.cloud.google.com/marketplace/product/noaa-public/goes.
Gordon, A. E., and Coauthors, 2024: Airborne observations of upper troposphere and lower stratosphere composition change in active convection producing above-anvil cirrus plumes. Atmos. Chem. Phys., 24, 7591–7608, https://doi.org/10.5194/acp-24-7591-2024.
Heidinger, A. K., and M. J. Pavolonis, 2009: Gazing at Cirrus Clouds for 25 years through a split window. Part I: Methodology. J. Appl. Meteor. Climatol., 48, 1100–1116, https://doi.org/10.1175/2008JAMC1882.1.
Herman, R. L., and Coauthors, 2017: Enhanced stratospheric water vapor over the summertime continental United States and the role of overshooting convection. Atmos. Chem. Phys., 17, 6113–6124, https://doi.org/10.5194/acp-17-6113-2017.
Hochreiter, S., and J. Schmidhuber, 1997: Long short-term memory. Neural Comput., 9, 1735–1780, https://doi.org/10.1162/neco.1997.9.8.1735.
Holton, J. R., P. H. Haynes, M. E. McIntyre, A. R. Douglass, R. B. Rood, and L. Pfister, 1995: Stratosphere-troposphere exchange. Rev. Geophys., 33, 403–439, https://doi.org/10.1029/95RG02097.
Homeyer, C. R., K. P. Bowman, and L. L. Pan, 2010: Extratropical tropopause transition layer characteristics from high-resolution sounding data. Journal of Geophysical Research: Atmospheres, 115, https://doi.org/10.1029/2009JD013664.
Homeyer, C. R., 2014: Formation of the enhanced-V infrared cloud-top feature from high-resolution three-dimensional radar observations. J. Atmos. Sci., 71, 332–348, https://doi.org/10.1175/JAS-D-13-079.1.
Homeyer, C. R., and M. R. Kumjian, 2015: Microphysical characteristics of overshooting convection from polarimetric radar observations. J. Atmos. Sci., 72, 870–891, https://doi.org/10.1175/JAS-D-13-0388.1.
Homeyer, C. R., and K. P. Bowman, 2022: Algorithm description document for version 4.2 of the three-dimensional gridded NEXRAD WSR-88D radar (GridRad) dataset. Tech. Doc., 30 pp., https://gridrad.org/pdf/GridRad-v4.2-Algorithm-Description.pdf.
Homeyer, C. R., J. D. McAuliffe, and K. M. Bedka, 2017: On the development of above-anvil cirrus plumes in extratropical convection. J. Atmos. Sci., 74, 1617–1633, https://doi.org/10.1175/JAS-D-16-0269.1.
Ibtehaz, N., and M. S. Rahman, 2020: MultiResUNet: Rethinking the U-Net architecture for multimodal biomedical image segmentation. Neural Networks, 121, 74–87, https://doi.org/10.1016/j.neunet.2019.08.025.
IPCC, 2021: Climate Change 2021: The Physical Science Basis. V. Masson-Delmotte et al., Eds., Cambridge University Press, 2391 pp., https://doi.org/10.1017/9781009157896.
Jellis, D., K. P. Bowman, and A. D. Rapp, 2023: Lifetimes of overshooting convective events using high-frequency gridded radar composites. Mon. Wea. Rev., 151, 1979–1992, https://doi.org/10.1175/MWR-D-23-0032.1.
Khlopenkov, K. V., K. M. Bedka, J. W. Cooney, and K. Itterly, 2021: Recent advances in detection of overshooting cloud tops from longwave infrared satellite imagery. J. Geophys. Res. Atmos., 126, e2020JD034359, https://doi.org/10.1029/2020jd034359.
Kim, M., J. Lee, and J. Im, 2018: Deep learning-based monitoring of overshooting cloud tops from geostationary satellite data. GIScience Remote Sens., 55, 763–792, https://doi.org/10.1080/15481603.2018.1457201.
Lagerquist, R., A. McGovern, and D. J. Gagne II, 2019: Deep learning for spatially explicit prediction of synoptic-scale fronts. Wea. Forecasting, 34, 1137–1160, https://doi.org/10.1175/WAF-D-18-0183.1.
Lakshmanan, L., 2019: Ml design pattern #2: Checkpoints. Towards Data Science, 27 September, https://medium.com/data-science/ml-design-pattern-2-checkpoints-e6ca25a4c5fe.
Lee, J., M. Kim, J. Im, H. Han, and D. Han, 2021: Pre-trained feature aggregated deep learning-based monitoring of overshooting tops using multi-spectral channels of GeoKompsat-2A advanced meteorological imagery. GIScience Remote Sens., 58, 1052–1071, https://doi.org/10.1080/15481603.2021.1960075.
Liu, C., and E. J. Zipser, 2005: Global distribution of convection penetrating the tropical tropopause. J. Geophys. Res., 110, D23104, https://doi.org/10.1029/2005JD006063.
Liu, N., C. Liu, and L. Hayden, 2020: Climatology and detection of overshooting convection from 4 years of GPM precipitation radar and passive microwave observations. J. Geophys. Res., 125, e2019JD032003, https://doi.org/10.1029/2019JD032003.
Losos, D., 2019: GOES R series product definition and users’ guide 3rd ed. Computer Software Manual, 402 pp.
McCann, D. W., 1983: The enhanced-V: A satellite observable severe storm signature. Mon. Wea. Rev., 111, 887–894, https://doi.org/10.1175/1520-0493(1983)111<0887:TEVASO>2.0.CO;2.
McGovern, A., K. L. Elmore, D. J. Gagne II, S. E. Haupt, C. D. Karstens, R. Lagerquist, T. Smith, and J. K. Williams, 2017: Using artificial intelligence to improve real-time decision-making for high-impact weather. Bull. Amer. Meteor. Soc., 98, 2073–2090, https://doi.org/10.1175/BAMS-D-16-0123.1.
McGovern, A., C. D. Karstens, T. Smith, and R. Lagerquist, 2019: Quasi-operational testing of real-time storm-longevity prediction via machine learning. Wea. Forecasting, 34, 1437–1451, https://doi.org/10.1175/WAF-D-18-0141.1.
Minaee, S., Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, and D. Terzopoulos, 2022: Image segmentation using deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell., 44, 3523–3542, https://doi.org/10.1109/TPAMI.2021.3059968.
Murillo, E. M., and C. R. Homeyer, 2022: What determines above-anvil cirrus plume infrared temperature? J. Atmos. Sci., 79, 3184–3194, https://doi.org/10.1175/JAS-D-22-0080.1.
Murphy, A. M., C. R. Homeyer, and K. Q. Allen, 2023: Development and investigation of GridRad-Severe, a multiyear severe event radar dataset. Mon. Wea. Rev., 151, 2257–2277, https://doi.org/10.1175/MWR-D-23-0017.1.
Negri, A. J., 1982: Cloud-top structure of tornadic storms on 10 April 1979 from rapid scan and stereo satellite observations. Bull. Amer. Meteor. Soc., 63, 1151–1159, https://doi.org/10.1175/1520-0477-63.10.1151.
NOAA, 2018: Geostationary Operational Environmental Satellite-R (GOES-R) Series Program User Readiness Plan. 410-R-PLN-0250, 38 pp., https://www.goes-r.gov/org/docs/user-readiness-plan.pdf.
NCEP, NWS, NOAA, and U.S. Department of Commerce, 2015: NCEP GFS 0.25 degree global forecast grids historical archive. Research Data Archive at the National Center for Atmospheric Research, Computational and Information Systems Laboratory, accessed 03 March 2024, https://doi.org/10.5065/D65D8PWK.
Oktay, O., J. Schlemper, L. Le Folgoc, M. Lee, M. Heinrich, K. Misawa, and D. Rueckert, 2018: Attention U-Net: Learning where to look for the pancreas. arXiv, 1804.03999v3, https://doi.org/10.48550/arXiv.1804.03999.
O’Neill, M. E., L. Orf, G. M. Heymsfield, and K. Halbert, 2021: Hydraulic jump dynamics above supercell thunderstorms. Science, 373, 1248–1251, https://doi.org/10.1126/science.abh3857.
Pan, L. L., and Coauthors, 2014: Thunderstorms enhance tropospheric ozone by wrapping and shedding stratospheric air. Geophys. Res. Lett., 41, 7785–7790, https://doi.org/10.1002/2014GL061921.
Peterson, M., T. E. L. Light, and D. Mach, 2022: The illumination of thunderclouds by lightning: 2. The effect of GLM instrument threshold on detection and clustering. Earth Space Sci., 9, e2021EA001943, https://doi.org/10.1029/2021EA001943.
Reynolds, D. W., 1980: Observations of damaging hailstorms from geosynchronous satellite digital data. Mon. Wea. Rev., 108, 337–348, https://doi.org/10.1175/1520-0493(1980)108<0337:OODHFG>2.0.CO;2.
Ronneberger, O., P. Fischer, and T. Brox, 2015: U-Net: Convolutional networks for biomedical image segmentation. arXiv, 1505.04597v1, https://doi.org/10.48550/arXiv.1505.04597.
Rudlosky, S., 2018: Geostationary lightning mapper: Gridded products (fed) quick guide 2nd ed. Computer Software Manual.
Sandmæl, T. N., C. R. Homeyer, K. M. Bedka, J. M. Apke, J. R. Mecikalski, and K. Khlopenkov, 2019: Evaluating the ability of remote sensing observations to identify significantly severe and potentially tornadic storms. J. Appl. Meteor. Climatol., 58, 2569–2590, https://doi.org/10.1175/jamc-d-18-0241.1.
Scarino, B., K. Itterly, K. Bedka, C. R. Homeyer, J. Allen, S. Bang, and D. Cecil, 2023: Deriving severe hail likelihood from satellite observations and model reanalysis parameters using a deep neural network. Artif. Intell. Earth Syst., 2, e220042, https://doi.org/10.1175/AIES-D-22-0042.1.
Schmetz, J., S. A. Tjemkes, M. Gube, and L. van de Berg, 1997: Monitoring deep convection and convective overshooting with Meteosat. Adv. Space Res., 19, 433–441, https://doi.org/10.1016/S0273-1177(97)00051-3.
Schmidt, T., S. Heise, J. Wickert, G. Beyerle, and C. Reigber, 2005: GP S radio occultation with CHAMP and SAC-C: global monitoring of thermal tropopause parameters. Atmosph. Chem. Phys., 5, 1473–1488.
Schmit, T. J., P. Griffith, M. M. Gunshor, J. M. Daniels, S. J. Goodman, and W. J. Lebair, 2017: A closer look at the ABI on the GOES-R series. Bull. Amer. Meteor. Soc., 98, 681–698, https://doi.org/10.1175/BAMS-D-15-00230.1.
Schmit, T. J., S. S. Lindstrom, J. J. Gerth, and M. M. Gunshor, 2018: Applications of the 16 spectral bands on the Advanced Baseline Imager (ABI). J. Oper. Meteor., 6, 33–46, https://doi.org/10.15191/nwajom.2018.0604.
Setvak, M., and C. A. Doswell, 1991: The AVHRR channel 3 cloud top reflectivity of convective storms. Mon. Wea. Rev., 119, 841–847, https://doi.org/10.1175/1520-0493(1991)119<0841:TACCTR>2.0.CO;2.
Setvak, M., and Coauthors, 2010: Satellite-observed cold-ring-shaped features atop deep convective clouds. Atmos. Res., 97, 80–96, https://doi.org/10.1016/j.atmosres.2010.03.009.
Setvak, M., K. M. Bedka, D. T. Lindsey, A. Sokol, Z. Charvat, J. St Astka, and P. K. Wang, 2013: A-Train observations of deep convective storm tops. Atmos. Res., 123, 229–248, https://doi.org/10.1016/j.atmosres.2012.06.020.
Smith, A., 2024: 2023: A historic year of U.S. billion-dollar weather and climate disasters. NOAA, accessed 05 May 2025, https://www.climate.gov/news-features/blogs/beyond-data/2023-historic-year-us-billion-dollar-weather-and-climate-disasters.
Smith, J. B., and Coauthors, 2017: A case study of convectively sourced water vapor observed in the overworld stratosphere over the United States. J. Geophys. Res. Atmos., 122, 9529–9554, https://doi.org/10.1002/2017JD026831.
Solomon, D. L., K. P. Bowman, and C. R. Homeyer, 2016: Tropopause-penetrating convection from three-dimensional gridded NEXRAD data. J. Appl. Meteor. Climatol., 55, 465–478, https://doi.org/10.1175/JAMC-D-15-0190.1.
Stohl, A., H. Wernli, P. James, M. Bourqui, C. Forster, M. A. Liniger, P. Seibert, and M. Sprenger, 2003: A new perspective of stratosphere–troposphere exchange. Bull. Amer. Meteor. Soc., 84, 1565–1574, https://doi.org/10.1175/BAMS-84-11-1565.
Tealab, A., 2018: Time series forecasting using artificial neural networks methodologies: A systematic review. Future Comput. Inf. J., 3, 334–340, https://doi.org/10.1016/j.fcij.2018.10.003.
Wada, K., 2021: labelme: Image polygonal annotation with python. GitHub, https://github.com/wkentaro/labelme.
Williams, R. J., and D. Zipser, 1989: A learning algorithm for continually running fully recurrent neural networks. Neural Comput., 1, 270–280, https://doi.org/10.1162/neco.1989.1.2.270.
WMO, 1957: Meteorology—A three-dimensional science: Second session of the commission for aerology. WMO Bull., IV (4), 134–138.
Zibert, M. I., and J. Zibert, 2013: Monitoring and automatic detection of the cold-ring patterns atop deep convective clouds using Meteosat data. Atmos. Res., 123, 281–292, https://doi.org/10.1016/j.atmosres.2012.08.007.