1. Introduction
Clouds have a substantial influence over Earth’s climate and weather. They impact the hydrological cycle, vertical moisture and heat transport, and crucially, the planetary radiative budget. Difficulties representing clouds and their climate feedbacks (Hartmann et al. 2001; Bony and Dufresne 2005; Andrews et al. 2012; Tan et al. 2019; Zelinka et al. 2020) as well as their interactions with atmospheric aerosols (Fan et al. 2016; Bellouin et al. 2020) have long been some of the largest sources of uncertainty in climate projections (Boucher et al. 2014; Forster et al. 2021). Marine low clouds are of particular importance because of their diversity and substantial radiative impact (Wood 2012; Hartmann 2015; Forster et al. 2021). Over the last several decades, significant effort in the form of field campaigns, laboratory experiments, modeling studies, and satellite observational research has been devoted to improving understanding of the complex processes that control the evolution of marine clouds. While our knowledge and modeling capabilities have improved (Forster et al. 2021), major shortcomings remain that contribute to substantial climate prediction uncertainties.
Satellite observations are an indispensable resource for studying marine clouds. We now have decades-long passive satellite imagery datasets from weather satellites (Rossow and Schiffer 1999) and even several single-instrument records that span decades (Hubanks et al. 2019). Most cloud retrievals developed for these imagers estimate cloud physical properties such as optical thickness, brightness temperature, and effective radius, on a per-pixel (or block of pixels) basis (Schmit et al. 2012; Platnick et al. 2017). These measures convey useful information about clouds’ radiative and microphysical properties and can be emulated by climate models for direct comparison to observations (Bodas-Salcedo et al. 2011). However, marine low clouds exhibit a diversity of mesoscale structures and organizational patterns (Wood 2012; Houze 2014) that conventional satellite cloud retrievals fail to sufficiently characterize (Yuan et al. 2020) (Figs. 1 and 2 from this manuscript show some good examples of this diversity). Many imagers have adequate spatial resolution (down to subkilometer scales) to visually assess mesoscale cloud morphology, but doing so for a large volume of imagery requires a prohibitive number of person-hours from subject matter experts, and automating this process is challenging. Several studies have tackled this problem with machine learning (ML). Wood and Hartmann (2006) used a conventional neural network, while Rasp et al. (2020), Yuan et al. (2020), and Segal-Rozenhaimer et al. (2023) used deep convolutional neural networks (Goodfellow et al. 2016). While their precise methodology differs, each of these studies used supervised learning approaches that follow a similar ML pipeline: 1) define several distinct cloud categories of interest, 2) have experts label a collection of images, 3) train a ML algorithm to sort the labeled images, and 4) deploy the algorithm to categorize previously unseen images (Bishop 2006). These projects have led to studies of cloud processes that leverage the granular cloud morphologies enabled by the ML and inspired further ML research in this area (Muhlbauer et al. 2014; McCoy et al. 2017; Mohrmann et al. 2021; Schulz et al. 2021; Vogel et al. 2021; Eastman et al. 2022; Beucler et al. 2022; Ver Hoef et al. 2023; McCoy et al. 2023).
A query image from each category of the SGFF dataset (shown in the far left column) and the five closest images from the unlabeled MODIS imagery dataset judged by the cosine similarity of their vector embeddings.
Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0036.1
A query image from each category of the Yuan et al. (2020) dataset (shown in the far left column) and the five closest images from the unlabeled MODIS imagery dataset judged by the cosine similarity of their vector embeddings.
Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0036.1
Steps 1 and 2 in the supervised learning pipeline above are limiting factors for this problem. Step 1, because marine cloud morphology is more diverse than the finite set of categories used by past supervised ML studies and predefined cloud types constrain research to using only those types. Step 2, because excessive time and effort are required to generate labeled data. Self-supervised and unsupervised methods are a class of ML algorithms that avoid the use of human annotations for training, and a variety of techniques for learning image representations without labeled data have been proposed over the years (e.g., Chen et al. 2016; Vincent et al. 2010; T. Chen et al. 2020). Many of these algorithms learn to project images into a low-dimensional vector embedding space rather than a finite set of categories and thus can address the above limitations of supervised schemes. A drawback is that they have historically been less powerful than supervised methods when applied to downstream image classification tasks. Several past studies have applied self- and unsupervised methods to satellite cloud imagery: Kurihana et al. (2022a,b) used a convolutional autoencoder, Yuan (2019) used an information maximizing generative adversarial network, and Denby (2020) used a contrastive method that involves comparing vector embeddings of image triplets (Jean et al. 2019). Recently, new self-supervised deep learning schemes have been developed that have significant advantages over previous approaches. The quality of their learned internal features and vector embedding space are improved, and their skill is competitive with supervised methods for classification tasks.
Deep learning and deep convolutional neural networks (CNNs) have recently revolutionized the fields of computer vision and machine learning (Goodfellow et al. 2016) and are used in many of the mesoscale cloud studies discussed above. In particular, deep CNNs have led to a significant leap forward in the skill of ML-based image recognition (Krizhevsky et al. 2012). It has also been shown that deep CNNs trained on very large labeled image datasets are able to learn complex and meaningful representations of the semantic content of the images (Erhan et al. 2009; Mahendran and Vedaldi 2015; Nguyen et al. 2019). The size of the datasets required to train these algorithms and the resulting large number of person-hours required to produce accurate training labels is a major limitation, however (Deng et al. 2009). Though introduced decades ago (Bromley et al. 1993), it has recently been shown that self-supervised methods based on joint embeddings can be used to learn deep feature representations of image data that are of comparable, if not higher quality, than those learned by supervised CNNs (Misra and van der Maaten 2019; Caron et al. 2021). Joint-embedding methods involve processing image pairs with CNNs that reduce them to a vector representation and training the CNNs by applying a loss function in the vector embedding space. A wide range of approaches to this concept exist (He et al. 2019; Xu et al. 2019; Grill et al. 2020; Bardes et al. 2021; Assran et al. 2023). Perhaps the most prevalent among these involves generating augmented (randomly altered) image pairs and processing them with a “Siamese network”—a pair of identical CNNs with shared weights (T. Chen et al. 2020; Chen and He 2020; Zbontar et al. 2021). While a wide variety of loss functions and CNN implementations have been developed to solve this problem, they all share a common theme: vector representations of image datasets are learned by comparing pairs or groups of images that are known to either have similar or dissimilar content, usually because they are drawn from either the same or different source images, which avoids the use of hand labels. At the time of writing, modern self-supervised methods have seen very little use in the atmospheric sciences. Lv et al. (2022) applied contrastive learning to ground based total sky imagers, Wang et al. (2022) applied it to reanalysis data, and several authors have demonstrated its potential for satellite land surface remote sensing (e.g., J. Chen et al. 2020; Stojnic and Risojevic 2021). The atmospheric and Earth sciences are fields with massive unlabeled datasets from satellite imagers, models, and other instruments such as radars and lidars. These new self-supervised schemes could be extremely impactful for analyzing these types of data and extracting information describing complex nonlinear phenomena. In this study, we train and evaluate a modern self-supervised deep learning scheme (Zbontar et al. 2021), for analysis of satellite cloud imagery. We show that its learned vector embedding space can be used to achieve competitive classification accuracy with supervised methods. We also interrogate its learned features, demonstrate its potential for searching cloud imagery with few training samples, and show that it is both robust when applied to different instruments and resilient to changing scene illumination.
2. Data
This study uses two training datasets and two evaluation datasets. The two training datasets are large collections of unlabeled satellite cloud imagery from two different instruments, the Terra Moderate Resolution Imaging Spectroradiometer (MODIS) and the GOES-17 Advanced Baseline Imager (ABI). The CNN trains on these image libraries in a self-supervised manner to learn internal representations of common cloud structures. The two evaluation datasets are much smaller collections of hand-labeled satellite images, collected by both Terra MODIS and Aqua MODIS, containing several specific cloud morphologies produced by past studies (Yuan et al. 2020; Rasp et al. 2020). Here, we have leveraged their past work to perform objective evaluation of the quality of our self-supervised CNN’s image representations by applying the networks trained on the large unlabeled imagery datasets to image classification tasks on the smaller labeled imagery datasets.
a. Unlabeled MODIS imagery
MODIS is a multispectral imager that orbits aboard NASA’s EOS Terra satellite (Salomonson et al. 1989, 2006). Since 1999, it has collected data in a polar, sun-synchronous orbit, with an equatorial crossing time of 1030 local time (LT) and has now produced a multidecade single instrument dataset of high-quality images of Earth. Here, we have constructed a dataset of Terra MODIS images for training the self-supervised model. Specifically, we use the 500-m Level-1b radiance product (file name: MOD02HKM; MODIS Characterization Support Team 2017), and convert the radiances to reflectances. Reflectances are used because neural networks generally perform better with bounded and/or standardized data. Reflectance is naturally bounded to the range [0, 1] and is a physically meaningful value. The dataset includes a full year (2020) of MODIS images taken during daytime over the world’s oceans. We use data from band 4 (550 nm), which is ideal for daytime visible wavelength observations of clouds. It is also used for aerosol and land surface retrievals and is the green band for generating true-color [red–green–blue (RGB)] MODIS imagery (Salomonson et al. 2006). MODIS uses an internal mirror to scan perpendicular to its orbit and has a resulting swath width of about 2330 km. Some distortion caused by the along-orbit motion of the satellite and changing pixel footprint while scanning (the “bowtie” effect), can occur near the edges of the swath (Gómez-Landesa et al. 2004). Here, we opt to crop the MODIS images to the center 1024 pixels (of 2708 for the 500-m product) where the distortion is negligible, which limits the images in the dataset to a maximum viewing zenith angle of about 21°. The dataset is further limited to granules that have no more than 10% of their pixels flagged as land or sub-50-m-deep ocean, and to granules with average solar zenith angle less than 60° to ensure sufficient scene illumination. Finally, we break up the cropped granules along swath into 1024 × 1024 pixel images that are a more convenient size for our ML algorithm to work with. These training images are further subsampled during the training process to generate 256 × 256 pixel image pairs used by the CNN, as described in section 3c, but the training dataset is stored on disk as a collection of 64 340 unique 1024 × 1024 images.
b. Unlabeled ABI imagery
The Advanced Baseline Imager (ABI) (Schmit et al. 2017) is a multispectral, 16-band imager aboard the GOES-17 satellite, which was launched in 2018 and occupies a geostationary orbit above 137.2°W at the equator. Here, we use reflectance images from the ABI’s 2-band that has a central wavelength of 640 nm, an approximate pixel resolution of 500 m at nadir, and is well suited for observing daytime clouds. The ABI operates in several scan modes that are changed based on imaging needs, but currently (since 2019) collects a “full disk” scan approximately every 10 min (Schmit et al. 2017). Here, to sample a diversity of cloud scenes, we have constructed a cloud image dataset by acquiring one full-disk scan for each day of 2021 and have selected the scan taken closest to solar noon for 137.2°W, which ensures that the majority of the scene is illuminated. The year 2021 was chosen to avoid overlap with the MODIS dataset. Similar to the MODIS dataset, the full disk scans were broken into nonoverlapping 1024 × 1024 pixel images, and we enforce requirements that each image has an average solar zenith angle less than 60° with less than 10% land pixels. These constraints result in 91 077 total images.
c. Labeled imagery from Yuan et al. (2020)
Yuan et al. (2020) constructed an expert labeled cloud imagery dataset using 1-km resolution Aqua MODIS channel 4 (550 nm) observations broken into 128 × 128 pixel image chips. Their study focuses on marine low-cloud morphology, so they filtered the image dataset to exclude scenes with more than 10% high cloud, less than 5% low cloud, or more than 10% land coverage. They asked a group of experts to separate the cloud imagery into six categories to create a dataset of hundreds of labeled cloud images (the numbers in parentheses below represent the number of labeled samples from each category used in this study):
-
Stratus: (455) A mostly uniform layer of low cloud.
-
Closed cellular: (428) A cellular structure with cloudy centers that frequently organizes into a honeycomb pattern.
-
Disorganized cellular: (427) A combination of strong convective elements surrounded by stratiform cloud.
-
Open cellular: (468) Spatial cells with a clear center and vigorous convection at the edges.
-
Clustered cumulus: (368) Aggregate convective cloud elements with relatively strong convection.
-
Suppressed cumulus: (363) Multiple small disperse cumulus clouds.
They then used a pretrained CNN [Visual Geometry Group, model 16 (VGG-16); Simonyan and Zisserman 2014] to demonstrate automated classification of cloud morphology, achieving a 93% categorical accuracy on a held-out testing set. They demonstrated the use of the trained CNN to study the frequency of occurrence of the cloud regimes for winter 2011 in the North Atlantic and to generate a 2003–18 cloud climatology for the southeast Pacific.
The authors of Yuan et al. (2020) have provided a dataset of 2509 hand-generated annotations of Aqua MODIS images from 2010 and 2020 for use in this study (these are not identical to the ones used in their study). Their original samples are 128 × 128 pixel images of 1-km 550-nm reflectance, and we have retrieved 256 × 256 pixel versions at 500-m resolution for consistency with our unlabeled MODIS training data.
d. Labeled imagery from Rasp et al. (2020)
Stevens et al. (2020) performed an in-depth study of cloud regimes that occur in the North Atlantic trade wind region during the winter (December–February). They used visual inspection of MODIS L1B corrected reflectance imagery to identify four distinct and common mesoscale cloud structures:
-
Sugar: Disorganized fine-scale clouds. Primarily composed of cumulus humilis.
-
Gravel: Larger boundary layer cumulus features, often precipitating with deeper development than sugar.
-
Fish: Organized regions of boundary layer cumulus with a skeletal structure separated from other clouds by large clear regions.
-
Flowers: Large organized stratiform cloud regions with a precipitating cumulus core. Typically, multiple distinct neighboring clouds occur at the same time.
Their choice of these four categories was based on their frequency of occurrence and the ability of independent labelers to consistently identify them. Ultimately, 12 experts hand-labeled 10 winter seasons worth of data (900 images) using images that covered 10° × 10° regions. Stevens et al. (2020) used the dataset to analyze differences between labelers and rate of class confusion and provided some additional exposition on the 3D cloud structure of each category using radar observations. Rasp et al. (2020) later extended their work by using crowd-sourced labeling to significantly increase the size of the labeled cloud imagery dataset, from which they trained and evaluated a CNN to perform the same classification task. They used Terra and Aqua MODIS cloud imagery from two regions in the subtropical Pacific and one in the subtropical Atlantic observed between 2007 and 2017. Their labels are human-drawn rectangular bounding boxes that contain primarily a single cloud type from one of the categories above. They estimate approximately 250 person-hours were required to generate 49 000 unique labels.
The neural networks trained here, and the majority of self-supervised image processing schemes, are designed to operate on image chips. Here “image chips” refers to 256 × 256 pixel samples of a satellite image. Because the dataset produced by Rasp et al. (2020) is structured as labeled bounding boxes applied to large MODIS images, using it for evaluation of our models required reprocessing the original labeled images to produce a new dataset of labeled image chips. On inspection of the bounding boxes, it is clear that labelers often draw large bounding boxes that may contain a cloud type of interest but are also likely to contain large regions of clear sky or disorganized cloud that does not clearly match one of the class definitions. Furthermore, multiple labelers were asked to label the same images and they often produced overlapping bounding boxes with different labels or with similar labels but significant disagreement on the spatial extent of the bounding box (the inconsistency of human labelers is another drawback of supervised methods when class boundaries are not well defined). To generate an evaluation dataset with consistent samples, we extracted all regions where at least three labelers agreed on the cloud category from the 9904 images used by Rasp et al. (2020). These were then split into 256 × 256 pixel chunks with no overlap. Because the classes are imbalanced, with significantly more samples of “gravel” than “flowers” for instance, we retained only 2200 samples from each category (corresponding to the class with the fewest samples) resulting in an evaluation dataset of 8800 image chips. Because of the number of steps required to generate this dataset, we have made it publicly available (see the data availability statement below). Throughout the remainder of this article, we use the acronym of sugar, gravel, flower, fish (SGFF) to refer to this dataset.
3. Model
This work uses an implementation of the “Barlow Twins” model (Zbontar et al. 2021), one of several recently developed self-supervised learning techniques discussed in section 1. Barlow Twins attempts to learn deep feature representations from unlabeled image data with a loss function meant to reduce redundancy in a learned image vector embedding space. The neural network components of the model consist of a CNN “encoder” that maps two-dimensional image data to a vector embedding space and a fully connected “projector” neural network that maps from the embedding space to a higher-dimensional vector embedding on which the loss is computed. The networks are trained as Siamese networks: during each training step, two corresponding batches of training samples are generated and processed in tandem by identical copies of the neural networks with shared weights. Each sample in a batch is drawn from the same randomly selected MODIS image as the corresponding sample in the other training batch but might be drawn from a different location in the source image and is randomly altered using a variety of image transformations (“image augmentation”). Corresponding images between the two training batches are then likely to contain similar cloud features, but do not look identical to each other, and noncorresponding samples are unlikely to contain similar cloud features. Each training-batch is passed through the encoder and projector, and a cross-correlation matrix is computed between the two sets of resulting vector projections along the batch dimension. A loss function [Eq. (1)] is applied that pushes this output cross-correlation matrix toward the identity matrix. This approach ensures that the encoding is invariant under the types of augmentations applied to the inputs and that the components of the embedding space do not encode redundant (correlated) information. Details about our neural network implementation and training procedure are provided below.
a. Encoder
We use a ResNet-style (He et al. 2016) CNN as the encoder. This is a common and powerful architecture for image processing tasks and is the architecture used by Zbontar et al. (2021). ResNet is notable for including a large number of skip connections that allow the network to easily learn representations across multiple resolutions and combat the vanishing gradient problem (Goodfellow et al. 2016). Here, we have reimplemented the ResNet instead of replicating one of the architectures proposed by He et al. (2016), but use the same key building blocks. There are two large differences in our implementation: we have opted to use 128 km (256 pixel) image chips for consistency with past mesoscale cloud morphology studies (Wood and Hartmann 2006; Yuan et al. 2020) instead of the 224 × 224 pixel inputs that are typically used for training on ImageNet (Deng et al. 2009) and have modified the architecture to accommodate this. We also use a smaller latent space and output a 1024-dimensional vector whereas ResNet-50 outputs 2048 features from its CNN component (before applying a classifier head). The exact implementation details can be found in the project’s code repository and a diagram of both the encoder and projector is provided in Fig. A1 in the appendix. Overall, the encoder CNN has 18.6 million trainable parameters.
b. Projector
The projector network is a 3-layer fully connected neural network that maps the vector outputs of the ResNet encoder to a higher dimensional space where the loss function is computed. Here, we are using a smaller image embedding space than Zbontar et al. (2021) and opted to use a smaller projector network than they did that has progressively increasing layer sizes of 2048, 4096, and 8192 (also shown in Fig. A1). The lower number of trainable parameters reduces the computational requirements of training; meanwhile, the output from the projector network remains high dimensional, which has been observed to improve model skill. We hypothesize that a slightly lower complexity model will be adequate to generate useful representations (vector embeddings) of the image data because cloud imagery likely contains less diverse content than the assorted photographs used in ImageNet (Deng et al. 2009), though additional experimentation would be required to verify this. Nonetheless, this projector configuration has 44 million trainable parameters and is substantially larger than the encoder, which is a common feature of Barlow Twins and related self-supervised schemes (Bardes et al. 2021).
c. Image augmentations
Reliance on image augmentations is a critical aspect of most modern self-supervised schemes (Gidaris et al. 2018; T. Chen et al. 2020; Bardes et al. 2021; Grill et al. 2020; Chen and He 2020). Performing image comparisons is an effective and intuitive way to learn deep feature representations of unlabeled image data, but augmentations must be used to avoid learning trivial solutions (comparing bulk image statistics like image color, brightness, or contrast for instance) and drive the models to distill semantically meaningful image representations. This work uses a procedure for generating image pairs during training that accounts for the unique properties of satellite cloud imagery. First, the unlabeled imagery described in sections 2a and 2b, has been sampled to nonoverlapping 1024 × 1024 pixel scenes. The semantic content in these scenes is often spatially autocorrelated, meaning that neighboring regions in the 1024 × 1024 pixel training images tend to contain the same cloud types. We leverage this when generating training samples by randomly selecting paired image chips from two locations in the same 1024 × 1024 pixel region. Second, single cloud types can vary drastically in scale. Stevens et al. (2020) notes that the potential spatial extent of individual cloud elements in their flowers category can span scales from 20 to 200 km for instance. When sampling image pairs, we randomly choose image chips between 128 × 128 pixels and 512 × 512 pixels in size and rescale them to 256 × 256 pixels using bicubic interpolation. Third, image flips and rotations do not change the cloud type observed, so samples are randomly flipped horizontally and vertically, and randomly rotated by integer multiples of 90°. Last, scene illumination can change drastically depending on viewing and solar zenith angles, and the same cloud morphology can occur under a variety of different lighting conditions. We leverage this by randomly altering the image brightness and contrast by factors of 0.5–1.5 during training. This specific procedure was chosen based on our knowledge of the dataset and these image transforms can be applied safely without changing the cloud type present in an image. While we did not perform extensive experiments to determine the relative value of individual image augmentations like T. Chen et al. (2020), we believe that these are reasonable choices based on domain knowledge and show in section 4 that they were effective for training the CNN.
d. Training procedure
4. Results
In this section, we evaluate the quality of the representations learned by the self-supervised model by assessing their utility for classifying the labeled cloud imagery datasets introduced in sections 2c and 2d.
a. Evaluation procedure
Quantitative evaluation of the quality of a self-supervised image vector embedding space is not particularly straightforward because there is no ground-truth for comparison. We follow an approach that has been frequently deployed in the ML literature of training a simple linear classifier model to ingest vector embeddings created by the self-supervised encoder CNN and produce image labels for the labeled image datasets introduced in sections 2c and 2d (van den Oord et al. 2018; T. Chen et al. 2020; Chen and He 2020; Misra and van der Maaten 2019). For large image datasets, a simple linear classifier does not have the capacity to directly classify image pixel data. Therefore, if a self-supervised CNN can learn representations of an unlabeled image dataset that a linear classifier can leverage to achieve high classification accuracy on a labeled dataset, then the CNN must have learned useful feature representations of the images.
The linear model is a single layer neural network with a softmax output (also commonly known as a linear classifier or logistic regression). It outputs a probability for each cloud category in the corresponding labeled imagery dataset and has one input for each value in the image embedding space, meaning it has 1024 × C trainable weights and C biases where C is either 4 or 6 depending on the dataset used. We train it by concatenating it to the trained encoder CNNs after freezing their weights. This means that while the encoder CNNs are included as part of these models, only the linear classifiers’ weights can be updated when training on the labeled imagery. Because of the relatively small size of the labeled datasets, we apply a 20% dropout to the linear classifier’s input during training to prevent overfitting and use random horizontal and vertical image flips to augment the labeled cloud image datasets at this stage. We use a 70%, 10%, 20% train/validation/test split of the labeled imagery datasets. The classifiers are trained while monitoring the validation loss with an early-stopping callback using a patience of 10 epochs. This is done once with an initial learning rate of 10−3, and then again after lowering the learning rate by a factor of 10. We use the Adam optimizer (Kingma and Ba 2014) with the Keras (Chollet et al. 2015) default settings of β1 = 0.9 and β2 = 0.999 and a batch size of 32. The test-set accuracy of these linear classifiers on each category in the labeled imagery datasets is shown in Tables 1 and 2, the overall categorical accuracy is compared to other benchmark models (introduced in the following sections) in Table 3, and the results are discussed in more detail in section 4d.
Confusion matrix for a simple linear classifier model trained to recognize the sugar, gravel, fish, and flowers cloud types using vector embeddings from the self-supervised model. Bold values are true positives.
Confusion matrix for a simple linear classifier model trained to recognize the cloud types from Yuan et al. (2020) using vector embeddings from the self-supervised model. Bold values are true positives.
Categorical accuracy for linear classifiers trained on top of several different convolutional neural networks for two different labeled cloud imagery datasets. Two self-supervised models are shown, one trained on ABI imagery and one trained on MODIS imagery. DenseNet-121 and ResNet-50 are CNNs that have been pretrained on the ImageNet dataset, and the linear classifier is applied to the features from their last convolutional layer (after applying global average pooling). ResNet-CNN is a ResNet-style CNN that is trained from scratch with a densely connected layer identical to the linear classifier as the output layer.
b. Benchmark models
To provide context for the accuracy of the self-supervised classifier approach (described above), we compare against several benchmark models. Two types of benchmarks are considered: training a full CNN directly on the labeled imagery datasets (“from scratch”) and training a linear classifier on the image embeddings from two different types of CNNs that were pretrained in a supervised manner on ImageNet (“pretrained”). The categorical accuracy of the self-supervised classifiers is compared to these benchmark models in Table 3 and discussed in greater detail in section 4d.
The “from scratch” benchmark can be described as follows: The traditional approach to using a CNN for a classification task is to train the CNN (with a linear classifier as its last layer) solely on a labeled dataset. For this benchmark, the same type of linear classifier head is appended to a CNN with randomly initialized trainable weights and the exact same architecture as the self-supervised encoder CNN (Fig. A1). The whole CNN and classifier head are trained directly on each of the small labeled imagery datasets using the same procedure and train/validation/test split described in the section above. Comparing the skill of the classifiers where the CNN is trained on the much larger unlabeled imagery dataset and only the linear classifier is trained on the labeled dataset to the full CNNs trained completely on the labeled dataset gives an estimate of both the value of self-supervised transfer learning and the relative quality of the deep representations learned in the self-supervised training process. These trained from scratch models generally perform much worse than the self-supervised models, though could likely perform comparably with a substantially larger labeled training dataset.
The “pretrained” benchmark can be described as follows: A common approach when facing limited labeled training data is using a CNN that has been pretrained on a very large labeled imagery dataset in a supervised manner and fine-tuning the weights of the classification layers at the end of the model to the target task [this technique was used by Yuan et al. (2020)]. Here, we follow a similar setup to the one used for our self-supervised models. We train a linear classifier applied to a global average pooling of the output from the final convolutional layer of ResNet-50 (He et al. 2016) pretrained on ImageNet to classify the labeled cloud imagery datasets. Resnet-50 has a 2048-value embedding space, so we also compare to DenseNet-121 (Huang et al. 2018), which has the same embedding dimensionality as our model. The pretrained weights for these two CNNs are distributed as part of Keras (Chollet et al. 2015). These two CNNs were not originally trained on 256 × 256 pixel imagery so a resizing layer is used to convert the images to their native input resolution (224 × 224 pixels). This experiment helps characterize the relative value of training on cloud imagery versus a largely unrelated natural image dataset, assessing whether the self-supervised approach can learn representations of comparable quality to those learned by a supervised CNN trained on a large labeled dataset.
Last, an identical self-supervised model was also trained on a library of ABI imagery, which is discussed in greater detail below in section 4c. All of the benchmark models were trained using the same approach (early stopping, optimizer, learning rate, etc.) as the combined self-supervised and linear classifier model.
c. Cross-instrument application
The image augmentations used while training the self-supervised CNN prevent it from learning trivial ways to minimize the Barlow Twins loss function, but they also encourage resilience against these types of transformations in downstream applications. The image resizing, flips, rotations, and brightness adjustments used during training make the CNN’s embeddings robust against external sources of similar variability, such as small differences in resolution or spatial distortions from a different satellite imager, different spatial orientations of cloud patterns in different geographic regions and weather patterns, or different scene illumination. Most algorithms underlying satellite retrievals are bespoke, and direct cross-instrument application is not possible, so the prospect of an algorithm that can be trained using one instrument and applied to others (or potentially trained on a multi-instrument dataset) is particularly novel. In this experiment, we test whether the feature representations learned by a self-supervised CNN from ABI images can be leveraged by a linear classifier to accurately identify cloud in images from MODIS, two distinct imagers in different orbits.
The most straightforward way to test this concept would have been to evaluate the self-supervised CNN that has already been trained on MODIS imagery on its ability to accurately label a hand-labeled ABI cloud imagery dataset, but the only hand-labeled images available for testing are from MODIS (sections 2c and 2d). Instead, a second copy of the self-supervised model was trained using the unlabeled ABI imagery dataset (section 2b) following the same training procedure described in section 3d. Table 3 also shows the same linear classifier evaluation procedure discussed in section 4a applied to image embeddings of the MODIS evaluation datasets produced by the CNN trained only on ABI imagery (in the columns labeled “self-supervised” and “ABI”).
d. Classification skill
Tables 1 and 2 show confusion matrices for the linear classifier trained on the self-supervised CNN’s embeddings of the SGFF (Rasp et al. 2020) and the Yuan et al. (2020) datasets, respectively. The lowest and highest skill for each dataset occurs in the fish and sugar SGFF categories, respectively, and “clustered cumulus” and “open cellular” Yuan et al. (2020) categories, respectively. Both linear classifiers achieve skill much higher than random guessing for all cloud categories. From this we can infer that the vector embeddings generated by the self-supervised algorithm successfully encode information about mesoscale cloud structure.
Table 3 shows the test-split categorical accuracy for both labeled datasets across the various classifier model types. In the remainder of this paragraph, we have italicized the words that directly correspond to columns and subcolumns in the table. As a short summary, this table includes results for two trained from scratch CNNs, one trained on each of the evaluation datasets; two self-supervised CNNs, trained on each of the unlabeled imagery datasets, each with a separate linear classifier trained on each of the evaluation datasets; and two CNNs pretrained on ImageNet, each with a separate linear classifier trained for each evaluation dataset. For SGFF (Rasp et al. 2020), the larger of the two labeled imagery datasets, both self-supervised models, with CNNs trained on either the unlabeled MODIS or ABI imagery datasets, outperform the CNNs that were pretrained on ImageNet and the trained from scratch CNNs by several percent. The self-supervised model trained on MODIS imagery performs slightly better than the one trained on ABI imagery, which is expected because the labeled image datasets contain MODIS imagery. For the smaller Yuan et al. (2020) dataset, the ResNet-50 pretrained model performs the best with the MODIS- and ABI-trained self-supervised models performing second and third best, respectively. However, both self-supervised models outperform the DenseNet-121 pretrained model, which has the same size embedding space. The trained from scratch CNN underperforms the others by a wider margin for this dataset, likely because of the smaller training-split size and the need to use early stopping to prevent overfit. Our expectation is that the trained from scratch CNN would perform comparably to the other models if we had access to significantly more labeled training data but performs much worse in this data-limited setting.
The categorical accuracy of the self-supervised model trained with ABI images is only slightly lower than the model trained with MODIS images: 0.6% for SGFF and 1.9% for the Yuan et al. (2020) dataset. This indicates that the self-supervised CNN embeddings are relatively robust against changes in imager properties. Importantly, a linear classifier leveraging embeddings from the self-supervised CNN that was trained on unlabeled ABI data performs better than a CNN trained directly on the labeled MODIS datasets, demonstrating this method’s potential for cross-instrument pretraining in data-limited scenarios. While here we have demonstrated a cross-instrument application between two contemporary instruments, this result suggests that applying the algorithm across multiple generations of instruments should also be possible. For instance, the ABI’s 640-nm band was carried over from the past generation of ABI imagers, and the Visible Infrared Imaging Radiometer Suite (VIIRS), which will carry on some of MODIS’s functionality when it deorbits (Platnick et al. 2021), includes 550- and 640-nm bands at 675- and 375-m resolution, respectively. An instrument-agnostic algorithm also suggests the possibility of multi-instrument studies of cloud morphology and climatology, without the same degree of concern about introducing cross-instrument artifacts, a common frustration for past multi-instrument cloud studies (Rossow and Schiffer 1999).
Overall, these results demonstrate the ability of self-supervised CNNs to learn deep representations of cloud images that efficiently encode information about the mesoscale cloud morphology. For the purposes of future cloud studies that focus on cloud structure, and particularly those that might have limited labeled data to work with, pretrained models using self-supervision will be a more powerful tool than classifiers trained directly on the available labeled data or ImageNet pretrained CNNs. This approach, catalyzed by the abundance of historical cloud imagery, permits future cloud studies that were previously impossible or prohibitively labor intensive.
5. Additional experiments
The self-supervised CNNs have a range of valuable uses beyond the image classification task in the previous section. In this section, we perform some additional experiments that demonstrate several other unique properties and applications.
a. Comparisons in embedding space
b. Detection and search with limited labels
A valuable use case for the vector representations is performing image search and classification with only one, or perhaps a handful, of labeled images. This can be done via comparisons of images in the embedding space under the assumption that nearby embeddings will correspond to similar cloud morphologies. In Figs. 1 and 2, examples are shown of a single query image from each cloud category in the two labeled imagery datasets followed by the closest 5 images drawn from the unlabeled MODIS dataset. While we did not enforce this constraint, no two images of the same cloud type in Figs. 1 and 2 were captured on the same day, with the exception of two of the flowers images in Fig. 1, and even those were drawn from different MODIS granules. By examination, the cosine similarity comparison has identified new samples that contain similar cloud morphologies to each of the query images and has identified unique cases from throughout the imagery dataset.
For a more quantitative evaluation of embedding-space comparisons, we can calculate the classification skill of a k-nearest-neighbor algorithm applied to the vector embeddings of each of the labeled imagery datasets (Bishop 2006). Here, we chose k = 40 by evaluating a k-nearest-neighbor classifier on the validation splits using various values of k. Then, for each image in the testing split, we assign the most common label among the 40 closest (by cosine distance) images in the training split. This approach achieves a categorical accuracy of 84.4% on the SGFF dataset and 73.1% on the Yuan et al. (2020) dataset. These accuracies are high considering the simplicity of the k-nearest-neighbor model and approach the skill of the linear classifiers trained in section 4. This result further verifies the quality of the embedding space for comparing different cloud morphologies.
c. Visualizing learned representations
CNNs’ internal representations are difficult to interpret. Direct examination of model weights, for instance, does not provide much information about what CNNs learn except in the early layers of a network, so numerous indirect interrogation techniques have been developed (Qin et al. 2018; Nguyen et al. 2019). Here, we apply “activation maximization” (Erhan et al. 2009; Simonyan and Zisserman 2014) to visualize the variety of features learned from cloud imagery.
Activation maximization leverages the automatic differentiation capabilities of ML software libraries (Chollet et al. 2015). The process involves selecting a single activation from the encoder network to interrogate, generating an initial image with random pixel intensities, and then iteratively updating the image’s pixel values to maximize the value output by the target activation. Our application of activation maximization seeks to identify what has been learned by the output neurons from the encoder network. In other words, what cloud morphologies are represented by each dimension of the embedding space? We used 2500 iterations of stochastic gradient descent to iteratively push white-noise images toward pixel values that excited each output neuron from the trained encoder and found that the process resulted in a wide variety of distinct spatial patterns. Figure 3a shows 12 example cases. While there is a great deal of variety in the activation maximization images, we note that some output neurons produce qualitatively similar patterns to others, perhaps implying that the Barlow Twins algorithm has not entirely eliminated redundancies in the embedding space or that a smaller embedding space would have been sufficient. Nonetheless, the visualizations show that the CNN has internal representations of a range of distinct cloud structures.
(a) Activation maximization applied to a selection of neurons in the output layer of the encoder network. Each panel shows hallucinated spatial patterns that excite a specific neuron. (b) Closest images in the MODIS imagery dataset to each of the samples in (a), determined by cosine similarity in the image embedding space.
Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0036.1
Activation maximization typically does not yield crisp or realistic images, instead producing images that repeat specific patterns learned by the CNN. As an example, performing activation maximization on an output neuron of a CNN trained to detect dogs, might produce outputs that are a mishmash of dog eyes, ears, and noses, but not anything that closely resembles a dog. More generally, we expect the process to show patterns that give clear indications of what excites a neuron, but perhaps look nothing like any samples in the training data (Simonyan and Zisserman 2014; Nguyen et al. 2019). In Fig. 3b we show a collection of images from the MODIS training dataset that have the closest vector embeddings [Eq. (2)] to the corresponding panels in Fig. 3a. There are clear similarities between the activation maximization patterns and corresponding cloud images. The upper-left panel in Fig. 3a for instance, has hallucinated a series of small bright objects overlaid with dimmer fuzzy structures, and this appears to correspond with cirrus overriding fair weather cumulus, a common occurrence in the trades. This experiment, while not particularly quantitative, provides an important confirmation that the CNN detects meaningful cloud structures. In the face of limited training data, transfer learning—fine-tuning a CNN trained on a different dataset to operate on a new one—is a common and effective practice (Goodfellow et al. 2016). While often successful (Yosinski et al. 2014; Yuan et al. 2020), the concept of classifying cloud imagery based on its resemblance to ImageNet classes (e.g., vehicles, plants, animals) seems dubious, particularly in a scientific setting. This experiment demonstrates that the CNN has learned feature representations relevant to classifying cloud morphology. The CNNs trained in this work are deep neural networks and inherit the same interpretation difficulties as all deep CNNs. Nonetheless, the fact that their learned internal representations are clearly representative of different cloud types is likely desirable from an artificial intelligence trustworthiness and interpretability standpoint (McGovern et al. 2022; Ebert-Uphoff and Hilburn 2020) and indicates that self-supervised methods are a promising way to pretrain models for domain specific transfer learning.
d. Sensitivity to scene brightness
As a consequence of the aggressive data augmentation used to train the model, we expect that the CNN’s vector embeddings are robust against changing image brightness, and in the context of cloud imagery, this is a useful property for several downstream tasks. Many conventional passive satellite cloud retrieval algorithms are extremely sensitive to changes in observed spectral radiance. For instruments that are in orbit for long periods of time, even mild sensor degradation can lead to spurious trends in cloud properties derived from that sensor. As an example, Lyapustin et al. (2014) describe how a small long-term calibration degradation in Terra MODIS’s visible and near-infrared bands affected dependent retrievals, including a −13% per decade false trend in marine cloud optical thickness, a key parameter used to assess cloud type and radiative effects. Furthermore, interinstrument differences in calibration and hardware can be a major hurdle for generating multi-instrument records. Finally, for satellite imagers, scene illumination can change drastically depending on solar zenith angle and viewing angle. Therefore, classification invariance under different scene illumination or observed spectral radiance is a useful feature.
Here, we evaluate the stability of the vector embeddings generated by the self-supervised CNN under extreme changes in image brightness. First, the cosine similarity is computed between the vector embedding of each image in each labeled MODIS imagery dataset and the nearest-neighbor vector embedding from the same dataset. Then, copies of each dataset with different brightness alterations ranging from −50% to 50% are generated, and the cosine similarity between each image embedding and the embedding of its brightness-altered copy is computed. The blue regions in Fig. 4 show the resulting cosine similarity as a function of image brightness changes, while the red regions show the nearest-neighbor cosine similarity for comparison. The two levels of shading encompass 90% and 50% of the samples in each labeled imagery dataset. Even for extreme changes in scene brightness, the image vector embeddings remain relatively unaltered such that images remain closer to their brightness-altered counterparts in the embedding space than the nearest unaltered image from the same MODIS imagery dataset. While interpretation in terms of cosine similarity may be less intuitive than classification accuracy for instance, it demonstrates that the entire range of potential downstream uses for the vector embeddings (classification, clustering, searching, etc.) should be resilient to these kinds of changes in image properties.
Sensitivity of image vector embeddings to changing scene brightness. Blue represents the cosine similarity (vertical axis) between image vector embeddings and the embeddings of the same images with scaled brightness (horizontal axis). Red represents the cosine similarity between images and their nearest neighbor. The two levels of shading contain 90% and 50% of the samples from the testing datasets and the thick lines represent the mean. Higher values indicate greater similarity. The plot indicates that even for substantial changes in scene brightness, image vector embeddings remain relatively stable, and their classification would likely remain unchanged for most classification schemes.
Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0036.1
6. Conclusions and discussion
This work has demonstrated the effectiveness of recently developed deep self-supervised machine learning techniques, specifically the Barlow Twins method (Zbontar et al. 2021), for interpreting satellite images of marine clouds. When applied to classification tasks, the self-supervised CNN’s image representations can be used alongside even very simple classifiers with limited training data to achieve competitive classification accuracy with supervised deep learning. We demonstrated how the self-supervised schemes can be used to perform image searching based on single query images, for example, identifying more cases of a cloud type in an image library from only one sample case. The models also learn meaningful internal representations of cloud structure, are robust in cross-instrument applications, and resilient to changing scene brightness.
There are several areas of future research that could fine-tune the quality of the self-supervised models applied to clouds. We did not perform thorough hyperparameter optimization for our model. This class of self-supervised learning is relatively new, and several past authors have performed large ablation studies to determine optimal batch sizes, training duration, neural architecture, image augmentations, etc. (T. Chen et al. 2020). Satellite imagery is significantly different than the natural image datasets these algorithms were developed for, so an in-depth hyperparameter optimization would likely be beneficial. Furthermore, our image augmentations relied heavily on the idea that cloud types are spatially autocorrelated, and a study to determine the optimal sampling distances and image sizes based on decorrelation length could likely improve the results. Last, the 1-yr libraries of MODIS and ABI data we used for training are fairly large datasets, but substantially more data are available. Training a self-supervised model using multiple data sources with longer duration datasets may improve results.
In addition to the tests performed in this study, we envision several future use cases for this model. First, the ability to search an image database using a limited number of query images could be used to generate climatologies, improve data browsing, identify trends in very specific cloud types, and create new training datasets for classifiers with relatively little time investment. The embedding vectors generated by the CNN are much smaller, in terms of storage space, than the images themselves, so precomputing image embeddings for an entire imager mission is feasible and would be valuable for this purpose. Second, unsupervised schemes like clustering and linear decompositions can be applied in the embedding space, and some clustering schemes (k-means) can directly use the cosine similarity during optimization, so the image embeddings are potentially useful for exploratory data analysis. Third, there is potential to apply this approach to cloud fields generated by models for the purposes of model evaluation. While climate simulations and operational weather models do not have sufficient resolution to emulate diverse marine boundary layer cloud fields, large eddy simulations can (Zhou et al. 2018; Zhou and Bretherton 2019). Last, different mesoscale marine cloud types have different dynamical, radiative, and microphysical properties with unique interactions with atmospheric aerosol. This algorithm can be used as a research tool to enable highly targeted studies of these properties for specific mesoscale cloud regimes, or to examine the cloud regimes present under specific atmospheric conditions.
On the ML side, the high-quality internal feature representations learned by self-supervised CNNs have several potentially important applications in the atmospheric sciences. Transfer learning—using model weights pretrained on one dataset to create a classifier for another—is one key area that has already been discussed above. Several other applications for feature representations learned on ImageNet have been developed by the ML community though, for instance, image style transfer (Gatys et al. 2015)—altering one image to imitate the properties of another, Frechet inception distance (Heusel et al. 2017)—a metric used to judge the realism of images created by generative models, and perception loss (Johnson et al. 2016)—which has been used in image superresolution (Ledig et al. 2016) and image to image translation tasks (Isola et al. 2017). Several studies in the atmospheric sciences have already successfully used these approaches, but imposing features learned from the images of cars, animals, plants, etc. in ImageNet on satellite cloud imagery seems dangerously akin to cloud gazing, and self-supervision provides a way to learn robust deep feature representations that are more appropriate in a scientific setting.
In closing, recently developed self-supervised image processing schemes are very powerful compared to their predecessors and have potential to be impactful in Earth and atmospheric sciences, where large regularly gridded datasets are plentiful but expert annotations are in short supply. In the context of satellite cloud imagery, these schemes can learn image feature representations with a myriad of potential downstream use cases and enable future satellite studies of cloud morphology with an unprecedented level of granularity.
Acknowledgments.
This research was a component of the Integrated Cloud, Land-surface, and Aerosol System Study (ICLASS) supported by the U.S. Department of Energy Office of Science Biological and Environmental Research as part of the Atmospheric Systems Research (ASR) Program. Pacific Northwest National Laboratory is operated by Battelle for the U.S. Department of Energy under Contract DE-AC05-76RLO1830. Tianle Yuan and Hua Song were supported by NASA Grant 80NSSC20K0132.
Data availability statement.
MODIS 500-m L1B Calibrated Radiances are available at https://ladsweb.modaps.eosdis.nasa.gov/archive/Science%20Domain/Level-1/MODIS%20Level-1/MODIS%20Terra%20C6.1%20-%20Level%201B%20Calibrated%20Radiances%20-%20500m/ (last accessed: 9 March 2023); https://doi.org/10.5067/MODIS/MOD02HKM.006. GOES-17 ABI L1B Full-disk images are currently available at https://noaa-goes17.s3.amazonaws.com/index.html#ABI-L1b-RadF/ (last accessed: 9 March 2023) https://doi.org/10.7289/V5BV7DSR. Trained copies of the MODIS and ABI based self-supervised models along with the labeled image chips we produced based on the SGFF dataset are available in the project’s Zenodo repository: https://doi.org/10.5281/zenodo.7823778. The Yuan et al. (2020) labels have not yet been published and were provided courtesy of Tianle Yuan (tianle.yuan@nasa.gov) and Hua Song. The project code is currently available at https://github.com/avgeiss/self_supervised_clouds and has been permanently archived at Zenodo (https://doi.org/10.5281/zenodo.7864558).
APPENDIX
Encoder and Projector Diagram
Figure A1 shows a diagram of the ResNet-style encoder CNN and the projector network.
A diagram of the ResNet-style encoder CNN and the projector network. Note that the yellow shading indicates layers that are only included in the residual blocks of the ResNet that include a spatial downsampling operation, though the connections that pass through them, shown by black arrows, are always on.
Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0036.1
REFERENCES
Andrews, T., J. M. Gregory, M. J. Webb, and K. E. Taylor, 2012: Forcing, feedbacks and climate sensitivity in CMIP5 coupled atmosphere-ocean climate models. Geophys. Res. Lett., 39, L09712, https://doi.org/10.1029/2012GL051607.
Assran, M., Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y. LeCun, and N. Ballas, 2023: Self-supervised learning from images with a joint-embedding predictive architecture. arXiv, 2301.08243v3, https://doi.org/10.48550/arXiv.2301.08243.
Bardes, A., J. Ponce, and Y. LeCun, 2021: VICReg: Variance-invariance-covariance regularization for self-supervised learning. arXiv, 2105.04906v3, https://doi.org/10.48550/arXiv.2105.04906.
Bellouin, N., and Coauthors, 2020: Bounding global aerosol radiative forcing of climate change. Rev. Geophys., 58, e2019RG000660, https://doi.org/10.1029/2019RG000660.
Beucler, T., I. Ebert-Uphoff, S. Rasp, M. Pritchard, and P. Gentine, 2022: Machine learning for clouds and climate. ESS Open Archive, 10506925, https://doi.org/10.1002/essoar.10506925.1.
Bodas-Salcedo, A., and Coauthors, 2011: COSP: Satellite simulation software for model assessment. Bull. Amer. Meteor. Soc., 92, 1023–1043, https://doi.org/10.1175/2011BAMS2856.1.
Bony, S., and J.-L. Dufresne, 2005: Marine boundary layer clouds at the heart of tropical cloud feedback uncertainties in climate models. Geophys. Res. Lett., 32, L20806, https://doi.org/10.1029/2005GL023851.
Boucher, O., and Coauthors, 2014: Clouds and aerosols. Climate Change 2013: The Physical Science Basis, Cambridge University Press, 571–658, https://doi.org/10.1017/CBO9781107415324.016.
Bromley, J., I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, 1993: Signature verification using a “Siamese” time delay neural network. Advances in Neural Information Processing Systems, J. Cowan, G. Tesauro, and J. Alspector, Eds., Vol. 6, Morgan-Kaufmann, 737–744.
Caron, M., H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, 2021: Emerging properties in self-supervised vision transformers. arXiv, 2104.14294v2, https://doi.org/10.48550/arXiv.2104.14294.
Chen, J., Z. Yuan, J. Peng, L. Chen, H. Huang, J. Zhu, Y. Liu, and H. Li, 2020: DASNet: Dual attentive fully convolutional Siamese networks for change detection in high-resolution satellite images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., 14, 1194–1206, https://doi.org/10.1109/JSTARS.2020.3037893.
Chen, T., S. Kornblith, M. Norouzi, and G. E. Hinton, 2020: A simple framework for contrastive learning of visual representations. arXiv, 2002.05709v3, https://doi.org/10.48550/arXiv.2002.05709.
Chen, X., and K. He, 2020: Exploring simple Siamese representation learning. arXiv, 2011.10566v1, https://doi.org/10.48550/arXiv.2011.10566.
Chen, X., Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, 2016: InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. arXiv, 1606.03657v1, https://doi.org/10.48550/arXiv.1606.03657.
Chollet, F., and Coauthors, 2015: Keras. GitHub, https://github.com/fchollet/keras.
Denby, L., 2020: Discovering the importance of mesoscale cloud organization through unsupervised classification. Geophys. Res. Lett., 47, e2019GL085190, https://doi.org/10.1029/2019GL085190.
Deng, J., W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, 2009: ImageNet: A large-scale hierarchical image database. 2009 IEEE Conf. on Computer Vision and Pattern Recognition, Miami, FL, IEEE, 248–255, https://doi.org/10.1109/CVPR.2009.5206848.
Eastman, R., I. L. McCoy, and R. Wood, 2022: Wind, rain, and the closed to open cell transition in subtropical marine stratocumulus. J. Geophys. Res. Atmos., 127, e2022JD036795, https://doi.org/10.1029/2022JD036795.
Ebert-Uphoff, I., and K. Hilburn, 2020: Evaluation, tuning, and interpretation of neural networks for working with images in meteorological applications. Bull. Amer. Meteor. Soc., 101, E2149–E2170, https://doi.org/10.1175/BAMS-D-20-0097.1.
Erhan, D., Y. Bengio, A. Courville, and P. Vincent, 2009: Visualizing higher-layer features of a deep network. Université de Montréal Département d’Informatique et Recherche Opérationnelle Tech. Rep. 1341, 13 pp.
Fan, J., Y. Wang, D. Rosenfeld, and X. Liu, 2016: Review of aerosol–cloud interactions: Mechanisms, significance, and challenges. J. Atmos. Sci., 73, 4221–4252, https://doi.org/10.1175/JAS-D-16-0037.1.
Forster, P., and Coauthors, 2021: The Earth’s energy budget, climate feedbacks, and climate sensitivity. Climate Change 2021: The Physical Science Basis, V. Masson-Delmotte et al., Eds., Cambridge University Press, 923–1054, https://doi.org/10.1017/9781009157896.009.
Gatys, L. A., A. S. Ecker, and M. Bethge, 2015: A neural algorithm of artistic style. arXiv, 1508.06576v2, https://doi.org/10.48550/arXiv.1508.06576.
Gidaris, S., P. Singh, and N. Komodakis, 2018: Unsupervised representation learning by predicting image rotations. arXiv, 1803.07728v1, https://doi.org/10.48550/arXiv.1803.07728.
Gómez-Landesa, E., A. Rango, and M. Bleiweiss, 2004: An algorithm to address the MODIS bowtie effect. Can. J. Remote Sens., 30, 644–650, https://doi.org/10.5589/m04-028.
Goodfellow, I., Y. Bengio, and A. Courville, 2016: Deep Learning. MIT Press, 800 pp.
Goyal, P., and Coauthors, 2017: Accurate, large Minibatch SGD: Training ImageNet in 1 hour. arXiv, 1706.02677v2, https://doi.org/10.48550/arXiv.1706.02677.
Grill, J., and Coauthors, 2020: Bootstrap your own latent: A new approach to self-supervised learning. arXiv, 2006.07733v3, https://doi.org/10.48550/arXiv.2006.07733.
Hartmann, D. L., 2015: Global Physical Climatology. International Geophysics, Elsevier Science, 498 pp., https://doi.org/10.1016/C2009-0-00030-0.
Hartmann, D. L., L. A. Moy, and Q. Fu, 2001: Tropical convection and the energy balance at the top of the atmosphere. J. Climate, 14, 4495–4511, https://doi.org/10.1175/1520-0442(2001)014<4495:TCATEB>2.0.CO;2.
He, K., X. Zhang, S. Ren, and J. Sun, 2016: Deep residual learning for image recognition. 2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, IEEE, 770–778, https://doi.org/10.1109/CVPR.2016.90.
He, K., H. Fan, Y. Wu, S. Xie, and R. Girshick, 2019: Momentum contrast for unsupervised visual representation learning. arXiv, 1911.05722v3, https://doi.org/10.48550/ARXIV.1911.05722.
Heusel, M., H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, 2017: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Advances in Neural Information Processing Systems, I. Guyon et al., Eds., Vol. 30, Curran Associates, Inc., 6626–6637, https://proceedings.neurips.cc/paper/2017/file/8a1d694707eb0fefe65871369074926d-Paper.pdf.
Houze, R., 2014: Cloud Dynamics. International Geophysics, Vol. 104, Elsevier Science, 496 pp.
Huang, G., Z. Liu, L. van der Maaten, and K. Q. Weinberger, 2018: Densely connected convolutional networks. arXiv, 1608.06993v5, https://doi.org/10.48550/arXiv.1608.06993.
Hubanks, P., S. Platnick, M. King, and B. Ridgway, 2019: MODIS atmosphere L3 gridded product Algorithm Theoretical Basis Document (ATBD) and users guide. MODIS Algorithm Theoretical Basis Doc. ATBD-MOD-30, 96 pp., https://modis.gsfc.nasa.gov/data/atbd/atbd_mod30.pdf.
Isola, P., J.-Y. Zhu, T. Zhou, and A. A. Efros, 2017: Image-to-image translation with conditional adversarial networks. 2017 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, IEEE, 5967–5976, https://doi.org/10.1109/CVPR.2017.632.
Jean, N., S. Wang, A. Samar, G. Azzari, D. Lobell, and S. Ermon, 2019: Tile2Vec: Unsupervised representation learning for spatially distributed data. Proc. Conf. AAAI Artif. Intell., 33, 3967–3974, https://doi.org/10.1609/aaai.v33i01.33013967.
Johnson, J., A. Alahi, and L. Fei-Fei, 2016: Perceptual losses for real-time style transfer and super-resolution. arXiv, 1603.08155v1, https://doi.org/10.48550/arXiv.1603.08155.
Kingma, D. P., and J. Ba, 2014: Adam: A method for stochastic optimization. arXiv, 1412.6980v9, https://doi.org/10.48550/ARXIV.1412.6980.
Krizhevsky, A., I. Sutskever, and G. E. Hinton, 2012: ImageNet classification with deep convolutional neural networks. NIPS’12: Proc. 25th Int. Conf. on Neural Information Processing Systems, Lake Tahoe, NV, Association for Computing Machinery, 1097–1105, https://dl.acm.org/doi/10.5555/2999134.2999257.
Kurihana, T., E. J. Moyer, and I. T. Foster, 2022a: AICCA: AI-driven cloud classification atlas. Remote Sens., 14, 5690, https://doi.org/10.3390/rs14225690.
Kurihana, T., and Coauthors, 2022b: Cloud classification with unsupervised deep learning. arXiv, 2209.15585v1, https://doi.org/10.48550/arXiv.2209.15585.
Ledig, C., and Coauthors, 2016: Photo-realistic single image super-resolution using a generative adversarial network. arXiv, 1609.04802v5, https://doi.org/10.48550/arXiv.1609.04802.
Lv, Q., Q. Li, K. Chen, Y. Lu, and L. Wang, 2022: Classification of ground-based cloud images by contrastive self-supervised learning. Remote Sens., 14, 5821, https://doi.org/10.3390/rs14225821.
Lyapustin, A., and Coauthors, 2014: Scientific impact of MODIS C5 calibration degradation and C6+ improvements. Atmos. Meas. Tech., 7, 4353–4365, https://doi.org/10.5194/amt-7-4353-2014.
Mahendran, A., and A. Vedaldi, 2015: Understanding deep image representations by inverting them. 2015 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Boston, MA, IEEE, 5188–5196, https://doi.org/10.1109/CVPR.2015.7299155.
McCoy, I. L., R. Wood, and J. K. Fletcher, 2017: Identifying meteorological controls on open and closed mesoscale cellular convection associated with marine cold air outbreaks. J. Geophys. Res. Atmos., 122, 11 678–11 702, https://doi.org/10.1002/2017JD027031.
McCoy, I. L., D. T. McCoy, R. Wood, P. Zuidema, and F. A.-M. Bender, 2023: The role of mesoscale cloud morphology in the shortwave cloud feedback. Geophys. Res. Lett., 50, e2022GL101042, https://doi.org/10.1029/2022GL101042.
McGovern, A., I. Ebert-Uphoff, D. J. Gagne II, and A. Bostrom, 2022: Why we need to focus on developing ethical, responsible, and trustworthy artificial intelligence approaches for environmental science. Environ. Data Sci., 1, E6, https://doi.org/10.1017/eds.2022.5.
Misra, I., and L. van der Maaten, 2019: Self-supervised learning of pretext-invariant representations. arXiv, 1912.01991v1, https://doi.org/10.48550/arXiv.1912.01991.
MODIS Characterization Support Team, 2017: MODIS 500m calibrated radiance product. NASA MODIS Adaptive Processing System, Goddard Space Flight Center, https://doi.org/10.5067/MODIS/MOD02HKM.061.
Mohrmann, J., R. Wood, T. Yuan, H. Song, R. Eastman, and L. Oreopoulos, 2021: Identifying meteorological influences on marine low-cloud mesoscale morphology using satellite classifications. Atmos. Chem. Phys., 21, 9629–9642, https://doi.org/10.5194/acp-21-9629-2021.
Muhlbauer, A., I. L. McCoy, and R. Wood, 2014: Climatology of stratocumulus cloud morphologies: Microphysical properties and radiative effects. Atmos. Chem. Phys., 14, 6695–6716, https://doi.org/10.5194/acp-14-6695-2014.
Nguyen, A., J. Yosinski, and J. Clune, 2019: Understanding neural networks via feature visualization: A survey. arXiv, 1904.08939v1, https://doi.org/10.48550/arXiv.1904.08939.
Platnick, S., and Coauthors, 2017: The MODIS cloud optical and microphysical products: Collection 6 updates and examples from Terra and Aqua. IEEE Trans. Geosci. Remote Sens., 55, 502–525, https://doi.org/10.1109/TGRS.2016.2610522.
Platnick, S., and Coauthors, 2021: The NASA MODIS-VIIRS continuity cloud optical properties products. Remote Sens., 13, 2, https://doi.org/10.3390/rs13010002.
Qin, Z., F. Yu, C. Liu, and X. Chen, 2018: How convolutional neural network see the world—A survey of convolutional neural network visualization methods. arXiv, 1804.11191v2, https://doi.org/10.48550/ARXIV.1804.11191.
Rasp, S., H. Schulz, S. Bony, and B. Stevens, 2020: Combining crowdsourcing and deep learning to explore the mesoscale organization of shallow convection. Bull. Amer. Meteor. Soc., 101, E1980–E1995, https://doi.org/10.1175/BAMS-D-19-0324.1.
Rossow, W. B., and R. A. Schiffer, 1999: Advances in understanding clouds from ISCCP. Bull. Amer. Meteor. Soc., 80, 2261–2288, https://doi.org/10.1175/1520-0477(1999)080<2261:AIUCFI>2.0.CO;2.
Salomonson, V. V., W. Barnes, P. Maymon, H. Montgomery, and H. Ostrow, 1989: MODIS: Advanced facility instrument for studies of the Earth as a system. IEEE Trans. Geosci. Remote Sens., 27, 145–153, https://doi.org/10.1109/36.20292.
Salomonson, V. V., W. Barnes, and E. J. Masuoka, 2006: Introduction to MODIS and an overview of associated activities. Earth Science Satellite Remote Sensing, Springer, 12–32, https://doi.org/10.1007/978-3-540-37293-6_2.
Schmit, T., M. Gunshor, G. Fu, T. Rink, K. Bah, W. Zhang, and W. Wolf, 2012: GOES-R Advanced Baseline Imager (ABI) Algorithm Theoretical Basis Document for Cloud and Moisture Imagery Product (CMIP) version 3.0. NOAA NESDIS Center for Satellite Applications and Research, 63 pp., https://www.star.nesdis.noaa.gov/goesr/docs/ATBD/Imagery.pdf.
Schmit, T. J., P. Griffith, M. M. Gunshor, J. M. Daniels, S. J. Goodman, and W. J. Lebair, 2017: A closer look at the ABI on the GOES-R series. Bull. Amer. Meteor. Soc., 98, 681–698, https://doi.org/10.1175/BAMS-D-15-00230.1.
Schulz, H., R. Eastman, and B. Stevens, 2021: Characterization and evolution of organized shallow convection in the downstream North Atlantic trades. J. Geophys. Res. Atmos., 126, e2021JD034575, https://doi.org/10.1029/2021JD034575.
Segal-Rozenhaimer, M., D. Nukrai, H. Che, R. Wood, and Z. Zhang, 2023: Cloud mesoscale cellular classification and diurnal cycle using a convolutional neural network (CNN). Remote Sens., 15, 1607, https://doi.org/10.3390/rs15061607.
Simonyan, K., and A. Zisserman, 2014: Very deep convolutional networks for large-scale image recognition. arXiv, 1409.1556v6, https://doi.org/10.48550/ARXIV.1409.1556.
Stevens, B., and Coauthors, 2020: Sugar, gravel, fish and flowers: Mesoscale cloud patterns in the trade winds. Quart. J. Roy. Meteor. Soc., 146, 141–152, https://doi.org/10.1002/qj.3662.
Stojnic, V., and V. Risojevic, 2021: Self-supervised learning of remote sensing scene representations using contrastive multiview coding. Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Nashville, TN, IEEE, 1182–1191, https://doi.org/10.1109/CVPRW53098.2021.00129.
Tan, I., L. Oreopoulos, and N. Cho, 2019: The role of thermodynamic phase shifts in cloud optical depth variations with temperature. Geophys. Res. Lett., 46, 4502–4511, https://doi.org/10.1029/2018GL081590.
van den Oord, A., Y. Li, and O. Vinyals, 2018: Representation learning with contrastive predictive coding. arXiv, 1807.03748v2, https://doi.org/10.48550/arXiv.1807.03748.
Ver Hoef, L., H. Adams, E. J. King, and I. Ebert-Uphoff, 2023: A primer on topological data analysis to support image analysis tasks in environmental science. Artif. Intell. Earth Syst., 2, e220039, https://doi.org/10.1175/AIES-D-22-0039.1.
Vincent, P., H. Larochelle, I. Lajoie, Y. Bengio, P.-A. Manzagol, and L. Bottou, 2010: Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res., 11, 3371–3408.
Vogel, R., H. Konow, H. Schulz, and P. Zuidema, 2021: A climatology of trade-wind cumulus cold pools and their link to mesoscale cloud organization. Atmos. Chem. Phys., 21, 16 609–16 630, https://doi.org/10.5194/acp-21-16609-2021.
Wang, L., Q. Li, and Q. Lv, 2022: Self-supervised classification of weather systems based on spatiotemporal contrastive learning. Geophys. Res. Lett., 49, e2022GL099131, https://doi.org/10.1029/2022GL099131.
Wood, R., 2012: Stratocumulus clouds. Mon. Wea. Rev., 140, 2373–2423, https://doi.org/10.1175/MWR-D-11-00121.1.
Wood, R., and D. L. Hartmann, 2006: Spatial variability of liquid water path in marine low cloud: The importance of mesoscale cellular convection. J. Climate, 19, 1748–1764, https://doi.org/10.1175/JCLI3702.1.
Wu, Z., Y. Xiong, S. Yu, and D. Lin, 2018: Unsupervised feature learning via non-parametric instance-level discrimination. arXiv, 1805.01978v1, https://doi.org/10.48550/ARXIV.1805.01978.
Xu, D., J. Xiao, Z. Zhao, J. Shao, D. Xie, and Y. Zhuang, 2019: Self-supervised spatiotemporal learning via video clip order prediction. 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, IEEE, 10 326–10 335, https://doi.org/10.1109/CVPR.2019.01058.
Yosinski, J., J. Clune, Y. Bengio, and H. Lipson, 2014: How transferable are features in deep neural networks? arXiv, 1411.1792v1, https://doi.org/10.48550/arXiv.1411.1792.
You, Y., I. Gitman, and B. Ginsburg, 2017: Large batch training of convolutional networks. arXiv, 1708.03888v3, https://doi.org/10.48550/ARXIV.1708.03888.
You, Y., and Coauthors, 2019: Large batch optimization for deep learning: Training BERT in 76 minutes. arXiv, 1904.00962, https://doi.org/10.48550/arXiv.1904.00962.
Yuan, T., 2019: Understanding low cloud mesoscale morphology with an information maximizing generative adversarial network. EarthArXiv, 877, https://doi.org/10.31223/OSF.IO/GVEBT.
Yuan, T., H. Song, R. Wood, J. Mohrmann, K. Meyer, L. Oreopoulos, and S. Platnick, 2020: Applying deep learning to NASA MODIS data to create a community record of marine low-cloud mesoscale morphology. Atmos. Meas. Tech., 13, 6989–6997, https://doi.org/10.5194/amt-13-6989-2020.
Zbontar, J., L. Jing, I. Misra, Y. LeCun, and S. Deny, 2021: Barlow Twins: Self-supervised learning via redundancy reduction. Proc. 38th Int. Conf. on Machine Learning, Online, PMLR, 12 310–12 320, https://proceedings.mlr.press/v139/zbontar21a/zbontar21a.pdf.
Zelinka, M. D., T. A. Myers, D. T. McCoy, S. Po-Chedley, P. M. Caldwell, P. Ceppi, S. A. Klein, and K. E. Taylor, 2020: Causes of higher climate sensitivity in CMIP6 models. Geophys. Res. Lett., 47, e2019GL085782, https://doi.org/10.1029/2019GL085782.
Zhou, X., and C. S. Bretherton, 2019: Simulation of mesoscale cellular convection in marine stratocumulus: 2. Nondrizzling conditions. J. Adv. Model. Earth Syst., 11, 3–18, https://doi.org/10.1029/2018MS001448.
Zhou, X., A. S. Ackerman, A. M. Fridlind, and P. Kollias, 2018: Simulation of mesoscale cellular convection in marine stratocumulus. Part I: Drizzling conditions. J. Atmos. Sci., 75, 257–274, https://doi.org/10.1175/JAS-D-17-0070.1.