Understanding Spatial Context in Convolutional Neural Networks Using Explainable Methods: Application to Interpretable GREMLIN

Kyle A. Hilburn aCooperative Institute for Research in the Atmosphere, Colorado State University, Fort Collins, Colorado

Search for other papers by Kyle A. Hilburn in
Current site
Google Scholar
PubMed
Close
Open access

Abstract

Convolutional neural networks (CNNs) are opening new possibilities in the realm of satellite remote sensing. CNNs are especially useful for capturing the information in spatial patterns that is evident to the human eye but has eluded classical pixelwise retrieval algorithms. However, the black-box nature of CNN predictions makes them difficult to interpret, hindering their trustworthiness. This paper explores a new way to simplify CNNs that allows them to be implemented in a fully transparent and interpretable framework. This clarity is accomplished by moving the inner workings of the CNN out into a feature engineering step and replacing the CNN with a regression model. The specific example of the GOES Radar Estimation via Machine Learning to Inform NWP (GREMLIN) is used to demonstrate that such simplifications are possible and to show the benefits of the interpretable approach. GREMLIN translates images of GOES radiances and lightning into images of radar reflectivity, and previous research used explainable artificial intelligence (XAI) approaches to explain some aspects of how GREMLIN makes predictions. However, the Interpretable GREMLIN model shows that XAI missed several strategies, and XAI does not provide guarantees on how the model will respond when confronted with new scenarios. In contrast, the interpretable model establishes well-defined relationships between inputs and outputs, offering a clear mapping of the spatial context utilized by the CNN to make accurate predictions, and providing guarantees on how the model will respond to new inputs. The significance of this work is that it provides a new approach for developing trustworthy artificial intelligence models.

Significance Statement

Convolutional neural networks (CNNs) are very powerful tools for interpreting and processing satellite imagery. However, the black-box nature of their predictions makes them difficult to interpret, compromising their trustworthiness when applied in the context of high-stakes decision-making. This paper develops an interpretable version of a CNN model, showing that it has similar performance as the original CNN. The interpretable model is analyzed to obtain clear relationships between inputs and outputs, which elucidates the nature of spatial context utilized by CNNs to make accurate predictions. The interpretable model has a well-defined response to inputs, providing guarantees for how it will respond to novel inputs. The significance of this work is that it provides an approach to developing trustworthy artificial intelligence models.

© 2023 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Kyle A. Hilburn, kyle.hilburn@colostate.edu

Abstract

Convolutional neural networks (CNNs) are opening new possibilities in the realm of satellite remote sensing. CNNs are especially useful for capturing the information in spatial patterns that is evident to the human eye but has eluded classical pixelwise retrieval algorithms. However, the black-box nature of CNN predictions makes them difficult to interpret, hindering their trustworthiness. This paper explores a new way to simplify CNNs that allows them to be implemented in a fully transparent and interpretable framework. This clarity is accomplished by moving the inner workings of the CNN out into a feature engineering step and replacing the CNN with a regression model. The specific example of the GOES Radar Estimation via Machine Learning to Inform NWP (GREMLIN) is used to demonstrate that such simplifications are possible and to show the benefits of the interpretable approach. GREMLIN translates images of GOES radiances and lightning into images of radar reflectivity, and previous research used explainable artificial intelligence (XAI) approaches to explain some aspects of how GREMLIN makes predictions. However, the Interpretable GREMLIN model shows that XAI missed several strategies, and XAI does not provide guarantees on how the model will respond when confronted with new scenarios. In contrast, the interpretable model establishes well-defined relationships between inputs and outputs, offering a clear mapping of the spatial context utilized by the CNN to make accurate predictions, and providing guarantees on how the model will respond to new inputs. The significance of this work is that it provides a new approach for developing trustworthy artificial intelligence models.

Significance Statement

Convolutional neural networks (CNNs) are very powerful tools for interpreting and processing satellite imagery. However, the black-box nature of their predictions makes them difficult to interpret, compromising their trustworthiness when applied in the context of high-stakes decision-making. This paper develops an interpretable version of a CNN model, showing that it has similar performance as the original CNN. The interpretable model is analyzed to obtain clear relationships between inputs and outputs, which elucidates the nature of spatial context utilized by CNNs to make accurate predictions. The interpretable model has a well-defined response to inputs, providing guarantees for how it will respond to novel inputs. The significance of this work is that it provides an approach to developing trustworthy artificial intelligence models.

© 2023 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Kyle A. Hilburn, kyle.hilburn@colostate.edu

1. Introduction

Convolutional neural networks (CNNs) are opening new opportunities in Earth remote sensing. For example, CNNs provide the ability to anticipate cloud structures beneath obscuring cirrus clouds, making greater usage of the visible and infrared radiances observed by geostationary satellites (Meng et al. 2022; Haynes et al. 2022; Hilburn et al. 2021; Veillette et al. 2018). CNNs do this by using the information in image gradients and the spatial context in which a pixel is embedded, and in that way are mimicking how a human analyst visually interprets imagery. However, the black-box nature of CNNs hinders their trustworthiness: what exactly is the spatial context they use? This paper seeks to provide insights by exposing an approximation of the CNN input feature space and using that to evaluate the CNN predictions.

The GOES Radar Estimation via Machine Learning to Inform NWP (GREMLIN) model (Hilburn et al. 2021) uses a CNN to perform image-to-image translation from GOES radiances and lightning to Multi-Radar Multi-Sensor (MRMS) composite reflectivity (REFC), showing good accuracy for warm-season convection over the contiguous United States (CONUS). The explainable artificial intelligence (XAI) technique of layerwise relevance propagation (LRP; Bach et al. 2015; Montavon et al. 2018; Lapuschkin et al. 2019; Mamalakis et al. 2022) was used to gain insight on how GREMLIN makes accurate predictions, by showing in attention maps which channel and where in the image the network focuses its attention to predict radar reflectivity for a particular pixel. The use of LRP identified three physical strategies GREMLIN uses for predicting REFC (Ebert-Uphoff and Hilburn 2020; Hilburn et al. 2021): 1) lightning is a very strong predictor of strong radar echoes, 2) cold brightness temperatures (TBs) are associated with stronger radar echoes, and 3) stronger echoes are more likely on the cold side of strong TB gradients. These are all physically reasonable strategies, and the second strategy is the classical method for relating infrared observations to radar reflectivity and precipitation intensity (e.g., Arkin and Meisner 1987).

While the insights provided by LRP were helpful for confirming these three strategies were encoded in GREMLIN, there were gaps in our knowledge after using LRP. Had all the strategies used by GREMLIN been identified? This was not clear because LRP requires selection of individual pixels in particular samples for analysis. While the three strategies above were observed for the samples analyzed, LRP does not provide any guarantees for how GREMLIN would respond to new samples. It was obvious from LRP that GREMLIN was using spatial information, but it was not clear exactly how it was doing so in a quantitative manner. The inability to quantitatively define spatial context makes it hard to know how the model will generalize. For example, to what degree does the preferred wind shear direction over CONUS, where GREMLIN was trained, influence the spatial structure of GREMLIN predictions? Thus, LRP is not satisfactory for full understanding, and this provides motivation for a new approach to dig deeper. Since GREMLIN is being considered for operational NOAA applications, it is important to gain a deep understanding of the model. Note that LPR is not the only XAI method available, but an evaluation of various XAI methods is beyond the scope of this work. Newer methods such as Shapley additive explanations (Lundberg and Lee 2017) show promise to overcome limitations of LRP.

The objective of this paper is to provide insight on the nature of spatial context utilized by a CNN through the development of an interpretable version of GREMLIN. This will lead to a model that is easier to understand, help elucidate how well the model will generalize to unseen regimes and identify the conditions in which predictions are uncertain due to lack of information content. The interpretation of a CNN is complicated by the presence of many layers (Olah et al. 2017), but if everything can be brought up into the first layer, it would make interpretation of the relationship between inputs and outputs much easier. Figure 1 illustrates our approach, which is to pull the inner workings from inside the CNN out into a feature engineering step. By using a manual rather than automatic feature engineering approach, the features can be input into different model classes (e.g., neural networks, linear regression, random forests). Section 2 will discuss the three elements necessary to reproduce CNN accuracy are nonlinear transformations of the input channels, use of predefined image kernels to capture spatial patterns, and use of an image pyramid to capture multiresolution context. The interpretable approach produces maps where each pixel has a vector containing all the pieces of information used by the CNN. This mapping provides the desired interpretability by characterizing how data inside the network are being represented (Gilpin et al. 2018). Having extracted the CNN representation of the spatial context, the CNN can then be replaced with a different regression framework, such as a fully connected (dense) neural network (NN) or a linear regression. Use of linear regression represents the gold standard of interpretability because the resulting model has a weight for each input that tells us exactly how that input contributes to the prediction.

Fig. 1.
Fig. 1.

Schematic comparing the original convolutional neural network approach (left branch) with the interpretable framework (right branch).

Citation: Artificial Intelligence for the Earth Systems 2, 3; 10.1175/AIES-D-22-0093.1

Thus, the fundamental approach of this paper is knowledge distillation (Hinton et al. 2015) with a simpler proxy model that behaves like the original model but is easier to explain. The results are relevant to the original model provided the simpler model has a high degree of completeness relative to the original CNN model (Gilpin et al. 2018). At this point, a quantitative definition of completeness is lacking, and section 3 provides several evaluations of model outputs as a function of the inputs to argue for completeness. Note that in this work, the linear model is being fit to the training dataset, as opposed to other interpretable approaches that use linear regression to approximate the full model in a local manner [local interpretable model-agnostic explanations (LIME) in Ribeiro et al. (2016)]. Fitting the model to the full dataset, gives information about the global properties of the model, not just local sensitivity about a particular input state.

The outcome of this paper is an Interpretable GREMLIN model. The term “interpretable” is used deliberately to indicate that the explainability has been built into the model right from the start [Du et al. (2020) call this “intrinsic interpretability”], rather than being derived post hoc from a trained black-box model as with XAI. The terms interpretable and explainable are used in the same sense as Došilović et al. (2018), where interpretable indicates mapping of concepts into a human understandable domain, while explainable indicates the contributions of a collection of features toward an output. Similarly, Doshi-Velez and Kim (2017) define interpretability as the ability to explain or to present in understandable terms to a human. Flora et al. (2022) provide a survey of the use of the terms interpretable and explainable in machine learning (ML) literature.

The main advantage of an interpretable model is that it provides building blocks to enable the trustworthiness needed to serve as an input to decision-making activities. Rudin (2019) argues that for high stakes decisions, it is better to use a model that is interpretable from the start, rather than trying to explain a black-box model using post hoc techniques because XAI has many pitfalls (Molnar et al. 2022) and can provide unreliable and misleading explanations. Also, the interpretable model offers a path forward toward understanding model biases, which is needed for ethical AI (McGovern et al. 2022). The main drawback of the interpretable methodology is that it explodes the size of the datasets involved. So, either more computer memory is required, or streamwise estimations approaches that do not hold the entire dataset in memory are required. Thus, this work confirms, as discussed in Rudin (2019), that interpretable models require more effort to construct in terms of computation and domain expertise.

Consideration of these advantages and disadvantages to interpretable methodologies highlights that CNNs are very efficient implementations: they are quick and easy to train and have small memory footprints. This work finds that developing linear regression models capable of replacing machine learning models requires substantially more effort than the traditional CNN development process. In addition, because of the bias–variance trade-off in ML model development, the end user may be willing to sacrifice interpretability if it means reducing model bias. So, this paper is not advocating that all CNNs be reformulated as described herein; there may be cases where that is not desirable. Instead, the interpretability approach in this paper provides a new tool for analyzing satellite data and for understanding more complex ML models.

The paper is structured as follows. Section 2 describes the data and methodology including use of an image pyramid (section 2a), the prescribed convolutional kernels (section 2b), nonlinear transformations used in data preprocessing (section 2c), the linear regression model (section 2d), finding the weights of the regression model (section 2e), and handling unbalanced data (section 2f). Section 3 provides results and discussion, confirming that the interpretable model matches the accuracy of the CNN using several metrics (section 3a), providing an interpretation of the features and explaining spatial context (section 3b), examining information in the gradient direction (section 3c), multiresolution information (section 3d), and multichannel information (section 3e). The paper closes with summary and conclusions (section 4).

2. Data and method

The dataset used to train and evaluate GREMLIN’s estimates of radar reflectivity is described in detail by Hilburn et al. (2021). It consists of brightness temperature from three bands (Table 1) of the Advanced Baseline Imager (ABI; Schmit et al. 2017) and lightning group extent density (GED) from the Geostationary Lightning Mapper (GLM; Goodman et al. 2013) on GOES-16. All datasets were resampled to the High-Resolution Rapid Refresh (HRRR) CONUS 3-km grid using a drop-in-the-bucket technique and matched in time to a 15-min refresh rate. The purpose for resampling onto the HRRR grid was because the original use case for GREMLIN was initializing convection in the HRRR model. Parallax shift in ABI imagery was removed assuming a fixed 10-km cloud-top height. While this is reasonable for deep convection over CONUS, it will overcorrect the shift for shallow storms. GREMLIN was trained to MRMS (Smith et al. 2016) REFC. There are other channels on ABI and other parameters from GLM that have been found to be helpful for GREMLIN, but they are not used in this work to be able to directly compare Interpretable GREMLIN results with the GREMLIN CNN. The dataset, named CONUS2 (Hilburn 2022), was reduced to 1798 training and 448 testing samples by selecting 256 × 256 pixel image patches over a 92-day period from April to July 2019 to maximize the number of storm reports.

Table 1.

ABI bands and their nicknames used in this paper.

Table 1.

a. Image pyramid

The GREMLIN CNN architecture shown in Fig. 2a is a simple version of a U-Net (Ronneberger et al. 2015). The model has four input channels, one output channel, is four layers deep, uses convolution-pooling blocks, and has 32 3 × 3 kernels in each convolutional layer. The key insight enabling Interpretable GREMLIN is that this architecture corresponds to the combination of an image pyramid (Fig. 2b), a filter bank, and a regression framework. The image pyramid (Burt and Adelson 1983; Adelson et al. 1984) is the mechanism by which GREMLIN captures multiresolution information content. The convolutional kernels are what captures the gradients and spatial patterns. The regression framework is what models the nonlinearity between inputs and outputs. For Interpretable GREMLIN, there are also four input channels and a four-level image pyramid. In the GREMLIN CNN, the image pyramid is internally constructed through 2 × 2 max-pooling layers in the encoder branch and 2 × 2 upsampling in the decoder branch. For Interpretable GREMLIN (Fig. 2c), 2 × 2 max-pooling is applied from 0 to 3 times, then the kernels are applied, and upsampling is applied from 0 to 3 times to get each pyramid layer back to the original resolution. In developing Interpretable GREMLIN, applying a 3 × 3 binomial smoother before each pooling and after each upsampling was found to yield better predictions, but note that no smoothing layers were used in the GREMLIN CNN.

Fig. 2.
Fig. 2.

(a) GREMLIN model architecture, with number of parameters given under the blue arrows and image sizes shown in green boxes. In (a), the convolutional kernels are learned. (b) Image pyramid corresponding to GREMLIN, where level 0 is the original resolution input image and three levels of pooling are applied. (c) The architecture of the interpretable model. In (c), the convolutional kernels are prescribed. Dimensions are number of samples Ns, number of channels Nc, number of kernels Nk, number of pyramid levels Np, and the image dimensions Nx and Ny.

Citation: Artificial Intelligence for the Earth Systems 2, 3; 10.1175/AIES-D-22-0093.1

b. Convolutional kernels

In the original GREMLIN CNN, the kernels are learned, while Interpretable GREMLIN has prescribed kernels. Guilloteau and Foufoula-Georgiou (2020) suggested that a small number of prescribed kernels can capture much of the information content learned by CNNs. Indeed, this work finds that a large filter bank was not necessary to reproduce GREMLIN accuracy, and in fact, for this application, only four kernels were found to be necessary to reproduce GREMLIN accuracy: identity I, Sobel DX, Sobel DY, and Laplacian (LAP), given by the following equations:
I=[000010000],
DX=[101202101],
DY=[121000121], and
LAP=[111181111],
where
SMOOTH=116[121242121]
gives the smoothing kernel used in constructing the image pyramid. Such a simple filter bank may not be adequate for all types of problems, and at this point it is not possible to offer general guidelines for reformulating CNNs as interpretable models. Thus, as in ML development, there is a certain amount of trial and error that is involved to configure the interpretable version of the model. An advantage of these kernels is that they come with physical interpretations as edge and point detectors. The CNN learned many additional kernels with less obvious interpretations. Tests supplementing the model with additional kernels did not find significant additional improvements to accuracy. The actual number of inputs for each pixel is then given by the product of the number of input channels Nc, the number of pyramid layers Np, and the number of image kernels Nk. For Interpretable GREMLIN this is 4 × 4 × 4 = 64 inputs per pixel.
Bringing the internal workings of the CNN to the input preparation step provides the advantage that now the sensitivity of model outputs to model inputs can be determined, including the spatial context utilized by the CNN. It comes with the disadvantage however that now the input dataset is much larger; by a factor of 16 in this case. A minor point that is obscured with CNNs, but is obvious with the interpretable model, is that near image edges there is a loss of spatial context, unless some sort of edge padding is used. For this work, only pixels with full spatial context were utilized. For 3 × 3 kernels, 2 × 2 pooling, and 3 × 3 smoothing this results in dimensions of the valid data after each block of smoothing, pooling, upsampling, and smoothing of (Nx − di, Ny − di) where the “pixels lost” di for each pyramid level i follow the recurrence relationship
d0=2 and
di=2di1+2
so that the resulting images are reduced in size to (Nx − 30, Ny − 30), removing 15 pixels from each edge for this model with four pyramid levels.

c. Data preprocessing

To keep the linear regression simple, it was found advantageous to apply a gamma correction (e.g., Gonzalez and Woods 2002) to the inputs to remove the mean nonlinearity, shown in Fig. 3. The equations for scaling the inputs are given by
xscaled=[(xxmin)/(xmaxxmin)]γ and
xscaled=[(xmaxx)/(xmaxxmin)]γ,
where xmin, xmax, and the γ exponents are given in Table 2. The exponents were calculated by fitting a power law to the mean radar reflectivity versus each of the four input channels. Equation (3a) is the regular scaling used for lightning, and Eq. (3b) is the inverted scaling used for infrared bands. The water vapor band is the most linear input, and the lightning is the most nonlinear input. Note that the data were already scaled into the range 0–1 and applying an exponent to linearize the data is another way that the data preparation process resembles the same processes used to create red–green–blue (RGB) image products (e.g., Miller et al. 2020). The motivation behind the gamma correction is further discussed in section 3b.
Fig. 3.
Fig. 3.

The mean radar reflectivity as a function of input value for the four input channels showing data (blue) and power-law fit (orange) for (a) C07, (b) C09, (c) C13, and (d) GED.

Citation: Artificial Intelligence for the Earth Systems 2, 3; 10.1175/AIES-D-22-0093.1

Table 2.

Scaling parameters for each input channel.

Table 2.

d. Linear regression model

Given those inputs to the interpretable model, a regression framework is needed for making the predictions of REFC. Two approaches are used. The first uses a fully connected dense NN (called DENSE), which serves as a nonlinear function approximator using a model with 2 hidden layers and 32 units layer−1. Using this approach is a quick way to confirm that the interpretable model can indeed reproduce the accuracy of the original GREMLIN CNN architecture, and in some cases even performs better. However, a fully connected dense NN is not very interpretable, and so the second approach is to replace that with a linear model (called LINEAR). The DENSE model indicated that the ability to represent nonlinearity is an important contributor to ML accuracy, and this provided guidance on how much nonlinearity must be included in the linear regression. It was found that GREMLIN accuracy could be reproduced with a model of the form
y=i=1nwixi+i=1nj=1jiwi,jxixj=i=1Nwizi,
where n is the number of inputs x. This model includes linear functions of each input, two-way interaction terms, and quadratic functions of each input. The number of terms z in this model is given by N = n + n(n + 1)/2 = 2144 for 64 inputs. Note that while linear regression is fully interpretable, the large number of features here means that this approach is technically not simulatable in the sense of Murdoch et al. (2019). In other words, despite being linear regression, the large number of terms hinders the ability to of a human to easily interpret the model, and additional methods are required to distill the model behavior into something digestible. Section 3 provides analysis along several dimensions of variability in the inputs to obtain interpretations.
An advantage of linear regression models over machine learning is that it easily allows calculation of the condition number κ of the model from
κ=CxCx1,
where ǁǁ is the Frobenius norm and Cx is defined below. The condition number measures the sensitivity of a model output to changes in the inputs (e.g., Press et al. 1992). Starting with just the 64 linear terms in Eq. (4), the condition number is 3.1 × 106. Adding the quadratic terms increases this to 4.6 × 108, and including all of the terms increases it to 2.1 × 1013. Thus, the linear model will be highly sensitive to noise in the inputs, and if one wanted to operationalize the linear model, further work on feature selection and regularization would be needed. However, what is important is that the process of constructing the linear model exposes an approximation of the effective input feature space of the CNN. It is the use of that input space that provides the interpretability shown and discussed in section 3.

e. Finding regression weights

The standard approach to solving the generalized least squares problem1 is through applying singular value decomposition (SVD) to the normal equations. However, SVD involves calculating the inverse of a matrix that has the shape of the number of inputs (64) by the number pixels across all images (8 × 107), which produces memory exhaustion. The standard approach for linear regression when the dataset is too large for memory is stochastic gradient descent (SGD).2 However, SGD was found to be prohibitively slow, and many passes over the dataset are required for convergence. Thus, the approach used to solve Eq. (4) is employing the linear minimum mean-squared error (MMSE) estimator.

From the orthogonality principle (Papoulis and Pillai 2002), the solution of Eq. (4) is
wT=Cx1Cyx,
where Cx is the autocovariance matrix and Cyx is the cross-correlation matrix. The memory required goes as the number of inputs and does not depend on the number of data points. These matrices can be calculated with just one pass over the data, accumulating the sums of xi, y, xixj, and yxi for inputs i and j. This approach also has the advantage that ablation studies can be conducted without needing to refit the model: you simply drop the rows and columns in Cx and Cyx for the inputs you want to remove and recalculate w using Eq. (6). One might raise the question whether this type of model is properly called “machine learning.” Since calculating the sums required to solve for the weights in Eq. (4) can only be performed by machine, this approach is perhaps the simplest form of machine learning. One important difference with ML is that this approach does not have an optimizer since weights are calculated explicitly, rather than through an iterative process.

f. Handling unbalanced data

The last detail in the methodology of Interpretable GREMLIN is dealing with the imbalanced nature of the dataset, given that a PDF of REFC falls off exponentially with increasing REFC. If not addressed, this behavior leads to a poor probability of detection/false alarm trade-off, where strong echoes are severely underpredicted. Using a weighted mean-squared-error (MSE) loss function, as in Hilburn et al. (2021), achieves a balance between underprediction and overprediction across the full range of REFC values. The loss function weights W are given by
W=ebyc,
where y is the true value (from MRMS), and the coefficients b and c are given for each model in Table 3. For the CNN and DENSE models, the weights were implemented in the loss function, and for the LINEAR model, it was implemented with weights for each observation. The weights were found, as in Hilburn et al. (2021), to produce the minimum categorical bias across the range from 5 to 50 dBZ in steps of 5 dBZ. The weights were found for the training dataset, to keep weights independent of the testing dataset.
Table 3.

The number of parameters and loss function weights for each model.

Table 3.

g. Evaluation metrics

Model performance will be characterized using several metrics that compare the estimated (Y) and true (X) radar reflectivity pixel by pixel and then sum over the entire image. The Pearson correlation coefficient R2 and root-mean-square difference (RMSD) are respectively given by
R2=E[(XE[X])(YE[Y])]2{E[(XE[X])2]}{E[(YE[Y])2]} and
RMSD=E[(YX)2],
where E[] is the expected value. Categorical statistics are calculated for a given REFC threshold by creating binary objects from both the predicted and true datasets using the greater-than operator and then forming a contingency table of the hits H (echoes in both the predicted and true datasets), false alarms F (echoes in the predicted dataset but not in the true dataset), and misses M (echoes in the true dataset but not in the predicted dataset). The probability of detection (POD), false alarm ratio (FAR), critical success index (CSI), and bias are given by
POD=HH+M,
FAR=FH+F,
CSI=HH+F+M, and
bias=H+FH+M.
The statistics were calculated for the common subset of data in the center of the image with full spatial context (section 2b). Confidence intervals for each metric were estimated using bootstrap resampling.3 The resampling was performed on the level of samples (images), which produces larger confidence intervals than resampling on the level of pixels (i.e., having flattened all the images together). A total of 10 000 resamples were generated to bootstrap the distribution of the metric, and the 95% confidence interval was calculated by
CI=[μ1.96σ,μ+1.96σ],
where μ is the sample mean and σ is the sample standard deviation.

3. Results and discussion

This section begins by showing how the interpretable model performance is materially similar to that of the CNN (section 3a). Then, the interpretable nature of the model will be used to interpret the input features (section 3b). The input space of the interpretable model will then be used to analyze all three models to explore the meaning of spatial context (sections 3c and 3d), and address the role of multichannel information in section 3e.

a. Interpretable model performance

For the interpretable model to be useful, it must reasonably reproduce the accuracy of the CNN. To assess this performance in the context of nonlinearity, results are provided for an interpretable model that uses DENSE and LINEAR described in section 2d. Figure 4 provides a comparison of the models using several performance metrics, and it shows that, while the CNN has the highest overall accuracy, the interpretable models are not far behind and thus the interpretable models are mostly capturing CNN accuracy. The confidence interval of the R2 (Fig. 4a) of the LINEAR model overlaps the DENSE model, and the DENSE model overlaps the CNN. The RMSD (Fig. 4b) of the interpretable models is higher than the CNN. The CSI (Fig. 4c) of all three models matches well across reflectivity thresholds, except in the highest two bins for the LINEAR model. The bias (Fig. 4d) indicates that the interpretable models do exhibit somewhat more overprediction of echo areas at lower REFC. At higher REFC, the CNN and DENSE models have nearly identical performance, while the LINEAR model underpredicts the echo area. This disparity between DENSE and LINEAR models at higher REFC may suggest the importance of nonlinearity in reproducing the strongest echoes. Achieving that balance in a dataset where the frequency of occurrence falls off exponentially is difficult, and it is possible that additional tuning of the interpretable models could bring the bias in line with the CNN, however for the purposes of this analysis, these results are reasonably similar to the CNN. These results support the idea in Rudin (2019) that it is a myth there is always a necessary trade-off between accuracy and interpretability and that complicated models are required for top performance.

Fig. 4.
Fig. 4.

Performance statistics for the three models (CNN in blue, DENSE in red, and LINEAR in yellow), calculated from the testing dataset: (a) R2, (b) RMSD, (c) CSI, and (d) bias.

Citation: Artificial Intelligence for the Earth Systems 2, 3; 10.1175/AIES-D-22-0093.1

The interpretable models perform reasonably in a statistical sense, but it is essential to consider the spatial variability of model predictions to get a sense for how well the models perform where it matters the most. Figure 5 shows an example test case for the models. The tendency for overprediction of echo area is evident in the interpretable models, although in this case it provides significantly better POD than the CNN does, and only slightly worse FAR. This yields better representation of the stratiform area east of Akron, Ohio, in the interpretable models. Note that the interpretable models have more detail in the spatial variability than the CNN, although changes to the CNN architecture could likely improve that, such as adding skip connections or adding additional convolutional layers after the last upsampling layer. But it is encouraging that the interpretable models are following the patterns in the GOES input data (not shown). Note that while all the ML models get the basic distribution of weak and strong echoes correct, none of the ML predictions give the same fine-scale details seen in MRMS, and thus meteorologists should be cautious about overinterpreting the meaning of a particular detail in the ML predictions.

Fig. 5.
Fig. 5.

REFC for the (a) MRMS (truth), (b) GREMLIN CNN, (c) DENSE interpretable model, and (d) LINEAR interpretable model. The location of Akron is indicated with “A” label.

Citation: Artificial Intelligence for the Earth Systems 2, 3; 10.1175/AIES-D-22-0093.1

b. Feature interpretation

The motivating question of this research is what exactly is the spatial context used by CNNs to make accurate predictions of REFC? Having brought the kernel convolution and multiresolution representation from inside the CNN to the input space, this question can finally be answered. Figure 6 provides an overall summary of how the mean output varies as a function of each of the individual inputs. This figure, and subsequent figures in section 3, were computed using the same methodology. The figures are showing the mean output, which comes from MRMS data and predictions from the CNN, DENSE, and LINEAR models. The input value comes from the interpretable model data preprocessing (sections 2a2c), and thus can be applied to analyze the MRMS data and CNN predictions as well. In other words, by exposing an approximation of the input space, it can be applied to analyze all models by computing bin-average statistics. Each panel in Fig. 6 is a different combination of GOES input and kernel, and the figure shows results for level 0 of the image pyramid (results for the other levels are discussed in section 3d). Keep in mind that the ABI TBs were inversely scaled, so zero is warm and one is cold. The bin averages were constructed from the training dataset to characterize how well the models fit the training data and used a bin width of 0.02.

Fig. 6.
Fig. 6.

The mean output vs each input for level 0 of the image pyramid. Shown are the MRMS data (black), CNN (blue), DENSE (red), and LINEAR (yellow). Each row is a different input channel, and each column is a different image kernel. A bin width of 0.02 is used, and bins with less than 10 points are masked.

Citation: Artificial Intelligence for the Earth Systems 2, 3; 10.1175/AIES-D-22-0093.1

The identity kernel (Figs. 6a,e,i,m) shows the first two of GREMLIN’s strategies that were also identified by LRP: pixels with colder TBs tend to have stronger echoes, and lightning is associated with stronger echoes. There are small vertical offsets of the models relative to the data, which is the result of weighting the stronger echoes more using Eq. (7). The LINEAR model captures well the shape of the lightning nonlinearity (Fig. 6m), which is because of the gamma correction described in section 2c, without which, the curve would bend downward for scaled GED above 0.5 and would lead to underprediction of the strongest echoes. In fact, it was Fig. 6m that revealed the need for the gamma correction.

Trying to use LRP to diagnose such an issue would be nearly impossible, and thus the interpretable model is already providing much more information that can be used to understand its performance and improve predictions. Comparing the LINEAR and DENSE models with the CNN provides insights into the important role of nonlinearity in model performance, a factor that is completely obscured in the CNN. The performance of a CNN is tuned through the number of filters per layer and the number of layers in the model, which commingles factors related to pattern representation and nonlinearity. In the interpretable models, these factors are separated and can be tuned separately, providing more flexibility to create the best possible model with the smallest number of trainable parameters.

The gradient kernels (Fig. 6 rightmost three columns) show the third strategy that was also found by LRP: strong gradients are associated with stronger echoes. For the Laplacian kernel (Figs. 6d,h,l,p) and for the water vapor (Figs. 6f,g) and lightning channels (Figs. 6n,o), there is a minimum at zero, with increasing mean output as the gradients become stronger. The shortwave (Figs. 6b,c) and longwave (Figs. 6j,k) channels follow a similar form, but the minimum is not at zero, with an asymmetry in the model response to gradients that depends on the direction of the gradient. In other words, replacing the individual DX and DY gradient components with just the gradient magnitude would produce a less accurate model. The source of this asymmetry is likely related to the wind shear sampled in the training dataset. Since all these models are trained over CONUS warm-season convection, it would suggest that if the models were applied to a different region, with a significantly different preferred wind direction, larger errors would likely be observed because the model would not generalize to unseen wind shear regimes. That might suggest wind shear should be included as a training parameter. Section 3c will investigate gradient direction information further.

Figure 6 provides a very useful overall characterization of the different versions of the GREMLIN model, and it also demonstrates the three strategies learned by the CNN that were identified by LRP. However, Fig. 6 does not tell the whole story of what is being learned because the 1D bin averages do not show interactions among variables. Once such interactions are considered, this becomes a very high-dimensional space, and fully characterizing the model would require more figures than can fit in a journal article. However, there naturally are correlations among channels, among levels, and among kernels, which mean that taking representative slices through the space can convey the important relationships. The slices are chosen based on domain knowledge, and although there are other less subjective ways to explore the high-dimensional space [e.g., t-distributed stochastic neighbor embedding (t-SNE); Van der Maaten and Hinton 2008], there are 64 × 63/2 = 2016 ways to make 2D slices through the 64-dimenional space, so the problem is overwhelming (not to mention higher-dimensional slices as well). Instead, the remainder of this section compares the performance of the different models using 2D bin averages to examine two aspects of spatial context: 1) directional information (section 3c) and multiresolution information (section 3d), and 2) to examine multichannel information (section 3e).

c. Directional information

It has already been shown that stronger TB gradient magnitudes are generally associated with stronger radar echoes. But an interesting question is whether the gradient directions also contain information, and whether this has a relationship with wind shear. Figure 7 provides the mean output as a function of both −DX and −DY kernel inputs together, where the minus sign puts the directions in the meteorological direction convention (0° represents wind from the north and a gradient direction pointing to the south). All of the models are learning a directional preference in the data: radar echoes are strongest when gradients have southwest orientation. For longwave (Figs. 7a–d), which captures weaker echoes, the maximum mean output is in the west-southwest direction; while for GED (Figs. 7e–h), which captures stronger echoes, the maximum is in the southwest direction. The models show a stronger REFC response than the data, which is a consequence of using weighted loss functions to emphasize the strong echoes. A directional histogram for radar echoes exceeding 35 dBZ (not shown) has a mode at 201° (south-southwest) with the peak between south to southwest directions. The vertical wind shear, estimated from the HRRR 250 hPa wind components, has a mode at 250° (west-southwest) with most samples between south to northwest directions. This is suggestive that the directional relationships learned by these models are specific to CONUS, and if these models were applied to different wind shear regimes, such as the tropics, this directional information may not generalize. It would be difficult to get this insight from LRP, since LRP only considers one sample at a time, but consideration of the whole dataset is needed for this lesson to emerge. At this point it is just a hypothesis, and it would take examination (using an analysis like Fig. 7) of models that are trained on a set of tropical samples to verify it.

Fig. 7.
Fig. 7.

The mean output (color fill) vs the DX and DY kernel inputs for the (a)–(d) longwave and (e)–(h) lightning channels and each model [(left) data, (left center) CNN, (right center) DENSE, and (right) LINEAR] for level 0.

Citation: Artificial Intelligence for the Earth Systems 2, 3; 10.1175/AIES-D-22-0093.1

d. Multiresolution information

Figure 6 shows results for the base of the image pyramid (level 0), and in an approximate sense, the results for other levels of the pyramid are similar, but the curves get progressively squashed down toward the x axis (not shown). That is because the higher levels of the pyramid contain values that correspond to a larger portion of the image, and thus there is averaging over a greater range of output values for a given input value. However, the more interesting story emerges when considering the relationships between two different pyramid levels, which is how multiresolution information is encoded in the model.

Figure 8 presents results for the base of the pyramid (level 0) versus the top of the pyramid (level 3). Figures 8a–d shows that not only are strong echoes associated with cold TBs but are maximized when the TBs are warmer on other pyramid levels. This implies a relationship between the strongest echoes and the distance from the cloud edge, determined by the pair of levels being considered. Even though Fig. 8 is for the identity kernel, the use of image pyramiding is capturing information in spatial variability (results for the gradient kernels are noisy and not shown). Cold TBs on level 0 implies positioning on the interior side of the cloud edge, while warmer level 3 TBs implies that the cloud edge is nearby, and warmer ground pixels are contributing. There is a relative minimum along the 1–1 line when TBs are cold on both pyramid levels, reflecting the uncertainty in locating strong radar echoes when cold TBs cover large areas, unless there is texture or lighting to provide clues.

Fig. 8.
Fig. 8.

The mean output (color fill) vs level-0 and level-3 input values for the (a)–(d) longwave band and (e)–(h) lightning for each model [(left) data, (left center) CNN, (right center) DENSE, and (right) LINEAR] for the identity kernel.

Citation: Artificial Intelligence for the Earth Systems 2, 3; 10.1175/AIES-D-22-0093.1

Lightning has a somewhat similar pattern (Figs. 8e–h), but not as strongly bifurcated along the 1–1 line. The weakest echoes are associated with little to no lightning on level 0, but as the level 3 lightning increases, it indicates nearby lightning, making moderate echoes more likely. Thus, the use of multiple image pyramid levels as inputs to these models provides the capability to construct multiresolution representations of the phenomena. In other words, it provides the model with the ability to locate strong echoes some distance inward from cloud edge. Pixelwise retrieval methods simply cannot make use of this multiscale information but given how strong the mean response is in longwave, exceeding 50 dBZ in certain situations, this is clearly important to extracting full value from GOES-R Series observations.

e. Multichannel information

The interpretable approach also provides new insights regarding the information content coming from multiple channels together. Figure 9 provides results for key channel combinations of interest and uses physical units to simplify the interpretation. While cold longwave TBs are associated with stronger echoes, that relationship is stronger for certain ranges of the shortwave band, which has a bimodal distribution with maxima near 210 and 250 K that is due to night versus day. The shortwave band has a solar reflected component during daytime, which augments the thermal emission component, leading to warmer temperatures during daytime. Strikingly, the data and models have a feature with REFC > 40 dBZ occurring for cold longwave (<210 K) but also very warm shortwave (>275 K). Examining samples with those features reveals that the value in shortwave comes from the fact that it can see through thin cirrus better than longwave. When the warm shortwave and cold longwave signature appears, it is because shortwave can see breaks between storms that have warmer surface contributions, while longwave only sees one large area of cold cloud. Examining the visible band for daytime cases confirms that shortwave is seeing cloud edges and breaks between clouds that are obscured by very thin cirrus in longwave.

Fig. 9.
Fig. 9.

Mean output (color fill) vs channel combinations (a)–(d) shortwave and longwave, (e)–(h) water vapor and longwave, and (i)–(l) longwave and lightning for each model [(left) data, (left center) CNN, (right center) DENSE, and (right) LINEAR] for level 0 and the identity kernel.

Citation: Artificial Intelligence for the Earth Systems 2, 3; 10.1175/AIES-D-22-0093.1

Figure 9 also shows a familiar relationship where the strongest echoes are associated with small differences between the water vapor and longwave bands (Schmetz et al. 1997). This channel difference has been found to be valuable for distinguishing shallow from deep precipitation (Kurino 1997) and overshooting tops (Bedka et al. 2012). Usually, longwave is warmer than water vapor because longwave has greater sensitivity to the lower troposphere, however once clouds grow deep enough and thick enough, the difference goes to zero. The strongest echoes occur when water vapor is warmer than longwave, which is indicative of water vapor in the stratosphere (i.e., overshooting tops).

Figure 9 shows that, while strong echoes are associated with cold TBs and lightning individually, when considered together they provide additional information that can reduce false alarms. For example, when TBs are very cold, but there is little lightning, strong echoes are relatively less likely, for example in thin cirrus. Similarly, when lightning is strong, but TBs are warm, strong echoes are also relatively less likely. Physically, this situation can occur when the light from lightning reflects off nearby low clouds (Wolf 2018).

4. Summary and conclusions

This paper used Interpretable AI to take a deeper look at the GREMLIN CNN that transforms GOES-R Series radiances and lightning into synthetic radar reflectivity fields. A knowledge distillation approach was taken and found the original 47 000-parameter CNN could be replaced by feature engineering plus a linear regression with as few as 2000 parameters without a substantial loss of accuracy. The CNN used 32 learned filters with seven convolutional layers, resulting in a total of 5280 different kernels across the CNN. However, the DENSE and LINEAR models were able to reasonably reproduce the accuracy of the CNN with just four prescribed kernels. This provides evidence that in the GREMLIN application, a huge filter bank is not necessary for explaining the CNN’s accuracy.

The interpretable approach took the image pyramiding and kernel convolutions happening inside the CNN and brought them out to the preprocessing stage through feature engineering. This allows examining the sensitivity of the output to the individual inputs. This showed that the spatial context utilized by the CNN to make predictions is a combination of 1) spatial patterns identified by the four kernels and 2) a multiresolution representation provided by an image pyramid. Both factors are important for the additional accuracy provided by the CNN relative to a purely pixelwise approach. The interpretable approach has the benefit of disentangling the role of nonlinearity in CNN accuracy. In the CNN, the ability to resolve spatial context and represent nonlinearity are inherently mixed together when specifying CNN model architecture parameters. The nonlinearity was found to be an important factor, especially for the accuracy in strong echoes.

Explainable AI (specifically LRP) had helped to identify three strategies of the GREMLIN CNN: 1) presence of lightning, 2) cold brightness temperatures, and 3) strong brightness temperature gradients. The Interpretable AI model developed herein was able to identify an additional five strategies. 4) Gradient direction: stronger echoes are more likely when gradients had a southwestern orientation. This is also the predominant wind shear direction in the training data, which raises questions how well such directional information will generalize to different regimes. 5) Multiscale information: while stronger gradients are associated with stronger echoes, the stronger echoes are more likely when the gradients on other scales are weaker. 6) Shortwave-longwave multichannel information: using warm shortwave TBs to find cloud edges where thin cirrus obscures them in longwave. 7) Longwave-water vapor multichannel information: deep convection with stronger echoes is associated with small differences between these channels. 8) Longwave-lightning multichannel information: cold TBs are more likely to have strong echoes when also associated with high flash rates. There was some evidence for strategy 8 from LRP, when lightning was set to zero the CNN changed how it interpreted the TBs, but the interpretable model clarified how. Since LRP provides heatmaps for each channel individually, it was not well suited for identifying the multichannel strategies that were obvious using the interpretable model.

A potential concern with this approach is the Rashomon effect (Rudin et al. 2022), in which a set of different models with similar performance may exist. That means one cannot just assume that what is learned by the simpler model is necessarily valid for the complex model even if they have similar performance. However, the analysis presented in Figs. 59 shows that the CNN and simpler models are learning similar behavior, which is also seen in the data. The simpler models do not capture all aspects of the CNN, in particular the sequential application of convolutional filters and the use of a spatially aware decoder. So, it is not appropriate to conclude that the simpler models can capture every aspect of the CNN model.

It is important to point out that GREMLIN was trained over CONUS and this paper only evaluates over CONUS. Thus, it is premature to draw conclusions regarding regime dependence of GREMLIN performance. However, the linear model approach used herein could naturally be extended to generalized additive models (GAMs) for a more complex treatment of nonlinearity (Rudin et al. 2022). By replacing the fixed coefficients in Eq. (4) with coefficients that vary by some regime characterizing variable (Zhou and Hooker 2022), it might be possible to extend the interpretable model to different regimes. This seems like a good fit for meteorological regime dependence where the underlying model does not change, but where thresholds may vary with regime. An alternative approach for incorporating regime dependence is a mixture of experts (Hinton et al. 2015) where different specialist models are trained on different subsets of the data (i.e., different regimes). That approach would make more sense where the underlying structure of the model does change significantly with regime.

In closing, this work is not intended to be an end state for interpretability, but a beginning. Interpretable GREMLIN helped identify several new strategies that were not found using XAI, and based on physics, some strategies confer a greater degree of confidence than others. Extending this research, it ought to be possible to combine Interpretable GREMLIN with an explanation producing system so that users could decide for themselves whether they chose to believe a particular prediction or not. However, should that be expressed in terms of a confidence flag as is common in satellite meteorology, or in terms of discrete scenarios (i.e., storylines in the sense of Shepherd 2019) depending on the strategy employed? A question is how would that work for a model such as GREMLIN that has images as outputs when “screen real estate” is so limited in weather forecast offices? One possible approach is the ProbSevere approach (Cintineo et al. 2018) where mousing over a particular echo object provides metadata about what informs the prediction. However, this highlights the larger issue that interpretability is not just about physical science, but about social science, and input from users is required to determine what constitutes a good explanation for a particular application (Miller 2019). In other words, developing AI is the starting point, and it is the human–AI interaction that matters for decision-making.

1

For example, in Python, scipy.linalg.lstsq.

2

For example, in Python, sklearn.linear_model.SGDRegression.partial_fit.

3

In Python, sklearn.utils.resample.

Acknowledgments.

Support of the GOES-R Program under Grant NA19OAR4320073, and of the NOAA RDHPCS for access to the Fine Grain Architecture System on Hera, are gratefully acknowledged. Helpful comments from Yoonjin Lee, Imme Ebert-Uphoff, and Steve Miller on an earlier draft are acknowledged. The inspiration for this paper originated from author Hilburn’s final project in the Introduction to Causal Discovery course taught by Peter Jan van Leeuwen at Colorado State University (CSU).

Data availability statement.

The dataset used for this research is available through the CSU Mountain Scholar Repository (https://doi.org/10.25675/10217/235392). The code used for this research is available online [https://zenodo.org/badge/latestdoi/628414580 (https://doi.org/10.5281/zenodo.7832223)].

REFERENCES

  • Adelson, E. H., C. H. Anderson, J. R. Bergen, P. J. Burt, and J. M. Ogden, 1984: Pyramid methods in image processing. RCA Eng., 29, 3341.

    • Search Google Scholar
    • Export Citation
  • Arkin, P. A., and B. N. Meisner, 1987: The relationship between large-scale convective rainfall and cold cloud over the Western Hemisphere during 1982–84. Mon. Wea. Rev., 115, 5174, https://doi.org/10.1175/1520-0493(1987)115<0051:TRBLSC>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Bach, S., A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek, 2015: On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLOS ONE, 10, e0130140, https://doi.org/10.1371/journal.pone.0130140.

    • Search Google Scholar
    • Export Citation
  • Bedka, K. M., R. Dworak, J. Brunner, and W. Feltz, 2012: Validation of satellite-based objective overshooting cloud-top detection methods using CloudSat cloud profiling radar observations. J. Appl. Meteor. Climatol., 51, 18111822, https://doi.org/10.1175/JAMC-D-11-0131.1.

    • Search Google Scholar
    • Export Citation
  • Burt, P., and E. Adelson, 1983: The Laplacian pyramid as a compact image code. IEEE Trans. Commun., 31, 532540, https://doi.org/10.1109/TCOM.1983.1095851.

    • Search Google Scholar
    • Export Citation
  • Cintineo, J. L., and Coauthors, 2018: The NOAA/CIMSS ProbSevere Model: Incorporation of total lightning and validation. Wea. Forecasting, 33, 331345, https://doi.org/10.1175/WAF-D-17-0099.1.

    • Search Google Scholar
    • Export Citation
  • Doshi-Velez, F., and B. Kim, 2017: Towards a rigorous science of interpretable machine learning. arXiv, 1702.08608v2, https://doi.org/10.48550/arXiv.1702.08608.

  • Došilović, F. K., M. Brčić, and N. Hlupić, 2018: Explainable artificial intelligence: A survey. 41st Int. Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, IEEE, 210–215, https://doi.org/10.23919/MIPRO.2018.8400040.

  • Du, M., N. Liu, and X. Hu, 2020: Techniques for interpretable machine learning. Commun. ACM, 63, 6877, https://doi.org/10.1145/3359786.

    • Search Google Scholar
    • Export Citation
  • Ebert-Uphoff, I., and K. Hilburn, 2020: Evaluation, tuning and interpretation of neural networks for meteorological applications. Bull. Amer. Meteor. Soc., 101, E2149E2170, https://doi.org/10.1175/BAMS-D-20-0097.1.

    • Search Google Scholar
    • Export Citation
  • Flora, M., C. Potvin, A. McGovern, and S. Handler, 2022: Comparing explanation methods for traditional machine learning models. Part 1: An overview of current methods and quantifying their disagreement. arXiv, 2211.08943v1, https://doi.org/10.48550/arXiv.2211.08943.

  • Gilpin, L. H., D. Bau, B. Z. Yuan, A. Bajwa, M. Specter, and L. Kagal, 2018: Explaining explanations: An overview of interpretability of machine learning. IEEE Fifth Int. Conf. on Data Science and Advanced Analytics, Turin, Italy, IEEE, 80–89, https://doi.org/10.1109/DSAA.2018.00018.

  • Gonzalez, R. C., and R. E. Woods, 2002: Digital Image Processing. 2nd ed. Prentice-Hall, 793 pp.

  • Goodman, S. J., and Coauthors, 2013: The GOES-R Geostationary Lightning Mapper (GLM). Atmos. Res., 125–126, 3449, https://doi.org/10.1016/j.atmosres.2013.01.006.

    • Search Google Scholar
    • Export Citation
  • Guilloteau, C., and E. Foufoula-Georgiou, 2020: Beyond the pixel: Using patterns and multiscale spatial information to improve the retrieval of precipitation from spaceborne passive microwave imagers. J. Atmos. Oceanic Technol., 37, 15711591, https://doi.org/10.1175/JTECH-D-19-0067.1.

    • Search Google Scholar
    • Export Citation
  • Haynes, K., C. Slocum, J. Knaff, K. Musgrave, and I. Ebert-Uphoff, 2022: Aiding tropical cyclone forecasting by simulating 89-GHz imagery from operational geostationary satellites. 35th Conf. on Hurricanes and Tropical Meteorology, New Orleans, LA, Amer. Meteor. Soc., 8A.2, https://ams.confex.com/ams/35Hurricanes/meetingapp.cgi/Paper/401833.

  • Hilburn, K. A., 2022: GREMLIN CONUS2 dataset. Colorado State University, accessed 23 June 2022, https://doi.org/10.25675/10217/235392.

  • Hilburn, K. A., I. Ebert-Uphoff, and S. D. Miller, 2021: Development and interpretation of a neural network-based synthetic radar reflectivity estimator using GOES-R satellite observations. J. Appl. Meteor. Climatol., 60, 321, https://doi.org/10.1175/JAMC-D-20-0084.1.

    • Search Google Scholar
    • Export Citation
  • Hinton, G., O. Vinyals, and J. Dean, 2015: Distilling the knowledge in a neural network. arXiv, 1503.00253v1, https://doi.org/10.48550/arXiv.1503.00253.

  • Kurino, T., 1997: A satellite infrared technique for estimating “deep/shallow” precipitation. Adv. Space Res., 19, 511514, https://doi.org/10.1016/S0273-1177(97)00063-X.

    • Search Google Scholar
    • Export Citation
  • Lapuschkin, S., S. Wäldchen, A. Binder, G. Montavon, W. Samek, and K.-R. Müller, 2019: Unmasking Clever Hans predictors and assessing what machines really learn. Nat. Commun., 10, 1096, https://doi.org/10.1038/s41467-019-08987-4.

    • Search Google Scholar
    • Export Citation
  • Lundberg, S. M., and S.-I. Lee, 2017: A unified approach to interpreting model predictions. NIPS’17: Proc. 31st Int. Conf. on Neural Information Processing Systems, Long Beach, CA, ACM, 4768–4777, https://dl.acm.org/doi/10.5555/3295222.3295230.

  • Mamalakis, A., I. Ebert-Uphoff, and E. A. Barnes, 2022: Neural network attribution methods for problems in geoscience: A novel synthetic benchmark dataset. Environ. Data Sci., 1, E8, https://doi.org/10.1017/eds.2022.7.

    • Search Google Scholar
    • Export Citation
  • McGovern, A., I. Ebert-Uphoff, D. J. Gagne II, and A. Bostrom, 2022: Why we need to focus on developing ethical, responsible, and trustworthy artificial intelligence approaches for environmental science. Environ. Data Sci., 1, e6, https://doi.org/10.1017/eds.2022.5.

    • Search Google Scholar
    • Export Citation
  • Meng, F., T. Song, and D. Xu, 2022: Simulating tropical cyclone passive microwave rainfall imagery using infrared imagery via generative adversarial networks. IEEE Geosci. Remote Sens. Lett., 19, 1005105, https://doi.org/10.1109/LGRS.2022.3152847.

    • Search Google Scholar
    • Export Citation
  • Miller, S. D., D. T. Lindsey, C. J. Seaman, and J. E. Solbrig, 2020: GeoColor: A blending technique for satellite imagery. J. Atmos. Oceanic Technol., 37, 429448, https://doi.org/10.1175/JTECH-D-19-0134.1.

    • Search Google Scholar
    • Export Citation
  • Miller, T., 2019: Explanation in artificial intelligence: Insights from the social sciences. Artif. Intell., 267, 138, https://doi.org/10.1016/j.artint.2018.07.007.

    • Search Google Scholar
    • Export Citation
  • Molnar, C., and Coauthors, 2022: General pitfalls of model-agnostic interpretation methods for machine learning models. xxAI—Beyond Explainable AI, A. Holzinger et al., Eds., Springer, 39–68, https://doi.org/10.1007/978-3-031-04083-2_4.

  • Montavon, G., W. Samek, and K.-R. Müller, 2018: Methods for interpreting and understanding deep neural networks. Digital Signal Process., 73, 115, https://doi.org/10.1016/j.dsp.2017.10.011.

    • Search Google Scholar
    • Export Citation
  • Murdoch, W. J., C. Singh, K. Kumbier, R. Abbasi-Asl, and B. Yu, 2019: Definitions, methods, and applications in interpretable machine learning. Proc. Natl. Acad. Sci. USA, 116, 22 07122 080, https://doi.org/10.1073/pnas.1900654116.

    • Search Google Scholar
    • Export Citation
  • Olah, C., A. Mordvintsev, and L. Schubert, 2017: Feature visualization. Distill, 2, e7, https://doi.org/10.23915/distill.00007.

  • Papoulis, A., and S. U. Pillai, 2002: Probability, Random Variables, and Stochastic Processes. 4th ed. McGraw Hill, 852 pp.

  • Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, 1992: Numerical recipes in Fortran 77: The Art of Scientific Computing. 2nd ed. Cambridge University Press, 1003 pp.

  • Ribeiro, M. T., S. Singh, and C. Guestrin, 2016: “Why should I trust you?” Explaining the predictions of any classifier. arXiv, 1602.04938v3, https://doi.org/10.48550/arXiv.1602.04938.

  • Ronneberger, O., P. Fischer, and T. Brox, 2015: U-Net: Convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Springer, 234–241.

  • Rudin, C., 2019: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell., 1, 206215, https://doi.org/10.1038/s42256-019-0048-x.

    • Search Google Scholar
    • Export Citation
  • Rudin, C., C. Chen, Z. Chen, H. Huang, L. Semenova, and C. Zhong, 2022: Interpretable machine learning: Fundamental principles and 10 grand challenges. Stat. Surv., 16, 185, https://doi.org/10.1214/21-SS133.

    • Search Google Scholar
    • Export Citation
  • Schmetz, J., S. A. Tjemkes, M. Gube, and L. van de Berg, 1997: Monitoring deep convection and convective overshooting with Meteosat. Adv. Space Res., 19, 433441, https://doi.org/10.1016/S0273-1177(97)00051-3.

    • Search Google Scholar
    • Export Citation
  • Schmit, T. J., P. Griffith, M. M. Gunshor, J. M. Daniels, S. J. Goodman, and W. J. Lebair, 2017: A closer look at the ABI on the GOES-R series. Bull. Amer. Meteor. Soc., 98, 681698, https://doi.org/10.1175/BAMS-D-15-00230.1.

    • Search Google Scholar
    • Export Citation
  • Shepherd, T. G., 2019: Storyline approach to the construction of regional climate change information. Proc. Roy. Soc., 475A, 20190013, https://doi.org/10.1098/rspa.2019.0013.

    • Search Google Scholar
    • Export Citation
  • Smith, T. M., and Coauthors, 2016: Multi-Radar Multi-Sensor (MRMS) severe weather and aviation products: Initial operating capabilities. Bull. Amer. Meteor. Soc., 97, 16171630, https://doi.org/10.1175/BAMS-D-14-00173.1.

    • Search Google Scholar
    • Export Citation
  • Van der Maaten, L., and G. Hinton, 2008: Visualizing data using t-SNE. J. Mach. Learn. Res., 9, 25792605.

  • Veillette, M. S., E. P. Hassey, C. J. Mattioli, H. Iskenderian, and P. M. Lamey, 2018: Creating synthetic radar imagery using convolutional neural networks. J. Atmos. Oceanic Technol., 35, 23232338, https://doi.org/10.1175/JTECH-D-18-0010.1.

    • Search Google Scholar
    • Export Citation
  • Wolf, P., 2018: Utilizing radar and satellite to provide meaningful lightning initiation and cessation information for effective decision-making. FDTD Satellite Applications Webinars, https://rammb2.cira.colostate.edu/training/visit/satellite_chat/.

  • Zhou, Y., and G. Hooker, 2022: Decision tree boosted varying coefficient models. Data Min. Knowl. Discovery, 36, 22372271, https://doi.org/10.1007/s10618-022-00863-y.

    • Search Google Scholar
    • Export Citation
Save
  • Adelson, E. H., C. H. Anderson, J. R. Bergen, P. J. Burt, and J. M. Ogden, 1984: Pyramid methods in image processing. RCA Eng., 29, 3341.

    • Search Google Scholar
    • Export Citation
  • Arkin, P. A., and B. N. Meisner, 1987: The relationship between large-scale convective rainfall and cold cloud over the Western Hemisphere during 1982–84. Mon. Wea. Rev., 115, 5174, https://doi.org/10.1175/1520-0493(1987)115<0051:TRBLSC>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Bach, S., A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek, 2015: On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLOS ONE, 10, e0130140, https://doi.org/10.1371/journal.pone.0130140.

    • Search Google Scholar
    • Export Citation
  • Bedka, K. M., R. Dworak, J. Brunner, and W. Feltz, 2012: Validation of satellite-based objective overshooting cloud-top detection methods using CloudSat cloud profiling radar observations. J. Appl. Meteor. Climatol., 51, 18111822, https://doi.org/10.1175/JAMC-D-11-0131.1.

    • Search Google Scholar
    • Export Citation
  • Burt, P., and E. Adelson, 1983: The Laplacian pyramid as a compact image code. IEEE Trans. Commun., 31, 532540, https://doi.org/10.1109/TCOM.1983.1095851.

    • Search Google Scholar
    • Export Citation
  • Cintineo, J. L., and Coauthors, 2018: The NOAA/CIMSS ProbSevere Model: Incorporation of total lightning and validation. Wea. Forecasting, 33, 331345, https://doi.org/10.1175/WAF-D-17-0099.1.

    • Search Google Scholar
    • Export Citation
  • Doshi-Velez, F., and B. Kim, 2017: Towards a rigorous science of interpretable machine learning. arXiv, 1702.08608v2, https://doi.org/10.48550/arXiv.1702.08608.

  • Došilović, F. K., M. Brčić, and N. Hlupić, 2018: Explainable artificial intelligence: A survey. 41st Int. Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, IEEE, 210–215, https://doi.org/10.23919/MIPRO.2018.8400040.

  • Du, M., N. Liu, and X. Hu, 2020: Techniques for interpretable machine learning. Commun. ACM, 63, 6877, https://doi.org/10.1145/3359786.

    • Search Google Scholar
    • Export Citation
  • Ebert-Uphoff, I., and K. Hilburn, 2020: Evaluation, tuning and interpretation of neural networks for meteorological applications. Bull. Amer. Meteor. Soc., 101, E2149E2170, https://doi.org/10.1175/BAMS-D-20-0097.1.

    • Search Google Scholar
    • Export Citation
  • Flora, M., C. Potvin, A. McGovern, and S. Handler, 2022: Comparing explanation methods for traditional machine learning models. Part 1: An overview of current methods and quantifying their disagreement. arXiv, 2211.08943v1, https://doi.org/10.48550/arXiv.2211.08943.

  • Gilpin, L. H., D. Bau, B. Z. Yuan, A. Bajwa, M. Specter, and L. Kagal, 2018: Explaining explanations: An overview of interpretability of machine learning. IEEE Fifth Int. Conf. on Data Science and Advanced Analytics, Turin, Italy, IEEE, 80–89, https://doi.org/10.1109/DSAA.2018.00018.

  • Gonzalez, R. C., and R. E. Woods, 2002: Digital Image Processing. 2nd ed. Prentice-Hall, 793 pp.

  • Goodman, S. J., and Coauthors, 2013: The GOES-R Geostationary Lightning Mapper (GLM). Atmos. Res., 125–126, 3449, https://doi.org/10.1016/j.atmosres.2013.01.006.

    • Search Google Scholar
    • Export Citation
  • Guilloteau, C., and E. Foufoula-Georgiou, 2020: Beyond the pixel: Using patterns and multiscale spatial information to improve the retrieval of precipitation from spaceborne passive microwave imagers. J. Atmos. Oceanic Technol., 37, 15711591, https://doi.org/10.1175/JTECH-D-19-0067.1.

    • Search Google Scholar
    • Export Citation
  • Haynes, K., C. Slocum, J. Knaff, K. Musgrave, and I. Ebert-Uphoff, 2022: Aiding tropical cyclone forecasting by simulating 89-GHz imagery from operational geostationary satellites. 35th Conf. on Hurricanes and Tropical Meteorology, New Orleans, LA, Amer. Meteor. Soc., 8A.2, https://ams.confex.com/ams/35Hurricanes/meetingapp.cgi/Paper/401833.

  • Hilburn, K. A., 2022: GREMLIN CONUS2 dataset. Colorado State University, accessed 23 June 2022, https://doi.org/10.25675/10217/235392.

  • Hilburn, K. A., I. Ebert-Uphoff, and S. D. Miller, 2021: Development and interpretation of a neural network-based synthetic radar reflectivity estimator using GOES-R satellite observations. J. Appl. Meteor. Climatol., 60, 321, https://doi.org/10.1175/JAMC-D-20-0084.1.

    • Search Google Scholar
    • Export Citation
  • Hinton, G., O. Vinyals, and J. Dean, 2015: Distilling the knowledge in a neural network. arXiv, 1503.00253v1, https://doi.org/10.48550/arXiv.1503.00253.

  • Kurino, T., 1997: A satellite infrared technique for estimating “deep/shallow” precipitation. Adv. Space Res., 19, 511514, https://doi.org/10.1016/S0273-1177(97)00063-X.

    • Search Google Scholar
    • Export Citation
  • Lapuschkin, S., S. Wäldchen, A. Binder, G. Montavon, W. Samek, and K.-R. Müller, 2019: Unmasking Clever Hans predictors and assessing what machines really learn. Nat. Commun., 10, 1096, https://doi.org/10.1038/s41467-019-08987-4.

    • Search Google Scholar
    • Export Citation
  • Lundberg, S. M., and S.-I. Lee, 2017: A unified approach to interpreting model predictions. NIPS’17: Proc. 31st Int. Conf. on Neural Information Processing Systems, Long Beach, CA, ACM, 4768–4777, https://dl.acm.org/doi/10.5555/3295222.3295230.

  • Mamalakis, A., I. Ebert-Uphoff, and E. A. Barnes, 2022: Neural network attribution methods for problems in geoscience: A novel synthetic benchmark dataset. Environ. Data Sci., 1, E8, https://doi.org/10.1017/eds.2022.7.

    • Search Google Scholar
    • Export Citation
  • McGovern, A., I. Ebert-Uphoff, D. J. Gagne II, and A. Bostrom, 2022: Why we need to focus on developing ethical, responsible, and trustworthy artificial intelligence approaches for environmental science. Environ. Data Sci., 1, e6, https://doi.org/10.1017/eds.2022.5.

    • Search Google Scholar
    • Export Citation
  • Meng, F., T. Song, and D. Xu, 2022: Simulating tropical cyclone passive microwave rainfall imagery using infrared imagery via generative adversarial networks. IEEE Geosci. Remote Sens. Lett., 19, 1005105, https://doi.org/10.1109/LGRS.2022.3152847.

    • Search Google Scholar
    • Export Citation
  • Miller, S. D., D. T. Lindsey, C. J. Seaman, and J. E. Solbrig, 2020: GeoColor: A blending technique for satellite imagery. J. Atmos. Oceanic Technol., 37, 429448, https://doi.org/10.1175/JTECH-D-19-0134.1.

    • Search Google Scholar
    • Export Citation
  • Miller, T., 2019: Explanation in artificial intelligence: Insights from the social sciences. Artif. Intell., 267, 138, https://doi.org/10.1016/j.artint.2018.07.007.

    • Search Google Scholar
    • Export Citation
  • Molnar, C., and Coauthors, 2022: General pitfalls of model-agnostic interpretation methods for machine learning models. xxAI—Beyond Explainable AI, A. Holzinger et al., Eds., Springer, 39–68, https://doi.org/10.1007/978-3-031-04083-2_4.

  • Montavon, G., W. Samek, and K.-R. Müller, 2018: Methods for interpreting and understanding deep neural networks. Digital Signal Process., 73, 115, https://doi.org/10.1016/j.dsp.2017.10.011.

    • Search Google Scholar
    • Export Citation
  • Murdoch, W. J., C. Singh, K. Kumbier, R. Abbasi-Asl, and B. Yu, 2019: Definitions, methods, and applications in interpretable machine learning. Proc. Natl. Acad. Sci. USA, 116, 22 07122 080, https://doi.org/10.1073/pnas.1900654116.

    • Search Google Scholar
    • Export Citation
  • Olah, C., A. Mordvintsev, and L. Schubert, 2017: Feature visualization. Distill, 2, e7, https://doi.org/10.23915/distill.00007.

  • Papoulis, A., and S. U. Pillai, 2002: Probability, Random Variables, and Stochastic Processes. 4th ed. McGraw Hill, 852 pp.

  • Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, 1992: Numerical recipes in Fortran 77: The Art of Scientific Computing. 2nd ed. Cambridge University Press, 1003 pp.

  • Ribeiro, M. T., S. Singh, and C. Guestrin, 2016: “Why should I trust you?” Explaining the predictions of any classifier. arXiv, 1602.04938v3, https://doi.org/10.48550/arXiv.1602.04938.

  • Ronneberger, O., P. Fischer, and T. Brox, 2015: U-Net: Convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Springer, 234–241.

  • Rudin, C., 2019: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell., 1, 206215, https://doi.org/10.1038/s42256-019-0048-x.

    • Search Google Scholar
    • Export Citation
  • Rudin, C., C. Chen, Z. Chen, H. Huang, L. Semenova, and C. Zhong, 2022: Interpretable machine learning: Fundamental principles and 10 grand challenges. Stat. Surv., 16, 185, https://doi.org/10.1214/21-SS133.

    • Search Google Scholar
    • Export Citation
  • Schmetz, J., S. A. Tjemkes, M. Gube, and L. van de Berg, 1997: Monitoring deep convection and convective overshooting with Meteosat. Adv. Space Res., 19, 433441, https://doi.org/10.1016/S0273-1177(97)00051-3.

    • Search Google Scholar
    • Export Citation
  • Schmit, T. J., P. Griffith, M. M. Gunshor, J. M. Daniels, S. J. Goodman, and W. J. Lebair, 2017: A closer look at the ABI on the GOES-R series. Bull. Amer. Meteor. Soc., 98, 681698, https://doi.org/10.1175/BAMS-D-15-00230.1.

    • Search Google Scholar
    • Export Citation
  • Shepherd, T. G., 2019: Storyline approach to the construction of regional climate change information. Proc. Roy. Soc., 475A, 20190013, https://doi.org/10.1098/rspa.2019.0013.

    • Search Google Scholar
    • Export Citation
  • Smith, T. M., and Coauthors, 2016: Multi-Radar Multi-Sensor (MRMS) severe weather and aviation products: Initial operating capabilities. Bull. Amer. Meteor. Soc., 97, 16171630, https://doi.org/10.1175/BAMS-D-14-00173.1.

    • Search Google Scholar
    • Export Citation
  • Van der Maaten, L., and G. Hinton, 2008: Visualizing data using t-SNE. J. Mach. Learn. Res., 9, 25792605.

  • Veillette, M. S., E. P. Hassey, C. J. Mattioli, H. Iskenderian, and P. M. Lamey, 2018: Creating synthetic radar imagery using convolutional neural networks. J. Atmos. Oceanic Technol., 35, 23232338, https://doi.org/10.1175/JTECH-D-18-0010.1.

    • Search Google Scholar
    • Export Citation
  • Wolf, P., 2018: Utilizing radar and satellite to provide meaningful lightning initiation and cessation information for effective decision-making. FDTD Satellite Applications Webinars, https://rammb2.cira.colostate.edu/training/visit/satellite_chat/.

  • Zhou, Y., and G. Hooker, 2022: Decision tree boosted varying coefficient models. Data Min. Knowl. Discovery, 36, 22372271, https://doi.org/10.1007/s10618-022-00863-y.

    • Search Google Scholar
    • Export Citation
  • Fig. 1.

    Schematic comparing the original convolutional neural network approach (left branch) with the interpretable framework (right branch).

  • Fig. 2.

    (a) GREMLIN model architecture, with number of parameters given under the blue arrows and image sizes shown in green boxes. In (a), the convolutional kernels are learned. (b) Image pyramid corresponding to GREMLIN, where level 0 is the original resolution input image and three levels of pooling are applied. (c) The architecture of the interpretable model. In (c), the convolutional kernels are prescribed. Dimensions are number of samples Ns, number of channels Nc, number of kernels Nk, number of pyramid levels Np, and the image dimensions Nx and Ny.

  • Fig. 3.

    The mean radar reflectivity as a function of input value for the four input channels showing data (blue) and power-law fit (orange) for (a) C07, (b) C09, (c) C13, and (d) GED.

  • Fig. 4.

    Performance statistics for the three models (CNN in blue, DENSE in red, and LINEAR in yellow), calculated from the testing dataset: (a) R2, (b) RMSD, (c) CSI, and (d) bias.

  • Fig. 5.

    REFC for the (a) MRMS (truth), (b) GREMLIN CNN, (c) DENSE interpretable model, and (d) LINEAR interpretable model. The location of Akron is indicated with “A” label.

  • Fig. 6.

    The mean output vs each input for level 0 of the image pyramid. Shown are the MRMS data (black), CNN (blue), DENSE (red), and LINEAR (yellow). Each row is a different input channel, and each column is a different image kernel. A bin width of 0.02 is used, and bins with less than 10 points are masked.

  • Fig. 7.

    The mean output (color fill) vs the DX and DY kernel inputs for the (a)–(d) longwave and (e)–(h) lightning channels and each model [(left) data, (left center) CNN, (right center) DENSE, and (right) LINEAR] for level 0.

  • Fig. 8.

    The mean output (color fill) vs level-0 and level-3 input values for the (a)–(d) longwave band and (e)–(h) lightning for each model [(left) data, (left center) CNN, (right center) DENSE, and (right) LINEAR] for the identity kernel.

  • Fig. 9.

    Mean output (color fill) vs channel combinations (a)–(d) shortwave and longwave, (e)–(h) water vapor and longwave, and (i)–(l) longwave and lightning for each model [(left) data, (left center) CNN, (right center) DENSE, and (right) LINEAR] for level 0 and the identity kernel.

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 915 835 36
PDF Downloads 723 626 24