1. Introduction
Deep learning (DL) has become a widely used tool in climate science and assists various tasks, such as nowcasting (Shi et al. 2015; Han et al. 2017; Bromberg et al. 2019), climate or weather monitoring (Hengl et al. 2017; Anantrasirichai et al. 2019) and forecasting (Ham et al. 2019; Chen et al. 2020; Scher and Messori 2021), numerical model enhancement (Yuval and O’Gorman 2020; Harder et al. 2021), and upsampling of satellite data (Wang et al. 2021; Leinonen et al. 2021). However, a deep neural network (DNN) is mostly considered a black box due to its inaccessible decision-making process. This lack of interpretability limits their trustworthiness and application in climate research, as DNNs should not only have high predictive performance but also provide accessible and consistent predictive reasoning aligned with existing theory (McGovern et al. 2019; Mamalakis et al. 2020; Camps-Valls et al. 2020; Sonnewald and Lguensat 2021; Clare et al. 2022; Flora et al. 2022). Explainable artificial intelligence (XAI) aims to address the lack of interpretability by explaining potential reasons behind the predictions of a network. In the climate context, XAI can help to validate DNNs and on a well-performing model provide researchers with new insights into physical processes (Ebert-Uphoff and Hilburn 2020; Hilburn et al. 2021). For example, Gibson et al. (2021) demonstrated using XAI that DNNs produce skillful seasonal precipitation forecasts based on known relevant physical processes. Moreover, XAI was used to improve the forecasting of droughts (Dikshit and Pradhan 2021), teleconnections (Mayer and Barnes 2021), and regional precipitation (Pegion et al. 2022); to assess external drivers of global climate change (Labe and Barnes 2021); and to understand subseasonal drivers of high-temperature summers (Van Straaten et al. 2022). Additionally, Labe and Barnes (2022) showed that XAI applications can aid in the comparative assessment of climate models.
Generally, explainability methods can be divided into ante hoc and post hoc approaches (Samek et al. 2019) (see Table 1). Ante hoc approaches modify the DNN architecture to improve interpretability, like adding an interpretable prototype layer to learn humanly understandable representations for different classes [see, e.g., Chen et al. (2019) and Gautam et al. (2022, 2023)] or constructing mathematically similar but interpretable models (Hilburn 2023). Such approaches are also called self-explaining neural networks and link to the field of interpretability (Flora et al. 2022). Post hoc XAI methods, on the other hand, can be applied to any neural network architecture (Samek et al. 2019), and here, we focus on three characterizing aspects (Samek et al. 2019; Letzgus et al. 2022; Mamalakis et al. 2022b), as shown in Table 1. The first considers the explanation target (i.e., what is explained), which can differ between local and global decision-making. While local explanations provide explanations of the network decision for a single data point (Baehrens et al. 2010; Bach et al. 2015; Vidovic et al. 2016; Ribeiro et al. 2016), e.g., by assessing the contribution of each pixel in a given image to the predicted class, global explanations reveal the overall decision strategy, e.g., by showing a map of important features or image patterns, learned by the model for the whole class (Vidovic et al. 2015; Nguyen et al. 2016; Lapuschkin et al. 2019; Grinwald et al. 2022; Bykov et al. 2022a). The second aspect concerns the components used to calculate the explanation, differentiating between model-aware and model-agnostic methods. Model-aware methods use components of the trained model for the explanation calculation, such as network weights, while model-agnostic methods consider the model as a black box and only assess the change in the output caused by a perturbation in the input (Strumbelj and Kononenko 2010; Ribeiro et al. 2016). The third aspect considers the DNN explanation output. Here, we can differentiate between methods where the assigned value of a pixel indicates the sensitivity of the network regarding that pixel also called sensitivity methods, such as absolute gradient, as well as methods, that display the positive or negative contribution of a pixel to predict the class, such as layerwise relevance propagation (LRP, see section 3) also called salience methods, and methods presenting input examples leading to the same prediction. Beyond these three characteristics, recent efforts (Flora et al. 2022) also differentiate between feature importance methods encompassing mostly global methods, which calculate feature contribution based on the network performance (e.g., accuracy), and feature relevance methods describing mostly local methods, which calculate contributions to the model prediction. In climate research, the decision patterns learned by DNNs have been analyzed with local explanation methods such as LRP or Shapley values (Gibson et al. 2021; Dikshit and Pradhan 2021; Mayer and Barnes 2021; Labe and Barnes 2021; He et al. 2021; Felsche and Ludwig 2021; Labe and Barnes 2022). However, different local explanation methods can identify different input features as being important to the network decision, subsequently leading to different scientific conclusions (Sturmfels et al. 2020; Covert et al. 2021; Han et al. 2022; Flora et al. 2022). Thus, with the increasing number of XAI methods available, selecting the most suitable method for a specific task poses a challenge and the practitioner’s choice of a method is often based upon popularity or upon easy access (Krishna et al. 2022). To navigate the field of XAI, recent climate science publications have compared and assessed different explanation techniques using benchmark datasets, where the XAI method was assessed by comparing its explanation with a defined target, considered as ground truth (Mamalakis et al. 2022b,a). While benchmark datasets (Yang and Kim 2019; Arras et al. 2020; Agarwal et al. 2022) certainly contribute to the understanding of local XAI methods, the existence of a ground truth explanation is highly debated (e.g., Janzing et al. 2020; Sturmfels et al. 2020). In the case of DNNs, ground truth explanation labels can only be considered approximations and are not guaranteed to align precisely with the model’s decision process or the features it utilizes (Ancona et al. 2019; Hedström et al. 2023a). For exact ground truth, either perfect knowledge of how the model handles the available information or a carefully engineered model would be required, which is usually not the case. Additionally, post hoc explanation methods are generally only approximations of a model’s behavior (Lundberg and Lee 2017; Han et al. 2022), and the distinct mathematical concepts of the different methods consequently lead to distinct ground truth explanations.
Overview and categorization of research on the transparency and understandability of neural networks. For this categorization, we follow works such as Samek et al. (2019), Ancona et al. (2019), Mamalakis et al. (2022b), Letzgus et al. (2022), and Flora et al. (2022).
Here, we address these challenges, by introducing XAI evaluation in the context of climate science to compare different local explanation methods. The field of XAI evaluation has emerged recently and refers to the development of metrics to compare, benchmark, and rank explanation methods, in different explainability contexts (e.g., Adebayo et al. 2018; Hedström et al. 2023b,a). As discussed below in more detail, using evaluation metrics, we are able to quantitatively assess the robustness, complexity, localization, randomization, and faithfulness of explanation methods, making them comparable regarding their suitability, strengths, and weaknesses (Hoffman et al. 2018; Arrieta et al. 2020; Mohseni et al. 2021; Hedström et al. 2023b).
In this work, we discuss these properties in an exemplary manner and build upon work from Labe and Barnes (2021). In their work, an MLP was trained with global annual temperature anomaly maps and the network’s task was to assign the respective year or decade of occurrence. The MLP achieves the assignment, as global mean warming progresses. Using LRP, they then identified the signals relevant to the network’s decision and found the North Atlantic, Southern Ocean, and Southeast Asia as key regions. Here, we use their work as a case study and train an MLP and a CNN for the same prediction tasks (see step 1 in Fig. 1). Then, we apply several explanation methods and show the variation in their explanation maps, potentially leading to different scientific insights (step 2 in Fig. 1). We therefore introduce XAI evaluation metrics and quantify the skill of the different XAI methods against a random baseline in different properties to compare their performance with respect to the underlying task.
Schematic of the XAI evaluation procedure. Based on an annual temperature anomaly map as input, the network predicts the respective decade (box 1). The explanation methods (Grad—gradient, SG—SmoothGrad applied to gradient, and LRP—layerwise relevance propagation) then provide insights (i.e., “shine a light”; see box 2) into the specific network’s decision. The different explanation maps (marked in orange—Grad, green—SG, and blue—LRP) highlight different areas as positively (red) and negatively (blue) contributing to the network decision. Here, XAI evaluation can “shine a light” on the explanation methods and help choose a suitable method (here indicated by the first rank) since evaluation explores the explanation maps regarding their robustness, faithfulness, localization, complexity, and randomization properties.
Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1
This paper is structured as follows. In section 2, we discuss the used dataset and network types and briefly describe the different analyzed explanation methods. Section 3 introduces XAI evaluation and describes five evaluation properties. Then, in section 4, we first discuss the performance of both network types and provide a motivational example highlighting the risks of an uninformed choice of an explanation method. Next, we evaluate different XAI methods applied to the MLP, using two different metrics for each evaluation property, and then compare the XAI evaluation results for the different networks (see sections 4b and 4c). Finally, in section 4d, we present a guideline on using XAI evaluation to choose a suitable XAI method. The discussion of our results and our conclusions are presented in section 5.
2. Data and methods
a. Data
We analyze data simulated by the general climate model, CESM1 (Hurrell et al. 2013), focusing on the “ALL” configuration (Kay et al. 2015), which is discussed in detail in Labe and Barnes (2021). We use the global 2-m air temperature (T2m) maps from 1920 to 2080. The data Ω consist of I = 40 ensemble members Ωi∈{1,…,I}, and each member is generated by varying the atmospheric initial conditions zi with fixed external forcing θclima. Following Labe and Barnes (2021), we compute annual averages and apply bilinear interpolation. This results in T = 161 temperature maps for each member
b. Networks
Additionally, we construct a CNN
c. XAI
In this work, we focus on local model-aware explanation methods belonging to the group of feature attribution methods (Ancona et al. 2019; Das and Rad 2020; Zhou et al. 2022). For the mathematical details, we refer to appendix A, section a.
-
Gradient (Baehrens et al. 2010) explains the network decision by computing the first partial derivative of the network output f(x) with respect to the input. This explanation method feeds backward the network’s prediction to the features in the input x, indicating the change in network prediction given a change in the respective features. The explanation values correspond to the network’s sensitivity to each feature, thus belonging to the group of sensitivity methods. The absolute gradient, often referred to as the saliency map, can also be used as an explanation (Simonyan et al. 2014).
-
Input times gradient is an extension of the gradient method and computes the product of the gradient and the input. In the explanation map, a high relevance is assigned to an input feature if it has a high value and the model gradient is sensitive to it. Therefore, contrary to the gradient as a sensitivity method, input times gradient and other methods including the input pixel value are considered salience methods (Ancona et al. 2019) [or attribution methods, e.g., Mamalakis et al. (2022a)].
-
Integrated Gradients (Sundararajan et al. 2017) extends input times gradient, by integrating a gradient along a line path from a baseline (generally a reference vector for which the network’s output is zero, e.g., all zeros for standardized data) to the explained sample x. In practice, the gradient explanations of a set of images lying between the baseline and x are averaged and multiplied by the difference between the baseline and the explained input [see Eq. (A3)]. Hence, the Integrated Gradients method is a salience method and highlights the difference between the features important to the prediction of x and the features important to the prediction of the baseline value.
-
LRP (Bach et al. 2015; Montavon et al. 2019) computes the relevance for each input feature by feeding the network’s prediction backward through the model, layer by layer, until the prediction score is distributed over the input features and is a salience method. Different propagation rules can be used, all resembling the energy conservation rule, i.e., the sum of all relevances within one layer is equal to the original prediction score. In the case of the α-β rule, relevance is assigned at each layer to each neuron. All positively contributing activations of connected neurons in the previous layer are weighted by α, while β is used to weigh the contribution of the negative activations. The default values are α = 1 and β = 0, where only positively contributing activations are considered. Contrary to that, the z rule calculates the explanation by including both negative and positive neuron activations. Hence, the corresponding explanations, visualized as heatmaps, display both positive and negative evidence. The composite rule combines various rules for different layer types. The method accounts for layer structure variety in CNNs, such as fully connected, convolutional, and pooling layers.
-
SmoothGrad (Smilkov et al. 2017) aims to filter out the background noise [i.e., the gradient shattering effect, where gradients resemble white noise with increasing layer number (Balduzzi et al. 2017)] to enhance the interpretability of the explanation. To this end, multiple noisy samples are generated by adding random noise to the input; then, the explanations of the noisy samples are computed and averaged, such that the most important features are enhanced and the less important features are “canceled out.”
-
NoiseGrad (Bykov et al. 2022b) perturbs the weights of the model, instead of the input feature as done by SmoothGrad. The explanations, resulting from explaining the predictions made by the noisy versions of the model on the same image, are averaged to suppress the background noise of the image in the final explanation.
-
FusionGrad (Bykov et al. 2022b) combines SmoothGrad and NoiseGrad by perturbing both the input features and the network weights. The purpose of the method is to account for uncertainties within the network and the input space (Bykov et al. 2021).
-
Deep Shapley additive explanations (DeepSHAP) (Lundberg and Lee 2017) estimates Shapley values for the full DNN by dividing it into small network components, calculating the Shapley values, and averaging them across all components. The idea behind SHAP values is to fairly distribute the contribution of each feature to the prediction of a specific instance considering all possible feature combinations. Following the game-theoretic concept of Shapley values (Shapley 1951), DeepSHAP satisfy properties such as local accuracy, missingness, and consistency (Lundberg and Lee 2017) and is a salience method.
In this work, we maintain literature values for most hyperparameters of the explanation methods. Exceptions are hyperparameters of explanation methods such as NoiseGrad and FusionGrad. We adjust the perturbation levels of the parameters, as discussed in Bykov et al. (2022b), to ensure at most 5% loss in accuracy. All hyperparameters are presented in Table B1 (see appendix B, section a). Additionally, both Integrated Gradients and DeepSHAP require background images as reference points to calculate the explanations [see also Lundberg and Lee (2017) and appendix A, section a]. To allow for a fair performance comparison, for both methods, we sample 100 maps containing all zero values. We note that there are other possible reference values, e.g., natural images from training, or all one maps, and this choice can affect the explanation performance (Sturmfels et al. 2020). Last, the baseline for SmoothGrad, NoiseGrad, and FusionGrad can be any local explanation method, and here, we use the gradient explanations. Accordingly, gradient, SmoothGrad, NoiseGrad, and FusionGrad are sensitivity methods.
3. Evaluation techniques
Due to the lack of a ground truth explanation, XAI research developed alternative metrics to assess the reliability of an explanation method. These evaluation metrics analyze different properties an explanation method should fulfill and can serve to evaluate different explanation methods (Hoffman et al. 2018; Arrieta et al. 2020; Mohseni et al. 2021; Hedström et al. 2023b). Following Hedström et al. (2023b), we describe five different evaluation properties, and based on the classification task from Labe and Barnes (2021), we illustrate each property in a schematic diagram (see Figs. 2–4).
Diagram of the concept behind the robustness property. Perturbed input images are created by adding uniform noise maps of small magnitude to the original temperature map (left part of the figure). The perturbed maps are passed to the network, resulting in an explanation map for each prediction. The explanation maps of the perturbed inputs (explanation maps with gray outlines) are then compared to (indicated by a minus sign) the explanation of the unperturbed input (explanation map with black outline). A robust XAI method is expected to produce similar explanations for the perturbed inputs and unperturbed inputs.
Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1
Diagram of the concept behind the faithfulness property. Faithfulness assesses the impact of highly relevant pixels in the explanation map on the network decision. First, the explanation values are sorted to identify the highest relevance values (here shown in red). Next, the corresponding pixel positions in the flattened input temperature map are identified (see dotted arrows) and masked (marked in black); i.e., their value is set to a chosen masking value, such as 0 or 1. Both the masked and the original input maps are passed through the network, and their predictions are compared. If the masking is based on a faithful explanation, the prediction of the masked input (j; gray) is expected to change compared with (indicated by a minus sign) the unmasked input (i; black), e.g., a different decade is predicted.
Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1
Diagram of the concept behind the complexity property. Complexity assesses how the evidence values are distributed across the explanation map. For this, the distribution of the relevance values from the original explanation is compared to a “random” explanation drawn from a random uniform distribution. Here, shown in a 1D example, the evidence distribution of the explanation exhibits clear maxima and minima (see maxima in red oval), which is considered desirable and linked to increased scores. The noisy features show a uniform distribution linked to a low complexity score.
Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1
a. Robustness
Robustness measures the stability of an explanation regarding small changes in the input image x + δ (Alvarez-Melis and Jaakkola 2018b; Yeh et al. 2019; Montavon et al. 2018). Ideally, these small changes (δ < ϵ) in the input sample should produce only small changes in the model prediction and successively only small changes in the explanation (see Fig. 2).
The robustness metrics assess the difference between the explanation of a true and perturbed image shown in Eqs. (2) and (3). Accordingly, the lowest score represents the highest robustness.
b. Faithfulness
Faithfulness measures whether changing a feature that an explanation method assigned high relevance to changes the network prediction (see Fig. 3). This can be examined through the iterative perturbation of an increasing number of input pixels corresponding to high-relevance values and subsequent comparison of each resulting model prediction to the original model prediction. Since explanation methods assign relevance to features based on their contribution to the network’s prediction, changing high-relevance features should have a larger impact on the model prediction compared to features of lesser relevance (Bach et al. 2015; Samek et al. 2017; Montavon et al. 2018; Bhatt et al. 2020; Nguyen and Martínez 2020).
c. Complexity
Complexity is a measure of conciseness, indicating an explanation should consist of a few highly important features (Chalasani et al. 2020; Bhatt et al. 2020) (see Fig. 4). The assumption is that concise explanations, characterized by prominent features, facilitate researcher interpretation and potentially include higher informational value with reduced noise.
d. Localization
For localization, the quality of an explanation is measured based on its agreement with a user-defined region of interest (ROI; see Fig. 5). Accordingly, the position of pixels with the highest relevance values (given by the XAI explanation) is compared to the labeled areas, e.g., bounding boxes or segmentation masks. Based on the assumption that the ROI should be mainly responsible for the network decision (ground truth) (Zhang et al. 2018; Arras et al. 2020; Theiner et al. 2022; Arias-Duart et al. 2022), an explanation map yields high localization if high-relevance values are assigned to the ROI.
Diagram of the concept behind the localization property. First, an expected region of high relevance for the network decision, the ROI, is defined in the input temperature map (blue box). Here, the NA is chosen, as this region has been discussed to affect the prediction (see Labe and Barnes 2021). Next, the sorted explanation values of the ROI, encompassing k pixels, are compared to the k highest values of the sorted explanation values across all pixels. An explanation method with strong localization should assign the highest relevance values to the ROI.
Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1
e. Randomization
Randomization assesses how a random perturbation scenario changes the explanation (see Fig. 6). Either the network weights (case 2; Adebayo et al. 2018) are randomized or a random class that was not predicted by the network for the input sample x is explained (case 1; Sixt et al. 2020). In both cases, a change in the explanation is expected, since the explanation of an input x should change if the model changes or if a different class is explained.
Diagram of the concept behind the randomization property. In the middle row, the original input temperature map is passed through the network, and the explanation map is calculated based on the predicted (gray background) decade. For the random logit metric (first row, labeled 1), the input temperature map and the network remain unchanged, but the decade k used to calculate the explanation is randomly chosen (pink font). The resulting explanation map is then compared to the original explanation (indicated by a minus sign) to test its dependence on the class. For the model parameter randomization test (bottom row, labeled 2), the network is perturbed (see green box) with noisy parameters (θ1 = θ + noise), potentially altering the predicted decade (j; gray). The explanation map of the perturbed model should differ from the original explanation map if the explanation is sensitive to the model parameters.
Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1
f. Metric score calculation
4. Experiments
a. Network predictions, explanations, and motivating example
In the following, we evaluate the network performance and discuss the application of the explanation methods for both network architectures. To ensure comparability between networks and comparability to our case study (Labe and Barnes 2021), we use a similar set of hyperparameters for the MLP and the CNN during training. A detailed performance discussion is provided in appendix B, section a. The achieved similar performance ensures that XAI evaluation score differences between the MLP and the CNN are not caused by differences in network accuracy.
After training and performance evaluation, we explain all correctly predicted temperature maps in the training, validation, and test samples (see appendix B, section a for details). These explanations are most often subject to further research on physical phenomena learned by the network (Barnes et al. 2020; Labe and Barnes 2021; Barnes et al. 2021; Labe and Barnes 2022). We apply all XAI methods presented in section 2c to both networks with the exception of the composite rule of LRP, converging to the LRP-z rule for the MLP model due to its dense layer architecture (Montavon et al. 2019). The corresponding explanation maps across all XAI methods and for both networks are displayed in Figs. B4 and B5. Despite explaining the same network predictions, different methods assign different relevance values to the same areas, revealing the disagreement problem in XAI (Krishna et al. 2022).
To illustrate this explanation disagreement, we show the explanation maps for the year 2068 given by DeepSHAP and LRP-z, alongside the input temperature map in Fig. 7. According to the primary publication (Labe and Barnes 2021), the cooling patch in the North Atlantic (NA), depicted in the zoomed-in map sections of 10°–80°W, 20°–60°N of Fig. 7, significantly contributes to the network prediction for all decades. Thus, it is reasonable to assume high relevance values in this region. However, the two XAI methods display contrary signs of relevance in some areas, impeding visual comparison and interpretation. The varying sign can be attributed to DeepSHAP being based on feature removal and modified gradient backpropagation, while LRP-z, in contrast, being theoretically equivalent to input times gradient. Thus, the two explanations potentially display different aspects of the network decision (Clare et al. 2022) and explanations can vary in sign depending on the input image [see also discussion on input shift invariance in Mamalakis et al. (2022a)]. Nonetheless, we also find common features, as, for example, in Australia or throughout the Antarctic region. Thus, a deeper understanding of explanation methods and their properties is necessary to enable an informed method choice.
Motivating example visualizing the difference between different XAI methods. Shown are the T2m maps (a) for the year 2068 with the corresponding (b) DeepSHAP and (c) LRP-z explanation maps of the MLP. For both XAI methods, red indicates a pixel contributed positively, and blue indicates a negative contribution to the predicted class. Next to the explanation maps, a zoomed-in map of the NA region (10°–80°W, 20°–60°N) is shown, demonstrating different evidence for DeepSHAP and LRP-z.
Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1
b. Assessment of explanation methods
To introduce the application of XAI evaluation, we compare the different XAI methods applied to the MLP and calculate their skill scores across all five XAI method properties (see section 3). For each property, two representative metrics (hyperparameters are listed in appendix B, section b) are computed and compared. Each skill score is averaged across 50 random samples drawn from the explanations of all correctly predicted inputs, and we provide the standard error of the mean (SEM) (see appendix A, section b for details). To account for potential biases resulting from the choice of the testing period, we also compute the scores for random samples not limited to correct predictions. We report qualitatively robust findings (not shown) compared to the scores shown here. Our results are depicted in Fig. 8.
Barplot of skill scores based on the random baseline reference for two different metrics in each, (a) robustness, (b) faithfulness, (c) complexity, (d) localization, and (e) randomization property. The different metrics are indicated by hatches or no hatches on the bar. We report the mean skill score (as bar labels) and the SEM, indicated by the error bars in black on each bar. The bar color scheme indicates the grouping of the XAI methods into sensitivity (violet tones) and salience/attribution methods (earthy tones).
Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1
For the robustness property, we find that all tested explanation methods result in similar, high, and closely distributed skill scores (≥0.85 and ≤0.93) for both the average sensitivity metric (hatches in Fig. 8a) and local Lipschitz estimate metric (no hatches), where the latter shows slightly higher values overall. For both metrics, we find that salience (earthy tones) and sensitivity (violet tones) methods show a similar robustness skill and perturbation-based methods (SmoothGrad, NoiseGrad, and Integrated Gradients) do not significantly improve skill compared to the respective baseline explanations (gradient and input times gradient). We relate the latter finding to the low signal-to-noise ratio of the climate data and variability between different ensemble members, complicating the choice of an efficient perturbation threshold for the explanation methods. Nonetheless, these findings disagree with previous studies regarding suggested robustness improvements when applying salience and perturbation-based methods (Smilkov et al. 2017; Sundararajan et al. 2017; Bykov et al. 2022b; Mamalakis et al. 2022a).
For faithfulness, we find pronounced skill score differences between both metrics, with ROAD scores indicating positive skill for all methods, whereas faithfulness correlation scores include negative values for the sensitivity methods (hatched violet bars in Fig. 8b). This disparity arises from the calculation of faithfulness correlation metric scores using the correlation coefficient and the distinct interpretations of relevance values in salience maps versus sensitivity maps. Since sensitivity maps display the network’s sensitivity toward the change in the value of each pixel (the sign conveys the direction), the impact of the masking value depends on the discrepancy between the original pixel value and the masking value, leading to a negative correlation. Nonetheless, across metrics, the best skill scores ≤ 0.6 are achieved by input times gradient, Integrated Gradients, and LRP-z, followed by S(qLPR-α-β) ≤ 0.42. Furthermore, sensitivity methods (violet tones) achieve overall lower skill scores. Although DeepSHAP exhibits a lower faithfulness correlation skill [which we attribute to the challenge of applying Shapley values to continuous data (Han et al. 2022) and vulnerability toward feature correlation (Flora et al. 2022)], the method still outperforms the sensitivity methods, indicating salience (or attribution) methods provide more faithful relevance values. However, this is due to salience methods indicating the contribution of each pixel to the prediction as required by faithfulness. Thus, sensitivity methods inherently result in less faithful explanations. We note that the input multiplication of salience methods can lead to a loss of information when using standardized input pixels, as zero values in the input (i.e., values close to climatology) will result in zero values in the explanation regardless of the networks sensitivity to it [see section 2c and Mamalakis et al. (2022a) discussing “ignorant to zero input”].
For complexity (Fig. 8c), all explanation methods exhibit low complexity scores compared to sparseness, indicating the explanations on climate data exhibit similar entropy to uniformly sampled values. This similarity in entropy can be attributed to the increased variability and subsequently low signal-to-noise ratio of climate data (Sonnewald and Lguensat 2021; Clare et al. 2022). For the sparseness metric, skill scores show skill improvement for salience (attribution) methods. We also find slight skill score improvements for NoiseGrad and FusionGrad, suggesting that incorporating network perturbations may decrease explanation complexity.
To compute the results of the localization metrics, Top-K (hatches in Fig. 8d) and relevance rank accuracy (no hatches), we select the region in the North Atlantic (10°–80°W, 20°–60°N) as our ROI, with the cooling in this region being a recognized feature of climate change (Labe and Barnes 2021). In both metrics, all explanation methods yield low skill scores. This is consistent with lower sparseness skill scores in complexity (≤0.47), indicating that high-relevance values are spread out, with the ROI also including fewer distinct features. In addition, high relevance in the ROI depends on whether the network learned this specific area. Thus, our results potentially indicate an inadequate choice of the ROI (either size or location) and show that localization metrics can identify a learned region. Nonetheless, LRP-α-β yields the highest skill across metrics, indicating that attributing only positive relevance values improves the distinctiveness of features in the NA region. Similar to complexity, salience methods (earthy tones) yield a slightly higher localization skill than sensitivity methods (violet tones) with the exception of NoiseGrad.
Last, we present the randomization results (Fig. 8e). For the random logit metric, all XAI methods yield lower skill scores (≥0.1 and ≤0.58). This can be attributed to the network task classes being defined based on decades with an underlying continuous temperature trend. Thus, the differences in temperature maps can be small for subsequent years, and the network decision and explanation for different classes may include similar features. Nonetheless, we find salience (earthy tones) and sensitivity methods (violet tones) to yield no clear separation. Instead, XAI methods using perturbation result in higher skill scores, with mean improvements for FusionGrad exceeding the SEM, as well as a slight improvement for NoiseGrad and SmoothGrad over gradient and Integrated Gradients over input times gradient. Thus, while input perturbations already slightly improve the class separation in the explanation, also including network perturbation yields favorable improvement. For the model parameter randomization test scores, skill scores are overall higher (≥0.58 and ≤0.99) across all explanation methods, and sensitivity methods outperform salience methods, the latter aligning with Mamalakis et al. (2022b). Similar to the complexity results, the DeepSHAP skill score aligns with other salience method results. In addition, LRP-α-β yields the worst skill across metrics, potentially due to neglecting negatively contributing neurons during backpropagation [see Eq. (A4) in appendix A, section a] and corresponding variations across classes and under parameter randomization.
c. Network-based comparison
To compare the performance of explanation methods for the MLP and CNNs, we selected one metric per property: local Lipschitz estimate for robustness, ROAD for faithfulness, sparseness for complexity, Top-K for localization, and model parameter randomization test for randomization.
For robustness (see Fig. 9a), XAI methods applied to the CNN yield strong skill score variations, with the MLP results showing overall higher skill scores. For the CNN, the LRP composite rule provides the best robustness skill. We find salience methods to exhibit slightly higher skill scores, the exception being FusionGrad outperforming LRP-α-β and DeepSHAP. This suggests that due to the differences in learned patterns between the CNN and MLP, including both network and input perturbations yields more robust explanations, while the combination of a removal-based technique (Covert et al. 2021) with a modified gradient backpropagation (Ancona et al. 2019) as in DeepSHAP and neglecting negatively contributing neurons as in LRP-α-β worsens robustness compared with other salience methods. Moreover, explanation methods using input perturbations improve sensitivity explanation robustness for the CNN (SmoothGrad and FusionGrad), while methods using only network perturbations decrease robustness skill (NoiseGrad).
Barplot of skill scores based on the random baseline reference for the MLP (star hatches) and CNN (no hatches) in each, (a) robustness, (b) faithfulness, (c) the complexity, (d) localization, and (e) randomization property. We report the skill score (as bar labels) and the SEM of all scores, indicated by the error bars in black on each bar. The bar color scheme indicates the grouping of the XAI methods into sensitivity (violet tones) and salience/attribution methods (earthy tones). Note that for LRP composite (LRP-comp), we only report the CNN results (for details, see section 4a).
Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1
In the faithfulness property (see Fig. 9b), salience explanation methods (Integrated Gradients, input times gradient, and LRP) achieve higher skill for both networks, aligning with previous research (Mamalakis et al. 2022b,a) and the theoretical differences (see section 4b). However, the LRP composite rule is the exception, adding additional insight to the findings of other studies (Mamalakis et al. 2022a), as the LRP composite rule sacrifices faithful evidence for a less complex [human-aligned; Montavon et al. (2019)] and more robust explanation. Moreover, perturbation-based explanation methods (SmoothGrad, NoiseGrad, FusionGrad, and Integrated Gradients) do not significantly increase the faithfulness skill compared to their respective baseline explanations (gradient and input times gradient), except for Integrated Gradients for the MLP. Similar to the MLP results, LRP-α-β acts as an outlier compared with other salience methods. For the CNN, also the DeepSHAP’s faithfulness skill is decreased, contradicting theoretical claims and other findings (Lundberg and Lee 2017; Mamalakis et al. 2022a). Since the CNN learns more clustered patterns (groups of pixels according to the filter-based architecture), we attribute this outcome to both DeepSHAP’s theoretical definitions (Han et al. 2022) and vulnerability toward feature correlation (Flora et al. 2022), with the latter making partitionSHAP a more suitable option (Flora et al. 2022).
In complexity, salience methods exhibit slight skill improvement over sensitivity methods across networks, except for LRP-α-β for the CNN (Fig. 9c). This indicates that neglecting feature relevance is more influential for the CNN’s explanation, leading to fewer distinct features in the explanation, while the lower DeepSHAP skill further confirms the previously discussed disadvantages of DeepSHAP for the CNN.
In localization, both the MLP and CNN show similar low overall skill scores (≤0.33), indicating that the size or location of the ROI was not optimally chosen for the case study. Nonetheless, the skill scores across XAI methods are in line with the complexity results, except for the worst and best skill scores. The LRP composite rule yields the lowest localization skill, further confirming its trade-off between faithfulness and interpretability, also in the ROI. FusionGrad provides the highest localization skill for the CNN. In contrast, LRP-α-β yields the highest skill for the MLP but the second lowest skill score for the CNN. The difference in results across networks for complexity and localization can be attributed to differences in learned patterns (as discussed above), affecting properties that assess the spatial distribution of evidence in the explanation.
Last, for randomization (see Fig. 9e), regardless of the network, sensitivity methods outperform salience methods, indicating a decreased susceptibility to changes in the network parameters. While slightly lower, the randomization skill score of DeepSHAP does agree with other salience methods aligning with Mamalakis et al. (2022b,a).
Overall, our results show that while explanation methods applied to different network architectures retain similar faithfulness and randomization properties, their robustness, complexity, and localization properties depend on the specific architecture.
d. Choosing an XAI method
Evaluation metrics enable the comparison of different explanation methods based on various properties for different network architectures, allowing us to assess their suitability for specific tasks. Here, we propose a framework to select an appropriate XAI method.
Practitioners first determine which explanation properties are essential for their network task. For instance, for physically informed networks, randomization (the model parameter randomization test) is crucial, as parameters are meaningful and explanations should respond to their successive randomization. Similarly, localization might be less important if an ROI cannot be determined beforehand. Second, practitioners calculate evaluation scores for each selected property across various XAI methods. We suggest calculating the skill score (see section 3f) to improve score interpretability. Third and last, the optimal XAI method for the task can be chosen based on the skill scores independently or rank of the explanation method, as in previous studies (Hedström et al. 2023b; Tomsett et al. 2020; Rong et al. 2022b; Brocki and Chung 2022; Gevaert et al. 2022).
In our case study, for example, the explanation method should exhibit robustness toward variation across climate model ensemble members, display concise features (complexity) without sacrificing faithfulness, and capture randomization of the network parameter (randomization). Using the Quantus XAI evaluation library (Hedström et al. 2023b), we visualize the evaluation results for the MLP in a spider plot (Fig. 10a), with the outermost line indicating the best-performing XAI method in each property. All methods yield a similar robustness skill but differ in randomization, faithfulness, and complexity skills. The methods such as LRP-z (light beige), input times gradient (ocher), Integrated Gradients (orange), and DeepSHAP (brown) provide the most faithful explanations [similar to findings in Mamalakis et al. (2022a)], with DeepSHAP providing a slightly worsened randomization and robustness skill.
Visualization of the proposed procedure to choose an appropriate XAI method. (a) In the spider plot, the mean skill scores for all properties across nine explanation methods (MLP explanations) are visualized, according to Fig. 9. The spider plot can be used as a visual aid alongside the skill scores or ranks in each essential property to identify the best-performing XAI method. In the plot, the best results correspond to the furthest distance from the center of the graph. (b) The LRP-z explanation map of the decade prediction on the temperature map of 2068 is shown with (c) a zoom to the NA region.
Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1
Based on the different strengths and weaknesses, we would select LRP-z to explain the MLP predictions (Fig. 10b) and analyze the impact of the NA region (Fig. 10c) on the network predictions. According to the explanation, the network heavily depends on the North Atlantic region and the cooling patch pattern, suggesting its relevance in correctly predicting the decade in this global warming simulation scenario. However, we also stress that additionally applying a sensitivity method such as gradient-based SmoothGrad potentially illuminates more aspects of this network decision, as sensitivity methods provide strong randomization, in contrast to LRP-z.
5. Discussion and conclusions
AI models, particularly DNNs, can learn complex relationships from data to predict unseen points afterward. However, their black box character restricts the human understanding of the learned input–output relation, making DNN predictions challenging to interpret. To illuminate the model’s behavior, local XAI methods were developed, that identify the input features responsible for individual predictions and offer novel insights into climate artificial intelligence research (Camps-Valls et al. 2020; Gibson et al. 2021; Dikshit and Pradhan 2021; Mayer and Barnes 2021; Labe and Barnes 2021; Van Straaten et al. 2022; Labe and Barnes 2022). Nevertheless, the increasing number of available XAI methods and their visual disagreement (Krishna et al. 2022), illustrated in our motivating example (Fig. 7), raise two important questions: Which explanation method is trustworthy, and which is the appropriate choice for a given task?
To address these questions, we introduced XAI evaluation to climate science, building upon existing climate XAI research as our case study (Labe and Barnes 2021). We evaluate and compare various local explanation methods for an MLP and a CNN regarding five properties, i.e., robustness, faithfulness, randomization, complexity, and localization, that are provided by the Quantus library (Hedström et al. 2023b). Furthermore, we improve the interpretation of the evaluation scores by calculating a skill score in reference to a random uniform explanation.
In the first experiment, we showcase the application of XAI evaluation on the MLP explanations using two metrics for each property (Alvarez-Melis and Jaakkola 2018b; Montavon et al. 2019; Yeh et al. 2019; Bhatt et al. 2020; Arras et al. 2020; Rong et al. 2022a; Hedström et al. 2023b). Our results indicate that salience methods (i.e., input times gradient, Integrated Gradients, and LRP) yield an improvement in faithfulness and complexity skill but a reduced randomization skill. Contrary to salience methods, sensitivity methods (gradient, SmoothGrad, NoiseGrad, and FusionGrad) show higher randomization skill scores while sacrificing faithfulness and complexity skills. These results indicate that a combination of explanation methods can be favorable depending on the explainability context. We also establish that evaluating explanation methods in a climate context mandates careful consideration. For example, due to the natural variability in the data, the sparseness metric is best suited for determining explanation complexity. Further, the random logit metric is favored for classification with pronounced class separations rather than datasets with continuous features spanning multiple classes. Last, we highlight the importance of the correct identification of an ROI to ensure an informative localization evaluation and that localization metrics enable probing the network regarding learned physical phenomena.
In the second experiment, we compare the properties of the MLP and CNN explanations across all XAI methods. Both localization and complexity evaluation show larger variations between networks, due to differences in how the networks learn features in the input. The robustness results exhibit similar variation, with the CNN showing higher skill scores for all input perturbation-based methods like SmoothGrad, FusionGrad, and Integrated Gradients, contrary to the MLP, with the exception of NoiseGrad. Independent of network architecture, explanations using averages across input perturbations, like SmoothGrad and Integrated Gradients, do not consistently increase and, in some cases, even decrease the faithfulness skill. Furthermore, sensitivity methods result in less faithful and more complex explanations but capture network parameter changes more reliably. In contrast, salience methods are less complex, except for LRP-α-β explaining the CNN. Moreover, salience methods exhibit a higher faithfulness skill and lower randomization skill than sensitivity methods, consistent with findings in Mamalakis et al. (2022b,a) and in line with salience methods presenting the contribution of each input pixel rather than sensitivity (see section 4b), due to input multiplication. Contrary to previous research (Mamalakis et al. 2022a), the LRP composite rule was an outlier among salience methods, sacrificing a faithful explanation for an improved complexity skill and higher robustness. Similarly, LRP-α-β and DeepSHAP stand out as an exception among salience methods applied to the CNN due to almost consistently lower skill scores. We attribute both findings to the mathematical definition of each method. While the LRP composite rule is optimized toward improved interpretation, resulting in less feature content, DeepSHAP is based on feature removal and modified gradient backpropagation and is vulnerable toward feature correlation, for CNN features and LRP-α-β neglecting negatively contributing neurons during backpropagation.
Last, we propose a framework using XAI evaluation to support the selection of an appropriate XAI method for a specific research task. The first step is to identify important XAI properties for the network and data, followed by calculating evaluation skill scores across the properties for different XAI methods. Then, the resulting skill scores across XAI methods can be ranked or compared directly to determine the best-performing method or combination of methods. In our case study, LRP-z (alongside input times gradient and Integrated Gradients) yields suitable results in the MLP task, allowing the reassessment of our motivating example (Fig. 7) and the trustworthy interpretation of the NA region as a contributing input feature.
Overall, our results demonstrate the value of XAI evaluation for climate AI research. Due to their technical and theoretical differences (Letzgus et al. 2022; Han et al. 2022; Flora et al. 2022), the various explanation methods can reveal different aspects of the network decision and exhibit different strengths and weaknesses. Evaluation metrics allow us to compare explanation methods by assessing their suitability and properties, in different explainability contexts. Next to benchmark datasets, evaluation metrics also contribute to the benchmarking of explanation methods. XAI evaluation can support researchers in the choice of an explanation method, independent of the network structure and targeted to their specific research problem.
Acknowledgments.
This work was funded by the German Ministry for Education and Research through project Explaining 4.0 (ref. 01IS200551). M. K. acknowledges funding from XAIDA (European Union’s Horizon 2020 research and innovation program under Grant Agreement 101003469). The authors also thank the CESM Large Ensemble Community Project (Kay et al. 2015) for making the data publicly available. Support for the Twentieth Century Reanalysis Project version 3 dataset is provided by the U.S. Department of Energy, the Office of Science Biological and Environmental Research (BER), the National Oceanic and Atmospheric Administration Climate Program Office, and the NOAA Earth System Research Laboratory Physical Sciences Laboratory.
Data availability statement.
Our study is based on the RPC8.5 configuration of the CESM1 Large Ensemble simulations (https://www.cesm.ucar.edu/community-projects/lens/instructions). The data are freely available (https://www.cesm.ucar.edu/projects/community-projects/LENS/data-sets.html). The source code for all experiments will be accessible at https://github.com/philine-bommer/Climate_X_Quantus. All experiments and code are based on Python v3.7.6, Numpy v1.19 (Harris et al. 2020), SciPy v1.4.1 (Virtanen et al. 2020), and colormaps provided by Matplotlib v3.2.2 (Hunter 2007). Additional Python packages used for the development of the ANN, explanation methods, and evaluation include Keras/TensorFlow (Abadi et al. 2016), iNNvestigate (Alber et al. 2019), and Quantus (Hedström et al. 2023b). We implemented all explanation methods except for NoiseGrad and FusionGrad using iNNvestigate (Alber et al. 2019). For XAI methods by Bykov et al. (2022b) and Quantus (Hedström et al. 2023b), we present a Keras/TensorFlow (Abadi et al. 2016) adaptation in our repository. All dataset references are provided throughout the study.
APPENDIX A
Additional Methodology
a. Explanations
To provide a theoretical background, we give formulas for the different XAI methods we compare, in the following section.
1) Gradient
2) Input times gradient
3) Integrated Gradients
4) LRP
For LRP, the relevances of each neuron i in each layer l are calculated based on the relevances of all connected neurons j in the later layer l + 1 (Samek et al. 2017; Montavon et al. 2017).
5) SmoothGrad
6) NoiseGrad
7) FusionGrad
8) DeepSHAP (Lundberg and Lee 2017)
b. Evaluation metrics
1) Random baseline
Similar to Rieger and Hansen (2020), we establish a random baseline as an uninformative baseline explanation. The artificial explanation
2) Score calculation
As discussed in section 3e, we calculate the skill score according to the optimal metric outcome. Thus, skill scores reported for the average sensitivity, the local Lipschitz estimate, the ROAD, the complexity, the model parameter randomization test, and the random logit metrics are calculated based on the first case of Eq. (15), while the skill scores calculation based on faithfulness correlation, Top-K, relevance rank accuracy, and sparseness scores are calculated following the bottom case of Eq. (15).
An exception is the ROAD metric, as discussed in section 3, and the curve used in the AUC calculation results from the average of N = 50 samples. Thus, we repeat the AUC calculation for V = 10 draws of N = 50 samples and calculate the mean skill score and the SEM.
APPENDIX B
Additional Experiments
a. Network and explanation
Aside from the learning rate l (lCNN = 0.001), we maintain a similar set of hyperparameters to Labe and Barnes (2021) and use the fuzzy classification setup for the performance validation. To assess the predictions of the network for each individual input, we include the network predictions for the 20CRv3 data, i.e., observations (Slivinski et al. 2019). We measure performance using both the RMSE = R between true year
Learning curve of the MLP including (a) accuracy and (b) loss. In both plots, the scatter graph represents the training performance, and the line graph represents the performance on the validation data.
Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1
Learning curve of the CNN including (a) accuracy and (b) loss. In both plots, the scatter graph represents the training performance, and the line graph represents the performance on the validation data.
Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1
In Fig. B3, we also show the number of correct predictions for both architectures (all points on the regression line). In these graphs, we observe changing numbers of correct predictions across different years. Thus, we apply all explanation methods to the full model data Ω, to ensure access to correct samples across all years.
Network performance based on the RMSE of the predicted years to the true years of both (a) MLP and (b) CNN [cf. to Fig. 3c in Labe and Barnes (2021)]. The light gray dots correspond to the agreement of the predictions based on the training and validation data to the actual years and the dark gray dots show agreement between the predictions on the test set and the actual years, with the black line showing the linear regression across the full model data (training, validation, and test data). In blue, we also included the predictions on the reanalysis data with the linear regression line in dark blue.
Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1
We show examples of the MLP and CNN across all explanation methods in Figs. B4 and B5. Following Labe and Barnes (2021), we adopt a criterion requiring a correct year regression within an error of ±2 years, to identify a correct prediction. We average correct predictions across ensemble members and display time periods of 40 years based on the temporal average of explanations [see Fig. 6 in Labe and Barnes (2021)].
The MLP explanation map average over 1920–60, 1960–2000, 2000–40, and 2040–80 for all XAI methods. (first row) The average input temperature map
Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1
The CNN explanation map average over 1920–60, 1960–2000, 2000–40, and 2040–80 for all XAI methods. (first row) The average input temperature map
Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1
In comparison, both figures highlight the difference in spatial learning patterns, with the CNN relevance focusing on pixel groups, whereas the MLP relevance can change pixelwise. In Table B1, we list the hyperparameters of the explanation methods, compared in our experiments. We use the notation introduced in appendix A, section a. We use Integrated Gradients with the baseline
The hyperparameters of the XAI methods. Note that parameters vary across explanation methods. We report only adjusted parameters; for all others, we use a dash —. We denote the maximum and minimum values across all temperature maps X in the dataset Ω as xmax and xmin,, respectively.
b. Evaluation metrics
1) Hyperparameters
In Table B2, we list the hyperparameters of the different metrics. We list only the adapted parameters for all others (see Hedström et al. 2023b), and we used the Quantus default values. The normalization parameter refers to an explanation of normalization according to Eq. (A12).
We show the hyperparameters of the XAI evaluation metrics based on the Quantus package calculations (Hedström et al. 2023b). We consider the metrics, average sensitivity (AS, local Lipschitz estimate (LLE), faithfulness correlation (FC), ROAD, model parameter randomization test (MPT), random logit (RL), complexity (COM), sparseness (SPA), Top-K, and relevance rank accuracy (RRA). Note that parameters vary across metrics, and we report settings only for existing parameters in each metric (for all others, we use a dash —).
(i) Faithfulness
In Table B2, the perturbation function “Indices” refers to the baseline replacement by indices of the highest value pixels in the explanation and “Linear” refers to noisy linear imputation [see Rong et al. (2022a) for details]. Please note that the evaluation of the faithfulness property strongly depends on the choice of perturbation baseline. Thus, we advise the reader to choose the uniform baseline, as determined here for standardized weather data, as it most strongly resembles noise.
(ii) Randomization
For the model parameter randomization test score calculations, we perturb the layer weights starting from the output layer to the input layer, referred to as “bottom_up” in Table B2. To ensure comparability, we use the Pearson correlation as the similarity function for both metrics.
(iii) Localization
For Top-k, we consider k = 0.1d, which are the 10% most relevant pixels of all pixels d in the temperature map.
REFERENCES
Abadi, M., and Coauthors, 2016: TensorFlow: A system for large-scale machine learning. Proc. 12th USENIX Symp. on Operating Systems Design and Implementation (OSDI’16), Savannah, GA, USENIX Association, 265–283, https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf.
Adebayo, J., J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim, 2018: Sanity checks for saliency maps. NIPS’18: Proc. 32nd Int. Conf. on Neural Information Processing Systems, Montréal, Canada, Curran Associates Inc., 9525–9536, https://dl.acm.org/doi/10.5555/3327546.3327621.
Agarwal, C., S. Krishna, E. Saxena, M. Pawelczyk, N. Johnson, I. Puri, M. Zitnik, and H. Lakkaraju, 2022: OpenXAI: Towards a transparent evaluation of model explanations. Advances in Neural Information Processing Systems, S. Koyejo et al., Eds., Curran Associates Inc., 15 784–15 799.
Alber, M., and Coauthors, 2019: iNNvestigate neural networks! J. Mach. Learn. Res., 20 (93), 1–8.
Alvarez-Melis, D., and T. S. Jaakkola, 2018a: Towards robust interpretability with self-explaining neural networks. NIPS’18: Proc. 32nd Int. Conf. on Neural Information Processing Systems, Montréal, Canada, Curran Associates Inc., 7786–7795, https://dl.acm.org/doi/10.5555/3327757.3327875.
Alvarez-Melis, D., and T. S. Jaakkola, 2018b: On the robustness of interpretability methods. arXiv, 1806.08049v1, https://doi.org/10.48550/arXiv.1806.08049.
Anantrasirichai, N., J. Biggs, F. Albino, and D. Bull, 2019: A deep learning approach to detecting volcano deformation from satellite imagery using synthetic datasets. Remote Sens. Environ., 230, 111179, https://doi.org/10.1016/j.rse.2019.04.032.
Ancona, M., E. Ceolini, C. Öztireli, and M. Gross, 2019: Gradient-based attribution methods. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, Samek, W. et al., Eds., Springer International Publishing, 169–191, https://doi.org/10.1007/978-3-030-28954-6_9.
Arias-Duart, A., F. Parés, D. Garcia-Gasulla, and V. Giménez-Ábalos, 2022: Focus! Rating XAI methods and finding biases. 2022 IEEE Int. Conf. on Fuzzy Systems (FUZZ-IEEE), Padua, Italy, Institute of Electrical and Electronics Engineers, 1–8, https://doi.org/10.1109/FUZZ-IEEE55066.2022.9882821.
Arras, L., A. Osman, and W. Samek, 2020: Ground truth evaluation of neural network explanations with CLEVR-XAI. arXiv, 2003.07258v2, https://doi.org/10.48550/arXiv.2003.07258.
Arrieta, A. B., and Coauthors, 2020: Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion, 58, 82–115, https://doi.org/10.1016/j.inffus.2019.12.012.
Bach, S., A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek, 2015: On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLOS ONE, 10, e0130140, https://doi.org/10.1371/journal.pone.0130140.
Baehrens, D., T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K.-R. Müller, 2010: How to explain individual classification decisions. J. Mach. Learn. Res., 11, 1803–1831.
Balduzzi, D., M. Frean, L. Leary, J. Lewis, K. W.-D. Ma, and B. McWilliams, 2017: The Shattered Gradients Problem: If resnets are the answer, then what is the question? arXiv, 1702.08591v2, https://doi.org/10.48550/arXiv.1702.08591.
Barnes, E. A., B. Toms, J. W. Hurrell, I. Ebert-Uphoff, C. Anderson, and D. Anderson, 2020: Indicator patterns of forced change learned by an artificial neural network. J. Adv. Model. Earth Syst., 12, e2020MS002195, https://doi.org/10.1029/2020MS002195.
Barnes, E. A., R. J. Barnes, and N. Gordillo, 2021: Adding uncertainty to neural network regression tasks in the geosciences. arXiv, 2109.07250v1, https://doi.org/10.48550/arXiv.2109.07250.
Bhatt, U., A. Weller, and J. M. Moura, 2020: Evaluating and aggregating feature-based model explanations. arXiv, 2005.00631v1, https://doi.org/10.48550/arXiv.2005.00631.
Brocki, L., and N. C. Chung, 2022: Fidelity of interpretability methods and perturbation artifacts in neural networks. arXiv, 2203.02928v4, https://doi.org/10.48550/arXiv.2203.02928.
Bromberg, C. L., C. Gazen, J. J. Hickey, J. Burge, L. Barrington, and S. Agrawal, 2019: Machine learning for precipitation nowcasting from radar images. arXiv, 1912.12132v1, https://arxiv.org/abs/1912.12132.
Bykov, K., M. M.-C. Höhne, A. Creosteanu, K.-R. Müller, F. Klauschen, S. Nakajima, and M. Kloft, 2021: Explaining Bayesian neural networks. arXiv, 2108.10346v1, https://doi.org/10.48550/arXiv.2108.10346.
Bykov, K., M. Deb, D. Grinwald, K.-R. Müller, and M. M.-C. Höhne, 2022a: DORA: Exploring outlier representations in deep neural networks. arXiv, 2206.04530v4, https://doi.org/10.48550/arXiv.2206.04530.
Bykov, K., A. Hedström, S. Nakajima, and M. M.-C. Höhne, 2022b: NoiseGrad: Enhancing explanations by introducing stochasticity to model weights. Proc. 36th AAAI Conf. on Artificial Intelligence, Online, AAAI Press, 6132–6140, https://doi.org/10.1609/aaai.v36i6.20561.
Camps-Valls, G., M. Reichstein, X. Zhu, and D. Tuia, 2020: Advancing deep learning for earth sciences from hybrid modeling to interpretability. IGARSS 2020 – 2020 IEEE Int. Geoscience and Remote Sensing Symp., Waikoloa, HI, Institute of Electrical and Electronics Engineers, 3979–3982, https://doi.org/10.1109/IGARSS39084.2020.9323558.
Chalasani, P., J. Chen, A. R. Chowdhury, S. Jha, and X. Wu, 2020: Concise explanations of neural networks using adversarial training. ICML’20: Proc. 37th Int. Conf. on Machine Learning, Online, PMLR, https://proceedings.mlr.press/v119/chalasani20a/chalasani20a.pdf.
Chen, C., O. Li, D. Tao, A. J. Barnett, C. Rudin, and J. K. Su, 2019: This looks like that: Deep learning for interpretable image recognition. NIPS’19: Proc. 33rd Int. Conf. on Neural Information Processing Systems, Vancouver, British Columbia, Canada, 8930–8941, https://dl.acm.org/doi/10.5555/3454287.3455088.
Chen, K., P. Wang, X. Yang, N. Zhang, and D. Wang, 2020: A model output deep learning method for grid temperature forecasts in Tianjin area. Appl. Sci., 10, 5808, https://doi.org/10.3390/app10175808.
Clare, M. C., M. Sonnewald, R. Lguensat, J. Deshayes, and V. Balaji, 2022: Explainable artificial intelligence for Bayesian neural networks: Toward trustworthy predictions of ocean dynamics. J. Adv. Model. Earth Syst., 14, e2022MS003162, https://doi.org/10.1002/essoar.10511239.1.
Covert, I. C., S. Lundberg, and S.-I. Lee, 2021: Explaining by removing: A unified framework for model explanation. J. Mach. Learn. Res., 22, 9477–9566.
Das, A., and P. Rad, 2020: Opportunities and challenges in explainable artificial intelligence (XAI): A survey. arXiv, 2006.11371v2, https://doi.org/10.48550/arXiv.2006.11371.
Dikshit, A., and B. Pradhan, 2021: Interpretable and explainable AI (XAI) model for spatial drought prediction. Sci. Total Environ., 801, 149797, https://doi.org/10.1016/j.scitotenv.2021.149797.
Ebert-Uphoff, I., and K. Hilburn, 2020: Evaluation, tuning and interpretation of neural networks for working with images in meteorological applications. Bull. Amer. Meteor. Soc., 101, E2149–E2170, https://doi.org/10.1175/BAMS-D-20-0097.1.
Felsche, E., and R. Ludwig, 2021: Applying machine learning for drought prediction in a perfect model framework using data from a large ensemble of climate simulations. Nat. Hazards Earth Syst. Sci., 21, 3679–3691, https://doi.org/10.5194/nhess-21-3679-2021.
Flora, M., C. Potvin, A. McGovern, and S. Handler, 2022: Comparing explanation methods for traditional machine learning models Part 1: An overview of current methods and quantifying their disagreement. arXiv, 2211.08943v1, https://doi.org/10.48550/arXiv.2211.08943.
Gautam, S., A. Boubekki, S. Hansen, S. Salahuddin, R. Jenssen, M. Höhne, and M. Kampffmeyer, 2022: ProtoVAE: A trustworthy self-explainable prototypical variational model. 36th Conf. on Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, NeurIPS, 17 940–17 952, https://proceedings.neurips.cc/paper_files/paper/2022/file/722f3f9298a961d2639eadd3f14a2816-Paper-Conference.pdf.
Gautam, S., M. M.-C. Höhne, S. Hansen, R. Jenssen, and M. Kampffmeyer, 2023: This looks more like that: Enhancing self-explaining models by prototypical relevance propagation. Pattern Recognit., 136, 109172, https://doi.org/10.1016/j.patcog.2022.109172.
Gevaert, A., A.-J. Rousseau, T. Becker, D. Valkenborg, T. De Bie, and Y. Saeys, 2022: Evaluating feature attribution methods in the image domain. arXiv, 2202.12270v1, https://doi.org/10.48550/arXiv.2202.12270.
Gibson, P. B., W. E. Chapman, A. Altinok, L. Delle Monache, M. J. DeFlorio, and D. E. Waliser, 2021: Training machine learning models on climate model output yields skillful interpretable seasonal precipitation forecasts. Commun. Earth Environ., 2, 159, https://doi.org/10.1038/s43247-021-00225-4.
Grinwald, D., K. Bykov, S. Nakajima, and M. M.-C. Höhne, 2022: Visualizing the diversity of representations learned by Bayesian neural networks. arXiv, 2201.10859v2, https://doi.org/10.48550/arXiv.2201.10859.
Ham, Y.-G., J.-H. Kim, and J.-J. Luo, 2019: Deep learning for multi-year ENSO forecasts. Nature, 573, 568–572, https://doi.org/10.1038/s41586-019-1559-7.
Han, L., J. Sun, W. Zhang, Y. Xiu, H. Feng, and Y. Lin, 2017: A machine learning nowcasting method based on real-time reanalysis data. J. Geophys. Res. Atmos., 122, 4038–4051, https://doi.org/10.1002/2016JD025783.
Han, T., S. Srinivas, and H. Lakkaraju, 2022: Which explanation should I choose? A function approximation perspective to characterizing post hoc explanations. arXiv, 2206.01254v3, https://doi.org/10.48550/arXiv.2206.01254.
Harder, P., D. Watson-Parris, D. Strassel, N. Gauger, P. Stier, and J. Keuper, 2021: Emulating aerosol microphysics with machine learning. arXiv, 2109.10593v2, https://doi.org/10.48550/arXiv.2109.10593.
Harris, C. R., and Coauthors, 2020: Array programming with NumPy. Nature, 585, 357–362, https://doi.org/10.1038/s41586-020-2649-2.
He, S., X. Li, T. DelSole, P. Ravikumar, and A. Banerjee, 2021: Sub-seasonal climate forecasting via machine learning: Challenges, analysis, and advances. Proc. 35th Conf. AAAI Artificial Intelligence, Online, AAAI Press, https://doi.org/10.1609/aaai.v35i1.16090.
Hedström, A., P. Bommer, K. K. Wickstrøm, W. Samek, S. Lapuschkin, and M. M.-C. Höhne, 2023a: The meta-evaluation problem in explainable AI: Identifying reliable estimators with MetaQuantus. arXiv, 2302.07265v2, https://doi.org/10.48550/arXiv.2302.07265.
Hedström, A., L. Weber, D. Krakowczyk, D. Bareeva, F. Motzkus, W. Samek, S. Lapuschkin, and M. M.-C. Höhne, 2023b: Quantus: An explainable AI toolkit for responsible evaluation of neural network explanations and beyond. J. Mach. Learn. Res., 24 (34), 1–11.
Hengl, T., and Coauthors, 2017: SoilGrids250m: Global gridded soil information based on machine learning. PLOS ONE, 12, e0169748, https://doi.org/10.1371/journal.pone.0169748.
Hilburn, K. A., 2023: Understanding spatial context in convolutional neural networks using explainable methods: Application to interpretable GREMLIN. Artif. Intell. Earth Syst., 2, 220093, https://doi.org/10.1175/AIES-D-22-0093.1.
Hilburn, K. A., I. Ebert-Uphoff, and S. D. Miller, 2021: Development and interpretation of a neural-network-based synthetic radar reflectivity estimator using GOES-R satellite observations. J. Appl. Meteor. Climatol., 60, 3–21, https://doi.org/10.1175/JAMC-D-20-0084.1.
Hoffman, R. R., S. T. Mueller, G. Klein, and J. Litman, 2018: Metrics for explainable AI: Challenges and prospects. arXiv, 1812.04608v2, https://doi.org/10.48550/arXiv.1812.04608.
Hunter, J. D., 2007: Matplotlib: A 2D graphics environment. Comput. Sci. Eng., 9, 90–95, https://doi.org/10.1109/MCSE.2007.55.
Hurley, N., and S. Rickard, 2009: Comparing measures of sparsity. 2008 IEEE Workshop on Machine Learning for Signal Processing, Cancun, Mexico, Institute of Electrical and Electronics Engineers, 4723–4741, https://doi.org/10.1109/MLSP.2008.4685455.
Hurrell, J. W., and Coauthors, 2013: The community earth system model: A framework for collaborative research. Bull. Amer. Meteor. Soc., 94, 1339–1360, https://doi.org/10.1175/BAMS-D-12-00121.1.
Janzing, D., L. Minorics, and P. Blöbaum, 2020: Feature relevance quantification in explainable AI: A causal problem. Proc. 23rd Int. Conf. on Artificial Intelligence and Statistics, Palermo, Italy, PMLR, 2907–2916, https://proceedings.mlr.press/v108/janzing20a/janzing20a.pdf.
Kay, J. E., and Coauthors, 2015: The Community Earth System Model (CESM) large ensemble project: A community resource for studying climate change in the presence of internal climate variability. Bull. Amer. Meteor. Soc., 96, 1333–1349, https://doi.org/10.1175/BAMS-D-13-00255.1.
Krishna, S., T. Han, A. Gu, J. Pombra, S. Jabbari, S. Wu, and H. Lakkaraju, 2022: The disagreement problem in explainable machine learning: A practitioner’s perspective. arXiv, 2202.01602v3, https://doi.org/10.48550/arXiv.2202.01602.
Labe, Z. M., and E. A. Barnes, 2021: Detecting climate signals using explainable AI with single-forcing large ensembles. J. Adv. Model. Earth Syst., 13, e2021MS002464, https://doi.org/10.1029/2021MS002464.
Labe, Z. M., and E. A. Barnes, 2022: Comparison of climate model large ensembles with observations in the Arctic using simple neural networks. Earth Space Sci., 9, e2022EA002348, https://doi.org/10.1002/essoar.10510977.1.
Lapuschkin, S., S. Wäldchen, A. Binder, G. Montavon, W. Samek, and K.-R. Müller, 2019: Unmasking Clever Hans predictors and assessing what machines really learn. Nat. Commun., 10, 1096, https://doi.org/10.1038/s41467-019-08987-4.
Leinonen, J., D. Nerini, and A. Berne, 2021: Stochastic super-resolution for downscaling time-evolving atmospheric fields with a generative adversarial network. IEEE Trans. Geosci. Remote Sens., 59, 7211–7223, https://doi.org/10.1109/TGRS.2020.3032790.
Letzgus, S., P. Wagner, J. Lederer, W. Samek, K.-R. Müller, and G. Montavon, 2022: Toward explainable artificial intelligence for regression models: A methodological perspective. IEEE Signal Process. Mag., 39, 40–58, https://doi.org/10.1109/MSP.2022.3153277.
Lundberg, S. M., and S.-I. Lee, 2017: A unified approach to interpreting model predictions. NIPS’17: Proc. 31st Int. Conf. on Neural Information Processing Systems, Long Beach, CA, Curran Associates Inc., 4765–4774, https://dl.acm.org/doi/10.5555/3295222.3295230.
Mamalakis, A., I. Ebert-Uphoff, and E. A. Barnes, 2020: Explainable artificial intelligence in meteorology and climate science: Model fine-tuning, calibrating trust and learning new science. xxAI: Int. Workshop on Extending Explainable AI beyond Deep Models and Classifiers, Vienna, Austria, Springer, 315–339, https://doi.org/10.1007/978-3-031-04083-2_16.
Mamalakis, A., E. A. Barnes, and I. Ebert-Uphoff, 2022a: Investigating the fidelity of explainable artificial intelligence methods for applications of convolutional neural networks in geoscience. Artif. Intell. Earth Syst., 1, e220012, https://doi.org/10.1175/AIES-D-22-0012.1.
Mamalakis, A., I. Ebert-Uphoff, and E. A. Barnes, 2022b: Neural network attribution methods for problems in geoscience: A novel synthetic benchmark dataset. Environ. Data Sci., 1, e8, https://doi.org/10.1017/eds.2022.7.
Mayer, K. J., and E. A. Barnes, 2021: Subseasonal forecasts of opportunity identified by an explainable neural network. Geophys. Res. Lett., 48, e2020GL092092, https://doi.org/10.1029/2020GL092092.
McGovern, A., R. Lagerquist, D. John Gagne, G. E. Jergensen, K. L. Elmore, C. R. Homeyer, and T. Smith, 2019: Making the black box more transparent: Understanding the physical implications of machine learning. Bull. Amer. Meteor. Soc., 100, 2175–2199, https://doi.org/10.1175/BAMS-D-18-0195.1.
Mohseni, S., N. Zarei, and E. D. Ragan, 2021: A multidisciplinary survey and framework for design and evaluation of explainable AI systems. ACM Trans. Interact. Intell. Syst., 11 (3–4), 1–45, https://doi.org/10.1145/3387166.
Montavon, G., S. Lapuschkin, A. Binder, W. Samek, and K.-R. Müller, 2017: Explaining nonlinear classification decisions with deep Taylor decomposition. Pattern Recognit., 65, 211–222, https://doi.org/10.1016/j.patcog.2016.11.008.
Montavon, G., W. Samek, and K.-R. Müller, 2018: Methods for interpreting and understanding deep neural networks. Digit. Signal Process., 73, 1–15, https://doi.org/10.1016/j.dsp.2017.10.011.
Montavon, G., A. Binder, S. Lapuschkin, W. Samek, and K.-R. Müller, 2019: Layer-wise relevance propagation: An overview. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, Springer, 193–209.
Murphy, A. H., 1988: Skill scores based on the mean square error and their relationships to the correlation coefficient. Mon. Wea. Rev., 116, 2417–2424, https://doi.org/10.1175/1520-0493(1988)116<2417:SSBOTM>2.0.CO;2.
Murphy, A. H., and H. Daan, 1985: Forecast evaluation. Probability, Statistics, and Decision Making in the Atmospheric Sciences, CRC Press, 379–437.
Nguyen, A., A. Dosovitskiy, J. Yosinski, T. Brox, and J. Clune, 2016: Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. NIPS’16: Proc. 30th Int. Conf. on Neural Information Processing Systems, Barcelona, Spain, Curran Associates Inc., 3395–3403, https://dl.acm.org/doi/10.5555/3157382.3157477.
Nguyen, A.-p., and M. R. Martínez, 2020: On quantitative aspects of model interpretability. arXiv, 2007.07584v1, https://doi.org/10.48550/arXiv.2007.07584.
Olah, C., A. Mordvintsev, and L. Schubert, 2017: Feature visualization. Distill, 2, e7, https://doi.org/10.23915/distill.00007.
Pegion, K., E. J. Becker, and B. P. Kirtman, 2022: Understanding predictability of daily southeast U.S. precipitation using explainable machine learning. Artif. Intell. Earth Syst., 1, e220011, https://doi.org/10.1175/AIES-D-22-0011.1.
Petsiuk, V., A. Das, and K. Saenko, 2018: RISE: Randomized input sampling for explanation of black-box models. arXiv, 1806.07421v3, https://doi.org/10.48550/arXiv.1806.07421.
Ribeiro, M. T., S. Singh, and C. Guestrin, 2016: “Why should I trust you?”: Explaining the predictions of any classifier. Proc. 22nd ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, San Francisco, CA, Association for Computing Machinery, 1135–1144, https://doi.org/10.1145/2939672.2939778.
Rieger, L., and L. K. Hansen, 2020: IROF: A low resource evaluation metric for explanation methods. arXiv, 2003.08747v1, https://doi.org/10.48550/arXiv.2003.08747.
Rong, Y., T. Leemann, V. Borisov, G. Kasneci, and E. Kasneci, 2022a: A consistent and efficient evaluation strategy for attribution methods. arXiv, 2202.00449v2, https://doi.org/10.48550/arXiv.2202.00449.
Rong, Y., T. Leemann, V. Borisov, G. Kasneci, and E. Kasneci, 2022b: Evaluating feature attribution: An information-theoretic perspective. arXiv, 2202.00449v1, https://www.hci.uni-tuebingen.de/assets/pdf/publications/rong2022evaluating.pdf.
Samek, W., A. Binder, G. Montavon, S. Lapuschkin, and K.-R. Muller, 2017: Evaluating the visualization of what a deep neural network has learned. IEEE Trans. Neural Networks Learn. Syst., 28, 2660–2673, https://doi.org/10.1109/TNNLS.2016.2599820.
Samek, W., G. Montavon, A. Vedaldi, L. K. Hansen, and K.-R. Müller, 2019: Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. Vol. 11700. Springer Nature, 439 pp.
Sawada, Y., and K. Nakamura, 2022: C-SENN: Contrastive self-explaining neural network. arXiv, 2206.09575v2, https://doi.org/10.48550/arXiv.2206.09575.
Scher, S., and G. Messori, 2021: Ensemble methods for neural network-based weather forecasts. J. Adv. Model. Earth Syst., 13, e2020MS002331, https://doi.org/10.1029/2020MS002331.
Selvaraju, R. R., M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, 2017: Grad-CAM: Visual explanations from deep networks via gradient-based localization. 2017 Proc. IEEE Int. Conf. on Computer Vision, Venice, Italy, Institute of Electrical and Electronics Engineers, 618–626, https://doi.org/10.1109/ICCV.2017.74.
Shapley, L. S., 1951: Notes on the N-person game—II: The value of an N-person game. RAND Corporation, 19 pp., https://www.rand.org/pubs/research_memoranda/RM0670.html.
Shi, X., Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, 2015: Convolutional LSTM network: A machine learning approach for precipitation nowcasting. NIPS’15: Proc. 28th Int. Conf. on Neural Information Processing System, Montreal, Canada, MIT Press, 802–810, https://dl.acm.org/doi/proceedings/10.5555/2969239.
Shrikumar, A., P. Greenside, A. Shcherbina, and A. Kundaje, 2016: Not just a black box: Learning important features through propagating activation differences. arXiv, 1605.01713v3, https://doi.org/10.48550/arXiv.1605.01713.
Simonyan, K., A. Vedaldi, and A. Zisserman, 2014: Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv:1312.6034v2, https://arxiv.org/abs/1312.6034.
Sixt, L., M. Granz, and T. Landgraf, 2020: When explanations lie: Why many modified BP attributions fail. Proc. 37th Int. Conf. on Machine Learning (ICML 2020), Online, PMLR, https://researchr.org/publication/icml-2020.
Slivinski, L. C., and Coauthors, 2019: Towards a more reliable historical reanalysis: Improvements for version 3 of the twentieth century reanalysis system. Quart. J. Roy. Meteor. Soc., 145, 2876–2908, https://doi.org/10.1002/qj.3598.
Smilkov, D., N. Thorat, B. Kim, F. Viégas, and M. Wattenberg, 2017: SmoothGrad: Removing noise by adding noise. arXiv, 1706.03825v1, https://doi.org/10.48550/arXiv.1706.03825.
Sonnewald, M., and R. Lguensat, 2021: Revealing the impact of global heating on North Atlantic circulation using transparent machine learning. J. Adv. Model. Earth Syst., 13, e2021MS002496, https://doi.org/10.1029/2021MS002496.
Strumbelj, E., and I. Kononenko, 2010: An efficient explanation of individual classifications using game theory. J. Mach. Learn. Res., 11, 1–18.
Sturmfels, P., S. Lundberg, and S.-I. Lee, 2020: Visualizing the impact of feature attribution baselines. Distill, 5, e22, https://doi.org/10.23915/distill.00022.
Sundararajan, M., A. Taly, and Q. Yan, 2017: Axiomatic attribution for deep networks. Proc. 34th Int. Conf. on Machine Learning, Sydney, New South Wales, Australia, PMLR, 3319–3328, https://dl.acm.org/doi/10.5555/3305890.3306024.
Theiner, J., E. Müller-Budack, and R. Ewerth, 2022: Interpretable semantic photo geolocation. 2022 Proc. IEEE/CVF Winter Conf. on Applications of Computer Vision, Waikoloa, HI, Institute of Electrical and Electronics Engineers, 750–760, https://doi.org/10.1109/WACV51458.2022.00154.
Tomsett, R., D. Harborne, S. Chakraborty, P. Gurram, and A. Preece, 2020: Sanity checks for saliency metrics. Proc. 34th AAAI Conf. on Artificial Intelligence, New York, New York, AAAI Press, AAAI Press, 6021–6029, https://doi.org/10.1609/aaai.v34i04.6064.
Van Straaten, C., K. Whan, D. Coumou, B. Van den Hurk, and M. Schmeits, 2022: Using explainable machine learning forecasts to discover subseasonal drivers of high summer temperatures in western and central Europe. Mon. Wea. Rev., 150, 1115–1134, https://doi.org/10.1175/MWR-D-21-0201.1.
Vidovic, M. M.-C., N. Görnitz, K.-R. Müller, G. Rätsch, and M. Kloft, 2015: Opening the black box: Revealing interpretable sequence motifs in kernel-based learning algorithms. ECML PKDD 2015: Machine Learning and Knowledge Discovery in Databases, Porto, Portugal, Springer, 137–153, https://doi.org/10.1007/978-3-319-23525-7_9.
Vidovic, M. M.-C., N. Görnitz, K.-R. Müller, and M. Kloft, 2016: Feature importance measure for non-linear learning algorithms. arXiv, 1611.07567v1, https://doi.org/10.48550/arXiv.1611.07567.
Virtanen, P., and Coauthors, 2020: SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nat. Methods, 17, 261–272, https://doi.org/10.1038/s41592-019-0686-2.
Wang, J., K. Gao, Z. Zhang, C. Ni, Z. Hu, D. Chen, and Q. Wu, 2021: Multisensor remote sensing imagery super-resolution with conditional GAN. J. Remote Sens., 2021, 9829706, https://doi.org/10.34133/2021/9829706.
Yang, M., and B. Kim, 2019: Benchmarking attribution methods with relative feature importance. arXiv, 1907.09701v2, https://doi.org/10.48550/arXiv.1907.09701.
Yeh, C.-K., C.-Y. Hsieh, A. S. Suggala, D. I. Inouye, and P. K. Ravikumar, 2019: On the (in)fidelity and sensitivity of explanations. NIPS’19: Proc. 33rd Int. Conf. on Neural Information Processing Systems, Vancouver, British Columbia, Canada, Curran Associates Inc., 10 967–10 978, https://dl.acm.org/doi/abs/10.5555/3454287.3455271.
Yuval, J., and P. A. O’Gorman, 2020: Stable machine-learning parameterization of subgrid processes for climate modeling at a range of resolutions. Nat. Commun., 11, 3295, https://doi.org/10.1038/s41467-020-17142-3.
Zhang, J., S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff, 2018: Top-down neural attention by excitation backprop. Int. J. Comput. Vis., 126, 1084–1102, https://doi.org/10.1007/s11263-017-1059-x.
Zhou, Y., S. Booth, M. T. Ribeiro, and J. Shah, 2022: Do feature attribution methods correctly attribute features? Proc. 36th Conf. AAAI Artificial Intelligence, Online, AAAI Press, https://doi.org/10.1609/aaai.v36i9.21196.