Finding the Right XAI Method—A Guide for the Evaluation and Ranking of Explainable AI Methods in Climate Science

Philine Lou Bommer aUnderstandable Machine Intelligence Lab, Technical University Berlin, Berlin, Germany
bDepartment of Data Science, ATB, Potsdam, Germany

Search for other papers by Philine Lou Bommer in
Current site
Google Scholar
PubMed
Close
,
Marlene Kretschmer cLeipzig Institute for Meteorology, University of Leipzig, Leipzig, Germany
dDepartment of Meteorology, University of Reading, Reading, United Kingdom

Search for other papers by Marlene Kretschmer in
Current site
Google Scholar
PubMed
Close
,
Anna Hedström aUnderstandable Machine Intelligence Lab, Technical University Berlin, Berlin, Germany
bDepartment of Data Science, ATB, Potsdam, Germany

Search for other papers by Anna Hedström in
Current site
Google Scholar
PubMed
Close
,
Dilyara Bareeva aUnderstandable Machine Intelligence Lab, Technical University Berlin, Berlin, Germany
bDepartment of Data Science, ATB, Potsdam, Germany

Search for other papers by Dilyara Bareeva in
Current site
Google Scholar
PubMed
Close
, and
Marina M.-C. Höhne aUnderstandable Machine Intelligence Lab, Technical University Berlin, Berlin, Germany
bDepartment of Data Science, ATB, Potsdam, Germany
eInstitute of Computer Science – University of Potsdam, Potsdam, Germany
fBerlin Institute for the Foundations of Learning and Data, Berlin, Germany
gMachine Learning Group, UiT The Arctic University of Norway, Tromsø, Norway

Search for other papers by Marina M.-C. Höhne in
Current site
Google Scholar
PubMed
Close
Open access

Abstract

Explainable artificial intelligence (XAI) methods shed light on the predictions of machine learning algorithms. Several different approaches exist and have already been applied in climate science. However, usually missing ground truth explanations complicate their evaluation and comparison, subsequently impeding the choice of the XAI method. Therefore, in this work, we introduce XAI evaluation in the climate context and discuss different desired explanation properties, namely, robustness, faithfulness, randomization, complexity, and localization. To this end, we chose previous work as a case study where the decade of annual-mean temperature maps is predicted. After training both a multilayer perceptron (MLP) and a convolutional neural network (CNN), multiple XAI methods are applied and their skill scores in reference to a random uniform explanation are calculated for each property. Independent of the network, we find that XAI methods such as Integrated Gradients, layerwise relevance propagation, and input times gradients exhibit considerable robustness, faithfulness, and complexity while sacrificing randomization performance. Sensitivity methods, gradient, SmoothGrad, NoiseGrad, and FusionGrad, match the robustness skill but sacrifice faithfulness and complexity for the randomization skill. We find architecture-dependent performance differences regarding robustness, complexity, and localization skills of different XAI methods, highlighting the necessity for research task-specific evaluation. Overall, our work offers an overview of different evaluation properties in the climate science context and shows how to compare and benchmark different explanation methods, assessing their suitability based on strengths and weaknesses, for the specific research problem at hand. By that, we aim to support climate researchers in the selection of a suitable XAI method.

Significance Statement

Explainable artificial intelligence (XAI) helps to understand the reasoning behind the prediction of a neural network. XAI methods have been applied in climate science to validate networks and provide new insight into physical processes. However, the increasing number of XAI methods can overwhelm practitioners, making it difficult to choose an explanation method. Since XAI methods’ results can vary, uninformed choices might cause misleading conclusions about the network decision. In this work, we introduce XAI evaluation to compare and assess the performance of explanation methods based on five desirable properties. We demonstrate that XAI evaluation reveals the strengths and weaknesses of different XAI methods. Thus, our work provides climate researchers with the tools to compare, analyze, and subsequently choose explanation methods.

© 2024 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Philine Bommer, philine.l.bommer@tu-berlin.de

Abstract

Explainable artificial intelligence (XAI) methods shed light on the predictions of machine learning algorithms. Several different approaches exist and have already been applied in climate science. However, usually missing ground truth explanations complicate their evaluation and comparison, subsequently impeding the choice of the XAI method. Therefore, in this work, we introduce XAI evaluation in the climate context and discuss different desired explanation properties, namely, robustness, faithfulness, randomization, complexity, and localization. To this end, we chose previous work as a case study where the decade of annual-mean temperature maps is predicted. After training both a multilayer perceptron (MLP) and a convolutional neural network (CNN), multiple XAI methods are applied and their skill scores in reference to a random uniform explanation are calculated for each property. Independent of the network, we find that XAI methods such as Integrated Gradients, layerwise relevance propagation, and input times gradients exhibit considerable robustness, faithfulness, and complexity while sacrificing randomization performance. Sensitivity methods, gradient, SmoothGrad, NoiseGrad, and FusionGrad, match the robustness skill but sacrifice faithfulness and complexity for the randomization skill. We find architecture-dependent performance differences regarding robustness, complexity, and localization skills of different XAI methods, highlighting the necessity for research task-specific evaluation. Overall, our work offers an overview of different evaluation properties in the climate science context and shows how to compare and benchmark different explanation methods, assessing their suitability based on strengths and weaknesses, for the specific research problem at hand. By that, we aim to support climate researchers in the selection of a suitable XAI method.

Significance Statement

Explainable artificial intelligence (XAI) helps to understand the reasoning behind the prediction of a neural network. XAI methods have been applied in climate science to validate networks and provide new insight into physical processes. However, the increasing number of XAI methods can overwhelm practitioners, making it difficult to choose an explanation method. Since XAI methods’ results can vary, uninformed choices might cause misleading conclusions about the network decision. In this work, we introduce XAI evaluation to compare and assess the performance of explanation methods based on five desirable properties. We demonstrate that XAI evaluation reveals the strengths and weaknesses of different XAI methods. Thus, our work provides climate researchers with the tools to compare, analyze, and subsequently choose explanation methods.

© 2024 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Philine Bommer, philine.l.bommer@tu-berlin.de

1. Introduction

Deep learning (DL) has become a widely used tool in climate science and assists various tasks, such as nowcasting (Shi et al. 2015; Han et al. 2017; Bromberg et al. 2019), climate or weather monitoring (Hengl et al. 2017; Anantrasirichai et al. 2019) and forecasting (Ham et al. 2019; Chen et al. 2020; Scher and Messori 2021), numerical model enhancement (Yuval and O’Gorman 2020; Harder et al. 2021), and upsampling of satellite data (Wang et al. 2021; Leinonen et al. 2021). However, a deep neural network (DNN) is mostly considered a black box due to its inaccessible decision-making process. This lack of interpretability limits their trustworthiness and application in climate research, as DNNs should not only have high predictive performance but also provide accessible and consistent predictive reasoning aligned with existing theory (McGovern et al. 2019; Mamalakis et al. 2020; Camps-Valls et al. 2020; Sonnewald and Lguensat 2021; Clare et al. 2022; Flora et al. 2022). Explainable artificial intelligence (XAI) aims to address the lack of interpretability by explaining potential reasons behind the predictions of a network. In the climate context, XAI can help to validate DNNs and on a well-performing model provide researchers with new insights into physical processes (Ebert-Uphoff and Hilburn 2020; Hilburn et al. 2021). For example, Gibson et al. (2021) demonstrated using XAI that DNNs produce skillful seasonal precipitation forecasts based on known relevant physical processes. Moreover, XAI was used to improve the forecasting of droughts (Dikshit and Pradhan 2021), teleconnections (Mayer and Barnes 2021), and regional precipitation (Pegion et al. 2022); to assess external drivers of global climate change (Labe and Barnes 2021); and to understand subseasonal drivers of high-temperature summers (Van Straaten et al. 2022). Additionally, Labe and Barnes (2022) showed that XAI applications can aid in the comparative assessment of climate models.

Generally, explainability methods can be divided into ante hoc and post hoc approaches (Samek et al. 2019) (see Table 1). Ante hoc approaches modify the DNN architecture to improve interpretability, like adding an interpretable prototype layer to learn humanly understandable representations for different classes [see, e.g., Chen et al. (2019) and Gautam et al. (2022, 2023)] or constructing mathematically similar but interpretable models (Hilburn 2023). Such approaches are also called self-explaining neural networks and link to the field of interpretability (Flora et al. 2022). Post hoc XAI methods, on the other hand, can be applied to any neural network architecture (Samek et al. 2019), and here, we focus on three characterizing aspects (Samek et al. 2019; Letzgus et al. 2022; Mamalakis et al. 2022b), as shown in Table 1. The first considers the explanation target (i.e., what is explained), which can differ between local and global decision-making. While local explanations provide explanations of the network decision for a single data point (Baehrens et al. 2010; Bach et al. 2015; Vidovic et al. 2016; Ribeiro et al. 2016), e.g., by assessing the contribution of each pixel in a given image to the predicted class, global explanations reveal the overall decision strategy, e.g., by showing a map of important features or image patterns, learned by the model for the whole class (Vidovic et al. 2015; Nguyen et al. 2016; Lapuschkin et al. 2019; Grinwald et al. 2022; Bykov et al. 2022a). The second aspect concerns the components used to calculate the explanation, differentiating between model-aware and model-agnostic methods. Model-aware methods use components of the trained model for the explanation calculation, such as network weights, while model-agnostic methods consider the model as a black box and only assess the change in the output caused by a perturbation in the input (Strumbelj and Kononenko 2010; Ribeiro et al. 2016). The third aspect considers the DNN explanation output. Here, we can differentiate between methods where the assigned value of a pixel indicates the sensitivity of the network regarding that pixel also called sensitivity methods, such as absolute gradient, as well as methods, that display the positive or negative contribution of a pixel to predict the class, such as layerwise relevance propagation (LRP, see section 3) also called salience methods, and methods presenting input examples leading to the same prediction. Beyond these three characteristics, recent efforts (Flora et al. 2022) also differentiate between feature importance methods encompassing mostly global methods, which calculate feature contribution based on the network performance (e.g., accuracy), and feature relevance methods describing mostly local methods, which calculate contributions to the model prediction. In climate research, the decision patterns learned by DNNs have been analyzed with local explanation methods such as LRP or Shapley values (Gibson et al. 2021; Dikshit and Pradhan 2021; Mayer and Barnes 2021; Labe and Barnes 2021; He et al. 2021; Felsche and Ludwig 2021; Labe and Barnes 2022). However, different local explanation methods can identify different input features as being important to the network decision, subsequently leading to different scientific conclusions (Sturmfels et al. 2020; Covert et al. 2021; Han et al. 2022; Flora et al. 2022). Thus, with the increasing number of XAI methods available, selecting the most suitable method for a specific task poses a challenge and the practitioner’s choice of a method is often based upon popularity or upon easy access (Krishna et al. 2022). To navigate the field of XAI, recent climate science publications have compared and assessed different explanation techniques using benchmark datasets, where the XAI method was assessed by comparing its explanation with a defined target, considered as ground truth (Mamalakis et al. 2022b,a). While benchmark datasets (Yang and Kim 2019; Arras et al. 2020; Agarwal et al. 2022) certainly contribute to the understanding of local XAI methods, the existence of a ground truth explanation is highly debated (e.g., Janzing et al. 2020; Sturmfels et al. 2020). In the case of DNNs, ground truth explanation labels can only be considered approximations and are not guaranteed to align precisely with the model’s decision process or the features it utilizes (Ancona et al. 2019; Hedström et al. 2023a). For exact ground truth, either perfect knowledge of how the model handles the available information or a carefully engineered model would be required, which is usually not the case. Additionally, post hoc explanation methods are generally only approximations of a model’s behavior (Lundberg and Lee 2017; Han et al. 2022), and the distinct mathematical concepts of the different methods consequently lead to distinct ground truth explanations.

Table 1.

Overview and categorization of research on the transparency and understandability of neural networks. For this categorization, we follow works such as Samek et al. (2019), Ancona et al. (2019), Mamalakis et al. (2022b), Letzgus et al. (2022), and Flora et al. (2022).

Table 1.

Here, we address these challenges, by introducing XAI evaluation in the context of climate science to compare different local explanation methods. The field of XAI evaluation has emerged recently and refers to the development of metrics to compare, benchmark, and rank explanation methods, in different explainability contexts (e.g., Adebayo et al. 2018; Hedström et al. 2023b,a). As discussed below in more detail, using evaluation metrics, we are able to quantitatively assess the robustness, complexity, localization, randomization, and faithfulness of explanation methods, making them comparable regarding their suitability, strengths, and weaknesses (Hoffman et al. 2018; Arrieta et al. 2020; Mohseni et al. 2021; Hedström et al. 2023b).

In this work, we discuss these properties in an exemplary manner and build upon work from Labe and Barnes (2021). In their work, an MLP was trained with global annual temperature anomaly maps and the network’s task was to assign the respective year or decade of occurrence. The MLP achieves the assignment, as global mean warming progresses. Using LRP, they then identified the signals relevant to the network’s decision and found the North Atlantic, Southern Ocean, and Southeast Asia as key regions. Here, we use their work as a case study and train an MLP and a CNN for the same prediction tasks (see step 1 in Fig. 1). Then, we apply several explanation methods and show the variation in their explanation maps, potentially leading to different scientific insights (step 2 in Fig. 1). We therefore introduce XAI evaluation metrics and quantify the skill of the different XAI methods against a random baseline in different properties to compare their performance with respect to the underlying task.

Fig. 1.
Fig. 1.

Schematic of the XAI evaluation procedure. Based on an annual temperature anomaly map as input, the network predicts the respective decade (box 1). The explanation methods (Grad—gradient, SG—SmoothGrad applied to gradient, and LRP—layerwise relevance propagation) then provide insights (i.e., “shine a light”; see box 2) into the specific network’s decision. The different explanation maps (marked in orange—Grad, green—SG, and blue—LRP) highlight different areas as positively (red) and negatively (blue) contributing to the network decision. Here, XAI evaluation can “shine a light” on the explanation methods and help choose a suitable method (here indicated by the first rank) since evaluation explores the explanation maps regarding their robustness, faithfulness, localization, complexity, and randomization properties.

Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1

This paper is structured as follows. In section 2, we discuss the used dataset and network types and briefly describe the different analyzed explanation methods. Section 3 introduces XAI evaluation and describes five evaluation properties. Then, in section 4, we first discuss the performance of both network types and provide a motivational example highlighting the risks of an uninformed choice of an explanation method. Next, we evaluate different XAI methods applied to the MLP, using two different metrics for each evaluation property, and then compare the XAI evaluation results for the different networks (see sections 4b and 4c). Finally, in section 4d, we present a guideline on using XAI evaluation to choose a suitable XAI method. The discussion of our results and our conclusions are presented in section 5.

2. Data and methods

a. Data

We analyze data simulated by the general climate model, CESM1 (Hurrell et al. 2013), focusing on the “ALL” configuration (Kay et al. 2015), which is discussed in detail in Labe and Barnes (2021). We use the global 2-m air temperature (T2m) maps from 1920 to 2080. The data Ω consist of I = 40 ensemble members Ωi∈{1,…,I}, and each member is generated by varying the atmospheric initial conditions zi with fixed external forcing θclima. Following Labe and Barnes (2021), we compute annual averages and apply bilinear interpolation. This results in T = 161 temperature maps for each member ΩiRT×υ×h, with υ = 144 and h = 95 denoting the number of longitudes and latitudes, with 1.9° sampling in latitude and 2.5° sampling in longitude. Accordingly, the whole dataset XRI×T×υ×h contains I × T samples. The data are split into a training set Ωtr and a test set Ωtest. More precisely, we sample 20% of the ensemble members (i.e., in total eight ensemble members) as a test set XtestR0.2I×T×υ×h and use the remaining 80% (i.e., 32 ensemble members) for training and validation. Of these 32 ensemble members, all temperature maps are split into a training (80% of the data points, i.e., 64% of all ensemble members) and validation (20% of the temperature maps, i.e., 16% of all ensemble members) set. All temperature maps xRυxh are standardized by subtracting the mean and subsequently dividing by the corresponding standard deviation at each grid point individually, whereby the mean xmeanRυ×h and standard deviation xstdRυ×h are computed over the training set only.

b. Networks

Following Labe and Barnes (2021), we train an MLP fMLP:RdRc with network weights WW, to solve a fuzzy classification problem by combining classification and regression. As input x ∈ Ω, the network considers the flattened temperature maps with dimensionality d = υ × h. Given the goal of fuzzy classification, first, the network assigns each map to one of the C = 20 different classes, where each class corresponds to one decade between 1900 and 2100 [see Fig. 1 in Labe and Barnes (2021)]. The network output f(x), thus, is a probability vector yR1×C across C = 20 classes. Afterward, since the network can assign a nonzero probability to more than one class, regression is used to predict the year y^ of the input as follows:
y^=i=1CyiY¯i,
where yi is the probability of class i, predicted by the network y = f(x) in the classification step, and Y¯i denotes the central year of the corresponding decade class i (e.g., for class i = 1, Y¯1=1925 represents the decade 1920–29). Accordingly, the task ensures the association of temperature patterns with the respective year or decade. Here, we train using the binary cross-entropy loss, considering Eq. (1) only for performance evaluation.

Additionally, we construct a CNN fCNN:Rυ×hRc that maintains the longitude–latitude grid of the data ximgRυ×h for each input sample (see section 2a), unlike the flattened input used for the MLP. The CNN consists of a 2D convolutional layer (2dConv) with 6 × 6 window size and a stride of 2. The second layer includes a max-pooling layer with a 2 × 2 window size, followed by a dense layer with L2 regularization and a softmax output layer.

c. XAI

In this work, we focus on local model-aware explanation methods belonging to the group of feature attribution methods (Ancona et al. 2019; Das and Rad 2020; Zhou et al. 2022). For the mathematical details, we refer to appendix A, section a.

  1. Gradient (Baehrens et al. 2010) explains the network decision by computing the first partial derivative of the network output f(x) with respect to the input. This explanation method feeds backward the network’s prediction to the features in the input x, indicating the change in network prediction given a change in the respective features. The explanation values correspond to the network’s sensitivity to each feature, thus belonging to the group of sensitivity methods. The absolute gradient, often referred to as the saliency map, can also be used as an explanation (Simonyan et al. 2014).

  2. Input times gradient is an extension of the gradient method and computes the product of the gradient and the input. In the explanation map, a high relevance is assigned to an input feature if it has a high value and the model gradient is sensitive to it. Therefore, contrary to the gradient as a sensitivity method, input times gradient and other methods including the input pixel value are considered salience methods (Ancona et al. 2019) [or attribution methods, e.g., Mamalakis et al. (2022a)].

  3. Integrated Gradients (Sundararajan et al. 2017) extends input times gradient, by integrating a gradient along a line path from a baseline (generally a reference vector for which the network’s output is zero, e.g., all zeros for standardized data) to the explained sample x. In practice, the gradient explanations of a set of images lying between the baseline and x are averaged and multiplied by the difference between the baseline and the explained input [see Eq. (A3)]. Hence, the Integrated Gradients method is a salience method and highlights the difference between the features important to the prediction of x and the features important to the prediction of the baseline value.

  4. LRP (Bach et al. 2015; Montavon et al. 2019) computes the relevance for each input feature by feeding the network’s prediction backward through the model, layer by layer, until the prediction score is distributed over the input features and is a salience method. Different propagation rules can be used, all resembling the energy conservation rule, i.e., the sum of all relevances within one layer is equal to the original prediction score. In the case of the α-β rule, relevance is assigned at each layer to each neuron. All positively contributing activations of connected neurons in the previous layer are weighted by α, while β is used to weigh the contribution of the negative activations. The default values are α = 1 and β = 0, where only positively contributing activations are considered. Contrary to that, the z rule calculates the explanation by including both negative and positive neuron activations. Hence, the corresponding explanations, visualized as heatmaps, display both positive and negative evidence. The composite rule combines various rules for different layer types. The method accounts for layer structure variety in CNNs, such as fully connected, convolutional, and pooling layers.

  5. SmoothGrad (Smilkov et al. 2017) aims to filter out the background noise [i.e., the gradient shattering effect, where gradients resemble white noise with increasing layer number (Balduzzi et al. 2017)] to enhance the interpretability of the explanation. To this end, multiple noisy samples are generated by adding random noise to the input; then, the explanations of the noisy samples are computed and averaged, such that the most important features are enhanced and the less important features are “canceled out.”

  6. NoiseGrad (Bykov et al. 2022b) perturbs the weights of the model, instead of the input feature as done by SmoothGrad. The explanations, resulting from explaining the predictions made by the noisy versions of the model on the same image, are averaged to suppress the background noise of the image in the final explanation.

  7. FusionGrad (Bykov et al. 2022b) combines SmoothGrad and NoiseGrad by perturbing both the input features and the network weights. The purpose of the method is to account for uncertainties within the network and the input space (Bykov et al. 2021).

  8. Deep Shapley additive explanations (DeepSHAP) (Lundberg and Lee 2017) estimates Shapley values for the full DNN by dividing it into small network components, calculating the Shapley values, and averaging them across all components. The idea behind SHAP values is to fairly distribute the contribution of each feature to the prediction of a specific instance considering all possible feature combinations. Following the game-theoretic concept of Shapley values (Shapley 1951), DeepSHAP satisfy properties such as local accuracy, missingness, and consistency (Lundberg and Lee 2017) and is a salience method.

In this work, we maintain literature values for most hyperparameters of the explanation methods. Exceptions are hyperparameters of explanation methods such as NoiseGrad and FusionGrad. We adjust the perturbation levels of the parameters, as discussed in Bykov et al. (2022b), to ensure at most 5% loss in accuracy. All hyperparameters are presented in Table B1 (see appendix B, section a). Additionally, both Integrated Gradients and DeepSHAP require background images as reference points to calculate the explanations [see also Lundberg and Lee (2017) and appendix A, section a]. To allow for a fair performance comparison, for both methods, we sample 100 maps containing all zero values. We note that there are other possible reference values, e.g., natural images from training, or all one maps, and this choice can affect the explanation performance (Sturmfels et al. 2020). Last, the baseline for SmoothGrad, NoiseGrad, and FusionGrad can be any local explanation method, and here, we use the gradient explanations. Accordingly, gradient, SmoothGrad, NoiseGrad, and FusionGrad are sensitivity methods.

3. Evaluation techniques

Due to the lack of a ground truth explanation, XAI research developed alternative metrics to assess the reliability of an explanation method. These evaluation metrics analyze different properties an explanation method should fulfill and can serve to evaluate different explanation methods (Hoffman et al. 2018; Arrieta et al. 2020; Mohseni et al. 2021; Hedström et al. 2023b). Following Hedström et al. (2023b), we describe five different evaluation properties, and based on the classification task from Labe and Barnes (2021), we illustrate each property in a schematic diagram (see Figs. 24).

Fig. 2.
Fig. 2.

Diagram of the concept behind the robustness property. Perturbed input images are created by adding uniform noise maps of small magnitude to the original temperature map (left part of the figure). The perturbed maps are passed to the network, resulting in an explanation map for each prediction. The explanation maps of the perturbed inputs (explanation maps with gray outlines) are then compared to (indicated by a minus sign) the explanation of the unperturbed input (explanation map with black outline). A robust XAI method is expected to produce similar explanations for the perturbed inputs and unperturbed inputs.

Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1

Fig. 3.
Fig. 3.

Diagram of the concept behind the faithfulness property. Faithfulness assesses the impact of highly relevant pixels in the explanation map on the network decision. First, the explanation values are sorted to identify the highest relevance values (here shown in red). Next, the corresponding pixel positions in the flattened input temperature map are identified (see dotted arrows) and masked (marked in black); i.e., their value is set to a chosen masking value, such as 0 or 1. Both the masked and the original input maps are passed through the network, and their predictions are compared. If the masking is based on a faithful explanation, the prediction of the masked input (j; gray) is expected to change compared with (indicated by a minus sign) the unmasked input (i; black), e.g., a different decade is predicted.

Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1

Fig. 4.
Fig. 4.

Diagram of the concept behind the complexity property. Complexity assesses how the evidence values are distributed across the explanation map. For this, the distribution of the relevance values from the original explanation is compared to a “random” explanation drawn from a random uniform distribution. Here, shown in a 1D example, the evidence distribution of the explanation exhibits clear maxima and minima (see maxima in red oval), which is considered desirable and linked to increased scores. The noisy features show a uniform distribution linked to a low complexity score.

Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1

a. Robustness

Robustness measures the stability of an explanation regarding small changes in the input image x + δ (Alvarez-Melis and Jaakkola 2018b; Yeh et al. 2019; Montavon et al. 2018). Ideally, these small changes (δ < ϵ) in the input sample should produce only small changes in the model prediction and successively only small changes in the explanation (see Fig. 2).

To measure robustness, we choose the local Lipschitz estimate qLLE,m (Alvarez-Melis and Jaakkola 2018b) and the average sensitivity qAS,m (Yeh et al. 2019) as representative metrics. Both use Monte Carlo sampling-based approximation to measure the Lipschitz constant or the average sensitivity of an explanation. For an explanation Φm(f,c,x)Rd of a XAI method m and a given input x, the scores are defined by
qLLE,m=maxx+δNϵ(x)ϵΦm(f,c,x)Φm(f,c,x+δ)2x(x+δ)2,
qAS,m=Ex+δNϵ(x)ϵ{[Φm(f,c,x)Φm(f,c,x+δ)]x},
where ϵ defines the discrete, finite-sample neighborhood radius Nϵ for every input xX,Nϵ(x)={x+δX|x(x+δ)ϵ} and c denotes the true class of the input sample [for more details on intuition and calculation, we also suggest the primary publications (Alvarez-Melis and Jaakkola 2018b; Yeh et al. 2019)].

The robustness metrics assess the difference between the explanation of a true and perturbed image shown in Eqs. (2) and (3). Accordingly, the lowest score represents the highest robustness.

b. Faithfulness

Faithfulness measures whether changing a feature that an explanation method assigned high relevance to changes the network prediction (see Fig. 3). This can be examined through the iterative perturbation of an increasing number of input pixels corresponding to high-relevance values and subsequent comparison of each resulting model prediction to the original model prediction. Since explanation methods assign relevance to features based on their contribution to the network’s prediction, changing high-relevance features should have a larger impact on the model prediction compared to features of lesser relevance (Bach et al. 2015; Samek et al. 2017; Montavon et al. 2018; Bhatt et al. 2020; Nguyen and Martínez 2020).

To measure this property, we apply “Remove And Debias” (ROAD) (Rong et al. 2022a), which returns a curve of scores q^=:(q^1,,q^I) for a chosen percentage range pR1×I, with IN being the number of percentage steps [curve visualizations can also be found in Rong et al. (2022a)]. For each curve value q^i, a percentage pip, pi ∈ [0, 1] of the pixels in the input xn is perturbed, according to their value in the explanation Φm(f, cn, xn) (starting with the highest relevance). The predictions based on the input xn and corresponding perturbed input x^ni are compared, resulting in 1 for equal predictions and 0 otherwise. The procedure is repeated for several inputs n. Accordingly, the ROAD score q^ROAD,im for each percentage i corresponds to the average and is defined as
q^ROAD,m,i=1Nn=1N1cn(cpred,n)with1cn(cpred,n)={1cn=cpred,n0otherwise,
where 1cn:C[0,1] is an indicator function comparing the predicted class cpred,n=f(x^ni) of x^ni to cn = f(xn), the predicted class of the unperturbed input xn. We calculate the score values for up to 50% of pixel replacements p of the highest relevant pixel, calculated in steps of 1%, resulting in a curve q^ROADm. For faithful explanations, this curve should degrade faster toward increasing percentages of perturbed pixels [see Eq. (5)]. The area under the curve (AUC) is then used as the final ROAD score qROAD,m:
qROAD,m=AUC(p,q^ROADm).
Accordingly, a lower ROAD score corresponds to higher faithfulness.
Furthermore, to measure faithfulness, we consider the faithfulness correlation qFC,m (Bhatt et al. 2020), defined as
qFC,m=corrS|S|d{ϕ¯Sm,f(x)f(x[xs=x¯s])}
where S ∈ |S| ⊆ d is a set of |S| random indices drawn from all pixel indices d in sample x and ϕ¯Sm:=iSΦim(f,c,x) is the sum across explanation map values i that are part of the random subset iS. This set of random indices S is masked (i.e., perturbed) in the input x[xS=x¯S], with x¯Rd being an array filled with the perturbation values (e.g., 0 or 1), which are used to replace all indices i in the perturbed input x[xS=x¯S]. Accordingly, the correlation of the prediction difference between perturbed and unperturbed inputs f(x)f(x[xs=x¯s]) and the sum across the explanation values of the perturbed pixels ϕ¯Sm are calculated [see Bhatt et al. (2020) for more details and visualizations]. Unlike ROAD, the faithfulness correlation score increases as faithfulness improves.

c. Complexity

Complexity is a measure of conciseness, indicating an explanation should consist of a few highly important features (Chalasani et al. 2020; Bhatt et al. 2020) (see Fig. 4). The assumption is that concise explanations, characterized by prominent features, facilitate researcher interpretation and potentially include higher informational value with reduced noise.

Here, we use complexity qCOM,m (Bhatt et al. 2020) and sparseness qSPA,m (Chalasani et al. 2020) as representative metric functions, which can be formulated as follows:
qCOM,m=H[P(Φm)],withP(Φm):=Φ(f,c,x)j[d]s|Φ(f,c,x)j|,
qSPA,m=i=1d(2id1)Φm(f,x)di=1dΦ(f,x),
where H(⋅) is the Shannon entropy, P(Φm) is a valid probability distribution across the fractional contribution of all features xi of x to the total magnitude of the explanation values j[d]|Φ(f,c,x)j|, d is the total number of pixels in x, f is the network function, and c is the explained class. Sparseness is based on the Gini index (Hurley and Rickard 2009), while complexity is calculated using the entropy [see also Bhatt et al. (2020) and Chalasani et al. (2020), where both metric functions are discussed in more detail]. While the lower the entropy, the less complex the explanation, a high Gini index indicates less complexity.

d. Localization

For localization, the quality of an explanation is measured based on its agreement with a user-defined region of interest (ROI; see Fig. 5). Accordingly, the position of pixels with the highest relevance values (given by the XAI explanation) is compared to the labeled areas, e.g., bounding boxes or segmentation masks. Based on the assumption that the ROI should be mainly responsible for the network decision (ground truth) (Zhang et al. 2018; Arras et al. 2020; Theiner et al. 2022; Arias-Duart et al. 2022), an explanation map yields high localization if high-relevance values are assigned to the ROI.

Fig. 5.
Fig. 5.

Diagram of the concept behind the localization property. First, an expected region of high relevance for the network decision, the ROI, is defined in the input temperature map (blue box). Here, the NA is chosen, as this region has been discussed to affect the prediction (see Labe and Barnes 2021). Next, the sorted explanation values of the ROI, encompassing k pixels, are compared to the k highest values of the sorted explanation values across all pixels. An explanation method with strong localization should assign the highest relevance values to the ROI.

Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1

As localization metrics, we use the top-K-pixel metric (also referred to as Top-K) (Theiner et al. 2022), which is computed as follows:
qTop-K,m=|Ks||K|,
where K:=Φmr|K| denotes the vector of indices of explanation Φ corresponding to the |K| highest ranked features with r = Rank[Φm(f, c, x)] and s refers to the indices of ROI [see Theiner et al. (2022) for more details]. Furthermore, we consider the relevance rank accuracy qRRA,m (Arras et al. 2020):
qRRA,m=|Φ|s|ms||s|,
where Φ|s|m:=Φmr|s| denotes the vector of indices of the explanation Φ corresponding to the highest ranked features r|s|R1×|s| and |s| is the number of pixels in the ROI [details on the calculation and intuition can also be found in Arras et al. (2020)]. Thus, Top-K and relevance rank accuracy are the same for |K| chosen such that it is equal to the number of pixels in the ROI |s|. Both corresponding scores are high for well-performing methods and low for explanations with low localization.

e. Randomization

Randomization assesses how a random perturbation scenario changes the explanation (see Fig. 6). Either the network weights (case 2; Adebayo et al. 2018) are randomized or a random class that was not predicted by the network for the input sample x is explained (case 1; Sixt et al. 2020). In both cases, a change in the explanation is expected, since the explanation of an input x should change if the model changes or if a different class is explained.

Fig. 6.
Fig. 6.

Diagram of the concept behind the randomization property. In the middle row, the original input temperature map is passed through the network, and the explanation map is calculated based on the predicted (gray background) decade. For the random logit metric (first row, labeled 1), the input temperature map and the network remain unchanged, but the decade k used to calculate the explanation is randomly chosen (pink font). The resulting explanation map is then compared to the original explanation (indicated by a minus sign) to test its dependence on the class. For the model parameter randomization test (bottom row, labeled 2), the network is perturbed (see green box) with noisy parameters (θ1 = θ + noise), potentially altering the predicted decade (j; gray). The explanation map of the perturbed model should differ from the original explanation map if the explanation is sensitive to the model parameters.

Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1

Here, we evaluate randomization based on the model parameter randomization test (Adebayo et al. 2018). The score qMPT,m is defined as the average correlation coefficient between the explanation of the original model f and the randomized model fW over all layers L:
qMPT,m=1Ll=1Lρ[Φm(f,c,x),Φm(fl,c,x)],
where ρ denotes the Spearman rank correlation coefficient and fl is the true model with additive perturbed weights of layer l [see Adebayo et al. (2018) for further details].
We also consider the random logit score qRL,m (Sixt et al. 2020), which can be defined as, e.g., structural similarity index (SSIM) or Pearson’s correlation between an explanation map of a random class c^ [with f(x)=c,c^c] and an explanation map of the predicted class c [see also Sixt et al. (2020) for further details and visualization]:
qRL,m=SSIM[Φm(f,c,x),Φm(f,c^,x)].
Here, the metrics return scores qm,n:=qMPT/RL,m,n with n ∈ {1, …, N} for either all layers (randomization metric) N = L or all other classes (cctrue) with N = Γ. Thus, we average across L or Γ to obtain qm as follows:
qMPT/RL,m=1Nn=1Nqm,n.
The metric scores of randomization and robustness metrics are interpreted similarly, i.e., low metric scores indicate strong performance.

f. Metric score calculation

The differing scales of the evaluation metric output (e.g., sparseness ranges between 0 and 1, faithfulness correlation between −1 and 1, and local Lipschitz estimate between 0 and ∞) and their respective interpretations (e.g., for the first two metrics, the best score would be 1, whereas for the latter, the best score would be 0) complicate their comparison. Therefore, following Murphy and Daan (1985), we introduce a skill score S measuring the improvement in forecast performance Af over the performance of reference forecast Ar relative to the perfect performance Ap, where Ap = 0 if performance is measured by the mean-square error (Murphy and Daan 1985; Murphy 1988). The skill score S is given by
S(Af)=AfArApAr.
Here, we calculate the skill score S(qm) for an explanation method based on the metric scores in each property. The skill score allows us to compare the performance of explanation methods relative to a reference score Ar = qr. To establish this reference score qr, we create a uniform random baseline explanation similar to Rieger and Hansen (2020), maximizing the violation of each property’s underlying assumptions and creating a bad-skill scenario (for details, see appendix A, section b). The skill score then measures whether an explanation method improves upon this baseline score.
As the respective perfect score q* varies across metrics and takes up values of both 0 (e.g., for local Lipschitz estimate) and 1 (e.g., for sparseness), the skill score is
S(qm)={1qmqrifq*=0,qmqr1qrifq*=1,
where qmR represents the raw or aggregated metric score (for details, see appendix A, section b).

4. Experiments

a. Network predictions, explanations, and motivating example

In the following, we evaluate the network performance and discuss the application of the explanation methods for both network architectures. To ensure comparability between networks and comparability to our case study (Labe and Barnes 2021), we use a similar set of hyperparameters for the MLP and the CNN during training. A detailed performance discussion is provided in appendix B, section a. The achieved similar performance ensures that XAI evaluation score differences between the MLP and the CNN are not caused by differences in network accuracy.

After training and performance evaluation, we explain all correctly predicted temperature maps in the training, validation, and test samples (see appendix B, section a for details). These explanations are most often subject to further research on physical phenomena learned by the network (Barnes et al. 2020; Labe and Barnes 2021; Barnes et al. 2021; Labe and Barnes 2022). We apply all XAI methods presented in section 2c to both networks with the exception of the composite rule of LRP, converging to the LRP-z rule for the MLP model due to its dense layer architecture (Montavon et al. 2019). The corresponding explanation maps across all XAI methods and for both networks are displayed in Figs. B4 and B5. Despite explaining the same network predictions, different methods assign different relevance values to the same areas, revealing the disagreement problem in XAI (Krishna et al. 2022).

To illustrate this explanation disagreement, we show the explanation maps for the year 2068 given by DeepSHAP and LRP-z, alongside the input temperature map in Fig. 7. According to the primary publication (Labe and Barnes 2021), the cooling patch in the North Atlantic (NA), depicted in the zoomed-in map sections of 10°–80°W, 20°–60°N of Fig. 7, significantly contributes to the network prediction for all decades. Thus, it is reasonable to assume high relevance values in this region. However, the two XAI methods display contrary signs of relevance in some areas, impeding visual comparison and interpretation. The varying sign can be attributed to DeepSHAP being based on feature removal and modified gradient backpropagation, while LRP-z, in contrast, being theoretically equivalent to input times gradient. Thus, the two explanations potentially display different aspects of the network decision (Clare et al. 2022) and explanations can vary in sign depending on the input image [see also discussion on input shift invariance in Mamalakis et al. (2022a)]. Nonetheless, we also find common features, as, for example, in Australia or throughout the Antarctic region. Thus, a deeper understanding of explanation methods and their properties is necessary to enable an informed method choice.

Fig. 7.
Fig. 7.

Motivating example visualizing the difference between different XAI methods. Shown are the T2m maps (a) for the year 2068 with the corresponding (b) DeepSHAP and (c) LRP-z explanation maps of the MLP. For both XAI methods, red indicates a pixel contributed positively, and blue indicates a negative contribution to the predicted class. Next to the explanation maps, a zoomed-in map of the NA region (10°–80°W, 20°–60°N) is shown, demonstrating different evidence for DeepSHAP and LRP-z.

Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1

b. Assessment of explanation methods

To introduce the application of XAI evaluation, we compare the different XAI methods applied to the MLP and calculate their skill scores across all five XAI method properties (see section 3). For each property, two representative metrics (hyperparameters are listed in appendix B, section b) are computed and compared. Each skill score is averaged across 50 random samples drawn from the explanations of all correctly predicted inputs, and we provide the standard error of the mean (SEM) (see appendix A, section b for details). To account for potential biases resulting from the choice of the testing period, we also compute the scores for random samples not limited to correct predictions. We report qualitatively robust findings (not shown) compared to the scores shown here. Our results are depicted in Fig. 8.

Fig. 8.
Fig. 8.

Barplot of skill scores based on the random baseline reference for two different metrics in each, (a) robustness, (b) faithfulness, (c) complexity, (d) localization, and (e) randomization property. The different metrics are indicated by hatches or no hatches on the bar. We report the mean skill score (as bar labels) and the SEM, indicated by the error bars in black on each bar. The bar color scheme indicates the grouping of the XAI methods into sensitivity (violet tones) and salience/attribution methods (earthy tones).

Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1

For the robustness property, we find that all tested explanation methods result in similar, high, and closely distributed skill scores (≥0.85 and ≤0.93) for both the average sensitivity metric (hatches in Fig. 8a) and local Lipschitz estimate metric (no hatches), where the latter shows slightly higher values overall. For both metrics, we find that salience (earthy tones) and sensitivity (violet tones) methods show a similar robustness skill and perturbation-based methods (SmoothGrad, NoiseGrad, and Integrated Gradients) do not significantly improve skill compared to the respective baseline explanations (gradient and input times gradient). We relate the latter finding to the low signal-to-noise ratio of the climate data and variability between different ensemble members, complicating the choice of an efficient perturbation threshold for the explanation methods. Nonetheless, these findings disagree with previous studies regarding suggested robustness improvements when applying salience and perturbation-based methods (Smilkov et al. 2017; Sundararajan et al. 2017; Bykov et al. 2022b; Mamalakis et al. 2022a).

For faithfulness, we find pronounced skill score differences between both metrics, with ROAD scores indicating positive skill for all methods, whereas faithfulness correlation scores include negative values for the sensitivity methods (hatched violet bars in Fig. 8b). This disparity arises from the calculation of faithfulness correlation metric scores using the correlation coefficient and the distinct interpretations of relevance values in salience maps versus sensitivity maps. Since sensitivity maps display the network’s sensitivity toward the change in the value of each pixel (the sign conveys the direction), the impact of the masking value depends on the discrepancy between the original pixel value and the masking value, leading to a negative correlation. Nonetheless, across metrics, the best skill scores ≤ 0.6 are achieved by input times gradient, Integrated Gradients, and LRP-z, followed by S(qLPR-α-β) ≤ 0.42. Furthermore, sensitivity methods (violet tones) achieve overall lower skill scores. Although DeepSHAP exhibits a lower faithfulness correlation skill [which we attribute to the challenge of applying Shapley values to continuous data (Han et al. 2022) and vulnerability toward feature correlation (Flora et al. 2022)], the method still outperforms the sensitivity methods, indicating salience (or attribution) methods provide more faithful relevance values. However, this is due to salience methods indicating the contribution of each pixel to the prediction as required by faithfulness. Thus, sensitivity methods inherently result in less faithful explanations. We note that the input multiplication of salience methods can lead to a loss of information when using standardized input pixels, as zero values in the input (i.e., values close to climatology) will result in zero values in the explanation regardless of the networks sensitivity to it [see section 2c and Mamalakis et al. (2022a) discussing “ignorant to zero input”].

For complexity (Fig. 8c), all explanation methods exhibit low complexity scores compared to sparseness, indicating the explanations on climate data exhibit similar entropy to uniformly sampled values. This similarity in entropy can be attributed to the increased variability and subsequently low signal-to-noise ratio of climate data (Sonnewald and Lguensat 2021; Clare et al. 2022). For the sparseness metric, skill scores show skill improvement for salience (attribution) methods. We also find slight skill score improvements for NoiseGrad and FusionGrad, suggesting that incorporating network perturbations may decrease explanation complexity.

To compute the results of the localization metrics, Top-K (hatches in Fig. 8d) and relevance rank accuracy (no hatches), we select the region in the North Atlantic (10°–80°W, 20°–60°N) as our ROI, with the cooling in this region being a recognized feature of climate change (Labe and Barnes 2021). In both metrics, all explanation methods yield low skill scores. This is consistent with lower sparseness skill scores in complexity (≤0.47), indicating that high-relevance values are spread out, with the ROI also including fewer distinct features. In addition, high relevance in the ROI depends on whether the network learned this specific area. Thus, our results potentially indicate an inadequate choice of the ROI (either size or location) and show that localization metrics can identify a learned region. Nonetheless, LRP-α-β yields the highest skill across metrics, indicating that attributing only positive relevance values improves the distinctiveness of features in the NA region. Similar to complexity, salience methods (earthy tones) yield a slightly higher localization skill than sensitivity methods (violet tones) with the exception of NoiseGrad.

Last, we present the randomization results (Fig. 8e). For the random logit metric, all XAI methods yield lower skill scores (≥0.1 and ≤0.58). This can be attributed to the network task classes being defined based on decades with an underlying continuous temperature trend. Thus, the differences in temperature maps can be small for subsequent years, and the network decision and explanation for different classes may include similar features. Nonetheless, we find salience (earthy tones) and sensitivity methods (violet tones) to yield no clear separation. Instead, XAI methods using perturbation result in higher skill scores, with mean improvements for FusionGrad exceeding the SEM, as well as a slight improvement for NoiseGrad and SmoothGrad over gradient and Integrated Gradients over input times gradient. Thus, while input perturbations already slightly improve the class separation in the explanation, also including network perturbation yields favorable improvement. For the model parameter randomization test scores, skill scores are overall higher (≥0.58 and ≤0.99) across all explanation methods, and sensitivity methods outperform salience methods, the latter aligning with Mamalakis et al. (2022b). Similar to the complexity results, the DeepSHAP skill score aligns with other salience method results. In addition, LRP-α-β yields the worst skill across metrics, potentially due to neglecting negatively contributing neurons during backpropagation [see Eq. (A4) in appendix A, section a] and corresponding variations across classes and under parameter randomization.

c. Network-based comparison

To compare the performance of explanation methods for the MLP and CNNs, we selected one metric per property: local Lipschitz estimate for robustness, ROAD for faithfulness, sparseness for complexity, Top-K for localization, and model parameter randomization test for randomization.

For robustness (see Fig. 9a), XAI methods applied to the CNN yield strong skill score variations, with the MLP results showing overall higher skill scores. For the CNN, the LRP composite rule provides the best robustness skill. We find salience methods to exhibit slightly higher skill scores, the exception being FusionGrad outperforming LRP-α-β and DeepSHAP. This suggests that due to the differences in learned patterns between the CNN and MLP, including both network and input perturbations yields more robust explanations, while the combination of a removal-based technique (Covert et al. 2021) with a modified gradient backpropagation (Ancona et al. 2019) as in DeepSHAP and neglecting negatively contributing neurons as in LRP-α-β worsens robustness compared with other salience methods. Moreover, explanation methods using input perturbations improve sensitivity explanation robustness for the CNN (SmoothGrad and FusionGrad), while methods using only network perturbations decrease robustness skill (NoiseGrad).

Fig. 9.
Fig. 9.

Barplot of skill scores based on the random baseline reference for the MLP (star hatches) and CNN (no hatches) in each, (a) robustness, (b) faithfulness, (c) the complexity, (d) localization, and (e) randomization property. We report the skill score (as bar labels) and the SEM of all scores, indicated by the error bars in black on each bar. The bar color scheme indicates the grouping of the XAI methods into sensitivity (violet tones) and salience/attribution methods (earthy tones). Note that for LRP composite (LRP-comp), we only report the CNN results (for details, see section 4a).

Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1

In the faithfulness property (see Fig. 9b), salience explanation methods (Integrated Gradients, input times gradient, and LRP) achieve higher skill for both networks, aligning with previous research (Mamalakis et al. 2022b,a) and the theoretical differences (see section 4b). However, the LRP composite rule is the exception, adding additional insight to the findings of other studies (Mamalakis et al. 2022a), as the LRP composite rule sacrifices faithful evidence for a less complex [human-aligned; Montavon et al. (2019)] and more robust explanation. Moreover, perturbation-based explanation methods (SmoothGrad, NoiseGrad, FusionGrad, and Integrated Gradients) do not significantly increase the faithfulness skill compared to their respective baseline explanations (gradient and input times gradient), except for Integrated Gradients for the MLP. Similar to the MLP results, LRP-α-β acts as an outlier compared with other salience methods. For the CNN, also the DeepSHAP’s faithfulness skill is decreased, contradicting theoretical claims and other findings (Lundberg and Lee 2017; Mamalakis et al. 2022a). Since the CNN learns more clustered patterns (groups of pixels according to the filter-based architecture), we attribute this outcome to both DeepSHAP’s theoretical definitions (Han et al. 2022) and vulnerability toward feature correlation (Flora et al. 2022), with the latter making partitionSHAP a more suitable option (Flora et al. 2022).

In complexity, salience methods exhibit slight skill improvement over sensitivity methods across networks, except for LRP-α-β for the CNN (Fig. 9c). This indicates that neglecting feature relevance is more influential for the CNN’s explanation, leading to fewer distinct features in the explanation, while the lower DeepSHAP skill further confirms the previously discussed disadvantages of DeepSHAP for the CNN.

In localization, both the MLP and CNN show similar low overall skill scores (≤0.33), indicating that the size or location of the ROI was not optimally chosen for the case study. Nonetheless, the skill scores across XAI methods are in line with the complexity results, except for the worst and best skill scores. The LRP composite rule yields the lowest localization skill, further confirming its trade-off between faithfulness and interpretability, also in the ROI. FusionGrad provides the highest localization skill for the CNN. In contrast, LRP-α-β yields the highest skill for the MLP but the second lowest skill score for the CNN. The difference in results across networks for complexity and localization can be attributed to differences in learned patterns (as discussed above), affecting properties that assess the spatial distribution of evidence in the explanation.

Last, for randomization (see Fig. 9e), regardless of the network, sensitivity methods outperform salience methods, indicating a decreased susceptibility to changes in the network parameters. While slightly lower, the randomization skill score of DeepSHAP does agree with other salience methods aligning with Mamalakis et al. (2022b,a).

Overall, our results show that while explanation methods applied to different network architectures retain similar faithfulness and randomization properties, their robustness, complexity, and localization properties depend on the specific architecture.

d. Choosing an XAI method

Evaluation metrics enable the comparison of different explanation methods based on various properties for different network architectures, allowing us to assess their suitability for specific tasks. Here, we propose a framework to select an appropriate XAI method.

Practitioners first determine which explanation properties are essential for their network task. For instance, for physically informed networks, randomization (the model parameter randomization test) is crucial, as parameters are meaningful and explanations should respond to their successive randomization. Similarly, localization might be less important if an ROI cannot be determined beforehand. Second, practitioners calculate evaluation scores for each selected property across various XAI methods. We suggest calculating the skill score (see section 3f) to improve score interpretability. Third and last, the optimal XAI method for the task can be chosen based on the skill scores independently or rank of the explanation method, as in previous studies (Hedström et al. 2023b; Tomsett et al. 2020; Rong et al. 2022b; Brocki and Chung 2022; Gevaert et al. 2022).

In our case study, for example, the explanation method should exhibit robustness toward variation across climate model ensemble members, display concise features (complexity) without sacrificing faithfulness, and capture randomization of the network parameter (randomization). Using the Quantus XAI evaluation library (Hedström et al. 2023b), we visualize the evaluation results for the MLP in a spider plot (Fig. 10a), with the outermost line indicating the best-performing XAI method in each property. All methods yield a similar robustness skill but differ in randomization, faithfulness, and complexity skills. The methods such as LRP-z (light beige), input times gradient (ocher), Integrated Gradients (orange), and DeepSHAP (brown) provide the most faithful explanations [similar to findings in Mamalakis et al. (2022a)], with DeepSHAP providing a slightly worsened randomization and robustness skill.

Fig. 10.
Fig. 10.

Visualization of the proposed procedure to choose an appropriate XAI method. (a) In the spider plot, the mean skill scores for all properties across nine explanation methods (MLP explanations) are visualized, according to Fig. 9. The spider plot can be used as a visual aid alongside the skill scores or ranks in each essential property to identify the best-performing XAI method. In the plot, the best results correspond to the furthest distance from the center of the graph. (b) The LRP-z explanation map of the decade prediction on the temperature map of 2068 is shown with (c) a zoom to the NA region.

Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1

Based on the different strengths and weaknesses, we would select LRP-z to explain the MLP predictions (Fig. 10b) and analyze the impact of the NA region (Fig. 10c) on the network predictions. According to the explanation, the network heavily depends on the North Atlantic region and the cooling patch pattern, suggesting its relevance in correctly predicting the decade in this global warming simulation scenario. However, we also stress that additionally applying a sensitivity method such as gradient-based SmoothGrad potentially illuminates more aspects of this network decision, as sensitivity methods provide strong randomization, in contrast to LRP-z.

5. Discussion and conclusions

AI models, particularly DNNs, can learn complex relationships from data to predict unseen points afterward. However, their black box character restricts the human understanding of the learned input–output relation, making DNN predictions challenging to interpret. To illuminate the model’s behavior, local XAI methods were developed, that identify the input features responsible for individual predictions and offer novel insights into climate artificial intelligence research (Camps-Valls et al. 2020; Gibson et al. 2021; Dikshit and Pradhan 2021; Mayer and Barnes 2021; Labe and Barnes 2021; Van Straaten et al. 2022; Labe and Barnes 2022). Nevertheless, the increasing number of available XAI methods and their visual disagreement (Krishna et al. 2022), illustrated in our motivating example (Fig. 7), raise two important questions: Which explanation method is trustworthy, and which is the appropriate choice for a given task?

To address these questions, we introduced XAI evaluation to climate science, building upon existing climate XAI research as our case study (Labe and Barnes 2021). We evaluate and compare various local explanation methods for an MLP and a CNN regarding five properties, i.e., robustness, faithfulness, randomization, complexity, and localization, that are provided by the Quantus library (Hedström et al. 2023b). Furthermore, we improve the interpretation of the evaluation scores by calculating a skill score in reference to a random uniform explanation.

In the first experiment, we showcase the application of XAI evaluation on the MLP explanations using two metrics for each property (Alvarez-Melis and Jaakkola 2018b; Montavon et al. 2019; Yeh et al. 2019; Bhatt et al. 2020; Arras et al. 2020; Rong et al. 2022a; Hedström et al. 2023b). Our results indicate that salience methods (i.e., input times gradient, Integrated Gradients, and LRP) yield an improvement in faithfulness and complexity skill but a reduced randomization skill. Contrary to salience methods, sensitivity methods (gradient, SmoothGrad, NoiseGrad, and FusionGrad) show higher randomization skill scores while sacrificing faithfulness and complexity skills. These results indicate that a combination of explanation methods can be favorable depending on the explainability context. We also establish that evaluating explanation methods in a climate context mandates careful consideration. For example, due to the natural variability in the data, the sparseness metric is best suited for determining explanation complexity. Further, the random logit metric is favored for classification with pronounced class separations rather than datasets with continuous features spanning multiple classes. Last, we highlight the importance of the correct identification of an ROI to ensure an informative localization evaluation and that localization metrics enable probing the network regarding learned physical phenomena.

In the second experiment, we compare the properties of the MLP and CNN explanations across all XAI methods. Both localization and complexity evaluation show larger variations between networks, due to differences in how the networks learn features in the input. The robustness results exhibit similar variation, with the CNN showing higher skill scores for all input perturbation-based methods like SmoothGrad, FusionGrad, and Integrated Gradients, contrary to the MLP, with the exception of NoiseGrad. Independent of network architecture, explanations using averages across input perturbations, like SmoothGrad and Integrated Gradients, do not consistently increase and, in some cases, even decrease the faithfulness skill. Furthermore, sensitivity methods result in less faithful and more complex explanations but capture network parameter changes more reliably. In contrast, salience methods are less complex, except for LRP-α-β explaining the CNN. Moreover, salience methods exhibit a higher faithfulness skill and lower randomization skill than sensitivity methods, consistent with findings in Mamalakis et al. (2022b,a) and in line with salience methods presenting the contribution of each input pixel rather than sensitivity (see section 4b), due to input multiplication. Contrary to previous research (Mamalakis et al. 2022a), the LRP composite rule was an outlier among salience methods, sacrificing a faithful explanation for an improved complexity skill and higher robustness. Similarly, LRP-α-β and DeepSHAP stand out as an exception among salience methods applied to the CNN due to almost consistently lower skill scores. We attribute both findings to the mathematical definition of each method. While the LRP composite rule is optimized toward improved interpretation, resulting in less feature content, DeepSHAP is based on feature removal and modified gradient backpropagation and is vulnerable toward feature correlation, for CNN features and LRP-α-β neglecting negatively contributing neurons during backpropagation.

Last, we propose a framework using XAI evaluation to support the selection of an appropriate XAI method for a specific research task. The first step is to identify important XAI properties for the network and data, followed by calculating evaluation skill scores across the properties for different XAI methods. Then, the resulting skill scores across XAI methods can be ranked or compared directly to determine the best-performing method or combination of methods. In our case study, LRP-z (alongside input times gradient and Integrated Gradients) yields suitable results in the MLP task, allowing the reassessment of our motivating example (Fig. 7) and the trustworthy interpretation of the NA region as a contributing input feature.

Overall, our results demonstrate the value of XAI evaluation for climate AI research. Due to their technical and theoretical differences (Letzgus et al. 2022; Han et al. 2022; Flora et al. 2022), the various explanation methods can reveal different aspects of the network decision and exhibit different strengths and weaknesses. Evaluation metrics allow us to compare explanation methods by assessing their suitability and properties, in different explainability contexts. Next to benchmark datasets, evaluation metrics also contribute to the benchmarking of explanation methods. XAI evaluation can support researchers in the choice of an explanation method, independent of the network structure and targeted to their specific research problem.

Acknowledgments.

This work was funded by the German Ministry for Education and Research through project Explaining 4.0 (ref. 01IS200551). M. K. acknowledges funding from XAIDA (European Union’s Horizon 2020 research and innovation program under Grant Agreement 101003469). The authors also thank the CESM Large Ensemble Community Project (Kay et al. 2015) for making the data publicly available. Support for the Twentieth Century Reanalysis Project version 3 dataset is provided by the U.S. Department of Energy, the Office of Science Biological and Environmental Research (BER), the National Oceanic and Atmospheric Administration Climate Program Office, and the NOAA Earth System Research Laboratory Physical Sciences Laboratory.

Data availability statement.

Our study is based on the RPC8.5 configuration of the CESM1 Large Ensemble simulations (https://www.cesm.ucar.edu/community-projects/lens/instructions). The data are freely available (https://www.cesm.ucar.edu/projects/community-projects/LENS/data-sets.html). The source code for all experiments will be accessible at https://github.com/philine-bommer/Climate_X_Quantus. All experiments and code are based on Python v3.7.6, Numpy v1.19 (Harris et al. 2020), SciPy v1.4.1 (Virtanen et al. 2020), and colormaps provided by Matplotlib v3.2.2 (Hunter 2007). Additional Python packages used for the development of the ANN, explanation methods, and evaluation include Keras/TensorFlow (Abadi et al. 2016), iNNvestigate (Alber et al. 2019), and Quantus (Hedström et al. 2023b). We implemented all explanation methods except for NoiseGrad and FusionGrad using iNNvestigate (Alber et al. 2019). For XAI methods by Bykov et al. (2022b) and Quantus (Hedström et al. 2023b), we present a Keras/TensorFlow (Abadi et al. 2016) adaptation in our repository. All dataset references are provided throughout the study.

APPENDIX A

Additional Methodology

a. Explanations

To provide a theoretical background, we give formulas for the different XAI methods we compare, in the following section.

1) Gradient
The gradient method is the weak derivative x:=f(x) of the network output f(x) with respect to each entry of the temperature map xX (Baehrens et al. 2010):
Φ[f(x)]=x.
Accordingly, the raw gradient has the same dimensions as the input sample x,xRD.
2) Input times gradient
Input times gradient explanations are based on a pointwise multiplication of the impact of each temperature map entry on the network output, i.e., the weak derivative ∇x, with the value of the entry in the explained temperature map x. All explanations are calculated as follows:
Φ[f(x)]=xx,
with Φ[f(x)],x,xRD.
3) Integrated Gradients
The Integrated Gradients method aggregates gradients along the straight line path from the baseline x¯ to the input temperature map x. The relevance attribution function is defined as follows:
Φ[f(x)]=(xx¯)01f[x¯+α(xx¯)]dα,
where ⊙ denotes the elementwise product and α is the step width from x¯ to x.
4) LRP

For LRP, the relevances of each neuron i in each layer l are calculated based on the relevances of all connected neurons j in the later layer l + 1 (Samek et al. 2017; Montavon et al. 2017).

For the α-β rule, the weighted contribution of a neuron j to a neuron i, i.e., zij=ai(l)wij(l,l+1) with ai(l)=xi, is separated into a positive zij+ part and a negative zij part. Accordingly, the propagation rule is defined by
Ri(l)=j(αzij+izij++βzijizij),
with α as the positive weight, β as the negative weight, and α + β = 1 to maintain relevance conservation. We set α = 1 and β = 0.
The z rule accounts for the bounding that input images in image classification are exhibiting, by multiplying positive network weights wij+ with the lowest pixel value li in the input and the negative weights wij by the highest input pixel value hi (Montavon et al. 2017). The relevance is calculated as follows:
Ri(l)=jzijliwij+hiwij+izijliwij+hiwij+.
For the composite rule, the relevances of the last layers with high neuron numbers are calculated based on LRP-0 (see Bach et al. 2015), which we drop due to our small network. In the middle layers, propagation is based on LRP-ϵ, defined as
Ri(l)=jαaj(wij+γwij+)iaj(wij+γwij+).
The relevance of neurons in the layer before the input follows from LRP-γ is calculated as
Ri(l)=jαzijizij,
and the relevance of the input layer is calculated based on Eq. (A5).
5) SmoothGrad
The SmoothGrad explanations are defined as the average over the explanations of M perturbed input images x + gi with i = (1, …, M):
Φ[f(x)]=1M+1i=0MΦ0[f(x+gi)].
The additive noise giN(0,σ) is generated using a Gaussian distribution.
6) NoiseGrad
NoiseGrad samples N sets of perturbed network parameters θ^i=ηiθ using multiplicative noise ηiN(1,σ). Each set of perturbed parameters θ^i results in a perturbed network fi(x):=f(x;θ^i), which are all explained by a baseline explanation method Φ0[f(x)]. The NoiseGrad explanation is calculated as follows:
Φ[f(x)]=1N+1i=0NΦ0[fi(x)],
with f0(x) = f(x) being the unperturbed network.
7) FusionGrad
For FusionGrad, the NoiseGrad (NG) procedure is extended by combining the SG procedure using M perturbed input samples with NG calculations. Accordingly, FusionGrad (FG) can be calculated as follows:
Φ[f(x)]=1M+11N+1j=0Mi=0NΦ0[fi(xj)].
8) DeepSHAP (Lundberg and Lee 2017)
The DeepSHAP explainer uses the concept of DeepLift (Shrikumar et al. 2016) to approximate Shapley values. Formally, we can express the Shapley values as follows:
ϕdi(fW,x)=Sddi|S|!(|d||S|1)!|d|![f(x)f(xS)],
where x is the input with features d and individual features did, f is our model, and xS:= x\di is the masked input, only containing the features in Sd\{di}, all subsets that do not contain the feature di. For DeepSHAP, the network f is separated into individual components fi according to the layer structure as proposed in DeepLift. Similar to Integrated Gradients, DeepSHAP uses a reference value (here chosen as an all-zero reference image), relative to which the contributions of each feature are calculated. This is achieved by determining the multiplicators for each layer according to the DeepLift multiplicators, and the multiplicators are back-propagated to the input layer (Shrikumar et al. 2016; Lundberg and Lee 2017).
For visualizations, as depicted in appendix B (see Figs. B4 and B5), we maintain comparability of the relevance maps Φ[f(Xi,t)]=R¯(i,t)Rυxh across different methods, by applying a min-max normalization to all explanations:
R¯i=ImaxRimax(rjk|rjkRij[1,υ]k[1,h])IminRimin(rjk|rjkRij[1,υ]k[1,h]),
with Imin,ImaxRυxh defining corresponding minimum/maximum indicator masks, i.e., for the minimum indicator, each entry imin(jk)=1,rjk<0 and imin(jk)=0rjk0, and for the maximum indicator, entries are defined reversely imax(jk)=1,rjk0 and imax(jk)=0 otherwise. The normalization maps pixelwise relevance rjkr¯jk with r¯jk[1,1] for methods identifying positive and negative relevance and r¯jk[0,1] for methods contributing only positive relevance values.

b. Evaluation metrics

1) Random baseline

Similar to Rieger and Hansen (2020), we establish a random baseline as an uninformative baseline explanation. The artificial explanation ΦrandRh×υ is drawn from a uniform distribution ΦrandU(0, 1). Each time a metric reapplies the explanation function, for example, in the robustness metrics when the perturbed input is subject to the explanation method, we redraw each random explanation. The only exception for the re-explanation step is the randomization metric as it aims for a maximally different explanation. Thus, to maximally violate the metric assumptions, we fix the explanation, emulating a constant explanation for a changing network Φ(x,fθ)Φ(x,fθ^).

2) Score calculation

As discussed in section 3e, we calculate the skill score according to the optimal metric outcome. Thus, skill scores reported for the average sensitivity, the local Lipschitz estimate, the ROAD, the complexity, the model parameter randomization test, and the random logit metrics are calculated based on the first case of Eq. (15), while the skill scores calculation based on faithfulness correlation, Top-K, relevance rank accuracy, and sparseness scores are calculated following the bottom case of Eq. (15).

We calculate the mean skill scores Qm and corresponding SEM reported in Figs. 8 and 9 based on the skill scores of I = 50 explanation samples. We choose this number of samples to provide valid statistics, while maintaining computational efficiency, for both networks. All samples are drawn randomly from the calculated explanations (both training and test data). For each explanation method M, both mean skill scores Qm and corresponding SEM are calculated as follows:
Qm=1Ij=1Iq¯m,j,s¯m=sI,
with s being the standard deviation of the normalized scores q¯im (see section 3) across explanation samples.

An exception is the ROAD metric, as discussed in section 3, and the curve used in the AUC calculation results from the average of N = 50 samples. Thus, we repeat the AUC calculation for V = 10 draws of N = 50 samples and calculate the mean skill score and the SEM.

APPENDIX B

Additional Experiments

a. Network and explanation

Aside from the learning rate l (lCNN = 0.001), we maintain a similar set of hyperparameters to Labe and Barnes (2021) and use the fuzzy classification setup for the performance validation. To assess the predictions of the network for each individual input, we include the network predictions for the 20CRv3 data, i.e., observations (Slivinski et al. 2019). We measure performance using both the RMSE = R between true year y^true and predicted year y^ as well as the accuracy on the test set. Both the MLP and the CNN have a similar performance compared to the primary publication. We show below (in Fig. B3) the regression curves for the model data (gray) and reanalysis data (blue) of the MLP (Fig. B3a) and CNN (Fig. B3b) [see also Fig. 3c in Labe and Barnes (2021)]. We train both networks to exhibit no significant performance differences and prevent overfitting. The learning curves for the MLP, achieving a test accuracy of AccMLP = 67% ± 4%, and the CNN, with AccCNN = 71% ± 2% (estimated across 50 trained networks), are shown in Figs. B1 and B2, respectively. Additionally, we consider the RMSE of the predicted years and see comparable RMSE for the test data with RMLP = 5.1 and RCNN = 4.5.

Fig. B1.
Fig. B1.

Learning curve of the MLP including (a) accuracy and (b) loss. In both plots, the scatter graph represents the training performance, and the line graph represents the performance on the validation data.

Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1

Fig. B2.
Fig. B2.

Learning curve of the CNN including (a) accuracy and (b) loss. In both plots, the scatter graph represents the training performance, and the line graph represents the performance on the validation data.

Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1

In Fig. B3, we also show the number of correct predictions for both architectures (all points on the regression line). In these graphs, we observe changing numbers of correct predictions across different years. Thus, we apply all explanation methods to the full model data Ω, to ensure access to correct samples across all years.

Fig. B3.
Fig. B3.

Network performance based on the RMSE of the predicted years to the true years of both (a) MLP and (b) CNN [cf. to Fig. 3c in Labe and Barnes (2021)]. The light gray dots correspond to the agreement of the predictions based on the training and validation data to the actual years and the dark gray dots show agreement between the predictions on the test set and the actual years, with the black line showing the linear regression across the full model data (training, validation, and test data). In blue, we also included the predictions on the reanalysis data with the linear regression line in dark blue.

Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1

We show examples of the MLP and CNN across all explanation methods in Figs. B4 and B5. Following Labe and Barnes (2021), we adopt a criterion requiring a correct year regression within an error of ±2 years, to identify a correct prediction. We average correct predictions across ensemble members and display time periods of 40 years based on the temporal average of explanations [see Fig. 6 in Labe and Barnes (2021)].

Fig. B4.
Fig. B4.

The MLP explanation map average over 1920–60, 1960–2000, 2000–40, and 2040–80 for all XAI methods. (first row) The average input temperature map T¯ with the color bar ranging from maximum (red) to minimum (blue) temperature anomalies. All consecutive rows show the explanation maps of the different XAI methods with the color bar ranging from 1 (red) to −1 (blue).

Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1

Fig. B5.
Fig. B5.

The CNN explanation map average over 1920–60, 1960–2000, 2000–40, and 2040–80 for all XAI methods. (first row) The average input temperature map T¯ with the color bar ranging from maximum (red) to minimum (blue) temperature anomalies. All consecutive rows show the explanation maps of the different XAI methods with the color bar ranging from 1 (red) to −1 (blue).

Citation: Artificial Intelligence for the Earth Systems 3, 3; 10.1175/AIES-D-23-0074.1

In comparison, both figures highlight the difference in spatial learning patterns, with the CNN relevance focusing on pixel groups, whereas the MLP relevance can change pixelwise. In Table B1, we list the hyperparameters of the explanation methods, compared in our experiments. We use the notation introduced in appendix A, section a. We use Integrated Gradients with the baseline x¯ generated per default by iNNvestigate.

Table B1.

The hyperparameters of the XAI methods. Note that parameters vary across explanation methods. We report only adjusted parameters; for all others, we use a dash —. We denote the maximum and minimum values across all temperature maps X in the dataset Ω as xmax and xmin,, respectively.

Table B1.

b. Evaluation metrics

1) Hyperparameters

In Table B2, we list the hyperparameters of the different metrics. We list only the adapted parameters for all others (see Hedström et al. 2023b), and we used the Quantus default values. The normalization parameter refers to an explanation of normalization according to Eq. (A12).

Table B2.

We show the hyperparameters of the XAI evaluation metrics based on the Quantus package calculations (Hedström et al. 2023b). We consider the metrics, average sensitivity (AS, local Lipschitz estimate (LLE), faithfulness correlation (FC), ROAD, model parameter randomization test (MPT), random logit (RL), complexity (COM), sparseness (SPA), Top-K, and relevance rank accuracy (RRA). Note that parameters vary across metrics, and we report settings only for existing parameters in each metric (for all others, we use a dash —).

Table B2.

(i) Faithfulness

In Table B2, the perturbation function “Indices” refers to the baseline replacement by indices of the highest value pixels in the explanation and “Linear” refers to noisy linear imputation [see Rong et al. (2022a) for details]. Please note that the evaluation of the faithfulness property strongly depends on the choice of perturbation baseline. Thus, we advise the reader to choose the uniform baseline, as determined here for standardized weather data, as it most strongly resembles noise.

(ii) Randomization

For the model parameter randomization test score calculations, we perturb the layer weights starting from the output layer to the input layer, referred to as “bottom_up” in Table B2. To ensure comparability, we use the Pearson correlation as the similarity function for both metrics.

(iii) Localization

For Top-k, we consider k = 0.1d, which are the 10% most relevant pixels of all pixels d in the temperature map.

REFERENCES

  • Abadi, M., and Coauthors, 2016: TensorFlow: A system for large-scale machine learning. Proc. 12th USENIX Symp. on Operating Systems Design and Implementation (OSDI’16), Savannah, GA, USENIX Association, 265283, https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf.

  • Adebayo, J., J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim, 2018: Sanity checks for saliency maps. NIPS’18: Proc. 32nd Int. Conf. on Neural Information Processing Systems, Montréal, Canada, Curran Associates Inc., 9525–9536, https://dl.acm.org/doi/10.5555/3327546.3327621.

  • Agarwal, C., S. Krishna, E. Saxena, M. Pawelczyk, N. Johnson, I. Puri, M. Zitnik, and H. Lakkaraju, 2022: OpenXAI: Towards a transparent evaluation of model explanations. Advances in Neural Information Processing Systems, S. Koyejo et al., Eds., Curran Associates Inc., 15 784–15 799.

  • Alber, M., and Coauthors, 2019: iNNvestigate neural networks! J. Mach. Learn. Res., 20 (93), 18.

  • Alvarez-Melis, D., and T. S. Jaakkola, 2018a: Towards robust interpretability with self-explaining neural networks. NIPS’18: Proc. 32nd Int. Conf. on Neural Information Processing Systems, Montréal, Canada, Curran Associates Inc., 7786–7795, https://dl.acm.org/doi/10.5555/3327757.3327875.

  • Alvarez-Melis, D., and T. S. Jaakkola, 2018b: On the robustness of interpretability methods. arXiv, 1806.08049v1, https://doi.org/10.48550/arXiv.1806.08049.

  • Anantrasirichai, N., J. Biggs, F. Albino, and D. Bull, 2019: A deep learning approach to detecting volcano deformation from satellite imagery using synthetic datasets. Remote Sens. Environ., 230, 111179, https://doi.org/10.1016/j.rse.2019.04.032.

    • Search Google Scholar
    • Export Citation
  • Ancona, M., E. Ceolini, C. Öztireli, and M. Gross, 2019: Gradient-based attribution methods. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, Samek, W. et al., Eds., Springer International Publishing, 169–191, https://doi.org/10.1007/978-3-030-28954-6_9.

  • Arias-Duart, A., F. Parés, D. Garcia-Gasulla, and V. Giménez-Ábalos, 2022: Focus! Rating XAI methods and finding biases. 2022 IEEE Int. Conf. on Fuzzy Systems (FUZZ-IEEE), Padua, Italy, Institute of Electrical and Electronics Engineers, 18, https://doi.org/10.1109/FUZZ-IEEE55066.2022.9882821.

  • Arras, L., A. Osman, and W. Samek, 2020: Ground truth evaluation of neural network explanations with CLEVR-XAI. arXiv, 2003.07258v2, https://doi.org/10.48550/arXiv.2003.07258.

  • Arrieta, A. B., and Coauthors, 2020: Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion, 58, 82115, https://doi.org/10.1016/j.inffus.2019.12.012.

    • Search Google Scholar
    • Export Citation
  • Bach, S., A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek, 2015: On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLOS ONE, 10, e0130140, https://doi.org/10.1371/journal.pone.0130140.

    • Search Google Scholar
    • Export Citation
  • Baehrens, D., T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K.-R. Müller, 2010: How to explain individual classification decisions. J. Mach. Learn. Res., 11, 18031831.

    • Search Google Scholar
    • Export Citation
  • Balduzzi, D., M. Frean, L. Leary, J. Lewis, K. W.-D. Ma, and B. McWilliams, 2017: The Shattered Gradients Problem: If resnets are the answer, then what is the question? arXiv, 1702.08591v2, https://doi.org/10.48550/arXiv.1702.08591.

  • Barnes, E. A., B. Toms, J. W. Hurrell, I. Ebert-Uphoff, C. Anderson, and D. Anderson, 2020: Indicator patterns of forced change learned by an artificial neural network. J. Adv. Model. Earth Syst., 12, e2020MS002195, https://doi.org/10.1029/2020MS002195.

    • Search Google Scholar
    • Export Citation
  • Barnes, E. A., R. J. Barnes, and N. Gordillo, 2021: Adding uncertainty to neural network regression tasks in the geosciences. arXiv, 2109.07250v1, https://doi.org/10.48550/arXiv.2109.07250.

  • Bhatt, U., A. Weller, and J. M. Moura, 2020: Evaluating and aggregating feature-based model explanations. arXiv, 2005.00631v1, https://doi.org/10.48550/arXiv.2005.00631.

  • Brocki, L., and N. C. Chung, 2022: Fidelity of interpretability methods and perturbation artifacts in neural networks. arXiv, 2203.02928v4, https://doi.org/10.48550/arXiv.2203.02928.

  • Bromberg, C. L., C. Gazen, J. J. Hickey, J. Burge, L. Barrington, and S. Agrawal, 2019: Machine learning for precipitation nowcasting from radar images. arXiv, 1912.12132v1, https://arxiv.org/abs/1912.12132.

  • Bykov, K., M. M.-C. Höhne, A. Creosteanu, K.-R. Müller, F. Klauschen, S. Nakajima, and M. Kloft, 2021: Explaining Bayesian neural networks. arXiv, 2108.10346v1, https://doi.org/10.48550/arXiv.2108.10346.

  • Bykov, K., M. Deb, D. Grinwald, K.-R. Müller, and M. M.-C. Höhne, 2022a: DORA: Exploring outlier representations in deep neural networks. arXiv, 2206.04530v4, https://doi.org/10.48550/arXiv.2206.04530.

  • Bykov, K., A. Hedström, S. Nakajima, and M. M.-C. Höhne, 2022b: NoiseGrad: Enhancing explanations by introducing stochasticity to model weights. Proc. 36th AAAI Conf. on Artificial Intelligence, Online, AAAI Press, 61326140, https://doi.org/10.1609/aaai.v36i6.20561.

    • Search Google Scholar
    • Export Citation
  • Camps-Valls, G., M. Reichstein, X. Zhu, and D. Tuia, 2020: Advancing deep learning for earth sciences from hybrid modeling to interpretability. IGARSS 20202020 IEEE Int. Geoscience and Remote Sensing Symp., Waikoloa, HI, Institute of Electrical and Electronics Engineers, 39793982, https://doi.org/10.1109/IGARSS39084.2020.9323558.

  • Chalasani, P., J. Chen, A. R. Chowdhury, S. Jha, and X. Wu, 2020: Concise explanations of neural networks using adversarial training. ICML’20: Proc. 37th Int. Conf. on Machine Learning, Online, PMLR, https://proceedings.mlr.press/v119/chalasani20a/chalasani20a.pdf.

  • Chen, C., O. Li, D. Tao, A. J. Barnett, C. Rudin, and J. K. Su, 2019: This looks like that: Deep learning for interpretable image recognition. NIPS’19: Proc. 33rd Int. Conf. on Neural Information Processing Systems, Vancouver, British Columbia, Canada, 8930–8941, https://dl.acm.org/doi/10.5555/3454287.3455088.

  • Chen, K., P. Wang, X. Yang, N. Zhang, and D. Wang, 2020: A model output deep learning method for grid temperature forecasts in Tianjin area. Appl. Sci., 10, 5808, https://doi.org/10.3390/app10175808.

    • Search Google Scholar
    • Export Citation
  • Clare, M. C., M. Sonnewald, R. Lguensat, J. Deshayes, and V. Balaji, 2022: Explainable artificial intelligence for Bayesian neural networks: Toward trustworthy predictions of ocean dynamics. J. Adv. Model. Earth Syst., 14, e2022MS003162, https://doi.org/10.1002/essoar.10511239.1.

    • Search Google Scholar
    • Export Citation
  • Covert, I. C., S. Lundberg, and S.-I. Lee, 2021: Explaining by removing: A unified framework for model explanation. J. Mach. Learn. Res., 22, 94779566.

    • Search Google Scholar
    • Export Citation
  • Das, A., and P. Rad, 2020: Opportunities and challenges in explainable artificial intelligence (XAI): A survey. arXiv, 2006.11371v2, https://doi.org/10.48550/arXiv.2006.11371.

  • Dikshit, A., and B. Pradhan, 2021: Interpretable and explainable AI (XAI) model for spatial drought prediction. Sci. Total Environ., 801, 149797, https://doi.org/10.1016/j.scitotenv.2021.149797.

    • Search Google Scholar
    • Export Citation
  • Ebert-Uphoff, I., and K. Hilburn, 2020: Evaluation, tuning and interpretation of neural networks for working with images in meteorological applications. Bull. Amer. Meteor. Soc., 101, E2149E2170, https://doi.org/10.1175/BAMS-D-20-0097.1.

    • Search Google Scholar
    • Export Citation
  • Felsche, E., and R. Ludwig, 2021: Applying machine learning for drought prediction in a perfect model framework using data from a large ensemble of climate simulations. Nat. Hazards Earth Syst. Sci., 21, 36793691, https://doi.org/10.5194/nhess-21-3679-2021.

    • Search Google Scholar
    • Export Citation
  • Flora, M., C. Potvin, A. McGovern, and S. Handler, 2022: Comparing explanation methods for traditional machine learning models Part 1: An overview of current methods and quantifying their disagreement. arXiv, 2211.08943v1, https://doi.org/10.48550/arXiv.2211.08943.

  • Gautam, S., A. Boubekki, S. Hansen, S. Salahuddin, R. Jenssen, M. Höhne, and M. Kampffmeyer, 2022: ProtoVAE: A trustworthy self-explainable prototypical variational model. 36th Conf. on Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, NeurIPS, 17 94017 952, https://proceedings.neurips.cc/paper_files/paper/2022/file/722f3f9298a961d2639eadd3f14a2816-Paper-Conference.pdf.

  • Gautam, S., M. M.-C. Höhne, S. Hansen, R. Jenssen, and M. Kampffmeyer, 2023: This looks more like that: Enhancing self-explaining models by prototypical relevance propagation. Pattern Recognit., 136, 109172, https://doi.org/10.1016/j.patcog.2022.109172.

    • Search Google Scholar
    • Export Citation
  • Gevaert, A., A.-J. Rousseau, T. Becker, D. Valkenborg, T. De Bie, and Y. Saeys, 2022: Evaluating feature attribution methods in the image domain. arXiv, 2202.12270v1, https://doi.org/10.48550/arXiv.2202.12270.

  • Gibson, P. B., W. E. Chapman, A. Altinok, L. Delle Monache, M. J. DeFlorio, and D. E. Waliser, 2021: Training machine learning models on climate model output yields skillful interpretable seasonal precipitation forecasts. Commun. Earth Environ., 2, 159, https://doi.org/10.1038/s43247-021-00225-4.

    • Search Google Scholar
    • Export Citation
  • Grinwald, D., K. Bykov, S. Nakajima, and M. M.-C. Höhne, 2022: Visualizing the diversity of representations learned by Bayesian neural networks. arXiv, 2201.10859v2, https://doi.org/10.48550/arXiv.2201.10859.

  • Ham, Y.-G., J.-H. Kim, and J.-J. Luo, 2019: Deep learning for multi-year ENSO forecasts. Nature, 573, 568572, https://doi.org/10.1038/s41586-019-1559-7.

    • Search Google Scholar
    • Export Citation
  • Han, L., J. Sun, W. Zhang, Y. Xiu, H. Feng, and Y. Lin, 2017: A machine learning nowcasting method based on real-time reanalysis data. J. Geophys. Res. Atmos., 122, 40384051, https://doi.org/10.1002/2016JD025783.

    • Search Google Scholar
    • Export Citation
  • Han, T., S. Srinivas, and H. Lakkaraju, 2022: Which explanation should I choose? A function approximation perspective to characterizing post hoc explanations. arXiv, 2206.01254v3, https://doi.org/10.48550/arXiv.2206.01254.

  • Harder, P., D. Watson-Parris, D. Strassel, N. Gauger, P. Stier, and J. Keuper, 2021: Emulating aerosol microphysics with machine learning. arXiv, 2109.10593v2, https://doi.org/10.48550/arXiv.2109.10593.

  • Harris, C. R., and Coauthors, 2020: Array programming with NumPy. Nature, 585, 357362, https://doi.org/10.1038/s41586-020-2649-2.

  • He, S., X. Li, T. DelSole, P. Ravikumar, and A. Banerjee, 2021: Sub-seasonal climate forecasting via machine learning: Challenges, analysis, and advances. Proc. 35th Conf. AAAI Artificial Intelligence, Online, AAAI Press, https://doi.org/10.1609/aaai.v35i1.16090.

  • Hedström, A., P. Bommer, K. K. Wickstrøm, W. Samek, S. Lapuschkin, and M. M.-C. Höhne, 2023a: The meta-evaluation problem in explainable AI: Identifying reliable estimators with MetaQuantus. arXiv, 2302.07265v2, https://doi.org/10.48550/arXiv.2302.07265.

  • Hedström, A., L. Weber, D. Krakowczyk, D. Bareeva, F. Motzkus, W. Samek, S. Lapuschkin, and M. M.-C. Höhne, 2023b: Quantus: An explainable AI toolkit for responsible evaluation of neural network explanations and beyond. J. Mach. Learn. Res., 24 (34), 111.

    • Search Google Scholar
    • Export Citation
  • Hengl, T., and Coauthors, 2017: SoilGrids250m: Global gridded soil information based on machine learning. PLOS ONE, 12, e0169748, https://doi.org/10.1371/journal.pone.0169748.

    • Search Google Scholar
    • Export Citation
  • Hilburn, K. A., 2023: Understanding spatial context in convolutional neural networks using explainable methods: Application to interpretable GREMLIN. Artif. Intell. Earth Syst., 2, 220093, https://doi.org/10.1175/AIES-D-22-0093.1.

    • Search Google Scholar
    • Export Citation
  • Hilburn, K. A., I. Ebert-Uphoff, and S. D. Miller, 2021: Development and interpretation of a neural-network-based synthetic radar reflectivity estimator using GOES-R satellite observations. J. Appl. Meteor. Climatol., 60, 321, https://doi.org/10.1175/JAMC-D-20-0084.1.

    • Search Google Scholar
    • Export Citation
  • Hoffman, R. R., S. T. Mueller, G. Klein, and J. Litman, 2018: Metrics for explainable AI: Challenges and prospects. arXiv, 1812.04608v2, https://doi.org/10.48550/arXiv.1812.04608.

  • Hunter, J. D., 2007: Matplotlib: A 2D graphics environment. Comput. Sci. Eng., 9, 9095, https://doi.org/10.1109/MCSE.2007.55.

  • Hurley, N., and S. Rickard, 2009: Comparing measures of sparsity. 2008 IEEE Workshop on Machine Learning for Signal Processing, Cancun, Mexico, Institute of Electrical and Electronics Engineers, 47234741, https://doi.org/10.1109/MLSP.2008.4685455.

  • Hurrell, J. W., and Coauthors, 2013: The community earth system model: A framework for collaborative research. Bull. Amer. Meteor. Soc., 94, 13391360, https://doi.org/10.1175/BAMS-D-12-00121.1.

    • Search Google Scholar
    • Export Citation
  • Janzing, D., L. Minorics, and P. Blöbaum, 2020: Feature relevance quantification in explainable AI: A causal problem. Proc. 23rd Int. Conf. on Artificial Intelligence and Statistics, Palermo, Italy, PMLR, 29072916, https://proceedings.mlr.press/v108/janzing20a/janzing20a.pdf.

  • Kay, J. E., and Coauthors, 2015: The Community Earth System Model (CESM) large ensemble project: A community resource for studying climate change in the presence of internal climate variability. Bull. Amer. Meteor. Soc., 96, 13331349, https://doi.org/10.1175/BAMS-D-13-00255.1.

    • Search Google Scholar
    • Export Citation
  • Krishna, S., T. Han, A. Gu, J. Pombra, S. Jabbari, S. Wu, and H. Lakkaraju, 2022: The disagreement problem in explainable machine learning: A practitioner’s perspective. arXiv, 2202.01602v3, https://doi.org/10.48550/arXiv.2202.01602.

  • Labe, Z. M., and E. A. Barnes, 2021: Detecting climate signals using explainable AI with single-forcing large ensembles. J. Adv. Model. Earth Syst., 13, e2021MS002464, https://doi.org/10.1029/2021MS002464.

    • Search Google Scholar
    • Export Citation
  • Labe, Z. M., and E. A. Barnes, 2022: Comparison of climate model large ensembles with observations in the Arctic using simple neural networks. Earth Space Sci., 9, e2022EA002348, https://doi.org/10.1002/essoar.10510977.1.

    • Search Google Scholar
    • Export Citation
  • Lapuschkin, S., S. Wäldchen, A. Binder, G. Montavon, W. Samek, and K.-R. Müller, 2019: Unmasking Clever Hans predictors and assessing what machines really learn. Nat. Commun., 10, 1096, https://doi.org/10.1038/s41467-019-08987-4.

    • Search Google Scholar
    • Export Citation
  • Leinonen, J., D. Nerini, and A. Berne, 2021: Stochastic super-resolution for downscaling time-evolving atmospheric fields with a generative adversarial network. IEEE Trans. Geosci. Remote Sens., 59, 72117223, https://doi.org/10.1109/TGRS.2020.3032790.

    • Search Google Scholar
    • Export Citation
  • Letzgus, S., P. Wagner, J. Lederer, W. Samek, K.-R. Müller, and G. Montavon, 2022: Toward explainable artificial intelligence for regression models: A methodological perspective. IEEE Signal Process. Mag., 39, 4058, https://doi.org/10.1109/MSP.2022.3153277.

    • Search Google Scholar
    • Export Citation
  • Lundberg, S. M., and S.-I. Lee, 2017: A unified approach to interpreting model predictions. NIPS’17: Proc. 31st Int. Conf. on Neural Information Processing Systems, Long Beach, CA, Curran Associates Inc., 47654774, https://dl.acm.org/doi/10.5555/3295222.3295230.

  • Mamalakis, A., I. Ebert-Uphoff, and E. A. Barnes, 2020: Explainable artificial intelligence in meteorology and climate science: Model fine-tuning, calibrating trust and learning new science. xxAI: Int. Workshop on Extending Explainable AI beyond Deep Models and Classifiers, Vienna, Austria, Springer, 315339, https://doi.org/10.1007/978-3-031-04083-2_16.

  • Mamalakis, A., E. A. Barnes, and I. Ebert-Uphoff, 2022a: Investigating the fidelity of explainable artificial intelligence methods for applications of convolutional neural networks in geoscience. Artif. Intell. Earth Syst., 1, e220012, https://doi.org/10.1175/AIES-D-22-0012.1.

    • Search Google Scholar
    • Export Citation
  • Mamalakis, A., I. Ebert-Uphoff, and E. A. Barnes, 2022b: Neural network attribution methods for problems in geoscience: A novel synthetic benchmark dataset. Environ. Data Sci., 1, e8, https://doi.org/10.1017/eds.2022.7.

    • Search Google Scholar
    • Export Citation
  • Mayer, K. J., and E. A. Barnes, 2021: Subseasonal forecasts of opportunity identified by an explainable neural network. Geophys. Res. Lett., 48, e2020GL092092, https://doi.org/10.1029/2020GL092092.

    • Search Google Scholar
    • Export Citation
  • McGovern, A., R. Lagerquist, D. John Gagne, G. E. Jergensen, K. L. Elmore, C. R. Homeyer, and T. Smith, 2019: Making the black box more transparent: Understanding the physical implications of machine learning. Bull. Amer. Meteor. Soc., 100, 21752199, https://doi.org/10.1175/BAMS-D-18-0195.1.

    • Search Google Scholar
    • Export Citation
  • Mohseni, S., N. Zarei, and E. D. Ragan, 2021: A multidisciplinary survey and framework for design and evaluation of explainable AI systems. ACM Trans. Interact. Intell. Syst., 11 (3–4), 145, https://doi.org/10.1145/3387166.

    • Search Google Scholar
    • Export Citation
  • Montavon, G., S. Lapuschkin, A. Binder, W. Samek, and K.-R. Müller, 2017: Explaining nonlinear classification decisions with deep Taylor decomposition. Pattern Recognit., 65, 211222, https://doi.org/10.1016/j.patcog.2016.11.008.

    • Search Google Scholar
    • Export Citation
  • Montavon, G., W. Samek, and K.-R. Müller, 2018: Methods for interpreting and understanding deep neural networks. Digit. Signal Process., 73, 115, https://doi.org/10.1016/j.dsp.2017.10.011.

    • Search Google Scholar
    • Export Citation
  • Montavon, G., A. Binder, S. Lapuschkin, W. Samek, and K.-R. Müller, 2019: Layer-wise relevance propagation: An overview. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, Springer, 193–209.

  • Murphy, A. H., 1988: Skill scores based on the mean square error and their relationships to the correlation coefficient. Mon. Wea. Rev., 116, 24172424, https://doi.org/10.1175/1520-0493(1988)116<2417:SSBOTM>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., and H. Daan, 1985: Forecast evaluation. Probability, Statistics, and Decision Making in the Atmospheric Sciences, CRC Press, 379–437.

  • Nguyen, A., A. Dosovitskiy, J. Yosinski, T. Brox, and J. Clune, 2016: Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. NIPS’16: Proc. 30th Int. Conf. on Neural Information Processing Systems, Barcelona, Spain, Curran Associates Inc., 33953403, https://dl.acm.org/doi/10.5555/3157382.3157477.

  • Nguyen, A.-p., and M. R. Martínez, 2020: On quantitative aspects of model interpretability. arXiv, 2007.07584v1, https://doi.org/10.48550/arXiv.2007.07584.

  • Olah, C., A. Mordvintsev, and L. Schubert, 2017: Feature visualization. Distill, 2, e7, https://doi.org/10.23915/distill.00007.

  • Pegion, K., E. J. Becker, and B. P. Kirtman, 2022: Understanding predictability of daily southeast U.S. precipitation using explainable machine learning. Artif. Intell. Earth Syst., 1, e220011, https://doi.org/10.1175/AIES-D-22-0011.1.

    • Search Google Scholar
    • Export Citation
  • Petsiuk, V., A. Das, and K. Saenko, 2018: RISE: Randomized input sampling for explanation of black-box models. arXiv, 1806.07421v3, https://doi.org/10.48550/arXiv.1806.07421.

  • Ribeiro, M. T., S. Singh, and C. Guestrin, 2016: “Why should I trust you?”: Explaining the predictions of any classifier. Proc. 22nd ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, San Francisco, CA, Association for Computing Machinery, 11351144, https://doi.org/10.1145/2939672.2939778.

  • Rieger, L., and L. K. Hansen, 2020: IROF: A low resource evaluation metric for explanation methods. arXiv, 2003.08747v1, https://doi.org/10.48550/arXiv.2003.08747.

  • Rong, Y., T. Leemann, V. Borisov, G. Kasneci, and E. Kasneci, 2022a: A consistent and efficient evaluation strategy for attribution methods. arXiv, 2202.00449v2, https://doi.org/10.48550/arXiv.2202.00449.

  • Rong, Y., T. Leemann, V. Borisov, G. Kasneci, and E. Kasneci, 2022b: Evaluating feature attribution: An information-theoretic perspective. arXiv, 2202.00449v1, https://www.hci.uni-tuebingen.de/assets/pdf/publications/rong2022evaluating.pdf.

  • Samek, W., A. Binder, G. Montavon, S. Lapuschkin, and K.-R. Muller, 2017: Evaluating the visualization of what a deep neural network has learned. IEEE Trans. Neural Networks Learn. Syst., 28, 26602673, https://doi.org/10.1109/TNNLS.2016.2599820.

    • Search Google Scholar
    • Export Citation
  • Samek, W., G. Montavon, A. Vedaldi, L. K. Hansen, and K.-R. Müller, 2019: Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. Vol. 11700. Springer Nature, 439 pp.

  • Sawada, Y., and K. Nakamura, 2022: C-SENN: Contrastive self-explaining neural network. arXiv, 2206.09575v2, https://doi.org/10.48550/arXiv.2206.09575.

  • Scher, S., and G. Messori, 2021: Ensemble methods for neural network-based weather forecasts. J. Adv. Model. Earth Syst., 13, e2020MS002331, https://doi.org/10.1029/2020MS002331.

    • Search Google Scholar
    • Export Citation
  • Selvaraju, R. R., M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, 2017: Grad-CAM: Visual explanations from deep networks via gradient-based localization. 2017 Proc. IEEE Int. Conf. on Computer Vision, Venice, Italy, Institute of Electrical and Electronics Engineers, 618626, https://doi.org/10.1109/ICCV.2017.74.

  • Shapley, L. S., 1951: Notes on the N-person game—II: The value of an N-person game. RAND Corporation, 19 pp., https://www.rand.org/pubs/research_memoranda/RM0670.html.

  • Shi, X., Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, 2015: Convolutional LSTM network: A machine learning approach for precipitation nowcasting. NIPS’15: Proc. 28th Int. Conf. on Neural Information Processing System, Montreal, Canada, MIT Press, 802810, https://dl.acm.org/doi/proceedings/10.5555/2969239.

  • Shrikumar, A., P. Greenside, A. Shcherbina, and A. Kundaje, 2016: Not just a black box: Learning important features through propagating activation differences. arXiv, 1605.01713v3, https://doi.org/10.48550/arXiv.1605.01713.

  • Simonyan, K., A. Vedaldi, and A. Zisserman, 2014: Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv:1312.6034v2, https://arxiv.org/abs/1312.6034.

  • Sixt, L., M. Granz, and T. Landgraf, 2020: When explanations lie: Why many modified BP attributions fail. Proc. 37th Int. Conf. on Machine Learning (ICML 2020), Online, PMLR, https://researchr.org/publication/icml-2020.

  • Slivinski, L. C., and Coauthors, 2019: Towards a more reliable historical reanalysis: Improvements for version 3 of the twentieth century reanalysis system. Quart. J. Roy. Meteor. Soc., 145, 28762908, https://doi.org/10.1002/qj.3598.

    • Search Google Scholar
    • Export Citation
  • Smilkov, D., N. Thorat, B. Kim, F. Viégas, and M. Wattenberg, 2017: SmoothGrad: Removing noise by adding noise. arXiv, 1706.03825v1, https://doi.org/10.48550/arXiv.1706.03825.

  • Sonnewald, M., and R. Lguensat, 2021: Revealing the impact of global heating on North Atlantic circulation using transparent machine learning. J. Adv. Model. Earth Syst., 13, e2021MS002496, https://doi.org/10.1029/2021MS002496.

    • Search Google Scholar
    • Export Citation
  • Strumbelj, E., and I. Kononenko, 2010: An efficient explanation of individual classifications using game theory. J. Mach. Learn. Res., 11, 118.

    • Search Google Scholar
    • Export Citation
  • Sturmfels, P., S. Lundberg, and S.-I. Lee, 2020: Visualizing the impact of feature attribution baselines. Distill, 5, e22, https://doi.org/10.23915/distill.00022.