Interpretable Quality Control of Sparsely Distributed Environmental Sensor Networks Using Graph Neural Networks

Elżbieta Lasota Institute of Meteorology and Climate Research Atmospheric Environmental Research, Karlsruhe Institute of Technology, Garmisch-Partenkirchen, Germany

Search for other papers by Elżbieta Lasota in
Current site
Google Scholar
PubMed
Close
https://orcid.org/0000-0002-6276-2188
,
Timo Houben Research Data Management, Helmholtz Centre for Environmental Research – UFZ, Leipzig, Germany
Department of Monitoring and Exploration Technologies, Helmholtz Centre for Environmental Research – UFZ, Leipzig, Germany

Search for other papers by Timo Houben in
Current site
Google Scholar
PubMed
Close
,
Julius Polz Institute of Meteorology and Climate Research Atmospheric Environmental Research, Karlsruhe Institute of Technology, Garmisch-Partenkirchen, Germany

Search for other papers by Julius Polz in
Current site
Google Scholar
PubMed
Close
,
Lennart Schmidt Research Data Management, Helmholtz Centre for Environmental Research – UFZ, Leipzig, Germany
Department of Monitoring and Exploration Technologies, Helmholtz Centre for Environmental Research – UFZ, Leipzig, Germany

Search for other papers by Lennart Schmidt in
Current site
Google Scholar
PubMed
Close
,
Luca Glawion Institute of Meteorology and Climate Research Atmospheric Environmental Research, Karlsruhe Institute of Technology, Garmisch-Partenkirchen, Germany

Search for other papers by Luca Glawion in
Current site
Google Scholar
PubMed
Close
,
David Schäfer Research Data Management, Helmholtz Centre for Environmental Research – UFZ, Leipzig, Germany
Department of Monitoring and Exploration Technologies, Helmholtz Centre for Environmental Research – UFZ, Leipzig, Germany

Search for other papers by David Schäfer in
Current site
Google Scholar
PubMed
Close
,
Jan Bumberger Research Data Management, Helmholtz Centre for Environmental Research – UFZ, Leipzig, Germany
Department of Monitoring and Exploration Technologies, Helmholtz Centre for Environmental Research – UFZ, Leipzig, Germany
German Centre for Integrative Biodiversity Research Halle-Jena-Leipzig, Leipzig, Germany

Search for other papers by Jan Bumberger in
Current site
Google Scholar
PubMed
Close
, and
Christian Chwala Institute of Meteorology and Climate Research Atmospheric Environmental Research, Karlsruhe Institute of Technology, Garmisch-Partenkirchen, Germany

Search for other papers by Christian Chwala in
Current site
Google Scholar
PubMed
Close
Open access

Abstract

Environmental sensor networks play a crucial role in monitoring key parameters essential for understanding Earth’s systems. To ensure the reliability and accuracy of collected data, effective quality control (QC) measures are essential. Conventional QC methods struggle to handle the complexity of environmental data. Conversely, advanced techniques such as neural networks are typically not designed to process data from sensor networks with irregular spatial distribution. In this study, we focus on anomaly detection in environmental sensor networks using graph neural networks, which can represent sensor network structures as graphs. We investigate its performance on two datasets with distinct dynamics and resolution: commercial microwave link (CML) signal levels used for rainfall estimation and SoilNet soil moisture measurements. To evaluate the benefits of incorporating neighboring sensor information for anomaly detection, we compare two models: graph convolution network (GCN) and a graph-less baseline: long short-term memory (LSTM). Our robust evaluation through a five-fold cross validation demonstrates the superiority of the GCN models. For CML, the mean area under receiver operating characteristic curve for the GCN was 0.941 compared to 0.885 for the baseline LSTM, and for SoilNet, it was 0.858 for GCN and 0.816 for the baseline LSTM. Visual inspection of CML time series revealed that the GCN proficiently classified anomalies and remained resilient against rain-induced events often misidentified by the baseline LSTM. However, for SoilNet, the advantage of GCN was less pronounced, likely due to inconsistent and less precise labeling. Through interpretable model analysis, we demonstrate how feature attributions vividly illustrate the significance of neighboring sensor data, particularly in distinguishing between anomalies and expected changes in the signal level in the time series.

© 2025 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Christian Chwala, christian.chwala@kit.edu

Abstract

Environmental sensor networks play a crucial role in monitoring key parameters essential for understanding Earth’s systems. To ensure the reliability and accuracy of collected data, effective quality control (QC) measures are essential. Conventional QC methods struggle to handle the complexity of environmental data. Conversely, advanced techniques such as neural networks are typically not designed to process data from sensor networks with irregular spatial distribution. In this study, we focus on anomaly detection in environmental sensor networks using graph neural networks, which can represent sensor network structures as graphs. We investigate its performance on two datasets with distinct dynamics and resolution: commercial microwave link (CML) signal levels used for rainfall estimation and SoilNet soil moisture measurements. To evaluate the benefits of incorporating neighboring sensor information for anomaly detection, we compare two models: graph convolution network (GCN) and a graph-less baseline: long short-term memory (LSTM). Our robust evaluation through a five-fold cross validation demonstrates the superiority of the GCN models. For CML, the mean area under receiver operating characteristic curve for the GCN was 0.941 compared to 0.885 for the baseline LSTM, and for SoilNet, it was 0.858 for GCN and 0.816 for the baseline LSTM. Visual inspection of CML time series revealed that the GCN proficiently classified anomalies and remained resilient against rain-induced events often misidentified by the baseline LSTM. However, for SoilNet, the advantage of GCN was less pronounced, likely due to inconsistent and less precise labeling. Through interpretable model analysis, we demonstrate how feature attributions vividly illustrate the significance of neighboring sensor data, particularly in distinguishing between anomalies and expected changes in the signal level in the time series.

© 2025 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Christian Chwala, christian.chwala@kit.edu

1. Introduction

Climate information is the key ingredient for successful climate change adaptation and the mitigation of impacts of extreme events. Accurate and dense climate observations are essential for risk management and the prediction of natural hazards. However, there is a large gap in the global availability of climate information, especially in developing countries (UNFCCC 2022; Lorenz and Kunstmann 2012). To close this gap, multiple options are available. First, the installation of cost- and maintenance-intensive dedicated sensors, such as those used in the Integrated Carbon Observation System (ICOS) or Terrestrial Environmental Observatories (TERENO) (Bogena 2016; Rebmann et al. 2018). Second, the use of existing infrastructure, such as commercial microwave links (CMLs) for precipitation estimation (Chwala and Kunstmann 2019). Third, the deployment of low-cost sensors, like personal weather stations (Graf et al. 2021).

a. Quality control of environmental sensor data

A common theme across all efforts to observe Earth’s environment outside controlled laboratory settings is the need for extensive quality control (QC) of the data. Inevitably, environmental sensors are subject to numerous disruptive influences and, thus, exhibit erroneous data, manifesting as unacceptable deviations from the expected values or ground truth of the measured variable. Causes include instrument constraints such as battery voltage and malfunction, technical failures during data transmission, or environmental influences that interfere with the measurement principle (Gandin 1988). Common QC approaches are 1) manual data inspection by domain experts—a task that lacks reproducibility and explainability and is often too laborious for operational processing of large data volumes (Jones et al. 2018). 2) Automated workflows using rule-based or parametric statistical tests, such as defining rules for physically plausible value limits or detecting outliers with respect to a given statistical distribution (Schmidt et al. 2023; Horsburgh et al. 2015). Finding the best combination and parameterization of such tests involves time-consuming trial-and-error and requires significant expert knowledge, especially when considering cross dependencies between sensors or variables (Sturtevant et al. 2021). 3) Deep learning (DL) algorithms for anomaly detection, which have the potential to enhance automated QC routines. These algorithms can possess and learn from diverse and extensive datasets, enabling them to capture complex relationships that traditional rule-based methods may overlook. Despite varying requirements for accurately labeled training data depending on the complexity of the problem (Erhan et al. 2021; Wang et al. 2022), their ability to autonomously learn from data enhances their versatility and effectiveness in QC tasks. However, DL methods are often considered black boxes, meaning that the models’ decisions are not self-explanatory and are hard to interpret.

b. Deep learning for anomaly detection

In this study, we focus on the methodological development of improved DL approaches for the QC of environmental sensor data. DL has been extensively applied in fields such as cybersecurity, medicine, food industry, and manufacturing (Zhang et al. 2021; Vandewinckele et al. 2020; Nayak et al. 2020; Cioffi et al. 2020). However, most studies have benefited from ready-to-use benchmark datasets, which enable the comparison of algorithm performance across different studies (Erhan et al. 2021). Currently, the few applications of anomaly detection on real-world environmental sensor networks primarily focus on detecting anomalies in uni- or multivariate time series from single sensors using well-established methods such as autoregressive integrated moving average, support vector machines, and long short-term memory (LSTM) models (Russo et al. 2021; Jones et al. 2022; Muharemi et al. 2019). There has been relatively little emphasis on addressing the challenges associated with DL-based QC of sensor networks that consider neighboring sensor information. These challenges include spatial sparsity due to irregular sensor network layouts and variations in data availability resulting from evolving network layouts and sensor malfunctions.

Considering not only single sensor but also contextual anomalies, it can be assumed that signals from multiple sensors distributed across space and their interrelationships help in detecting erroneous behavior (Chalapathy and Chawla 2019). Thus, a neural network architecture capable of encoding the spatial proximity of sparse and variable inputs is essential to enhance DL-based anomaly detection. One approach that can meet these needs is using graph neural networks (GNNs), which possess a remarkable ability to handle irregular and unstructured data containing relational information, naturally represented as graphs. Traditional neural networks are more tailored to process structured data such as images, individual sequences, or rasters (Egmont-Petersen et al. 2002; Zhang et al. 2022; Sutskever et al. 2014). Existing GNN applications span diverse domains, including social network analysis, recommendation systems, and chemistry (Fan et al. 2019; Zhang et al. 2022; Coley et al. 2019). Several well-performing GNN applications exist for anomaly detection in controlled experiments, such as benchmark datasets, synthetic pollution events in air quality data, or simulated attacks on waste-water testbed systems (Guan et al. 2022; Lin et al. 2022; Deng and Hooi 2021). However, the application of GNNs for the QC of sparse, real-world environmental sensor networks has not yet been studied.

c. The need for XAI

Crucial requirements for QC in an operational setting are the interpretability and reproducibility of classification results. While reproducibility is achieved by establishing and sharing a deterministic algorithm, the interpretability of neural network outputs necessitates dedicated techniques summarized under the term explainable artificial intelligence (XAI). Generally, XAI for GNNs is an active field of research, predominantly consisting of theoretical work (Agarwal et al. 2022). Gradient-based XAI methods analyze gradients with respect to neural network input–output pairs to attribute model predictions to input features. Several gradient-based methods for GNNs have been proposed (Baldassarre and Azizpour 2019; Pope et al. 2019), but there are only a few applications (Kosasih and Brintrup 2022; Rathee et al. 2022; Yin et al. 2023). An evaluation of XAI for the interpretability of GNN-based QC of environmental time series sensor data is still missing.

d. Study outline

To close the abovementioned knowledge gaps, this study aims to answer the following two research questions:

  1. Can GNNs improve automated QC of environmental sensor data by integrating spatial information from sensor networks that are distributed irregularly in space and provide varying amounts of observations for each time step?

  2. Can XAI reveal information about the influence of neighboring sensors to explain the decisions of the proposed graph convolution network (GCN) model?

To provide comprehensive answers to these questions, we selected two different datasets: one with CML signal level observations from a large network scattered across Germany and one with soil moisture observations from a local-scale environmental observatory. Both datasets represent environmental sensors and share challenges such as irregular spatial distribution, sensitivity to environmental factors, and a high number of sensors resulting in a large volume of observed data. Differences include variable dynamics, spatial coverage, spatial resolutions, and sampling rate. These differences define the context in which results from this specific study may be generalized.

2. Methods

a. Data

1) Commercial microwave links

CMLs provide line-of-sight radio connections in mobile phone networks (Chwala and Kunstmann 2019). Since the wavelength of the transmitted signal is in the order of raindrop diameters, the signal is significantly attenuated by rainfall through scattering and absorption processes (Atlas and Ulbrich 1977). This offers an opportunity to accurately estimate rainfall amounts, as the rainfall-induced path-integrated attenuation is related to the path-averaged rainfall rate in a close-to-linear manner (Messer et al. 2006). Additionally, CMLs have extensive global coverage in inhabited areas, with more than 90% of the human population living in regions with broadband telecommunication access (Global System for Mobile Communications Association 2022). However, other causes such as dew formation on the antenna, multipath propagation, or mixed-phase precipitation can cause fluctuations in the signal level, thereby disturbing accurate measurements (van Leth et al. 2018).

The CML data used in this study are a subset of a larger dataset collected in cooperation with Ericsson Germany, using a custom CML data acquisition system (Chwala et al. 2016). The full dataset covers 3904 CMLs across Germany. The length of CML paths ranges from 0.1 to over 30 km, and the transmission frequencies range from 10 to 40 GHz. For each CML, received signal level (RSL) and transmitted signal level (TSL) are recorded at a temporal resolution of 1 min and power resolutions of 0.3 and 1.0 dB for RSL and TSL, respectively. The difference between TSL and RSL yields total signal loss (TL), which can be measured for two sublinks within each communication link. This is feasible because the system supports two-way data transmission, allowing data to be sent and received in both directions, which enables the detailed measurement of signal loss for each segment.

The subset used in this study focuses on 20 CMLs that were manually checked and labeled by four independent experts for March, May, and July 2019, using a specifically designed tool for visualization and labeling (Polz et al. 2023). Each expert categorized anomalies into different classes (jump, dew, fluctuation, and unknown). Since this study addresses anomaly detection as a binary classification problem, assigning a single flag required agreement from at least three experts regarding the specific anomaly type. For each of the 20 quality-checked CMLs, data from a selected set of neighboring CMLs are also included, as illustrated by the example in Fig. 1. The neighbor selection procedure is described in section 3.

Fig. 1.
Fig. 1.

Example of (left) CML and (right) SoilNet data used as input for the GCN models. (a),(b) The basic principles of CML and SoilNet techniques, respectively. (c) The spatial connections between sensors in the CML network at the time of classification, with the labeled sensor highlighted in red to indicate its role in the analysis. (d) The complete 3D network configuration of SoilNet. Due to the large size of the network, only one selected sensor is highlighted with a red edge color, showing its neighboring sensors and their signals for clarity. The colors of nodes in the graphs represent TL and moisture values. (e) The TL time series of the labeled sensors (highlighted in red) and their neighbors. (f) The time series of soil moisture and battery voltage for the labeled sensor (red) and its neighbors. In both panels, red vertical lines mark the classification times.

Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0032.1

2) SoilNet

SoilNet sensor networks are used for battery-operated wireless soil moisture and soil temperature measurements as described in Bogena et al. (2010).

The SoilNet data used in this study are a subset of continuous measurements from the Hohes Holz observatory, which is part of the TERENO network (Wollschläger et al. 2017). Hohes Holz is a 1-ha patch of mixed beech forest where meteorological, hydrological, and ecological variables are observed at high spatial and temporal resolutions (Rebmann et al. 2017). The dataset includes measurements of soil moisture (vol %) measured via the capacitance method, soil temperature (°C) measured by an integrated digital thermometer, and a device battery voltage for the year 2014. The variables were measured at 15-min intervals in a network of 180 sensors distributed across 35 spatial sampling locations at irregular spacing. At each sampling location, six sensors were vertically aligned below the soil surface, positioned at approximately 0.1-m intervals up to a depth of 0.6 m.

Errors in soil moisture measurements generally stem from the diverse nature of soil properties and environmental factors (Mittelbach et al. 2012). Temperature fluctuations, improper sensor installation, and calibration errors may introduce inaccuracies. Additionally, the evolving presence of roots, stones, and preferential flow pathways in the soil can lead to spatial variability in moisture content (Mittelbach et al. 2012; Susha Lekshmi et al. 2014). For the site in question, battery voltage drops, transmission errors, and sensor deterioration over time were observed as additional sources of measurement errors.

The dataset was quality checked using a semiautomated routine consisting of three automated tests followed by manual checking. The first automated test, the range test, flagged data points that fell outside a physically plausible value range (between 5% and 60% for soil moisture and from −25° to 50°C for soil temperature). Next, a custom spike test identified physically implausible jumps and outliers in soil moisture and temperature. Finally, the BattV test flagged both soil moisture and temperature if the battery voltage dropped below 3 V. Since these automated routines were not sufficient, all data were manually reviewed and, if necessary, flagged by domain experts. Although multiple experts flagged anomalies during the measurement campaign, each time step was flagged by only one expert. In this study, we focus on the automated detection of manually labeled anomaly flags and nonanomalous data.

3) Data preprocessing

Before the actual AI model development, we prepared both datasets to optimize their quality and suitability for model training to achieve the best performance. First, we selected relevant features for both datasets. For CML, we used TL from both channels, while for SoilNet, we included moisture, battery voltage, and temperature. Subsequent preprocessing steps involved graph sample preparation with adjacency matrix establishment, missing data imputation, data normalization, and splitting the time series into fixed-length samples. All parameters for preprocessing were optimized experimentally.

In the CML dataset, only 20 out of 3904 sensors were flagged, while their neighbors were not quality checked. In contrast, all sensors in the SoilNet dataset were labeled. This difference in label availability required distinct approaches to preparing the samples for both datasets. Due to the limited availability of flagged CML sensors, to form a graph sample for GCN models, all sensors in a 20-km radius around a flagged CML were selected as graph nodes, and nodes with a maximum distance of 10 km were connected by edges. For SoilNet, all available sensors at the given time step were used as nodes and, due to the 3D structure (longitude, latitude, and depth), sensors were connected by edges if they were within a 30-m distance and shared the same depth, or if they were located at the same position and within a vertical distance of up to 0.1 m.

Following the graph definition, we proceeded to generate time series samples for training and testing. To increase the number of available samples, we applied linear gap interpolation in time, filling up to five missing data points—up to 5 min for CML data and up to 60 min for SoilNet data.

Afterward, we normalized the data. For CML, we used “rolling median removal,” which has proven effective for CML applications (Polz et al. 2020), subtracting the median value from the original time series for the 5 days prior to the classification time. For SoilNet, we applied a minimum–maximum scaler based on variable-specific criteria: moisture ranged from a minimum of 0% to a maximum of 60%, battery voltage from a minimum of 2.8 V to a maximum of 3.6 V, and temperature from a minimum of −20°C to a maximum of 40°C.

The dataset was then partitioned into time series samples of varying lengths. For CML, the time series included 2 h before and 1 h after the flagged time step. For SoilNet, the time series included 72 h before and 12 h after the flagged time step.

Samples containing any missing values after interpolation were excluded, which could result in gaps in a classification time series larger than the missing period (see “no data” in Fig. 5). For SoilNet, we excluded samples flagged by automated QC tests, as these flags are relatively easy to detect and could skew the model’s performance. As a result, the final dataset includes only those samples flagged manually and nonanomalous data, ensuring a more accurate and challenging assessment of the model’s performance. Eventually, the sample selection and preparation procedure resulted in 2 558 577 samples for CML and 18 639 for SoilNet, comprising a total of 2 588 730 nodes used for model development. Illustrations of exemplary CML and SoilNet graphs and time series are depicted in Fig. 1.

b. GNN for anomaly detection

In this study, we leverage the power of GNN, specifically focusing on the core operation of graph convolution (GC) to tackle anomaly detection in environmental sensor networks. To comprehensively evaluate the efficacy of our GC-based anomaly detection framework and assess the advantages of incorporating neighboring sensor information, we introduce and compare two distinct models: the GCN model that uses neighboring sensor information and a corresponding baseline LSTM model, which does not.

1) Graph neural network

The following section provides an explanation of GNNs and assumes that readers have a basic understanding of fully connected neural network (FCN). While the concepts of GNNs are distinct and tailored for graph-structured data, familiarity with FCNs and their fundamental principles can be helpful for grasping the explanations and mechanisms discussed.

The fundamental components of every graph are its nodes V and edges E, where nodes (υiV) represent the entities and edges (eijE) define the relationships between nodes υi and υj. The arrangement of these elements is captured by the square adjacency matrix A, where Aij is an entry indicating the presence (or absence) of an edge between nodes i and j. The basic concept underlying GNNs involves the simultaneous processing of information from node features and their interconnected neighbors, as defined by the edges, enabling the propagation of information throughout the graph. This fundamental operation is termed GC and comprises several sequential steps. Initially, node embedding is conducted by associating each node with a feature vector hυ(0). Subsequently, exploiting the adjacency matrix A, information is aggregated from all neighboring nodes to update the nodes’ representations. A typical representation of the update rule in graph convolution involves a transformation with learnable parameters and, optionally, an activation function, which may be expressed by the rule (Kipf and Welling 2016; Chen et al. 2020; You et al. 2020):
H(l+1)=σ[D˜1/2A˜D˜1/2H(l)W(l)],
where H(l) is the matrix of node features at layer l, σ is an activation function, A˜=A+IN is the adjacency matrix with added self-loops through adding the identity matrix IN, D˜ is the degree matrix of A˜, and W(l) is the learnable weight matrix at layer l. This operation is performed iteratively across multiple layers, allowing the model to progressively refine node representations and incorporate complex relationships within the graph structure.

2) Model development

We developed two separate models, the GCN model and the baseline LSTM model, for classifying anomalies in CML and SoilNet datasets. The unique characteristics and availability of labels in each dataset determined the model architectures, with simplified versions presented in Fig. 2. To assess GCN effectiveness and the benefits of incorporating neighboring sensor information, we compare it with the baseline LSTM model, which lacks the GC layer and only uses one time series as input. All four models were implemented using Python 3.9.16 with TensorFlow and Keras API (2.11.1) and Spektral (1.3.0) libraries.

Fig. 2.
Fig. 2.

Schematic representation of two anomaly detection models’ architectures. (left) The GCN model, incorporating GC for anomaly detection; (right) the baseline LSTM without GC, illustrating the structural similarities and differences between the two approaches.

Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0032.1

Preprocessed time series samples (see section 3) in the forms of graphs or individual time series served as input to the models. The GNN model architecture (Fig. 2) starts with a GC layer, which captures spatial relationships between sensors and is applied separately at each time step. The GC output is then concatenated with the original time series of flagged sensors. To ensure clarity in the explainability analysis, we refer to the input to the concatenation layer as the flagged sensor (FS) series and to the GC layer as a self-reference cycle or neighbor zero (N0).

Due to varying availability of labeled sensors, the CML GCN predicts an anomaly probability for only one sensor of interest (graph classification problem), while the SoilNet GCN predicts scores for all sensors in the input graph (node classification problem). As a result, the CML version applies global average pooling, calculating mean values at each time step, to ensure consistent tensor shapes before concatenation. For SoilNet, where all neighbors were labeled, this step was not necessary. Given that GC operates independently at each time step, the model incorporates LSTM stacks to capture time dependencies. These stacks comprise LSTM layers combined with average pooling layers to downsample feature maps and reduce their size. The model concludes with dense layers, which allow the network to extract high-level features, reduce the output dimension, and, together with LSTM layers, learn nonlinear relationships.

Eventually, the network was trained using the Adam optimizer and binary cross entropy as the loss function. Model hyperparameters were tuned using manual search, adjusting parameters such as batch size, epochs, data normalization, learning rate, number of LSTM stacks, and units, as well as activation functions for each GC, dense, and LSTM layers. This manual search, particularly regarding model architecture, was guided by prior experience with CML classification tasks and expert knowledge, including insights from rain event detection studies like those by Polz et al. (2020). Additionally, we performed five-fold cross validation (CV) to ensure that the chosen set of hyperparameters did not overfit to a particular train/validation/test split. This process helped us select a robust and generalizable model architecture. The full list of tuned hyperparameters is available in Table A1 in appendix A. Unless otherwise specified, default TensorFlow hyperparameters were used for the respective layers.

c. Feature attribution through integrated gradients

We showcase the potential of feature attribution for QC by applying a model-agnostic XAI technique, called the integrated gradient (IG) method, to the GCN model and present results for two selected CML examples. Interpreting data-driven models through the visualization of feature attribution provides insight into the impact of specific input features on the model’s output. The IG methodology, developed by Sundararajan et al. (2017), is often applied in image analysis but can be seamlessly adapted to time series (classification) problems (Assaf and Schumann 2019; Jiang et al. 2022; Choi et al. 2022). Several related approaches exist, such as layer-wise relevance propagation (Bach et al. 2015), DeepLIFT (Shrikumar et al. 2017), and SmoothGrad (Smilkov et al. 2017). However, no single method has yet been universally established in the literature (Minh et al. 2022). Therefore, we selected IG based on practical considerations, including the ease of implementation within our code structure and the authors’ prior experience with this technique.

1) Theoretical background

Similar to other gradient-based approaches, IG calculates the gradient of the input with respect to the model output. The method integrates gradients across various model inputs that represent intermediate steps, forming a linear interpolation between a user-defined feature baseline and the actual features (Sundararajan et al. 2017). The outcome is an attribution of the model’s output to the individual input features.

Mathematically, the integrated gradients are expressed as the path integral from the baseline to the model input, where the integrated gradient along the ith dimension is defined as
IGi(x)::=(xixi)α=01F[x+α(xx)]xidα,
where x is the model input, x′ is the baseline, ∂F(x)/∂xi is the gradient of F(x) along the ith dimension, and α is a scalar parameter ranging from 0 to 1, representing the interpolation factor between the baseline input and the actual input. The path integral can be approximated by the Riemann integral, which sums the gradients for sufficiently small steps. IG satisfies the completeness axiom, ensuring that the sum of the attributions for all features in a sample equals the difference between the model output for the original input and the baseline. This property can serve as a sanity check and enables qualitative and numeric comparisons of attributions (Sundararajan et al. 2017). Additionally, it facilitates quantitative comparisons of attributions across samples from different time steps.

2) Implementation

We applied the IG method to the CML dataset to interpret the model’s decisions and understand the contribution of features from the flagged time series compared to those of neighboring sensors processed by the GCN.

The model output at the baseline should represent a neutral state, ensuring that the baseline prediction is near zero (Sundararajan et al. 2017). After testing random and mean baselines, we chose a zero baseline such as all input feature values set to zero. The random baseline introduced significant noise into the final attribution pattern, while the mean baseline resulted in smoothed attribution patterns with insufficient contrast.

We used the TensorFlow library to compute gradients through back-propagation and automatic differentiation (Samek et al. 2021). Gradients were retrieved for 100 interpolation steps [e.g., 100 increments for α in Eq. (2)] and integrated using the Riemann integral (Sundararajan et al. 2017). After integration, the resulting attributions of the input features to the model were visualized in two types of heatmaps (Fig. 3):

  1. Single-sample feature attribution: for each classification at a sensor at time step t, the model receives an input time series sample from the sensor itself and its neighbors. For all time steps of the dataset, we calculate feature attribution for all input features in the respective time series sample.

  2. Aggregated feature attribution: since the input time series samples are sequentially sampled and overlap, it can be challenging to visualize the evolution of attributions over time. Therefore, we aggregate each sample’s attributions by taking the arithmetic mean, resulting in one attribution value for each input feature series at each time step t. This approach allows us to visualize attributions over the entire dataset time series.

Fig. 3.
Fig. 3.

Schematic view of the two types of attribution heatmaps used for analysis. (a) Single-sample feature attribution: Each single-sample heatmap corresponds to the attribution of input features for a single model classification at one sensor at a specific time step t. (b) Aggregated feature attribution: each single-sample attribution is averaged to produce a mean attribution value for each input feature series at each time step t. This results in a heatmap showing attributions over the entire dataset time series.

Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0032.1

As a third option for visualization, video sequences showing the evolution of single-sample feature attributions are provided in the online supplemental material to the manuscript.

3. Model performance evaluation

After developing the anomaly detection models for the CML and SoilNet datasets, we evaluated their performance using key classification metrics covering different aspects. For this purpose, we used the receiver operating characteristic (ROC) curve and the Matthews’ correlation coefficient (MCC).

The ROC curve is a commonly used graphical representation of the performance of a binary classification model. It depicts the trade-off between the true positive rate (TPR) and the false positive rate (FPR) for different discrimination thresholds, which determine the classification boundary between positive and negative classifications. The TPR represents the ratio of correctly classified positive observations to the total actual positives:
TPR=TPTP+FN,
while FPR is the proportion of actual negative instances incorrectly identified as positive by the model:
FPR=FPFP+TN.
Here, TP is true positives, TN is true negatives, FP is false positives, and FN is false negatives. TPR and FPR depend on the classification threshold, which parameterizes the ROC curve, depicting the model’s ability to discriminate between positive and negative instances. In general, a steeper ROC curve indicates a better model performance. To quantify the performance shown in the ROC, the area under ROC curve (AUC) can be used. Its score ranges between 0 and 1, with 0.5 representing random classification performance and higher values indicating better model performance.
The MCC is another widely used metric for evaluating binary classifiers and can be calculated using the following equation:
MCC=TP×TNFP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN).

The equation results in values between −1, indicating total disagreement between prediction and true label, and 1, meaning perfect prediction, while a score of 0 denotes random guessing. The MCC takes into account all elements of the confusion matrix composed of TP, TN, FP, and FN and indicates good performance only if there is high accuracy for both positive and negative classes. Therefore, it is extremely useful for evaluating classification performance when the dataset is highly imbalanced.

Our evaluation procedure was comprehensive and covered several steps to provide a robust assessment of the models’ performance. First, we conducted a five-fold CV to analyze the potential sensitivity of the model to different data splits. Following this standard procedure, each dataset was partitioned into five equal-sized subsets, and the models were iteratively trained five times, using four subsets (80%) for training and one subset (20%) for validation in each iteration, resulting in five different models being trained.

Following CV, we proceeded to train the final models using data split into training, validation, and test datasets in a 6:2:2 ratio, with equally sized temporal blocks for each split. To avoid data leakage and ensure the independence of our datasets, we carefully removed overlapping time series. For the CML dataset, where the data resolution is high (1 min) and the sample length is short (181 min in total), we excluded the last day between consecutive data subsets. For the SoilNet dataset, which is split based on months, we rejected the last 4 days of each month to eliminate any temporal autocorrelation between subsets. Throughout the training process, we monitored the loss function on the validation dataset and used early stopping with a patience of five epochs. Upon completion, we assessed the model performance on the independent test dataset using the model with the lowest recorded validation loss. To establish anomaly flags for predictions, we carefully selected thresholds for all models (CV and final) that maximized the MCC scores on the validation dataset for the final models. Eventually, we concluded our evaluation with aggregated statistics regarding the individual sensors.

4. Results

a. Cross validation

Figure 4 presents the ROC curves of each of the five CV folds, along with the mean value for both GCN and baseline LSTM models. For both the CML and SoilNet datasets, there is a clear superiority of the GCN over the baseline LSTM. The mean GCN ROC curves are consistently positioned to the left of the baseline models’ curves, indicating a higher TPR for the same FPR. However, the CML GCN and baseline LSTM models exhibit significantly better performance than the SoilNet models, as evidenced by their closer proximity to the upper-left corner of the plot and an average increase in AUC scores of approximately 0.08. Also, for CML, most of the GCN folds performed better than any of the baseline LSTM model runs, whereas for SoilNet, individual GCN and baseline runs overlap with each other. The numeric values of the particular splits, along with the mean AUC scores and results for the final model, are detailed in Table 1, demonstrating the superiority of GCN over the baseline LSTM runs.

Fig. 4.
Fig. 4.

ROC curves comparing the performance of GCN (red) and baseline LSTM models (green) in five-fold CV. The bold lines represent the mean performance, while the thin lines illustrate individual runs. (a) CML data; (b) results for the SoilNet.

Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0032.1

Table 1

Summary of AUC and MCC scores from final models, five runs of CV and their means, for both CML and SoilNet datasets. MCC scores are calculated based on the threshold obtained from the final model.

Table 1

The same table also presents the MCC scores, which were consistently higher for the GCN models in both datasets. To calculate the MCC scores, it was necessary to determine the classification thresholds. Following the procedure described in section 3, for the CML dataset, we established thresholds of 0.956 and 0.044 for the GCN and baseline LSTM, respectively. Similarly, for SoilNet, the corresponding thresholds were 0.814 and 0.640. For CML, the maximum MCC achieved by GCN was 0.588, compared to 0.310 for the baseline LSTM. For SoilNet, the maximum MCC reached 0.551 for GCN, while for the baseline LSTM, it was higher than that achieved on the CML data, reaching 0.513.

b. Final model evaluation: visual and statistical analysis

We proceeded with the evaluation of the models’ performance through a visual analysis of the original time series, together with the classification results for the final baseline LSTM and GCN models. All presented results are based on the test data split. Figure 5 illustrates examples of classified time series using the GCN and baseline LSTM models, with Figs. 5a and 5b for CML and Figs. 5c and 5d for SoilNet. Figures presenting the classified time series along with their corresponding neighboring time series are provided in appendix B. For the CML examples, both models effectively captured the majority of anomalies; however, the GCN model exhibited greater accuracy.

Fig. 5.
Fig. 5.

(a)–(d) Classification results of CML and SoilNet time series data from selected sensors. The upper panel in each subplot displays the original TL time series, with blue and green lines representing two different TL channels for CML, and battery voltage (blue line) and soil moisture (green line) for SoilNet. The panels below showcase the classification outcomes for the GCN and the baseline LSTM, respectively, with distinct colors representing the four classes derived from the confusion matrix, as well as no data (see section 3) and samples with automatic flags. The red vertical line in (b) points out to the event described in XAI analysis (section 4d), while the dashed lines in (c) mark the periods analyzed later in the results.

Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0032.1

Figure 5a depicts three anomaly events with signal fluctuations of up to 10 dB, each lasting for approximately 9 h. Both models struggled with the correct classification of the first event and performed similarly on the second and third. However, the baseline LSTM showed a high number of false positives, such as on 17 and 19 July, while the GCN demonstrated high accuracy during these periods. Figure 5b shows a similar picture, with a time series covering two anomalous CML events and one rain event (intended measurement) between 1800 UTC 28 July and 0600 UTC 29 July. The first anomaly lasted 11 h and displayed a jagged shape, while the second, shorter event lasted 5 h and featured a sharp trough. Although both events were detected, the onset of the first event was delayed. The GCN performed better in detecting the anomalies (more TP) and, at the same time, recognized the rain event as a nonanomalous pattern, while the baseline LSTM faced challenges primarily related to false positives.

In both examples for the SoilNet dataset presented in Figs. 5c and 5d, prolonged periods are labeled as no data due to missing periods longer than our maximum interpolation length. Consequently, data samples with a length of 84 h, which form the input, contained missing values and had to be excluded from the analysis. The first SoilNet example (Fig. 5c) displays an anomalous soil moisture time series, including several automatic anomaly flags. The first analyzed period, between 14 and 20 October, exhibits a constant battery voltage (blue line) of around 3.2 V and a relatively steady moisture level of around 39% (teal line), with a few minor drops. During the second period, from 24 to 28 October, the battery level decreases slightly and fluctuates, while more noticeable fluctuations are visible in moisture, which decreases slightly toward the end of the series. Both the GCN and baseline LSTM struggled to accurately predict the start and end of the first event, identifying only the central part as anomalous. Here, the baseline LSTM outperformed the GCN by detecting the longer anomaly period. In the second period, where the moisture anomaly was clearer, both models correctly classified the onset of the anomaly. Nevertheless, the baseline LSTM failed to detect the second half of the event and erroneously identified an anomaly on 9 October, resulting in an FP detection.

The second analyzed time series, presented in Fig. 5d, contains four anomalous events. The battery voltage values show a strong diurnal cycle throughout the entire time series. The GCN showed superior performance for the first and second periods, characterized by notable bumps and a gradual decrease in moisture afterward. However, for the third period, where the moisture anomaly was less obvious and manifested only as a steady decline, and the last event registered as a sharp peak, the baseline LSTM performed better. In summary, while the GCN model applied to the SoilNet time series demonstrated only a slight advantage over the baseline LSTM, the quantitative statistics presented below paint a different picture, particularly when more nonanomalous periods are included in the analyzed data.

A quantitative analysis using the final models on the entire test dataset confirms the advantages of incorporating neighboring information (Table 1). Specifically, for the CML dataset, the GCN model achieved an AUC of 0.974 and an MCC of 0.683, outperforming the baseline LSTM (AUC: 0.880; MCC: 0.306). For SoilNet, although scores were lower, the GCN model still outperformed the baseline LSTM, with an AUC of 0.859 and an MCC of 0.462. However, it is important to note that MCC is a threshold-dependent score, influencing its interpretation. To gain insights into how different thresholds impact performance, we calculated FPR and TPR rates for the chosen thresholds. For GCN, FPR and TPR reached 0.230 and 0.966, respectively, compared to 0.166 and 0.825 for the baseline LSTM. For SoilNet, GCN demonstrated a lower FPR (0.095) than the baseline LSTM (0.123) and a higher TPR (0.597) compared to the baseline LSTM (0.541). These results underscore the benefits of leveraging neighboring information in improving the overall predictive performance of the GCN model.

c. Classification performance for individual sensors

We aggregated the results and computed metrics separately for each sensor to evaluate the consistency of prediction skills. The top panels in Fig. 6 present the scatterplots of AUC values of baseline LSTM and GCN models for the CML and SoilNet datasets.

Fig. 6.
Fig. 6.

(top) Scatterplots and (bottom) boxplots illustrating the model performance metrics for baseline LSTM and GCN models on (left) CML and (right) SoilNet datasets. Each data point in scatterplots reflects the AUC score for a specific sensor, while boxplots depict the distribution of AUC scores and MCC values on the left and right sides of each panel, respectively. The middle line of the boxplot represents the median, while the box itself spans from the first quartile to the third quartile. The whiskers extend from the box to the most distant data points within 1.5 times the interquartile range from the box. Eventually, data points beyond this range are considered outliers and are plotted individually. The GCN outcomes are presented in red, while the baseline LSTM results are shown in teal.

Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0032.1

In the CML dataset (Fig. 6a), all points lie on the right side of the diagonal, indicating the GCN’s superiority. Most sensors show a GCN AUC score approximately 0.1 higher than that of the baseline LSTM, with one exception where the GCN achieved nearly 1 compared to 0.40 for the baseline LSTM. This trend is confirmed by the boxplots in Fig. 6c, where the median AUC value for the GCN is almost 1, while the baseline LSTM drops slightly below 0.9. The GCN demonstrates a negligible interquartile range, with two outliers at 0.80 and 0.90, while the baseline LSTM has one outlier and a wider box, spreading between slightly over 0.75 and 0.90. For MCC, the GCN model performed better, with a median of 0.75 compared to 0.30 for the baseline LSTM, although MCC scores varied substantially across sensors.

Results for the SoilNet dataset (Figs. 6b,d) exhibit more heterogeneity. The scatterplot (Fig. 6b) shows points concentrated in the upper-right corner, indicating comparable model performance. While, for most instances, the GCN model outperformed the baseline LSTM, and there were also cases where the GCN failed and performed below random prediction, as indicated by AUC scores below 0.5. This varied performance is reflected in the boxplots (Fig. 6d), where the median AUC and MCC scores for both models are identical, measuring 0.89 and 0, respectively. Nevertheless, the AUC interquartile range for the baseline LSTM is slightly larger, with six outliers, while the GCN had four outliers with a minimum AUC of 0.16. The boxes for MCC extend from 0 up to 0.60 and 0.76 for baseline LSTM and GCN models, respectively, with long whiskers reaching up to 1, indicating significant variability in individual sensor performance. Given these results, it appears that both GCN and baseline LSTM models have limitations for the SoilNet problem and may not be fully suitable for QC in this context. Further research, including using more and different data, would help to clarify this issue and potentially improve model performance.

d. Single-sample feature attribution

The attribution values of all model input features, that is, the normalized time series values of the flagged sensor and its neighbors, are visualized as a heatmap with colors ranging from blue (negative) through white (zero) to red (positive). Negative attribution values drive the model output toward zero (no anomaly). Conversely, positive attributions guide the model output toward one (anomaly). Figure 7 depicts the heatmap for the CML time series shown in Fig. 5b at 0000:40 UTC 29 July 2019 where four representative neighbors out of 21 were selected for demonstration. The complete figure is provided in appendix C (Fig. C2). The features of the flagged time series were not subject to the GC process and obtained remarkably higher attribution than the neighbors and the self-reference cycle, so we scaled the attribution of the neighbors and self-reference cycle by a factor of 25 for visualization purposes.

Fig. 7.
Fig. 7.

Single-sample feature attribution heatmap for CML sensor (Fig. 5b) at 0000:40 UTC 29 July 2019. Input feature time series TL1 and TL2 from the FS the self-reference cycle, and its neighbors are plotted over the sampling time interval. The background color indicates the IG attribution of each feature. The attribution of the self-reference cycle and the neighboring sensor inputs are scaled by a factor of 25 for visualization purposes.

Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0032.1

The upper panel in Fig. 7 illustrates the flagged time series, indicating the onset of a rain event around time step 115. The integrated gradient analysis showed a positive attribution for the flagged sensor, leading to an increase in the model’s prediction toward one and triggering an anomaly classification. However, similar rain events with comparable shapes occurred at neighboring sensors N4 and N11 and approximately 40 time steps earlier at sensor N15. These neighboring sensors received negative attribution, resulting in a decrease in the model output and suggesting no anomaly. This influence ultimately led the model decision away from an FP classification and instead toward a TN classification. Consequently, the model has learned that when certain patterns observed in the flagged time series also appear at neighboring sensors, the likelihood of an anomaly is reduced, indicating that a rain event might be responsible for the increase in sensor readings. The neighboring information was missing in the baseline LSTM; thus, the discussed event was misclassified in the prediction of the baseline LSTM (Fig. 5b).

e. Aggregated feature attribution

In Fig. 8, we present the aggregated feature attributions for each time step (corresponding to each single-sample feature attribution), visualized over time from the first sensor shown in Fig. 5a. The aggregation is performed by averaging the attributions of each sample across the sample window, as indicated in Fig. 3 and explained in section 2.

Fig. 8.
Fig. 8.

Aggregated feature attribution for the CML sensor depicted in Fig. 5a. Each single-sample feature attribution was averaged across the sample time interval (181 min) to obtain one value for each input feature series at a time step t. This results in aggregated feature attribution heatmaps for the entire time series, which are visualized along with the model input time series. The top panel depicts the model prediction and the resulting classification, with the numbers above indicating the analyzed event number. The second and third panels show the time series of the FS and the corresponding aggregated attribution, separately for channels TL1 and TL2. The time series in the other panels show the model input for the self-reference cycle and neighbors. The background color indicates the aggregated attribution for that respective time step, averaged across channels TL1 and TL2. The attribution of the neighbors and self-reference cycle was scaled with a factor of 25 for visualization purposes.

Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0032.1

Similar to the significance of neighboring sensors discussed in section 4d, a comparable importance of neighbors was observed in two instances of TN events with high model predictions in this time series. For the first TN event (event 1), the self-reference cycle N0 and neighbor N3 showed a similar pattern in the TL record as the flagged sensor but received negative attribution, thus leading away from an FP classification. The second TN event appeared later (event 4), where sensors N0, N4, N5, N6, and N7 showed comparable TL records to the flagged sensor, hence showing negative attribution and avoiding misclassification here as well. In an optimal model, a true negative event should have a low prediction score and low attributions. However, the model has learned that flagged sensor events can be anomalies, leading it to assign positive attributions to these values. Normally, this would result in an anomaly classification if no additional context (e.g., no neighboring time series, see the baseline LSTM model) was available. However, when similar events were detected in neighboring time series, the model assigned negative attributions to them, reducing the overall attribution in the single-sample heatmap. This, in turn, prevented misclassification and correctly identified the event as nonanomalous. The baseline LSTM failed to correctly classify these events (Fig. 5a), demonstrating a clear advantage of the GCN model over the baseline LSTM. In general, the positive attribution mainly accumulated along the flagged time series, while the self-reference cycle N0 and the other neighbors were predominantly attributed with negative values. Exceptionally strong negative attribution could be observed during the first anomaly event (event 2). The self-reference cycle and presumably similar patterns at the neighboring sensors N1, N3, and N4 caused a model prediction drop in the middle of the anomaly event leading to FN classifications. For the third and fifth events, the self-reference cycle also showed negative attribution, but in the absence of a clear signal from the neighboring sensors, this resulted in only a small negative overall attribution. Thus, the positive attribution of the flagged time series led the majority of the event toward a TP classification. Here, the IG method shows that, in some cases, when similar sensor readings appeared at neighboring sensors, the model assigned an overly strong negative attribution to fluctuations in neighboring sensor signals, reducing the overall attribution and resulting in an FN classification. The same plot for the sensor shown in Fig. 5b is provided in appendix C (Fig. C1).

5. Discussion

Our research explores the application of GCNs for anomaly detection in two diverse environmental datasets: CML and SoilNet. Despite differences in spatial and temporal resolutions, both datasets feature irregularly and sparsely distributed environmental sensors across their respective regions. Employing GCNs allowed us to leverage information from neighboring sensors through message passing, enhancing anomaly detection compared to baseline LSTM that do not employ GC. This study is the first to demonstrate the merits of GCNs in detecting anomalies in real-world environmental sensor data, whereas previous research primarily focused on synthetic or benchmark datasets commonly used in AI applications. Furthermore, existing QC frameworks for environmental sensor data typically classify individual sensors using established methods such as autoregressive integrated moving average, support vector machine, and 1D convolution neural network (CNN) models. Using 2D CNN would require either a very dense grid, which is computationally expensive, or averaging values into grid cells, potentially leading to a loss of critical information. These traditional methods generally neglect the potential benefits of incorporating neighbor information, highlighting the value of exploring alternative approaches like GNN that explicitly model these relationships.

Our robust evaluation, employing five-fold CV, consistently demonstrated superior scores for the GCNs over the baseline LSTM, illustrating the benefits of incorporating neighbor information in anomaly detection. The added benefit was more pronounced for the CML dataset than for SoilNet data.

The higher performance scores on the CML dataset may be attributed to the robust and precise labeling strategy and generally better data quality. CML data underwent meticulous examination by four independent experts and merged into reliable anomaly labels by majority vote. In contrast, for SoilNet, flagging was performed by different experts, which led to potential inconsistencies. The flags we used included both clearly erroneous data and suspicious periods where automated tests had already flagged many data points, such as low battery voltage. Consequently, there were instances where flagging was not executed with high temporal precision, leading to valid observations being erroneously labeled as anomalies and introducing incorrect information into the model. Soil moisture observations exhibit strong variability on a small scale and sensitivity to numerous factors (Mittelbach et al. 2012; Susha Lekshmi et al. 2014). Both lead to diverse signal fluctuations and disturbances, resulting in significant intraclass variability for both anomaly and no-anomaly classes, which impacted the detection accuracy. Moreover, automatic flags, which were easily detectable and subsequently excluded in our study, resulted in reduced available samples and graph nodes, further influencing model performance.

This work emphasizes the power of explainable AI (XAI) in interpreting model predictions and showcases the importance of neighbors for the GCN CML model. This is achieved by utilizing interpretable attributions derived from integrated gradients of the input features. Through single-sample and aggregated attribution heatmaps, we illustrated the extent to which information from neighboring sensors influenced the final classification outcome. During rain events, where the baseline LSTM erroneously flagged anomalies, the GCN model accurately identified the rain event by recognizing similar sensor reading patterns across neighboring sensors. This highlights the essential role of neighboring sensors in informing the model and aiding in the distinction between rain events and anomalies. However, if an anomaly event identified by experts coincides with signal fluctuations at other sensors, it may lead to a decrease in model accuracy.

While our study offers valuable insights, it also has limitations. We worked with a limited three-month CML dataset, analyzing data from only 20 out of 3904 sensors that underwent manual quality checks, which resulted in unlabeled neighbors within the graph. To address this, we applied global pooling after GC, smoothing artifacts present only in particular neighbor signals. Additionally, our approach did not consider sensor correlations in establishing graph links, relying solely on experimentally chosen distances, which may affect model performance. Last, during the data preparation phase, we employed simple linear interpolation to fill up short data gaps. However, longer gaps remained, resulting in a reduced number of samples, which was particularly noticeable in the SoilNet dataset due to the frequent occurrence of such gaps. While these limitations do not weaken our claim that GCNs performed superior in this study, it is important to acknowledge that even better model performance could potentially be achieved. However, overcoming these limitations poses challenges, as manual labeling efforts are extensive, and addressing data gaps would require sophisticated infilling methods that are yet to be developed.

6. Conclusions

This study demonstrates the potential of graph neural network (GNN) for improving automated quality control (QC) of environmental sensor networks. Answering the first research question, GNNs can indeed enhance automated QC by integrating spatial information from sensor networks that are distributed irregularly in space and provide varying amounts of observations for each time step. The superior performance of GCN models, as shown across both datasets, highlights the significance of incorporating spatial context into anomaly detection tasks. Regarding the second research question, XAI techniques revealed that neighboring sensors influence the decisions made by the GCN model, shedding light on how these sensors contribute to the model’s predictions. The visualization of feature attribution confirms the importance of information from neighboring sensors and can support experts in understanding the AI model’s behavior when discriminating anomalies from valuable observations, such as rain events in the case of CML data. We found that the GCN consistently achieved higher evaluation metrics than the baseline LSTM model. For CML, the AUC for GCN was notably higher at 0.974 compared to 0.880 for the baseline LSTM, accompanied by MCC scores of 0.683 (GCN) and 0.306 (baseline LSTM). Conversely, SoilNet demonstrated lower performance, with GCN achieving an AUC of 0.859 and the baseline LSTM at 0.782. Correspondingly, GCN attained an MCC score of 0.462 compared to 0.345 for the baseline LSTM.

Visual inspection of flagged time series demonstrated the clear superiority of the GCN over the baseline LSTM model, which proved to be proficient at classifying anomalies and resilient against events often misidentified by the baseline LSTM. However, this advantage of GCN was less evident for SoilNet than for the CML data. We found consistent performance across CML sensors, while there was a notable variation across SoilNet sensors. The CV results showed that, even with a comparatively small amount of flagged data, robust performance can be achieved, making automated QC of much larger datasets feasible. Comparing the results of the two datasets leads to the hypothesis that more carefully flagged data, requiring multiple experts to agree on a label, can enhance the performance of the proposed algorithm, which may be tested in future research. The same holds for the performance scaling with an increased amount of training data, which should be considered for the operationalization of such an approach.

Future research should involve more extensive data collection for both CML and SoilNet datasets, with thorough expert verification of all neighboring sensors to ensure robust anomaly detection. Refining the graph structure to better capture temporal correlations and incorporating attention mechanisms, as suggested by recent advancements in GNN (Zhao et al. 2020; Deng and Hooi 2021; Veličković et al. 2017), could enhance detection accuracy. Additionally, employing advanced missing data imputation techniques for longer data gaps will increase the number of available samples, thereby improving the overall model reliability. Our ongoing research includes developing a model that combines GCN and LSTM layers for missing data imputation, as incorrectly predicted time series can propagate errors and negatively influence anomaly detection. This approach will increase the number of available samples and improve the overall model reliability.

Acknowledgments.

Elżbieta Lasota was funded by the Helmholtz AI project RESEAD from the Initiative and Networking Fund of the Helmholtz Association (Grant ZT-I-PF-5-148), Julius Polz by the German Research Foundation (Grant CH 1785/1-2, RealPEP), Luca Glawion by the Helmholtz Innopool project SCENIC, and Lennart Schmidt by the Federal Ministry of Education and Research of Germany (BMBF, Grant 02WDG1641B, i-SEWER project). We thank the Helmholtz Association for establishing the Helmholtz.AI initiative, which initially brought together the authors of this work. We also thank the Helmholtz Association and BMBF for supporting the DataHub Initiative of the Research Field Earth and Environment. The DataHub enables overarching and comprehensive research data management, following FAIR principles, for all topics in the Program Changing Earth—Sustaining our Future. XAI analyses were conducted at the HPC Cluster EVE, jointly operated by the Helmholtz Centre for Environmental Research—UFZ and the German Centre for Integrative Biodiversity Research Halle-Jena-Leipzig. We appreciate Mohit Anand’s feedback on the IG methodology. The Hohes Holz observatories are supported by TERENO, funded by the Helmholtz Association and BMBF. We acknowledge all experts involved in manual QC, especially Corinna Rebmann, Juliane Mai, and Matthias Cuntz, for providing labeled SoilNet data. We thank Ericsson Germany for supporting the acquisition of CML data.

Data availability statement.

The CML data supporting this research were provided to the authors by Ericsson. These data were not publicly available, as Ericsson has restricted its distribution due to their commercial interest. To obtain CML data for research purposes, a separate and individual agreement with the network provider must be established. The SoilNet data used in this study are available upon request at https://www.ufz.de/record/dmp/logger/806/en/.

APPENDIX A

Hyperparameters Used in Model Development

Table A1 lists the final set of hyperparameters employed during the training of the models for anomaly detection in the CML and SoilNet datasets.

Table A1

Hyperparameters used in model training for the CML and SoilNet datasets. This table lists the final set of hyperparameters employed during the training of the models for anomaly detection in the CML and SoilNet datasets.

Table A1

APPENDIX B

Time Series Comparison of Classified and Neighboring Sensors

This appendix provides a visual comparison of the classified acCML and SoilNet sensors’ data, as described in detail in section 4b, with the data from the neighboring sensors (Figs. B1B4).

Fig. B1.
Fig. B1.

Flagged CML time series with neighboring sensors and anomaly detection results. The uppermost panel presents the flagged CML time series, previously shown in Fig. 5a, with additional panels below illustrating the time series of neighboring sensors. The colors indicate the confusion matrix results of the anomaly detection.

Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0032.1

Fig. B2.
Fig. B2.

As in Fig. B1, but for the CML sensor depicted in Fig. 5b.

Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0032.1

Fig. B3.
Fig. B3.

Flagged SoilNet time series with neighboring sensors and anomaly detection results. The uppermost panel presents the flagged SoilNet time series, previously shown in Fig. 5c, with additional panels below illustrating the time series of neighboring sensors. The colors indicate the confusion matrix results of anomaly detection, as well as instances of no data and samples with automatic flags.

Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0032.1

Fig. B4.
Fig. B4.

As in Fig. B3, but for the SoilNet sensor depicted in Fig. 5d.

Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0032.1

APPENDIX C

Additional Figures of Integrated Gradient Attributions

This appendix provides the full version of the single-sample attribution heatmap from Fig. 7, as well as a figure of aggregated feature attributions from the other sensor (Figs. C1 and C2).

Fig. C1.
Fig. C1.

As in Fig. 8, but for the other sensor.

Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0032.1

Fig. C2.
Fig. C2.

Complete version of Fig. 7, including all neighboring sensors and their IG attributions.

Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0032.1

REFERENCES

  • Agarwal, C., M. Zitnik, and H. Lakkaraju, 2022: Probing GNN explainers: A rigorous theoretical and empirical analysis of GNN explanation methods. Proc. 25th Int. Conf. on Artificial Intelligence and Statistics, Valencia, Spain, PMLR, 8969–8996, https://doi.org/10.48550/arXiv.2106.09078.

  • Assaf, R., and A. Schumann, 2019: Explainable deep neural networks for multivariate time series predictions. Proc. 28th Int. Joint Conf. on Artificial Intelligence (IJCAI-19), Macao, China, IJCAI, 64886490, https://doi.org/10.24963/ijcai.2019/932.

  • Atlas, D., and C. W. Ulbrich, 1977: Path- and area-integrated rainfall measurement by microwave attenuation in the 1–3 cm band. J. Appl. Meteor., 16, 13221331, https://doi.org/10.1175/1520-0450(1977)016<1322:PAAIRM>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Bach, S., A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek, 2015: On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLOS ONE, 10, e0130140, https://doi.org/10.1371/journal.pone.0130140.

    • Search Google Scholar
    • Export Citation
  • Baldassarre, F., and H. Azizpour, 2019: Explainability techniques for graph convolutional networks. arXiv, 1905.13686v1, https://doi.org/10.48550/arXiv.1905.13686.

  • Bogena, H. R., 2016: TERENO: German network of terrestrial environmental observatories. J. Large-Scale Res. Facil., 2, A52, https://doi.org/10.17815/jlsrf-2-98.

    • Search Google Scholar
    • Export Citation
  • Bogena, H. R., M. Herbst, J. A. Huisman, U. Rosenbaum, A. Weuthen, and H. Vereecken, 2010: Potential of wireless sensor networks for measuring soil water content variability. Vadose Zone J., 9, 10021013, https://doi.org/10.2136/vzj2009.0173.

    • Search Google Scholar
    • Export Citation
  • Chalapathy, R., and S. Chawla, 2019: Deep learning for anomaly detection: A survey. arXiv, 1901.03407v2, https://doi.org/10.48550/arXiv.1901.03407.

  • Chen, M., Z. Wei, Z. Huang, B. Ding, and Y. Li, 2020: Simple and deep graph convolutional networks. Proc. 37th Int. Conf. on Machine Learning, Online, PMLR, 17251735, https://proceedings.mlr.press/v119/chen20v.html.

  • Choi, H., C. Jung, T. Kang, H. J. Kim, and I.-Y. Kwak, 2022: Explainable time-series prediction using a residual network and gradient-based methods. IEEE Access, 10, 108 469108 482, https://doi.org/10.1109/ACCESS.2022.3213926.

    • Search Google Scholar
    • Export Citation
  • Chwala, C., and H. Kunstmann, 2019: Commercial microwave link networks for rainfall observation: Assessment of the current status and future challenges. Wiley Interdiscip. Rev.: Water, 6, e1337, https://doi.org/10.1002/wat2.1337.

    • Search Google Scholar
    • Export Citation
  • Chwala, C., F. Keis, and H. Kunstmann, 2016: Real-time data acquisition of commercial microwave link networks for hydrometeorological applications. Atmos. Meas. Tech., 9, 991999, https://doi.org/10.5194/amt-9-991-2016.

    • Search Google Scholar
    • Export Citation
  • Cioffi, R., M. Travaglioni, G. Piscitelli, A. Petrillo, and F. De Felice, 2020: Artificial intelligence and machine learning applications in smart production: Progress, trends, and directions. Sustainability, 12, 492, https://doi.org/10.3390/su12020492.

    • Search Google Scholar
    • Export Citation
  • Coley, C. W., W. Jin, L. Rogers, T. F. Jamison, T. S. Jaakkola, W. H. Green, R. Barzilay, and K. F. Jensen, 2019: A graph-convolutional neural network model for the prediction of chemical reactivity. Chem. Sci., 10, 370377, https://doi.org/10.1039/C8SC04228D.

    • Search Google Scholar
    • Export Citation
  • Deng, A., and B. Hooi, 2021: Graph neural network-based anomaly detection in multivariate time series. Proc. Conf. AAAI Artif. Intell., 35, 40274035, https://doi.org/10.1609/aaai.v35i5.16523.

    • Search Google Scholar
    • Export Citation
  • Egmont-Petersen, M., D. de Ridder, and H. Handels, 2002: Image processing with neural networks—A review. Pattern Recognit., 35, 22792301, https://doi.org/10.1016/S0031-3203(01)00178-9.

    • Search Google Scholar
    • Export Citation
  • Erhan, L., M. Ndubuaku, M. Di Mauro, W. Song, M. Chen, G. Fortino, O. Bagdasar, and A. Liotta, 2021: Smart anomaly detection in sensor systems: A multi-perspective review. Inf. Fusion, 67, 6479, https://doi.org/10.1016/j.inffus.2020.10.001.

    • Search Google Scholar
    • Export Citation
  • Fan, W., Y. Ma, Q. Li, Y. He, E. Zhao, J. Tang, and D. Yin, 2019: Graph Neural Networks for social recommendation. WWW’19: The World Wide Web Conf., San Francisco, CA, Association for Computing Machinery, 417–426, https://doi.org/10.1145/3308558.3313488.

  • Gandin, L. S., 1988: Complex quality control of meteorological observations. Mon. Wea. Rev., 116, 11371156, https://doi.org/10.1175/1520-0493(1988)116<1137:CQCOMO>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Graf, M., A. El Hachem, M. Eisele, J. Seidel, C. Chwala, H. Kunstmann, and A. Bárdossy, 2021: Rainfall estimates from opportunistic sensors in Germany across spatio-temporal scales. J. Hydrol., 37, 100883, https://doi.org/10.1016/j.ejrh.2021.100883.

    • Search Google Scholar
    • Export Citation
  • Guan, S., B. Zhao, Z. Dong, M. Gao, and Z. He, 2022: GTAD: Graph and temporal neural network for multivariate time series anomaly detection. Entropy, 24, 759, https://doi.org/10.3390/e24060759.

    • Search Google Scholar
    • Export Citation
  • Horsburgh, J. S., S. L. Reeder, A. S. Jones, and J. Meline, 2015: Open source software for visualization and quality control of continuous hydrologic and water quality sensor data. Environ. Modell. Software, 70, 3244, https://doi.org/10.1016/j.envsoft.2015.04.002.

    • Search Google Scholar
    • Export Citation
  • Jiang, S., E. Bevacqua, and J. Zscheischler, 2022: River flooding mechanisms and their changes in Europe revealed by explainable machine learning. Hydrol. Earth Syst. Sci., 26, 63396359, https://doi.org/10.5194/hess-26-6339-2022.

    • Search Google Scholar
    • Export Citation
  • Jones, A. S., J. S. Horsburgh, and D. P. Eiriksson, 2018: Assessing subjectivity in environmental sensor data post processing via a controlled experiment. Ecol. Inform., 46, 8696, https://doi.org/10.1016/j.ecoinf.2018.05.001.

    • Search Google Scholar
    • Export Citation
  • Jones, A. S., T. L. Jones, and J. S. Horsburgh, 2022: Toward automating post processing of aquatic sensor data. Environ. Modell. Software, 151, 105364, https://doi.org/10.1016/j.envsoft.2022.105364.

    • Search Google Scholar
    • Export Citation
  • Kipf, T. N., and M. Welling, 2016: Semi-supervised classification with graph convolutional networks. arXiv, 1609.02907v4, https://doi.org/10.48550/arXiv.1609.02907.

  • Kosasih, E. E., and A. Brintrup, 2022: A machine learning approach for predicting hidden links in supply chain with graph neural networks. Int. J. Prod. Res., 60, 53805393, https://doi.org/10.1080/00207543.2021.1956697.

    • Search Google Scholar
    • Export Citation
  • Lin, X., H. Wang, J. Guo, and G. Mei, 2022: A deep learning approach using graph neural networks for anomaly detection in air quality data considering spatiotemporal correlations. IEEE Access, 10, 94 07494 088, https://doi.org/10.1109/ACCESS.2022.3204284.

    • Search Google Scholar
    • Export Citation
  • Lorenz, C., and H. Kunstmann, 2012: The hydrological cycle in three state-of-the-art reanalyses: Intercomparison and performance analysis. J. Hydrometeor., 13, 13971420, https://doi.org/10.1175/JHM-D-11-088.1.

    • Search Google Scholar
    • Export Citation
  • Messer, H., A. Zinevich, and P. Alpert, 2006: Environmental monitoring by wireless communication networks. Science, 312, 713, https://doi.org/10.1126/science.1120034.

    • Search Google Scholar
    • Export Citation
  • Minh, D., H. X. Wang, Y. F. Li, and T. N. Nguyen, 2022: Explainable artificial intelligence: A comprehensive review. Artif. Intell. Rev., 55, 35033568, https://doi.org/10.1007/s10462-021-10088-y.

    • Search Google Scholar
    • Export Citation
  • Mittelbach, H., I. Lehner, and S. I. Seneviratne, 2012: Comparison of four soil moisture sensor types under field conditions in Switzerland. J. Hydrol., 430–431, 3949, https://doi.org/10.1016/j.jhydrol.2012.01.041.

    • Search Google Scholar
    • Export Citation
  • Muharemi, F., D. Logofătu, and F. Leon, 2019: Machine learning approaches for anomaly detection of water quality on a real-world data set. J. Inf. Telecommun., 3, 294307, https://doi.org/10.1080/24751839.2019.1565653.

    • Search Google Scholar
    • Export Citation
  • Nayak, J., K. Vakula, P. Dinesh, B. Naik, and D. Pelusi, 2020: Intelligent food processing: Journey from artificial neural network to deep learning. Comput. Sci. Rev., 38, 100297, https://doi.org/10.1016/j.cosrev.2020.100297.

    • Search Google Scholar
    • Export Citation
  • Polz, J., C. Chwala, M. Graf, and H. Kunstmann, 2020: Rain event detection in commercial microwave link attenuation data using convolutional neural networks. Atmos. Meas. Tech., 13, 38353853, https://doi.org/10.5194/amt-13-3835-2020.

    • Search Google Scholar
    • Export Citation
  • Polz, J., L. Glawion, M. Graf, N. Blettner, E. Lasota, L. Schmidt, H. Kunstmann, and C. Chwala, 2023: Expert flagging of commercial microwave link signal anomalies: Effect on rainfall estimation and ambiguity of flagging. 2023 IEEE Int. Conf. on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), Rhodes Island, Greece, Institute of Electrical and Electronics Engineers, 15, https://doi.org/10.1109/ICASSPW59220.2023.10193654.

  • Pope, P. E., S. Kolouri, M. Rostami, C. E. Martin, and H. Hoffmann, 2019: Explainability methods for graph convolutional neural networks. 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, Institute of Electrical and Electronics Engineers, 10 76410 773, https://doi.org/10.1109/CVPR.2019.01103.

  • Rathee, M., T. Funke, A. Anand, and M. Khosla, 2022: BAGEL: A benchmark for assessing graph neural network explanations. arXiv, 2206.13983v1, https://doi.org/10.48550/arXiv.2206.13983.

  • Rebmann, C., S. Claudia, M.-J. Sara, G. Sebastian, Z. Matthias, S. Luis, and C. Matthias, 2017: Integrative measurements focusing on carbon, energy and water fluxes at the forest site ‘Hohes Holz’ and the grassland ‘Grossen Bruch’. Geophysical Research Abstracts, Vol. 19, Abstract 9727, https://meetingorganizer.copernicus.org/EGU2017/EGU2017-9727.pdf.

  • Rebmann, C., and Coauthors, 2018: ICOS eddy covariance flux-station site setup: A review. Int. Agrophys., 32, 471494, https://doi.org/10.1515/intag-2017-0044.

    • Search Google Scholar
    • Export Citation
  • Russo, S., and Coauthors, 2021: The value of human data annotation for machine learning based anomaly detection in environmental systems. Water Res., 206, 117695, https://doi.org/10.1016/j.watres.2021.117695.

    • Search Google Scholar
    • Export Citation
  • Samek, W., G. Montavon, S. Lapuschkin, C. J. Anders, and K.-R. Müller, 2021: Explaining deep neural networks and beyond: A review of methods and applications. Proc. IEEE, 109, 247278, https://doi.org/10.1109/JPROC.2021.3060483.

    • Search Google Scholar
    • Export Citation
  • Schmidt, L., and Coauthors, 2023: System for automated Quality Control (SaQC) to enable traceable and reproducible data streams in environmental science. Environ. Modell. Software, 169, 105809, https://doi.org/10.1016/j.envsoft.2023.105809.

    • Search Google Scholar
    • Export Citation
  • Shrikumar, A., P. Greenside, and A. Kundaje, 2017: Learning important features through propagating activation differences. Proc. 34th Int. Conf. on Machine Learning, Sydney, New South Wales, Australia, JMLR, 31453153, https://doi.org/10.48550/ARXIV.1704.02685.

  • Smilkov, D., N. Thorat, B. Kim, F. Viégas, and M. Wattenberg, 2017: SmoothGrad: Removing noise by adding noise. arXiv, 1706.03825v1, https://doi.org/10.48550/ARXIV.1706.03825.

  • Sturtevant, C., S. Metzger, S. Nehr, and T. Foken, 2021: Quality assurance and control. Springer Handbook of Atmospheric Measurements, Springer, 47–90, https://doi.org/10.1007/978-3-030-52171-4_3.

  • Sundararajan, M., A. Taly, and Q. Yan, 2017: Axiomatic attribution for deep networks. Proc. 34th Int. Conf. on Machine Learning, Sydney, New South Wales, Australia, PMLR, 33193328, https://doi.org/10.48550/ARXIV.1703.01365.

  • Sutskever, I., O. Vinyals, and Q. V. Le, 2014: Sequence to sequence learning with neural networks. NIPS’14: Proc. 27th Int. Conf. on Neural Information Processing System, Montreal, QC, Canada, MIT Press, 3104–3112, https://dl.acm.org/doi/10.5555/2969033.2969173.

  • Susha Lekshmi, S. L., D. N. Singh, and M. S. Baghini, 2014: A critical review of soil moisture measurement. Measurement, 54, 92105, https://doi.org/10.1016/j.measurement.2014.04.007.

    • Search Google Scholar
    • Export Citation
  • UNFCCC, 2022: Sharm el-Sheikh Implementation Plan. Revised draft decision—/CMA.4. UNFCCC, 12 pp., https://unfccc.int/documents/621908.

  • Vandewinckele, L., M. Claessens, A. Dinkla, C. Brouwer, W. Crijns, D. Verellen, and W. van Elmpt, 2020: Overview of artificial intelligence-based applications in radiotherapy: Recommendations for implementation and quality assurance. Radiother. Oncol., 153, 5566, https://doi.org/10.1016/j.radonc.2020.09.008.

    • Search Google Scholar
    • Export Citation
  • van Leth, T. C., A. Overeem, H. Leijnse, and R. Uijlenhoet, 2018: A measurement campaign to assess sources of error in microwave link rainfall estimation. Atmos. Meas. Tech., 11, 46454669, https://doi.org/10.5194/amt-11-4645-2018.

    • Search Google Scholar
    • Export Citation
  • Veličković, P., G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, 2017: Graph attention networks. Statistics, 1050, 1048 550.

    • Search Google Scholar
    • Export Citation
  • Wang, S., J. Cao, and P. S. Yu, 2022: Deep learning for spatio-temporal data mining: A survey. IEEE Trans. Knowl. Data Eng., 34, 36813700, https://doi.org/10.1109/TKDE.2020.3025580.

    • Search Google Scholar
    • Export Citation
  • Wollschläger, U., and Coauthors, 2017: The Bode hydrological observatory: A platform for integrated, interdisciplinary hydro-ecological research within the TERENO Harz/Central German Lowland Observatory. Environ. Earth Sci., 76, 29, https://doi.org/10.1007/s12665-016-6327-5.

    • Search Google Scholar
    • Export Citation
  • Yin, G., and Coauthors, 2023: Automatic recognition of schizophrenia from brain-network features using graph convolutional neural network. Asian J. Psychiatry, 87, 103687, https://doi.org/10.1016/j.ajp.2023.103687.

    • Search Google Scholar
    • Export Citation
  • You, J., Z. Ying, and J. Leskovec, 2020: Design space for graph neural networks. Advances in Neural Information Processing Systems, Curran Associates, Inc., 17 009–17 021.

  • Zhang, J., L. Pan, Q.-L. Han, C. Chen, S. Wen, and Y. Xiang, 2021: Deep learning based attack detection for cyber-physical system cybersecurity: A survey. IEEE/CAA J. Autom. Sin., 9, 377391, https://doi.org/10.1109/JAS.2021.1004261.

    • Search Google Scholar
    • Export Citation
  • Zhang, M., S. Wu, X. Yu, Q. Liu, and L. Wang, 2022: Dynamic Graph Neural Networks for Sequential Recommendation. IEEE Trans. Knowl. Data Eng., 35, 47414753, https://doi.org/10.1109/TKDE.2022.3151618.

    • Search Google Scholar
    • Export Citation
  • Zhao, H., and Coauthors, 2020: Multivariate time-series anomaly detection via graph attention network. 2020 IEEE Int. Conf. on Data Mining (ICDM), Sorrento, Italy, Institute of Electrical and Electronics Engineers, 841–850, https://doi.org/10.1109/ICDM50108.2020.00093.

Supplementary Materials

Save
  • Agarwal, C., M. Zitnik, and H. Lakkaraju, 2022: Probing GNN explainers: A rigorous theoretical and empirical analysis of GNN explanation methods. Proc. 25th Int. Conf. on Artificial Intelligence and Statistics, Valencia, Spain, PMLR, 8969–8996, https://doi.org/10.48550/arXiv.2106.09078.

  • Assaf, R., and A. Schumann, 2019: Explainable deep neural networks for multivariate time series predictions. Proc. 28th Int. Joint Conf. on Artificial Intelligence (IJCAI-19), Macao, China, IJCAI, 64886490, https://doi.org/10.24963/ijcai.2019/932.

  • Atlas, D., and C. W. Ulbrich, 1977: Path- and area-integrated rainfall measurement by microwave attenuation in the 1–3 cm band. J. Appl. Meteor., 16, 13221331, https://doi.org/10.1175/1520-0450(1977)016<1322:PAAIRM>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Bach, S., A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek, 2015: On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLOS ONE, 10, e0130140, https://doi.org/10.1371/journal.pone.0130140.

    • Search Google Scholar
    • Export Citation
  • Baldassarre, F., and H. Azizpour, 2019: Explainability techniques for graph convolutional networks. arXiv, 1905.13686v1, https://doi.org/10.48550/arXiv.1905.13686.

  • Bogena, H. R., 2016: TERENO: German network of terrestrial environmental observatories. J. Large-Scale Res. Facil., 2, A52, https://doi.org/10.17815/jlsrf-2-98.

    • Search Google Scholar
    • Export Citation
  • Bogena, H. R., M. Herbst, J. A. Huisman, U. Rosenbaum, A. Weuthen, and H. Vereecken, 2010: Potential of wireless sensor networks for measuring soil water content variability. Vadose Zone J., 9, 10021013, https://doi.org/10.2136/vzj2009.0173.

    • Search Google Scholar
    • Export Citation
  • Chalapathy, R., and S. Chawla, 2019: Deep learning for anomaly detection: A survey. arXiv, 1901.03407v2, https://doi.org/10.48550/arXiv.1901.03407.

  • Chen, M., Z. Wei, Z. Huang, B. Ding, and Y. Li, 2020: Simple and deep graph convolutional networks. Proc. 37th Int. Conf. on Machine Learning, Online, PMLR, 17251735, https://proceedings.mlr.press/v119/chen20v.html.

  • Choi, H., C. Jung, T. Kang, H. J. Kim, and I.-Y. Kwak, 2022: Explainable time-series prediction using a residual network and gradient-based methods. IEEE Access, 10, 108 469108 482, https://doi.org/10.1109/ACCESS.2022.3213926.

    • Search Google Scholar
    • Export Citation
  • Chwala, C., and H. Kunstmann, 2019: Commercial microwave link networks for rainfall observation: Assessment of the current status and future challenges. Wiley Interdiscip. Rev.: Water, 6, e1337, https://doi.org/10.1002/wat2.1337.

    • Search Google Scholar
    • Export Citation
  • Chwala, C., F. Keis, and H. Kunstmann, 2016: Real-time data acquisition of commercial microwave link networks for hydrometeorological applications. Atmos. Meas. Tech., 9, 991999, https://doi.org/10.5194/amt-9-991-2016.

    • Search Google Scholar
    • Export Citation
  • Cioffi, R., M. Travaglioni, G. Piscitelli, A. Petrillo, and F. De Felice, 2020: Artificial intelligence and machine learning applications in smart production: Progress, trends, and directions. Sustainability, 12, 492, https://doi.org/10.3390/su12020492.

    • Search Google Scholar
    • Export Citation
  • Coley, C. W., W. Jin, L. Rogers, T. F. Jamison, T. S. Jaakkola, W. H. Green, R. Barzilay, and K. F. Jensen, 2019: A graph-convolutional neural network model for the prediction of chemical reactivity. Chem. Sci., 10, 370377, https://doi.org/10.1039/C8SC04228D.

    • Search Google Scholar
    • Export Citation
  • Deng, A., and B. Hooi, 2021: Graph neural network-based anomaly detection in multivariate time series. Proc. Conf. AAAI Artif. Intell., 35, 40274035, https://doi.org/10.1609/aaai.v35i5.16523.

    • Search Google Scholar
    • Export Citation
  • Egmont-Petersen, M., D. de Ridder, and H. Handels, 2002: Image processing with neural networks—A review. Pattern Recognit., 35, 22792301, https://doi.org/10.1016/S0031-3203(01)00178-9.

    • Search Google Scholar
    • Export Citation
  • Erhan, L., M. Ndubuaku, M. Di Mauro, W. Song, M. Chen, G. Fortino, O. Bagdasar, and A. Liotta, 2021: Smart anomaly detection in sensor systems: A multi-perspective review. Inf. Fusion, 67, 6479, https://doi.org/10.1016/j.inffus.2020.10.001.

    • Search Google Scholar
    • Export Citation
  • Fan, W., Y. Ma, Q. Li, Y. He, E. Zhao, J. Tang, and D. Yin, 2019: Graph Neural Networks for social recommendation. WWW’19: The World Wide Web Conf., San Francisco, CA, Association for Computing Machinery, 417–426, https://doi.org/10.1145/3308558.3313488.

  • Gandin, L. S., 1988: Complex quality control of meteorological observations. Mon. Wea. Rev., 116, 11371156, https://doi.org/10.1175/1520-0493(1988)116<1137:CQCOMO>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Graf, M., A. El Hachem, M. Eisele, J. Seidel, C. Chwala, H. Kunstmann, and A. Bárdossy, 2021: Rainfall estimates from opportunistic sensors in Germany across spatio-temporal scales. J. Hydrol., 37, 100883, https://doi.org/10.1016/j.ejrh.2021.100883.

    • Search Google Scholar
    • Export Citation
  • Guan, S., B. Zhao, Z. Dong, M. Gao, and Z. He, 2022: GTAD: Graph and temporal neural network for multivariate time series anomaly detection. Entropy, 24, 759, https://doi.org/10.3390/e24060759.

    • Search Google Scholar
    • Export Citation
  • Horsburgh, J. S., S. L. Reeder, A. S. Jones, and J. Meline, 2015: Open source software for visualization and quality control of continuous hydrologic and water quality sensor data. Environ. Modell. Software, 70, 3244, https://doi.org/10.1016/j.envsoft.2015.04.002.

    • Search Google Scholar
    • Export Citation
  • Jiang, S., E. Bevacqua, and J. Zscheischler, 2022: River flooding mechanisms and their changes in Europe revealed by explainable machine learning. Hydrol. Earth Syst. Sci., 26, 63396359, https://doi.org/10.5194/hess-26-6339-2022.

    • Search Google Scholar
    • Export Citation
  • Jones, A. S., J. S. Horsburgh, and D. P. Eiriksson, 2018: Assessing subjectivity in environmental sensor data post processing via a controlled experiment. Ecol. Inform., 46, 8696, https://doi.org/10.1016/j.ecoinf.2018.05.001.

    • Search Google Scholar
    • Export Citation
  • Jones, A. S., T. L. Jones, and J. S. Horsburgh, 2022: Toward automating post processing of aquatic sensor data. Environ. Modell. Software, 151, 105364, https://doi.org/10.1016/j.envsoft.2022.105364.

    • Search Google Scholar
    • Export Citation
  • Kipf, T. N., and M. Welling, 2016: Semi-supervised classification with graph convolutional networks. arXiv, 1609.02907v4, https://doi.org/10.48550/arXiv.1609.02907.

  • Kosasih, E. E., and A. Brintrup, 2022: A machine learning approach for predicting hidden links in supply chain with graph neural networks. Int. J. Prod. Res., 60, 53805393, https://doi.org/10.1080/00207543.2021.1956697.

    • Search Google Scholar
    • Export Citation
  • Lin, X., H. Wang, J. Guo, and G. Mei, 2022: A deep learning approach using graph neural networks for anomaly detection in air quality data considering spatiotemporal correlations. IEEE Access, 10, 94 07494 088, https://doi.org/10.1109/ACCESS.2022.3204284.

    • Search Google Scholar
    • Export Citation
  • Lorenz, C., and H. Kunstmann, 2012: The hydrological cycle in three state-of-the-art reanalyses: Intercomparison and performance analysis. J. Hydrometeor., 13, 13971420, https://doi.org/10.1175/JHM-D-11-088.1.

    • Search Google Scholar
    • Export Citation
  • Messer, H., A. Zinevich, and P. Alpert, 2006: Environmental monitoring by wireless communication networks. Science, 312, 713, https://doi.org/10.1126/science.1120034.

    • Search Google Scholar
    • Export Citation
  • Minh, D., H. X. Wang, Y. F. Li, and T. N. Nguyen, 2022: Explainable artificial intelligence: A comprehensive review. Artif. Intell. Rev., 55, 35033568, https://doi.org/10.1007/s10462-021-10088-y.

    • Search Google Scholar
    • Export Citation
  • Mittelbach, H., I. Lehner, and S. I. Seneviratne, 2012: Comparison of four soil moisture sensor types under field conditions in Switzerland. J. Hydrol., 430–431, 3949, https://doi.org/10.1016/j.jhydrol.2012.01.041.

    • Search Google Scholar
    • Export Citation
  • Muharemi, F., D. Logofătu, and F. Leon, 2019: Machine learning approaches for anomaly detection of water quality on a real-world data set. J. Inf. Telecommun., 3, 294307, https://doi.org/10.1080/24751839.2019.1565653.

    • Search Google Scholar
    • Export Citation
  • Nayak, J., K. Vakula, P. Dinesh, B. Naik, and D. Pelusi, 2020: Intelligent food processing: Journey from artificial neural network to deep learning. Comput. Sci. Rev., 38, 100297, https://doi.org/10.1016/j.cosrev.2020.100297.

    • Search Google Scholar
    • Export Citation
  • Polz, J., C. Chwala, M. Graf, and H. Kunstmann, 2020: Rain event detection in commercial microwave link attenuation data using convolutional neural networks. Atmos. Meas. Tech., 13, 38353853, https://doi.org/10.5194/amt-13-3835-2020.

    • Search Google Scholar
    • Export Citation
  • Polz, J., L. Glawion, M. Graf, N. Blettner, E. Lasota, L. Schmidt, H. Kunstmann, and C. Chwala, 2023: Expert flagging of commercial microwave link signal anomalies: Effect on rainfall estimation and ambiguity of flagging. 2023 IEEE Int. Conf. on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), Rhodes Island, Greece, Institute of Electrical and Electronics Engineers, 15, https://doi.org/10.1109/ICASSPW59220.2023.10193654.

  • Pope, P. E., S. Kolouri, M. Rostami, C. E. Martin, and H. Hoffmann, 2019: Explainability methods for graph convolutional neural networks. 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, Institute of Electrical and Electronics Engineers, 10 76410 773, https://doi.org/10.1109/CVPR.2019.01103.

  • Rathee, M., T. Funke, A. Anand, and M. Khosla, 2022: BAGEL: A benchmark for assessing graph neural network explanations. arXiv, 2206.13983v1, https://doi.org/10.48550/arXiv.2206.13983.

  • Rebmann, C., S. Claudia, M.-J. Sara, G. Sebastian, Z. Matthias, S. Luis, and C. Matthias, 2017: Integrative measurements focusing on carbon, energy and water fluxes at the forest site ‘Hohes Holz’ and the grassland ‘Grossen Bruch’. Geophysical Research Abstracts, Vol. 19, Abstract 9727, https://meetingorganizer.copernicus.org/EGU2017/EGU2017-9727.pdf.

  • Rebmann, C., and Coauthors, 2018: ICOS eddy covariance flux-station site setup: A review. Int. Agrophys., 32, 471494, https://doi.org/10.1515/intag-2017-0044.

    • Search Google Scholar
    • Export Citation
  • Russo, S., and Coauthors, 2021: The value of human data annotation for machine learning based anomaly detection in environmental systems. Water Res., 206, 117695, https://doi.org/10.1016/j.watres.2021.117695.

    • Search Google Scholar
    • Export Citation
  • Samek, W., G. Montavon, S. Lapuschkin, C. J. Anders, and K.-R. Müller, 2021: Explaining deep neural networks and beyond: A review of methods and applications. Proc. IEEE, 109, 247278, https://doi.org/10.1109/JPROC.2021.3060483.

    • Search Google Scholar
    • Export Citation
  • Schmidt, L., and Coauthors, 2023: System for automated Quality Control (SaQC) to enable traceable and reproducible data streams in environmental science. Environ. Modell. Software, 169, 105809, https://doi.org/10.1016/j.envsoft.2023.105809.

    • Search Google Scholar
    • Export Citation
  • Shrikumar, A., P. Greenside, and A. Kundaje, 2017: Learning important features through propagating activation differences. Proc. 34th Int. Conf. on Machine Learning, Sydney, New South Wales, Australia, JMLR, 31453153, https://doi.org/10.48550/ARXIV.1704.02685.

  • Smilkov, D., N. Thorat, B. Kim, F. Viégas, and M. Wattenberg, 2017: SmoothGrad: Removing noise by adding noise. arXiv, 1706.03825v1, https://doi.org/10.48550/ARXIV.1706.03825.

  • Sturtevant, C., S. Metzger, S. Nehr, and T. Foken, 2021: Quality assurance and control. Springer Handbook of Atmospheric Measurements, Springer, 47–90, https://doi.org/10.1007/978-3-030-52171-4_3.

  • Sundararajan, M., A. Taly, and Q. Yan, 2017: Axiomatic attribution for deep networks. Proc. 34th Int. Conf. on Machine Learning, Sydney, New South Wales, Australia, PMLR, 33193328, https://doi.org/10.48550/ARXIV.1703.01365.

  • Sutskever, I., O. Vinyals, and Q. V. Le, 2014: Sequence to sequence learning with neural networks. NIPS’14: Proc. 27th Int. Conf. on Neural Information Processing System, Montreal, QC, Canada, MIT Press, 3104–3112, https://dl.acm.org/doi/10.5555/2969033.2969173.

  • Susha Lekshmi, S. L., D. N. Singh, and M. S. Baghini, 2014: A critical review of soil moisture measurement. Measurement, 54, 92105, https://doi.org/10.1016/j.measurement.2014.04.007.

    • Search Google Scholar
    • Export Citation
  • UNFCCC, 2022: Sharm el-Sheikh Implementation Plan. Revised draft decision—/CMA.4. UNFCCC, 12 pp., https://unfccc.int/documents/621908.

  • Vandewinckele, L., M. Claessens, A. Dinkla, C. Brouwer, W. Crijns, D. Verellen, and W. van Elmpt, 2020: Overview of artificial intelligence-based applications in radiotherapy: Recommendations for implementation and quality assurance. Radiother. Oncol., 153, 5566, https://doi.org/10.1016/j.radonc.2020.09.008.

    • Search Google Scholar
    • Export Citation
  • van Leth, T. C., A. Overeem, H. Leijnse, and R. Uijlenhoet, 2018: A measurement campaign to assess sources of error in microwave link rainfall estimation. Atmos. Meas. Tech., 11, 46454669, https://doi.org/10.5194/amt-11-4645-2018.

    • Search Google Scholar
    • Export Citation
  • Veličković, P., G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, 2017: Graph attention networks. Statistics, 1050, 1048 550.

    • Search Google Scholar
    • Export Citation
  • Wang, S., J. Cao, and P. S. Yu, 2022: Deep learning for spatio-temporal data mining: A survey. IEEE Trans. Knowl. Data Eng., 34, 36813700, https://doi.org/10.1109/TKDE.2020.3025580.

    • Search Google Scholar
    • Export Citation
  • Wollschläger, U., and Coauthors, 2017: The Bode hydrological observatory: A platform for integrated, interdisciplinary hydro-ecological research within the TERENO Harz/Central German Lowland Observatory. Environ. Earth Sci., 76, 29, https://doi.org/10.1007/s12665-016-6327-5.

    • Search Google Scholar
    • Export Citation
  • Yin, G., and Coauthors, 2023: Automatic recognition of schizophrenia from brain-network features using graph convolutional neural network. Asian J. Psychiatry, 87, 103687, https://doi.org/10.1016/j.ajp.2023.103687.

    • Search Google Scholar
    • Export Citation
  • You, J., Z. Ying, and J. Leskovec, 2020: Design space for graph neural networks. Advances in Neural Information Processing Systems, Curran Associates, Inc., 17 009–17 021.

  • Zhang, J., L. Pan, Q.-L. Han, C. Chen, S. Wen, and Y. Xiang, 2021: Deep learning based attack detection for cyber-physical system cybersecurity: A survey. IEEE/CAA J. Autom. Sin., 9, 377391, https://doi.org/10.1109/JAS.2021.1004261.

    • Search Google Scholar
    • Export Citation
  • Zhang, M., S. Wu, X. Yu, Q. Liu, and L. Wang, 2022: Dynamic Graph Neural Networks for Sequential Recommendation. IEEE Trans. Knowl. Data Eng., 35, 47414753, https://doi.org/10.1109/TKDE.2022.3151618.

    • Search Google Scholar
    • Export Citation
  • Zhao, H., and Coauthors, 2020: Multivariate time-series anomaly detection via graph attention network. 2020 IEEE Int. Conf. on Data Mining (ICDM), Sorrento, Italy, Institute of Electrical and Electronics Engineers, 841–850, https://doi.org/10.1109/ICDM50108.2020.00093.

  • Fig. 1.

    Example of (left) CML and (right) SoilNet data used as input for the GCN models. (a),(b) The basic principles of CML and SoilNet techniques, respectively. (c) The spatial connections between sensors in the CML network at the time of classification, with the labeled sensor highlighted in red to indicate its role in the analysis. (d) The complete 3D network configuration of SoilNet. Due to the large size of the network, only one selected sensor is highlighted with a red edge color, showing its neighboring sensors and their signals for clarity. The colors of nodes in the graphs represent TL and moisture values. (e) The TL time series of the labeled sensors (highlighted in red) and their neighbors. (f) The time series of soil moisture and battery voltage for the labeled sensor (red) and its neighbors. In both panels, red vertical lines mark the classification times.

  • Fig. 2.

    Schematic representation of two anomaly detection models’ architectures. (left) The GCN model, incorporating GC for anomaly detection; (right) the baseline LSTM without GC, illustrating the structural similarities and differences between the two approaches.

  • Fig. 3.

    Schematic view of the two types of attribution heatmaps used for analysis. (a) Single-sample feature attribution: Each single-sample heatmap corresponds to the attribution of input features for a single model classification at one sensor at a specific time step t. (b) Aggregated feature attribution: each single-sample attribution is averaged to produce a mean attribution value for each input feature series at each time step t. This results in a heatmap showing attributions over the entire dataset time series.

  • Fig. 4.

    ROC curves comparing the performance of GCN (red) and baseline LSTM models (green) in five-fold CV. The bold lines represent the mean performance, while the thin lines illustrate individual runs. (a) CML data; (b) results for the SoilNet.

  • Fig. 5.

    (a)–(d) Classification results of CML and SoilNet time series data from selected sensors. The upper panel in each subplot displays the original TL time series, with blue and green lines representing two different TL channels for CML, and battery voltage (blue line) and soil moisture (green line) for SoilNet. The panels below showcase the classification outcomes for the GCN and the baseline LSTM, respectively, with distinct colors representing the four classes derived from the confusion matrix, as well as no data (see section 3) and samples with automatic flags. The red vertical line in (b) points out to the event described in XAI analysis (section 4d), while the dashed lines in (c) mark the periods analyzed later in the results.

  • Fig. 6.

    (top) Scatterplots and (bottom) boxplots illustrating the model performance metrics for baseline LSTM and GCN models on (left) CML and (right) SoilNet datasets. Each data point in scatterplots reflects the AUC score for a specific sensor, while boxplots depict the distribution of AUC scores and MCC values on the left and right sides of each panel, respectively. The middle line of the boxplot represents the median, while the box itself spans from the first quartile to the third quartile. The whiskers extend from the box to the most distant data points within 1.5 times the interquartile range from the box. Eventually, data points beyond this range are considered outliers and are plotted individually. The GCN outcomes are presented in red, while the baseline LSTM results are shown in teal.

  • Fig. 7.

    Single-sample feature attribution heatmap for CML sensor (Fig. 5b) at 0000:40 UTC 29 July 2019. Input feature time series TL1 and TL2 from the FS the self-reference cycle, and its neighbors are plotted over the sampling time interval. The background color indicates the IG attribution of each feature. The attribution of the self-reference cycle and the neighboring sensor inputs are scaled by a factor of 25 for visualization purposes.

  • Fig. 8.

    Aggregated feature attribution for the CML sensor depicted in Fig. 5a. Each single-sample feature attribution was averaged across the sample time interval (181 min) to obtain one value for each input feature series at a time step t. This results in aggregated feature attribution heatmaps for the entire time series, which are visualized along with the model input time series. The top panel depicts the model prediction and the resulting classification, with the numbers above indicating the analyzed event number. The second and third panels show the time series of the FS and the corresponding aggregated attribution, separately for channels TL1 and TL2. The time series in the other panels show the model input for the self-reference cycle and neighbors. The background color indicates the aggregated attribution for that respective time step, averaged across channels TL1 and TL2. The attribution of the neighbors and self-reference cycle was scaled with a factor of 25 for visualization purposes.

  • Fig. B1.

    Flagged CML time series with neighboring sensors and anomaly detection results. The uppermost panel presents the flagged CML time series, previously shown in Fig. 5a, with additional panels below illustrating the time series of neighboring sensors. The colors indicate the confusion matrix results of the anomaly detection.

  • Fig. B2.

    As in Fig. B1, but for the CML sensor depicted in Fig. 5b.

  • Fig. B3.

    Flagged SoilNet time series with neighboring sensors and anomaly detection results. The uppermost panel presents the flagged SoilNet time series, previously shown in Fig. 5c, with additional panels below illustrating the time series of neighboring sensors. The colors indicate the confusion matrix results of anomaly detection, as well as instances of no data and samples with automatic flags.

  • Fig. B4.

    As in Fig. B3, but for the SoilNet sensor depicted in Fig. 5d.

  • Fig. C1.

    As in Fig. 8, but for the other sensor.

  • Fig. C2.

    Complete version of Fig. 7, including all neighboring sensors and their IG attributions.

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 2288 2288 45
PDF Downloads 708 708 21