A Deep Learning Data Fusion Model Using Sentinel-1/2, SoilGrids, SMAP, and GLDAS for Soil Moisture Retrieval

Vishal Batchu aGoogle Research, Bangalore, India

Search for other papers by Vishal Batchu in
Current site
Google Scholar
PubMed
Close
https://orcid.org/0000-0003-0461-0730
,
Grey Nearing bGoogle Research, Mountain View, California

Search for other papers by Grey Nearing in
Current site
Google Scholar
PubMed
Close
, and
Varun Gulshan aGoogle Research, Bangalore, India

Search for other papers by Varun Gulshan in
Current site
Google Scholar
PubMed
Close
Open access

Abstract

We develop a deep learning–based convolutional-regression model that estimates the volumetric soil moisture content in the top ∼5 cm of soil. Input predictors include Sentinel-1 (active radar) and Sentinel-2 (multispectral imagery), as well as geophysical variables from SoilGrids and modeled soil moisture fields from SMAP and GLDAS. The model was trained and evaluated on data from ∼1000 in situ sensors globally over the period 2015–21 and obtained an average per-sensor correlation of 0.707 and ubRMSE of 0.055 m3 m−3, and it can be used to produce a soil moisture map at a nominal 320-m resolution. These results are benchmarked against 14 other soil moisture evaluation research works at different locations, and an ablation study was used to identify important predictors.

Significance Statement

Soil moisture is a key variable in various agriculture and water management systems. Accurate and high-resolution estimates of soil moisture have multiple downstream benefits such as reduced water wastage by better understanding and managing the consumption of water, utilizing smarter irrigation methods and effective canal water management. We develop a deep learning–based model that estimates the volumetric soil moisture content in the top ∼5 cm of soil at a nominal 320-m resolution. Our results demonstrate that machine learning is a useful tool for fusing different modalities with ease, while producing high-resolution models that are not location specific. Future work could explore the possibility of using temporal input sources to further improve model performance.

© 2023 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Vishal Batchu, vishalbatchu@google.com

Abstract

We develop a deep learning–based convolutional-regression model that estimates the volumetric soil moisture content in the top ∼5 cm of soil. Input predictors include Sentinel-1 (active radar) and Sentinel-2 (multispectral imagery), as well as geophysical variables from SoilGrids and modeled soil moisture fields from SMAP and GLDAS. The model was trained and evaluated on data from ∼1000 in situ sensors globally over the period 2015–21 and obtained an average per-sensor correlation of 0.707 and ubRMSE of 0.055 m3 m−3, and it can be used to produce a soil moisture map at a nominal 320-m resolution. These results are benchmarked against 14 other soil moisture evaluation research works at different locations, and an ablation study was used to identify important predictors.

Significance Statement

Soil moisture is a key variable in various agriculture and water management systems. Accurate and high-resolution estimates of soil moisture have multiple downstream benefits such as reduced water wastage by better understanding and managing the consumption of water, utilizing smarter irrigation methods and effective canal water management. We develop a deep learning–based model that estimates the volumetric soil moisture content in the top ∼5 cm of soil at a nominal 320-m resolution. Our results demonstrate that machine learning is a useful tool for fusing different modalities with ease, while producing high-resolution models that are not location specific. Future work could explore the possibility of using temporal input sources to further improve model performance.

© 2023 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Vishal Batchu, vishalbatchu@google.com

1. Introduction

Soil moisture is one of the primary hydrological state (memory) variables in terrestrial systems (Brocca et al. 2017; Laiolo et al. 2016; Robinson et al. 2008), and is one of the primary controls for agriculture and water management (Dobriyal et al. 2012; Rossato et al. 2017). Soil moisture affects evapotranspiration and vegetation water availability, which are at the core of the climate–carbon cycle (Falloon et al. 2011) and play an important role in hydrological risks such as floods, drought, erosion, and landslides (Kim et al. 2019; Legates et al. 2011; Tramblay et al. 2012). Accurate measurement of soil moisture has numerous downstream benefits (Moran et al. 2015), including reduced water wastage by better understanding and managing the consumption of water (Brocca et al. 2018; Foster et al. 2020), utilizing smarter irrigation methods (Kumar et al. 2014), and effective canal water management (Zafar et al. 2021) where soil moisture estimates are used to track and reroute canals to distribute water more effectively.

The most accurate way to measure soil moisture is via ground-based methods such as direct gravimetric measurements (Klute 1986) or indirect methods such as dielectric reflectometry, capacitance charge, etc. (Bittelli 2011), which in situ sensors utilize (Walker et al. 2004). However, in situ sensors are difficult to scale spatially and are expensive to install and maintain. Remote sensing–based methods scale globally and provide modestly accurate estimates of top-soil moisture (typically 0–5 cm) (Ahmed et al. 2011), while lowering deployment and maintenance costs relative to ground-based methods. While remote sensing soil moisture estimates generally have lower accuracy than in situ measurements, they scale well spatially.

A large body of remote sensing–based methods use microwave band radiometric reflectance to quantify soil moisture. These sensors can either be located on aerial or satellite platforms, and can be broadly classified into two types—passive and active. Passive remote sensing mainly uses L/C-band brightness temperatures. Early retrieval methods were site specific, semiempirical models such as Oh (Oh et al. 1992), Dubois/Topp (Dubois et al. 1995), and the Integral Equation Model (IEM) (Baghdadi et al. 2004; Chen et al. 2003). These methods provide reasonably accurate estimates of soil moisture; however, they are sensitive to site-specific parameters such as soil roughness (Mattia et al. 1997) and are only capable of estimating the soil moisture on bare-ground soils (Verhoest et al. 2008) or soils with low vegetation content. These models have been extensively tested and evaluated (Choker et al. 2017; Ma et al. 2021; MirMazloumi and Sahebi 2016; Panciera et al. 2014), showing that they are generally not reliable when scaled globally. More generalizable retrieval methods were developed on top of remote sensing systems such as SMAP (resolution of 36 km/9 km; Entekhabi et al. 2010), SMOS (resolution of 30–50 km; Kerr et al. 2001), and AMSR-E (resolution of 25 km; Njoku et al. 2003). Baseline algorithms developed for each of these systems (Chan et al. 2014; Kerr et al. 2012; Njoku et al. 2003) were physics-based models built on top of the microwave brightness temperatures validated with in situ sensor grids (Al-Yaari et al. 2017; Cai et al. 2017; Chen et al. 2018; Colliander et al. 2017a). In addition to the brightness temperatures, these models also use various globally available static land surface parameters such as soil type, land cover, etc. (Chan et al. 2014). The main limitation of passive microwave products is their coarse spatial resolution; for example, the SMAP L2_SM_P_E has a resolution of 9 km × 9 km (Entekhabi et al. 2010), which is too coarse for many agriculture-related tasks, e.g., field-scale monitoring.

Active remote sensing methods mainly utilize L/C-band synthetic aperture radar (SAR) and allow for more high-resolution estimates (Haider et al. 2004; Shi et al. 1997). However, the retrieval accuracy of these methods is not as high as passive remote sensing methods (Njoku et al. 2002). Active and passive methods can be combined (Das et al. 2018, 2019; Wu et al. 2017) to improve resolution and maintain accuracy, which was an impetus for the SMAP mission. Passive and active methods are complementary; the former is more accurate but has a coarse spatial resolution while the latter is less accurate but has a higher spatial resolution.

Recent work has looked at the combination of passive and active radar with other remote sensing sources, such as optical/thermal imagery (Gao et al. 2017; Ojha et al. 2021). A SMAP + Sentinel-1 product was also developed from the perspective of disaggregating SMAP radiometer brightness temperatures based on Sentinel-1 readings (Das et al. 2019); however, this method has not yet achieved accuracies similar to SMAP at a higher resolution.

In recent years, the use of machine learning (ML) has shown promise in various fields of geoscience (Lary et al. 2016). ML has been used specifically to help disaggregate microwave-based soil moisture estimates into higher-resolution products (Abbaszadeh et al. 2019; Kolassa et al. 2018; Liu et al. 2020; Mao et al. 2019). Machine learning provides a unique set of abilities that work well for soil-moisture estimation—enabling the use of large datasets such as the International Soil Moisture Network (ISMN) database (Dorigo et al. 2011) without the need for site-specific tuning, modeling nonlinear relationships between multiple predictors (remote sensed inputs) and the target [soil moisture (SM)], and fusing multiple sources of inputs which could potentially aid in handling vegetation and canopy covers (Greifeneder et al. 2021; Karthikeyan and Mishra 2021; Lee et al. 2018; Vergopolan et al. 2021). While some of these ML efforts have shown improvements in the estimation of soil moisture compared to previous methods, these techniques mostly depend on a small number of inputs (Liu et al. 2020) and face issues with speckle in active remote sensing data (Oliver and Quegan 1998) and are unable to perform scene understanding which is required in soil moisture estimation (Davenport et al. 2008).

To address the aforementioned issues and continue pushing the boundaries of soil moisture models used in the community, we develop, train, and test a deep learning model that provides high-resolution (nominal resolution of 320 m; discussed in detail in section 3e) and accurate (average Pearson correlation of 0.727) estimates of soil moisture globally. A few things which combinedly make our approach unique are the use of high-resolution inputs (Sentinel-1/2) capable of scene understanding, the use of deep learning (which does not require feature engineering), the development of location-agnostic models that are not location/regional specific, the use of a large-scale dataset and the ease of addition of new input sources to the model for future exploration.

2. Data

We generate large datasets for training, evaluation, and testing. Each data point consists of (i) a set of model inputs from various sources (described presently), (ii) soil moisture labels that the model is trained to estimate, and (iii) additional metadata such as time stamp, geographical coordinates, etc. These data, their sources, and preprocessing are described in the following subsections.

a. Labels

1) International Soil Moisture Network

The largest repository (to our knowledge) of in situ sensor data for soil moisture is from the ISMN “network of networks” (Dorigo et al. 2011). This dataset has been used extensively for calibrating, training, and evaluating models, and we use it to maintain consistency with previous studies. At each sensor location, ISMN provides measurements in volumetric units at hourly intervals at various depths. The data are quality checked and flagged for anomalies/inconsistencies (Dorigo et al. 2013) in accordance with NASA’s good validation practices (Montzka et al. 2020). We use ISMN soil moisture tagged with a depth of 0–5 cm and 5–5 cm. Both of these correspond to the top layer of soil moisture; however, the notation differs across different provider networks.

The area of study is limited by the presence of in situ sensors. Although the ISMN repository has a global presence, sensors are primarily located in the United States, Europe, Australia, and China (Dorigo et al. 2021). This limits the applicability of models trained on these data to similar geographies. The sensor networks that we use from ISMN are listed in Table 1.

Table 1.

List of sensor networks from ISMN that we use as a part of our study. The number of data points corresponds to the number of samples present in the final data postmerge with remote sensed data as described in section 2c. An asterisk (*) indicates networks where a very small number of data points were available.

Table 1.

2) SMAP core validation sites

In addition to the ISMN network, we also obtain data from the SMAP core validation sites (Colliander et al. 2017a) listed in Table 2 and combine it with the ISMN data to train and validate our models.

Table 2.

List of SMAP core validation sites that we use as a part of our study. We only use a subset of all the core validation sites (Colliander et al. 2017b) due to data access limitations. The number of data points correspond to the number of samples present in the final data postmerge with remote sensed data as described in section 2c.

Table 2.

b. Input data

Input data for our models were downloaded from Google Earth Engine (Gorelick et al. 2017), which provides a number of different satellite products and geophysical variables that can be combined and exported in a format suitable for machine learning. Earth Engine also facilitates the processing of imagery such as scaling it to a given spatial resolution, performing temporal joins, projecting the imagery to a specific projection, etc.

As mentioned in the introduction there are a number of types of remote sensing sources that can be useful for soil moisture estimation (Ahmed et al. 2011). We select sources that have a significant correlation with soil moisture and/or have potential to help with the disaggregation of low-resolution soil moisture estimates.

1) High-resolution sources

(i) Sentinel-1 (S1)

The Copernicus Sentinel-1 (Torres et al. 2012) mission (2014–present) by the European Space Agency (ESA) provides global SAR readings (Rosen et al. 2000) at regular intervals. It has a revisit time of 6 days at the equator (Torres et al. 2012).

We use the Sentinel-1 GRD (Ground Range Detected) product from Earth Engine which consists of VV (vertical transmit–vertical receive polarization), VH (vertical transmit–horizontal receive polarization), and angle imagery corresponding to dual-band cross-polarized data at a 10-m resolution. The scenes undergo thermal noise removal, radiometric calibration, and terrain correction with the Sentinel-1 toolbox (Veci et al. 2014) to despeckle and denoise.

(ii) Sentinel-2 (S2)

The Copernicus Sentinel-2 (Drusch et al. 2012) mission (2015–present) by the ESA provides high-resolution multispectral imagery (Table 3 lists the Sentinel-2 bands used in this study). It has a revisit time of 5 days at the equator.

Table 3.

Sentinel-2 bands we use and their wavelengths/properties. S2A = Sentinel-2A; S2B = Sentinel-2B.

Table 3.

Unlike SAR, optical, near-infrared (NIR), and shortwave infrared (SWIR) imagery are dependent on cloud cover and the time of the day of acquisition. A large fraction of the Sentinel-2 scenes has significant cloud cover. We filter the data to retain only scenes containing less than 30% cloud cover (we use the QA60 cloud_mask band from Sentinel-2 to do this).

We use the L1C top-of-atmosphere product, which consists of multiple bands with varying resolutions ranging from 10 to 60 m. All the bands were upscaled (using the nearest neighbor method) or maintained at a 10-m resolution to use as inputs for our models.

(iii) NASA DEM

Digital elevation models capture the topography of bare ground which helps estimate the amount of moisture the surface can hold. The “elevation” band from NASA Digital Elevation Model (DEM) (NASA JPL 2020) on Earth Engine provides 30-m resolution estimates and is a reprocessing of the widely used Shuttle Radar Topography Mission (SRTM) product (Farr et al. 2000), which improves the model globally. Data from Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) Global Digital Elevation Model (GDEM); Ice, Cloud, and Land Elevation Satellite (ICESat) Geoscience Laser Altimeter System (GLAS); and PRISM are incorporated into the SRTM product to produce the refined NASA DEM. This is a single time acquisition product (acquired in the year 2000) but the variation over time is small, hence the product is still relevant for the task at hand.

2) Low-resolution sources

SoilGrids

SoilGrids (Hengl et al. 2017) provides us with various environmental and soil profile layers with global coverage. We specifically use soil texture—i.e., sand, silt, and clay fractions—and bulk density. All of these mapped products are present at a resolution of 250 m.

We do not use pedotransfer functions explicitly since our models can learn the required mapping. Soil information is essential because it is related to maximum water holding capacity (porosity); infiltration; and, to some extent, evaporation rates, which are critical controls on water retention and storage (Beale et al. 2019; Pan et al. 2012). Soil information obtained from soil grids has been used as a proxy for the disaggregation of soil moisture in prior work (Leenaars et al. 2018; Montzka et al. 2018).

3) Coarse soil moisture products

(i) SMAP soil moisture

The NASA–USDA enhanced Soil Moisture Active Passive (SMAP) soil moisture product (Mladenova et al. 2020) provides surface level (0–5 cm) soil moisture estimates at a 10-km resolution. The product is produced by applying a Palmer model (Palmer 1965) followed by 1D ensemble Kalman filter (EnKF) (Evensen 2003) to assimilate the Level 3 SMAP product. It has a temporal revisit period of ∼3 days.

The “ssm” band provided in this product corresponds to surface soil moisture and is in units of millimeters (mm), which is equivalent to kilograms per square meter (kg m−2). We convert this to volumetric soil moisture content by using 1000 kg m−3 as the density of water and a measurement depth of 5 cm.1

Ideally, we would have liked to use the SMAP L4 9-km product here, but we chose SMAP as it was available in Earth Engine, which enabled us to build large-scale data pipelines for this task easily. SMAP L4 data were unfortunately not available in Earth Engine.

(ii) GLDAS soil moisture

Global Land Data Assimilation System (GLDAS) 2.1 (Rodell et al. 2004) uses land surface modeling and data assimilation techniques to model various land surface states and fluxes. They provide soil moisture products at various depths, of which 0–10 cm (SoilMoi0_10cm_inst band) is closest to the depth that we want to estimate at (0–5 cm) at a 25-km resolution. Although their soil moisture product does not map exactly to the top level surface soil moisture, it still correlates well and is a useful input. Being a modeling product, GLDAS provides information indirectly from recent rainfall and meteorological inputs. Estimates are produced at a ∼3-h interval.

Similar to the SMAP data, we convert GLDAS SM provided in millimeters (kg m−2) to volumetric soil moisture using a measurement depth of 10 cm.

4) Input normalization

All inputs from Earth Engine data sources are normalized into a consistent range to prepare inputs for our machine learning model. A linear min–max scaling [x′ = (xxmin)/(xmaxxmin)] is used for most of the data sources as shown in Table 4.

Table 4.

Source statistics for each of the Earth Engine sources we use. Each of these sources is normalized via linear min–max scaling.

Table 4.

Coming to Sentinel-2, cloudy pixels in Sentinel-2 scenes have large reflectance values compared to noncloudy pixels. Due to this, a min–max linear scaling across the entire data results in a small dynamic range for noncloudy reflectance values. To account for this, we use a logarithm-based nonlinear scaling method (Brown et al. 2022) that results in better dynamic ranges for noncloudy pixels. Additional details are specified in the appendix.

c. Creating the dataset

Data from all the sources specified above are joined with a distributed Earth Engine export pipeline to create datasets for training and testing.

This pipeline consists of the following steps:

  1. Filter ISMN dataset based on the quality flags. We retain data that have the following ISMN flags: “G” (good), “C02” (soil moisture > 0.6), “C03” (soil moisture > saturation point), “C02, C03” (soil moisture > 0.6 and soil moisture > saturation point).

  2. For each remaining ISMN data point consisting of latitude, longitude, time stamp, and soil moisture reading:

    1. [Sentinel-2 source only] Filter out images with >30% cloud percentage.

    2. Perform a spatial join to find matching images where the (latitude, longitude) of ISMN data point lies within the image bounds.

    3. Perform a temporal join to retain images within a specified time bound in the past, where the bound depends upon the input source as specified in Table 5.

    4. Pick the temporally closest image to the ISMN data point from the filtered images. If none are available, we discard this data point.

    5. Reproject the image to 10-m resolution and the corresponding UTM projection based on the UTM zone of the data point.

    6. Crop the image to extract a 512 × 512 sized region centered around the (latitude, longitude) of the ISMN data point.

    7. Normalize the image following Table 4.

    8. Pair this image with the data point.

    Table 5.

    Temporal bounds for each of the sources used during the join while creating the dataset. Note that in situ readings are present on an hourly basis. The bold cell indicates the strictest bound (1 h).

    Table 5.

Since each of the input sources have different revisit intervals that do not align perfectly with each other, e.g., Sentinel-1 (revisit of 6 days2) and Sentinel-2 (revisit of 5 days), we temporally anchor the data to Sentinel-1 (our primary high-resolution input) while setting temporal bounds for the other sources that allow us to acquire meaningful information. The next section describes the dataset created.

1) Datasets created

(i) Sentinel-1 anchored dataset: Full

For creating the dataset, we used Sentinel-1 data as the primary input. We chose to anchor upon Sentinel-1 since it is resilient to atmospheric conditions and is directly sensitive to soil moisture, unlike Sentinel-2, which primarily provides information for scene understanding. Data from other sources were used to enrich the Sentinel-1 inputs, so we allowed for a greater time slack there.

The dataset is created using the Sentinel-1 anchored temporal bounds specified in Table 5. It is then split into train, validation, and test splits with a 60:20:20 ratio in an IID (independent and identically distributed) manner at a sensor level, i.e., 20% of the sensors are put under validation, 20% under test, and the remaining are used for training. Data points belonging to a single sensor are always together.

The dataset comprises a total of 124 207 data points of which 80 867 (65.1%) data points are in the training split, 22 543 (18.1%) data points are in the validation split, and 20 797 (16.8%) data points are in the testing split. Note that since all the sensors do not have the same number of data points, the 60:20:20 ratio is not strictly present across data points, although it is present across sensors. Sample input imagery and labels are present in the appendix.

In Table 5, the 1-h bound ensures that each image in Sentinel-1 is paired with a maximum of 1 in situ sensor readings, as the sensor readings are never less than 1 h apart. This ensures that we do not create duplicate data points in the dataset where the exact same set of imagery is paired with multiple in situ readings.

(ii) Sentinel 1 anchored dataset: Hard

The dataset created above is filtered further to create a harder dataset. The train split remains the same as above. The validation and test splits, however, are filtered to ensure that for each data point, the distance to the closest training sensor is >25 km. This results in much stronger validation/test splits that showcase model generalizability.

This also results in a smaller dataset with fewer sensor networks present in validation/test splits but provides us with a stronger dataset for us to validate our approach against allowing us to test model generalizability and applicability in the real world.

The dataset comprises a total of 99 411 data points of which 80 867 data points are in the train split, 8794 data points are in the validation split, and 9750 data points in the test split.

Note that we use the hard dataset primarily throughout the paper with the exception of section 5d (benchmarking) and section 6 (discussion on metrics versus distance from nearest train sensor), where the full dataset is used since it is better suited. Additional details are provided in the respective sections.

2) Dataset statistics

All the statistics provided in this section correspond to the full Sentinel-1 anchored dataset. Volumetric soil moisture label distribution in Fig. 1 shows that the labels are skewed toward the lower end; this can be attributed to the fact that most soils are not usually saturated. The time stamp distribution in the same figure shows that there is a fair spread of data across time; however, winter seasons (seasonality considering the Northern Hemisphere since most of our data are from the Northern Hemisphere) have fewer data points noticeable as spiky dips, which can be attributed to filtering out soil moisture readings where the temperature is below 0°C via ISMN flags as well as cloud filtering on Sentinel-2.

Fig. 1.
Fig. 1.

(left) Volumetric soil moisture (m3 m−3) label distribution. The dashed line indicates the median soil moisture. (right) Time stamp distribution.

Citation: Journal of Hydrometeorology 24, 10; 10.1175/JHM-D-22-0118.1

Distribution of coarse soil moisture products in Fig. 2 shows that the SMAP soil moisture histogram resembles the label histogram in Fig. 1 more closely when compared to the GLDAS histogram, likely because GLDAS estimates soil moisture at a depth of 10 cm.

Fig. 2.
Fig. 2.

Distribution of coarse soil moisture products. (left) SMAP volumetric soil moisture (m3 m−3). (right) GLDAS volumetric soil moisture (m3 m−3). The dashed lines indicate the corresponding median soil moisture values. Note that both the distributions are computed from the same set of data points.

Citation: Journal of Hydrometeorology 24, 10; 10.1175/JHM-D-22-0118.1

Figure 3 shows a heat map of sensor locations present in the dataset. Around 83% of the sensors are in the United States, 11% in Europe, and the remaining around the globe. Note that this shows the distribution of in situ sensors and not the actual data points so each sensor is represented only once.

Fig. 3.
Fig. 3.

A heat map of in situ sensor locations present in the dataset [obtained using Heatmaps (https://developers.google.com/maps/documentation/javascript/examples/layer-heatmap)].

Citation: Journal of Hydrometeorology 24, 10; 10.1175/JHM-D-22-0118.1

The soil texture distribution in Fig. 4 shows the distribution of our sensors based on USDA soil texture classification (Groenendyk et al. 2015) where the input fractions of sand, silt, and clay were obtained from SoilGrids. Land cover distribution in Fig. 4 shows that the majority of sensors are spread across vegetation, croplands and forests. Note that the two sensors that fall under the “Water” class correspond to sensors that are very close to rivers/water bodies and are not actually inside the water. This is a fairly broad distribution and allows us to see how our model performs across these different land cover types. We obtain land cover classes for each sensor from the Copernicus Land Cover Map (Buchhorn et al. 2020) on Earth Engine. Figure 5 shows the distribution of our sensors across the different climate zones (Kottek et al. 2006). The land cover, soil texture, and climate zone distributions are similar for training, validation, and testing (these distributions are shown in the appendix).

Fig. 4.
Fig. 4.

(left) USDA based soil texture distribution. (right) Land cover distribution derived from the Copernicus Land Cover Map across the sensors present in the dataset.

Citation: Journal of Hydrometeorology 24, 10; 10.1175/JHM-D-22-0118.1

Fig. 5.
Fig. 5.

Köppen climate zone distribution across the sensors present in the dataset. We use the first two identifiers of the Köppen classification only in order to group similar climatic zones together.

Citation: Journal of Hydrometeorology 24, 10; 10.1175/JHM-D-22-0118.1

The distribution of the number of data points per sensor in Fig. 6 (left) shows that we have ∼100–200 data points per sensor on average. This ensures that the data span multiple seasons and years for a majority of the sensors. For each validation sensor that we evaluate on, we compute the geodesic distance to the nearest train sensor and show these as a density curve in Fig. 6 (right). This provides an understanding of how validation sensors are distributed with respect to train sensors.

Fig. 6.
Fig. 6.

(left) The distribution of the number of data points per sensor. The dashed line indicates the median. (right) Cumulative histogram of validation sensors with respect to the distance from the nearest train sensor. The distance on the x axis varies on a log scale.

Citation: Journal of Hydrometeorology 24, 10; 10.1175/JHM-D-22-0118.1

Figure 7 captures the temporal variation of soil moisture at a single sensor. We observe that Sentinel-2 inputs can be quite visually indicative of the vegetation growth and dryness of a region, which correlate with soil moisture.

Fig. 7.
Fig. 7.

Variation of soil moisture at a randomly selected sensor across time and corresponding input Sentinel-2 RGB (top in each dotted image pair) and Sentinel-1 VV (bottom in each dotted image pair) imagery.

Citation: Journal of Hydrometeorology 24, 10; 10.1175/JHM-D-22-0118.1

3. Methods

The problem of soil moisture estimation is framed as an image-based regression task (Sateesh Babu et al. 2016; Fu et al. 2018; Rogez et al. 2017) where remote sensed sources and geophysical variables are used as inputs. These input sources provide the spatial and spectral covariates to estimate surface-level soil moisture. We employ various deep learning techniques that have proven to work well for image-based regression. These techniques allow our models to be site agnostic, which results in better generalization compared to site-specific/calibrated models, providing the capability to scale globally (Fig. 8).

Fig. 8.
Fig. 8.

Our model architecture for the task of soil moisture estimation.

Citation: Journal of Hydrometeorology 24, 10; 10.1175/JHM-D-22-0118.1

a. Model

Starting with the two sets of inputs (high resolution and low resolution), we encode each set of the inputs with Xception-based feature encoders generating a single embedding (a compact feature representation with no spatial dimension; detailed in the next section) for each set of inputs. The hypothesis is that embeddings generated from the high-resolution inputs capture a combination of the scene embedding (Main-Knorn et al. 2017; Raiyani et al. 2021) from Sentinel-2 and DEM sources and a potential rough soil moisture estimate (Paloscia et al. 2013) from the Sentinel-1 and Sentinel-2 sources. Likewise, the embeddings generated from the low-resolution inputs would capture soil properties such as the water holding capacity, soil type, density, and texture. These embeddings are then fused together by first concatenating both embeddings and then passing the concatenated vector through fusion head 1, which is a stack of [dropout (Srivastava et al. 2014), fully connected (FC), batch norm (Ioffe and Szegedy 2015), activation] layers followed by a [dropout, FC] block at the end, described in detail in section 3a(2), resulting in a compact representation of features relevant for soil moisture.

We then fuse the coarse soil moisture inputs (SMAP and GLDAS SM estimates) with the output of the first fusion head, by passing the concatenated vector through another set of fully connected layers, called fusion head 2. This produces the final soil moisture estimate that is passed to the loss function to compare with training labels.

We also considered early-fusion approaches where the low-resolution inputs are concatenated with the high-resolution inputs. However, these require the low-resolution inputs to be scaled to the size of the high-resolution inputs which involves a large amount of duplication and wasted compute.

1) Input encoders

We use the Xception (Chollet 2017) encoder for the feature encoders. We have also experimented with ResNet (He et al. 2016) and MobileNet-V2 (Sandler et al. 2018) encoders—empirically, Xception performed the best out of these choices of encoders.

In particular, we use the Xception-65 encoder to encode high-resolution imagery and an Xception-41 to encode low-resolution imagery. Low-resolution imagery has a much smaller amount of input pixels which contain more semantically meaningful features and hence a smaller encoder with a fractional depth multiplier (Chollet 2017) works well and allows us to utilize model capacity better—faster training and fewer parameters.

The input to the high-res encoder consists of a 256 × 256 [we take a center crop from the 512 × 512 image in the dataset for computational reasons; ablation study in section 5d(1)] pixel image centered at the location of the sensor. High-resolution imagery is sampled at a 10-m resolution resulting in a 2.56 km × 2.56 km region being used as the input. The low-resolution encoder uses a 16 × 16 input size where low-resolution imagery (at 250-m original resolution) is bilinearly resampled at 160 m to ensure that we cover the same 2.56 km × 2.56 km region. The large input region allows the model to understand the context around the center of the image to estimate the soil moisture at the center. The impact of varying the region size on the model performance is explored in the experiments section.

The high-resolution encoder produces 8 × 8 × 2048 sized features. We then apply a center weighted global pooling as shown in Fig. 9 on them to ensure that the embeddings generated at the center are given the most importance since we are estimating the soil moisture at the center of the image. This produces a 1 × 1 × 2048 sized embedding. The low-resolution encoder produces 1 × 1 × 2048 sized features directly. These embeddings are flattened and then stacked to produce a 4096 sized embedding which is passed on to the fusion head.

Fig. 9.
Fig. 9.

A representative comparison of the features generated by the Xception encoder. (left) Center weighted pooling. (right) The standard average pooling. Darker colors indicate a higher weight placed on a particular pixel. The sum of weights in both cases equals 1.

Citation: Journal of Hydrometeorology 24, 10; 10.1175/JHM-D-22-0118.1

2) Fusion head

The fusion head consists of stacked blocks of [dropout, FC, batch norm, activation] with an additional set of dropout and FC layers toward the end without an activation or batch normalization since we do not want to restrict our predicted output range (Fig. 10). This head allows for nonlinear fusion of concatenated input features. Each fusion head can be configured by specifying the number of output channels per block and the number of blocks present.

Fig. 10.
Fig. 10.

Our fusion head architecture consists of fully connected, batch norm, activation, and dropout layers. The dotted region shows one block of the fusion head, which is repeated N times.

Citation: Journal of Hydrometeorology 24, 10; 10.1175/JHM-D-22-0118.1

We use two fusion heads in our model. The first one (with 3 blocks and output channels per block [128, 64, 1]) fuses information from the two input encoders encoding high-resolution and low-resolution imagery, respectively. The second one (with 2 blocks and output channels per block [8, 1]) fuses information from the coarse soil moisture inputs with the fused embedding generated by the first fusion head producing the final soil moisture estimate. Additional hyperparameter details are presented in section 3c.

b. Training

We train our model using Tensorflow in a synchronized distributed Tensor Processing Unit (TPU; v1) (Cheng et al. 2017) setting where each model is trained on a 2 × 2 pod (8 TPU cores). We use stochastic gradient descent (SGD) with momentum (Rumelhart et al. 1986) as the optimizer. Training takes around 8–10 h on average to complete. The learning rate transitions from the initial value to 0 via a polynomial decay schedule with a power of 0.9 across each minibatch during training. Additional hyperparameter details are presented in section 3c.

In addition to the distributed training strategy, we also utilize the Tensorflow data service (Murray et al. 2021), allowing us to distribute dataset preprocessing across a cluster of central processing units (CPUs).

1) Loss

We use the Huber loss (Huber 1992) to train our models. The Huber loss is a combination of mean squared error (MSE) and mean absolute error (MAE) which allows our model to efficiently optimize inliers while paying lesser attention to outliers, with delta specifying the threshold on determining what constitutes as an outlier:
loss={12(xy)2,if|xy|δδ|xy|12δ2,otherwise.
We tried using MSE and MAE loss functions, but the Huber loss performed better empirically.

2) Regularization

Although the dataset consists of a large number of data points, it contains only 958 distinct locations since the number of sensors is limited. Due to this, our models could overfit to the small set of train locations. Regularization helps prevent this and improve generalization. We employ several regularization techniques to reduce overfitting such as weight decay (Loshchilov and Hutter 2019), dropout (Srivastava et al. 2014), and drop path (Huang et al. 2016).

Augmentations

We also apply various augmentations on our inputs as an additional form of regularization (Table 6; Shorten and Khoshgoftaar 2019). We only use center preserving augmentations since we are estimating the soil moisture at the center of the image [see discussion about center-weighted pooling in section 3a(1)]. We do not use any color transform augmentations since our remote sensing inputs have a much larger spectral range (i.e., not just RGB imagery) and we do not want to lose the signal present in the absolute value of pixels.

Table 6.

Details on the exact set of augmentations used. The probability here indicates the probability with which a specific augmentation is applied on an input.

Table 6.

For every data point, each of the augmentations are applied based on their probabilities. This results in a large number of combinations of these augmentations allowing us to produce a wide variety of augmented images, similar to the augmentation policies described in autoaugment (Cubuk et al. 2019).

c. Hyperparameters

The high-resolution encoder uses an Xception-65 backbone with an input crop size of 256 (selected from a sweep of [512, 256, 128, 64] discussed in the ablation in Table 14). The backbone has an output stride of 32 and uses a drop-path probability of 0.6 (selected from a sweep of [0.8, 0.6, 0.4, 0.2]). The low-resolution encoder uses an Xception-41 backbone with an input crop size of 16 and an output stride of 16 with the same drop-path probability as the high-resolution encoder.

The first fusion head has a shape of (128, 64, 1) with drop-keep probabilities of (0.5, 0.5, 0.6) and uses Swish (Ramachandran et al. 2017) activations (selected from a sweep of [Swish, ReLU, Leaky ReLU]). The second fusion head is relatively simple and has a shape of (8, 1) with drop-keep probabilities of (0.9, 0.9) and uses the same Swish activations similar to the first fusion head.

We use a global batch size of 64, which is split across the 8 TPUs for training (data parallel distribution) with synchronized batch norm. We trained each model for a total of 600k steps to ensure training curves plateau (not reported). We use a Huber loss delta of 0.4 (selected from a sweep across [0, 0.2, 0.4, 0.6, 0.8, 1.0]). We use the momentum optimizer with momentum set to 0.9. We perform a joint sweep on the learning rate across [0.1, 0.05, 0.03] and weight decay across [0.0001, 0.000 05, 0.000 03] and pick the best performing model on our validation set.

We tried using ImageNet checkpoints to initialize the Xception encoders since ImageNet-based initializations have shown to perform well in various transfer learning scenarios (Kornblith et al. 2019), but we did not see any significant improvement in performance for the task at hand. Hence, we just use random initialization for the weights of our models.

d. Evaluation

We evaluate our models both quantitatively and qualitatively. As a part of the qualitative evaluation, we plot multiple time series of model estimates and labels to visually assess temporal variation at randomly selected sensors. This is useful to identify whether any biases are present at different sensor locations, and to visually assess how biases might differ based on properties of the location such as the soil type, land cover, etc. For quantitative evaluation, we look at a set of metrics on the best model checkpoint during training as described below. During evaluation, we clamp model predictions to a range of [0, 1] since that is the valid range for soil moisture predictions.

Metrics

To measure the performance of our models, we compute various metrics for each sensor and then average them over all our sensors where at least 13 data points are present at each sensor. We wanted to evaluate only on sensors where there are data for at least 3 months or more. Considering a Sentinel-1 revisit period of around 6–7 days, 3 months would have at least 13 data points:
final_metric=s=1MmetricsM,
where M = number of sensors having ≥13 data points and metrics = per sensor metric.
The metrics that we report are (i) ubRMSE (m3 m−3), (ii) RMSE (m3 m−3), and (iii) correlation (r), which are described as follows:
RMSE=i=1N(xiyi)2N,
ubRMSE=i=1N[(xix¯)(yiy¯)]2N,
r=i=1N(xix¯)(yiy¯)i=1N(xix¯)2i=1N(yiy¯)2,
where xi = model estimate, yi = label, x¯ = mean model estimate, y¯ = mean of labels at a particular sensor location, and N = number of data points.

We focus on the ubRMSE since it is often the case that soil moisture estimates have large climatological differences with in situ sensors; however, we are often concerned with per-location dynamics instead of absolute values. RMSE and ubRMSE are unbounded metrics, and therefore difficult to contextualize outside of intermodel comparisons. We therefore also report the standard Pearson product-moment correlation coefficient.

e. Model resolution

Our models produce soil moisture estimates at a nominal resolution of 320 m. This is because our encoders reduce the high-resolution input (256 × 256 pixels at 10 m per pixel) by a factor of 32 to create a feature representation that is 8 × 8 pixels at 320 m per pixel. We use the term “nominal” because, although the model produces representations at a 320-m resolution, nearby predictions are not completely independent and even when we stride out inputs at 320 m to get estimates for adjacent locations, the inputs still overlap.

Note that we use point estimates from in situ sensors at the center of the image as labels to train the model rather than an average of soil moisture values across a 320 m × 320 m pixel, which is the resolution at which the model produces soil moisture estimates. Ideally, we would have liked to use dense sensor grids within 320 m × 320 m regions and take the average across in situ sensors in the grid to produce the labels. However, due to data limitations, we stick to using just the readings from single in situ sensors at the center of each data point as the label.

4. Experiments

We adopt the following model naming convention: the names of all the sources that are used as inputs to the model are concatenated to identify the model. Please refer to Fig. 8 for details on how each of the inputs are used.

Hereafter, we refer to the Sentinel-1 + DEM + Sentinel-2 + SoilGrids + SMAP + GLDAS model as “our model” and will be the model used in experiments, results, and other analyses unless specified otherwise.

a. Overall model performance

The following set of baselines and ablations are used to measure the performance of our model quantitatively. Results are reported on the test set of the Sentinel-1 anchored hard dataset.

  • Baseline:

    • SMAP + GLDAS NN (neural network): The coarse soil moisture inputs passed through a fusion head to produce a soil moisture estimate.

  • Sentinel-1 + DEM + Sentinel-2 + SoilGrids + SMAP + GLDAS: Our model as described in Fig. 8.

  • Ablations:

    • Sentinel-1 + DEM: A Sentinel-1 only model (DEM allows the model to factor in the terrain).

    • Sentinel-1 + DEM + Sentinel-2: A Sentinel-1 and Sentinel-2 combined model where both of the inputs are passed together to the high-resolution encoder whose embeddings are then passed through a soil moisture head.

    • Sentinel-1 + DEM + Sentinel-2 + SoilGrids: Similar to the previous model, except that SoilGrids is also used.

b. Spatial and temporal analysis

1) Spatial stratification of model performance

To understand model fidelity, multiple stratified analyses are performed. We look at the performance variation across land cover classes, soil texture types and climate zones. The Copernicus Global Land Cover map is used for the land cover class, SoilGrids for the soil texture type, and the Köppen climate map for the climate zone.

To understand variation at a finer scale, quantitative analysis at an individual sensor level is also performed. This is done for each of the validation sensors present in the United States (since a majority of the sensors are located in the United States). We use the validation split of the Sentinel-1 anchored hard dataset for all these analyses.

2) Time series analysis

To understand temporal variations in the performance of our model, we look at the time series of model estimates versus in situ labels for a few randomly selected validation sensors from the Sentinel-1 anchored hard dataset. This helps understand the temporal stability of model estimates and visualize the bias present (if any).

3) Large-scale spatial estimation

Last, large-scale estimation is performed using our model where we move the model pixel by pixel and estimate the soil moisture at each location. This provides insight into spatially coherency of the model estimates and variation with respect to the inputs.

c. Model exploration studies

Ablation studies and sensitivity analyses are performed to see how important various model inputs/parameters are. All the studies here were performed on the validation set of the Sentinel-1 anchored hard dataset.

1) Input size sensitivity analysis

Each of the high-resolution sources correspond to 10 m × 10 m per pixel. The sensitivity study here uses the Sentinel-1 + Sentinel-2 + DEM model. We only use high-resolution sources here since they are the most sensitive to a change in input size. This helps us understand the amount of context required by the model better and how it ties to performance.

2) Input feature ablation study

To identify which features are the most important for our models, we perform the following experiments. Starting with our model with all the input bands (Sentinel-1, DEM, Sentinel-2, SoilGrids, and the coarse soil moisture inputs SMAP and GLDAS) as the reference model, we remove the source for which we want to calculate the feature importance and then see how much of a drop in performance is observed.

d. Benchmarking

We pick some of the best existing global methods along with a few top regional models to compare and evaluate our model performance against. Information about these works is presented in Table 7.

Table 7.

Information on the works we compare against.

Table 7.

Not all methods in Table 7 use the same kind of train/validation/test splits and acquisition time ranges for in situ data, so it is not possible to make rigorous comparisons. However, the sensor networks used are the same in all comparisons and sampling is always performed over a large time range, which provides a meaningful comparison. A lot of works also perform K-fold cross validation while reporting their results, but we do not do that on our end due to practical constraints—deep learning models take a long time to train. We instead do a rigorous train/validation/test split, where we only looked at the numbers on the test set exactly once (at the time of writing this paper), and after all model parameters were finalized on the validation split.

Note that in order to benchmark our results, we use the Sentinel-1 anchored full dataset since the Sentinel-1 anchored hard dataset contains only a small fraction of sensor networks that we can benchmark against. Additionally, most of the works we compare against use IID splits and we do the same to remain consistent while comparing our results. In addition to benchmarking, we also provide overall model performance results on the Sentinel-1 anchored full dataset here for completeness.

5. Results

a. Overall model performance

A quantitative evaluation of our models is presented in Table 8 along with comparisons to baselines (validation set results are present in Table A2). Our model performs better than the baseline SMAP + GLDAS NN in all the metrics. These results together show that the high-resolution sources/geophysical variables and coarse soil moisture sources provide complementary information and combining them gives us the best performance.

Table 8.

Test set results on the Sentinel-1 anchored hard dataset. All results presented are in volumetric (m3 m−3) units. The number in parentheses denotes the percentage change with respect to the SMAP + GLDAS NN baseline. A positive change in percentage denotes an increase in Pearson correlation and decrease in ubRMSE/RMSE. The test split consists of 9750 data points. Bold indicates results from the best baseline method and our best method.

Table 8.

b. Spatial and temporal analysis

1) Spatial stratification of model performance

Our model performs well across different kinds of land cover classes as shown in Table 9, soil texture types as shown in Table 10, and climatic zones as shown in Table 11. Metrics on classes containing less than 5 sensors should be disregarded since the sample size is too small to get a reliable aggregate. In all of these stratifications, the model performs fairly consistently (correlation within ±0.2) across all the classes showing adaptability and robustness of the model.

Table 9.

Validation results on the Sentinel-1 anchored hard dataset for our model stratified by land cover type. All results presented are in volumetric (m3 m−3) units. An asterisk (*) indicates that there are <5 sensors available for the specific land cover class in the test data.

Table 9.
Table 10.

Validation results on the Sentinel-1 anchored hard dataset for our model stratified by soil texture type. All results presented are in volumetric (m3 m−3) units. An asterisk (*) indicates that there are <5 sensors available for the specific soil texture class in the test data.

Table 10.
Table 11.

Validation results on the Sentinel-1 anchored hard dataset for our model stratified by the climatic zone. All results presented are in volumetric (m3 m−3) units. An asterisk (*) indicates that there are <5 sensors available for the specific climate zone in the validation data.

Table 11.

Figure 11 shows the correlation at a sensor level for each of our validation sensors across the United States. Sensors along the coastlines show a slight drop in performance overall. Similar results for ubRMSE are provided in Fig. A16.

Fig. 11.
Fig. 11.

Correlation between model estimates and ground truth labels at each validation sensor from the Sentinel-1 anchored hard dataset located in the United States. Red–green circles indicate in situ sensor locations at which we report the correlation.

Citation: Journal of Hydrometeorology 24, 10; 10.1175/JHM-D-22-0118.1

2) Time series analysis

Figure 12 shows time series plots for a few randomly selected sensors. In a majority of cases, our model estimates follow the variation in the sensors well. They often have a small amount of bias based on the location but the overall trend is captured well by the model. This reflects what we see in our quantitative metrics—correlation and ubRMSE. Additional time series plots are presented in Fig. A14.

Fig. 12.
Fig. 12.

Sample time series plots showing how the estimates of the model (m3 m−3) vary over time at a few randomly picked validation sensors from the Sentinel-1 anchored hard dataset compared to the labels/SMAP estimates. (left) The sensor has the following properties: land cover, agriculture; soil type, silt loam; climate zone, Cfa. (right) This sensor has the same properties as the one on the left, except the soil type, which is loam.

Citation: Journal of Hydrometeorology 24, 10; 10.1175/JHM-D-22-0118.1

3) Large-scale spatial estimation

In Fig. 13, we notice a fair amount of variation at a sub-SMAP pixel (10 km × 10 km) level as well. The local variation in the estimates can be explained to a fair extent by looking at the input imagery. These results along with the metrics we observe in Table 8 show that the model produces accurate high-resolution estimates of soil moisture.

Fig. 13.
Fig. 13.

Sample volumetric soil moisture (m3 m−3) prediction results with our model on ∼20 km × 20 km regions selected at random. The color bar shows the normalized input/output range for each source/prediction.

Citation: Journal of Hydrometeorology 24, 10; 10.1175/JHM-D-22-0118.1

In Fig. 14, we notice that although the actual estimated values of soil moisture change, the patterns visible remain fairly consistent which shows that the model is sensitive to spatial changes. At times however, tiling artifacts at SMAP/GLDAS boundaries appear when there is a large variation in adjacent SMAP/GLDAS tiles.

Fig. 14.
Fig. 14.

Model volumetric soil moisture (m3 m−3) prediction results across time (each row separated by a year) on a single ∼20 km × 20 km region selected at random.

Citation: Journal of Hydrometeorology 24, 10; 10.1175/JHM-D-22-0118.1

c. Model exploration studies

1) Input size sensitivity analysis

Looking at the correlation, larger input sizes provide better results since the model has more spatial context and a larger number of model parameters to work with, leading to greater representation power; however, the improvements decline as we continue to scale up (Table 12). The ubRMSE, however, remains fairly consistent throughout. Considering practical constraints, we went ahead with 256 × 256 for the input size.

Table 12.

Validation results on the Sentinel-1 anchored hard dataset for the input size sensitivity analysis. All results presented are in volumetric (m3 m−3) units.

Table 12.

2) Input feature ablation study

SoilGrids is the most important source (largest drop from the baseline in terms of performance, looking at the correlation) for the model followed by SMAP and Sentinel-1 (the rest following afterward) in Table 13. We notice that SMAP has a higher importance than GLDAS; this could be because GLDAS has a coarser resolution and a different sensing depth of 10 cm instead of 5 cm.

Table 13.

Validation results on the Sentinel-1 anchored hard dataset for the feature importance ablation study. Shows the importance of each feature set/source. All results presented are in volumetric (m3 m−3) units. The number in parentheses denotes the percentage change with respect to the first row. A positive change in percentage denotes an increase in Pearson correlation and decrease in ubRMSE/RMSE. Bold indicates results of the baseline, which we compare to each ablated row below.

Table 13.

Some of these inputs are related and in order to understand their importance together, we drop features in groups. Dropping SMAP + GLDAS (SM estimates) together results in the largest drop in performance, followed by SoilGrids + DEM (static inputs) and finally Sentinel-1 + Sentinel-2 (high-resolution inputs). All three sets of inputs result in a significant drop in metrics which shows that the model uses all these different types of sources in order to estimate soil moisture accurately.

d. Benchmarking

Benchmark comparisons for correlation and ubRMSE in Tables 14 and 15, respectively, show that our model performs well across various sensor networks. Even in cases where existing benchmark models perform better, our model trails closely behind. Our model however, performs significantly weaker than most benchmarks on MAQU and RSMN networks in terms of correlation and YANCO in terms of ubRMSE.

Table 14.

A comparison of Pearson correlation (for our best model) vs other works spread across stratified by the sensor network. Bold metrics in each row indicate the best performing model for the network in question. All results presented are in volumetric (m3 m−3) units. An asterisk (*) indicates that the time range during which in situ data were obtained is different from the one we use for training our models. A double asterisk (**) indicates that the work uses region-specific models. A triple asterisk (***) indicates that the work does not ensure a sensor level training/validation/testing split (data points from the same sensor can be present in training and validation).

Table 14.
Table 15.

A comparison of ubRMSE (for our best model) vs other works stratified by the sensor network. Bold metrics in each row indicate the best performing model for the network in question. All results presented are in volumetric (m3 m−3) units. An asterisk (*) indicates that the time range during which in situ data were obtained is different from the one we use for training our models. A double asterisk (**) indicates that the work uses region-specific models. A triple asterisk (***) indicates that the work does not ensure a sensor level train/validation/test split (data points from the same sensor can be present in train and validation).

Table 15.

Do note that these are not an exact 1:1 set of comparisons since the time ranges, sampling strategies, etc., differ for some benchmarks and some of these methods perform an aggregation of in situ sensor readings falling within a single pixel of their model inputs or an aggregation of in situ sensor readings across a period of time which simplifies the task to some extent.

To provide a comprehensive set of results to compare against other works, test set results on the Sentinel-1 anchored full dataset are presented in Table 16. These results can be used to compare against existing works that use IID-based splits.

Table 16.

Test set results on the Sentinel-1 anchored full dataset. All results presented are in volumetric (m3 m−3) units. The number in parentheses denotes the percentage change with respect to the SMAP + GLDAS NN baseline. A positive change in percentage denotes an increase in Pearson correlation and decrease in ubRMSE/RMSE. The test split consists of 20 797 data points. Bold indicates results from the best baseline method, our best method, and our method that does not use any of the baseline inputs.

Table 16.

6. Discussion

We developed machine learning–based models that fuse information from different remote sensing, geophysical and meteorological data sources at varying resolutions to produce soil moisture estimates at a nominal resolution of 320 m. The result is a trained model applicable in a large variety of settings across the world, and outperforms SMAP/GLDAS baselines and most of the other methods that we compare against in most of the sensor networks. We perform various input sensitivity and ablation studies which provide useful insight into each of the remote sensing sources used in an empirical setting. We have not used additional DEM-based topographic position indices or wetness indices as model inputs currently and plan to add them as inputs in the future. We have also tried using vegetation indices such as NDVI and leaf area indices as inputs to the model but did not see any improvement (shown in Table A3). To understand possible biases that arise during cloud filtering, we perform a few analyses in Figs. A12 and A13. We observed that cloud filtering does lead to a slight temporal bias (some months lose a higher fraction of data compared to other months) as well as a bias toward lower soil moisture values. Although this is not very significant, we wanted to ensure that we are aware of this.

In order to further understand model generalizability, we looked at how model performance varies across validation sensors as we move farther away from the train set of sensors in Fig. 15. Specifically, for each validation sensor, we computed the distance from the nearest train sensor and plot the aggregate correlation across all validation sensors greater than a certain distance x from the nearest training sensor, for all x. This gave us insight into how model performance varies as validation sensors move further away from training sensors. In general we expect this graph to trend downward because we expect more out of domain examples to show up. Models that pick up relevant and general features will expect a smaller drop compared to models that pick up location specific features. While we expect this graph to trend downward, it does not always strictly do so because the geography changes as distance changes, and it is possible we run into easy/hard regions as we vary this threshold. We report results on both the “hard” and “full” splits to understand this generalizability behavior better, and provide results for the full range of distance thresholds in Fig. 15 for the validation set. The results also highlight the importance of ensuring that data splits are strong to validate performance and direct IID splits might not be representative of true performance when models are deployed in the real world.

Fig. 15.
Fig. 15.

Aggregate correlation of validation sensors vs distance from the nearest training sensor. The x axis is in units of distance, the left y axis is the metric value (m3 m−3), and the right y axis shows the number of validation sensors used at any given point to compute the metric average. Each point in the plot represents the average metric value for all validation sensors whose nearest training sensor is ≥x distance away.

Citation: Journal of Hydrometeorology 24, 10; 10.1175/JHM-D-22-0118.1

Our models are able to capture soil moisture well even in cases where there is high input variability in terms of the terrain and topography as shown in Fig. A11. They are not restricted to being applied on homogenous regions only and can be applied in a wide range of settings. This is important since most agricultural farmlands often have a high amount of variability in a given spatial context (Addis et al. 2015). We do notice some unexpected model behavior in the last row of Fig. 13 where irrigated/farmland regions (toward the center of the image) have lower soil moisture than the neighboring regions. SoilGrids data for the region showed that the sand fraction and bulk density for the farmland regions seem to be higher than surrounding areas which could be the reason for a lower SM estimate in these regions.

An important learning from this study is that tasks in hydrology such as soil moisture estimation can benefit from modern advances in deep learning. Through this study, we observed that a larger context around the location of interest can be leveraged by deep learning models to use their scene understanding capability, and learn location-agnostic models (this is seen from the significant improvement in performance as the amount of context is increased). Image-based CNNs, while being more computationally expensive than traditional ML techniques, are still fairly efficient on images and allow us to utilize additional input context with ease. Traditional methods like random forests, etc., require significant feature engineering (especially with image inputs) and do not match deep learning methods from ML literature. Our models also make it very simple to experiment with additional input sources and fuse them together in a multistage manner which has shown large performance improvements in our experiments.

Last, we also observed that different remote sensing/geophysical sources are complementary and the combination of them leads to better performance. Looking forward, we would like to add additional inputs such as thermal bands, precipitation time series and extend our model capacity to process time series inputs. This would enable the model to leverage autocorrelations in time, and enable forecasting of soil moisture. We are also interested in applying more advanced self/semi-supervised methods (Patel et al. 2021) to perform the task at hand to improve model generalization to unseen regions and scale better.

We release our models to be used by anyone interested. Additionally, we curate and release a large-scale soil moisture dataset that can be used by others to train and evaluate remote sensing–based soil moisture models with ease.

1

Additional information on the volumetric soil moisture conversion at https://ldas.gsfc.nasa.gov/faq/.

2

Sentinel-1 had a revisit period of 6 days before the Sentinel-1B malfunction.

Acknowledgments.

We would like to thank Prof. Muddu Sekhar, John C. Platt, Christopher H. Van Arsdale, Kevin James McCloskey, and Rob von Behren for helping us review the paper in whole and Nikhilesh Kumar for providing invaluable suggestions throughout the experimental process.

Data availability statement.

The USDA Agricultural Research Service, Grazinglands Research Laboratory, El Reno, Oklahoma provided FORTCOBB and LITTLEWASHITA data. The OzNet hydrological monitoring network provided YANCO and KYEAMBA data. The University of Twente via DANS provided TWENTE data. The University of Texas, Austin provided TXSON data via the Texas Data Repository. We release the dataset we created and used in the paper along with our trained models at https://github.com/google-research/google-research/tree/master/soil_moisture_retrieval.

APPENDIX

Additional Information, Experiments, and Analyses

a. Sentinel-2 normalization scheme

We use the following nonlinear normalization scheme where normalization is applied in the log space which ensures that we have a broader dynamic range for noncloudy imagery postnormalization.

Formulas are as follows:
scaled_band=exp{[log(original_band0.005+1.0)log_mean]log_std}×5.01.0, and
normalized_band=(scaled_bandscaled_band+1.0)×255.0,
where the log_mean and log_std for each of the bands are as specified in Table A1. These were obtained by computing statistics over the United States.
Table A1.

Normalization statistics for Sentinel-2 bands.

Table A1.

b. Sentinel-1 anchored dataset validation/test statistics

We provided statistics for the Sentinel-1 anchored full dataset in section 2c(2). But to ensure that our validation/test splits are truly IID, we compute the same set of statistics on each of the splits from the full dataset.

From the plots in Figs. A1A5 and Figs. A6A10, we observe that the distribution of data in our validation and test is similar to our training distribution. This validates that our splits are indeed IID.

Fig. A1.
Fig. A1.

(left) Volumetric soil moisture (m3 m−3) label distribution. The dashed line indicates the median soil moisture. (right) Time stamp distribution on the validation split of the dataset.

Citation: Journal of Hydrometeorology 24, 10; 10.1175/JHM-D-22-0118.1

Fig. A2.
Fig. A2.

Distribution of coarse soil moisture products. (left) SMAP volumetric soil moisture (m3 m−3). (right) GLDAS volumetric soil moisture (m3 m−3) on the validation split of the dataset. The dashed lines indicate the corresponding median soil moisture values.

Citation: Journal of Hydrometeorology 24, 10; 10.1175/JHM-D-22-0118.1

Fig. A3.
Fig. A3.

A heat map of in situ sensor locations present in the validation split of the dataset.

Citation: Journal of Hydrometeorology 24, 10; 10.1175/JHM-D-22-0118.1

Fig. A4.
Fig. A4.

(left) USDA based soil texture distribution. (right) Land cover distribution derived from the Copernicus Land Cover Map on the validation split of the dataset.

Citation: Journal of Hydrometeorology 24, 10; 10.1175/JHM-D-22-0118.1

Fig. A5.
Fig. A5.

Köppen climate zone distribution on the validation split of the dataset. We use the first two identifiers of the Köppen classification only in order to group similar climate zones together.

Citation: Journal of Hydrometeorology 24, 10; 10.1175/JHM-D-22-0118.1

Fig. A6.
Fig. A6.

(left) Volumetric soil moisture (m3 m−3) label distribution. The dashed line indicates the median soil moisture. (right) Time stamp distribution on the test split of the dataset.

Citation: Journal of Hydrometeorology 24, 10; 10.1175/JHM-D-22-0118.1

Fig. A7.
Fig. A7.

Distribution of coarse soil moisture products. (left) SMAP volumetric soil moisture (m3 m−3). (right) GLDAS volumetric soil moisture (m3 m−3) on the test split of the dataset. The dashed lines indicate the corresponding median soil moisture values.

Citation: Journal of Hydrometeorology 24, 10; 10.1175/JHM-D-22-0118.1

Fig. A8.
Fig. A8.

A heat map of in situ sensor locations present in the test split of the dataset.

Citation: Journal of Hydrometeorology 24, 10; 10.1175/JHM-D-22-0118.1

Fig. A9.
Fig. A9.

(left) USDA based soil texture distribution. (right) Land cover distribution derived from the Copernicus Land Cover Map on the test split of the dataset.

Citation: Journal of Hydrometeorology 24, 10; 10.1175/JHM-D-22-0118.1

Fig. A10.
Fig. A10.

Köppen climate zone distribution on the test split of the dataset. We use the first two identifiers of the Köppen classification only in order to group similar climate zones together.

Citation: Journal of Hydrometeorology 24, 10; 10.1175/JHM-D-22-0118.1

Figure A11 shows samples from the Sentinel-1 anchored full dataset. Tables A2 and A3 provide the validation set results for all the models specified in Tables 8 and 16. We notice similar sets of results on both the validation and the test sets on both the datasets.

Fig. A11.
Fig. A11.

Randomly selected inputs and labels from the dataset. The SoilGrids image is a false RGB image generated from the [sand, silt, clay] bands for R, G, B, respectively. The labels are in volumetric (m3 m−3) units.

Citation: Journal of Hydrometeorology 24, 10; 10.1175/JHM-D-22-0118.1

Table A2.

Validation results for our models on the Sentinel-1 anchored hard dataset. The number in brackets denotes the percentage change with respect to the SMAP + GLDAS NN baseline. All results presented are in volumetric (m3 m−3) units. A positive change in percentage denotes an increase in Pearson correlation and decrease in ubRMSE/RMSE. The validation split consists of 8794 data points. Bold indicates results from the best baseline method, our best method, and our method that does not use any of the baseline inputs.

Table A2.
Table A3.

Validation results for our models on the Sentinel-1 anchored full dataset. The number in brackets denotes the percentage change with respect to the SMAP + GLDAS NN baseline. All results presented are in volumetric (m3 m−3) units. A positive change in percentage denotes an increase in Pearson correlation and decrease in ubRMSE/RMSE. The validation split consists of 22 543 data points. Bold indicates results from the best baseline method, our best method, and our method that does not use any of the baseline inputs.

Table A3.