## 1. Introduction

Mutual differences between various soil moisture (SM) modeling or analysis products are notoriously large (Koster et al. 2009; Qiu et al. 2014). As a result, it is generally inadvisable to cross-apply a given model SM product, based on a single land surface model (LSM), to initialize, calibrate, or evaluate a second LSM. This lack of transferability is particularly pressing for SM analyses generated by a land data assimilation system (LDAS). In their most-common form, such systems assimilate satellite-based, L-band brightness temperature (Tb) observations, or surface SM retrievals based on Tb observations, into LSMs to mitigate the impact of random modeling errors. SM analyses derived from such systems are now operationally available with a medium product latency (i.e., 3–5 days) that is potentially suitable for a range of water resource, agricultural monitoring, and hydrological forecasting applications. Evaluation of these analyses reveals that SM (or Tb) data assimilation has a significant positive impact on agricultural drought monitoring (Bolten and Crow 2012; Mladenova et al. 2019) and the initialization of SM states for rainfall–runoff modeling (Crow et al. 2017).

The climatological characteristics of LDAS SM products are determined to a large degree by the selection of a single LSM to serve as their modeling core (Koster et al. 2009). Since existing drought monitoring and hydrologic forecasting applications have been developed around a variety of LSMs, there exists a transferability problem for LDAS SM analyses that limits their broader applicability. That is, a SM analysis from an existing operational LDAS cannot be directly applied to replace or enhance SM estimates generated by a second, application-specific LSM (in, e.g., agricultural drought monitoring) or to initialize an application-specific LSM (in, e.g., hydrological forecasting). At the same time, the resources and expertise required to develop and run a medium-latency, operational LDAS makes it impractical to support a dedicated LDAS for every possible end-use application.

What is required instead is a set of tools capable of transferring multilayer SM estimates derived from a particular LSM (or LDAS) into a second LSM (or LDAS) suitable for a particular application. Once developed, these tools would allow any application, regardless of the LSM it employs or has been designed around, to leverage a single, centralized source of operational SM products.

The appropriate form of such tools depends on the nature of inter-LSM SM differences. A subset of these differences is attributable to *random* error in the surface meteorological forcing data required to drive an offline LSM or LDAS. When comparing SM estimates acquired from multiple LSMs employing different forcing datasets, these errors will manifest as random SM differences (i.e., statistically independent of past and current SM states). In addition, LDAS products will exhibit random mutual SM differences depending on the exact set of land observations they assimilate.

Other sources of SM differences are *systematic* in nature and therefore statistically connected to current and past LSM states. Available large-scale SM products generally rely on LSM SM estimates to define their climatological characteristics. However, such characteristics are notoriously model dependent (Koster et al. 2009; Dong et al. 2020). In addition, the systematic characteristics of SM products are linked to the specific vertical discretization applied to define SM states. Finally, systematic differences in meteorological forcing data will also result in systematic SM differences between LSMs.

A relatively large literature exists describing efforts to filter random SM error via the application of data fusion or assimilation techniques (e.g., Afshar et al. 2019; Gruber et al. 2017). Relatively less guidance is available for systematic errors. Despite the ubiquity of systematic errors in LSM SM products, existing tools for intermodel SM transformations generally make the simplistic assumption that temporal percentiles of SM estimates generated by two separate LSMs are equal across time. Or, equivalently, that different SM products can be transformed into standardized anomalies to facilitate intermodel transfer. Such percentile matching (PM) approaches address only a subset of potential systematic errors affecting model-based SM products and have not been appropriately critiqued. Here, we will explore the development of a broader set of regression tools and compare the out-of-sample performance of such tools to baseline transformation results using only a direct PM approach.

The underlying motivation for this work is to support the establishment of a centralized source of operational SM information that, once corrected for inter-LSM systematic errors, can be leveraged for a wider variety of SM applications. Given that our emphasis here is on systematic SM differences, the implicit assumption is that such information will be provided by an operational LDAS source that (i) employs a near-optimal set of available forcing data (given input data latency requirements), (ii) assimilates the best-available observational sources of SM, or SM proxy, information, and (iii) applies a high-quality data assimilation method. If these assumptions hold, such a centralized LDAS source will already minimize the impact of random errors on its SM analysis. Consequently, systematic errors in the operational SM estimates will be of primary concern.

An obvious candidate for a centralized LDAS SM source is the NASA Soil Moisture Active Passive (SMAP) Level-4 Soil Moisture (L4_SM) product (Reichle et al. 2019). The L4_SM product is based on the sequential assimilation of SMAP L-band microwave brightness temperature observations into a continuous integration of the Catchment Land Surface Model (CLSM) to produce a multilayer, global, 9-km surface and root-zone SM product with a latency of ~3 days. Extensive validation evidence suggests that the L4_SM product is significantly more precise than competing global SM products (Dong et al. 2019); however, its climatological SM characteristics are defined by the CLSM and therefore different from those of many potential operational SM applications.

While our ultimate focus is on an operational LDAS system like the L4_SM product, we will start our discussion with an examination of LSM-to-LSM transformations before moving on to LDAS-to-LDAS conversions. To this end, section 2 provides background on our various regression-based transformation approaches, and section 3 describes our evaluation approach. Results are presented in section 4 and summarized in section 5.

## 2. Background

### a. Percentile mapping

Let *X* and *Y* be time series of SM estimates generated by two separate LSMs (or LDAS) for a specific soil layer with known marginal distributions. Hereinafter, *Y* denotes SM estimates obtained from a centralized operational source (e.g., L4_SM), and *X* denotes equivalent SM estimates produced by a target application LSM—which are assumed to be known in retrospect but are unavailable in an operational setting. The LSMs applied to generate *X* and *Y* will be referred to as “LSM X” and “LSM Y,” respectively.

*F*

_{X}be the cumulative density function that describes the probability of

*X*≤

*x*as

*t*. Once defined, (2) can be applied to transform operational SM estimates obtained from (the centralized operational source) LSM Y into equivalent estimates appropriate for (the target application based on) LSM X.

### b. Static regression fitting

The simplicity of (2) is appealing; however, it holds only if there is a one-to-one relationship between *X* and *Y* (i.e., each *X* can be mapped to a unique value of *Y* and vice versa). This is generally an oversimplification for SM products generated by a pair of LSMs. For example, if LSM Y is impacted by unique random rainfall error, then *X* and *Y* will have a probabilistic relationship. Therefore, the relationship between *F*_{X}(*X*_{t}) and *F*_{Y}(*Y*_{t}) must instead be described using a copula approach to account for the fact that knowledge of *X*_{t} does not fix *Y*_{t}—rather it conditions a probability density function for *Y*_{t}. Based on this reasoning, copulas are commonly applied to the analysis and transformation of SM time series (e.g., Gao et al. 2007; Verhoest et al. 2015).

A second possibility, examined here, is that *F*_{X} and *F*_{Y} have a deterministic relationship that is not one-to-one. For example, SM times series generated by a pair of LSMs do not generally contain the same degree of temporal memory. Even in a scenario where two LSMs ostensibly represent SM within a matching range of vertical soil depths (e.g., a 0–10-cm “surface” or a 30–100-cm “root-zone” layer), they are still likely to reflect different SM temporal memory characteristics (Qiu et al. 2014; Raoult et al. 2018). Such differences introduce hysteresis (i.e., path dependency) into the relationship between *F*_{X} and *F*_{Y}.

Figure 1 shows an example of this for a 5-yr comparison of daily SM time series estimates acquired from CLSM and the Noah LSM for an arbitrary 1° grid cell located in North America (centered at 50.5°N and 119.5°W). Plotted CLSM and Noah SM values in Fig. 1a correspond to average SM conditions within a 0–100-cm soil layer and are expressed as equivalent percentile ranks [i.e., *F*_{X}(*X*_{t})]. Despite being forced by the same time-varying meteorological data and having the same vertical support, the CLSM and Noah percentile SM time series do not track perfectly in time. Rather, Noah has less high-frequency temporal variability and is slightly lagged in time relative to CLSM. As a result, percentile comparisons of CLSM and Noah SM estimates exhibit a degree of path dependency (Fig. 1b). For the grid cell investigated here, such dependency is primarily seasonal in nature. For any given CLSM SM percentile, the equivalent Noah SM percentile depends on whether true SM is rising during the cold season (October–March) or declining during the growing season (April–September). As a result, CLSM and Noah SM percentiles are not equal at any fixed point in time, and error will be incurred when applying (2) to estimate Noah SM percentiles directly from CLSM SM percentiles.

**:**

*β***Y**

_{t}is a state vector containing all multilayer SM estimates produced at time

*t*by LSM Y. Likewise,

**F**

_{Y}is now a function that converts a vector of SM states into an equivalent vector of SM state percentiles. Note that (3) must be applied separately to each layered SM state estimate provided by LSM X. Like (2), (3) is a static transformation that considers only LSM Y SM estimates obtained at a fixed point in time. However, unlike (2), (3) allows for the weighted averaging of percentile information obtained from multiple layered SM states.

### c. Lagged regression fitting

*t*, Fig. 1 suggests that lagged variables should be applied as predictors to resolve temporal trends (e.g., cold-season wetting or warm-season drying cycles) that underlie the path dependency seen in Fig. 1. Along these lines, we will attempt to improve upon the accuracy of (3) by also considering SM estimates made by LSM Y at fixed points in the recent past:

*L*

_{1},

*L*

_{2}, …,

*L*

_{N}is a set of

*N*nonnegative integer temporal lags, and

*t*−

*L*

_{i}by LSM Y. Note that only nonnegative time lags are considered to ensure that (4) is applicable in an operational setting. The fundamental difference between (3) and (4) is that, in addition to SM values obtained in the same layer and at the same time, as in (3), (4) leverages SM in

*all*layers and across a fixed set of time lags—thereby providing the additional predictors necessary to further reduce the path dependency between

*F*

_{X}(

*X*

_{t}) and

*F*

_{Y}(

*Y*

_{t}) illustrated in Fig. 1b.

### d. Lagged regression fitting of anomalies

An important simplification in (4) is the neglect of seasonality. SM seasonality represents the most important nonstationary component of a SM time series and scaling relationships between SM product pairs are often different for seasonal climatology versus subseasonal anomaly SM time series components (Su and Ryu 2015). For this reason, we will also explore the decomposition of *F*_{Y}(*Y*_{t}) into its seasonal climatology *prior* to the application of (4). This seasonal climatology will reflect the mean of all SM values (sampled across all years of the historical SM dataset for a single soil layer) falling within a 31-day moving window centered on a particular day-of-year. Anomalies are then the difference between this climatology and the actual SM value at time *t*.

**regression operator to these anomalies:**

*β**F*

_{X}(

*X*

_{t}) at time

*t*, and

**Y**relative to a seasonal expectation at time

*t*−

*L*

_{N}. Note that the overall order of processing represented in (5) is as follows: (i) transform

*Y*into percentile ranks, (ii) remove the seasonal cycle from these ranks, (iii) fit

**to the resulting rank anomalies, and (iv) add the seasonal cycle of**

*β**F*

_{X}(

*X*

_{t}) to the transformed anomalies. Evaluation of results from (5) will therefore test the hypothesis that decomposition into anomalies prior to transformation improves our ability to fit

**and recover**

*β**F*

_{X}(

*X*

_{t}). For simplicity, we will generally drop the overhat notation applied in (5) when representing climatological anomalies below. Nevertheless, unless otherwise stated,

**will be fit to percentile anomalies following the removal of a fixed seasonal cycle.**

*β*We can now define our four separate transform cases: (i) percentile matching (PM) based on (2), (ii) static (i.e., no time lag) fitting (SF) based on (3), (iii) lagged fitting (LF) based on (4), and (iv) lagged fitting of anomalies (LFA) based on (5).

## 3. Approach

We applied the four transform cases (i.e., PM, SF, LF, and LFA) identified above to map SM estimates between pairs of LSM simulations and pairs of LDAS SM analyses. LSM results are based on global simulations generated by phase 2 of the Global Land Data Assimilation System (GLDAS-2) project (see section 3a below) without any data assimilation. LDAS results are based on both a synthetic twin data assimilation experiment [see section 3b(1) below] and the SMAP L4_SM data assimilation analysis. In particular, the L4_SM analysis is compared with corresponding Nature Run (NR) SM estimates based on the same forcing data and LSM but with no ensemble generation or data assimilation [see section 3b(2) below].

Note that all four transform cases, including the baseline PM case, require access to retrospective data from both LSM X and LSM Y to empirically determine the percentile transformations *F*_{X}(*X*) and *F*_{Y}(*Y*) (or **F**_{Y}). The SF, LF, and LFA cases additionally require access to historical data to fit ** β**, and, for the LFA case, to determine the mean seasonal cycles

**F**

_{Y}(

**Y**),

**,**

*β*To minimize the impact of differences in marginal LSM SM distributions and facilitate direct comparison to the PM baseline, all evaluation during the testing period was performed in percentile space. Note that *F*_{X}(*X*_{t}) is used solely for evaluation purposes and appears only on the left-hand side of Eqs. (2)–(5). Therefore, among all terms in (2)–(5), *F*_{X}(*X*_{t}) alone was calculated via direct ranking of SM time series results during the testing period. We then sampled the root-mean-square error (RMSE) between these actual *F*_{X}(*X*_{t}) values and approximations of *F*_{X}(*X*_{t}) provided by the PM, SF, LF, and LFA transform cases.

Unless otherwise noted, all transformation results were based on (i) mutually exclusive 10-yr training and testing periods; (ii) fitting of *F*_{Y}(*Y*) (or **F**_{Y}) for the PM, SF, LF, and LFA cases using a fifth-order polynomial; (iii) fitting of ** β** for the SF, LF, and LFA cases using multivariate linear regression (MVLR) with

*L*

_{i}= (

*i*− 1)

^{2}(days) where

*i*= [1, 2, …,

*N*] and

*N*= 13 and (iv) sampling of

### a. Transformation of LSM SM estimates

LSM-to-LSM transformation results are based on an assumed scenario where CLSM SM estimates (i.e., *Y*) are operationally available at medium data latency (i.e., within 3–5 days of real time) and transformed to approximate (operationally unavailable) estimates generated by a second LSM (i.e., *X*) for a particular application. Both the Variable Infiltration Capacity (VIC-4.1.2; Liang et al. 1994) and Noah-3.6 (Ek et al. 2003) LSMs were considered for LSM X. The VIC-4.1.2 (Beaudoing and Rodell 2020a) and Noah-3.6 (Beaudoing and Rodell 2020b) SM products were obtained from version 2.1 GLDAS-2 (Rodell et al. 2004) simulations between 1 January 2000 and 31 December 2019. Both simulations were initialized with their respective climatological SM states on 1 January 1948. All GLDAS-2 SM products were acquired at 1° spatial resolution and averaged into daily (0000–2400 UTC) values from their native 3-h temporal resolution.

VIC, Noah, and CLSM all take different approaches to defining individual SM states. Both VIC and Noah use the common approach of defining nonoverlapping soil layers between predefined vertical depths. For Noah GLDAS-2 simulations, four separate SM layers are defined with the same vertical layer boundaries everywhere (0–10, 10–40, 40–100, and 100–200 cm beneath the surface). The representation of SM states in VIC is based on three separate layers with horizontally variable thicknesses *D*_{VIC,2} and *D*_{VIC,3} for the second and third soil layers. Specifically, the VIC SM layers are from 0 to 30 cm, from 30 cm to 30 + *D*_{VIC,2} (cm), and from 30 + *D*_{VIC,2} (cm) to 30 + *D*_{VIC,2} + *D*_{VIC,3} (cm) beneath the surface. See appendix A for global maps of *D*_{VIC,2} and *D*_{VIC,3}.

In contrast, CLSM calculates soil water deficit and excess values, with subsurface moisture dynamics governed by soil-texture-based parameters and the statistical properties of the surface topography within each model grid cell (or computational element) (Koster et al. 2000; Ducharne et al. 2000). Volumetric soil moisture is then diagnosed from the excess and deficit variables for three nested (i.e., vertically overlapping) layers, which yields estimates of surface (0–5 cm), root-zone (0–100 cm), and profile (from 0 cm to bedrock) SM. Note that CLSM depth to bedrock values vary horizontally.

Application of the baseline PM transform case requires the definition of “equivalent” SM state variables between each potential LSM pair. To compute VIC (or Noah) equivalents to CLSM root-zone and profile SM estimates, simple weighted averaging based on the relative depths of the contributing VIC (or Noah) layers was applied (see Table 1). In addition, since the crux of our analysis is measuring the relative performance of transform cases versus the PM baseline, such multilayer averaging was also applied to the VIC and Noah training and testing samples used as the basis of the SF, LF, and LFA regression-based cases. This standardization ensures that the VIC and Noah SM transformation results presented here are associated with a standard set of SM states for all four transform cases.

Vertical layer boundaries of CLSM-equivalent surface, root-zone, and profile SM states defined for the VIC and Noah LSMs. For VIC and Noah root-zone and profile SM, we applied simple weighted averaging based on the relative thicknesses of the contributing layers.

The application of (2)–(5) was based on the division of the entire GLDAS-2 retrospective time series into two mutually exclusive 10-yr periods. Unless otherwise noted, the 10-yr period from 1 January 2000 to 31 December 2009 was used for training and the 10-yr period from 1 January 2010 to 31 December 2019 for testing. All GLDAS-2 LSM results were based on land areas between 60°N and 60°S excluding 1° grid cells that lack adequate temporal SM variability, defined here as a temporal coefficient of variation below 0.075 for any of the three collocated surface SM estimates generated by the CLSM, VIC, and Noah LSMs.

### b. Transformation of LDAS SM analysis

We also examined the LDAS-to-LDAS transformation of SM analyses generated via the assimilation of satellite-derived surface SM retrievals (or equivalent SM proxy information) into an LSM. Assuming that *Ẋ*_{t} and *Ẏ*_{t} are SM times series estimates at time *t* produced by an LDAS constructed around two separate LSMs (where overscore dots denote analysis state estimates generated by an LDAS), our goal is measuring the relative effectiveness of ** β** when applied to transform

**F**

_{Ẏ}(

**Ẏ**

_{t})—available from an existing, medium-latency, operational LDAS based on LSM Y—into a

*F*

_{Ẋ}(

*Ẋ*

_{t}) analysis approximating state estimates that

*would have been*generated by a second hypothetical LDAS analysis based on LSM X. Note that, following notation developed in (3)–(5), we will use bold font to express that

**F**

_{Ẏ}(

**Ẏ**

_{t}) represents a vector of SM percentiles for each SM state in a multilayer LSM—whereas

*F*

_{Ẋ}(

*Ẋ*

_{t}) represents the rank of a single layered SM state of interest. For convenience, we will drop the

*t*subscript below.

The SM analyses captured in **F**_{Ẏ}(**Ẏ**) and *F*_{Ẋ}(*Ẋ*) are assumed to be forced by a common set of meteorological forcing data and assimilate the same observations. Therefore, their mutual differences stem solely from systematic differences between their two respective LSMs. In all results below, **F**_{Ẏ}(**Ẏ**) is assumed to be freely available (both operationally and historically).

An obvious consideration is how to fit the operator ** β** applied in (3)–(5). To start, we assume that a (multilayer)

*F*

_{Ẋ}(

*Ẋ*) LDAS analysis is historically, but not operationally, available. This allows us to fit

**using historical**

*β**F*

_{Ẋ}(

*Ẋ*) and

**F**

_{Ẏ}(

**Ẏ**) products and subsequently apply the resulting

**operator to an operational**

*β***F**

_{Ẏ}(

**Ẏ**) analysis to obtain a corresponding operational estimate of

*F*

_{Ẋ}(

*Ẋ*

_{T}). Hereinafter, this will be referred to as the LDAS fit case.

However, *F*_{Ẋ}(*Ẋ*) may not be available either operationally or historically. In this case, our only recourse is to fit ** β** using retrospective (model-only) LSM results that are not based on data assimilation. That is, fit

**so that**

*β***[**

*β***F**

_{Y}(

**Y**)] approximates

*F*

_{X}(

*X*) and then cross-apply the same

**to the LDAS SM analyses**

*β**F*

_{Ẋ}(

*Ẋ*). This assumes that—if

**[**

*β***F**

_{Y}(

**Y**)] approximates

*F*

_{X}(

*X*)—then

**[**

*β***F**

_{Ẏ}(

**Ẏ**)] will also approximate

*F*

_{Ẋ}(

*Ẋ*). Hereinafter, this case is referred to as the LSM/LDAS fit case since the transformation is

*fit*to LSM SM products but then subsequently

*applied*to an LDAS analysis. As in the LSM fit case discussed earlier, fitting was applied out-of-sample using data from a separate training period.

Objective evaluation of the pairwise LDAS and LSM/LDAS fit cases is challenging. It is difficult to find two existing LDAS products based on different LSMs that assimilate the same set of observations and are forced with a consistent set of meteorological data. As discussed above, such consistency is needed to ensure that regression results reflect only systematic LSM differences and not analogous differences between assimilated observations or the meteorological data used to force the LDAS. In addition, there is the challenge of simultaneously optimizing two parallel, but distinct, LDAS algorithms to eliminate systematic error associated with a suboptimal data assimilation implementation.

To avoid these complications, the LDAS and LSM/LDAS fit cases were evaluated in two distinct scenarios: (i) a synthetic twin data assimilation experiment (assuming a simplified linear form for both LSMs) and (ii) the comparison of L4_SM estimates with equivalent NR estimates generated using the same LSM as in the L4_SM algorithm but without the assimilation of SMAP data. The remainder of this section details these two approaches.

#### 1) Synthetic twin data assimilation analysis

*A*,

*B*, and

*C*are nonpositive real numbers (mm day

^{−1});

*P*is daily rainfall (mm day

^{−1});

*D*

_{t}is the time step (day); and

*D*

_{1}and

*D*

_{2}are the thicknesses (mm) of the surface and second soil layers, respectively. Note that

*X*is expressed in unitless saturation fractions. In terms of approximate process representation:

*A*characterizes the magnitude of vertical drainage from the top layer into the second layer;

*B*captures transpiration extracted from the second layer and

*C*reflects vertical diffusive exchange between the two layers.

**X**

_{t}= [

*X*

_{1,t}

*X*

_{2,t}]

^{T}.

A second model operator (*φ*_{Y}) with identical structure, but different parameters, was assumed to characterize LSM Y. In addition, a single *P*_{t} time series realization (applied to both LSM X and Y) was synthetically generated assuming precipitation occurs randomly on 10% of days, and daily precipitation accumulations follow a uniform [0, 70] (mm) distribution.

Using randomized values of *A*, *B*, *C* and *D*_{2} selected within physically realistic ranges (see Table B1 in appendix B), we generated a large number (50 000) of independent *φ*_{X} and *φ*_{Y} pairs. Assuming a soil porosity of 0.50, the parameter *D*_{1} was set to 25 mm in all cases to reflect the maximum water depth within the (approximate) 5-cm vertical measurement support of L-band Tb observations.

For each *φ*_{X} and *φ*_{Y} pair, we then conducted a pair of synthetic twin data assimilation experiments (i.e., assimilate synthetically perturbed observations generated from LSM X back into a degraded version of LSM X and, analogously, synthetically perturbed observations from LSM Y back into LSM Y). Since (7) is linear, data assimilation was based on a standard Kalman filter (KF) using an assumed model error covariance matrix *R*) of assimilated surface SM retrievals. For each experimental pair, true values of *R* were randomly selected and assumed to be known. Likewise, off-diagonal terms of

Each of these 50 000 paired synthetic twin experiments was based on the: (i) generation of a 10 000-day true LSM SM percentile representation [i.e., *F*_{X}(*X*) and *F*_{Y}(*Y*)] using a pair (i.e., *φ*_{X} and *φ*_{Y}) of model operators, (ii) generation of synthetic observations through degradation of surface SM estimates using the assumed observation error variance *R*, and (iii) KF-based assimilation of the resulting synthetic observations back into their corresponding models following forecast degradation consistent with assumed *F*_{Ẋ}(*Ẋ*), and *F*_{Ẏ}(*Ẏ*). See Table B1 for further parameterization details.

Following the completion of these paired synthetic twin experiments, we examined the ability of SM transformations to approximate the synthetic SM analysis generated by the experiment for LSM X with the comparable experiment generated by LSM Y. As in the GLDAS-2 case above, all CLSM SM estimates were converted into their percentile equivalents [i.e., **F**_{Y}(**Y**) and **F**_{Ẏ}(**Ẏ**)] prior to transformation. For the LDAS fit case, we fit an operator ** β** to minimize the percentile difference

**[**

*β***F**

_{Ẏ}(

**Ẏ**)]

**−**

*F*

_{Ẋ}(

*Ẋ*) for each pair of retrospective SM DA analyses. Likewise, for the LSM/LDAS fit case, we fit a separate operator

**to minimize the percentile difference**

*β***[**

*β***F**

_{Y}(

**Y**)] −

*F*

_{X}(

*X*) for each pair of LSM SM estimates.

All fittings were based on the LFA transform case described above in (5) where *L*_{i} = (*i* − 1)^{2} (days) and *i* = [1, 2, …, 13]. Testing-period evaluation was based on the RMSE between approximated *F*_{Ẋ}(*Ẋ*_{t}), and the true *F*_{Ẋ}(*Ẋ*_{t}) time series generated during the first step of the identical twin data assimilation experiment.

#### 2) SMAP L4 versus Nature Run comparisons

The linear model defined in (7) may not capture all potential systematic differences between **F**_{Y}(**Y**) and **F**_{Ẏ}(**Ẏ**) impacting the performance of the LSM/LDAS fit case. For instance, the propagation of mean-zero, additive error as part of the Monte Carlo forecasting of ensemble states during implementation of an ensemble Kalman filter (EnKF) can introduce bias when applied to moderately nonlinear LSMs (Ryu et al. 2009). Therefore, we will expand our evaluation of the LSM/LDAS case to also consider analyses generated by the EnKF-based L4_SM product.

As described above, the performance of the LSM/LDAS fit case hinges on the presence of systematic differences between LSM SM integrations (i.e., **X** and **Y**) and the corresponding SM analysis generated by an LDAS (i.e., **Ẋ** and **Ẏ**). Stated another way, if the differences **F**_{Ẏ}(**Ẏ**) − **F**_{Y}(**Y**) are purely random [i.e., uncorrelated with all current and past members of the state vector **F**_{Y}(**Y**)], then the same ** β** operator that minimizes

**[**

*β***F**

_{Y}(

**Y**)] −

*F*

_{X}(

*X*) will also minimize

**[**

*β***F**

_{Ẏ}(

**Ẏ**)] −

*F*

_{Ẋ}(

*Ẋ*).

One direct way to test for such randomness is to fit ** β** to minimize

**[**

*β***F**

_{Y}(

**Y**)] −

*F*

_{Ẏ}(

*Ẏ*) and determine if such fitting robustly outperforms a PM transformation [i.e., simply assuming

*F*

_{Y}(

*Y*) =

*F*

_{Ẏ}(

*Ẏ*)]. Out-of-sample improvement for a fitted

**, versus simple PM matching, would suggest the presence of systematic differences between**

*β**F*

_{Y}(

*Y*) and

*F*

_{Ẏ}(

*Ẏ*). Conversely, no detectible improvement versus a PM baseline implies that any

**fit to minimize**

*β***[**

*β***F**

_{X}(

**X**)] −

*F*

_{Y}(

*Y*) will also minimize

**[**

*β***F**

_{Ẋ}(

**Ẋ**)] −

*F*

_{Ẏ}(

*Ẏ*). That is, support the implementation of the LSM/LDAS fit case described above.

Based on this reasoning, we evaluated the LSM/LDAS fit case using version 4 of the EnKF-based, 9-km SMAP L4_SM analysis for LDAS results (Reichle et al. 2018, 2019) and equivalent LSM results generated using the 9-km NR product (v7.2)—based on the same modeling system and surface meteorological forcing data as the L4_SM v4 analysis but without any data assimilation (Reichle et al. 2019).

All regression fitting was based on daily average (i.e., 0000–2400 UTC) L4_SM and NR results during the period 1 April 2015–31 March 2019. Fitting was then evaluated for the period 1 April 2019–31 March 2020. All 9-km grid cells lacking adequate temporal SM variability (i.e., a temporal coefficient of variation below 0.075 for either L4_SM or NR surface SM estimates) were masked. Testing was based on comparing percentile RMSE results when transforming L4_SM estimates to match their corresponding NR results using both PM and the out-of-sample fitting of ** β** based on the LFA transform case and

*L*

_{i}= (

*i*− 1)

^{2}(days) where

*i*= [1, 2, 3, …, 13].

## 4. Results

### a. GLDAS LSM results

The top rows of Figs. 2 and 3 describe the quasi-global (60°S–60°N) ability of the PM assumption to capture the relationship between root-zone (i.e., 0–100 cm) SM percentile estimates from CLSM and both VIC (Fig. 2) and Noah (Fig. 3). Results illustrate relatively large spatial variability in the quality of PM results for both LSM transform pairs (i.e., CLSM to Noah and CLSM to VIC). Key drivers of this variability include LSM-to-LSM differences in the degree of temporal memory (i.e., temporal autocorrelation) and seasonality present in root-zone SM percentile time series (Fig. 1).

(left) Quasi-global maps and (right) histograms for percentile RMSE results obtained by transforming CLSM root-zone (0–100 cm) SM estimates to match equivalent VIC SM estimates (see Table 1). Results are shown for all four fit cases (i.e., PM, SF, LF, and LFA). Global median RMSE percentile values are indicated on histograms with red vertical lines. Land grid cells with little or no temporal SM variability (i.e., a temporal coefficient of variation less than 0.075 in CLSM, VIC, or Noah surface SM estimates) are masked.

Citation: Journal of Hydrometeorology 22, 10; 10.1175/JHM-D-21-0061.1

(left) Quasi-global maps and (right) histograms for percentile RMSE results obtained by transforming CLSM root-zone (0–100 cm) SM estimates to match equivalent VIC SM estimates (see Table 1). Results are shown for all four fit cases (i.e., PM, SF, LF, and LFA). Global median RMSE percentile values are indicated on histograms with red vertical lines. Land grid cells with little or no temporal SM variability (i.e., a temporal coefficient of variation less than 0.075 in CLSM, VIC, or Noah surface SM estimates) are masked.

Citation: Journal of Hydrometeorology 22, 10; 10.1175/JHM-D-21-0061.1

(left) Quasi-global maps and (right) histograms for percentile RMSE results obtained by transforming CLSM root-zone (0–100 cm) SM estimates to match equivalent VIC SM estimates (see Table 1). Results are shown for all four fit cases (i.e., PM, SF, LF, and LFA). Global median RMSE percentile values are indicated on histograms with red vertical lines. Land grid cells with little or no temporal SM variability (i.e., a temporal coefficient of variation less than 0.075 in CLSM, VIC, or Noah surface SM estimates) are masked.

Citation: Journal of Hydrometeorology 22, 10; 10.1175/JHM-D-21-0061.1

As in Fig. 2, but for the transformation of CLSM to Noah root-zone (0–100 cm) SM estimates.

Citation: Journal of Hydrometeorology 22, 10; 10.1175/JHM-D-21-0061.1

As in Fig. 2, but for the transformation of CLSM to Noah root-zone (0–100 cm) SM estimates.

Citation: Journal of Hydrometeorology 22, 10; 10.1175/JHM-D-21-0061.1

As in Fig. 2, but for the transformation of CLSM to Noah root-zone (0–100 cm) SM estimates.

Citation: Journal of Hydrometeorology 22, 10; 10.1175/JHM-D-21-0061.1

In addition to PM results, Figs. 2 and 3 also examine the regression-based SF, LF, and LFA transform cases introduced in section 2. It should be stressed that a portion of the error associated with the PM case is due to the ambiguity of defining equivalent SM states in any given LSM pair. Note, for example, that the vertical equivalence between various root-zone SM states defined for each LSM in section 3a is only approximate (see Table 1). The SF case mitigates this ambiguity by fitting a transform function (** β**) that converts all three CLSM states into a single optimized representation of contemporary VIC and Noah SM estimates. As a result, this static transformation improves upon the PM baseline (see RMSE reductions between the first and second rows of Figs. 2 and 3).

Despite the relative success of the SF transform case versus the PM baseline, the path dependency in Fig. 1 suggests that the accurate recovery of current VIC and Noah percentile estimates requires access to both contemporary and past CLSM SM estimates. To address this sensitivity, the LF case additionally utilizes past CLSM states to improve upon the estimation of VIC and Noah SM percentiles. This expansion leads to an additional reduction in percentile RMSE relative to the earlier PM and SF transform cases. Specifically, time-lagged information allows ** β** to consider whether CLSM SM estimates at a given point in time are located during a wetting or drying SM phase. Such phasing information helps to account for the path-dependency observed in Fig. 1.

A final incremental decrease in RMSE is found for the LFA case of applying the LF transformation to CLSM SM percentile *anomalies* and then converting the resulting transformed anomalies back into the VIC or Noah SM climatology. This decomposition provides another net reduction of percentile RMSE and produces the best-available approximation of VIC and Noah SM states using CLSM (see the RMSE reduction between the third and fourth rows of Figs. 2 and 3). Such improvement suggests that, for any pair of LSMs, systematic differences exist in the mutual relationship between SM seasonal cycles versus the relationship between subseasonal SM anomalies. As a result, removing fixed CLSM seasonal characteristics, transforming the resulting CLSM anomalies and then adding the transformed anomalies onto the original VIC or Noah reference climatology is a superior regression approach.

Despite the overall tendency to improve upon the PM baseline, there are limited areas of North Africa and the Middle East where the original PM case is superior to our more complex regression cases—see Figs. 2 and 3. These areas tend to display low levels of intra-annual SM variability relative to longer-term (i.e., decadal-scale) SM variability. This makes it difficult to fit regression equations within our training period (1 January 2000–31 December 2009) that are subsequently robust during our validation period (1 January 2010–31 December 2019). However, it should be noted that such conditions apply within only 10%–15% of the quasi-global domain considered here.

Figure 4 summarizes overall incremental improvements found within surface, root-zone and profile SM states for both VIC and Noah across all four transform cases (i.e., PM, SF, LF, and LFA). (Note that Figs. 2 and 3 reflect only root-zone SM results.) Marginal improvements are approximately equal for each incremental case transition (i.e., PM to SF, SF to LF, and LF to LFA). However, there is a modest underlying tendency for surface and root-zone estimates to benefit relatively more from the application of the SF, LF, and LFA transform cases than profile SM estimates. Across all results, the median global RMSE reduction for the final LFA transformation case versus the baseline PM case ranges from 20% to 40%. Note that the following section explores the sensitivity of this change to our regression approach—including a discussion regarding the possibility of overfitting.

For the PM, SF, LF, and LFA transformation cases, and all three SM states defined in Table 1, global-median percentile RMSE values for the (left) CLSM to VIC and (right) CLSM to Noah SM transformations. Results are shown for both (top) raw percentile RMSE values and (bottom) percentile RMSE results normalized by RMSE results for the baseline PM transformation case.

Citation: Journal of Hydrometeorology 22, 10; 10.1175/JHM-D-21-0061.1

For the PM, SF, LF, and LFA transformation cases, and all three SM states defined in Table 1, global-median percentile RMSE values for the (left) CLSM to VIC and (right) CLSM to Noah SM transformations. Results are shown for both (top) raw percentile RMSE values and (bottom) percentile RMSE results normalized by RMSE results for the baseline PM transformation case.

Citation: Journal of Hydrometeorology 22, 10; 10.1175/JHM-D-21-0061.1

For the PM, SF, LF, and LFA transformation cases, and all three SM states defined in Table 1, global-median percentile RMSE values for the (left) CLSM to VIC and (right) CLSM to Noah SM transformations. Results are shown for both (top) raw percentile RMSE values and (bottom) percentile RMSE results normalized by RMSE results for the baseline PM transformation case.

Citation: Journal of Hydrometeorology 22, 10; 10.1175/JHM-D-21-0061.1

### b. Sensitivity to regression approach

It is useful to consider the regression methodology applied to fit ** β**. To this end, Figs. 5 and 6 describe the sensitivity of LFA percentile SM results to training-period length (i.e., 2, 5, and 10 years) and the number of parameters fit per layer (

*N*) during the MVLR-based fitting of

**. In all cases, evaluation is conducted during the out-of-sample period of 1 January 2010–31 December 2019.**

*β*For CLSM to VIC SM transformation, the sensitivity of multilayer LFA percentile RMSE results to training-period length and number of fitted parameters (*N*, per layer). Percentile RMSE results are normalized by their equivalent PM RMSE results.

Citation: Journal of Hydrometeorology 22, 10; 10.1175/JHM-D-21-0061.1

For CLSM to VIC SM transformation, the sensitivity of multilayer LFA percentile RMSE results to training-period length and number of fitted parameters (*N*, per layer). Percentile RMSE results are normalized by their equivalent PM RMSE results.

Citation: Journal of Hydrometeorology 22, 10; 10.1175/JHM-D-21-0061.1

For CLSM to VIC SM transformation, the sensitivity of multilayer LFA percentile RMSE results to training-period length and number of fitted parameters (*N*, per layer). Percentile RMSE results are normalized by their equivalent PM RMSE results.

Citation: Journal of Hydrometeorology 22, 10; 10.1175/JHM-D-21-0061.1

As in Fig. 5, but for CLSM to Noah SM transformations.

Citation: Journal of Hydrometeorology 22, 10; 10.1175/JHM-D-21-0061.1

As in Fig. 5, but for CLSM to Noah SM transformations.

Citation: Journal of Hydrometeorology 22, 10; 10.1175/JHM-D-21-0061.1

As in Fig. 5, but for CLSM to Noah SM transformations.

Citation: Journal of Hydrometeorology 22, 10; 10.1175/JHM-D-21-0061.1

As expected, longer training periods lead to larger improvements relative to the PM baseline. However, net improvement is still noted for training periods as short as 2 years. Since all evaluation is performed out of sample, results in Figs. 5 and 6 also capture the impact of parameter overfitting when defining ** β**. For a 10-yr training period, overfitting tends to occur after the fitting of around 9 parameters (

*N*) per layer—which is equivalent to a maximum time lag of 64 days. That is, for

*N*> 9, there is a generally tendency for rank RMSE to increase as

*N*rises. Naturally, fewer parameters can be robustly fit for shorter training periods. For a 2-yr training period, the maximum number of parameters per layer (without encountering overfitting) is reduced to between 3 and 5 parameters and the maximum time lag to between 4 and 16 days. Overfitting can, in a worst-case scenario, lead to the LFA case underperforming the baseline PM case, as can be seen in the Noah SM results for a 2-yr training period (Fig. 6). Note that in real-world cases where the entire historical period is utilized for training, overfitting can likely be detected via the application of Akaike information criteria or other regression evaluation approaches that penalize for excessive fit complexity.

Another consideration is the appropriateness of the linearity assumption inherent in MVLR. To examine this issue, Fig. 7 replots root-zone VIC LFA results in Fig. 5 for the case of fitting ** β** using both MVLR and a nonlinear support vector machine regression (SVMR) approach (Vapnik 1995). Due to the increased computational burden of SVMR, results are shown for North America only. These comparisons are repeated for a range of training-period lengths and values of

*N*. Results in Fig. 8 show no meaningful difference between MVLR and SVMR results in all cases. This suggests that the fitting of

**is an approximately linear regression problem that does not materially benefit from the application of more complex nonlinear regression approaches like SVMR.**

*β*For the LFA transformation approach applied to the (top) CLSM to VIC and (bottom) CLSM to Noah fit cases, the comparison of percentile root-zone RMSE results for MVLR (solid lines) and SVMR (circles) fitting across a range of training-period lengths (i.e., 2, 5, and 10 years) and number of fitted parameters per soil layer (*N*). For computational reasons, results are based on a North American domain.

Citation: Journal of Hydrometeorology 22, 10; 10.1175/JHM-D-21-0061.1

For the LFA transformation approach applied to the (top) CLSM to VIC and (bottom) CLSM to Noah fit cases, the comparison of percentile root-zone RMSE results for MVLR (solid lines) and SVMR (circles) fitting across a range of training-period lengths (i.e., 2, 5, and 10 years) and number of fitted parameters per soil layer (*N*). For computational reasons, results are based on a North American domain.

Citation: Journal of Hydrometeorology 22, 10; 10.1175/JHM-D-21-0061.1

For the LFA transformation approach applied to the (top) CLSM to VIC and (bottom) CLSM to Noah fit cases, the comparison of percentile root-zone RMSE results for MVLR (solid lines) and SVMR (circles) fitting across a range of training-period lengths (i.e., 2, 5, and 10 years) and number of fitted parameters per soil layer (*N*). For computational reasons, results are based on a North American domain.

Citation: Journal of Hydrometeorology 22, 10; 10.1175/JHM-D-21-0061.1

For the LFA transform case applied to the LDAS and LSM/LDAS fit cases, percentile RMSE for second-layer SM estimates, for the simplified LSM defined in (7) and (8), as a function of temporal memory differences between LSM X and LSM Y. Such memory differences are defined as the lag-1 day autocorrelation of second-layer SM from LSM Y minus the lag-1 day autocorrelation of second-layer SM from LSM X. All plotted results are based on median values sampled within a moving window of width 0.02. Comparable percentile RMSE results for the baseline PM case are shown for reference. The histogram describes the relative fraction of LDAS pairs demonstrating each lag-1 autocorrelation difference.

Citation: Journal of Hydrometeorology 22, 10; 10.1175/JHM-D-21-0061.1

For the LFA transform case applied to the LDAS and LSM/LDAS fit cases, percentile RMSE for second-layer SM estimates, for the simplified LSM defined in (7) and (8), as a function of temporal memory differences between LSM X and LSM Y. Such memory differences are defined as the lag-1 day autocorrelation of second-layer SM from LSM Y minus the lag-1 day autocorrelation of second-layer SM from LSM X. All plotted results are based on median values sampled within a moving window of width 0.02. Comparable percentile RMSE results for the baseline PM case are shown for reference. The histogram describes the relative fraction of LDAS pairs demonstrating each lag-1 autocorrelation difference.

Citation: Journal of Hydrometeorology 22, 10; 10.1175/JHM-D-21-0061.1

For the LFA transform case applied to the LDAS and LSM/LDAS fit cases, percentile RMSE for second-layer SM estimates, for the simplified LSM defined in (7) and (8), as a function of temporal memory differences between LSM X and LSM Y. Such memory differences are defined as the lag-1 day autocorrelation of second-layer SM from LSM Y minus the lag-1 day autocorrelation of second-layer SM from LSM X. All plotted results are based on median values sampled within a moving window of width 0.02. Comparable percentile RMSE results for the baseline PM case are shown for reference. The histogram describes the relative fraction of LDAS pairs demonstrating each lag-1 autocorrelation difference.

Citation: Journal of Hydrometeorology 22, 10; 10.1175/JHM-D-21-0061.1

Finally, LF and LFA results in Figs. 5 and 6 are based on time lags defined by the quadratic series *L*_{i} = (*i* − 1)^{2} (days), where *i* = [1, 2, 3, …, *N*]. A second possibility is a geometric series where *L*_{i} = 2^{i} − 1 (days). However, reproducing Figs. 5 and 6 for the case of geometric time lags produced essentially identical results (not shown).

### c. LDAS synthetic twin results

In this section, we examine the intermodel transferability of LDAS SM analyses (i.e., SM estimates obtained after the assimilation of soil SM information into an LSM). To start, we will examine the case of the simple, linear LSM summarized in (7) and (8).

Our first evaluation approach is based on a large number (50 000) of synthetic-twin data assimilation experiments conducted for randomized pairs of simple LSM parameterizations [section 3b(1)]. This analysis assumes the operational availability of an LDAS SM analysis based on LSM Y (hereinafter, LDAS Y), and transformations are designed to mimic a hypothetical SM analysis generated by a second LDAS based on LSM X (hereinafter, LDAS X). Later, in section 4d, we will examine parallel results for the case of a more complex/nonlinear LSM (i.e., CLSM).

Figure 8 summarizes our simple/linear synthetic experiments by plotting LFA percentile RMSE results for both the LDAS and LSM/LDAS fit cases [see section 3b(1)]. Specifically, LFA percentile RMSE is shown as a function of temporal memory differences between the second-layer SM estimates generated by LSM X and LSM Y as measured by the lag-1 (day) autocorrelation of their corresponding daily LSM SM estimates. The ordinate values in Fig. 8 reflect median RMSE values sampled within a 0.02 moving window along the plot abscissa. Since the simple LSM expressed in (6) and (7) lacks seasonality, the LF and LFA transformation case results are equivalent in this synthetic experiment.

As discussed above, the PM transform case assumes that the time series of SM percentiles for LDAS X results matches that of LDAS Y. Therefore, errors in the PM baseline case increase as a function of the (absolute) difference in SM autocorrelation between the underlying LSMs. As expected, the LFA-based fitting of ** β** using retrospective LDAS analyses (in the LFA LDAS fit case) leads to a substantial reduction in percentile RMSE results relative to the PM baseline—particularly in the case of large SM autocorrelation differences between the underlying LSMs X and Y. In addition, the success of the LDAS fit case is nearly duplicated by the LFA LSM/LDAS case where

**is fit using only retrospective LSM results. That is,**

*β***can be fit to historical pairs of LSM SM estimates and still provide accurate mapping between analogous operational LDAS SM analyses. This suggests that it is possible to use operational LDAS Y results to replicate hypothetical SM percentiles estimates produced by LDAS X—without actually implementing a data assimilation system for LSM X.**

*β*### d. SMAP L4 versus NR comparisons

The close correspondence between the LDAS and LSM/LDAS fit cases in Fig. 9 is intuitive since a well-parameterized LDAS system applied to a linear model will differ from its LSM equivalent only via white-noise increments added at each assimilation update. As a result, differences between LSM and LDAS SM results above should, in theory, be wholly random and not impact the systematic relationship between SM estimates acquired from two different linear LSMs.

For all three CLSM SM state variables and the SMAP NR to L4_SM fit case, percentile RMSE results for the LFA transformation case (normalized by their equivalent PM transformation RMSE values). Plotted results are based on a 4-yr training period and a range of *N* values. To facilitate comparisons, the *y* axis is scaled to match that of Figs. 5 and 6.

Citation: Journal of Hydrometeorology 22, 10; 10.1175/JHM-D-21-0061.1

For all three CLSM SM state variables and the SMAP NR to L4_SM fit case, percentile RMSE results for the LFA transformation case (normalized by their equivalent PM transformation RMSE values). Plotted results are based on a 4-yr training period and a range of *N* values. To facilitate comparisons, the *y* axis is scaled to match that of Figs. 5 and 6.

Citation: Journal of Hydrometeorology 22, 10; 10.1175/JHM-D-21-0061.1

For all three CLSM SM state variables and the SMAP NR to L4_SM fit case, percentile RMSE results for the LFA transformation case (normalized by their equivalent PM transformation RMSE values). Plotted results are based on a 4-yr training period and a range of *N* values. To facilitate comparisons, the *y* axis is scaled to match that of Figs. 5 and 6.

Citation: Journal of Hydrometeorology 22, 10; 10.1175/JHM-D-21-0061.1

Prospects are less clear for the application of complex, nonlinear LSMs. As expected in a well-parameterized LDAS, the SMAP L4_SM system generates nearly white-noise innovations (i.e., observation minus forecast differences) (Reichle et al. 2017). However, even white-noise increments (produced by white-noise innovations) can lead to systematic state biases when propagated through a nonlinear LSM (Ryu et al. 2009). Therefore, it is important to examine results equivalent to those shown in Fig. 8 but using a realistic, nonlinear LSM (e.g., the CLSM LSM core of the SMAP L4_SM product).

To this end, Fig. 9 examines time series of CLSM NR and L4_SM root-zone SM estimates for the presence of systematic differences. As discussed in section 3b(2), it does so by comparing quasi-global percentile RMSE results (based on the transformation of L4_SM results into CLSM NR results) using the PM and LFA fit cases. To this end, plotted values in Fig. 9 are global median LFA percentile RMSE results normalized by corresponding PM percentile RMSE values. All results are based on a 4-yr training period (1 April 2015–31 March 2019) and a 1-yr evaluation period (1 April 2019–31 March 2020).

Out-of-sample fitting advantages for the LFA case (relative to the PM baseline) in Fig. 9 are small (<5% for all layers) and only evident for very low *N* cases. This lack of a robust advantage for LFA results suggests that, despite the nonlinear nature of CLSM, temporal differences between NR and L4_SM state estimates remain generally random and, as a result, are not robustly rectified via the out-of-sample fitting of ** β**. This, in turn, supports the assumption underlying the LSM/LDAS case that

**can be fit to LSM results and subsequently cross-applied to the transformation of LDAS Y SM analyses into equivalent LDAS X estimates.**

*β*## 5. Summary and conclusions

Here, we describe the development of simple regression tools for translating SM time series estimates between different LSM or LDAS sources and target pairs. If available, such tools would aid in the application of existing centralized sources of medium-latency SM estimates like the NASA SMAP L4_SM product to support, for example, the initialization of profile SM states in a hydrological forecasting system or the diagnostic tracking of root-zone SM levels in an agricultural drought monitoring system.

A common transformation approach is simple PM based on the assumption that the temporal percentiles of SM estimates acquired from different LSMs (or multiple LDAS centered on different LSMs) match across all time. This baseline approach breaks down for the (very common) case of contrasting temporal memory (i.e., autocorrelation) in equivalent SM states obtained from different LSMs (see Fig. 1). As a result, even simple MVLR approaches can substantially improve upon baseline PM results (Figs. 2–4). Regression approaches become increasingly accurate when utilizing lagged SM estimates as predictors and applied to SM temporal anomalies obtained after the removal of a fixed SM seasonal cycle (Figs. 2–4). When applied quasi-globally to SM estimates generated by multiple GLDAS-2 LSMs, simple MVLR regression approaches remove between 20% and 40% of the percentile RMSE associated with a PM transformation baseline (Fig. 4).

Naturally, the quality of MVLR results improves as their training period is lengthened. However, significant improvement versus the PM baseline is evident for training periods as short as 2 years (Figs. 5 and 6). A 10-yr training period allows for the robust fitting of up to 9 parameters per layer without overfitting (Figs. 5 and 6). When regression fitting is performed in percentile space, simple and fast MVLR performs as well as more complex nonlinear regression approaches (Fig. 7).

Using a very large number of paired synthetic twin data assimilation experiments, comparable advantages are found when translating SM analyses between LDAS analyses based on the assimilation of surface SM information into different LSMs. Critically, when applied to LDAS SM analyses, ** β** can be fit to LSM results and subsequently applied to equivalent LDAS SM analyses (Fig. 8). This LSM/LDAS fit strategy is of practical value for the case where an application wishes to mimic an unavailable LDAS built around a specific LSM using output from an existing LDAS (e.g., the SMAP L4_SM product) centered on a

*different*LSM. The relative success of the LSM/LDAS fit case illustrates that it is not necessary to design and build a new, application-specific LDAS. Rather, only long-term SM records from the two LSMs are needed (acquired without any data assimilation).

It should be noted that a PM transformation already requires access to long-term, historical SM estimates to construct a cumulative density function to obtain percentile equivalents for multiple SM products. Therefore, the regression-based transformation approaches examined here do not require any additional LSM or LDAS historical integrations beyond what is already required by a baseline PM implementation.

Despite the relative success of regression-based SM transformations, they still reduce the SM percentile RMSE by less than half when translating SM percentiles between LSM pairs (Fig. 4). Therefore, in cases were an application-specific LDAS already exists, or can be easily generated, there is little justification for the approaches described here. However, for the much broader range of operational drought monitoring or medium-latency hydrologic forecasting applications where such work has not been completed, or is practically impossible given available resources, our proposed transformation approaches can be applied relatively quickly and affordably to maximally leverage existing medium-latency, operational LDAS SM analyses such as the SMAP L4_SM product.

## Acknowledgments

This research was supported by the NASA SMAP Science Team. The USDA is an equal opportunity employer. All data products used in the analysis are available for public download.

## APPENDIX A

### VIC Soil Layer Thicknesses

Figure A1 shows global maps of soil-layer thicknesses applied for VIC GLDAS simulations.

Thickness of the second (*D*_{VIC,2}) and third (*D*_{VIC,3}) VIC soil layers. Note variation in color-bar scales between the subplots and that the VIC implementation in GLDAS is somewhat unusual in that its deepest soil layer (i.e., VIC layer 3) is thinner than the soil layer immediately above it (i.e., VIC layer 2).

Citation: Journal of Hydrometeorology 22, 10; 10.1175/JHM-D-21-0061.1

Thickness of the second (*D*_{VIC,2}) and third (*D*_{VIC,3}) VIC soil layers. Note variation in color-bar scales between the subplots and that the VIC implementation in GLDAS is somewhat unusual in that its deepest soil layer (i.e., VIC layer 3) is thinner than the soil layer immediately above it (i.e., VIC layer 2).

Citation: Journal of Hydrometeorology 22, 10; 10.1175/JHM-D-21-0061.1

Thickness of the second (*D*_{VIC,2}) and third (*D*_{VIC,3}) VIC soil layers. Note variation in color-bar scales between the subplots and that the VIC implementation in GLDAS is somewhat unusual in that its deepest soil layer (i.e., VIC layer 3) is thinner than the soil layer immediately above it (i.e., VIC layer 2).

Citation: Journal of Hydrometeorology 22, 10; 10.1175/JHM-D-21-0061.1

## APPENDIX B

### Parameterization of Synthetic Twin Data Assimilation Experiments

Table B1 shows the range of parameter values applied to (6) for synthetic twin data assimilation results described in section 4c.

Ranges for randomized parameter values in the pairwise synthetic twin data assimilation experiments. Parameters are randomly drawn from a uniform distribution bounded by the listed values. Note that values of *Q* are used to derive a 2 × 2 model error covariance matrix with perfect vertical correlation.

## REFERENCES

Afshar, M. H., M. T. Yilmaz, and W. T. Crow, 2019: Impact of rescaling approaches in simple fusion of soil moisture products.

,*Water Resour. Res.***55**, 7804–7825, https://doi.org/10.1029/2019WR025111.Beaudoing, H., and M. Rodell, 2020a: GLDAS VIC Land Surface Model L4 3 hourly 1.0 × 1.0degree V2.1. Goddard Earth Sciences Data and Information Services Center (GES DISC), accessed 20 March 2020, https://doi.org/10.5067/ZOG6BCSE26HV.

Beaudoing, H., and M. Rodell, 2020b: GLDAS Noah Land Surface Model L4 3 hourly 1.0 × 1.0 degree V2.1. Goddard Earth Sciences Data and Information Services Center (GES DISC), accessed 20 March 2020, https://doi.org/10.5067/IIG8FHR17DA9.

Bolten, J. D., and W. T. Crow, 2012: Improved prediction of quasi-global vegetation conditions using remotely sensed surface soil moisture.

,*Geophys. Res. Lett.***39**, L19406, https://doi.org/10.1029/2012GL053470.Crow, W. T., F. Chen, R. H. Reichle, and Q. L. Liu, 2017: L-band band microwave remote sensing and land data assimilation improve the representation of prestorm soil moisture conditions for hydrologic forecasting.

,*Geophys. Res. Lett.***44**, 5495–5503, https://doi.org/10.1002/2017GL073642.Dong, J., W. T. Crow, R. H. Reichle, Q. Liu, F. Lei, and M. Cosh, 2019: A global assessment of added value in the SMAP Level 4 soil moisture product relative to its baseline land surface model.

,*Geophys. Res. Lett.***46**, 6604–6613, https://doi.org/10.1029/2019GL083398.Dong, J., W. T. Crow, K. J. Tobin, M. H. Cosh, D. D. Bosch, P. J. Starks, M. Seyfried, and C. Holifield-Collins, 2020: Comparison of microwave remote sensing and land surface modeling for surface soil moisture climatology estimation.

,*Remote Sens. Environ.***242**, 111756, https://doi.org/10.1016/j.rse.2020.111756.Ducharne, A., R. D. Koster, M. J. Suarez, M. Stieglitz, and P. Kumar, 2000: A catchment-based approach to modeling land surface processes in a general circulation model: 2. Parameter estimation and model demonstration.

,*J. Geophys. Res.***105**, 24 823–24 838, https://doi.org/10.1029/2000JD900328.Ek, M.B., K.E. Mitchell, Y. Lin, E. Rogers, P. Grunmann, V. Koren, G. Gayno, and J.D. Tarpley, 2003: Implementation of Noah land surface model advances in the National Centers for Environmental Prediction operational mesoscale Eta model.

,*J. Geophys. Res.***108**, 8851, https://doi.org/10.1029/2002JD003296.Gao, H., E. F. Wood, M. Drusch, and M. F. McCabe, 2007: Copula-derived observation operators for assimilating TMI and AMSR-E retrieved soil moisture into land surface models.

,*J. Hydrometeor.***8**, 413–429, https://doi.org/10.1175/JHM570.1.Gruber, A., W. Dorigo, W. T. Crow, and W. Wagner, 2017: Triple collocation-based merging of satellite soil moisture retrievals.

,*IEEE Trans. Geosci. Remote Sens.***55**, 6780–6792, https://doi.org/10.1109/TGRS.2017.2734070.Koster, R. D., M. J. Suarez, A. Ducharne, M. Stieglitz, and P. Kumar, 2000: A catchment-based approach to modeling land surface processes in a general circulation model: 1. Model structure.

,*J. Geophys. Res.***105**, 24 809–24 822, https://doi.org/10.1029/2000JD900327.Koster, R. D., Z. Guo, R. Yang, P. A. Dirmeyer, K. Mitchell, and M. J. Puma, 2009: On the nature of soil moisture in land surface models.

,*J. Climate***22**, 4322–4335, https://doi.org/10.1175/2009JCLI2832.1.Liang, X., D. P. Lettenmaier, E. F. Wood, and S. J. Burges, 1994: A simple hydrologically based model of land surface water and energy fluxes for general circulation models.

,*J. Geophys. Res.***99**, 14 415–14 428, https://doi.org/10.1029/94JD00483.Mladenova, I. E., J. D. Bolten, W. T. Crow, N. Sazib, M. H. Cosh, C. J. Tucker, and C. Reynolds, 2019: Evaluating the operational application of SMAP for global agricultural drought monitoring.

,*IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.***12**, 3387–3397, https://doi.org/10.1109/JSTARS.2019.2923555.Qiu, J., W. T. Crow, X. Mo, and S. Liu, 2014: Impact of temporal autocorrelation mismatch on the assimilation of satellite-derived surface soil moisture retrievals.

,*IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.***7**, 3534–3542, https://doi.org/10.1109/JSTARS.2014.2349354.Raoult, N., B. Delorme, C. Ottlé, P. Peylin, V. Bastrikov, P. Maugis, and J. Polcher, 2018: Confronting soil moisture dynamics from the ORCHIDEE land surface model with the ESA-CCI product: Perspectives for data assimilation.

,*Remote Sens.***10**, 1786, https://doi.org/10.3390/rs10111786.Reichle, R., and Coauthors, 2017: Global assessment of the SMAP Level-4 surface and root-zone soil moisture product using assimilation diagnostics.

,*J. Hydrometeor.***18**, 3217–3237, https://doi.org/10.1175/JHM-D-17-0130.1.Reichle, R., G. De Lannoy, R. D. Koster, W. T. Crow, J. S. Kimball, and Q. Liu, 2018: SMAP L4 Global 3-hourly 9 km EASE-Grid Surface and Root Zone Soil Moisture Geophysical Data, version 4. NASA National Snow and Ice Data Center Distributed Active Archive Center, accessed 25 August 2020, https://doi.org/10.5067/KPJNN2GI1DQR.

Reichle, R. H., and Coauthors, 2019: Version 4 of the SMAP Level-4 soil moisture algorithm and data product.

,*J. Adv. Model. Earth Syst.***11**, 3106–3130, https://doi.org/10.1029/2019MS001729.Rodell, M., and Coauthors, 2004: The Global Land Data Assimilation System.

,*Bull. Amer. Meteor. Soc.***85**, 381–394, https://doi.org/10.1175/BAMS-85-3-381.Ryu, D., W. T. Crow, X. Zhan, and T. J. Jackson, 2009: Correcting unintended perturbation biases in hydrologic data assimilation using ensemble Kalman filter.

,*J. Hydrometeor.***10**, 734–750, https://doi.org/10.1175/2008JHM1038.1.Su, C.-H., and D. Ryu, 2015: Multi-scale analysis of bias correction of soil moisture.

,*Hydrol. Earth Syst. Sci.***19**, 17–31, https://doi.org/10.5194/hess-19-17-2015.Vapnik, V., 1995:

*The Nature of Statistical Learning Theory*. Springer, 299 pp.Verhoest, N. E. C., and Coauthors, 2015: Copula-based downscaling of coarse-scale soil moisture observations with implicit bias correction.

,*IEEE Trans. Geosci. Remote Sens.***53**, 3507–3521, https://doi.org/10.1109/TGRS.2014.2378913.