Regional Frequency Analysis at Ungauged Sites with Multivariate Adaptive Regression Splines

A. Msilini Canada Research Chair in Statistical Hydro-Climatology, INRS-ETE, Quebec, Quebec, Canada

Search for other papers by A. Msilini in
Current site
Google Scholar
PubMed
Close
,
P. Masselot Canada Research Chair in Statistical Hydro-Climatology, INRS-ETE, Quebec, Quebec, Canada

Search for other papers by P. Masselot in
Current site
Google Scholar
PubMed
Close
, and
T. B. M. J. Ouarda Canada Research Chair in Statistical Hydro-Climatology, INRS-ETE, Quebec, Quebec, Canada

Search for other papers by T. B. M. J. Ouarda in
Current site
Google Scholar
PubMed
Close
Full access

We are aware of a technical issue preventing figures and tables from showing in some newly published articles in the full-text HTML view.
While we are resolving the problem, please use the online PDF version of these articles to view figures and tables.

Abstract

Hydrological systems are naturally complex and nonlinear. A large number of variables, many of which not yet well considered in regional frequency analysis (RFA), have a significant impact on hydrological dynamics and consequently on flood quantile estimates. Despite the increasing number of statistical tools used to estimate flood quantiles at ungauged sites, little attention has been dedicated to the development of new regional estimation (RE) models accounting for both nonlinear links and interactions between hydrological and physio-meteorological variables. The aim of this paper is to simultaneously take into account nonlinearity and interactions between variables by introducing the multivariate adaptive regression splines (MARS) approach in RFA. The predictive performances of MARS are compared with those obtained by one of the most robust RE models: the generalized additive model (GAM). Both approaches are applied to two datasets covering 151 hydrometric stations in the province of Quebec (Canada): a standard dataset (STA) containing commonly used variables and an extended dataset (EXTD) combining STA with additional variables dealing with drainage network characteristics. Results indicate that RE models using MARS with the EXTD outperform slightly RE models using GAM. Thus, MARS seems to allow for a better representation of the hydrological process and an increased predictive power in RFA.

© 2020 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Amina Msilini, amina.msilini@ete.inrs.ca

Abstract

Hydrological systems are naturally complex and nonlinear. A large number of variables, many of which not yet well considered in regional frequency analysis (RFA), have a significant impact on hydrological dynamics and consequently on flood quantile estimates. Despite the increasing number of statistical tools used to estimate flood quantiles at ungauged sites, little attention has been dedicated to the development of new regional estimation (RE) models accounting for both nonlinear links and interactions between hydrological and physio-meteorological variables. The aim of this paper is to simultaneously take into account nonlinearity and interactions between variables by introducing the multivariate adaptive regression splines (MARS) approach in RFA. The predictive performances of MARS are compared with those obtained by one of the most robust RE models: the generalized additive model (GAM). Both approaches are applied to two datasets covering 151 hydrometric stations in the province of Quebec (Canada): a standard dataset (STA) containing commonly used variables and an extended dataset (EXTD) combining STA with additional variables dealing with drainage network characteristics. Results indicate that RE models using MARS with the EXTD outperform slightly RE models using GAM. Thus, MARS seems to allow for a better representation of the hydrological process and an increased predictive power in RFA.

© 2020 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Amina Msilini, amina.msilini@ete.inrs.ca

1. Introduction and literature review

The main objective of regional frequency analysis (RFA) is the estimation of the return period of extreme hydrological events at target sites where little or no hydrological data are available. Examples of these events include floods and low-flow quantiles which are crucial for infrastructure design and management. In general, RFA comprises two main steps: (i) the delineation of homogenous region (DHR) to determine gauged sites similar to the target one and (ii) regional estimation (RE) to transfer the information from sites determined in the DHR step to the target one (e.g., Chebana and Ouarda 2008). Various methods have been suggested for each of these two steps (e.g., Ouarda 2016).

Among the most common DHR methods, we can mention the region of influence (ROI) (Burn 1990a) and the canonical correlation analysis (CCA) (Ouarda et al. 2001). Recently, several advanced nonlinear neighborhood approaches were suggested (e.g., Ouali et al. 2016; Wazneh et al. 2016). Among the commonly used RE approaches, we can distinguish the regression-based models and the index-flood models. Among the former, the log-linear regression models are the most commonly used ones in practice, because of their simplicity and good predictive performances. We focus here on regression-based models in the RE step.

Hydrological processes depend on a large number of variables, such as the topographic variability of the basins, their soil structure and texture, their geological formations, and the climatology. This leads to a natural complexity, which has been widely recognized and documented in the hydrological literature (e.g., Ibbitt and Woods 2004; Sivakumar 2007; Wang et al. 2008; Xu et al. 2010). In statistical terms, this complexity manifests itself through three aspects: (i) the high number of explanatory variables necessary to paint a realistic picture of the processes, (ii) the nonlinear impact of these explanatory variables, and (iii) the important interaction between the different explanatory variables. It is thus important that the RE step in RFA accounts for these three aspects in order to yield accurate estimations of the target site’s quantiles of interest.

In RFA studies, the RE step usually requires a large number of explanatory variables to result in satisfactory predictive performances. This number usually exceeds five, as in Ouarda et al. (2018), but should increase in the future with the discovery of new potential variables. For instance, evidence is growing that drainage network characteristics have a strong impact on hydrological dynamics, and are consequently linked to flood quantiles (Jung et al. 2017). Thus, integrating additional characteristics related to the drainage network may lead to more accurate estimates of the regional quantiles. Hence, there is a need to propose efficient approaches that are able to manage such high-dimensional databases.

Another consequence of the natural complexity of hydrological processes is the nonlinearity between explanatory variables and the at-site quantiles. To handle this problem and better reproduce the dynamics of hydrological processes, various nonlinear approaches have been proposed (e.g., Shu and Burn 2004). The classical log-linear method used in the RE step assumes that the relation between the logarithm of the response variable (hydrological) and explanatory variables (physio-meteorological) is linear, which is too simplistic for such complex nonlinear processes. Therefore, several RE approaches, such as random forest (RF), artificial neural network (ANN), and generalized additive models (GAM) have been proposed in the literature to account for the possible nonlinear links between variables (e.g., Aziz et al. 2014; Khalil et al. 2011; Ouali et al. 2017; Ouarda et al. 2018; Saadi et al. 2019).

Random forest (Breiman 2001) is a powerful nonlinear and nonparametric method commonly used to handle regression and classification problems based on decision trees. Due to its good performance, it has been applied in several fields, such as hydrology (e.g., Diez-Sierra and del Jesus 2019; Muñoz et al. 2018; Wang et al. 2015), ecology (e.g., Cutler et al. 2007; Prasad et al. 2006), environmental modeling (e.g., Masselink et al. 2017; Pourghasemi and Kerle 2016), and RFA (e.g., Booker and Woods 2014; Brunner et al. 2018). Despite its predictive power, RF suffers from major limitations such as the difficulty of interpretation and the large memory requirements for storing the model when used with a large dataset (Geurts et al. 2009).

The ANN is a nonparametric mathematical model, whose design is inspired by the biological functioning of brain neurons (Bishop 1995). It was considered in several RFA studies for the estimation of flood and low-flow quantiles at ungauged sites (e.g., Aziz et al. 2014; Ouarda and Shu 2009). However, ANNs present a major common problem which is the tendency to overfit (e.g., Gal and Ghahramani 2016; Lawrence and Giles 2000). In addition, their calibration is relatively complex, especially for debutant users, which requires some subjective choices since no explicit regression equations can be given (Ouali et al. 2017).

GAMs do not suffer the same drawbacks as ANNs. GAMs are flexible nonlinear regression models (Hastie and Tibshirani 1987) that have been introduced in the RFA context by Chebana et al. (2014). The authors found that the GAM-based methods present the best performances when compared to the classical log-linear model and other common methods. GAMs are increasingly being adopted in several fields such as hydroclimatology and environmental modeling (e.g., Rahman et al. 2018; Wen et al. 2011), public health (e.g., Bayentin et al. 2010; Leitte et al. 2009), and renewable energy (e.g., Ouarda et al. 2016). However, it still presents a number of disadvantages. Indeed, the method can be computationally intensive, especially when a large number of variables is involved. It can, then, be difficult to fit GAM to high-dimensional databases because of memory limitations imposed by the numerical complexities of this model (Leathwick et al. 2006). More importantly, GAMs do not cope well with the interaction between variables (e.g., Ramsay et al. 2003), which is difficult to integrate in the model.

The interaction between physiographical variables within the watershed has long been recognized (e.g., Niehoff et al. 2002). Thus, the inclusion of the terms of interactions between the explanatory variables used to model the hydrological dynamics seems to be essential for better estimates of flood quantiles. However, this aspect is difficult to take into account in the RE models due to the high complexity that it may add to the models (see above for the specific example of GAMs). This affects the quality of the estimates and makes it less accurate. Hence, the motivation behind the present paper is to propose and explore alternative techniques able to realistically reproduce the hydrological process while avoiding the problems mentioned above.

The method considered here is multivariate adaptive regression splines (MARS), a procedure designed to build complex nonlinear regression models in a high dimensional setting. It is attractive in the RFA context since it actually addresses the three issues developed above which are: high number of variables, nonlinearity, and interactions. Indeed, MARS is efficient in a high dimensional setting and naturally selects the relevant predictors in this context. In addition, it does not require assumptions about the form of the relationships between the response and the explanatory variables (Friedman 1991). MARS also allows the modeling of complex structures between variables, which are often hidden in high-dimensional data, without imposing strong model assumptions. Hence, it can easily include interactions between variables, allowing any degree of interaction to be considered (Lee et al. 2006).

All of these desirable properties lead to a very flexible approach able to adapt well to the hydrological phenomenon. Due to its simplicity and capacity to capture complex nonlinear relationships, it has been successfully applied in several fields such as ecology and environment (e.g., Balshi et al. 2009; Bond and Kennard 2017; Leathwick et al. 2006, 2005), finance (e.g., Lee and Chen 2005; Lee et al. 2006), geology (e.g., Zhang and Goh 2016; Zhang et al. 2015), energy (e.g., Li et al. 2016; Roy et al. 2018), and hydrology (e.g., Bond and Kennard 2017; Deo et al. 2017; Emamgolizadeh et al. 2015; Kisi 2015; Kisi and Parmar 2016). Despite the extensive use of the MARS model in various frameworks and contexts, its potential has never been exploited and investigated in the context of RFA of extreme hydrological events.

The main objective of the present study is to introduce the MARS approach in the RFA context to estimate flood quantiles and evaluate its predictive potential when it is applied to an extensive database. It is hereby applied in combination with the DHR with the CCA and the ROI approaches. MARS is also applied without DHR to test its performance when applied to all stations without consideration of hydrological neighborhoods. A jackknife procedure is used to evaluate the model performances, with GAMs used as a benchmark.

This paper is structured as follows. Section 2 presents the theoretical background of MARS and the other RFA approaches adopted. The considered methodology is outlined in section 3. Section 4 describes the case study and the considered datasets. The obtained results are presented and discussed in section 5. The conclusions of the study are summarized in the last section. The appendix contains a list of terms and abbreviations.

2. Theoretical background

In this section, the adopted statistical tools are briefly presented and discussed.

a. Neighborhood identification approaches

Here we present the two most commonly considered neighborhood identification approaches as a necessary step before the RE one.

1) Canonical correlation analysis approach

CCA (Hotelling 1935) is a multivariate analysis technique used to identify the possible correlations between two groups of variables. It consists of a linear transformation of two groups of random variables into pairs of canonical variables, which are established in such a way that the correlations between each pair are maximized.

Let X = (X1, X2, …, Xr) and Y = (Y1, Y2, …, Ys) be sets of random variables including, respectively, the r physio-meteorological variables and the s hydrological variables of n gauged sites. The objective of CCA is to construct linear combinations Vi and Wi (called canonical variables) of the variables X and Y, i.e.,
Vi=Ai1X1+Ai2X2++AirXr,
Wi=Bi1Y1+B2Y2++BisYs,
where i = 1, …, p, with p = min (r, s). The first weights vectors A1 and B1 maximize the correlation coefficients between resulting canonical variables, i.e., λ1 = corr (V1, W1), under constraints of unit variance. Once the first pair of canonical variables is identified, other pairs (Vi, Wi, i > 1) can be obtained under the constraint corr (Vi, Wj) = 0 (where ij).

For neighborhood delineation in RFA, the considered Xr are physio-meteorological variables while the YS are the flood quantiles of interest. CCA is then used to construct canonical variables Wi that correlate well with physio-meteorological variables. The neighborhood is the set of sites such that the canonical hydrological score wk, k = 1, …, K, is close to the canonical physio-meteorological score of the target ungauged site υ0. The distance is measured by a Mahalanobis distance between the hydrological mean position of the target site Λυ0 and the positions of other sites wk, where Λ = diag(λ1, … λp) and υ0 is the physio-meteorological canonical score of the target site. Provided the X variables are approximately normal, the Mahalanobis distance converges to a χ2 distribution with p degrees of freedom. The size of the neighborhood is controlled by the parameter α that represent the (1 − α) χp2 quantile above which sites are excluded from the neighborhood. As extreme cases, all stations are considered if α = 0, and no station is included in the neighborhood when α = 1. For more details, the reader is referred to Ouarda et al. (2001).

2) Region of influence approach

The ROI approach was introduced by Burn (1990b) to identify the neighborhood of a given target site based on the similitude between watersheds characteristics. The similitude is measured using a Euclidean distance in the multidimensional physio-meteorological space (e.g., Burn 1990b; Tasker et al. 1996), i.e.,
ROIi={sitesj(1,,n);Dij=[k=1rWk(Xk,iXk,j)2]1/2θ},
where Dij is the weighted Euclidean distance between the target site i and the gauged one, j = 1, …, n, Xk,j (k = 1, …, r) is the standardized value of the kth variable at site j, Wk is the weight associated with the kth variable, and θ represents the threshold value. The threshold value is defined for each site in such a way that it permits a compromise between the amount of information to be used and the degree of hydrological homogeneity of the neighborhood (Ouarda et al. 1999). For more details, the reader is referred to (e.g., Burn 1990b; GREHYS 1996).

b. Regional estimation approaches

Once a neighborhood is identified, the methods described below are used to transfer information from the neighborhood stations to the target site.

1) Generalized additive model

GAM (Hastie and Tibshirani 1987) is a flexible class of nonlinear models that is able to efficiently model a wide variety of nonlinear relationships. In addition, it allows for non-Gaussian response variables (Wood 2006) making it relevant for streamflow data. Thus, GAM allows a more realistic description of the hydrological phenomenon because of the flexible nonparametric fitting of the smooth functions.

Formally, a GAM is defined as (Wood 2006)
g(Y)=α+j=1mfj(Xj)+ε,
where g is a monotonic link function and fj are smooth functions giving the relationship between the explanatory variables Xj and the response Y. Parameter α is the intercept and ε is the error term. The structure of Eq. (4) allows for a distinct interpretation of each explanatory variable.
To estimate the model, the smooth functions fj are expressed as a set of q spline basis functions, a common choice for smoothing (Wahba 1990). They are expressed as
fj(X)=i=1qβjibji(X),
where βji are unknown parameters to be estimated and bji are the spline basis functions. The expansion in (5) allows linearizing the model that can then be estimated through backfitting (Hastie and Tibshirani 1987) or simple penalized least squares (Wood 2004).

For more details, the reader is referred to Wood (2006, 2017).

2) Multivariate adaptive regression splines

MARS was introduced by Friedman (1991) as a flexible nonparametric regression approach able to deal with high-dimensional data. The MARS model f(X) can be seen as a flexible extension of GAM, in that it is expressed as a linear combination of basis functions and their interactions as
f(X)=β0+n=1rβnBn(X),
where β0 is the intercept, and βn are regression coefficients of the basis functions [Bn(X)]. In the MARS model, the Bn(X) terms can take one of the following forms: (i) a constant (just one term) which represent the intercept, (ii) a linear spline functions on a single variable Xj called hinge function, i.e., of the form hm(Xj) = (tmXj)+ or hm(Xj) = (Xjtm)+, where t is a knot, and (iii) a products of two or more hinge functions, e.g., Bn(X)=hm(Xj)hm(Xk) where jk. The latter represent interaction between two or more variables. The Bn(X) are defined in pairs and separated by a knot which represents an inflection point along the range of a given explanatory variable (see Fig. 1). Allowing the product of several linear spline terms hm(Xj) = (tmXj)+ as basis functions further allows the integration of interaction in the model, an aspect GAMs are not well designed for.
Fig. 1.
Fig. 1.

Knots and linear splines for a simple example of MARS.

Citation: Journal of Hydrometeorology 21, 12; 10.1175/JHM-D-19-0213.1

In mathematical terms, the hinge functions hm(Xj) are defined as (Rounaghi et al. 2015)
(tXj)+={tXj,ift>Xj0,otherwise,
(Xjt)+={Xjt,ifXj>t0,otherwise,
where t is the knot position.

The main difference of MARS with GAM is in the estimation algorithm. Where the spline bases are defined a priori in GAM, they are iteratively constructed in MARS, adapting hence to the data. Indeed, building the model in (6) is carried out through two phases: (i) a forward addition of linear spline terms [i.e., of the form (7) and (8)] to build a large model and (ii) a backward deletion to delete irrelevant terms. The forward phase begins with an empty model containing only the intercept β0. The Bn coefficients are then iteratively added to the model, each time choosing the variable and knot yielding the largest decrease in the residual error of the model. This process of adding Bn coefficients continues until the model reaches some predetermined maximum number, leading to a large model which may overfit the data. A backward deletion phase is then performed to improve the model performance by removing the less significant Bn coefficients until obtaining the best submodels. Comparison of submodels is made based on the generalized cross validation (GCV). Figure 2 illustrates the details of the MARS model algorithm.

Fig. 2.
Fig. 2.

Graph of MARS modeling process.

Citation: Journal of Hydrometeorology 21, 12; 10.1175/JHM-D-19-0213.1

Another interesting feature of MARS is the assessment of the variable importance for the prediction of the response. Variable importance can be measured in two different ways: (i) the number of submodels that include the variable, or (ii) the increase in GCV caused by deleting the considered variables from the final MARS model (e.g., Roy et al. 2018).

3. Methodology

a. Regional models

In this study, the methods presented in section 2 for neighborhood delineation (CCA and ROI) are used in combination with the regional estimation models GAM and MARS for transfer of hydrological information. As mentioned in section 1, other evaluated models are obtained by applying the GAM and MARS using all stations, i.e., without defining any neighborhoods. Table 1 summarizes all six resulting combinations.

Table 1.

Adopted regional models.

Table 1.

The two most commonly used neighborhood approaches, the CCA and the ROI (Ouarda 2016), are applied to the DHR using two sets of variables. For these methods, the relevant variables are selected based on their correlation degree with the hydrological variables.

Considering the classical procedures used to define the threshold in ROI and CCA, the density of stations in the neighborhoods can vary considerably from one region to another. Indeed, for a given fixed threshold, stations located near the center of the cloud points defined by the canonical space for CCA or the Euclidean space for ROI will have more stations within their neighborhoods and vice versa (Leclerc and Ouarda 2007). Since, the sample may affect the accuracy of the estimates obtained by regression models, it was decided that for each target station, the size of the region is increased until a selected optimal size is reached. The optimal number of stations to be considered in the DHR step is chosen based on the optimization procedure of Ouarda et al. (2001). The optimal number of sites in the neighborhood is the one that minimizes a given performance criterion of the log-linear model applied in each neighborhood.

MARS is fitted using the R package “earth” (Milborrow 2018). The application of MARS needs the tuning of three main parameters (see Fig. 2): the maximum number of terms in the model in the forward phase (Nk), the degree of interaction (degree), and the maximum number of terms in the backward phase (N_prune). A range of values of these parameters was tested and evaluated in order to optimize them based on the GCV, the residual sum of squares (RSS), and the coefficient of determination (R2) criteria of the fitted models.

GAM is also implemented in R, through the package “mgcv” (Wood 2006). The thin plate regression spline is used in this study as basis bji in the smoothing function fi in (5). The latter is selected due to its advantages, i.e., low calculation time, flexibility, and fewer number of parameters compared to other smoothing functions (Wood 2003). The used link function g in (4) is the identity function because of the approximately normal log-transformed quantiles such as considered in Ouali et al. (2017).

Different physio-meteorological variables are considered in each regional model. A backward stepwise approach is applied in this study to select the relevant explanatory variables to be used in each RE models (GAM and MARS). This method is presented in the next section.

b. Variable selection

The backward stepwise selection procedure is applied in this work to select the optimal explanatory variables as in Ouarda et al. (2018) and Chebana et al. (2014). It consists in a progressive deleting of the least effective variables from an initial full model containing all available variables. At each step, the removed variable is the one having either the highest p value for the null hypothesis that the smooth term for GAM is zero or those whose consideration yields the most significant increase in the GCV score of the model for MARS.

Note that the MARS algorithm naturally includes a variable selection feature since it builds a sparse model and a variable for which no term is added is by default discarded. This is not the case for GAM within which an automatic backward stepwise procedure was specially developed for this study.

c. Validation

For each RFA combination in Table 1, performances are evaluated using a leave-one-out cross validation, commonly called jackknife procedure in the field of hydrology. It consists of temporarily deleting each site to consider it the target one and perform RE. This process is repeated for each gauged site. Then, the regional estimate is compared to its observed values. Note that, in statistics, the validation with the jackknife technique is carried out on the retained data, not on the data removed as in the leave-one-out cross validation (Quenouille 1949). However, we will retain the jackknife term for ease of presentation.

Based on the jackknife procedure, several standard performance criteria are used to evaluate the prediction power of each regional model (e.g., Ouali et al. 2016). First, the Nash criterion (NASH) gives a global evaluation of the prediction quality. Second the root-mean-square error (RMSE) provides information about the accuracy of the prediction in an absolute scale, and the relative RMSE (RRMSE) removes the impact of each site’s order of magnitude from the RMSE computation. Finally, the bias (BIAS) and the relative bias (RBIAS) provide a measure of the magnitude of the systematic overestimation or underestimation of a model.

4. Case study and datasets

The dataset considered in the present paper consists in 151 hydrometric stations located in the southern part of the province of Quebec, Canada (Fig. 3). Two versions of the datasets with different variables are considered. The first is a standard one (STA) with only well-known variables used in previous RFA studies (e.g., Shu et al. 2007; Chebana et al. 2014; Durocher et al. 2016; Ouali et al. 2016; Wazneh et al. 2013, 2015, 2016). Note that geographical coordinates of the stations are considered instead of the geographical coordinates of the centroids. The second is an extended dataset (EXTD) combining STA with less common variables characterizing the drainage network systems. Table 2 lists all variables considered as well as whether they are in the EXTD dataset and thorough definitions of the new variables can be found in, for example, Adhikary and Dash (2018). These new variables are calculated based on drainage networks extracted using the D8 approach implemented in ArcGIS (Arc Hydro) using the digital elevation models (DEMs) (Jenson and Domingue 1988; O’Callaghan and Mark 1984; Tarboton et al. 1991). This method consists of calculating the flow direction and the flow accumulation layers based on the direction of the steepest slope among the eight neighbors of a given DEM. Using this information, the drainage networks can be defined considering a constant threshold value which represents the stream head locations (O’Callaghan and Mark 1984). Descriptive statistics of the new variables used in the EXTD dataset (Msilini et al. 2020, manuscript submitted to J. Hydrol.) are given in Table 3. In both datasets the considered hydrological response variables are at-site specific flood quantiles, chosen to match the specific return periods of 10, 50, and 100 years. These quantiles are thus denoted by QS10, QS50, and QS100.

Fig. 3.
Fig. 3.

Geographical location of the studied sites in the southern part of the province of Quebec, Canada.

Citation: Journal of Hydrometeorology 21, 12; 10.1175/JHM-D-19-0213.1

Table 2.

Variables used in the STA and the EXTD. An asterisk indicates variables considered in the standard dataset (STA). Plus signs indicate variables considered in the extended dataset (EXTD). The variables considered in the neighborhoods and their transformations are presented in bold.

Table 2.
Table 3.

Descriptive statistics of new physiographical variables.

Table 3.

To ensure the convergence of the Mahalanobis distance to a χ2 distribution in CCA, note that the logarithmic transformation is used for the following variables to achieve approximate normality: AREA, MBS, MATP, DDBZ, and RT and a square root transformation for PLAKE and RC. After transformation normal Q–Q plot indicate that all variables are approximately normal.

5. Results and discussion

a. Region delineation with CCA and ROI

The CCA and the ROI are applied to the DHR using two sets of variables. The first set contains variables from STA, which are the area (AREA), mean basin slope (MBS), percentage of the area occupied by lakes (PLAKE), mean annual total precipitation (MATP), mean annual degree days below 0°C (DDBZ), and the longitude of the centroid of the basin (LONGC). The second one includes variables from the EXTD, namely, PLAKE, MATP, DDBZ, LONGC, texture ratio (RT), and circularity ratio (RC).

The obtained optimum sizes of the neighborhood are nopt (STA) = 85 sites and nopt (EXTD) = 78 sites according to the RRMSE for the CCA method. For the ROI approach, we obtain nopt (STA) = 54 sites and nopt (EXTD) = 44 sites according to the same criterion. Thus, these neighborhood sizes are used for each target station.

b. Selection of optimal variables

The selection of significant explanatory variables is applied for each specific quantile (QS10, QS50, and QS100) and for each estimation model (GAM and MARS). Table 4 summarizes the final variables for each dataset (STA and EXTD). Following the application of the backward technique with GAM and MARS, we note the selection of the same new variables for the two models (RN, MRL, and DD). The definition of these variables can be found, for example, in Adhikary and Dash (2018). For each quantile and for each model, different combinations of variables are selected. The variables that seem to be the most important are AREA, PLAKE, MCL, and LONGC.

Table 4.

Explanatory variables selected for the various regression models.

Table 4.

c. MARS model results

Figure 4 shows the variable importance graph for QS100 obtained using the EXTD (we present only the results of QS100 to avoid repetitions). The variable with the most influence for the QS100 is the percentage of the area occupied by lakes, PLAKE. Indeed, lakes act as a sponge absorbing the excess water during extreme events. Thus they may have a significant effect on flood peaks.

Fig. 4.
Fig. 4.

Variable importance while predicting QS100. The red line represents the variation of the square root GCV values caused by the removal of a given variable from the MARS model during the backward phase. The black line represents the variation of the number of submodels including a given variable.

Citation: Journal of Hydrometeorology 21, 12; 10.1175/JHM-D-19-0213.1

Figure 5 shows the GCV R2 (GRSq) value for the QS100 predictions versus the number of terms in the final MARS model. The GCV R2 statistic is equivalent to the ordinary R2 statistic calculated with the variance for error replaced with the GCV statistic. It allows quantifying the goodness of fit for models that use unobserved data. The vertical dashed line at 12 indicates the optimal number of terms retained where marginal increases in GCV R2 are less than 0.001. The 12 final terms include seven variables in this case. Five terms are related to interaction effects.

Fig. 5.
Fig. 5.

MARS model selection for QS100. The gray line and the red dashed line represent, respectively, the variation of the GCV R2 (GRSq) and the R2 (RSq) values in the backward phase. For this model, 12 terms were retained, which are based on seven predictors (nbr preds).

Citation: Journal of Hydrometeorology 21, 12; 10.1175/JHM-D-19-0213.1

d. Comparison between MARS and GAM models

Table 5 shows the jackknife results for each model combination. The comparison of GAM and MARS models confirms that the simple linear spline fitting generated by MARS captures more information from the EXTD than the more sophisticated smoothing functions used in GAM. Indeed, MARS adds the terms in an iterative way leading to a simple and performant model including the effects of interactions. This model performs well with the ROI, which contains a smaller number of stations than CCA. Thus, based on the results of our case study MARS seems applicable in small neighborhoods even with complex terms (interaction effects) and able to give good predictions with fewer stations than GAM.

Table 5.

Jackknife validation results (STD and EXTD). Best results are in bold.

Table 5.

The response functions fitted by GAM and MARS models for selected explanatory variables are given in Fig. 6. It can be seen that the smoothing functions fitted by MARS approximate closely the more continuous smooth curves fitted by GAM, in a simpler way. This result has been observed by Leathwick et al. (2006) in a comparative study made between GAM and MARS applied in the field of ecology. The smooth curves generated by GAM add degrees of freedom to the model which makes it relatively more complex. This may be the reason for the better prediction results obtained by MARS than GAM.

Fig. 6.
Fig. 6.

Examples of smoothing functions produced by the GAM and MARS models for some explanatory variables. Dashed lines represent the 95% confidence intervals (CI). A Bayesian approach to variance estimation is used to calculate the CI for GAM. For MARS, the approach considered to identify the CI for MARS is the one that we can use for a linear regression model as it is simply a linear regression of linear basis functions. All the terms are estimated with a sum to zero constraint, leading to lower uncertainty associated with the mean in the plots.

Citation: Journal of Hydrometeorology 21, 12; 10.1175/JHM-D-19-0213.1

Figure 7 illustrates the interaction effects between some explanatory variables fitted by GAM and MARS models. Note that we considered the same interactions automatically identified by MARS to be able to make the comparison. The interaction surface generated by both models is also close. GAM gives more continuous and complex interaction effects, which lead to a large model with a large number of coefficients. This makes it difficult or impossible to integrate the interaction effects with GAM if we have a large number of explanatory variables in the model. For example, for the QS100, the integration of the same interactions identified by MARS to GAM considering the same variables gives a model with 79 coefficients, versus only 12 using MARS. In addition, MARS searches for and integrates interaction effects automatically into the model, which allows obtaining flood quantile estimates overall better than those obtained by GAM. We take as a simple example of interaction the first effect illustrated in Fig. 7, which represents the predicted response (specific quantile) as DD and LONGC vary. It can be seen that the LONGC affects little the hydrological variable level unless the DD is high where a nonlinear effect is seen.

Fig. 7.
Fig. 7.

Examples of the multivariate effects of some explanatory variables produced by the GAM and MARS models on the response variable (interactions).

Citation: Journal of Hydrometeorology 21, 12; 10.1175/JHM-D-19-0213.1

e. Comparison of regional models

According to Table 5 (see above), the highest NASH values (0.80) and the lowest RRMSE values (28.30% for QS100) are given by the ROI/MARS/EXTD, which leads to the most accurate estimates compared to all other combinations. It can also be seen that, with ALL, MARS has a comparable performance to GAM considering both databases. However, using the neighborhoods, especially the ROI, MARS overall outperforms GAM in terms of RRMSE and RBIAS criteria. This may be attributable to the flexibility of MARS and its generalization ability in small size neighborhoods.

Figure 8 illustrates the relative error, which is the most important criterion (Hosking and Wallis 2005), as a function of the sites ordered according to their area associated to the best models (ROI/MARS/EXTD and ROI/GAM/EXTD). One can notice that, overall, MARS with the EXTD performs better than GAM. The figure also shows that the performances at the level of extreme size basins are much worse than those obtained at the level of medium size basins.

Fig. 8.
Fig. 8.

Relative errors associated to the at-site quantile QS100 calculated using ROI/GAM/EXTD and ROI/MARS/EXTD.

Citation: Journal of Hydrometeorology 21, 12; 10.1175/JHM-D-19-0213.1

Figure 9 presents the differences between relative errors of MARS and GAM calculated using ROI/EXTD. One can notice that, in terms of RRMSE, MARS outperforms GAM in 84 sites out of 151, which represents 56% of the total number of sites. Accordingly, MARS is shown to be a simple performant model that can be considered as an alternative RE model.

Fig. 9.
Fig. 9.

Relative errors differences associated to the at site quantile QS100 calculated between MARS and GAM. The considered combinations are ROI/GAM/EXTD and ROI/MARS/EXTD.

Citation: Journal of Hydrometeorology 21, 12; 10.1175/JHM-D-19-0213.1

6. Conclusions

The aim of this study is to introduce MARS in the RFA of extreme hydrological variables and to compare its performance to GAM. The MARS model is able to model complex relationship between physio-meteorological variables, including variables dealing with drainage network characteristics, and flood quantiles at ungauged sites.

MARS is hereby compared to the GAM, which is gaining popularity in RFA and is one of the best performing models. Results show that slightly better flood quantile estimates are obtained from regional models that combine MARS with the EXTD including a STA with additional variables dealing with drainage network proprieties. Results indicate also that better performances are obtained with the ROI which includes low density of stations than CCA. This suggests that MARS is able to transfer hydrological information adequately even with fewer data than GAM. Further efforts are required to generalize this conclusion and to evaluate the benefits of MARS in other study areas and with other hydrological variables.

Although MARS is an effective and simple tool for estimation that can be used in RFA, there are some constraints such as the maximum number of terms and the maximum allowable degree of interaction in the forward pass that have to be specified by the user. These depend on the problem at hand and should be considered carefully. In addition, MARS does not cope well with missing data and, like many machine learning algorithms, is prone to overfitting. Note, however, that the backward deletion phase is meant to address this drawback.

Aside from the abovementioned shortcomings, MARS is easy-to-use as shown in this work. It is able to addresses the issues of high number of variables, nonlinearity, and interactions involved in the hydrological phenomena. This yields flood quantile estimates that compete with those obtained from GAM, while being simpler and more applicable to smaller datasets. Flood quantiles represent important information that is used in the design of hydraulic structures (e.g., dams). The construction of these structures is very expensive. The availability of simple and sophisticated tools for the reliable estimation of flood quantiles is crucial for hydraulics engineers.

In this work we considered linear neighborhood approaches (CCA and ROI), which are the most used methods in RFA. Future efforts can focus on the assessment of the performance of the MARS model in combination with nonlinear neighborhood approaches such as the nonlinear canonical correlation analysis (Ouali et al. 2016) and the nonlinear neighborhood based on the statistical depth function (Wazneh et al. 2016).

Acknowledgments

Financial support for this work was graciously provided by the Natural Sciences and Engineering Research Council of Canada (NSERC), the Canada Research chairs program (CRC), and the University Mission of Tunisia in Montreal (MUTAN). The authors are grateful to Natural Resources Canada and the USGS services for the employed DEM data. The authors would like also to thank the Ministry of Sustainable Development, Environment, and Fight Against Climate Change (MDDELCC) services for the employed dataset (STA). The authors thank the Editor, Prof. Andrew Wood, and three anonymous reviewers for their comments which helped improve the quality of the manuscript.

APPENDIX

Abbreviations

ANN

Artificial neural network

AREA

Basin area

BH

Basin relief

BIAS

Mean bias

CCA

Canonical correlation analysis

DD

Drainage density

DDBZ

Mean annual degree days below 0°C

DEM

Digital elevation model

DHR

Delineation of homogenous regions

Edf

Estimated smooth degree of freedom

EXTD

Extended dataset

FS

Stream frequency

GAM

Generalized additive model

GCV

Generalized cross validation

IF

Infiltration number

LATC

Latitude of the centroid of the basin

LONGC

Longitude of the centroid of the basin

MALP

Mean annual liquid precipitation

MALPS

Mean annual liquid precipitation (summer–fall)

MARS

Multivariate adaptive regression splines

MASP

Mean annual solid precipitation

MATP

Mean annual total precipitation

MBS

Mean basin slope

MCL

Main channel length

MCS

Main channel slope

MRB

Mean bifurcation ratio

MRL

Mean stream length ratio

NASH

Nash efficiency criterion

NL-CCA

Nonlinear canonical correlation analysis

PFOR

Percentage of the area occupied by forest

PL1

Percentage of first-order stream lengths

PLAKE

Percentage of the area occupied by lakes

PN1

Percentage of first-order streams

QST

Specific quantile associated to the return period T

R2

Coefficient of determination

RB

Bifurcation ratio

RBIAS

Relative mean bias

RC

Circularity ratio

RE

Regional estimation

RFA

Regional frequency analysis

RL

Stream length ratio

RMSE

Root-mean-square error

RN

Ruggedness number

ROI

Region of influence

RRMSE

Relative root-mean-square error

RSS

Residual sum of squares

RT

Texture ratio

STA

Standard dataset

WMRB

Weighted mean bifurcation ratio

REFERENCES

  • Adhikary, P. P., and J. Dash, 2018: Morphometric analysis of Katra Watershed of Eastern Ghats: A GIS approach. Int. J. Curr. Microbiol. Appl. Sci., 7, 16511665, https://doi.org/10.20546/ijcmas.2018.703.198.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Aziz, R., A. Rahman, G. Fang, and S. Shrestha, 2014: Application of artificial neural networks in regional flood frequency analysis: A case study for Australia. Stochastic Environ. Res. Risk Assess., 28, 541554, https://doi.org/10.1007/s00477-013-0771-5.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Balshi, M. S., A. D. McGuire, P. Duffy, M. Flannigan, J. Walsh, and J. Melillo, 2009: Assessing the response of area burned to changing climate in western boreal North America using a Multivariate Adaptive Regression Splines (MARS) approach. Global Change Biol., 15, 578600, https://doi.org/10.1111/j.1365-2486.2008.01679.x.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Bayentin, L., S. El Adlouni, T. B. M. J. Ouarda, P. Gosselin, B. Doyon, and F. Chebana, 2010: Spatial variability of climate effects on ischemic heart disease hospitalization rates for the period 1989-2006 in Quebec, Canada. Int. J. Health Geogr., 9, 5, https://doi.org/10.1186/1476-072X-9-5.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Bishop, C. M., 1995: Neural Networks for Pattern Recognition. Oxford University Press, 482 pp.

  • Bond, N. R., and M. J. Kennard, 2017: Prediction of hydrologic characteristics for ungauged catchments to support hydroecological modeling. Water Resour. Res., 53, 87818794, https://doi.org/10.1002/2017WR021119.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Booker, D. J., and R. A. Woods, 2014: Comparing and combining physically-based and empirically-based approaches for estimating the hydrology of ungauged catchments. J. Hydrol., 508, 227239, https://doi.org/10.1016/j.jhydrol.2013.11.007.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Breiman, L., 2001: Random forests. Mach. Learn., 45, 532, https://doi.org/10.1023/A:1010933404324.

  • Brunner, M. I., R. Furrer, A. E. Sikorska, D. Viviroli, J. Seibert, and A.-C. Favre, 2018: Synthetic design hydrographs for ungauged catchments: A comparison of regionalization methods. Stochastic Environ. Res. Risk Assess., 32, 19932023, https://doi.org/10.1007/s00477-018-1523-3.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Burn, D. H., 1990a: An appraisal of the “region of influence” approach to flood frequency analysis. Hydrol. Sci. J., 35, 149165, https://doi.org/10.1080/02626669009492415.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Burn, D. H., 1990b: Evaluation of regional flood frequency analysis with a region of influence approach. Water Resour. Res., 26, 22572265, https://doi.org/10.1029/WR026i010p02257.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Chebana, F., and T. B. M. J. Ouarda, 2008: Depth and homogeneity in regional flood frequency analysis. Water Resour. Res., 44, W11422, https://doi.org/10.1029/2007WR006771.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Chebana, F., C. Charron, T. B. M. J. Ouarda, and B. Martel, 2014: Regional frequency analysis at ungauged sites with the generalized additive model. J. Hydrometeor., 15, 24182428, https://doi.org/10.1175/JHM-D-14-0060.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Cutler, D. R., T. C. Edwards Jr., K. H. Beard, A. Cutler, K. T. Hess, J. Gibson, and J. J. Lawler, 2007: Random forests for classification in ecology. Ecology, 88, 27832792, https://doi.org/10.1890/07-0539.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Deo, R. C., O. Kisi, and V. P. Singh, 2017: Drought forecasting in eastern Australia using multivariate adaptive regression spline, least square support vector machine and M5Tree model. Atmos. Res., 184, 149175, https://doi.org/10.1016/j.atmosres.2016.10.004.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Diez-Sierra, J., and M. del Jesus, 2019: Subdaily rainfall estimation through daily rainfall downscaling using random forests in Spain. Water, 11, 125, https://doi.org/10.3390/w11010125.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Durocher, M., F. Chebana, and T. B. M. J. Ouarda, 2016: On the prediction of extreme flood quantiles at ungauged locations with spatial copula. J. Hydrol., 533, 523532, https://doi.org/10.1016/j.jhydrol.2015.12.029.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Emamgolizadeh, S., S. M. Bateni, D. Shahsavani, T. Ashrafi, and H. Ghorbania, 2015: Estimation of soil cation exchange capacity using genetic expression programming (GEP) and multivariate adaptive regression splines (MARS). J. Hydrol., 529, 15901600, https://doi.org/10.1016/j.jhydrol.2015.08.025.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Friedman, J. H., 1991: Multivariate adaptive regression splines. Ann. Stat., 19, 167, https://doi.org/10.1214/aos/1176347973.

  • Gal, Y., and Z. Ghahramani, 2016: A theoretically grounded application of dropout in recurrent neural networks. 30th Conf. on Advances in Neural Information Processing Systems, Barcelona, Spain, NIPS, 9 pp., https://papers.nips.cc/paper/6241-a-theoretically-grounded-application-of-dropout-in-recurrent-neural-networks.pdf.

  • Geurts, P., A. Irrthum, and L. Wehenkel, 2009: Supervised learning with decision tree-based methods in computational and systems biology. Mol. Biosyst., 5, 15931605, https://doi.org/10.1039/b907946g.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • GREHYS, 1996: Presentation and review of some methods for regional flood frequency analysis. J. Hydrol., 186, 6384, https://doi.org/10.1016/S0022-1694(96)03042-9.

    • Search Google Scholar
    • Export Citation
  • Hastie, T., and R. Tibshirani, 1987: Generalized additive models: Some applications. J. Amer. Stat. Assoc., 82, 371386, https://doi.org/10.1080/01621459.1987.10478440.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hosking, J. R. M., and J. R. M. Wallis, 2005: Regional Frequency Analysis: An Approach Based on L-Moments. Cambridge University Press, 244 pp.

    • Search Google Scholar
    • Export Citation
  • Hotelling, H., 1935: The most predictable criterion. J. Educ. Psychol., 26, 139142, https://doi.org/10.1037/h0058165.

  • Ibbitt, R., and R. Woods, 2004: Re-scaling the topographic index to improve the representation of physical processes in catchment models. J. Hydrol., 293, 205218, https://doi.org/10.1016/j.jhydrol.2004.01.016.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Jenson, S. K., and J. O. Domingue, 1988: Extracting topographic structure from digital elevation data for geographic information system analysis. Photogramm. Eng. Remote Sens., 54, 15931600.

    • Search Google Scholar
    • Export Citation
  • Jung, K., P. R. Marpu, and T. B. M. J. Ouarda, 2017: Impact of river network type on the time of concentration. Arabian J. Geosci., 10, 546, https://doi.org/10.1007/s12517-017-3323-3.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Khalil, B., T. B. M. J. Ouarda, and A. St-Hilaire, 2011: Estimation of water quality characteristics at ungauged sites using artificial neural networks and canonical correlation analysis. J. Hydrol., 405, 277287, https://doi.org/10.1016/j.jhydrol.2011.05.024.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Kisi, O., 2015: Pan evaporation modeling using least square support vector machine, multivariate adaptive regression splines and M5 model tree. J. Hydrol., 528, 312320, https://doi.org/10.1016/j.jhydrol.2015.06.052.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Kisi, O., and K. S. Parmar, 2016: Application of least square support vector machine and multivariate adaptive regression spline models in long term prediction of river water pollution. J. Hydrol., 534, 104112, https://doi.org/10.1016/j.jhydrol.2015.12.014.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Lawrence, S., and C. L. Giles, 2000: Overfitting and neural networks: Conjugate gradient and backpropagation. Proc. IEEE-INNS-ENNS Int. Joint Conf. on Neural Networks, Como, Italy, IEEE, 114119, https://doi.org/10.1109/IJCNN.2000.857823.

    • Crossref
    • Export Citation
  • Leathwick, J. R., D. Rowe, J. Richardson, J. Elith, and T. Hastie, 2005: Using multivariate adaptive regression splines to predict the distributions of New Zealand’s freshwater diadromous fish. Freshwater Biol., 50, 20342052, https://doi.org/10.1111/j.1365-2427.2005.01448.x.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Leathwick, J. R., J. Elith, and T. Hastie, 2006: Comparative performance of generalized additive models and multivariate adaptive regression splines for statistical modelling of species distributions. Ecol. Modell., 199, 188196, https://doi.org/10.1016/j.ecolmodel.2006.05.022.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Leclerc, M., and T. B. M. J. Ouarda, 2007: Non-stationary regional flood frequency analysis at ungauged sites. J. Hydrol., 343, 254265, https://doi.org/10.1016/j.jhydrol.2007.06.021.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Lee, T.-S., and I.-F. Chen, 2005: A two-stage hybrid credit scoring model using artificial neural networks and multivariate adaptive regression splines. Expert Syst. Appl., 28, 743752, https://doi.org/10.1016/j.eswa.2004.12.031.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Lee, T.-S., C.-C. Chiu, Y.-C. Chou, and C.-J. Lu, 2006: Mining the customer credit using classification and regression tree and multivariate adaptive regression splines. Comput. Stat. Data Anal., 50, 11131130, https://doi.org/10.1016/j.csda.2004.11.006.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Leitte, P., C. Petrescu, U. Franck, M. Richter, O. Suciu, R. Ionovici, O. Herbarth, and U. Schlink, 2009: Respiratory health, effects of ambient air pollution and its modification by air humidity in Drobeta-Turnu Severin, Romania. Sci. Total Environ., 407, 40044011, https://doi.org/10.1016/j.scitotenv.2009.02.042.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Li, Y., Y. He, Y. Su, and L. Shu, 2016: Forecasting the daily power output of a grid-connected photovoltaic system based on multivariate adaptive regression splines. Appl. Energy, 180, 392401, https://doi.org/10.1016/j.apenergy.2016.07.052.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Masselink, R., A. J. A. M. Temme, R. Giménez, J. Casalí, and S. D. Keesstra, 2017: Assessing hillslope-channel connectivity in an agricultural catchment using rare-earth oxide tracers and random forests models. Geogr. Res. Lett., 43, 1939, https://doi.org/10.18172/cig.3169.

    • Search Google Scholar
    • Export Citation
  • Milborrow, S., 2018: Earth: Multivariate adaptive regression splines. R package, version 4.6.3, https://cran.r-project.org/web/packages/earth/index.html.

  • Muñoz, P., J. Orellana-Alvear, P. Willems, and R. Célleri, 2018: Flash-flood forecasting in an Andean mountain catchment—Development of a step-wise methodology based on the random forest algorithm. Water, 10, 1519, https://doi.org/10.3390/w10111519.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Niehoff, F., U. Fritsch, and A. Bronstert, 2002: Land-use impacts on storm-runoff generation: Scenarios of land-use change and simulation of hydrological response in a meso-scale catchment in SW-Germany. J. Hydrol., 267, 8093, https://doi.org/10.1016/S0022-1694(02)00142-7.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • O’Callaghan, J. F., and D. M. Mark, 1984: The extraction of drainage networks from digital elevation data. Comput. Vision Graphics Image Process., 28, 323344, https://doi.org/10.1016/S0734-189X(84)80011-0.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Ouali, D., F. Chebana, and T. B. M. J. Ouarda, 2016: Non-linear canonical correlation analysis in regional frequency analysis. Stochastic Environ. Res. Risk Assess., 30, 449462, https://doi.org/10.1007/s00477-015-1092-7.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Ouali, D., F. Chebana, and T. B. M. J. Ouarda, 2017: Fully nonlinear statistical and machine-learning approaches for hydrological frequency estimation at ungauged sites. J. Adv. Model. Earth Syst., 9, 12921306, https://doi.org/10.1002/2016MS000830.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Ouarda, T. B. M. J., 2016: Regional flood frequency modeling. Chow’s Handbook of Applied Hydrology, 3rd ed. V. P. Singh, Ed., McGraw-Hill, 77.71–77.78.

  • Ouarda, T. B. M. J., and C. Shu, 2009: Regional low-flow frequency analysis using single and ensemble artificial neural networks. Water Resour. Res., 45, W11428, https://doi.org/10.1029/2008wr007196.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Ouarda, T. B. M. J., M. Lang, B. Bobée, J. Bernier, and P. Bois, 1999: Synthèse de modèles régionaux d'estimation de crue utilisée en France et au Québec. Revue des sciences de l'eau/J. Water Sci., 12, 155182, https://doi.org/10.7202/705347ar.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Ouarda, T. B. M. J., C. Girard, G. S. Cavadias, and B. Bobée, 2001: Regional flood frequency estimation with canonical correlation analysis. J. Hydrol., 254, 157173, https://doi.org/10.1016/S0022-1694(01)00488-7.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Ouarda, T. B. M. J., C. Charron, P. R. Marpu, and F. Chebana, 2016: The generalized additive model for the assessment of the direct, diffuse, and global solar irradiances using SEVIRI images, with application to the UAE. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., 9, 15531566, https://doi.org/10.1109/JSTARS.2016.2522764.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Ouarda, T. B. M. J., C. Charrona, Y. Hundecha, A. St-Hilaire, and F. Chebana, 2018: Introduction of the GAM model for regional low-flow frequency analysis at ungauged basins and comparison with commonly used approaches. Environ. Modell. Software, 109, 256271, https://doi.org/10.1016/j.envsoft.2018.08.031.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Pourghasemi, H. M., and N. Kerle, 2016: Random forests and evidential belief function-based landslide susceptibility assessment in Western Mazandaran Province, Iran. Environ. Earth Sci., 75, 185, https://doi.org/10.1007/s12665-015-4950-1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Prasad, A. M., L. R. Iverson, and A. Liaw, 2006: Newer classification and regression tree techniques: Bagging and random forests for ecological prediction. Ecosystems, 9, 181199, https://doi.org/10.1007/s10021-005-0054-1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Quenouille, M. H., 1949: Problems in plane sampling. Ann. Math. Stat., 20, 355375, https://doi.org/10.1214/aoms/1177729989.

  • Rahman, A., C. Charron, T. B. M. J. Ouarda, and F. Chebana, 2018: Development of regional flood frequency analysis techniques using generalized additive models for Australia. Stochastic Environ. Res. Risk Assess., 32, 123139, https://doi.org/10.1007/s00477-017-1384-1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Ramsay, T. O., R. T. Burnett, and D. Krewski, 2003: The effect of concurvity in generalized additive models linking mortality to ambient particulate matter. Epidemiology, 14, 1823, https://doi.org/10.1097/00001648-200301000-00009.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Rounaghi, M. M., M. R. Abbaszadeh, and M. Arashi, 2015: Stock price forecasting for companies listed on Tehran stock exchange using multivariate adaptive regression splines model and semi-parametric splines technique. Physica, 438A, 625633, https://doi.org/10.1016/j.physa.2015.07.021.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Roy, S. S., R. Roy, and V. E. Balas, 2018: Estimating heating load in buildings using multivariate adaptive regression splines, extreme learning machine, a hybrid model of MARS and ELM. Renewable Sustainable Energy Rev., 82, 42564268, https://doi.org/10.1016/j.rser.2017.05.249.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Saadi, M., L. Oudin, and P. Ribstein, 2019: Random forest ability in regionalizing hourly hydrological model parameters. Water, 11, 1540, https://doi.org/10.3390/w11081540.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Shu, C., and D. H. Burn, 2004: Artificial neural network ensembles and their application in pooled flood frequency analysis. Water Resour. Res., 40, W09301, https://doi.org/10.1029/2003WR002816.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Shu, C., and T. B. M. J. Ouarda, 2007: Flood frequency analysis at ungauged sites using artificial neural networks in canonical correlation analysis physiographic space. Water Resour. Res., 43, W07438, https://doi.org/10.1029/2006WR005142.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Sivakumar, B., 2007: Nonlinear determinism in river flow: Prediction as a possible indicator. Earth Surf. Processes Landforms, 32, 969979, https://doi.org/10.1002/esp.1462.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Tarboton, D. G., R. L. Bras, and I. Rodriguez-Iturbe, 1991: On the extraction of channel networks from digital elevation data. Hydrol. Processes, 5, 81100, https://doi.org/10.1002/hyp.3360050107.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Tasker, H., S. A. Hodge, and C. S. Barks, 1996: Region OF influence regression for estimating the 50-year flood at ungaged sites. J. Amer. Water Resour. Assoc., 32, 163170, https://doi.org/10.1111/j.1752-1688.1996.tb03444.x.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wahba, G., 1990: Spline Models for Observational Data. SIAM, 181 pp.

    • Crossref
    • Export Citation
  • Wang, W., X. Chen, P. Shi, and P. H. A. J. M. van Gelder, 2008: Detecting changes in extreme precipitation and extreme streamflow in the Dongjiang River Basin in southern China. Hydrol. Earth Syst. Sci., 12, 207221, https://doi.org/10.5194/hess-12-207-2008.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wang, Z., C. Lai, X. Chen, B. Yang, S. Zhao, and X. Bai, 2015: Flood hazard risk assessment model based on random forest. J. Hydrol., 527, 11301141, https://doi.org/10.1016/j.jhydrol.2015.06.008.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wazneh, H., F. Chebana, and T. B. M. J. Ouarda, 2013: Depth-based regional index-flood model. Water Resour. Res., 49, 79577972, https://doi.org/10.1002/2013WR013523.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wazneh, H., F. Chebana, and T. B. M. J. Ouarda, 2013: Delineation of homogeneous regions for regional frequency analysis using statistical depth function. J. Hydrol., 521, 232244, https://doi.org/10.1016/j.jhydrol.2014.11.068.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wazneh, H., F. Chebana, and T. B. M. J. Ouarda, 2016: Identification of hydrological neighborhoods for regional flood frequency analysis using statistical depth function. Adv. Water Resour., 94, 251263, https://doi.org/10.1016/j.advwatres.2016.05.013.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wen, R., K. Rogers, N. Saintilan, and J. Ling, 2011: The influences of climate and hydrology on population dynamics of waterbirds in the lower Murrumbidgee River floodplains in Southeast Australia: Implications for environmental water management. Ecol. Modell., 222, 154163, https://doi.org/10.1016/j.ecolmodel.2010.09.016.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wood, S. N., 2003: Thin plate regression splines. J. Roy. Stat. Soc., 65, 95114, https://doi.org/10.1111/1467-9868.00374.

  • Wood, S. N., 2004: Stable and efficient multiple smoothing parameter estimation for generalized additive models. J. Amer. Stat. Assoc., 99, 673686, https://doi.org/10.1198/016214504000000980.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wood, S. N., 2006: Generalized Additive Models: An Introduction with R. 1st ed. CRC Press, 410 pp.

  • Wood, S. N., 2017: Generalized Additive Models: An Introduction with R. 2nd ed. CRC Press, 476 pp.

  • Xu, J., W. Li, M. Ji, F. Lu, and S. Dong., 2010: A comprehensive approach to characterization of the nonlinearity of runoff in the headwaters of the Tarim River, western China. Hydrol. Processes J., 24, 136146, https://doi.org/10.1002/hyp.7484.

    • Search Google Scholar
    • Export Citation
  • Zhang, G., A. T. C. Goh, Y. Zhang, Y. Chen, and Y. Xiao, 2015: Assessment of soil liquefaction based on capacity energy concept and multivariate adaptive regression splines. Eng. Geol., 188, 2937, https://doi.org/10.1016/j.enggeo.2015.01.009.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Zhang, W., and A. Goh, 2016: Evaluating seismic liquefaction potential using multivariate adaptive regression splines and logistic regression. Geomech. Eng., 10, 269280, http://doi.org/10.12989/gae.2016.10.3.269.

    • Crossref
    • Search Google Scholar
    • Export Citation
Save
  • Adhikary, P. P., and J. Dash, 2018: Morphometric analysis of Katra Watershed of Eastern Ghats: A GIS approach. Int. J. Curr. Microbiol. Appl. Sci., 7, 16511665, https://doi.org/10.20546/ijcmas.2018.703.198.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Aziz, R., A. Rahman, G. Fang, and S. Shrestha, 2014: Application of artificial neural networks in regional flood frequency analysis: A case study for Australia. Stochastic Environ. Res. Risk Assess., 28, 541554, https://doi.org/10.1007/s00477-013-0771-5.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Balshi, M. S., A. D. McGuire, P. Duffy, M. Flannigan, J. Walsh, and J. Melillo, 2009: Assessing the response of area burned to changing climate in western boreal North America using a Multivariate Adaptive Regression Splines (MARS) approach. Global Change Biol., 15, 578600, https://doi.org/10.1111/j.1365-2486.2008.01679.x.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Bayentin, L., S. El Adlouni, T. B. M. J. Ouarda, P. Gosselin, B. Doyon, and F. Chebana, 2010: Spatial variability of climate effects on ischemic heart disease hospitalization rates for the period 1989-2006 in Quebec, Canada. Int. J. Health Geogr., 9, 5, https://doi.org/10.1186/1476-072X-9-5.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Bishop, C. M., 1995: Neural Networks for Pattern Recognition. Oxford University Press, 482 pp.

  • Bond, N. R., and M. J. Kennard, 2017: Prediction of hydrologic characteristics for ungauged catchments to support hydroecological modeling. Water Resour. Res., 53, 87818794, https://doi.org/10.1002/2017WR021119.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Booker, D. J., and R. A. Woods, 2014: Comparing and combining physically-based and empirically-based approaches for estimating the hydrology of ungauged catchments. J. Hydrol., 508, 227239, https://doi.org/10.1016/j.jhydrol.2013.11.007.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Breiman, L., 2001: Random forests. Mach. Learn., 45, 532, https://doi.org/10.1023/A:1010933404324.

  • Brunner, M. I., R. Furrer, A. E. Sikorska, D. Viviroli, J. Seibert, and A.-C. Favre, 2018: Synthetic design hydrographs for ungauged catchments: A comparison of regionalization methods. Stochastic Environ. Res. Risk Assess., 32, 19932023, https://doi.org/10.1007/s00477-018-1523-3.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Burn, D. H., 1990a: An appraisal of the “region of influence” approach to flood frequency analysis. Hydrol. Sci. J., 35, 149165, https://doi.org/10.1080/02626669009492415.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Burn, D. H., 1990b: Evaluation of regional flood frequency analysis with a region of influence approach. Water Resour. Res., 26, 22572265, https://doi.org/10.1029/WR026i010p02257.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Chebana, F., and T. B. M. J. Ouarda, 2008: Depth and homogeneity in regional flood frequency analysis. Water Resour. Res., 44, W11422, https://doi.org/10.1029/2007WR006771.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Chebana, F., C. Charron, T. B. M. J. Ouarda, and B. Martel, 2014: Regional frequency analysis at ungauged sites with the generalized additive model. J. Hydrometeor., 15, 24182428, https://doi.org/10.1175/JHM-D-14-0060.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Cutler, D. R., T. C. Edwards Jr., K. H. Beard, A. Cutler, K. T. Hess, J. Gibson, and J. J. Lawler, 2007: Random forests for classification in ecology. Ecology, 88, 27832792, https://doi.org/10.1890/07-0539.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Deo, R. C., O. Kisi, and V. P. Singh, 2017: Drought forecasting in eastern Australia using multivariate adaptive regression spline, least square support vector machine and M5Tree model. Atmos. Res., 184, 149175, https://doi.org/10.1016/j.atmosres.2016.10.004.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Diez-Sierra, J., and M. del Jesus, 2019: Subdaily rainfall estimation through daily rainfall downscaling using random forests in Spain. Water, 11, 125, https://doi.org/10.3390/w11010125.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Durocher, M., F. Chebana, and T. B. M. J. Ouarda, 2016: On the prediction of extreme flood quantiles at ungauged locations with spatial copula. J. Hydrol., 533, 523532, https://doi.org/10.1016/j.jhydrol.2015.12.029.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Emamgolizadeh, S., S. M. Bateni, D. Shahsavani, T. Ashrafi, and H. Ghorbania, 2015: Estimation of soil cation exchange capacity using genetic expression programming (GEP) and multivariate adaptive regression splines (MARS). J. Hydrol., 529, 15901600, https://doi.org/10.1016/j.jhydrol.2015.08.025.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Friedman, J. H., 1991: Multivariate adaptive regression splines. Ann. Stat., 19, 167, https://doi.org/10.1214/aos/1176347973.

  • Gal, Y., and Z. Ghahramani, 2016: A theoretically grounded application of dropout in recurrent neural networks. 30th Conf. on Advances in Neural Information Processing Systems, Barcelona, Spain, NIPS, 9 pp., https://papers.nips.cc/paper/6241-a-theoretically-grounded-application-of-dropout-in-recurrent-neural-networks.pdf.

  • Geurts, P., A. Irrthum, and L. Wehenkel, 2009: Supervised learning with decision tree-based methods in computational and systems biology. Mol. Biosyst., 5, 15931605, https://doi.org/10.1039/b907946g.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • GREHYS, 1996: Presentation and review of some methods for regional flood frequency analysis. J. Hydrol., 186, 6384, https://doi.org/10.1016/S0022-1694(96)03042-9.

    • Search Google Scholar
    • Export Citation
  • Hastie, T., and R. Tibshirani, 1987: Generalized additive models: Some applications. J. Amer. Stat. Assoc., 82, 371386, https://doi.org/10.1080/01621459.1987.10478440.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hosking,