1. Introduction and literature review
The main objective of regional frequency analysis (RFA) is the estimation of the return period of extreme hydrological events at target sites where little or no hydrological data are available. Examples of these events include floods and low-flow quantiles which are crucial for infrastructure design and management. In general, RFA comprises two main steps: (i) the delineation of homogenous region (DHR) to determine gauged sites similar to the target one and (ii) regional estimation (RE) to transfer the information from sites determined in the DHR step to the target one (e.g., Chebana and Ouarda 2008). Various methods have been suggested for each of these two steps (e.g., Ouarda 2016).
Among the most common DHR methods, we can mention the region of influence (ROI) (Burn 1990a) and the canonical correlation analysis (CCA) (Ouarda et al. 2001). Recently, several advanced nonlinear neighborhood approaches were suggested (e.g., Ouali et al. 2016; Wazneh et al. 2016). Among the commonly used RE approaches, we can distinguish the regression-based models and the index-flood models. Among the former, the log-linear regression models are the most commonly used ones in practice, because of their simplicity and good predictive performances. We focus here on regression-based models in the RE step.
Hydrological processes depend on a large number of variables, such as the topographic variability of the basins, their soil structure and texture, their geological formations, and the climatology. This leads to a natural complexity, which has been widely recognized and documented in the hydrological literature (e.g., Ibbitt and Woods 2004; Sivakumar 2007; Wang et al. 2008; Xu et al. 2010). In statistical terms, this complexity manifests itself through three aspects: (i) the high number of explanatory variables necessary to paint a realistic picture of the processes, (ii) the nonlinear impact of these explanatory variables, and (iii) the important interaction between the different explanatory variables. It is thus important that the RE step in RFA accounts for these three aspects in order to yield accurate estimations of the target site’s quantiles of interest.
In RFA studies, the RE step usually requires a large number of explanatory variables to result in satisfactory predictive performances. This number usually exceeds five, as in Ouarda et al. (2018), but should increase in the future with the discovery of new potential variables. For instance, evidence is growing that drainage network characteristics have a strong impact on hydrological dynamics, and are consequently linked to flood quantiles (Jung et al. 2017). Thus, integrating additional characteristics related to the drainage network may lead to more accurate estimates of the regional quantiles. Hence, there is a need to propose efficient approaches that are able to manage such high-dimensional databases.
Another consequence of the natural complexity of hydrological processes is the nonlinearity between explanatory variables and the at-site quantiles. To handle this problem and better reproduce the dynamics of hydrological processes, various nonlinear approaches have been proposed (e.g., Shu and Burn 2004). The classical log-linear method used in the RE step assumes that the relation between the logarithm of the response variable (hydrological) and explanatory variables (physio-meteorological) is linear, which is too simplistic for such complex nonlinear processes. Therefore, several RE approaches, such as random forest (RF), artificial neural network (ANN), and generalized additive models (GAM) have been proposed in the literature to account for the possible nonlinear links between variables (e.g., Aziz et al. 2014; Khalil et al. 2011; Ouali et al. 2017; Ouarda et al. 2018; Saadi et al. 2019).
Random forest (Breiman 2001) is a powerful nonlinear and nonparametric method commonly used to handle regression and classification problems based on decision trees. Due to its good performance, it has been applied in several fields, such as hydrology (e.g., Diez-Sierra and del Jesus 2019; Muñoz et al. 2018; Wang et al. 2015), ecology (e.g., Cutler et al. 2007; Prasad et al. 2006), environmental modeling (e.g., Masselink et al. 2017; Pourghasemi and Kerle 2016), and RFA (e.g., Booker and Woods 2014; Brunner et al. 2018). Despite its predictive power, RF suffers from major limitations such as the difficulty of interpretation and the large memory requirements for storing the model when used with a large dataset (Geurts et al. 2009).
The ANN is a nonparametric mathematical model, whose design is inspired by the biological functioning of brain neurons (Bishop 1995). It was considered in several RFA studies for the estimation of flood and low-flow quantiles at ungauged sites (e.g., Aziz et al. 2014; Ouarda and Shu 2009). However, ANNs present a major common problem which is the tendency to overfit (e.g., Gal and Ghahramani 2016; Lawrence and Giles 2000). In addition, their calibration is relatively complex, especially for debutant users, which requires some subjective choices since no explicit regression equations can be given (Ouali et al. 2017).
GAMs do not suffer the same drawbacks as ANNs. GAMs are flexible nonlinear regression models (Hastie and Tibshirani 1987) that have been introduced in the RFA context by Chebana et al. (2014). The authors found that the GAM-based methods present the best performances when compared to the classical log-linear model and other common methods. GAMs are increasingly being adopted in several fields such as hydroclimatology and environmental modeling (e.g., Rahman et al. 2018; Wen et al. 2011), public health (e.g., Bayentin et al. 2010; Leitte et al. 2009), and renewable energy (e.g., Ouarda et al. 2016). However, it still presents a number of disadvantages. Indeed, the method can be computationally intensive, especially when a large number of variables is involved. It can, then, be difficult to fit GAM to high-dimensional databases because of memory limitations imposed by the numerical complexities of this model (Leathwick et al. 2006). More importantly, GAMs do not cope well with the interaction between variables (e.g., Ramsay et al. 2003), which is difficult to integrate in the model.
The interaction between physiographical variables within the watershed has long been recognized (e.g., Niehoff et al. 2002). Thus, the inclusion of the terms of interactions between the explanatory variables used to model the hydrological dynamics seems to be essential for better estimates of flood quantiles. However, this aspect is difficult to take into account in the RE models due to the high complexity that it may add to the models (see above for the specific example of GAMs). This affects the quality of the estimates and makes it less accurate. Hence, the motivation behind the present paper is to propose and explore alternative techniques able to realistically reproduce the hydrological process while avoiding the problems mentioned above.
The method considered here is multivariate adaptive regression splines (MARS), a procedure designed to build complex nonlinear regression models in a high dimensional setting. It is attractive in the RFA context since it actually addresses the three issues developed above which are: high number of variables, nonlinearity, and interactions. Indeed, MARS is efficient in a high dimensional setting and naturally selects the relevant predictors in this context. In addition, it does not require assumptions about the form of the relationships between the response and the explanatory variables (Friedman 1991). MARS also allows the modeling of complex structures between variables, which are often hidden in high-dimensional data, without imposing strong model assumptions. Hence, it can easily include interactions between variables, allowing any degree of interaction to be considered (Lee et al. 2006).
All of these desirable properties lead to a very flexible approach able to adapt well to the hydrological phenomenon. Due to its simplicity and capacity to capture complex nonlinear relationships, it has been successfully applied in several fields such as ecology and environment (e.g., Balshi et al. 2009; Bond and Kennard 2017; Leathwick et al. 2006, 2005), finance (e.g., Lee and Chen 2005; Lee et al. 2006), geology (e.g., Zhang and Goh 2016; Zhang et al. 2015), energy (e.g., Li et al. 2016; Roy et al. 2018), and hydrology (e.g., Bond and Kennard 2017; Deo et al. 2017; Emamgolizadeh et al. 2015; Kisi 2015; Kisi and Parmar 2016). Despite the extensive use of the MARS model in various frameworks and contexts, its potential has never been exploited and investigated in the context of RFA of extreme hydrological events.
The main objective of the present study is to introduce the MARS approach in the RFA context to estimate flood quantiles and evaluate its predictive potential when it is applied to an extensive database. It is hereby applied in combination with the DHR with the CCA and the ROI approaches. MARS is also applied without DHR to test its performance when applied to all stations without consideration of hydrological neighborhoods. A jackknife procedure is used to evaluate the model performances, with GAMs used as a benchmark.
This paper is structured as follows. Section 2 presents the theoretical background of MARS and the other RFA approaches adopted. The considered methodology is outlined in section 3. Section 4 describes the case study and the considered datasets. The obtained results are presented and discussed in section 5. The conclusions of the study are summarized in the last section. The appendix contains a list of terms and abbreviations.
2. Theoretical background
In this section, the adopted statistical tools are briefly presented and discussed.
a. Neighborhood identification approaches
Here we present the two most commonly considered neighborhood identification approaches as a necessary step before the RE one.
1) Canonical correlation analysis approach
CCA (Hotelling 1935) is a multivariate analysis technique used to identify the possible correlations between two groups of variables. It consists of a linear transformation of two groups of random variables into pairs of canonical variables, which are established in such a way that the correlations between each pair are maximized.
For neighborhood delineation in RFA, the considered Xr are physio-meteorological variables while the YS are the flood quantiles of interest. CCA is then used to construct canonical variables Wi that correlate well with physio-meteorological variables. The neighborhood is the set of sites such that the canonical hydrological score wk, k = 1, …, K, is close to the canonical physio-meteorological score of the target ungauged site υ0. The distance is measured by a Mahalanobis distance between the hydrological mean position of the target site Λυ0 and the positions of other sites wk, where Λ = diag(λ1, … λp) and υ0 is the physio-meteorological canonical score of the target site. Provided the X variables are approximately normal, the Mahalanobis distance converges to a χ2 distribution with p degrees of freedom. The size of the neighborhood is controlled by the parameter α that represent the (1 − α)
2) Region of influence approach
b. Regional estimation approaches
Once a neighborhood is identified, the methods described below are used to transfer information from the neighborhood stations to the target site.
1) Generalized additive model
GAM (Hastie and Tibshirani 1987) is a flexible class of nonlinear models that is able to efficiently model a wide variety of nonlinear relationships. In addition, it allows for non-Gaussian response variables (Wood 2006) making it relevant for streamflow data. Thus, GAM allows a more realistic description of the hydrological phenomenon because of the flexible nonparametric fitting of the smooth functions.
For more details, the reader is referred to Wood (2006, 2017).
2) Multivariate adaptive regression splines
Knots and linear splines for a simple example of MARS.
Citation: Journal of Hydrometeorology 21, 12; 10.1175/JHM-D-19-0213.1
The main difference of MARS with GAM is in the estimation algorithm. Where the spline bases are defined a priori in GAM, they are iteratively constructed in MARS, adapting hence to the data. Indeed, building the model in (6) is carried out through two phases: (i) a forward addition of linear spline terms [i.e., of the form (7) and (8)] to build a large model and (ii) a backward deletion to delete irrelevant terms. The forward phase begins with an empty model containing only the intercept β0. The Bn coefficients are then iteratively added to the model, each time choosing the variable and knot yielding the largest decrease in the residual error of the model. This process of adding Bn coefficients continues until the model reaches some predetermined maximum number, leading to a large model which may overfit the data. A backward deletion phase is then performed to improve the model performance by removing the less significant Bn coefficients until obtaining the best submodels. Comparison of submodels is made based on the generalized cross validation (GCV). Figure 2 illustrates the details of the MARS model algorithm.
Graph of MARS modeling process.
Citation: Journal of Hydrometeorology 21, 12; 10.1175/JHM-D-19-0213.1
Another interesting feature of MARS is the assessment of the variable importance for the prediction of the response. Variable importance can be measured in two different ways: (i) the number of submodels that include the variable, or (ii) the increase in GCV caused by deleting the considered variables from the final MARS model (e.g., Roy et al. 2018).
3. Methodology
a. Regional models
In this study, the methods presented in section 2 for neighborhood delineation (CCA and ROI) are used in combination with the regional estimation models GAM and MARS for transfer of hydrological information. As mentioned in section 1, other evaluated models are obtained by applying the GAM and MARS using all stations, i.e., without defining any neighborhoods. Table 1 summarizes all six resulting combinations.
Adopted regional models.
The two most commonly used neighborhood approaches, the CCA and the ROI (Ouarda 2016), are applied to the DHR using two sets of variables. For these methods, the relevant variables are selected based on their correlation degree with the hydrological variables.
Considering the classical procedures used to define the threshold in ROI and CCA, the density of stations in the neighborhoods can vary considerably from one region to another. Indeed, for a given fixed threshold, stations located near the center of the cloud points defined by the canonical space for CCA or the Euclidean space for ROI will have more stations within their neighborhoods and vice versa (Leclerc and Ouarda 2007). Since, the sample may affect the accuracy of the estimates obtained by regression models, it was decided that for each target station, the size of the region is increased until a selected optimal size is reached. The optimal number of stations to be considered in the DHR step is chosen based on the optimization procedure of Ouarda et al. (2001). The optimal number of sites in the neighborhood is the one that minimizes a given performance criterion of the log-linear model applied in each neighborhood.
MARS is fitted using the R package “earth” (Milborrow 2018). The application of MARS needs the tuning of three main parameters (see Fig. 2): the maximum number of terms in the model in the forward phase (Nk), the degree of interaction (degree), and the maximum number of terms in the backward phase (N_prune). A range of values of these parameters was tested and evaluated in order to optimize them based on the GCV, the residual sum of squares (RSS), and the coefficient of determination (R2) criteria of the fitted models.
GAM is also implemented in R, through the package “mgcv” (Wood 2006). The thin plate regression spline is used in this study as basis bji in the smoothing function fi in (5). The latter is selected due to its advantages, i.e., low calculation time, flexibility, and fewer number of parameters compared to other smoothing functions (Wood 2003). The used link function g in (4) is the identity function because of the approximately normal log-transformed quantiles such as considered in Ouali et al. (2017).
Different physio-meteorological variables are considered in each regional model. A backward stepwise approach is applied in this study to select the relevant explanatory variables to be used in each RE models (GAM and MARS). This method is presented in the next section.
b. Variable selection
The backward stepwise selection procedure is applied in this work to select the optimal explanatory variables as in Ouarda et al. (2018) and Chebana et al. (2014). It consists in a progressive deleting of the least effective variables from an initial full model containing all available variables. At each step, the removed variable is the one having either the highest p value for the null hypothesis that the smooth term for GAM is zero or those whose consideration yields the most significant increase in the GCV score of the model for MARS.
Note that the MARS algorithm naturally includes a variable selection feature since it builds a sparse model and a variable for which no term is added is by default discarded. This is not the case for GAM within which an automatic backward stepwise procedure was specially developed for this study.
c. Validation
For each RFA combination in Table 1, performances are evaluated using a leave-one-out cross validation, commonly called jackknife procedure in the field of hydrology. It consists of temporarily deleting each site to consider it the target one and perform RE. This process is repeated for each gauged site. Then, the regional estimate is compared to its observed values. Note that, in statistics, the validation with the jackknife technique is carried out on the retained data, not on the data removed as in the leave-one-out cross validation (Quenouille 1949). However, we will retain the jackknife term for ease of presentation.
Based on the jackknife procedure, several standard performance criteria are used to evaluate the prediction power of each regional model (e.g., Ouali et al. 2016). First, the Nash criterion (NASH) gives a global evaluation of the prediction quality. Second the root-mean-square error (RMSE) provides information about the accuracy of the prediction in an absolute scale, and the relative RMSE (RRMSE) removes the impact of each site’s order of magnitude from the RMSE computation. Finally, the bias (BIAS) and the relative bias (RBIAS) provide a measure of the magnitude of the systematic overestimation or underestimation of a model.
4. Case study and datasets
The dataset considered in the present paper consists in 151 hydrometric stations located in the southern part of the province of Quebec, Canada (Fig. 3). Two versions of the datasets with different variables are considered. The first is a standard one (STA) with only well-known variables used in previous RFA studies (e.g., Shu et al. 2007; Chebana et al. 2014; Durocher et al. 2016; Ouali et al. 2016; Wazneh et al. 2013, 2015, 2016). Note that geographical coordinates of the stations are considered instead of the geographical coordinates of the centroids. The second is an extended dataset (EXTD) combining STA with less common variables characterizing the drainage network systems. Table 2 lists all variables considered as well as whether they are in the EXTD dataset and thorough definitions of the new variables can be found in, for example, Adhikary and Dash (2018). These new variables are calculated based on drainage networks extracted using the D8 approach implemented in ArcGIS (Arc Hydro) using the digital elevation models (DEMs) (Jenson and Domingue 1988; O’Callaghan and Mark 1984; Tarboton et al. 1991). This method consists of calculating the flow direction and the flow accumulation layers based on the direction of the steepest slope among the eight neighbors of a given DEM. Using this information, the drainage networks can be defined considering a constant threshold value which represents the stream head locations (O’Callaghan and Mark 1984). Descriptive statistics of the new variables used in the EXTD dataset (Msilini et al. 2020, manuscript submitted to J. Hydrol.) are given in Table 3. In both datasets the considered hydrological response variables are at-site specific flood quantiles, chosen to match the specific return periods of 10, 50, and 100 years. These quantiles are thus denoted by QS10, QS50, and QS100.
Geographical location of the studied sites in the southern part of the province of Quebec, Canada.
Citation: Journal of Hydrometeorology 21, 12; 10.1175/JHM-D-19-0213.1
Variables used in the STA and the EXTD. An asterisk indicates variables considered in the standard dataset (STA). Plus signs indicate variables considered in the extended dataset (EXTD). The variables considered in the neighborhoods and their transformations are presented in bold.
Descriptive statistics of new physiographical variables.
To ensure the convergence of the Mahalanobis distance to a χ2 distribution in CCA, note that the logarithmic transformation is used for the following variables to achieve approximate normality: AREA, MBS, MATP, DDBZ, and RT and a square root transformation for PLAKE and RC. After transformation normal Q–Q plot indicate that all variables are approximately normal.
5. Results and discussion
a. Region delineation with CCA and ROI
The CCA and the ROI are applied to the DHR using two sets of variables. The first set contains variables from STA, which are the area (AREA), mean basin slope (MBS), percentage of the area occupied by lakes (PLAKE), mean annual total precipitation (MATP), mean annual degree days below 0°C (DDBZ), and the longitude of the centroid of the basin (LONGC). The second one includes variables from the EXTD, namely, PLAKE, MATP, DDBZ, LONGC, texture ratio (RT), and circularity ratio (RC).
The obtained optimum sizes of the neighborhood are nopt (STA) = 85 sites and nopt (EXTD) = 78 sites according to the RRMSE for the CCA method. For the ROI approach, we obtain nopt (STA) = 54 sites and nopt (EXTD) = 44 sites according to the same criterion. Thus, these neighborhood sizes are used for each target station.
b. Selection of optimal variables
The selection of significant explanatory variables is applied for each specific quantile (QS10, QS50, and QS100) and for each estimation model (GAM and MARS). Table 4 summarizes the final variables for each dataset (STA and EXTD). Following the application of the backward technique with GAM and MARS, we note the selection of the same new variables for the two models (RN, MRL, and DD). The definition of these variables can be found, for example, in Adhikary and Dash (2018). For each quantile and for each model, different combinations of variables are selected. The variables that seem to be the most important are AREA, PLAKE, MCL, and LONGC.
Explanatory variables selected for the various regression models.
c. MARS model results
Figure 4 shows the variable importance graph for QS100 obtained using the EXTD (we present only the results of QS100 to avoid repetitions). The variable with the most influence for the QS100 is the percentage of the area occupied by lakes, PLAKE. Indeed, lakes act as a sponge absorbing the excess water during extreme events. Thus they may have a significant effect on flood peaks.
Variable importance while predicting QS100. The red line represents the variation of the square root GCV values caused by the removal of a given variable from the MARS model during the backward phase. The black line represents the variation of the number of submodels including a given variable.
Citation: Journal of Hydrometeorology 21, 12; 10.1175/JHM-D-19-0213.1
Figure 5 shows the GCV R2 (GRSq) value for the QS100 predictions versus the number of terms in the final MARS model. The GCV R2 statistic is equivalent to the ordinary R2 statistic calculated with the variance for error replaced with the GCV statistic. It allows quantifying the goodness of fit for models that use unobserved data. The vertical dashed line at 12 indicates the optimal number of terms retained where marginal increases in GCV R2 are less than 0.001. The 12 final terms include seven variables in this case. Five terms are related to interaction effects.
MARS model selection for QS100. The gray line and the red dashed line represent, respectively, the variation of the GCV R2 (GRSq) and the R2 (RSq) values in the backward phase. For this model, 12 terms were retained, which are based on seven predictors (nbr preds).
Citation: Journal of Hydrometeorology 21, 12; 10.1175/JHM-D-19-0213.1
d. Comparison between MARS and GAM models
Table 5 shows the jackknife results for each model combination. The comparison of GAM and MARS models confirms that the simple linear spline fitting generated by MARS captures more information from the EXTD than the more sophisticated smoothing functions used in GAM. Indeed, MARS adds the terms in an iterative way leading to a simple and performant model including the effects of interactions. This model performs well with the ROI, which contains a smaller number of stations than CCA. Thus, based on the results of our case study MARS seems applicable in small neighborhoods even with complex terms (interaction effects) and able to give good predictions with fewer stations than GAM.
Jackknife validation results (STD and EXTD). Best results are in bold.
The response functions fitted by GAM and MARS models for selected explanatory variables are given in Fig. 6. It can be seen that the smoothing functions fitted by MARS approximate closely the more continuous smooth curves fitted by GAM, in a simpler way. This result has been observed by Leathwick et al. (2006) in a comparative study made between GAM and MARS applied in the field of ecology. The smooth curves generated by GAM add degrees of freedom to the model which makes it relatively more complex. This may be the reason for the better prediction results obtained by MARS than GAM.
Examples of smoothing functions produced by the GAM and MARS models for some explanatory variables. Dashed lines represent the 95% confidence intervals (CI). A Bayesian approach to variance estimation is used to calculate the CI for GAM. For MARS, the approach considered to identify the CI for MARS is the one that we can use for a linear regression model as it is simply a linear regression of linear basis functions. All the terms are estimated with a sum to zero constraint, leading to lower uncertainty associated with the mean in the plots.
Citation: Journal of Hydrometeorology 21, 12; 10.1175/JHM-D-19-0213.1
Figure 7 illustrates the interaction effects between some explanatory variables fitted by GAM and MARS models. Note that we considered the same interactions automatically identified by MARS to be able to make the comparison. The interaction surface generated by both models is also close. GAM gives more continuous and complex interaction effects, which lead to a large model with a large number of coefficients. This makes it difficult or impossible to integrate the interaction effects with GAM if we have a large number of explanatory variables in the model. For example, for the QS100, the integration of the same interactions identified by MARS to GAM considering the same variables gives a model with 79 coefficients, versus only 12 using MARS. In addition, MARS searches for and integrates interaction effects automatically into the model, which allows obtaining flood quantile estimates overall better than those obtained by GAM. We take as a simple example of interaction the first effect illustrated in Fig. 7, which represents the predicted response (specific quantile) as DD and LONGC vary. It can be seen that the LONGC affects little the hydrological variable level unless the DD is high where a nonlinear effect is seen.
Examples of the multivariate effects of some explanatory variables produced by the GAM and MARS models on the response variable (interactions).
Citation: Journal of Hydrometeorology 21, 12; 10.1175/JHM-D-19-0213.1
e. Comparison of regional models
According to Table 5 (see above), the highest NASH values (0.80) and the lowest RRMSE values (28.30% for QS100) are given by the ROI/MARS/EXTD, which leads to the most accurate estimates compared to all other combinations. It can also be seen that, with ALL, MARS has a comparable performance to GAM considering both databases. However, using the neighborhoods, especially the ROI, MARS overall outperforms GAM in terms of RRMSE and RBIAS criteria. This may be attributable to the flexibility of MARS and its generalization ability in small size neighborhoods.
Figure 8 illustrates the relative error, which is the most important criterion (Hosking and Wallis 2005), as a function of the sites ordered according to their area associated to the best models (ROI/MARS/EXTD and ROI/GAM/EXTD). One can notice that, overall, MARS with the EXTD performs better than GAM. The figure also shows that the performances at the level of extreme size basins are much worse than those obtained at the level of medium size basins.
Relative errors associated to the at-site quantile QS100 calculated using ROI/GAM/EXTD and ROI/MARS/EXTD.
Citation: Journal of Hydrometeorology 21, 12; 10.1175/JHM-D-19-0213.1
Figure 9 presents the differences between relative errors of MARS and GAM calculated using ROI/EXTD. One can notice that, in terms of RRMSE, MARS outperforms GAM in 84 sites out of 151, which represents 56% of the total number of sites. Accordingly, MARS is shown to be a simple performant model that can be considered as an alternative RE model.
Relative errors differences associated to the at site quantile QS100 calculated between MARS and GAM. The considered combinations are ROI/GAM/EXTD and ROI/MARS/EXTD.
Citation: Journal of Hydrometeorology 21, 12; 10.1175/JHM-D-19-0213.1
6. Conclusions
The aim of this study is to introduce MARS in the RFA of extreme hydrological variables and to compare its performance to GAM. The MARS model is able to model complex relationship between physio-meteorological variables, including variables dealing with drainage network characteristics, and flood quantiles at ungauged sites.
MARS is hereby compared to the GAM, which is gaining popularity in RFA and is one of the best performing models. Results show that slightly better flood quantile estimates are obtained from regional models that combine MARS with the EXTD including a STA with additional variables dealing with drainage network proprieties. Results indicate also that better performances are obtained with the ROI which includes low density of stations than CCA. This suggests that MARS is able to transfer hydrological information adequately even with fewer data than GAM. Further efforts are required to generalize this conclusion and to evaluate the benefits of MARS in other study areas and with other hydrological variables.
Although MARS is an effective and simple tool for estimation that can be used in RFA, there are some constraints such as the maximum number of terms and the maximum allowable degree of interaction in the forward pass that have to be specified by the user. These depend on the problem at hand and should be considered carefully. In addition, MARS does not cope well with missing data and, like many machine learning algorithms, is prone to overfitting. Note, however, that the backward deletion phase is meant to address this drawback.
Aside from the abovementioned shortcomings, MARS is easy-to-use as shown in this work. It is able to addresses the issues of high number of variables, nonlinearity, and interactions involved in the hydrological phenomena. This yields flood quantile estimates that compete with those obtained from GAM, while being simpler and more applicable to smaller datasets. Flood quantiles represent important information that is used in the design of hydraulic structures (e.g., dams). The construction of these structures is very expensive. The availability of simple and sophisticated tools for the reliable estimation of flood quantiles is crucial for hydraulics engineers.
In this work we considered linear neighborhood approaches (CCA and ROI), which are the most used methods in RFA. Future efforts can focus on the assessment of the performance of the MARS model in combination with nonlinear neighborhood approaches such as the nonlinear canonical correlation analysis (Ouali et al. 2016) and the nonlinear neighborhood based on the statistical depth function (Wazneh et al. 2016).
Acknowledgments
Financial support for this work was graciously provided by the Natural Sciences and Engineering Research Council of Canada (NSERC), the Canada Research chairs program (CRC), and the University Mission of Tunisia in Montreal (MUTAN). The authors are grateful to Natural Resources Canada and the USGS services for the employed DEM data. The authors would like also to thank the Ministry of Sustainable Development, Environment, and Fight Against Climate Change (MDDELCC) services for the employed dataset (STA). The authors thank the Editor, Prof. Andrew Wood, and three anonymous reviewers for their comments which helped improve the quality of the manuscript.
APPENDIX
Abbreviations
ANN | Artificial neural network |
AREA | Basin area |
BH | Basin relief |
BIAS | Mean bias |
CCA | Canonical correlation analysis |
DD | Drainage density |
DDBZ | Mean annual degree days below 0°C |
DEM | Digital elevation model |
DHR | Delineation of homogenous regions |
Edf | Estimated smooth degree of freedom |
EXTD | Extended dataset |
FS | Stream frequency |
GAM | Generalized additive model |
GCV | Generalized cross validation |
IF | Infiltration number |
LATC | Latitude of the centroid of the basin |
LONGC | Longitude of the centroid of the basin |
MALP | Mean annual liquid precipitation |
MALPS | Mean annual liquid precipitation (summer–fall) |
MARS | Multivariate adaptive regression splines |
MASP | Mean annual solid precipitation |
MATP | Mean annual total precipitation |
MBS | Mean basin slope |
MCL | Main channel length |
MCS | Main channel slope |
MRB | Mean bifurcation ratio |
MRL | Mean stream length ratio |
NASH | Nash efficiency criterion |
NL-CCA | Nonlinear canonical correlation analysis |
PFOR | Percentage of the area occupied by forest |
PL1 | Percentage of first-order stream lengths |
PLAKE | Percentage of the area occupied by lakes |
PN1 | Percentage of first-order streams |
QST | Specific quantile associated to the return period T |
R2 | Coefficient of determination |
RB | Bifurcation ratio |
RBIAS | Relative mean bias |
RC | Circularity ratio |
RE | Regional estimation |
RFA | Regional frequency analysis |
RL | Stream length ratio |
RMSE | Root-mean-square error |
RN | Ruggedness number |
ROI | Region of influence |
RRMSE | Relative root-mean-square error |
RSS | Residual sum of squares |
RT | Texture ratio |
STA | Standard dataset |
WMRB | Weighted mean bifurcation ratio |
REFERENCES
Adhikary, P. P., and J. Dash, 2018: Morphometric analysis of Katra Watershed of Eastern Ghats: A GIS approach. Int. J. Curr. Microbiol. Appl. Sci., 7, 1651–1665, https://doi.org/10.20546/ijcmas.2018.703.198.
Aziz, R., A. Rahman, G. Fang, and S. Shrestha, 2014: Application of artificial neural networks in regional flood frequency analysis: A case study for Australia. Stochastic Environ. Res. Risk Assess., 28, 541–554, https://doi.org/10.1007/s00477-013-0771-5.
Balshi, M. S., A. D. McGuire, P. Duffy, M. Flannigan, J. Walsh, and J. Melillo, 2009: Assessing the response of area burned to changing climate in western boreal North America using a Multivariate Adaptive Regression Splines (MARS) approach. Global Change Biol., 15, 578–600, https://doi.org/10.1111/j.1365-2486.2008.01679.x.
Bayentin, L., S. El Adlouni, T. B. M. J. Ouarda, P. Gosselin, B. Doyon, and F. Chebana, 2010: Spatial variability of climate effects on ischemic heart disease hospitalization rates for the period 1989-2006 in Quebec, Canada. Int. J. Health Geogr., 9, 5, https://doi.org/10.1186/1476-072X-9-5.
Bishop, C. M., 1995: Neural Networks for Pattern Recognition. Oxford University Press, 482 pp.
Bond, N. R., and M. J. Kennard, 2017: Prediction of hydrologic characteristics for ungauged catchments to support hydroecological modeling. Water Resour. Res., 53, 8781–8794, https://doi.org/10.1002/2017WR021119.
Booker, D. J., and R. A. Woods, 2014: Comparing and combining physically-based and empirically-based approaches for estimating the hydrology of ungauged catchments. J. Hydrol., 508, 227–239, https://doi.org/10.1016/j.jhydrol.2013.11.007.
Breiman, L., 2001: Random forests. Mach. Learn., 45, 5–32, https://doi.org/10.1023/A:1010933404324.
Brunner, M. I., R. Furrer, A. E. Sikorska, D. Viviroli, J. Seibert, and A.-C. Favre, 2018: Synthetic design hydrographs for ungauged catchments: A comparison of regionalization methods. Stochastic Environ. Res. Risk Assess., 32, 1993–2023, https://doi.org/10.1007/s00477-018-1523-3.
Burn, D. H., 1990a: An appraisal of the “region of influence” approach to flood frequency analysis. Hydrol. Sci. J., 35, 149–165, https://doi.org/10.1080/02626669009492415.
Burn, D. H., 1990b: Evaluation of regional flood frequency analysis with a region of influence approach. Water Resour. Res., 26, 2257–2265, https://doi.org/10.1029/WR026i010p02257.
Chebana, F., and T. B. M. J. Ouarda, 2008: Depth and homogeneity in regional flood frequency analysis. Water Resour. Res., 44, W11422, https://doi.org/10.1029/2007WR006771.
Chebana, F., C. Charron, T. B. M. J. Ouarda, and B. Martel, 2014: Regional frequency analysis at ungauged sites with the generalized additive model. J. Hydrometeor., 15, 2418–2428, https://doi.org/10.1175/JHM-D-14-0060.1.
Cutler, D. R., T. C. Edwards Jr., K. H. Beard, A. Cutler, K. T. Hess, J. Gibson, and J. J. Lawler, 2007: Random forests for classification in ecology. Ecology, 88, 2783–2792, https://doi.org/10.1890/07-0539.1.
Deo, R. C., O. Kisi, and V. P. Singh, 2017: Drought forecasting in eastern Australia using multivariate adaptive regression spline, least square support vector machine and M5Tree model. Atmos. Res., 184, 149–175, https://doi.org/10.1016/j.atmosres.2016.10.004.
Diez-Sierra, J., and M. del Jesus, 2019: Subdaily rainfall estimation through daily rainfall downscaling using random forests in Spain. Water, 11, 125, https://doi.org/10.3390/w11010125.
Durocher, M., F. Chebana, and T. B. M. J. Ouarda, 2016: On the prediction of extreme flood quantiles at ungauged locations with spatial copula. J. Hydrol., 533, 523–532, https://doi.org/10.1016/j.jhydrol.2015.12.029.
Emamgolizadeh, S., S. M. Bateni, D. Shahsavani, T. Ashrafi, and H. Ghorbania, 2015: Estimation of soil cation exchange capacity using genetic expression programming (GEP) and multivariate adaptive regression splines (MARS). J. Hydrol., 529, 1590–1600, https://doi.org/10.1016/j.jhydrol.2015.08.025.
Friedman, J. H., 1991: Multivariate adaptive regression splines. Ann. Stat., 19, 1–67, https://doi.org/10.1214/aos/1176347973.
Gal, Y., and Z. Ghahramani, 2016: A theoretically grounded application of dropout in recurrent neural networks. 30th Conf. on Advances in Neural Information Processing Systems, Barcelona, Spain, NIPS, 9 pp., https://papers.nips.cc/paper/6241-a-theoretically-grounded-application-of-dropout-in-recurrent-neural-networks.pdf.
Geurts, P., A. Irrthum, and L. Wehenkel, 2009: Supervised learning with decision tree-based methods in computational and systems biology. Mol. Biosyst., 5, 1593–1605, https://doi.org/10.1039/b907946g.
GREHYS, 1996: Presentation and review of some methods for regional flood frequency analysis. J. Hydrol., 186, 63–84, https://doi.org/10.1016/S0022-1694(96)03042-9.
Hastie, T., and R. Tibshirani, 1987: Generalized additive models: Some applications. J. Amer. Stat. Assoc., 82, 371–386, https://doi.org/10.1080/01621459.1987.10478440.
Hosking, J. R. M., and J. R. M. Wallis, 2005: Regional Frequency Analysis: An Approach Based on L-Moments. Cambridge University Press, 244 pp.
Hotelling, H., 1935: The most predictable criterion. J. Educ. Psychol., 26, 139–142, https://doi.org/10.1037/h0058165.
Ibbitt, R., and R. Woods, 2004: Re-scaling the topographic index to improve the representation of physical processes in catchment models. J. Hydrol., 293, 205–218, https://doi.org/10.1016/j.jhydrol.2004.01.016.
Jenson, S. K., and J. O. Domingue, 1988: Extracting topographic structure from digital elevation data for geographic information system analysis. Photogramm. Eng. Remote Sens., 54, 1593–1600.
Jung, K., P. R. Marpu, and T. B. M. J. Ouarda, 2017: Impact of river network type on the time of concentration. Arabian J. Geosci., 10, 546, https://doi.org/10.1007/s12517-017-3323-3.
Khalil, B., T. B. M. J. Ouarda, and A. St-Hilaire, 2011: Estimation of water quality characteristics at ungauged sites using artificial neural networks and canonical correlation analysis. J. Hydrol., 405, 277–287, https://doi.org/10.1016/j.jhydrol.2011.05.024.
Kisi, O., 2015: Pan evaporation modeling using least square support vector machine, multivariate adaptive regression splines and M5 model tree. J. Hydrol., 528, 312–320, https://doi.org/10.1016/j.jhydrol.2015.06.052.
Kisi, O., and K. S. Parmar, 2016: Application of least square support vector machine and multivariate adaptive regression spline models in long term prediction of river water pollution. J. Hydrol., 534, 104–112, https://doi.org/10.1016/j.jhydrol.2015.12.014.
Lawrence, S., and C. L. Giles, 2000: Overfitting and neural networks: Conjugate gradient and backpropagation. Proc. IEEE-INNS-ENNS Int. Joint Conf. on Neural Networks, Como, Italy, IEEE, 114–119, https://doi.org/10.1109/IJCNN.2000.857823.
Leathwick, J. R., D. Rowe, J. Richardson, J. Elith, and T. Hastie, 2005: Using multivariate adaptive regression splines to predict the distributions of New Zealand’s freshwater diadromous fish. Freshwater Biol., 50, 2034–2052, https://doi.org/10.1111/j.1365-2427.2005.01448.x.
Leathwick, J. R., J. Elith, and T. Hastie, 2006: Comparative performance of generalized additive models and multivariate adaptive regression splines for statistical modelling of species distributions. Ecol. Modell., 199, 188–196, https://doi.org/10.1016/j.ecolmodel.2006.05.022.
Leclerc, M., and T. B. M. J. Ouarda, 2007: Non-stationary regional flood frequency analysis at ungauged sites. J. Hydrol., 343, 254–265, https://doi.org/10.1016/j.jhydrol.2007.06.021.
Lee, T.-S., and I.-F. Chen, 2005: A two-stage hybrid credit scoring model using artificial neural networks and multivariate adaptive regression splines. Expert Syst. Appl., 28, 743–752, https://doi.org/10.1016/j.eswa.2004.12.031.
Lee, T.-S., C.-C. Chiu, Y.-C. Chou, and C.-J. Lu, 2006: Mining the customer credit using classification and regression tree and multivariate adaptive regression splines. Comput. Stat. Data Anal., 50, 1113–1130, https://doi.org/10.1016/j.csda.2004.11.006.
Leitte, P., C. Petrescu, U. Franck, M. Richter, O. Suciu, R. Ionovici, O. Herbarth, and U. Schlink, 2009: Respiratory health, effects of ambient air pollution and its modification by air humidity in Drobeta-Turnu Severin, Romania. Sci. Total Environ., 407, 4004–4011, https://doi.org/10.1016/j.scitotenv.2009.02.042.
Li, Y., Y. He, Y. Su, and L. Shu, 2016: Forecasting the daily power output of a grid-connected photovoltaic system based on multivariate adaptive regression splines. Appl. Energy, 180, 392–401, https://doi.org/10.1016/j.apenergy.2016.07.052.
Masselink, R., A. J. A. M. Temme, R. Giménez, J. Casalí, and S. D. Keesstra, 2017: Assessing hillslope-channel connectivity in an agricultural catchment using rare-earth oxide tracers and random forests models. Geogr. Res. Lett., 43, 19–39, https://doi.org/10.18172/cig.3169.
Milborrow, S., 2018: Earth: Multivariate adaptive regression splines. R package, version 4.6.3, https://cran.r-project.org/web/packages/earth/index.html.
Muñoz, P., J. Orellana-Alvear, P. Willems, and R. Célleri, 2018: Flash-flood forecasting in an Andean mountain catchment—Development of a step-wise methodology based on the random forest algorithm. Water, 10, 1519, https://doi.org/10.3390/w10111519.
Niehoff, F., U. Fritsch, and A. Bronstert, 2002: Land-use impacts on storm-runoff generation: Scenarios of land-use change and simulation of hydrological response in a meso-scale catchment in SW-Germany. J. Hydrol., 267, 80–93, https://doi.org/10.1016/S0022-1694(02)00142-7.
O’Callaghan, J. F., and D. M. Mark, 1984: The extraction of drainage networks from digital elevation data. Comput. Vision Graphics Image Process., 28, 323–344, https://doi.org/10.1016/S0734-189X(84)80011-0.
Ouali, D., F. Chebana, and T. B. M. J. Ouarda, 2016: Non-linear canonical correlation analysis in regional frequency analysis. Stochastic Environ. Res. Risk Assess., 30, 449–462, https://doi.org/10.1007/s00477-015-1092-7.
Ouali, D., F. Chebana, and T. B. M. J. Ouarda, 2017: Fully nonlinear statistical and machine-learning approaches for hydrological frequency estimation at ungauged sites. J. Adv. Model. Earth Syst., 9, 1292–1306, https://doi.org/10.1002/2016MS000830.
Ouarda, T. B. M. J., 2016: Regional flood frequency modeling. Chow’s Handbook of Applied Hydrology, 3rd ed. V. P. Singh, Ed., McGraw-Hill, 77.71–77.78.
Ouarda, T. B. M. J., and C. Shu, 2009: Regional low-flow frequency analysis using single and ensemble artificial neural networks. Water Resour. Res., 45, W11428, https://doi.org/10.1029/2008wr007196.
Ouarda, T. B. M. J., M. Lang, B. Bobée, J. Bernier, and P. Bois, 1999: Synthèse de modèles régionaux d'estimation de crue utilisée en France et au Québec. Revue des sciences de l'eau/J. Water Sci., 12, 155–182, https://doi.org/10.7202/705347ar.
Ouarda, T. B. M. J., C. Girard, G. S. Cavadias, and B. Bobée, 2001: Regional flood frequency estimation with canonical correlation analysis. J. Hydrol., 254, 157–173, https://doi.org/10.1016/S0022-1694(01)00488-7.
Ouarda, T. B. M. J., C. Charron, P. R. Marpu, and F. Chebana, 2016: The generalized additive model for the assessment of the direct, diffuse, and global solar irradiances using SEVIRI images, with application to the UAE. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., 9, 1553–1566, https://doi.org/10.1109/JSTARS.2016.2522764.
Ouarda, T. B. M. J., C. Charrona, Y. Hundecha, A. St-Hilaire, and F. Chebana, 2018: Introduction of the GAM model for regional low-flow frequency analysis at ungauged basins and comparison with commonly used approaches. Environ. Modell. Software, 109, 256–271, https://doi.org/10.1016/j.envsoft.2018.08.031.
Pourghasemi, H. M., and N. Kerle, 2016: Random forests and evidential belief function-based landslide susceptibility assessment in Western Mazandaran Province, Iran. Environ. Earth Sci., 75, 185, https://doi.org/10.1007/s12665-015-4950-1.
Prasad, A. M., L. R. Iverson, and A. Liaw, 2006: Newer classification and regression tree techniques: Bagging and random forests for ecological prediction. Ecosystems, 9, 181–199, https://doi.org/10.1007/s10021-005-0054-1.
Quenouille, M. H., 1949: Problems in plane sampling. Ann. Math. Stat., 20, 355–375, https://doi.org/10.1214/aoms/1177729989.
Rahman, A., C. Charron, T. B. M. J. Ouarda, and F. Chebana, 2018: Development of regional flood frequency analysis techniques using generalized additive models for Australia. Stochastic Environ. Res. Risk Assess., 32, 123–139, https://doi.org/10.1007/s00477-017-1384-1.
Ramsay, T. O., R. T. Burnett, and D. Krewski, 2003: The effect of concurvity in generalized additive models linking mortality to ambient particulate matter. Epidemiology, 14, 18–23, https://doi.org/10.1097/00001648-200301000-00009.
Rounaghi, M. M., M. R. Abbaszadeh, and M. Arashi, 2015: Stock price forecasting for companies listed on Tehran stock exchange using multivariate adaptive regression splines model and semi-parametric splines technique. Physica, 438A, 625–633, https://doi.org/10.1016/j.physa.2015.07.021.
Roy, S. S., R. Roy, and V. E. Balas, 2018: Estimating heating load in buildings using multivariate adaptive regression splines, extreme learning machine, a hybrid model of MARS and ELM. Renewable Sustainable Energy Rev., 82, 4256–4268, https://doi.org/10.1016/j.rser.2017.05.249.
Saadi, M., L. Oudin, and P. Ribstein, 2019: Random forest ability in regionalizing hourly hydrological model parameters. Water, 11, 1540, https://doi.org/10.3390/w11081540.
Shu, C., and D. H. Burn, 2004: Artificial neural network ensembles and their application in pooled flood frequency analysis. Water Resour. Res., 40, W09301, https://doi.org/10.1029/2003WR002816.
Shu, C., and T. B. M. J. Ouarda, 2007: Flood frequency analysis at ungauged sites using artificial neural networks in canonical correlation analysis physiographic space. Water Resour. Res., 43, W07438, https://doi.org/10.1029/2006WR005142.
Sivakumar, B., 2007: Nonlinear determinism in river flow: Prediction as a possible indicator. Earth Surf. Processes Landforms, 32, 969–979, https://doi.org/10.1002/esp.1462.
Tarboton, D. G., R. L. Bras, and I. Rodriguez-Iturbe, 1991: On the extraction of channel networks from digital elevation data. Hydrol. Processes, 5, 81–100, https://doi.org/10.1002/hyp.3360050107.
Tasker, H., S. A. Hodge, and C. S. Barks, 1996: Region OF influence regression for estimating the 50-year flood at ungaged sites. J. Amer. Water Resour. Assoc., 32, 163–170, https://doi.org/10.1111/j.1752-1688.1996.tb03444.x.
Wahba, G., 1990: Spline Models for Observational Data. SIAM, 181 pp.
Wang, W., X. Chen, P. Shi, and P. H. A. J. M. van Gelder, 2008: Detecting changes in extreme precipitation and extreme streamflow in the Dongjiang River Basin in southern China. Hydrol. Earth Syst. Sci., 12, 207–221, https://doi.org/10.5194/hess-12-207-2008.
Wang, Z., C. Lai, X. Chen, B. Yang, S. Zhao, and X. Bai, 2015: Flood hazard risk assessment model based on random forest. J. Hydrol., 527, 1130–1141, https://doi.org/10.1016/j.jhydrol.2015.06.008.
Wazneh, H., F. Chebana, and T. B. M. J. Ouarda, 2013: Depth-based regional index-flood model. Water Resour. Res., 49, 7957–7972, https://doi.org/10.1002/2013WR013523.
Wazneh, H., F. Chebana, and T. B. M. J. Ouarda, 2013: Delineation of homogeneous regions for regional frequency analysis using statistical depth function. J. Hydrol., 521, 232–244, https://doi.org/10.1016/j.jhydrol.2014.11.068.
Wazneh, H., F. Chebana, and T. B. M. J. Ouarda, 2016: Identification of hydrological neighborhoods for regional flood frequency analysis using statistical depth function. Adv. Water Resour., 94, 251–263, https://doi.org/10.1016/j.advwatres.2016.05.013.
Wen, R., K. Rogers, N. Saintilan, and J. Ling, 2011: The influences of climate and hydrology on population dynamics of waterbirds in the lower Murrumbidgee River floodplains in Southeast Australia: Implications for environmental water management. Ecol. Modell., 222, 154–163, https://doi.org/10.1016/j.ecolmodel.2010.09.016.
Wood, S. N., 2003: Thin plate regression splines. J. Roy. Stat. Soc., 65, 95–114, https://doi.org/10.1111/1467-9868.00374.
Wood, S. N., 2004: Stable and efficient multiple smoothing parameter estimation for generalized additive models. J. Amer. Stat. Assoc., 99, 673–686, https://doi.org/10.1198/016214504000000980.
Wood, S. N., 2006: Generalized Additive Models: An Introduction with R. 1st ed. CRC Press, 410 pp.
Wood, S. N., 2017: Generalized Additive Models: An Introduction with R. 2nd ed. CRC Press, 476 pp.
Xu, J., W. Li, M. Ji, F. Lu, and S. Dong., 2010: A comprehensive approach to characterization of the nonlinearity of runoff in the headwaters of the Tarim River, western China. Hydrol. Processes J., 24, 136–146, https://doi.org/10.1002/hyp.7484.
Zhang, G., A. T. C. Goh, Y. Zhang, Y. Chen, and Y. Xiao, 2015: Assessment of soil liquefaction based on capacity energy concept and multivariate adaptive regression splines. Eng. Geol., 188, 29–37, https://doi.org/10.1016/j.enggeo.2015.01.009.
Zhang, W., and A. Goh, 2016: Evaluating seismic liquefaction potential using multivariate adaptive regression splines and logistic regression. Geomech. Eng., 10, 269–280, http://doi.org/10.12989/gae.2016.10.3.269.