The identification of the rainfall–runoff relationship is a significant precondition for surface–atmosphere process research and operational flood forecasting, especially in inadequately monitored basins. Based on an information diffusion model (IDM) improved by a genetic algorithm, a new algorithm (GIDM) is established for interpolating and forecasting monthly discharge time series; the input variables are the rainfall and runoff values observed during the previous time period. The genetic operators are carefully designed to avoid premature convergence and “local optima” problems while searching for the optimal window width (a parameter of the IDM). In combination with fuzzy inference, the effectiveness of the GIDM is validated using long-term observations. Conventional IDMs are also included for comparison. On the Yellow River or Yangtze River, twelve gauging stations are discussed, and the results show that the new method can simulate the observations more accurately than traditional IDMs, using only 50% or 33.33% of the total data for training. The low density of observations and the difficulties in information extraction are key problems for hydrometeorological research. Therefore, the GIDM may be a valuable tool for improving water management and providing the acceptable input data for hydrological models when available measurements are insufficient.
Scanty and missing data are insufficient to meet the needs of the hydrological modeling of the physical process. In addition, the rainfall–runoff relationship is one of the most complex hydrologic phenomena to comprehend because of the tremendous spatial and temporal variability of watershed characteristics and precipitation patterns (Sedki et al. 2009). Therefore, how to fill the existing observational data gaps and establish an acceptable model for rainfall–runoff forecasting is a crucial precondition for hydrometeorological research and operational flood forecasting, especially in some undermonitored river basins. Tremendous efforts have been made over the last few decades to recover missing data and to improve hydrological predictions.
Most of the missing data recovery methods, such as kriging interpolation, polynomial interpolation, optimal interpolation, Kalman filtering, the successive corrections method, fractal interpolation, and phase space reconstruction prediction have been widely applied to hydrologic and oceanographic interpolation. However, these methods may not achieve acceptable results when the known data are less than 60% of the total one (Wang et al. 2008).
Regarding how to estimate the relationship between rainfall and runoff accurately, many models have been proposed and have obtained many good results. These models can be broadly divided into three groups: regression-based methods, physical models, and artificial neural network (ANN) methods. The first group, which includes autoregressive moving-average models, has been widely used for reservoir design and optimization (Carlson et al. 1970; Chen and Rao 2002; Komorník et al. 2006; Salas 1993). However, this method generally assumes that the observations obey normal distribution or needs to make an assumption of the equation in advance. Therefore, it is very difficult to obtain a reasonable result for a small sample without any information about the population shape.
For the physical models (Sorooshian et al. 1993; Todini 1996; Whigham and Crapper 2001), equations of mathematical physics are developed into a popular approach to describe the relationships in physical systems. However, the parameters of these models need to be estimated by minimizing objective functions, which generally lead to groups of unrealistic parameters incorporating both data measurement errors and the errors present in the structure of the model itself; parameter observability conditions could not always be guaranteed either.
In recent years, the ANN technique has been of particular interest in operational hydrology. It is capable of simulating a nonlinear system that is hard to describe using traditional physical modeling. ANNs and improved ANNs have been applied in many fields of hydrology and water resource research (Alvisi et al. 2006; Cheng et al. 2005; Keskin et al. 2006; Muttil and Chau 2006; Tao et al. 2008; Zounemat-Kermani and Teshnehlab 2008). However, ANNs generally require sufficient samples to set optimal connection weights and thresholds, which may become challenging when data information is incomplete.
Aiming at shortcomings of the existing interpolation methods and hydrologic models mentioned above, the information diffusion model (IDM) is introduced into this paper. IDM is an effective method of dealing with the small-sample issue; it can capture complex nonlinear relationships without detailed knowledge of the physical processes (Huang 1997, 2001). Based on an IDM, an incomplete dataset can be regarded as a piece of fuzzy information; through some diffusion methods, some additional information can be extracted by spreading the observations. The diffusion coefficient [simple window width (SWW)] can be easily determined according to nearby criteria (Huang 1997) with incomplete data. Greater reliability for risk assessment and pattern recognition can be achieved using this method (Feng and Huang 2008; Huang 2002; Li et al. 2012). However, the IDM equipped with SWW (SIDM) is unable to precisely analyze meteorological, oceanic, and hydrological data that follow an asymmetrical and abnormal distribution. To solve this issue, under the principle of least-mean-squared errors, Xinzhou et al. (2003) proposed the optimal window width (OWW)-based IDM (OIDM), which displays a better performance for estimating the nonnormal population than SIDM. OWW uses the mean value of observations to iteratively compute an approximate result instead of taking all observations into account in one step, which may result in “local optima” and inconsistencies. Genetic algorithms (GAs) are implemented based on the ideas of natural genetics and biological evolution; the algorithm works with a number of solution sets over the search domain rather than a single one so that local optima can effectively be avoided (Goldberg and Holland 1988; Hong et al. 2013b). Hence, this paper presents a new method that obtains global optima diffusion coefficients by using a GA to interpolate and forecast monthly discharge time series.
To the best of our knowledge, no study has been reported in the hydrological literature that has used the IDM for hydrological modeling. Therefore, to facilitate our discussion, the principle of information diffusion and the algorithm to calculate the SWW and OWW is explained in section 2; in the same section, the new method to obtain the diffusion coefficients based on the GA is also discussed. Interpolation and prediction of river runoff using the IDM coupled with fuzzy inference is examined in section 3. To substantiate the new method, a step-by-step implementation of IDMs for the monthly river runoff interpolation and forecasting at real gauging stations is presented in section 4. Finally, in section 5, some conclusions are presented and future work is proposed.
2. IDM and the window width
a. Principles of information diffusion and SWW
Information diffusion refers to making an affirmation: when a knowledge sample is given, it can be used to compute a relationship between population and sample. Let be a given sample (where curly brackets indicate a series of values) and let the universe of discourse be . If and only if is incomplete, there must be a reasonable information diffusion function , , which can accurately estimate the real relation . This is called the principle of information diffusion (Huang 1997). Let be n independent identically distributed observations drawn from a population with density . Suppose is a Borel measurable function in :
is called an information diffusion estimator about , where is the diffusion function and is the window width or diffusion coefficient (Huang 1997). According to molecule diffusion theory and nearby criteria, Huang (1997) obtained the normal diffusion function as
and the (SWW) as :
b. Optimal window width
The SWW-based information diffusion method maximizes the amount of useful data extracted from the sample, thus improving the accuracy of system recognition and natural disaster risk evaluation (Huang 2001; Palm 2007). However, the method is invalid when the population from which observations are drawn does not follow a normal distribution. To obtain a more accurate information diffusion estimator for the abnormal relationship, based on the principle of least-mean-squared error, Xinzhou et al. (2003) proposed an iterative method to obtain the optimal diffusion coefficient (OWW), which can be expressed by
where is the initial iterative value; and denote the ordinal number of records and iterations, respectively; and , , and return the largest, smallest, and mean elements in , respectively. When is determined, the iterative computations end with the OWW:
The IDM in conjunction with the OWW, that is, the OIDM, applies to research data that follow both normal and abnormal distributions and can estimate a real relationship more accurately than a traditional one (Xinzhou et al. 2003). The OWW is obtained using the mean value of observations for iterative computation instead of including all observations in one step; however, the local optima problem (Goldberg and Holland 1988) may emerge. To avoid the problem, GAs are employed.
c. Searching the general optimal window width using the GA
GAs are established based on the ideas of natural genetics and biological evolution; the algorithm works with a number of solution sets over the search domain so that local optima can effectively be avoided (Goldberg and Holland 1988). Hence, this paper uses GA to search global optimal diffusion coefficients. The combination of the GA and IDM for window width searching consists of three major phases. In the first phase, the GA initializes a population that compounds random codes from the search domain (Xinzhou et al. 2003), where b and a are the maximum and minimum values of the samples, respectively. Since there are too many variables using binary-encoded GAs to solve such optimization problems, this paper selects a real code GA, which means each chromosome is encoded not with binary numbers but with real ones. The second phase is the evaluation of the fitness of all chromosomes. According to Xinzhou et al. (2003), the window width can be obtained by
where is the information diffusion estimator and denotes different records from sample . Motivated by second-order schemes (Flajolet and Sedgewick 1995),
which is the criteria for determination of the window width. Thus, different records have their own different window widths. To take all records into account at the same time and search the only one global optimal , we select
as the fitness value function. The third phase is to apply evolutionary processes, such as selection, crossover, and mutation operations by a GA according to its fitness, which is discussed in Goldberg and Holland (1988) and Hong et al. (2013a). The evolution stops when the fitness is smaller than a predefined value. Finally, the improved window width (IWW) can be adopted for the chromosome with the lowest fitness value.
After a brief overview of IDM and improved IDM techniques is presented, the procedure for interpolating and predicting river discharge based on the IDM is described in the next section.
3. IDM for runoff estimation using fuzzy inference
The IDM coupled with fuzzy inference is an approach that processes samples using a set numerical method (Huang 2002). Let be a training set of observations on ( is the real line), where input denotes the index of records sorted by chronological order or precipitation data, is the river discharge, and curly brackets indicate a series of values.
Let be the domain of and be the range of . The element of will be denoted by , the same for by , and curly brackets indicate a series of values. Let
The Cartesian product is called the illustrating space, and and are fuzzy sets of and , respectively. Recalling Eqs. (1) and (2) to deal with their membership functions, the following can be obtained:
An illustrating point is given by , and and are window widths that can be obtained by and based on the algorithms discussed in section 2. The information gain of is
Then we have
which consists of the information matrix (Huang 2002)
According to the theory of factor space (Pei-Zhuang 1990), a fuzzy relation matrix ,
can be obtained from an information matrix by using
To calculate the output fuzzy set , we use to denote the input fuzzy set,
Using the fuzzy inference formula
where operator “” denotes the maximum − minimum fuzzy composition rule,
where ; thus, we can obtain
Finally, the gravity center of the fuzzy set is generated as the output:
In general, we use the given sample to construct a relationship between the river discharge and its antecedent values or its meteorological influencing factor in the following form:
where is an input vector consisting of and is the flow in the next period or the lack of measurement. Thus, the value of river runoff occurring in a particular moment can be estimated by IDM with the help of fuzzy inference.
4. Case study
To investigate the effectiveness of the proposed model, an IDM improved by a genetic algorithm (GIDM), for runoff prediction and interpolation, experiments divided into two groups have been made. An application of GIDM for monthly discharge reconstructing and forecasting is compared with SIDM and OIDM using the same sparse observed data at Lijin station and neighboring stations in section 4b. In section 4c, more validation at other gauging stations is discussed.
a. Study area and data
The Tangnaihai, Lanzhou, Zhengzhou, Huayuankou, Jinan, and Lijin stations on the Yellow River, along with the Yibin, Zhutuo, Wanzhou, Yichang, Wuhu, and Datong stations on the Yangtze River, have been selected for this study (see Fig. 1). The Yangtze River, the longest river in China with a total length of 6380 km and a drainage basin of 1.8 × 106 km2, covers 20% of China’s land area. The surface runoff of the Yangtze River is about 9.616 × 1011 m3, which accounts for 36% of the total runoff in China. The Yellow River is equally vital in China’s hydrological cycle, with a mainstream length of 5464 km and an area of 752 443 km2. It originates from the Tibetan Plateau and flows eastward, crossing six Chinese provinces and two autonomous regions on its course to Bo Hai. The Lijin station is the farthest downstream on the Yellow River and is the master regulation station for river discharge and sediment. Lijin is selected for its importance to evaluate the performance of GIDM for simulating changes in runoff in detail, with the other stations as auxiliary evaluation sites. The monthly runoff data are published by the Yellow River Conservancy Commission (YRCC) and in the Bulletins of Chinese River Sediment complied by the Ministry of Water Resources from January 1951 to December 2010. Precipitation data from January 1981 to December 2010 are collected from the China Meteorology Administration.
b. Experiments estimating monthly runoff time series at Lijin
1) Experiment 1: Interpolating runoff time series using 50% of the total data
A real example of the monthly runoff data (×108 m3) from January 1951 to December 1965, taken at Lijin station, is presented to illustrate the step-by-step implementation of different IDMs.
Step 1. Let records measured from January 1951 to December 1965 be .
Step 2. Based on the Monte Carlo method, 50% of the observation data are pseudorandomly selected as the input data (see Table 1) and the remaining 50% are missing data or lack of measurements.
- Step 4. In the information diffusion technique, the selection of appropriate illustrating points is crucial for successful implementation because it provides the basic information about the system. Through a statistical analysis of the data series, the illustrating space can be well established. The input data are analyzed with respect to their distribution in Fig. 2. The more elements the spaced container has, the more illustrating points should be installed into it. Therefore, the illustrating space (where curly brackets indicate a series of values) is designed as
To measure the consistency of SIDM, OIDM, and GIDM, the monthly runoff of years 1966–80, 1981–95, and 1996–2010 are reviewed following the experiment discussed above. Figure 3 displays the aggregated time series of observed and interpolated runoffs. It can be observed that the GIDM outperforms the other two models. For example, in Fig. 3a, there are some obvious undersimulations in July 1951, June 1955, November 1958, October 1961, and July 1964, which are the same as oversimulations in March 1956, January 1957, and October 1965 obtained by SIDM and OIDM. However, the GIDM exhibits a good correlation with them. Although some discrepancies exist between observed and simulated data using GIDM (e.g., from June to November 1953), the general tendency could be acceptable, considering the limited number of training samples.
The root-mean-square error (RMSE; Wang et al. 2009; Nayak et al. 2004), the coefficient of correlation R (Wang et al. 2009), the Nash–Sutcliffe efficiency coefficient E (Nash and Sutcliffe 1970), and the mean absolute percentage error (MAPE; Hu et al. 2001) are employed as objective functions to calibrate the model. Table 3 shows the RMSE, R, E, and MAPE values for different models. It is clear from Table 3 that GIDM performs better than the traditional SIDM and OIDM. For example, in the years 1966–80, considering a high value of 145.2000 × 108 m3 and a very low value of 0.4692 × 108 m3 at the Lijin gauging station, the GIDM with an RMSE value of 20.4493 × 108 m3 performed satisfactorily up to the interpolation. Moreover, the GIDM obtained the best R, E, and MAPE statistics of 0.8580, 0.7159, and 28.69, respectively. Coefficient R evaluates the linear correlation between the observed and computed flow, E evaluates the capability of the model in predicting flow values deviating from the mean, and MAPE measures the mean absolute percentage error of the forecast. Therefore, according to the values in Table 3, it can be concluded that the GIDM has reliable robustness and consistency.
2) Experiment 2: Interpolating runoff time series using 33.33% of the total data
Subsequent to experiment 1, in step 2 only 33.33% of the monthly discharge data are selected as the input, with the remaining 66.67% used for testing. Figure 4 shows a plot of observed and reconstructed discharges using different models. It can be observed that the GIDM still correlates well with the recorded discharges, although there are some slight oversimulations and undersimulations. Table 4 presents a comparison for using different models in terms of various performance statistics and can be interpreted as follows. For example, in the years 1981–95, the GIDM improved the SIDM interpolation by about 23.92% and gave a 16.14% reduction in RMSE and MAPE, respectively; improvements of the results regarding R and E were approximately 20.23% and 40.26%, respectively. The RMSE and MAPE values obtained by the GIDM decreased by 16.04% and 30.24% compared with the OIDM, while the R and E values increased by 10.75% and 20.59%. Overall, GIDM is able to obtain better accuracy in terms of different evaluation measures. In addition, discharges have been interpolated using more sparse datasets for training, and all model simulations gradually deteriorate. This is because the samples contain less information about the river runoff for modeling runoff values.
3) Experiment 3: 12- and 24-month lead time forecasting
A new framework is proposed using IDM to investigate the relationship between upstream rainfall and predicted discharges at Lijin station. Assuming that the discharges at Lijin from January to December 2010 are ungauged, the only observations we have are the broken time series of monthly flow data from January 1981 to December 2009 measured at Lijin station and broken precipitation data at neighboring Jinan (see Fig. 1) station from January 1981 to December 2010.
Step 1. Let records from January 1981 to December 2009 be denoted by , where and are referred to as the rainfall and flow data and curly brackets indicate a series of values, respectively.
Step 2. Based on the Monte Carlo method, 33.33% of the observation data (see Table 5) are pseudorandomly selected for training the window widths, with the remaining 66.67% as missing data.
Step 3. Calculate the window widths. The values of SWW, OWW, and IWW are listed in Table 6.
- Step 4. The corresponding illustrating space (where curly brackets indicate a series of values) with respect to the input distribution (see Fig. 5) is designed as
Following the steps discussed above, the results of SIDM, OIDM, and GIDM for runoff forecasting at Lijin can be obtained (see Fig. 6). Figure 6 shows that the variation of runoff at Lijin is closely related to changes in upstream precipitation. Some slightly different tendencies between rainfall and runoff may be illustrated and suggest that the river flow is not only a response to rainfall but also to other physical factors, such as evaporation and soil moisture, or intensive human activities. In addition, Fig. 6 indicates a good match between the model output and observed runoff, especially in peak discharge forecasting using GIDM, which means the new method may be used as an operational flood forecasting tool.
To validate the effectiveness of SIDM, OIDM, and GIDM for runoff forecasting, 24-month lead time prediction also has been made for the period from January 2009 to December 2010. The above-mentioned experiment is repeated 30 times using different rainfall and runoff data from Zhengzhou (see Fig. 1) and Lijin for training, respectively. Table 8 shows the performance of different IDMs on average, and Fig. 7 presents the monthly hydrograph of observed and simulated river runoff for Lijin at the first experiment. They validate that the variation of runoff at Lijin is sensitive to changes in precipitation at upstream stations, and the GIDM performed better than the other two models: the GIDM improved the performance of traditional models by about 0.6%–33% in terms of different evaluation criteria.
c. More interpolation and forecasting experiments at other stations
Subsequent to experiments 2 and 3, the three IDMs have been used to interpolate monthly river discharges and to estimate the rainfall–runoff relationship at five other gauges. In particular, the runoff curves from Huayuankou and Yichang are given (see Figs. 8, 9) because they are at the midstream of two different main rivers in China and have different physiographical factors, such as catchment area and underlying surface. Although there are some obvious underestimations in Fig. 8a, it could be indicated that the GIDM provides better interpolation and prediction performance than traditional IDMs. Moreover, according to the analysis in Figs. 8b and 9, the consistency of the new method can be validated.
Estimation of variations of discharges at Lanzhou, Zhutuo, and Datong stations are shown in Tables 9 and 10. According to the values in Tables 9 and 10, the GIDM obtained the best RMSE, R, E, and MAPE statistics for the interpolations and predictions at Lanzhou, Zhutuo, and Datong. In summary, there is a considerable prospect for the interpolation and prediction of river runoff from incomplete information using the GIDM.
A new algorithm for reconstructing and forecasting river discharges with incomplete data has been proposed in this paper. The purpose for constructing the algorithm is to improve the coefficient of IDM for unraveling more information from sparse data. Conventional IDMs are also included for comparison. The monthly runoff data from Lijin, Lanzhou, Huayuankou, Zhutuo, Yichang, and Datong gauging stations and upstream rainfall data are employed to train and validate the different IDMs. According to the results obtained, the potential of the new method can be concluded as follows:
The GIDM is appropriate for estimating the relationship between monthly runoff and rainfall upstream with scanty data, which is crucial for flood prevention and water management in unmeasured basins.
With sparse observations, the GIDM is an operational tool for long-term interpolation of river runoff, which may provide the acceptable input data in hydrologic modeling of physical process; traditional time series approaches have to use much more information to obtain an acceptable result.
On average, the GIDM can improve traditional IDM interpolation and prediction by about 10%–40% in terms of the different performance criteria.
Although it is concluded that the new IDM is sufficient to model the runoff time series, it still cannot be acceptable when samples are much too sparse to permit effective simulation and forecasting. Therefore, it is hoped that future research will focus on these priorities, that is, on establishing a more efficient diffusion function, on estimating the relationship between runoff and other meteorological factors, and on saving computational time for searching optimal parameters of IDMs, etc., so as to improve the accuracy of hydrology simulation and to achieve better operation and management of the various engineering systems.
The authors are very grateful to Dr. Joe Turk and the anonymous reviewers for their valuable comments and constructive suggestions, which helped us significantly in improving the quality of the paper. This research was supported by the Chinese National Natural Science Fund (Grants 41375002, 41075045, 51190091, and 41071018) and the Chinese National Natural Science Fund of Jiangsu Province (BK2011123), the Program for New Century Excellent Talents in University (NCET-12-0262), the China Doctoral Program of Higher Education (20120091110026), the Qing Lan Project, the Elite Young Teachers Program, and the Excellent Disciplines Leaders in Midlife-Youth Program of Nanjing University.