## Abstract

Correlation analysis is used to determine the linear relationship between the Nile River flows and leading climatic indicators, such as SST and precipitation, in an effort to establish a basis for quantitative long-term streamflow prediction. The analysis of the lead–lag correlations between the Blue Nile River flows during the “flood season” [July–August–September–October (JASO)] and SSTs led to the identification of a number of regions in the oceans that are significantly correlated and suggests that the SSTs may be useful for predicting the Blue Nile flows. The significant correlation regions between SST in the Pacific and Blue Nile JASO flows evolve through time in a manner that is consistent with the ENSO development; that is, the evolution of the ENSO signal in the Pacific Ocean is reflected in the evolution of the referred cross-correlation field. In addition, the Blue Nile River JASO flow is significantly correlated with the previous year August–November Guinea precipitation, which suggests that the Guinea precipitation is another potential predictor of the Blue Nile River flows with 11 months of lead time. Furthermore, models based on multiple linear regression (MLR) and principal component analysis (PCA) are used to forecast the Blue Nile flows based on SST in the three oceans and the previous year of Guinea precipitation. The models based on PCA showed significant improvement in forecast accuracy over MLR models that were developed in terms of the original variables. The predictability is shown to be the highest for forecasts made in the preceding season and decreases as the lead time increases. The coefficients of multiple determination *R*^{2} for validation based on PCA models vary in the range 84%–59% for forecast lead times of 4–16 months. Further analysis using only SST predictors for the period 1913–89 indicates that the predictability of the Blue Nile River JASO flows is more affected by the variability of SSTs in the Pacific Ocean than by those of the other oceans. The conclusion is that long-range forecasting of the Blue Nile River flows with lead times over 1 yr is possible with a high degree of explained variance by using SST in a few regions in the Pacific Ocean and the previous year of Guinea precipitation.

## Introduction

For centuries, the flow of the Nile River has been of vital importance to several African nations, including Egypt and Sudan. The main uses of the Nile River waters have been agriculture and hydropower. Accurate forecasts of water availability are essential for efficient management of water resource facilities, like reservoirs and diversions, but forecasting the stream flows of the Nile River system is a daunting task. The river system is vast and varied, encompassing many different climates, landscapes, and geographical features. The immense size of the Nile River system combined with the great variability of the system's flow has posed a persistent challenge to forecasters. The problem has attracted the attention of many well-known hydrologists, water resources engineers, climatologists, and general scientists (e.g., Hurst 1951; Feller 1951; Yevjevich 1965; Mandelbrot and Wallis 1968; Quin 1992; Eltahir 1996).

The Nile River is formed by the confluence of the White Nile and Blue Nile Rivers. Downstream from their junction lies the Atbara River, a tributary of the Nile. The focus of this paper is on the predictability of the Blue Nile River flows. The major source of water of the Blue Nile and Atbara Rivers is the Ethiopian Plateau. The Ethiopian high plateau lies in the interaction zone of two monsoon regimes; the African (Sahelian) moisture flux, having its moisture originating far in the South Atlantic, and the Indian circulation, coming from the Indian Ocean (Awadallah and Rousselle 1999). The “big rains” of Ethiopia during the rainy season are also said to rely on moisture advection from the Congo basin through the southwesterly monsoon (Camberlin 1997). Convective instability, due to the intense heating of high plateau land, is also the cause of a great percentage of the rainfall (Seleshi 1991). Seleshi also indicated that the cause of the Ethiopian rainfall is the strong movement of moist air from the southwest (Gulf of Guinea, high) to the northeast (center of Arabia, low).

The predictability of hydrometeorological processes has intrigued hydrometeorologists and hydrologists for many years. The relationship between the variability of certain processes, such as precipitation and streamflow, with changes in large-scale oceanic–atmospheric circulation patterns has been a topic of increasing interest. For example, studies have shown a strong teleconnection between a particular oceanic–atmospheric phenomenon in the tropical Pacific Ocean, known as El Niño–Southern Oscillation (ENSO), and the hydroclimatic variations in many places of the globe (Philander 1990). In fact, sea surface temperature (SST) distribution over the world's oceans is now accepted to be one of the decisive factors that govern the level of global atmospheric activity. The monthly distributions of SST anomalies around the world appear to directly influence the monthly anomalies of the global atmosphere general circulation (Pan and Oort 1990).

Several studies have identified the ENSO as an important factor in long-range hydroclimatic forecasting in many parts of the world (e.g., Dracup and Kahya 1994; Eltahir 1996; Guetter and Georgakakos 1996; Piechota and Dracup 1997; Piechota et al. 1998; Hamlet and Lettenmaier 1999; Wang and Eltahir 1999). The mechanism by which ENSO affects the Nile basin hydrology is not fully understood, although its effects on the Nile River flows have been identified (e.g. Bhatt 1989; Conway and Hulme 1993; Eltahir 1996). In general, low-flow years appear to be associated with warm events (El Niño) while high-flow years are associated with cold events (La Niña). However, not every low- or high-flow year in the Nile River is El Niño or La Niña induced. For example, the years 1913, 1915, 1940, 1979, and 1984 were low-flow years, but they are not El Niño years. Likewise, the years 1917, 1929, 1946, and 1969 were high-flow years, but are not classified as La Niña years, according to Fraedrich and Muller (1992). Other large-scale oceanic–atmospheric phenomena, such as those related to the Atlantic and Indian Oceans, may exert a significant influence on the climate of the Nile basin. Thus, other predictors need to be identified in order to determine additional signals that may contribute to the variability of the Nile River flows.

Simpson et al. (1993a,b) used ENSO to forecast the annual discharges of Australian rivers. Their studies demonstrated that forecasts up to 1 yr in advance of the Pacific Ocean SST provide a mechanism for estimating probabilities of annual river discharge. These improvements in the understanding of global teleconnections of weather and climate have made it feasible to extend the lead time or prediction horizon of streamflow forecasts further for the Nile River flows. Over the last few years, several Nile flow precursors have been identified and prediction tests performed. For example, Attia and Abulhoda (1992) and Eltahir (1996) found a prediction flow signal in the Pacific Ocean. Amarasekera et al. (1997) found that ENSO explains 25% of the Nile flow variability. Wang and Eltahir (1999) used Bayesian analysis for developing a discriminant flow-forecasting algorithm based on forecasted ENSO, rainfall, and recent flows for both the short (lead times of the order of the hydrological response timescale of 2–3 months) and long range (lead times longer than 2–3 months). Awadalla and Rousselle (1999) developed a neural network and transfer function models to predict the Nile River inflows to the High Aswan Dam (HAD) in Egypt. They found significant improvement in model skill by incorporating SST in three locations of the Pacific, Indian, and Atlantic Oceans. The variance explained was about 63% for a 3-month forecasting horizon. These studies and others brought a new perspective to the Nile River flow forecasting.

The aim of this study is to improve the accuracy of flow forecasts and to extend the forecast lead time of the Nile River flows by identifying significant regions of SST in the oceans and by using other predictors, such as the Guinea precipitation in western Africa. For this purpose, the research reported herein uses correlation analysis, multiple linear regression (MLR), and principal component analysis (PCA).

## Streamflow and SST data

The naturalized monthly flows of the Blue Nile River at Eldeim (station located upstream from the Roseires Reservoir in Sudan) are the main flow data utilized in this study. The original data were obtained from the Ministry of Public Works and Water Resources of the Arab Republic of Egypt as part of a major statistical simulation study undertaken for the entire Nile River system (Salas et al. 1995). For this purpose, data of unequal lengths at 12 flow sites were extended to the common period 1913–89 using correlation analysis and regression models. In particular, data for the Blue Nile River at Eldeim (1964–94) were extended based on the Blue Nile naturalized data at Roseires (1914–90) and Khartoum, Sudan (1913–92). The Blue Nile River is the main source of water for Sudan and Egypt (Johnson and Curtis 1994). It contributes about 60% of the Nile River flow. Also based on the record 1913–89, the July–August–September–October (JASO) flows account for about 85% of the Blue Nile River annual flows (at Eldeim) and about 68% of the Nile River annual flows (at Aswan). Likewise, the July–September precipitation on the Ethiopian Plateau accounts for 60% and 71% of the annual precipitation at Addis Ababa, Ethiopia, and Asmara, Eritrea, respectively (Seleshi 1991). For illustration, Fig. 1 shows the mean monthly flows for the Blue Nile River at Eldeim, which summarizes the flow regime with a well-defined summer and autumn maximum, where most of the precipitation over the Ethiopian Plateau occurs. The forecast of the flow data for the season JASO has been the objective of our study.

The primary data used as predictor variables in this study are SST in the oceans and the previous year of Guinea precipitation data. The SSTs were obtained from two sources: 1) The Comprehensive Ocean–Atmosphere Data Set (COADS) that are available on a monthly basis for grid areas of 2° latitude × 2° longitude (Tanco and Berri 1999) are used. The SST data for the period 1950–97 were extracted using the software CLIMLAB2000 (Tanco and Berri 1999) for the region 60°N–44°S and 180°W–180°E. 2) The British Atmospheric Data Centre (BADC) global mean SST data for the period 1913–94 (Rayner et al. 1996) are also used. The SST data were then averaged for each of the three seasons, November–December–January–February (NDJF) March–April–May–June (MAMJ), and JASO. We used Guinea precipitation (western Africa) as another predictor because it has been shown to be correlated with the Ethiopian precipitation (Seleshi 1991). Also these data have been used in other forecasting studies (e.g., Gray et al. 1992). The Guinea precipitation data for the season comprising the months August–September–October–November (ASON) during 1953–89 were obtained from Gray et al. (1992, 1994).

## Methodology

We develop models for forecasting the Blue Nile River water discharge during the 4-month period JASO based on SST and precipitation data observed at previous 4 month periods, for example, MAMJ, NDJF, JASO, etc., for 4-, 8-, and 12-month forecast lead times, respectively. The forecasting methodology utilized in this paper can be summarized in three steps: 1) developing a set of potential predictors by using cross-correlation analysis, 2) defining the forecast model and estimating the model parameters, and 3) assessing the forecast efficiency of the model. In the first step a cross-correlation field map representing the cross-correlation coefficients between the JASO flows and the average SSTs for a previous 4 month period (e.g., the average SST for JASO of the previous year) is developed to ascertain the regions in the oceans where the previous year JASO SSTs are significantly correlated with this year's JASO flows. This procedure is repeated for various forecast lead times such as 4-, 8-, 12-, 16-months, etc. Thus, for a given forecast lead time, for example, 8 months, the pool of potential predictors may arise from SST data lagging 8, 12, 16, etc., months. In addition, the pool of potential predictors included the previous year ASON Guinea precipitation. Further explanation on this procedure is given in section 4a. In the second step, multiple linear regression and principal component analysis are utilized for developing the forecast models. Details are presented in section 3a below. And the third step on assessing the forecast performance involves a number of statistical measures as shown in section 3b.

### Prediction models

The multiple linear regression model may be expressed as

where *Y* is the predictand (Blue Nile River flows), the *b*_{i} are the regression coefficients that are estimated using the observed data, and the *x*_{i} are the regressors representing the predictors. The regression coefficients are determined by using the method of least squares. If a given coefficient dropped below the 5% significance level then the predictor was removed from the equation. One problem that may occur using multiple linear regression is the multicollinearity of the prediction variables, which is common when using hydrometeorological variables. Choosing the predictors carefully and checking for independence of the predictors could minimize this problem. Alternatively, one could use PCA by determining orthogonal variables in order to reduce the dimensionality of the predictors and retain the time series of the principal components (PCs). Then a regression model is constructed based on the PCs.

Principal component analysis is a widely used technique in meteorology and climatology. One objective of PCA is to find a small number of linear functions or a set of variables that successively accounts for the maximum amount of variation in the original variables. With PCA it is possible to reduce the size of large datasets while minimizing any loss of information. Another objective is to remove the multicollinearity that may be present in the independent variables. The principal component PC_{k} may be expressed as

where *c*_{k1}, *c*_{k2}, … , *c*_{km} are coefficients (loadings) of the *m* variables for the *k*th PC. The first principal component PC_{1} is a linear function that accounts for the maximum possible variance. Then, PC_{2} is a linear function with a maximum variance but is not correlated with PC_{1}; the same applies to any PC_{k}. The first eigenvector is the set of coefficients contained in PC_{1}. The eigenvalue corresponds to the variance of each PC_{k} and is, therefore, a measure of its importance in explaining variation. Garen (1992) included the PCs in sequence, while others (e.g., Marsden and Davis 1968; Wortman 1989) allow models to skip PCs. The latter has been adopted in this study. As will be discussed below, the first 9 or 10 PCs (depending on the case) will be retained, and up to 5 PCs with significant regression coefficients will be included in the MLR models. Computations were done using the CLIMLAB2000, version 1.0, model developed by Tanco and Berri (1999).

We consider retaining only those PC_{k} that make a meaningful contribution to the total variance and give significant regression coefficients. The regression model, in terms of the principal components, is

where *i, **j, **l, **m,* and *n* are the retained components (*i* ≠ *j* ≠ *l* ≠ *m* ≠ *n*) with significant regression coefficients *a*_{i}, *a*_{j}, *a*_{l}, *a*_{m}, and *a*_{n}, respectively. To limit artificial skill, predictors are limited to four for the MLR model of Eq. (1) based on the original variables, while there can be up to five for the PC model of Eq. (3). This is done in order to prevent redundancy and instability of the multiple regression equation and to ensure good results when applying it to new data (Bhalme et al. 1986). A screening process is conducted in order to select from a large set of possible predictors only those that contribute significantly and independently to forecast the river flows.

### Assessment of forecast performance

A number of statistics are used to measure the forecast performance. They include; the correlation coefficient *R* between the forecast *Ŷ*_{i} and the observed value *Y*_{i}, the root-mean-square error (rmse), the mean absolute error (MAE), and the bias. The referred four statistics may be determined as

in which *Y* is the mean of the data *Y*_{1}, … , *Y*_{N}; *Ŷ* is the mean of the predicted values *Ŷ*_{1}, … , *Ŷ*_{N}; and *N* is the sample size. For a given regression model, the coefficient of multiple determination *R*^{2} is bounded between 0 and 1, with values closer to 1 denoting a better-fitting model. However, increasing the number of regressor variables in the model, that is, increasing the number of parameters, can only increase the value of *R*^{2} or leave it unchanged (Garnett et al. 1998). For this reason, a modified measure called the adjusted coefficient of multiple determination for a *p*-order model is defined as

Similarly, the adjusted root-mean-square error is defined by

The adjusted *R*^{2} takes into account the degrees of freedom of the model and introduces a slight penalty for each regressor variable in the model. It may decrease in value if too many regressor variables are included in the model. The adjusted *R*^{2} is the statistic that is used in this paper for determining the degree of explained variance in a model. Thus, discussions and analysis throughout this paper will always refer to the adjusted coefficients and errors. In all of the analysis that follows, the predictors and predictand are standardized, respectively, by subtracting the mean and dividing by the standard deviation. Thus, each standardized series will have zero mean and unit standard deviation; that is, fluctuations in the series are nondimensional and they can be compared more easily among each other. One advantage of standardizing the data is to prevent principal component analysis from giving more weight to one variable over others.

In the construction of any forecast model, it is essential that the model be developed on a subset of the data and then be independently tested on a different subset of the data not used in the model calibration. This procedure is known as split-sample testing. It has been widely used for model validation with conceptual and statistical hydrologic models. Wortman (1989) listed some shortcomings of this scheme, including withholding critical information from the calibration step and providing unstable error statistics that are highly dependent on the subset size and information content. Thus, a variation on the “jackknife” procedure was recommended as a reasonable and robust approach for model validation. The validation of a forecast model may be accomplished by means of a drop-one Jackknife procedure that assures that the predicted value for any year is independent of the predicted values for other years (Gray et al. 1993). In this case each model is formulated based on *N* − 1 years and tested on the remaining year not used in the construction of the model. In this scheme a minimum amount of information is withheld from each fitted model. Assuming the dataset *Y*_{t}, *t* = 1, … , *N,* a particular value *Y*_{i} is removed from the set and a model is fitted based on the remaining *N* − 1 values. Then, the model is utilized to forecast the *i*th value; that is, *Ŷ*_{i} is obtained. This procedure is repeated *N* times and the forecast dataset *Ŷ*_{1}, … , *Ŷ*_{N} is determined. The two sets *Y*_{t} and *Ŷ*_{t}, *t* = 1, … , *N* are then used in Eqs. (4)–(6) to assess the forecast model performance. The forecast equations are constructed based on different methods, and forecasts are made for different lead times.

## Results

First we examine the predictability of the Blue Nile River flows using the SST data for all of the oceans (COADS dataset) and the Guinea precipitation data for the period 1950–89. The results are described in sections 4a and 4b below. Then, section 4c describes the results of the predictability using SST data for the Pacific Ocean only (BADC data) for a longer period, 1913–89.

### Lead–lag SST–streamflow relationship

The Blue Nile River monthly flows at Eldeim gauging station, for the period 1913–89, and the SST monthly data (COADS), for the period 1950–97, were utilized in this study for identifying climatic predictors that may be useful for long-term flow forecasting of the Blue Nile River. Therefore, 40 yr of concurrent flow and SST data were available for the cross-correlation analysis. The correlation coefficients between the JASO river flows and the 4-month-averaged SST data for the periods, NDJF, MAMJ, and JASO were calculated for each region of the oceans and for different lead–lags. NDJF is dated by the year in which January occurs. In addition, we refer to years leading, concomitant, and lagging the central year as year (−1), year (0), and year (+1), respectively. The seasons for each year are defined in a similar manner. For example, SST for NDJF (0) and NDJF (−1) are average SSTs for the season NDJF for the current year and the previous year, respectively. In this study we use the leading indicators, such as SST, to build the forecasting models. Assuming that the data are Gaussian distributed, the computed correlations are considered to be significant if they exceed 0.32 (at the 5% significance level). The normality assumption has been checked for all of the variables used in the multiple regression model.

Figure 2 shows the variation of the correlation coefficients between the averaged Blue Nile River flows during July–October of a given year, denoted as JASO (0), and the SSTs for MAMJ (−3), MAMJ (0), JASO (0), and NDJF (+1). The maps of correlations were computed for successive seasons beginning at MAMJ (−3) and continuing through NDJF (+2). However, for brevity, only the longest lead time, shortest lead time, contemporaneous, and following season are shown, respectively, in Figs. 2a–d. The long lead–lags may help track the effect of the evolution of large-scale climatic variables, such as ENSO. It is clear from the correlations maps that as the lead time becomes smaller the correlation magnitudes increase, as well as the corresponding areas or regions in the sea. The highest positive correlations (of the order of 0.6) have been observed for NDJF (0) (not shown) in the tropical North Atlantic, and for NDJF (+1) around the midlatitudes in the East Pacific and in the North Atlantic, as shown in Fig. 2d. The highest negative correlations (of about −0.60) are observed for MAMJ (0), JASO (0), and NDJF (+1) around the equatorial Pacific, as shown in Figs. 2b–d, and in some spots of the Indian Ocean, as shown in Figs. 2c–d. The North Atlantic (north of 40°N) and around the midlatitudes of the North and South Pacific consistently maintain a positive correlation with the Blue Nile in all of seasons, while the South Atlantic is generally negatively correlated.

Figure 2 also shows that the Indian Ocean's SSTs and the Blue Nile River flows are generally negatively correlated; although, in some seasons, certain areas of the Indian Ocean (e.g., the Arabian Sea and the sea north of Australia in Fig. 2b) are positively correlated. Quin (1992) explained that in the summer a large low pressure system over India and the Arabian Sea dominates the airflow and brings strong, persistent southwesterlies to the southern part of the region. He further indicated that, at the same time, the intertropical convergence zone (ITCZ) lies just to the north of Eritrea so that the resulting convergence may bring significant precipitation over the headwaters of the Blue Nile basin. It may also influence the direction and the amount of moisture fluxes from the Atlantic Ocean and the Congo basin. In addition, warm SSTs in northern Australia and Indonesia are generally associated with high Blue Nile River flows, while low-river flows are usually accompanied and preceded by cold SSTs. The northern Australia–Indonesia region is clearly an important component in the large-scale interaction between the SST and the Blue Nile River hydrology. Also, significant negative correlations were found in the southwestern Indian Ocean in the regions east and/or south of Mauritius in the seasons MAMJ (−3), NDJF (−2), MAMJ (−2), JASO (−2), MAMJ (−1), and JASO (−1). Generally, negative correlations dominate the Southern Hemisphere SST while positive correlations prevail in the Northern Hemisphere. However, remarkable negative and positive correlations can also be found in the Northern and Southern Hemispheres, respectively.

Based on a composite scenario derived from six Pacific warm SST events (El Niño) that occurred during 1950–76, Rasmusson and Carpenter (1982) defined five phases for a warm event: antecedent [June–October of year (−1)], onset [November of year (−1) to January of year (0)], peak [March–May of year (0)], transition [June–October of year (0)], and mature [November of year (0) to February of year (+1)]. They named March–May of year (0) as the peak phase because the South American western coast reached its peak warming during that period for the pre-1977 events (Wang 1995). The definition of seasons in our study is almost identical to the foregoing definition of the phase duration by Rasmusson and Carpenter, which helps to compare the regions and the seasons. For example, negative correlations emerge in the southern part of the eastern Pacific Ocean centered around the coordinates 35°S, 100°W in MAMJ (−1). In JASO (−1), the area expands north to 20°S, west to 140°W, and east to the South American coast, and in NDJF (0) it further expands along the South American coast. The signal moves northwestward to the tropical eastern Pacific Ocean in MAMJ (0), intensifying and spreading westward in JASO (0) and NDJF (+1) until it covers a vast area of the central and eastern tropical Pacific (the whole ENSO region), as shown, respectively, in Figs. 2c and 2d. Rasmusson and Carpenter (1982) presented evidence of a westward migration of SST anomalies that appears along the Ecuadorian and Peruvian coast. Significant correlation regions between the SSTs and the Blue Nile flows are consistent with the ENSO development, which suggests that the evolution of the ENSO signal in the Pacific Ocean is reflected in the evolution of the cross-correlation field between the SST and the Blue Nile River flows.

In NDJF (0) a dipole is established over the Indo Pacific (Indonesian region) and eastern North Pacific. Then, in MAMJ (0) the sign of correlations changes in the eastern North Pacific from negative to positive. Also MAMJ (0) is the season when the central/western North Pacific and central/western South Pacific have high positive correlations. As expected, the correlations are stronger (negative) over the central and eastern Pacific basin during JASO (0) and NDJF (+1) as shown in Figs. 2c and 2d. The two seasons lie within the definition of the transition and mature phases of ENSO, respectively. The correlation for both seasons reaches −0.6 in the region (5°N–10°S, 75°W–165°E). This is in agreement with the previous studies of Eltahir (1996) and Amarasekera et al. (1997), which were carried out on a monthly basis. During these seasons the negative correlation field region reduces from the east-equatorial Pacific along the South American Pacific coast to the west-equatorial Pacific near Indonesia. This area includes Niño-12 (0°–10°S, 90°–80°W), Niño-3 (5°N–5°S, 150°–90°W), Niño-34 (5°N–5°S, 120°–170°W), and Niño-4 (5°N–5°S, 160°E–150°W). Correlations remain strongly negative over most of the oceans at the Tropics during NDJF (+1) and are strongly positive in the subtropics. Also in the periods MAMJ (−2) and NDJF (0), a dipole is established over the subtropical North Atlantic and tropical North Atlantic resembling the North Atlantic Oscillation.

The main conclusion that can be drawn from the foregoing correlation analysis is the existence of a number of potential predictors for long-range forecasting of the Blue Nile River flows based on SSTs. Our analysis indicates that significant lead–lag correlations exist between the oceanic surface temperatures and the Blue Nile flows. Significant correlations are found for various lag times and various regions of the Pacific, Atlantic, and Indian Oceans. These correlations may be useful for long-range predictability of the Blue Nile River streamflows for lead times spanning several seasons. This is consistent with results found by Wang and Eltahir (1999), who showed significant relationships between ENSO and the Nile River flows at Aswan for lead times of a few months to about 5 yr. Then, based on the referred correlations maps, key regions of SSTs with largest correlations are identified, and averaged seasonal SSTs over each of these regions are extracted for all seasons.

### Multiple linear regression analysis

The main statistical tool used here is multiple linear regression. Linear regression has been a useful approach for long-range forecasting of large-scale events, such as monsoon rainfall and ENSO (Parthasarathy et al. 1988). Although atmospheric processes are basically nonlinear, linear statistical forecast models can contribute toward understanding the mechanisms associated with large-scale teleconnection patterns of the general atmospheric and ocean circulation (Kung and Sharif 1980). In our study, the relevant predictors are identified and multiple linear regression is employed to relate these predictors with the Blue Nile JASO flows. Our purpose is to find the smallest set of predictor variables that can perform efficiently in predicting the value of the dependent variable, that is, fluture flows of the Blue Nile River.

From the correlation maps such as those shown in Fig. 2, regions (areas) of SST data that significantly correlate with the JASO (0) Blue Nile River flows were identified. Then, the average SST series for the given region was obtained based on all of the gridded SST series contained in the region. In this manner, 84 regions of SST series that significantly correlate with the river flows were initially identified. Subsequently, from such 84 regions, 21 regions having serially independent SST series were selected for developing the multiple linear regression forecast models. In addition to the SST series, the previous year of August–November, ASON (−1), Guinea precipitation was also used as a predictor because it is (positively) significantly correlated with the Blue Nile River flows (*r* = 0.63 for the period 1953–89). This association appears to be the strongest of any of the individual SST predictors, which suggests that Guinea precipitation could be a useful predictor for the Nile River flows with 11 months of lead time. We used the Guinea rainfall for August–November of the previous year because the studies carried out by Gray et al. (1992) showed that data was an important predictor for forecasting hurricane activity in the Atlantic. The positive association between the ASON (−1) rainfall along the Gulf of Guinea with this year's JASO flow of the Blue Nile River likely may be due to precipitation recycling in that part of the African continent. Large amounts of rainfall on the Gulf of Guinea during ASON may provide a significant source of moisture to the atmosphere (through soil moisture and evapotranspiration) that, in turn, may lead to abundant rainfall on the Ethiopian Plateau in the following year. Brubaker et al. (1993) estimated precipitation recycling on several parts of the globe, including the African region south of the Sahara desert and the Gulf of Guinea. Also Gray et al. (1992) argue that abundant ASON precipitation at Guinea may lead to more rain in the Sahel during the following year, and conversely, a dry ASON period may contribute to drought in the Sahel several months later.

Figure 3 plots the standardized JASO Blue Nile flows and that of the previous year's (August–November) precipitation in Guinea. It shows that in most years the river flow and the precipitation are positively related (24 times out of 37). Table 1 summarizes the results obtained, including the locations of the 22 predictors (21 SST regions plus Guinea rainfall) and their respective correlation coefficients with the JASO Blue Nile River flows for various leading seasons. Also Fig. 4 displays the standardized anomalies of SSTs in 8 of the 21 regions included in Table 1, plus those of the Guinea rainfall and the Blue Nile River flows. A careful examination of the figures reveals the cross association between the Blue Nile River flows and the predictors (e.g., SST for region 4 with a +0.54 correlation and SST for region 16 with a −0.42 correlation).

Many combinations of the referred 22 predictors were used to forecast the Blue Nile River flows for different lead times. Table 2 shows five multiple linear regression forecast models (A models) for the JASO Blue Nile flows as a function of SST and Guinea precipitation. The subscripts stand for the region number, as shown in Table 1. The performance measures (refer to section 3b) shown in Table 2 help in judging the performance of the various models. For example, for the 4-month lead time (model A1) the adjusted *R*^{2} for validation reaches 64%, while it becomes 48% for the 16-month lead time forecast (model A5). As expected, the longer the lead time the smaller the forecast performance results. In addition, the comparison of the variability of the forecasted flows through the years (not shown), resemble the observed data reasonably well. In general, the forecast results for the five models indicate the potential predictability of the Blue Nile River flows based on SST at various locations in the oceans and Guinea precipitation, for different forecast lead times.

Multiple linear regression requires that the predictors are not correlated. In order to guarantee this requirement for each model, we performed a multiple linear regression analysis between every predictor and the rest of the predictors. Then, the adjusted *R*^{2} coefficients were computed. The percentage of variance of any given predictor that is explained by other predictors ranges between 0% and 15%. Thus, one may assume that the predictors are independent and, consequently, there is no change in the number of degrees of freedom required for testing the significance of the correlations between the datasets. Berri and Flamenco (1999) followed a similar procedure. In addition, the hypotheses of normality for the original variables used in the models (Table 2) were not rejected. Furthermore, the signs of the regression coefficients agree with the signs of the correlation coefficients between the predictand and predictor, respectively.

### Hybrid PC multiple linear regression model

One of the purposes of PC regression is to make sure that any intercorrelation among the predictor variables is removed. High correlation among the independent variables may result in computational errors or a singular matrix if regression on these variables is attempted (Marsden and Davis 1968). Because some of the predictors may be interrelated, we carried out principal component analysis on these predictors.

The nine variables used as predictors in the regression models included in Table 2 [models as in Eq. (1)] were in turn used to develop forecast models based on PCs as in Eq. (3). Because not all of the nine predictors can be utilized for all of the lead times, the number of PCs that can be determined vary with the leading season, that is, we determined nine PCs based on the nine predictors for the leading season MAMJ (0), eight for NDJF (0), seven for ASON (−1), six for JASO (−1), and five for MAMJ (−1). Figure 5 shows the percent variance explained by each of the PCs for the five cases analyzed. Most of the explained variance is contained in the first five components. The figure also illustrates the declining variance explained by successive PCs and supports the retention of the first five PCs. Thus, the forecasted models (B models) for each lead season were determined based on five PCs and by retaining in the forecast model only those terms with significant coefficients. Table 3 gives the regression equations obtained in terms of PCs. The table includes fitting and validation statistics for B models for five leading seasons. Figure 6 shows a comparison of the validation adjusted *R*^{2}s obtained from the A and B forecast models. Clearly the performances of the forecasting models based on PCs (B models) are superior than those for regression models based on the original variables (A models). This may be attributed to the higher amount of information that PCA extracts as compared to the original variable models. Figure 6 also shows that for both models the adjusted *R*^{2} decreases as the forecast lead time increases, which is expected. Also, it appears that the difference in the adjusted *R*^{2}s of model A and B reduces as the forecast lead time increases.

To summarize, the overall measures adopted for assessment of the accuracy of the forecasts are the adjusted *R*^{2}, the MAE, the adjusted rmse, and the bias. The values of these performance measures are shown for A and B forecasting models in Tables 2 and 3, respectively. The adjusted *R*^{2}s for validation vary in the range 64%–48% for forecast lead times of 4–16 months, respectively, for model A, and 84%–59% for model B. These results also reflect the relatively small values of the rmse (refer to the adjusted rmse in Tables 2 and 3). Because the data are standardized, the square of the rmse represents the ratios of the error variance with respect to the signal variance. In Table 3 they vary in the range 18%–45%, respectively. In addition, the MAE values vary from about 30% to less than about 50% of the standard deviation of the streamflow and the absolute values of the bias are in the range 0.00–0.01, with most of the values being equal to zero.

### Forecast assessment based on longer SST dataset

An additional assessment of the predictability of the Blue Nile River flows based on a longer dataset was examined. For this purpose we used the BADC SST dataset and the Blue Nile River flows for the period 1913–89 as described in section 4a. In this case, we focused on the SST data of the Pacific Ocean only. We pretty much followed a similar procedure as noted in the previous sections, that is, locating regions in the Pacific with significant correlations with the Blue Nile River flows for a specified lead time, using PCA and multiple linear regression forecast models (except that in this case we used the stepwise multiple linear regression approach), and assessing the forecast performance using the drop-one Jackknife validation technique, in addition to comparing the fitted flows versus the observed flows.

We examined the forecast performance (fitting and validation) statistics for lead times varying between one and eight seasons and the following cases: (a) periods 1913–49 and 1950–89, and (b) the entire period 1913–89. In the first case, we wanted to split the record into the periods 1913–49 and 1950–89 as a matter of comparison and because the forecasting analysis we described in sections 4b and 4c was based on the COADS SST dataset, which is available only for the period after 1949. The results obtained in terms of the adjusted *R*^{2} are summarized in Fig. 7 for both fitting and validation. Figure 7a shows that the fitting *R*^{2}s for the periods 1913–49 and 1950–89 are essentially the same, but there is about 10% of difference for the validation *R*^{2}s between the two periods. Figure 7b gives the fitting and validation *R*^{2}s for the entire period. Both Figs. 7a and 7b show consistent results and the tendency for the *R*^{2}s to decrease as the lead time increases, as is expected. Figures 8 and 9 show the comparison between the observed and forecasted river flows for one and eight seasons of lead time, which confirm that the SSTs of the Pacific can be quite useful for the long-term predictability of the Blue Nile River flows. One must also note that the *R*^{2}s obtained using the SSTs of the Pacific Ocean only are somewhat smaller than those based on models with predictors that include Guinea rainfall. For example, for one-season lead time one gets 84% of validation *R*^{2} (Table 3) using SSTs and Guinea rainfall as predictors as compared with 72% when using only SSTs. In addition, similar prediction analysis carried out using SST data for all oceans shows that regarding the predictability of the Blue Nile River flows, no significant gain is attained in using SST data for all oceans as compared to using SSTs of the Pacific Ocean only.

## Final remarks and conclusions

Correlation analysis between historical gridded seasonal SST data from the oceans, Guinea precipitation, and Blue Nile River flows at Eldeim are utilized in an effort to establish a framework for quantitative long-term streamflow prediction. In addition, statistical models based on multiple linear regression and principal component analysis are developed to forecast the Blue Nile flows using SSTs in the oceans and the previous year of Guinea precipitation as predictors.

The systematic analysis of the lead–lag correlations between the Blue Nile River flows and SSTs (COADS dataset) for the period 1950–89 led to the identification of a number of regions in the oceans that are significantly correlated and, consequently, may contribute to the long-range predictability of the Blue Nile River flows. The correlations are the highest for the concurrent (lag 0) and succeeding (lag 1) data, which suggests that forecasting the evolution of the SSTs should help in predicting the Nile River flows. In addition, the significant correlation regions between SST in the Pacific and Blue Nile flows evolve through time in a manner that is consistent with the ENSO development. This suggests that the evolution of the ENSO signal in the Pacific Ocean is reflected in the evolution of the cross-correlation field between the SST and the Blue Nile River flows.

Furthermore, the Blue Nile River flows during JASO (0) are significantly and positively correlated with ASON (−1) precipitation over Guinea (*r* = 0.63 for the period 1953–89). This association is the strongest of any of the individual potential predictors (SSTs in the oceans) investigated herein and indicates that the Guinea precipitation is a potential predictor of the Nile River flows, with 11 months of lead time. Seleshi (1991) suggested that the flow of moist air from high pressure systems over the Gulf of Guinea toward the low pressure center in Arabia is an important source of the Ethiopian rainfall. Also, Gray et al. (1993) and Elsner and Schmertmann (1993) reported the strong association between hurricanes over the tropical Atlantic basin and the previous year's rainfall along the Gulf of Guinea. Gray et al. (1992) attributed this to feedback on the monsoon circulation from one year to the next. This may be the same cause for the strong relationship between the Blue Nile River flows and the previous year of Guinea precipitation and, thus, the referred feedback mechanism may contribute to the moisture of the Blue Nile basin.

The models based on PCA (referred to as B models) showed significant improvement in forecasting accuracy over multiple linear regression models developed in terms of the original variables (referred to as A models). The predictability is shown to be the highest in the preceding season and decreases as the lead time increases. The (adjusted) *R*^{2}s for validation based on B models vary in the range 84%–59% for forecast lead times of 4–16 months, respectively. These results also reflect the relatively small values of the (adjusted) rmse (because the data are standardized to the square of the rmse, which represents the ratios of the error variance with respect to the signal variance). The referred variance ratios vary in the range 18%–45%, respectively. In addition, the absolute values of the bias are in the range 0.00–0.01, with most of the values being equal to zero. The conclusion is that long-range forecasting of the Blue Nile River flows with lead times over 1 yr is possible with a high degree of explained variance by using SST in a few regions in the oceans and the previous year of Guinea precipitation.

In addition, further analysis, using longer SST data in the Pacific (using BADC dataset) and the Blue Nile River flows for the period 1913–89, confirms the long-term predictability of the river flows. SSTs in some areas of the Pacific exert significant influence on the variability of the Blue Nile River flows. In fact, the predictability of the referred river flows appears to be more affected by the variability of the SSTs in the Pacific than by those in the other oceans.

## Acknowledgments

This paper forms part of the first author's Ph.D. dissertation at Colorado State University. The Ministry of Irrigation and Water Resources of the Republic of Sudan supported his graduate studies. In addition, we acknowledge the partial support of the National Science Foundation Grant CMS-9625685 on “Uncertainty and Risk Analysis Under Extreme Hydrologic Events.” Furthermore, the careful review and detailed comments by two anonymous reviewers substantially helped the authors in preparing the final manuscript.

## REFERENCES

## Footnotes

*Corresponding author address:* Dr. Jose D. Salas, Dept. of Civil Engineering, Colorado State University, Fort Collins, CO 80523. jsalas@engr.colostate.edu