Recent heightened concern regarding possible consequences of anthropogenically induced global warming has spurred analyses of data aimed at detection of climate change and more thorough characterization of the natural climate variability. However, there is greater concern regarding the extent and especially quality of the historical climate data. In this paper, rainfall records of 233 gauge stations over Ethiopia for the 1978–2007 period are employed in an analysis that involves homogenization, reconstruction, and gridding onto a regular 0.5° × 0.5° resolution grid. Inhomogeneity is detected and adjusted based on quantile matching. The regularized expectation-maximization and multichannel singular spectrum analysis algorithms are then utilized for imputation of missing values, and the latter has been determined to have a marginal advantage. Ordinary kriging is used to create a gridded monthly rainfall dataset. The spatial and temporal coherence of this dataset are assessed using harmonic analysis, self-organizing maps, and intercomparison with global datasets. The self-organizing map delineates Ethiopia into nine homogeneous rainfall regimes, which is consistent with seasonal and interannual rainfall variations. The harmonic analysis of the dataset reveals that the annual mode accounts for 55%–85% of the seasonal rainfall variability over western Ethiopia while the semiannual mode accounts for up to 40% over southern Ethiopia. The dataset is also intercompared with Global Precipitation Climatology Project (GPCP), Climate Prediction Center (CPC) Merged Analysis of Precipitation (CMAP), Climatic Research Unit time series version 3 (CRUTS3.0), Tropical Rainfall Measuring Mission (TRMM), and interim ECMWF Re-Analysis (ERA-Interim) rainfall. The correlation of the dataset with global datasets ranges from 0.52 to 0.95 over sparse to dense rain gauge regions. The GPCP rainfall has a small bias and good correlation with the new dataset whereas TRMM and ERA-Interim have relatively large dry and wet biases, respectively.
An element common to both quality assessment for database construction and the use of data in the study of physical phenomena is the estimation of some statistical quantities aimed at characterizing certain aspects of the data. For instance, the trend in climate variables is one of the pressing issues. Present climate studies are primarily focused on the detection and attribution of climate change. A number of studies in different parts of the world on continental and subcontinental spatial scales have been performed to find a detectable human influence on past climate change, which is manifested mainly by a change in temperature, precipitation, sea level rise, mountain glaciers, and discharge of rivers. For instance, Stott (2003) detects the warming effects of increasing greenhouse gas concentrations in six continental-scale regions over the 1900 to 2000 period, using simulations from the Met Office Hadley Centre Coupled Model version 3 (HadCM3). A more comprehensive study of Knutson et al. (2006), which covers between 0.3% and 7.4% of the area of the globe including tropical and extratropical land and ocean regions, suggests that there is a detectable anthropogenic warming signal over many of the regions he examined.
Although the evidence is weaker than at continental scale, anthropogenic signals can now also be detected in some subcontinental-scale areas using formal detection methods. Zhang et al. (2006) detect anthropogenic fingerprints in China and southern Canada. Spagnoli et al. (2002) find some evidence for a human influence on 30-yr trends of summer daily minimum temperatures in France. Min et al. (2005) find an anthropogenic influence on East Asian temperature changes in a Bayesian framework. Studies have also applied attribution analysis to subcontinental temperatures to make inferences about changes in related variables. Stott and Tett (1998) detect an anthropogenic influence on southern European summer mean temperature changes of the past 50 years and then infer the likelihood of exceeding an extreme temperature threshold. Gillett et al. (2004) detect an anthropogenic contribution to summer season warming in Canada and demonstrate a statistical link with area burned in forest fires.
Precipitation is another climate variable, almost as important as atmospheric surface temperature, in providing rich evidence of climate change at different spatial scales. The basis of such role is linked to the fact that the amount of moisture in the atmosphere is expected to increase in a warming climate as saturation vapor pressure increases with temperature. Greenhouse gas increases, apart from their role in global warming, are also expected to cause enhanced horizontal transport of water vapor that is expected to lead to a drying of the subtropics and parts of the tropics (Kumar et al. 2007; Neelin 2006) and a further increase in precipitation in the equatorial region and at high latitudes (Emori and Brown 2004; Held and Soden 2006).
Local climate studies are needed to provide finescale climate information for impact assessment and adaptation purposes because climate impacts occur locally and can take many different forms in different places. As the signal of climate change strengthens, detection of changes in surface temperature and precipitation is becoming possible at increasingly smaller scales (Stott and Tett 1998). This enables important insights into local climate from studies focused particularly in areas characterized by a diverse and heterogeneous land surfaces such as Ethiopia, a country where extreme weather variations such as floods and droughts are increasing in magnitude and frequency in recent times, since the different surfaces have different degrees of vulnerability. However, climate-related studies at the national level in Ethiopia are at the infant stage. One of the reasons for limited investigation of climate change and variability over Ethiopia is absence of homogeneous and continuous climate variables such as temperature and rainfall. Missing data are, in particular, a source of problems in climate research (e.g., in the analysis and modeling of spatiotemporal variability).
Moreover the complexity of climate data rescue procedures—specifically rainfall, which exhibits notable spatial variability—has created data barrier for compressive investigation in the continent in general and in Ethiopia in particular. In the regional projection of the Fourth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC AR4), East Africa is the only region in which there is a likely increase in annual rainfall; the rest of the continent has a likelihood of decrease in annual rainfall. But the region has been experiencing an intensifying dipole rain pattern on the decadal time scale and the dipole is characterized by an increase over the northern sector and decline over the southern sector in rainfall during recent decades as revealed from sparse data (Schreck and Semazzi 2004). This disparity can only be resolved with the use of continuous, homogeneous, and good-quality climate data that characterize climate change and variability.
Therefore, there is a heightened need to rescue climate data in data-sparse regions of Africa. There may be various reasons for missing values, including equipment failure, errors in measurements or faults in data acquisition, natural hazards, and manmade events such as war. Such information should have been found in metadata. However, no metadata are available about data acquisition and the history of each station, including changes in location, measurement conditions, observers, and observation times. Metadata are very useful for assessing the quality of a time series and to identify possible errors, but are frequently missing in raw climate databases in most parts of the world. In effect, a number of methods have been developed to fill data gaps, assess the inhomogeneity, and make adjustments without metadata.
Different methods for dealing with missing values are available and they can be divided into three categories: listwise deletion strategy, imputation-based procedures, and model-based procedures (Little and Rubin 2002). In the listwise deletion strategy, which is also known as complete-case analysis, all observations with one or more missing values are discarded from the analysis. This strategy is easy to carry out and is the most commonly practiced one. In imputation-based procedures, missing values are imputed with plausible values rather than being deleted entirely. Scientists sometimes impute a value for the missing values by using, for example, the mean of the observed variables. This method is simple and efficient and gives generally good results. Such a procedure will, however, seriously distort statistical properties such as standard variation, correlations, or percentiles, thereby falsifying data relationships and distributions (Schafer and Graham 2002). In the model-based algorithm, information on spatial and temporal variabilities in the dataset is used.
For the past few decades, statisticians have tried to introduce model-based methods such as maximum likelihood using the expectation-maximization (EM) algorithm and multiple imputation, arguing that they provide sounder and more promising solutions for a wider range of situations. However, these methods carry assumptions about the missing data mechanisms. A general method for maximum likelihood in missing data problems was described by Dempster et al. (1977) in their seminal paper on the EM algorithm.
Regularized EM algorithm is a recent variant of EM algorithm that works based on estimated regression models between missing and available data. Schneider’s (2001) parametric method uses EM and ridge regression to iteratively estimate the mean and covariance matrix of the analyzed dataset. Schneider’s (2001) method has shown some improvement over traditional optimal interpolation (OI) (Smith et al. 1996; Kaplan et al. 1997; Mann et al. 1998) in estimating missing values for simulated SST data. However, this EM-based method relies on the Gaussianity of the data, as well as on the randomness in time of the missing values. Recently, Beckers and Rixen (2003) proposed a nonparametric, EOF-based interpolation method to fill in missing data. Both the EOFs and the missing data are iteratively estimated, thus removing the need for a priori assumptions about the spatial form and parameters of the covariance matrix (Alvera-Azcarate et al. 2005). This spatial EOF-based reconstruction, as well as Schneider’s (2001) EM method, however, utilize spatial correlations only and are therefore less well suited to deal with datasets that exhibit relatively long continuous gaps.
Standard spectral analysis tools require regular sampling, although some methods do allow uneven sampling (MacDonald 1989; Foster 2002; Schulz and Mudelsee 2002). Recently, Schoellhamer (2001) suggested a modified singular-spectrum analysis (SSA) algorithm to obtain spectral estimates from records with a large fraction of missing data. Analyzing the full extent of the climate time series, with the missing points filled in, allows for greater accuracy and better significance testing in the spectral analysis. The full record can also improve our knowledge on the evolution of the oscillatory modes in the gaps and provide new information on changes in climate. A novel, iterative form of multichannel SSA (MSSA) utilizes temporal as well as spatial correlations to fill in missing points; it thus generalizes Beckers and Rixen’s (2003) spatial EOF-based reconstruction method and is particularly useful for datasets that exhibit relatively long continuous gaps.
Data gaps are not the only problem in the use of climate data. The presence of inhomogeneities is another common problem in climate data series. The majority of these appear as abrupt changes in the average values, but they can also appear as changes in the trend of the series (Alexandersson and Moberg 1997). Inhomogeneities in climate series can result in substantial misinterpretation of the behavior and evolution of climate. Inhomogeneities can arise from human causes such as changes in the location of the observation station, alteration of the surrounding environment, observer changes, and instrumental replacement (Karl and Williams 1987). If a series is identified as nonhomogeneous, use of the data for trend and variability analysis becomes questionable, and it is usually discarded.
The density of climate observations is as important as data gaps and inhomogeneity, particularly for rainfall observations, because global precipitation is a major component of the global water and energy cycle that influences significantly the earth’s climate system, and it is in turn affected by climate system variability and change. As a result, many attempts have been made to develop and improve global and regional gridded precipitation gauge datasets in recent years (e.g., Mitchell and Jones 2005; New et al. 2000). However, because of the lack of observational precipitation datasets over land areas, for example over Asia, Yatagai et al. (2008, 2009) have created a high-resolution rain gauge–based daily precipitation grid dataset for the Far East, Middle East, and Russia. To achieve more accurate gridded precipitation data at country level, Javanmard et al. (2010) presented the improvement and comparison of gridded precipitation data over Iran using highly quality-controlled data from ~200 synoptic stations with satellite rainfall estimates of the Tropical Rain Measurement Mission (TRMM).
Although rain gauge observations provide longer records, they do not provide a reliable spatial representation of precipitation (Gruber and Levizzani 2008). For example, the distribution of gauge networks is sparse in lowland areas in Ethiopia. Remote sensing techniques, such as those using radar or satellites, are useful for monitoring rainfall over large mountainous and oceanic regions; and with their time series accuracy and precision being potentially locally lower than existing and corresponding datasets derived from ground-based measurements, they provide more homogeneous quality data compared to ground observations (Schulz et al. 2009). Satellite-based datasets need to be examined and verified by comparison with ground-based rain gauge data.
To this end, there are different interpolation techniques that have been used in the past to create gridded climate fields. Ordinary kriging (OK) is one of them and develops weights for the surrounding stations based on both the covariability between the stations and estimated covariability between the surrounding stations and the estimation points (Todini et al. 2001). The spatial covariability is often quite noisy (and is unknown between the surrounding stations and estimation points) so is modeled as a function of distance (e.g., using a variogram model). Fitting the variogram function is the key aspect of OK. Efforts have been made to fit variograms objectively so as to improve the estimation (Todini et al. 2001).
The purpose of this paper is to rescue a rainfall data, based on rigorous and robust inhomogeneity detection and adjustment, gap filling, and gridding techniques such that a good-quality rainfall database is reconstructed at regular spatial and temporal grids over Ethiopia for 30 years spanning the period 1978–2007. Gauge rainfalls from 233 stations are used in the study. The regularized EM algorithm based on iterated linear regression analysis (hereafter RegEM) (Schneider 2001) and MSSA are utilized for data gap filling. Although detailed descriptions of the two algorithms can be found elsewhere (e.g., Dempster et al. 1977; Schoellhamer 2001), we summarize their silent features in section 3. We have employed the RHtestsV3 software package for inhomogeneity detection and adjustment. The package includes provision of quantile-matching (QM) adjustments (Wang et al. 2010) in addition to mean adjustment. OK with spherical variogram is used to reconstruct gridded rainfall dataset from gauge rainfall data that are treated for inhomogeneity and gaps. The gridded rainfall dataset is then validated against other observations, namely those from the Global Precipitation Climatology Project (GPCP), Climate Prediction Center (CPC) Merged Analysis of Precipitation (CMAP), Climate Research Unit (CRU), TRMM, and interim European Centre for Medium-Range Weather Forecasts (ECMWF) Re-Analysis (ERA-Interim) rainfalls.
2. Description of the study area
Ethiopia lies in the tropics; however, because of its high-altitude mountains, the tropical climate is somehow moderated and a good part of the country enjoys a temperate climate type. Most of the western Ethiopia as well as the highlands of south and the east have tropical rainy climate, while the highlands in southwest have warm temperate and tropical rainy climates. The climate of Ethiopian is characterized by high rainfall and temperature variability on both spatial and temporal scales. The variability in distribution is related to altitude, latitude, humidity, and winds, which are the significant factors in affecting the weather system of the country.
Temperature varies greatly with altitude. Temporally, there is a seasonal variability of the mean maximum and minimum temperatures in the country. The “Belg” months (March, April, and May) are the warmest months almost all over the country when the annual extreme maximum temperatures are observed in many places. The “Bega” months (November, December, and January) are the coldest months.
Because of Ethiopia’s location, the Indian Ocean, the Atlantic Ocean, and both the African and the Asian landmasses have strong influences on the seasonal and climatic behavior of the country. The seasonal and annual rainfall variations in Ethiopia are associated with the macroscale pressure system on the oceans (St. Helena and Mascarena) as well as monsoon winds originating from the oceans (Seleshi and Zanke 2004). Since the sources of moisture are the Indian and Atlantic Oceans, the position and strength of the high pressure systems in these oceans, as well as the strength of the jet that transports moisture from the Indian Oceans and the presence of monsoon winds from Guinea and heat lows over Sudan/Chad, will have controlling roles on the amount of moisture transported to Ethiopia. As a result, Ethiopia’s climate is the combined effects of the macro pressure systems coiled in the two oceans and other driving systems. These driving systems are intertropical convergence zone (ITCZ), subtropical jet (STJ), tropical easterly jet, Red Sea convergence zone, and the Somali jet.
3. Data and methodology
a. Rainfall datasets
Rainfall data records of about 300 stations are acquired from National Meteorological services. However, we were able to use only a tenth of these records in one of the previous studies [see, e.g., the study by Bayable (2008)] because of inhomogeneity and data gaps present in the original dataset. Figure 1 shows the geographical locations of the rain gauge stations incorporated in this study (left panel) and the fraction of missing data as a function of cumulative number of stations with missing data (right panel). Rainfall data records of only 233 stations with long records are retained and analyzed for existence of inhomogeneity and then adjustment is made.
We have used CRU, CMAP, GPCP, ERA-Interim, and TRMM rainfalls over Ethiopia to assess the accuracy and deficiency of the new gridded gauge rainfall dataset that might have come from data gap filling, inhomogeneity adjustment, and gridding. The CRU global climate dataset has recently been improved and reported as CRU time series version 3 (CRUTS3.0) by Mitchell and Jones (2005). This is the most complete and updated dataset of gridded precipitation and temperature at the global scale, has a spatial resolution of 0.5, and covers the period 1901–2006. The CMAP dataset merges satellite and rain gauge data from a number of satellite sources and rain gauge sources. This version of the CMAP dataset also uses precipitation from numerical weather prediction (NWP) models. The GPCP dataset consists of monthly means of precipitation derived from satellite and gauge measurements from several centers. The gauge and multisatellite analyses are merged using the inverse error-weighting technique to form the final GPCP dataset (Adler et al. 2003). ECMWF is currently producing ERA-Interim, a global reanalysis of the data-rich period since 1989 based on cycle 31r2 of the Integrated Forecasting System (IFS). Relative to the ERA-40 system, which was based on IFS cycle 23r4, ERA-Interim incorporates many important IFS improvements such as model resolution and physics changes, the use of four-dimensional variational data assimilation (4D-Var), and various other changes in the analysis methodology (Dee et al. 2011). TRMM, dedicated exclusively to monitor tropical and subtropical rainfall, was launched in November 1997. The TRMM dataset used covers a span of 10 years from 1998 to 2007.
b. Homogeneity testing
A variety of methods have been developed to identify inhomogeneities in climate data series (e.g., Peterson et al. 1998; Beaulieu et al. 2007). There are two general types of homogenization procedure: 1) absolute, which considers only the information in the time series being tested, and 2) relative, in which data from other observatories are also used. The latter procedure is more reliable as it involves comparison of the temporal evolution of a candidate series with that of a reference series created from correlated series nearby but is often difficult to realize.
Easterling and Peterson (1992) reviewed some of the easily automated detection techniques that involve a comparison between a reference series and the data series that is being examined. Most of the techniques that they reviewed are designed for detecting abrupt discontinuities that are defined as a change in the mean of a time series. These include double mass analysis, about which they conclude that it is impossible to determine which one of the two series being compared contains a heterogeneity, and the parallel cumulative sum (CUSUM) method that uses comparisons between the data series and several reference series; for this they conclude that the method is subjective, tedious, and has the disadvantage of not immediately providing the means for calculating an adjustment factor.
We have used the RHtestsV3 software package, which includes provision of quantile-matching adjustments (Wang et al. 2010). The assumption underlying quantile-matching adjustments (Wang et al. 2010) is the Gaussianity of the rainfall distribution. This, however, is difficult to ensure in the presence of data gaps. As a result, some authors argue that data gap filling should precede inhomogeneity detection and adjustment. However, others argue that some data gap filling techniques also presuppose the Gaussianity of the rainfall distribution, which is difficult to ensure again in the presence of data gaps. Because of this ambiguity, there is no clear and accepted path to follow as far as order of executing inhomogeneity adjustment and imputation of data gaps are concerned. Therefore, whichever ways one prefers to follow, it is important to perform quality control assessment to identify and substitute anomalous and questionable records in the database (negative precipitation and extreme precipitation events) and to check whether while removing the abrupt changes in a data series, the adjustment and imputation procedures have introduced errors to the adjusted/imputed data series or not, because the adjustment/imputation procedures are not necessarily describing the actual changes. Data series that cover a period of less than 15 years were considered too short to be suitable for reconstruction in this study.
c. Data reconstruction and imputation methods
We used the RegEM (Schneider 2001) and MSSA (Ghil and Taricco 1997; Ghil et al. 2002) algorithms for filling missing data. The RegEM algorithm is based on iterative analysis of linear regressions of variables with missing values on variables with available values, with regression coefficients estimated by ridge regression. The two algorithms are also used to extend the time series into the past in a few cases, so that a 30-yr rainfall time series data is established for all the stations. In this work, we apply a novel iterative form of MSSA. MSSA uses temporal as well as spatial correlations to fill in missing point. Figure 1 (right panel) shows the percentage of missing data for a given percentage fraction of number of stations used in this study. It can be noted that 50% of the stations in this study have missing data on the order of 15%. This shows clearly that a data gap in important climate variables such as rainfall is a serious problem.
1) Regularized expectation-maximization algorithm
The RegEM algorithm is one of the popular methods used by geophysical scientists for imputation of missing values of geophysical data. Although details of the RegEM method are well documented (Kalteh and Hjorth 2009), in the following we highlight in brief how the RegEM algorithm can be utilized. For any iteration j of the RegEM algorithm, estimates of the mean vector and the covariance matrix are revised in three stages. In the first iteration, and are obtained from the completed database after substituting the missing values with the mean values. In the first stage, each observation x = xi(i = 1, 2, … , n) can be divided into two subsets x: xo, xm, where the vector xo represents variables with available values and the vector xm represents the remaining variables with missing values. The regression parameters of the variables can then be computed from the estimates of the mean and the covariance matrix as
where is the matrix of estimated regression coefficients; is the submatrix from the estimated covariance matrix that consists of the estimated variances and covariances of the variables for which, in the given observation, the values are available; the scalar h is a regularization parameter that is used for a rank-deficient matrix; is the diagonal covariance matrix; are submatrices that consist of the estimated cross-covariances of the variables; the vector is that partition of the estimated mean that represents the variables for which, in the given observation, the values are available; and the vector is that partition of the estimated mean that belongs to the variables for which, in the given observation, the values are missing. In the second stage, the missing values in the given observation are filled in with conditional expectation values given the available values, mean, and covariance matrix estimates. In the third stage, once the missing values in all observations were filled in, then we reestimate the mean and covariance matrix from the conditional expectation of the cross-products . Then the next iteration of the RegEM algorithm is conducted with the new estimates, and this continues until the filled-in values and the estimates of the mean and covariance matrix parameters do not change substantially.
2) Multichannel singular spectrum analysis
The SSA gap filling procedure utilizes temporal correlations in the data to fill in the missing points for a univariate time series whereas the MSSA gap filling procedure takes advantage of both spatial and temporal correlations for a multivariate dataset. In both univariate and multivariate cases, data gap filling is accomplished in two stages: iterative generation of estimates of missing data points, which are then used to compute a self-consistent lag-covariance matrix X and its EOFs Ek; and cross-validation to optimize the window width M and number of dominant SSA modes to fill the gaps.
A few leading EOFs correspond to the record’s dominant oscillatory and/or trend modes, while the rest is noise for many geophysical variables (Ghil et al. 2002). The original data are centered by computing the unbiased value of the mean and setting the missing-data values to zero. The inner-loop iteration is started by computing the leading EOF E1 of the centered, zero-padded record. Then the SSA algorithm is again applied on the new time series, in which the reconstructed component R1 corresponding to that EOF alone was used to obtain nonzero values in place of the missing points and correct the record’s mean, the covariance matrix, and EOFs.
The reconstruction of the missing data is repeated with a new estimate of R1 and tested against the previous one until a convergence test has been satisfied. Next, outer loop iterations are performed by adding a second EOF E2 for reconstruction, starting from the solution with data filled in by R1, and repeating the inner iteration.
It is insightful to consider SSA gap filling in terms of applying iteratively finite impulse response filters to understand the flow of information from the known to the missing data where each reconstruction filter
is symmetric and has a length of 2M − 1. Each reconstruction filter represents the combined influence of the EOFs used so far in the outer-loop iteration (Varadi et al. 1999). The reconstructed time series X*(t) can be considered as the original time series X(t) filtered with the weights wn:
At each inner iteration, the values of X at missing points are replaced with estimated X* values. Then, the EOFs and the filter coefficients wn are recalculated, and the whole procedure is repeated until a convergence criterion is met for X* at missing points. Then the next EOF is added in the reconstruction, and so on. For a continuous gap, Eq. (2) shows that missing data are filled with information being transferred inside the gap from adjacent portions of the time series.
MSSA has some similarity with spatial EOF reconstruction method (Beckers and Rixen 2003) as reported (e.g., Ghil et al. 2002; Ghil and Jiang 1998; Kondrashov et al. 2005). The quality of the reconstruction (e.g., the closeness of its oscillatory and/or trend modes to those of the original, gappy time series) will of course depend on the amount of noise, as well as on the number and distribution of missing points. As the amount of noise increases, the significant EOFs will be polluted more, making it more difficult to remove the noise contributions. Increasing the number of missing data yields the same effect, with the worst-case scenario being continuous gaps. Even in this case, the period of the oscillation can be determined correctly, provided the gap is not larger than any significant spatiotemporal correlations present in the data (i.e., the time period of the slowest oscillatory mode).
d. Gridding, harmonic analysis, and clustering of rainfall data
Because gridded datasets are easy to use for climate research and for applications in management and decision making, we develop a gridded monthly rainfall dataset from the quality-controlled and adjusted data for the period 1978–2007 based on the geostatistical method. We have employed ordinary kriging. Therefore, in the following we give a further description of ordinary kriging.
1) Rainfall data gridding by ordinary kriging
Kriging interpolates a value Z(xo) of a random field, in this case rainfall, at an unobserved location xo from observations Z(xi) at nearby locations (xi, i = 1, … , n) from the best linear unbiased estimator based on a model of spatial dependence quantified either by the covariance matrix or variogram. The ordinary kriging method calculates weights using a vector of covariances between the estimation point and surrounding observing stations and also a matrix of covariances between all observing stations as
where w is the vector of weight, is the covariance matrix between observation locations, and D is the vector of covariance between estimation point and observation locations.
Because the covariance structure between the estimation and observations points is unknown, a model that is a function of distance is needed to determine the weights. This model is referred to as the variogram (Journel and Huijbregts 1978). To derive this variogram model, covariance is estimated between all possible observation points within a specified distance. A smooth function is then fit to the observed covariances. This fitting process involves selecting a function (e.g., spherical, exponential, etc.) and its associated parameters (range, nugget, sill). Once the variogram is determined, the kriging estimator is given by a linear combination
of the observed values with weights wi such that is minimum. The mean is assumed constant but unknown in ordinary kriging. A spherical function is fitted to the observed covariances. As the rainfall data are rather skewed toward smaller values and kriging is sensitive to it, a normalized rainfall anomaly, with respect to the spatial mean of the region, is used during the gridding procedure.
2) Harmonic analysis and clustering of rainfall data
The spatial and temporal coherence of the gridded gauge rainfall are assessed using harmonic analysis (HA) and a self-organizing map (SOM). The purpose of using HA and SOM in the analysis of reconstructed gauge rainfall is twofold. First, both HA and SOM are used to verify whether the reconstructed rainfall time series capture the known climatological variations at different time and spatial scales over Ethiopia as known from various previous studies. Second, identification of homogeneous rainfall regimes based on SOM, if found to be reasonable, can be used for further analysis of seasonal and interannual rainfall variability of the region.
The dominant cyclic mode of variability of rainfall over Ethiopia can be identified from the amplitude of HA functions while the time of occurrence of peak rainfall is determined from their phases. In particular, the different seasonal modes can be described by HA.
The SOM (Kohonen 1982a,b, 2001) is a type of artificial neural network designed for unsupervised pattern recognition applications. The SOM learns to map, in a nonlinear fashion, from a high-dimensional input layer to a low-dimensional, mostly two-dimensional, discrete lattice of neurons in the output layer. We used SOM for clustering Ethiopia into homogeneous rainfall regimes and to assess whether the clustered rainfall regimes are the same as previous works, which are based on relatively smaller number of stations and traditional clustering techniques (NMSA 2004).
4. Results and discussion
The homogenization of gauge rainfall over Ethiopia by RHtestV3 software package, data gap filling based on advanced RegEM and MSSA algorithms, gridding of the resulting dataset on a regular high-resolution grid using ordinary kriging, harmonic analysis to identify dominant modes of variability, and clustering using SOM to demarcate homogeneous rainfall regimes have been accomplished. The dataset has been gridded onto a regular 0.5° × 0.5° latitude–longitude resolution for the 30-yr period spanning 1978–2007. The quality and physical coherence of the dataset have been analyzed against CRUTS3.0, CMAP, TRMM, ERA-Interim, and GPCP datasets. In the following sections, evaluation of the new dataset is undertaken with respect to its dominant modes of variability, its characterization of homogeneous rainfall regimes, and consistency with previous independent studies. Moreover, consistency of spatial and temporal variability within individual rainfall regimes in the dataset with that exhibited by global observational and reanalysis datasets, and consistency of seasonal variability of the dataset within each rainfall regimes with that manifested in the other observational and reanalysis datasets are also determined.
a. Performance of inhomogeneity adjustment, gap filling, and gridding techniques
Figure 2 shows example of inhomogeneity detection and adjustment for one of the synoptic station at Bahir Dar, northern Ethiopia. Figure 2 (top panel) shows there is a break point in the anomaly time series after 1982. The rainfall time series before and after 1982 have different trends as shown by gray line. The middle panel in Fig. 2 depicts the original (black curve) and QM adjusted (gray curve) rainfall time series. The RHtestsV3 software package provides a nominal level of confidence in the detected break points. Generally, we applied the adjustment if the detected change point satisfies the 95% significance level. Moreover, visual inspection of each case as well as other artificial change point detection methods such as a two-phase regression approach (Menne and Williams 2005) and double mass curve are performed before we apply the adjustment.
Figure 3 shows the relation between rainfall data filled for artificial gaps and the original observed monthly rainfall–based MSSA algorithm. The gaps are artificially introduced into the observational rainfall data and a random number generator is used to determine where within each time series the gaps should be placed. This ensures that the gaps are randomly distributed both in space and time as new random number is generated for each time series considered. The MSSA gap filling approach is robust as reflected in near-perfect reconstruction of observed rainfalls at the data gap points. A small dry bias for high observed rainfall and a bit of wet bias toward lower end of observed rainfall is noticed in Fig. 3. The method is robust as it depends on both spatial and temporal covariability. Comparable results (not shown) have been observed using the RegEM procedure. However, it has to be noted that this performance is a function of length of data gaps. In this case, the gaps are of a modest nature having a continuous length of 6 months randomly distributed in space and time whereas in reality the gaps could be much longer for some time series.
The gauge rainfall from all 233 stations has undergone inhomogeneity tests and adjustment. Those with data gaps are filled based on the MSSA and RegEM algorithms. The results from the two approaches are nearly identical. As a result, the rest of discussion refers to the dataset generated using the MSSA gap filling algorithm. The dataset was then subjected to simple, ordinary, and universal kriging procedures. The results from simple kriging have been found to be slightly poorer than those from ordinary and universal kriging. Whereas the gridded rainfalls by ordinary and universal kriging are identical, we found that universal kriging has a high estimation error (not shown). Therefore, we have chosen to perform cross-validation and gridding based on ordinary kriging.
Figure 4 shows gridded annual mean rainfall (left panel) and corresponding annual mean kriging estimation error (right panel) for the 1978–2007 period. Apart from spatially distinct annual rainfall climatology, one can notice that the estimation error is very low over the dense station networks of northwestern and southeastern highlands of Ethiopia. In contrast, maximum estimation errors are found for southeastern lowlands (see Fig. 1 for station density), as expected. Moreover, the performance of the OK method can be evaluated using cross validation. We employed a leave-one-out cross-validation scheme. Figure 5 shows the performance of OK for the different seasons in terms of correlation of estimated rainfall with observation (top-left panel), root-mean-square deviation (RMSD) from observed rainfall, and standard deviation (SD) of the observed seasonal rainfall itself (top-right panel), and bias with respect to observation (bottom-left panel). The application of OK for gridding rainfall over Ethiopia has resulted in robust estimation as reflected in significant and high correlation exceeding 0.9 as well as a small wet bias in the range of 0–3 mm of seasonal rainfall. Both the correlation and wet bias exhibit small seasonal variation. Moreover, the estimation error is not only small but is also smaller than the standard deviation of the monthly mean observed rainfall as depicted in Fig. 5 (top-right panel) by more than a factor of 2 during most of the seasons.
b. Dominant modes of rainfall variability and rainfall regimes
The HA of gridded rainfall data over Ethiopia is conducted to determine the dominant periodic components in the variability of rainfall over Ethiopia. One of the motivations for such analysis is to gain knowledge of peak rainfall timing as it is crucial on account of its close link with maximum potentials of streamflow and groundwater recharge. Therefore, the spatial distribution of the two dominant annual and semiannual components is discussed in this section.
The annual rainfall variation accounts for up to 80% of the seasonal variability over much of the western half of the country as shown in the top left of Fig. 6 while the semiannual variation shown in the top right of Fig. 6 accounts for up to 45% of the seasonal variability over the southern regions of the country. The annual mode of rainfall variability peaks during the months of mid-June to mid-August over western half and southeastern highlands of the country (see Fig. 6, bottom left). This is consistent with features observed from annual mean in Fig. 4.
The semiannual mode of rainfall variability peaks during the months of mid-March to mid-April during the first half of the year and mid-September to mid-October during the second half of the year over the southern parts of the country as depicted in the bottom right of Fig. 6. This shows that the amplitude and phase analyses reveal the monomodal and bimodal rainfall distributions in a more robust objective manner than the previous studies (NMSA 2004). The northern and western parts receive rainfall in July and August as depicted in the bottom left of Fig. 6. The spatial difference on time of occurrence of peak rainfall and the northward advancing phase patterns show great similarity with rainfall “clusters” identified over Ethiopia by Gissila et al. (2004). In the customary popular “rainfall calendar” of Ethiopia, rainfall intensity advances gradually since its onset and becomes maximum around mid of the season, although there are regional differences in the time of peak rainy season.
The gridded rainfall is also employed to demarcate homogeneous rainfall regimes over the country. Figure 7 (left panel) shows nine homogeneous rainfall regions. The western lowlands and highlands are marked as R-1, R-8, and R-9 in Fig. 7. These regions are known as monomodal type-1 rainfall regimes, which are characterized as having wet season from February/March to October/November, June to September, and April/May to October/November over the R-1, R-8, and R-9 regions respectively. These regions are characterized by one continuous rainfall occurrence time and single maxima rainfall pattern during a year. This can be checked from right panel of Fig. 7 with extended peak for R-1 and relatively narrower peaks spanning a period of 2–3 months for the R-8 and R-9 regions. The regions have tropical-like nature since they are dominated by two seasons: wet and dry. The wet season gradually decreases northward from about 10 months in the southwest to only about 2 months in the northwest (NMSA 2004; Mekonnen 1998; Lemma 2003; Haile and Yarotskaya 1987).
The rift valley and adjoining western and eastern escarpments are categorized as the R-7 rainfall regime in Fig. 7. This region is known to have a dry season in the October–January (ONDJ) months and two wet seasons in MAM and JJAS and referred to as bimodal type-1 rainfall regime (NMSA 2004). R-2 region covers the southeastern highlands and belongs to bimodal type-1 region as well. The southern and eastern lowlands are bimodal type-2 regions with rains in March–May (MAM) and September–November (SON), which in this case are demarcated as three distinct rainfall regimes (R-3, R-4, and R-5, respectively), possibly due to differences in the seasonal intensity of rainfall (see right panel of Fig. 7). Although the first wet period is more prominent, both wet seasons are equally important. Generally the annual rainfall decreases from west to east (NMSA 2004; Mekonnen 1998; Lemma 2003; Haile and Yarotskaya 1987). Wet spells during the MAM season have been shown in case studies to be associated with a southward bend of the Northern Hemisphere subtropical westerly jet (Habtemichael and Pedgley 1974). There has been to date no comprehensive analysis of the interannual variability of those rains. However, a relationship with the occurrence of tropical cyclones in the southwest Indian Ocean was found at both interannual and intraseasonal time scales (Shanko and Camberlin 2005).
The R-6 region includes extreme northeastern lowlands and adjoining escarpment of northeastern high lands. The lowlands of R-6 cover the regions formally known as monomodal type-2 seasonal rainfall distribution. This part of the region encounters warm and dry air associated with occasional gusty winds and is characterized by irregular and erratic rainfall from July to February, and the rainfall has no distinct seasonal pattern (NMSA 2004; Mekonnen 1998). However, since R-6 includes the adjoining escarpment of the northeastern highlands, the seasonal rain extends throughout the year.
The consistency of results from the HA and SOM techniques shown in the preceding paragraphs has shown not only the quality of the reconstructed dataset but also the importance of taking the time scale into account in a holistic manner. When investigating climate variability, although the use of annual means is easy, it can be criticized since there are often marked differences in the behavior of climate during the various phases of the year. In particular, several equatorial regions show two rainy seasons in a year, each with its own interannual variability, related to different atmospheric dynamics. There is thus a real need to take into account the seasonal cycle when identifying space–time structures of interannual variability.
Both SOM clustering and HA are able to capture the complex seasonal patterns over the eastern half of Ethiopia. They are also consistent with the dynamics that indicate that rainfall over southern lowlands of Ethiopia is identical to East African rainfall, which mostly occurs during the boreal spring (long rains; MAM) and autumn (short rains; SON) seasons as the ITCZ migrates through the equator from south to north, and vice versa. As in other tropical regions, the interannual variability of rainfall in Ethiopia results from complex interactions of forced and free atmospheric variations. These include interactions between sea surface temperature (SST) forcing, large-scale atmospheric patterns, and synoptic-scale weather disturbances including monsoon and trade winds, persistent mesoscale circulations, tropical cyclones, subtropical anticyclones, easterly/westerly wave perturbations, and extratropical weather systems (Camberlin and Philippon 2002; Habtemichael and Pedgley 1974; Fritsch and Chappell 1980). Relative to the long rains, the short rains tend to have stronger interannual variability, stronger spatial coherence of rainfall anomalies across a large part of the region, and a substantial association with ENSO.
c. Comparison with other datasets: Interannual variability over the homogeneous rainfall regimes
The reliability of gridded monthly rainfall dataset over Ethiopia has been evaluated through extensive intercomparisons against GPCP, CMAP, CRUTS3.0, TRMM, and ERA-Interim monthly rainfalls. The comparison is performed for the nine homogeneous rainfall regimes identified in section 4b separately. This dataset and validation datasets are compared based on their correlation, root-mean-square deviation, and bias within each cluster. Moreover, the gridded and reconstructed rainfall is spatially averaged for the nine homogeneous rainfall regimes and the resulting time series are compared with corresponding spatially averaged time series of other datasets. The discussion of correlation, RMSD, and bias in the following subsections refers to the values calculated from individual data within the clusters as well as those of cluster-average time series. The correlation and bias of cluster-average time series of the reconstructed dataset and global datasets are shown as a figure title of the corresponding time series whereas those computed from individual data points are plotted. The summary statistics are obtained for both MSSA- and RegEM-based reconstructed gauge rainfall; however, for sake of clarity and marginal differences between the two, we have only reported results from the MSSA techniques.
1) Comparison with GPCP
The gridded rainfall over Ethiopia is generally in a very good agreement with blended observational datasets of GPCP. However, there is variation in the degree of agreement from one homogeneous rainfall regime to another. Results of comparison of the reconstructed and gridded gauge rainfall versus GPCP rainfall for the northwestern (top left), southern (top right), Rift Valley and adjoining highlands (bottom left), and southwestern (bottom right) regions shown in Fig. 8 are discussed. These regions are delineated as the R-8, R-4, R-7, and R-1 homogeneous rainfall regimes in Fig. 7, respectively. The regions are selected for further analysis as they represent distinct rainfall regimes of Ethiopia. The homogeneous R-8 region is characterized by a single rainy period of June–September (JJAS) whereas the R-4 region is characterized by two rainy seasons of MAM and JJAS periods. Other regions with the same bimodal rainfall pattern and main rainfall period of MAM as R-4 are R-2, R-3, and R-5. The homogeneous R-7 region is also characterized by bimodal rainfall pattern but with the main rainfall in JJAS. The homogeneous R-1 region is characterized by long rainy season from March/April to October/November.
The agreement with GPCP is very good for all four regions throughout the 30-yr period of gauge rainfall. The correlation is remarkably high for the R-8 and R-7 regions. However, there is significant difference during some years for the R-4 region. The number of rain gauge stations included in the reconstruction and then gridding is very small over the R-4 region as shown in Fig. 1. On the other hand, the GPCP rainfall, which includes the blended data from different satellite products and gauge, may contain the same information from gauge over this region. The spatially averaged rainfall time series of GPCP for the R-4 region exceed those of the reconstructed rainfall although the phases of both time series are in excellent agreement as shown in Fig. 8. The correlation between the spatially averaged gridded gauge and GPCP in Fig. 8 (top right) is very high, exceeding a value of 0.81 calculated from individual data points in the region (see Fig. 9).
Figure 9 shows the correlation, root-mean-square deviation, and bias of gridded gauge rainfalls with respect to validation datasets for all homogeneous rainfall regimes over the country. The gridded gauge rainfall over the northwestern R-8 region has a very good correlation of 0.93 with GPCP. A small positive bias of 0.17 mm day−1 in gridded gauge dataset with respect to GPCP is also observed over this region. The correlation of this dataset with GPCP over regions R-4, R-7, and R-1 is 0.81, 0.88, and 0.81 at the 95% significance level whereas its bias with respect to GPCP is −0.29, −0.08, and −0.83 mm day−1, respectively. The correlation over the rest of the region is between 0.66 (minimum) to 0.93 (maximum) whereas the bias is between 0.03 (minimum) and 0.41 mm day−1 (maximum). The wet bias of 0.29 mm day−1 in GPCP rainfall over the R-4 region is probably due to the fact that the bias in multisatellite rainfall used in GPCP rainfall is not sufficiently corrected because of the limited number of rain gauge stations incorporated in the analysis. In similar manner, a dry bias in GPCP rainfall over the southwestern region delineated as the homogeneous rainfall regime R-1 might also be linked to the relatively small number of rain gauge stations used given the complex topography over this region. We also suspect that since this region gets almost continuous rainfall throughout the year, from both convective clouds during JJAS and low and warm clouds during the rest of the rainy period, the algorithms used for estimation of rainfall from satellite radiances may underestimate rainfall particularly for those satellites with visible and infrared (VIS/IR) sensors.
The dry bias in GPCP rainfall over the R-8 and R-1 regions with a relatively complex topography might have come from combined factors of low dense network of rain gauges over lowlands and orographic precipitation over the highlands. It has also been suggested in one of the previous studies that the GPCP analysis over land may be underestimating precipitation in some regions with orographic features (Nijssen et al. 2001). Adler et al. (2003) examined this underestimation and found it to be related mainly to the relative lack of rain gauges in mountainous regions. The density of the rain gauge data being used may not be sufficient to reliably reproduce spatial structures, even if the orographic information would be used as part of the analysis. Unfortunately the satellite observations, both passive microwave and IR, also have difficulty detecting shallow orographic precipitation. Improvement in this area will probably require additional gauges going into the analysis and/or gauge analysis techniques that use terrain information to adjust the gauges that are available.
2) Comparison with CMAP
The time series of spatially averaged rainfall over the representative homogeneous regions is depicted in Fig. 10. The agreement is very good for all four regions throughout the 30-yr period of gauge rainfall. The correlation is remarkably high for the R-8 and R-7 regions but slightly lower than that of correlation with GPCP. Despite the small number of gauge data included in the reconstruction and gridding over the R-4 region, the agreement with CMAP is better than GPCP. A factor that may contribute to the observed larger differences between the gridded gauge rainfall and GPCP over R-4 than CMAP is the methodology used when there is no gauge measurement in a grid box. In the GPCP analysis, the precipitation values in the grid boxes where no gauge measurements are available are determined by interpolating the gauge measurements from surrounding areas. But this procedure in data-sparse regions with steep gradients in mean precipitation such as R-4 introduces additional error. In the CMAP-merged analysis, the values of precipitation are determined by the modified satellite estimates and therefore do not suffer from interpolation error. This difference in analysis methods can contribute to some of the observed differences over data-sparse regions. Gruber et al. (2000) have also attributed the relative difference of GPCP and CMAP, approaching 50% over central equatorial Africa, to this difference of analysis.
The dry bias in CMAP rainfall over western and southwestern regions (R-8 and R-1) is less likely to be purely due to satellite bias as a result of weak constraint by gauge rainfall and warm clouds. Most GCM and RCM models overestimate rainfall over Ethiopian highlands while they underestimate over lowlands on account of the sensitivity of convection schemes to topography (Gebremariam 2011; Segele et al. 2008).
The gridded gauge rainfall over the R-8 region has a very good correlation of 0.91 and a positive bias of 0.56 mm day−1 with respect to CMAP calculated from individual grid points, which is higher than the corresponding value with respect to GPCP. The correlations of this dataset with CMAP over regions R-4, R-7, and R-1 are 0.66, 0.85, and 0.79 at a 95% significance level whereas its bias with respect to CMAP is −0.13, 0.09, and 1.28 mm day−1 as shown in Fig. 9. The relation between CMAP and the new datasets described in terms of correlation and bias over these rainfall regimes is slightly poorer than what has been found for GPCP. The better agreement of GPCP with the gridded dataset than CMAP over most parts of Ethiopia is partly attributed to the large number of gauge rainfalls blended with satellite data in this version of the GPCP dataset. This kind of discrepancy between GPCP and CMAP was also found in a different study by Gruber et al. (2000) but attributed to a difference in the methodology of the blending gauge and satellite datasets as well as to the use of corrected gauge in the case of GPCP. We cannot rule out the contribution of this difference to the observed discrepancy between the three datasets.
3) Comparison with CRUTS3.0
The time series of spatially averaged rainfall over the representative homogeneous regions is depicted in Fig. 11. The agreement is very good for all four regions throughout the overlap period of gauge rainfall, both in magnitude and phase. The high correlation and small bias are remarkably reflected in the spatially averaged time series for all regions. However, unlike GPCP and CMAP, which exhibit very good agreement over the whole data reconstruction period, there is some overshot in CRUTS3.0 dataset during a few years of coincident observations as shown in Fig. 11 for 1998–2000 and 2005 for R-8 (top left), 1999–2001 for R-7 (top right), and during several years over the R-4 regions. However, the R-1 region has a dry bias in GPCP over the whole 30-yr period, which is evident from wet bias of gauge rainfall relative to GPCP in Fig. 9.
The gridded gauge rainfall over the R-8 region has a very good correlation of 0.91 and small positive bias of 0.06 mm day−1 with respect to CRUTS3.0. This small bias shows that the gauge dataset and CRUTS3.0 do not only correlate well in the rainfall pattern but also agree well on the absolute magnitude of the rainfalls at each grid during the whole period of the time series. The correlation of this dataset with CRUTS3.0 over regions R-4, R-7, and R-1 is 0.77, 0.84, and 0.75 at the 95% significance level whereas its bias with respect to CRUTS3.0 is −0.25, −0.01, and 0.77 mm day−1 as shown in Fig. 9. Again, CRUTS3.0 captures the magnitude of the rainfall in the observations better than CMAP except for the R-4 region, for which both GPCP and CMAP have low bias. The correlation between CRUTS3.0 and the new datasets over the rainfall regimes is slightly lower than what has been found for GPCP. However, the bias with respect to CRUTS3.0 is substantially smaller than that found with respect to GPCP for some regions.
The low bias between CRUTS3.0 and the new dataset is due to the fact that both are gauge rainfalls except for the difference in the number of rain gauge stations, which is larger in our case than CRUTS3.0. The inhomogeneity adjustment mechanisms are different since CRUTS3.0 relies on reference series constructed from nearby stations and it is unlikely to find good ones in data-sparse regions of Africa.
4) Comparison with TRMM
Actual observed rainfall generally shows a significant variability on inter annual time scales. Figure 12 compares the interannual variability of the seasonal rainfall obtained from the gauge analysis with that of the TRMM 3B42 V6 estimates over the spatially coherent rainfall regimes. The low correlation and high positive bias are remarkably reflected in the average time series for all regions. The interannual variability of observed rainfall has the right phase with the satellite estimates though the amplitudes are consistently underestimated with the exceptions during 1999 for the R-8 and R-7 regions and during the period 2004–07 over the R-4 region. The performance of TRMM varies over the different climatic regimes depending on the topography and rainfall characteristics of the region, although the biases are consistent and systematic. TRMM 3B42 V6 estimates could have tremendous potential to be used for intraseasonal studies over most regions of Ethiopia.
The reconstructed and gridded gauge rainfall over the R-8 region has a very good correlation of 0.83 and positive bias of 0.64 mm day−1 with respect to TRMM. The correlation of this dataset with TRMM over the R-4, R-7, and R-1 regions is 0.61, 0.82, and 0.67 at the 95% significance level whereas its bias with respect to TRMM is 0.17, 0.31 and 1.13 mm day−1, respectively, as shown in Fig. 9. The correlation with TRMM is relatively lower than the rest of datasets. Moreover, the new dataset has a positive bias over all four regions, which shows that TRMM rainfall is slightly drier than the reconstructed dataset. The comparison of individual observations (not shown) reveals that the satellite measurements are incapable of capturing some of the heavy rain rates. Our observation is consistent with the report by Chokngamwong and Chiu (2006). TRMM tends to produce conservative rainfall estimate as reflected in its wet bias for low rainfall and dry bias in heavy observed rainfall over Ethiopia. This has been observed in a different study by Nair et al. (2009), who show that there is a net tendency for the lower observed rainfalls to be overestimated by the satellite, whereas the higher rainfall values are underestimated over the western state of India. They have found that the scatter is more for observed values greater than 15 mm day−1. However, in this study, the scatter differs from one homogeneous rainfall regime to the other. For instance, the scatter (not shown) is more for observed values greater than 2 mm day−1 for R-8, 1 mm day−1 for R-7 and R-1, and all observations for R-4 regions.
5) Comparison with ERA-Interim
The time series of average rainfall over the representative homogeneous regions is depicted in Fig. 13. Now, we can clearly see that the magnitude of seasonal rainfall from ERA-Interim is higher than the corresponding gridded and reconstructed gauge dataset. However, ERA-Interim captures correctly the different phases of monthly rainfall in contrast to TRMM, which fails to capture interannual rainfall variability during small rainy seasons.
The reconstructed and gridded gauge rainfall over R-8 region has a very good correlation of 0.88 and large negative positive bias of −1.48 mm day−1 with respect to ERA-Interim rainfall. The correlation of this dataset with ERA-Interim over the R-4, R-7, and R-1 regions is 0.72, 0.77, and 0.66 at 95% significant level whereas its bias with respect to ERA-Interim is −0.41, −0.16, and −0.99 mm day−1, respectively, as shown in Fig. 9. Although the correlation of the new dataset with ERA-Interim is comparable to that with TRMM, the biases are large and negative in clear contrast to TRMM. It is not surprising to find such large negative bias in the gauge rainfall with respect to ERA-Interim over R-8 regions since ERA-Interim rainfall is mainly driven by model physics, which often overestimates rainfall over the highlands because of its convection scheme. Our assertion is also confirmed by the fact that there is small negative bias over rift valley and adjoining highlands (R-7) and southern (R-4) regions. Besides, Dee et al. (2011) have concluded that ERA-Interim shows more rainfall than GPCC in most of the Northern Hemisphere, and in parts of South America. The large differences over central Africa are indicative of higher uncertainties due to the sparse radiosonde coverage there, particularly during the first decade of the averaging period. This is also evident in the decadal change estimates, where ERA-Interim probably overestimates the decrease in rainfall in the central African region. A possible explanation is the presence of a substantial warm bias in the model associated with underestimated aerosol optical depth in the region. This, combined with the improved data coverage during the second decade, results in apparent cooling accompanied by reduced precipitation. Moreover, Dee et al. (2011) have shown that ERA-Interim-GPCP for 1990 is positive over western half of Ethiopia and negative over eastern half, which seems to suggest that a combination of factors such as weakness in model physics and insufficient observation to constrain the model is a cause for the observed wet bias in ERA-Interim.
d. Comparison with other datasets: Seasonal variability over homogeneous rainfall regimes
The GPCP seasonal cycle shown in Fig. 14 clearly demonstrates the agreement and discrepancy observed in Fig. 8 for the four regions. Both GPCP and the gridded gauge dataset show good agreement on the phase and magnitude of 30-yr mean seasonal rainfall. The two datasets perfectly agree that MAM is the maximum rainfall season whereas JJAS is the small rainfall season of the bimodal R-4 region. The two datasets agree also that the R-7 region is a bimodal region with maximum rainfall in JJAS and minimum in MAM rainy seasons while the R-8 and R-1 regions are monomodal regions with a rainfall peak in JJAS and continuous rainfall from March to October, respectively. However, the magnitude of the mean seasonal rainfall from GPCP is constantly lower than the gridded gauge seasonal mean rainfall amount throughout the whole rainy period over the R-1 regions. This has been already noticed in Fig. 8 but it has now become quite visible in the seasonal mean in the form of a constant and systematic offset from observation. In contrast, GPCP mean seasonal rainfall is constantly higher than gridded gauge observation during both MAM and SON seasons, but the differences in relative magnitude of the two seasons from the datasets are consistent with each other. The discrepancy in the magnitude of the seasonal rainfall in GPCP and gridded gauge over the R-4 and R-1 homogeneous regions is likely attributable to orographic features and relative small number of gauges to constrain these features in GPCP over R-1 and extrapolating, from gauges over the adjacent highlands due to none or few gauges over the R-4 regions, is likely to fail due to strong rainfall spatial gradients.
The CMAP and gridded gauge rainfall seasonal cycles in Fig. 14 depict the agreement and discrepancy observed in the 30-yr time series in Fig. 10 more clearly for the four regions. Both CMAP and the gridded gauge datasets show good agreement on the timing of peak rainy period and the relative magnitude of the rainfall in the different seasons. The magnitude of mean seasonal rainfall is underestimated by CMAP for R-8, R-7, and R-1 for all the seasons. CMAP mean seasonal rainfall is also lower than that of GPCP over these regions, implying that CMAP has a dry bias with respect to both GPCP and gridded gauge rainfall. On the other hand, the magnitude of CMAP seasonal rainfall over the R-4 region shows good agreement with the gauge in a clear contrast to GPCP, which has a constant offset near peak rainfall period during the two seasons. This discrepancy between GPCP and CMAP has been attributed to the difference in analysis methodology in the previous section.
The patterns and magnitudes of CRUTS3.0 and gauge seasonal mean rainfalls in Fig. 14 depict the silent features of interannual variability observed in the corresponding time series in Fig. 11 for the four regions. Both CRUTS3.0 and the gauge datasets show good agreement on the time of peak rainy period and the relative magnitude of the rainfall in the different seasons. CRUTS3.0 mean seasonal rainfall exhibits a wet bias over the R-8, R-4, and R-7 regions but a dry bias over R-1 region with respect to gridded gauge rainfall. Nevertheless, CRUTS3.0 still has a wet bias with respect to GPCP and CMAP over the R-1 region. This demonstrates that the CRUTS3.0 dataset built from gauge observation lacks additional information that comes from satellite observations present in both GPCP and CMAP. The question that may follow from this inference is why the newly gridded and reconstructed dataset agrees better with GPCP and CMAP than with CRUTS3.0 given that both are built from gauge observations. There are two possible explanations. The obvious one is related to the fact that the number of gauges used in the two dataset is different; and the gridding techniques as well as reconstruction methods for the two datasets are different. The inhomogeneity adjustment might have also contributed to the discrepancy between the gridded gauge and CRUTS3.0 rainfalls since CRUTS3.0 relies on reference series, and it is difficult to find good and representative ones in data-sparse regions as discussed in the previous section.
A similar comparison of TRMM and gauge mean seasonal rainfall in Fig. 14 reveals that TRMM and the gauge dataset show good agreement on the phase of rainfall and the relative magnitude of the mean seasonal rainfall over the R-4 and R-7 regions while large discrepancies still exist over the R-8 and R-1 regions. Moreover, TRMM tends to indicate there are two rainy season for R-1 region unlike the gridded gauge, GPCP, CMAP, and CRUTS3.0 datasets. The general tendency is that TRMM underestimates observed rainfall over R-8, R-7, and R-1 except over R-1 regions for which it agrees quite well with gauge data at both peaks during MAM and SON. Apart from this exception, this work shares the conclusions of Javanmard et al. (2010), who have shown that TRMM underestimates mean annual precipitation over Iran. Furthermore, TRMM and CRUTS3.0 appear to be two extremes with respect to gridded gauge, GPCP, and CMAP datasets exhibiting relative dry and wet biases, respectively.
The large wet bias in ERA-Interim shown in Fig. 13 with respect to gridded gauge rainfall apparently characterizes the relation between ERA-Interim and all other datasets as shown in Fig. 14 over all four rainfall regimes. The wet bias ranges from as large as 5 mm day−1 over the northwestern, R-8 region to as low as 1 mm day−1 over the other three regions during peak rainfall months. The fact that ERA-Interim captures the phases as well as any of the other datasets but fails to capture the magnitude, which is offset by as much as 5 mm day−1 at the peak rainfall period, strengthens our previous argument that 1) the model physics, particularly the convection scheme in GCMs and regional climate models (RCMs), is far from perfect, and 2) the moisture field for assimilation in the model is hardly available over Ethiopia.
This study has carefully addressed the problem of data gaps and inhomogeneity in climate records such as rainfall. Two advanced data gap filling and reconstruction techniques have been utilized over Ethiopia. The change point detection and adjustment algorithms are also robust and widely used methods. The reconstructed and gap-filled rainfall data have been gridded using ordinary kriging. The applicability of MSSA techniques for data filling and ordinary kriging for gridding over Ethiopia is evaluated through creating artificial data gaps in the observation and cross validation, respectively. The performance evaluations have shown that both analysis tools are applicable to Ethiopian complex topography. As a result, a good-quality rainfall dataset is created for Ethiopia that can be used for various purposes such as detection of climate change and attribution studies related to different climate signals in the datasets. The quality of the gridded gauge rainfall dataset is evaluated based on its skill to capture the dominant seasonal cycles and rainfall distribution using harmonic analysis and self-organizing maps (SOMs). The dataset revealed the monomodal and bimodal rainfall types and their areas of influence in a more quantitative and robust manner than previous studies. It has been determined that Ethiopia has nine homogeneous rainfall regimes with distinct times of peak rainfall and duration of rainy seasons. Moreover, the variability of monthly rainfall, for instance in the dominant rainy season (JJAS) over western Ethiopia (categorized as subregions R-1, R-8, and R-9) accounts for ~65%–85% of the total rainfall variability over these regions and reaches its peak in July. The remaining variability of the monthly rainfall is due to rainfalls in the other months as these regions get rainfall throughout nearly the whole year.
The rift valley and adjoining highlands receive much of the rainfall during JJAS, which accounts for 30%–60% of rainfall variability, and secondary rainfall during MAM, which accounts for 20%–30% of the monthly rainfall variability. The percentage range shows spatial difference in the amount and duration of rainy season. The remaining small percentage of variability is accounted by small and sporadic rains during other months of the year. Similarly, the southern regions delineated as R-2, R-3, and R-4 by the SOM clustering approach receive rainfall during JJAS, which accounts for 5%–15%, and MAM and SON, which account for 30%–50% each, of the monthly rainfall variability.
The dataset is also validated against CRUTS3.0, TRMM, and ERA-Interim and blended gauge and satellite rainfalls of GPCP and CMAP datasets. The intercomparison of the datasets and gridded gauge rainfalls have shown that gridded gauge rainfall is in a very good agreement with GPCP over all homogeneous regimes except R-5 in terms of large correlation and small bias. CMAP and CRUTS3.0 also exhibit good agreement over most regions while TRMM and ERA-Interim have bias in extreme directions. The dry bias in TRMM with respect to gridded monthly gauge rainfall is consistent with similar observations in separate studies over other regions. The wet bias in ERA-Interim with respect to gridded monthly gauge rainfall is probably due to the combined effects of deficiency in model physics, particularly the moisture scheme and the absence of a moisture field for assimilation into the model over Ethiopia. To our knowledge, there has been only one radiosonde sounding once a day over Addis Ababa station in the recent past. The overall, spatial, and temporal coherence of the newly created gridded monthly rainfall dataset with GPCP, CMAP, CRUTS3.0, TRMM, and ERA-Interim datasets is quite remarkable. This gives us assurance that the new dataset characterizes the region very well as it is created from quality-controlled, inhomogeneity-adjusted and data gap–filled in situ observation. Therefore, the dataset is worth using to identify and understand variations and changes of regional and global climate, particularly for the detection of climate change, attribution of climate change, and more thorough characterization of the natural climate variability. Moreover, the dataset is a valuable source of information to validate statistically downscaled GCM and RCM rainfall over the region. The new dataset is available to the scientific community upon request free of charge.
The author acknowledges the generous support of the Ethiopian National Meteorological Services, which has provided the monthly rainfall data used in this study for graduate-level studies at different times over the last few years. The author would also like to extend his gratitude to all respective centers that provided GPCP, CMAP, TRMM, CRUTS3.0, and ERA-Interim rainfall datasets.