• Badr, H. S., , B. F. Zaitchik, , and A. K. Dezfuli, 2015: A tool for hierarchical climate regionalization. Earth Sci. Inf., 8, 949958, doi:10.1007/s12145-015-0221-7.

    • Search Google Scholar
    • Export Citation
  • Bartle, A., 2002: Hydropower potential and development activities. Energy Policy, 14, 1231–1239, doi:10.1016/S0301-4215(02)00084-8.

  • Bekele, F., 1997: Ethiopian use of ENSO information in its seasonal forecasts. Internet J. Afr. Stud., 2. [Available online at http://www.bradford.ac.uk/research-old/ijas/ijasno2/bekele.html.]

  • Bisetegne, D., , L. Ogallo, , and J. Ininda, 1986: Rainfall characteristics in Ethiopia. Proc. First Technical Conf. on Meteorological Research in Eastern and Southern Africa, Nairobi, Kenya, UCAR.

  • Black, E., , J. Slingo, , and K. R. Sperber, 2003: An observational study of the relationship between excessively strong short rains in coastal East Africa and Indian Ocean SST. Mon. Wea. Rev., 131, 7494, doi:10.1175/1520-0493(2003)131<0074:AOSOTR>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Block, P. J., , and B. Rajagopalan, 2007: Interannual variability and ensemble forecast of upper Blue Nile basin Kiremt season precipitation. J. Hydrometeor., 8, 327343, doi:10.1175/JHM580.1.

    • Search Google Scholar
    • Export Citation
  • Camberlin, P., 1997: Rainfall anomalies in the source region of the Nile and their connection with the Indian summer monsoon. J. Climate, 10, 13801392, doi:10.1175/1520-0442(1997)010<1380:RAITSR>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Conway, D., 2000: The climate and hydrology of the upper Blue Nile River. Geogr. J., 166, 4962, doi:10.1111/j.1475-4959.2000.tb00006.x.

    • Search Google Scholar
    • Export Citation
  • Craven, P., , and G. Wahba, 1978: Smoothing noisy data with spline functions. Numer. Math., 31, 377403, doi:10.1007/BF01404567.

  • Degefu, W., 1987: Some aspects of meteorological drought in Ethiopia. Drought and Hunger in Africa, M. H. Glantz, Ed., Cambridge University Press, 23–36.

  • Dinku, T., , K. Hailemariam, , R. Maidment, , E. Tarnavsky, , and S. Connor, 2014: Combined use of satellite estimates and rain gauge observations to generate high-quality historical rainfall time series over Ethiopia. Int. J. Climatol., 34, 24892504, doi:10.1002/joc.3855.

    • Search Google Scholar
    • Export Citation
  • Diro, G. T., , D. I. F. Grimes, , E. Black, , A. O’Neill, , and E. Pardo-Iguzquiza, 2009: Evaluation of reanalysis rainfall estimates over Ethiopia. Int. J. Climatol., 29, 6778, doi:10.1002/joc.1699.

    • Search Google Scholar
    • Export Citation
  • Diro, G. T., , D. I. F. Grimes, , and E. Black, 2011a: Teleconnections between Ethiopian summer rainfall and sea surface temperature: Part I—Observation and modelling. Climate Dyn., 37, 103119, doi:10.1007/s00382-010-0837-8.

    • Search Google Scholar
    • Export Citation
  • Diro, G. T., , D. I. F. Grimes, , and E. Black, 2011b: Teleconnections between Ethiopian summer rainfall and sea surface temperature: Part II. Seasonal forecasting. Climate Dyn., 37, 121131, doi:10.1007/s00382-010-0896-x.

    • Search Google Scholar
    • Export Citation
  • Eklundh, L., , and P. Pilesjö, 1990: Regionalization and spatial estimation of Ethiopian mean annual rainfall. Int. J. Climatol., 10, 473494, doi:10.1002/joc.3370100505.

    • Search Google Scholar
    • Export Citation
  • Elagib, N. A., , and M. M. Elhag, 2011: Major climate indicators of ongoing drought in Sudan. J. Hydrol., 409, 612625, doi:10.1016/j.jhydrol.2011.08.047.

    • Search Google Scholar
    • Export Citation
  • Gamachu, D., 1977: Aspects of Climate and Water Budget in Ethiopia. Tech. Monogr., Addis Ababa University Press, 71 pp.

  • Giannini, A., , R. Saravanan, , and P. Chang, 2003: Oceanic forcing of Sahel rainfall on interannual to interdecadal time scales. Science, 302, 10271030, doi:10.1126/science.1089357.

    • Search Google Scholar
    • Export Citation
  • Gissila, T., , E. Black, , D. I. F. Grimes, , and J. M. Slingo, 2004: Seasonal forecasting of the Ethiopian summer rains. Int. J. Climatol., 24, 13451358, doi:10.1002/joc.1078.

    • Search Google Scholar
    • Export Citation
  • Goddard, L., , and N. E. Graham, 1999: Importance of the Indian Ocean for simulating rainfall anomalies over eastern and southern Africa. J. Geophys. Res., 104, 19 09919 116, doi:10.1029/1999JD900326.

    • Search Google Scholar
    • Export Citation
  • Gong, X., , and M. B. Richman, 1995: On the application of cluster analysis to growing season precipitation data in North America east of the Rockies. J. Climate, 8, 897931, doi:10.1175/1520-0442(1995)008<0897:OTAOCA>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Griffiths, J., 1972: Ethiopian highlands. Climates of Africa, J. Griffiths, Ed., World Survey of Climatology, Vol. 10, Elsevier, 369–388.

  • Hartigan, J. A., 1975: Clustering Algorithms. John Wiley & Sons, 351 pp.

  • Jain, A. K., , M. N. Murty, , and P. J. Flynn, 1999: Data clustering: A review. ACM Comput. Surv., 31, 264323, doi:10.1145/331499.331504.

    • Search Google Scholar
    • Export Citation
  • Kalnay, E., and et al. , 1996: The NCEP/NCAR 40-Year Reanalysis Project. Bull. Amer. Meteor. Soc., 77, 437471, doi:10.1175/1520-0477(1996)077<0437:TNYRP>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Kassahun, B., 1987: Weather systems over Ethiopia. Proc. First Technical Conf. on Meteorological Research in Eastern and Southern Africa, Nairobi, Kenya, UCAR, 53–57.

  • Korecha, D., , and A. G. Barnston, 2007: Predictability of June–September rainfall in Ethiopia. Mon. Wea. Rev., 135, 628650, doi:10.1175/MWR3304.1.

    • Search Google Scholar
    • Export Citation
  • Korecha, D., , and A. Sorteberg, 2013: Validation of operational seasonal rainfall forecast in Ethiopia. Water Resour. Res., 49, 76817697, doi:10.1002/2013WR013760.

    • Search Google Scholar
    • Export Citation
  • Latif, M., , D. Dommenget, , M. Dima, , and A. Grötzner, 1999: The role of Indian Ocean sea surface temperature in forcing east African rainfall anomalies during December–January 1997/98. J. Climate, 12, 34973504, doi:10.1175/1520-0442(1999)012<3497:TROIOS>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Manning, C. D., , P. Raghavan, and H. Schütze, 2008: Introduction to Information Retrieval. Cambridge University Press, 506 pp.

  • NMSA, 1996: Climatic and agroclimatic resources of Ethiopia. National Meteorological Services Agency of Ethiopia Research Rep. 1, 37 pp.

  • Segele, Z. T., , and P. J. Lamb, 2005: Characterization and variability of Kiremt rainy season over Ethiopia. Meteor. Atmos. Phys., 89, 153180, doi:10.1007/s00703-005-0127-x.

    • Search Google Scholar
    • Export Citation
  • Segele, Z. T., , P. J. Lamb, , and L. M. Leslie, 2009: Large-scale atmospheric circulation and global sea surface temperature associations with Horn of Africa June–September rainfall. Int. J. Climatol., 29, 10751100, doi:10.1002/joc.1751.

    • Search Google Scholar
    • Export Citation
  • Seleshi, Y., , and U. Zanke, 2004: Recent changes in rainfall and rainy days in Ethiopia. Int. J. Climatol., 24, 973983, doi:10.1002/joc.1052.

    • Search Google Scholar
    • Export Citation
  • Shanko, D., , and P. Camberlin, 1998: The effects of the southwest Indian Ocean tropical cyclones on Ethiopian drought. Int. J. Climatol., 18, 13731388, doi:10.1002/(SICI)1097-0088(1998100)18:12<1373::AID-JOC313>3.0.CO;2-K.

    • Search Google Scholar
    • Export Citation
  • Sugar, C. A., , and G. M. James, 2003: Finding the number of clusters in a dataset. J. Amer. Stat. Assoc., 98, 750763, doi:10.1198/016214503000000666.

    • Search Google Scholar
    • Export Citation
  • Tadesse, T., 1994: The influence of the Arabian Sea storms/depressions over the Ethiopian weather. Proc. Int. Conf. on Monsoon Variability and Prediction, Geneva, Switzerland, World Meteorological Organization, 228–236.

  • Thorndike, R. L., 1953: Who belongs in the family? Psychometrika, 18, 267276, doi:10.1007/BF02289263.

  • Tibshirani, R., , G. Walther, , and T. Hastie, 2001: Estimating the number of clusters in a data set via the gap statistic. J. Roy. Stat. Soc., 63B, 411423, doi:10.1111/1467-9868.00293.

    • Search Google Scholar
    • Export Citation
  • Tsegay, W., 1998: El Niño and drought early warning in Ethiopia. Internet J. Afr. Stud., 2. [Available online at http://www.bradford.ac.uk/research-old/ijas/ijasno2/Georgis.html.]

  • Viste, E., , and A. Sorteberg, 2013a: The effect of moisture transport variability on Ethiopian summer precipitation. Int. J. Climatol., 33, 31063123, doi:10.1002/joc.3566.

    • Search Google Scholar
    • Export Citation
  • Viste, E., , and A. Sorteberg, 2013b: Moisture transport into the Ethiopian highlands. Int. J. Climatol., 33, 249263, doi:10.1002/joc.3409.

    • Search Google Scholar
    • Export Citation
  • Ward, J. H., Jr., 1963: Hierarchical grouping to optimize an objective function. J. Amer. Stat. Assoc., 58, 236244, doi:10.1080/01621459.1963.10500845.

    • Search Google Scholar
    • Export Citation
  • View in gallery

    Study region (framed) of western Ethiopia and sample sites with June–September seasonal total precipitation (mm) time series from 1983 to 2011. Red circles indicate the precipitation in 1997, which is a strong El Niño year.

  • View in gallery

    Synthetic hierarchical clustering dendrogram conceptualization initialized with 10 data units. At the cutoff point shown, three clusters are identified.

  • View in gallery

    Synthetic k-means clustering conceptualization based on a two-dimensional dataset (X, Y). The diamonds and the crosses represent the data points and the centroids, respectively. (a) Scatterplot of the data. Data points assigned to the closest centroid (labeled with the same index as its assigned centroid): (b) randomly assigned centroids to initiate the algorithm, (c) recalculated centroids and reassigned data points, and (d) recalculated centroids and the same reassigned results (convergence).

  • View in gallery

    WSS, given different number of clusters k based on k-means clustering results on JJAS seasonal total precipitation over the complete study region. Note that, for each box plot, the line inside the box is the median, the box edges represent the 25th and 75th percentiles, the whiskers extend to the most extreme data points not considered outliers, and outliers are plotted individually with crosses.

  • View in gallery

    Comparative results of hierarchical clustering specifically using (a) regional linkage, (b) average linkage, and (c) Ward’s method as distance criteria and nonhierarchical (d) k-means clustering over (top) region 1 of 5.15°–14.95°N, 33.05°–39.95°E and (bottom) with an additional region 2 of 4.15°–5.15°N, 34.95°–38.75°E. All with the number of clusters k = 5.

  • View in gallery

    The k-means clustering map under a smoothing factor of (top left) 5, (top right) 10, (bottom left) 15, and (bottom right) 25.

  • View in gallery

    Minimum WSS, AIC, BIC, and GCV given different number of clusters k based on k-means clustering results on JJAS seasonal total precipitation over the complete study region.

  • View in gallery

    The minWSS and difference in minWSS from k − 1 to k.

  • View in gallery

    Gap statistic and difference in difference results.

  • View in gallery

    The k-means clustering maps given a different number of clusters k ranging from (top left) 2 to (bottom right) 10.

  • View in gallery

    Mean time series of JJAS seasonal total precipitation over complete study area.

  • View in gallery

    Clustering maps for k = (a) 7, (b) 8, and (c) 9 with (top left) full time series, (bottom left) dropping the 5 driest yr, (top right) dropping the first 5 yr, and (bottom right) dropping the last 5 yr.

  • View in gallery

    The k-means clustering maps given k = 8 with country boundaries (black) and river profiles (white).

  • View in gallery

    Standardized mean time series (centroid) of JJAS seasonal total precipitation (mm) within each clustered region from 1983 to 2011 (29 yr); clustering results are based on k = 8.

  • View in gallery

    Global correlation map of (a) SST and (b) SLP correlated with within-cluster mean time series of JJAS seasonal total precipitation.

  • View in gallery

    Correlation map of (a) SAT, (b) outgoing longwave radiation correlated with within-cluster mean time series of JJAS seasonal total precipitation for (a) clusters 5 and 7 and (b) clusters 2 and 6, centered on Africa.

  • View in gallery

    Modeled results using the first two PCs of selected climate variables as predictors, with drop-one-year cross validation.

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 260 260 52
PDF Downloads 256 256 56

Optimal Cluster Analysis for Objective Regionalization of Seasonal Precipitation in Regions of High Spatial–Temporal Variability: Application to Western Ethiopia

View More View Less
  • 1 Department of Civil and Environmental Engineering, University of Wisconsin–Madison, Madison, Wisconsin
  • | 2 School of Civil and Environmental Engineering, Addis Ababa Institute of Technology, Addis Ababa University, Addis Ababa, Ethiopia
  • | 3 Department of Civil and Environmental Engineering, University of Wisconsin–Madison, Madison, Wisconsin
© Get Permissions
Full access

Abstract

Defining homogeneous precipitation regions is fundamental for hydrologic applications, yet nontrivial, particularly for regions with highly varied spatial–temporal patterns. Traditional approaches typically include aspects of subjective delineation around sparsely distributed precipitation stations. Here, hierarchical and nonhierarchical (k means) clustering techniques on a gridded dataset for objective and automatic delineation are evaluated. Using a spatial sensitivity analysis test, the k-means clustering method is found to produce much more stable cluster boundaries. To identify a reasonable optimal k, various performance indicators, including the within-cluster sum of square errors (WSS) metric, intra- and intercluster correlations, and postvisualization are evaluated. Two new objective selection metrics (difference in minimum WSS and difference in difference) are developed based on the elbow method and gap statistics, respectively, to determine k within a desired range. Consequently, eight homogenous regions are defined with relatively clear and smooth boundaries, as well as low intercluster correlations and high intracluster correlations. The underlying physical mechanisms for the regionalization outcomes not only help justify the optimal number of clusters selected, but also prove informative in understanding the local- and large-scale climate factors affecting Ethiopian summertime precipitation. A principal component linear regression model to produce cluster-level seasonal forecasts also proves skillful.

Corresponding author address: Paul Block, Department of Civil and Environmental Engineering, University of Wisconsin–Madison, 1415 Engineering Dr., Engineering Hall 2205, Madison, WI 53706. E-mail: paul.block@wisc.edu

Abstract

Defining homogeneous precipitation regions is fundamental for hydrologic applications, yet nontrivial, particularly for regions with highly varied spatial–temporal patterns. Traditional approaches typically include aspects of subjective delineation around sparsely distributed precipitation stations. Here, hierarchical and nonhierarchical (k means) clustering techniques on a gridded dataset for objective and automatic delineation are evaluated. Using a spatial sensitivity analysis test, the k-means clustering method is found to produce much more stable cluster boundaries. To identify a reasonable optimal k, various performance indicators, including the within-cluster sum of square errors (WSS) metric, intra- and intercluster correlations, and postvisualization are evaluated. Two new objective selection metrics (difference in minimum WSS and difference in difference) are developed based on the elbow method and gap statistics, respectively, to determine k within a desired range. Consequently, eight homogenous regions are defined with relatively clear and smooth boundaries, as well as low intercluster correlations and high intracluster correlations. The underlying physical mechanisms for the regionalization outcomes not only help justify the optimal number of clusters selected, but also prove informative in understanding the local- and large-scale climate factors affecting Ethiopian summertime precipitation. A principal component linear regression model to produce cluster-level seasonal forecasts also proves skillful.

Corresponding author address: Paul Block, Department of Civil and Environmental Engineering, University of Wisconsin–Madison, 1415 Engineering Dr., Engineering Hall 2205, Madison, WI 53706. E-mail: paul.block@wisc.edu

1. Introduction

Defining homogeneous precipitation regions for hydrologic modeling, ecological and climate classification, prediction, or other analysis is nontrivial given variation in both temporal and spatial patterns. Multiple methods exist to delineate boundaries and define the optimal number of clusters (e.g., Bisetegne et al. 1986; Gong and Richman 1995; Jain et al. 1999; Gissila et al. 2004). Ideally, an objective method is selected to foster reproducibility; however, even so, traditional approaches typically include aspects of subjective delineation. Here we evaluate various regionalization methods for objective delineation and define a number of approaches for optimally selecting an appropriate number of clusters. These techniques are applied to seasonal precipitation in Ethiopia for illustration; however, transferability to other variables and regions is possible.

Precipitation in Ethiopia is tied to many important sectors, defining lives, livelihoods, and major parts of the domestic economy. It is the source of the Blue Nile River and others, offering substantial hydropower potential, second only to the Democratic Republic of Congo (Bartle 2002). Also, given that only 1% of the cultivated land is irrigated (Korecha and Sorteberg 2013), rain-fed yields are chiefly subject to precipitation quantity and timing, effectively dictating Ethiopia’s agriculture economy. Precipitation extremes—both droughts and floods—are also not uncommon across the country, exacerbating Ethiopia’s vulnerability. Improving our understanding and attribution of Ethiopia’s interannual variability in precipitation could benefit the country by developing a reliable seasonal prediction system to improve reservoir operations and water allocation, strategic planning of agriculture production, and preparation for potential natural disasters. However, with high temporal variability, this is a challenging task.

In addition to large interannual variability (Fig. 1), highly variable spatial patterns of precipitation also add complexity to attribution and prediction. Numerous studies to date point to the migration of the intertropical convergence zone (ITCZ), multiple regional hydroclimatic system interactions, and topographic influences as the leading explanatory mechanisms in describing precipitation variability in Ethiopia (e.g., Griffiths 1972; Gamachu 1977; NMSA 1996; Conway 2000; Seleshi and Zanke 2004). Teleconnected large-scale climate variables are also shown to be influential, particularly El Niño–Southern Oscillation (ENSO) (e.g., NMSA 1996; Camberlin 1997; Bekele 1997; Tsegay 1998; Gissila et al. 2004; Segele and Lamb 2005; Block and Rajagopalan 2007; Korecha and Barnston 2007; Diro et al. 2011a; Elagib and Elhag 2011). More recently, effects of the Indian Ocean are cited (e.g., Shanko and Camberlin 1998; Goddard and Graham 1999; Latif et al. 1999; Black et al. 2003), as are the Azores, St. Helena, and Mascarene high-pressure systems (Kassahun 1987; Tadesse 1994; NMSA 1996; Segele and Lamb 2005). These numerous and diverse drivers of variability, and their interactions, lead to a complex spatial and temporal precipitation regime. Some efforts at regionalization—specifically defining boundaries of homogeneous precipitation regions—have been undertaken but traditionally rely on subjective delineation.

Fig. 1.
Fig. 1.

Study region (framed) of western Ethiopia and sample sites with June–September seasonal total precipitation (mm) time series from 1983 to 2011. Red circles indicate the precipitation in 1997, which is a strong El Niño year.

Citation: Journal of Climate 29, 10; 10.1175/JCLI-D-15-0582.1

Currently, the National Meteorological Agency (NMA) of Ethiopia divides the country into eight homogenous regions subjectively according to major atmospheric and oceanic circulation mechanisms and typical rain-producing systems affecting respective regions (Korecha and Sorteberg 2013). A number of research studies have also proposed regionalization methods and subsequent clusters. Gissila et al. (2004) group Ethiopia into four clusters by comparing the seasonal cycle subjectively and analyzing the coherence of interannual variability from 24 stations. Diro et al. (2009) follow the same methodology but divide Ethiopia into five regions based on data from 33 stations; in addition, they adjust regional boundaries according to the interannual variability for spring (February–May) and summer (June–September) seasons, respectively, considering different homogeneous regions affected by diverse large-scale forcings in different seasons. In other studies, principal component analysis (PCA) is applied to identify homogeneous rainfall zones. Bisetegne et al. (1986) create five regional groups based on only 21 Ethiopian rainfall stations using PCA, retaining four eigenvectors explaining 75% of the variance. Eklundh and Pilesjö (1990) also perform PCA on rainfall data from 63 stations and divide the country into seven regions. These methods were all applied on nonuniform station data and require subjective grouping of stations, interpretation, and hand-delineation of boundaries. Not only are these outputs time-consuming to develop; the underlying methods produce immeasurable subjective errors.

This motivates analysis to objectively and automatically define homogeneous precipitation zones, preferably with a uniform dataset; for this study, we propose a k-means clustering statistical method applied to a gridded rainfall dataset. Western Ethiopia is clustered into homogeneous regions based on the Kiremt season total precipitation spanning June through September (JJAS). This season produces approximately 70% of the upper Blue Nile basin annual precipitation (Conway 2000) and coincides with the major agricultural activities (Degefu 1987). A cluster-based season-ahead prediction model is also presented for demonstration of utility; however, seasonal prediction is not the only application of regionalization. The clustering results can also be used for regional planning and management, hazard evaluation, and so forth. Hence, the objective of this paper is not centered on prediction techniques, but rather regionalization through cluster analysis that may subsequently lead to improved seasonal precipitation prediction.

2. Gridded precipitation datasets

A 0.1° × 0.1° gridded monthly precipitation dataset from NMA is utilized in this research (NMA 2014, unpublished data). The data are a merged product of satellite estimates and station measurements with spatial coverage over western Ethiopia from 1983 to 2011 (Fig. 1; Dinku et al. 2014). This product has been shown to reproduce station data over areas with both densely and sparsely distributed station networks. Data are aggregated to JJAS seasonal total precipitation over the 29 years for each of the 7320 grid cells.

3. Hierarchical and nonhierarchical cluster analysis

a. Description

Two main types of clustering algorithms exist for analysis of gridded data: namely, hierarchical and nonhierarchical (Jain et al. 1999); the objective is to assign each data grid cell to a cluster based on the inter-time-series correlation given the spatial–temporal dataset, analogous to using the Euclidean distance for a two-dimensional matrix. A higher correlation is equivalent to a smaller Euclidean distance and dictates which grid cells are likely to be grouped together. Hierarchical clustering produces a cluster dendrogram (Fig. 2), where, based on differently structured criteria, the two most similar grid cells would be grouped into one branch for the first step and subsequently viewed as one new unit with an averaged time series entering the next step. Two units are combined at each step. Once a unit is assigned to a branch, it cannot be detached, and the algorithm continues until the last two units are combined. Where the hierarchical “tree” is cut determines the final number of clusters. Nonhierarchical clustering, often referred to as k-means clustering, is more flexible. In contrast to hierarchical clustering, nonhierarchical clustering algorithms allow a grid cell to be reassigned to reach an optimum result. Nonhierarchical (k means) clustering algorithms typically follow these steps (Fig. 3):
  1. preselect the number of clusters k;
  2. randomly pick centroids or time series of each cluster;
  3. assign grid cells to the most similar centroid;
  4. recalculate centroids by averaging all time series assigned to that centroid;
  5. iterate steps 3 and 4 until convergence.
This produces k clusters with k cluster means or centroids, and the within-cluster sum of square errors (WSS) will be minimized:
e1
where WSS is the sum of the squared errors between the time series in each grid cell g (tg) in cluster j (gj) and the average time series in cluster j (, known as the mean or centroid) and then summed over all k clusters.
Fig. 2.
Fig. 2.

Synthetic hierarchical clustering dendrogram conceptualization initialized with 10 data units. At the cutoff point shown, three clusters are identified.

Citation: Journal of Climate 29, 10; 10.1175/JCLI-D-15-0582.1

Fig. 3.
Fig. 3.

Synthetic k-means clustering conceptualization based on a two-dimensional dataset (X, Y). The diamonds and the crosses represent the data points and the centroids, respectively. (a) Scatterplot of the data. Data points assigned to the closest centroid (labeled with the same index as its assigned centroid): (b) randomly assigned centroids to initiate the algorithm, (c) recalculated centroids and reassigned data points, and (d) recalculated centroids and the same reassigned results (convergence).

Citation: Journal of Climate 29, 10; 10.1175/JCLI-D-15-0582.1

b. Selection of k

Although data processing is automatic and objective for both hierarchical and nonhierarchical clustering analysis, determining k is still subjective. For hierarchical clustering, this requires a cutoff point or a desired number of clusters after the dendrogram is formed. For nonhierarchical clustering, a predetermined number of clusters is required in order to initiate the algorithm. An optimal number of clusters, however, can be estimated given a certain confidence level desired for intracorrelations (i.e., correlation between the centroid of each cluster and its members) or intercorrelations (i.e., correlation between centroids of two different clusters) (Badr et al. 2015). Intuitively, relatively high intracorrelation and low intercorrelation are desired. This estimation strategy is significantly more suitable for hierarchical clustering, as it has a fixed tree structure regardless of the number of clusters selected postanalysis; although nonhierarchical clustering requires a predetermined number of clusters, intra- and intercorrelations can still be examined for several trials of k.

An alternative method for selecting an optimal number of clusters is to perform a sensitivity analysis of WSS given different k (Fig. 4), particularly for nonhierarchical clustering. By evaluating the improvement of WSS when one cluster is added, the potentially optimal k can be identified; this is known as the elbow method (Thorndike 1953). This method may be problematic, however, when a large number of grid cells are considered yet a relatively small number of clusters is desired. This context is common in the field of hydroclimatology, particularly as data resolution increases yet homogeneous climatic zones remain the same size. Similar methods for determining the optimal number of clusters based on the WSS include the Akaike information criterion (AIC), Bayesian information criterion (BIC), and generalized cross validation (GCV), as well as some more sophisticated methods, such as the gap statistic (Tibshirani et al. 2001) and the jump method (Sugar and James 2003); however, these all suffer from the same problem. Thus, new methods or extensions of these are warranted and developed here to identify a reasonable number of clusters.

Fig. 4.
Fig. 4.

WSS, given different number of clusters k based on k-means clustering results on JJAS seasonal total precipitation over the complete study region. Note that, for each box plot, the line inside the box is the median, the box edges represent the 25th and 75th percentiles, the whiskers extend to the most extreme data points not considered outliers, and outliers are plotted individually with crosses.

Citation: Journal of Climate 29, 10; 10.1175/JCLI-D-15-0582.1

c. Sources of uncertainty

In addition to uncertainties in selection methods for the optimal number of clusters, additional sources of uncertainty effecting cluster outputs include noise in the raw data, the data extent, and the initial cluster state that may bias results. In hierarchical clustering, once two units are assigned to the same branch, and since reassignment is disallowed, a small bias leads to large biases in subsequent steps through a chain reaction. Comparatively, since nonhierarchical clustering is a flat divisive process, it is likely to be less effected by noisy data, if at all. If a subset of the data is initially selected and regionalized, subsequently extending the analysis spatially can be problematic, particularly for hierarchical clustering. Therefore, both hierarchical and nonhierarchical approaches should be subjected to a sensitivity test on spatial data extended from the initial analysis. It should also be noted that nonhierarchical clustering often suffers from nonexclusive convergence when subjected to different initial states (i.e., different cluster outcomes for different initials states). Nevertheless, by iteratively exploring different initial states, the optimal clustering result defined by the minimum WSS (minWSS) can be located.

d. Clustering criteria

Three criteria for hierarchical clustering are selected: Ward’s method, average linkage, and regional linkage for comparison with nonhierarchical (k means) clustering. Both the Ward’s method and k-means clustering minimize the WSS using the Euclidean distance, defined as one minus the Pearson correlation between the time series of any two units [Eq. (2)] (Ward 1963; Hartigan 1975); however, they differ given their diverse structures. Average linkage is another commonly used criterion, which links each candidate unit in the current step to two units merged in the previous hierarchical level separately to calculate the overall updated Pearson correlation distance weighted by the number of grid cells in each unit [Eq. (3)]. Regional linkage is an adjusted version of the average linkage criterion incorporating the standard deviation [Eq. (4)] (Badr et al. 2015). The equations for calculating the distance based on different clustering criteria discussed above are listed here:
e2
where dx,y is the equivalent Euclidean distance between x and y, based on rx,y, which is the Pearson correlation of the two time series of units x and y;
e3
where dxy,z is the overall updated Pearson correlation distance between candidate unit z and already merged units x and y in the previous hierarchical stage, and the number of members in units x and y are represented by nx and ny; and
e4
where σxy is the standard deviation of the mean time series of merged x and y at the final stage; all other parameters are the same as in Eq. (3).

Since data are standardized beforehand, all of the time series in each grid cell have a mean of 0 and variance of 1; therefore, the clustering is not affected by differences in mean or variance but is instead based on the correlation among all standardized time series. Variability from one time step to the next (up or down from one specific year to the next year) plays the major role in determining the clustering results.

e. Evaluation of techniques on western Ethiopian precipitation

For hierarchical climate regionalization (Badr et al. 2015) a tool from the R statistical language is used to produce hierarchical clustering results based on preprocessed data initially over the main region and then extended to include the additional southernmost portion of data (Fig. 1). This is compared with nonhierarchical k-means clustering results on the same regions. All data are preprocessed by standardizing across years for each gridded time series. An optimal number of clusters k = 5 is obtained at the 99% confidence level (α = 0.01) using the regional linkage hierarchical clustering technique for both data regions. However, for nonhierarchical clustering, the sensitivity analysis of k versus WSS for 1000 trials per k produces a smooth scree plot with no apparent elbow to identify the optimal number of clusters (Fig. 4). Before further evaluation of an optimal k, comparative results of hierarchical and nonhierarchical clustering for k = 5 are performed to identify the preferred clustering method, if possible. Note that, for each k, the optimal k-means clustering result is selected corresponding to the minWSS obtained from the 1000 trials with random initial states.

To evaluate the effect of extending the data analysis region, the hierarchical and nonhierarchical cluster techniques are applied initially to approximately 95% of the data (southern portion omitted) as well as the full dataset for comparison of how cluster delineations may change (Fig. 5). As expected, regional linkage and average linkage induce similar clustering results given their comparable structured criteria. Similarly, the Ward’s method and k-means clustering results resemble each other, given their equivalent objective of minimizing the WSS. The k-means clustering provides the lowest WSS, followed by the Ward’s method (Table 1). When comparing the two different data extents for each method, the differences in cluster boundaries for hierarchical clustering are quite notable, while k-means clustering basically produces the same cluster delineations. Thus, k-means clustering is likely more robust than hierarchical clustering in terms of cluster boundary stability considering inevitable data noise; this ability is attributable to its flexible algorithm of assigning and reassigning grid cells to clusters. However, this is also why it produces less smooth cluster borders compared with the hierarchical cluster methods.

Fig. 5.
Fig. 5.

Comparative results of hierarchical clustering specifically using (a) regional linkage, (b) average linkage, and (c) Ward’s method as distance criteria and nonhierarchical (d) k-means clustering over (top) region 1 of 5.15°–14.95°N, 33.05°–39.95°E and (bottom) with an additional region 2 of 4.15°–5.15°N, 34.95°–38.75°E. All with the number of clusters k = 5.

Citation: Journal of Climate 29, 10; 10.1175/JCLI-D-15-0582.1

Table 1.

WSS computed for clustering results in Figs. 3 and 4. The data regions are defined in Fig. 5. The k-means clustering results after smoothing with a factor of 25.

Table 1.

Smoothing borders for k-means clusters is possible using smoothing factors, defined as the minimum number of grid cells allowed for an isolated group to remain (i.e., grid cells in the same cluster isolated from the main cluster). A higher smoothing factor has a higher smoothing extent. If an isolated group has fewer grid cells than the smoothing factor (e.g., 25 grid cells), then these grid cells will be absorbed into its adjacent cluster. If more than one adjacent cluster exists, the cluster exhibiting the higher intercorrelation with the group will be selected (Table 2). Smoothing with factors of 5, 10, 15, and 25 gridcell minimums are evaluated (Fig. 6). Even under the highest smoothing factor of 25, k-means clustering still produces the lowest WSS among all the clustering methods tested here (Table 1).

Table 2.

Intercorrelation and intracorrelation for regional linkage, average linkage, Ward’s method, and k-means clustering.

Table 2.
Fig. 6.
Fig. 6.

The k-means clustering map under a smoothing factor of (top left) 5, (top right) 10, (bottom left) 15, and (bottom right) 25.

Citation: Journal of Climate 29, 10; 10.1175/JCLI-D-15-0582.1

Intra- and intercorrelations may also be compared between techniques (Table 2). The intercorrelation is the correlation between any two centroids (i.e., the average time series in each cluster). The intracorrelation is defined as the average correlation between the time series in each grid cell in one cluster and the centroid of that cluster. As expected, relatively high intracorrelations and low intercorrelations are obtained for all four clustering methods. At the 0.01 significance level, regional and average linkage produce lower intercorrelations (all lower than 0.5) than Ward’s method (one surpassing 0.6) and k means (four exceed 0.5). For the intracorrelations, k means produces the highest overall average, all ranging between 0.64 and 0.84. The average linkage approach also produces fairly stable (consistent) intracorrelations, albeit lower overall than k means. The Ward’s method produces both the highest (0.85) and lowest (0.55) intracorrelation among all clustering methods; the lowest one is even less than some of its intercorrelation values. Regional linkage yields a perfect correlation of 1 within cluster 5, because there a single grid cell constitutes this cluster. This is physically unrealistic yet statistically correct, implying the oversensitive response of the regional linkage method. Depending on the application, intracorrelation may be favored over intercorrelation, implying that homogeneous regions are more highly valued than independent regions.

Overall, nonhierarchical clustering tends to outperform hierarchical clustering for this particular application, given its low WSS, relatively high yet reasonable intracorrelations and acceptably low intercorrelations, and, most importantly, its flexibility to produce much more stable cluster delineations. Therefore, the following sections considering the selection of k and a sensitivity analysis are restricted to the nonhierarchical (k means) clustering technique only.

4. Selection of k for nonhierarchical cluster analysis

As previously discussed, delineation involving the selection of k is typically performed subjectively. For low-density station-based data, this may be appropriate; however, for high-resolution gridded data, an objective, automated process is appealing. Here we discuss numerous methods and their propensity for both objective selection and subjective evaluation, including the elbow method and gap statistics, as well as visualization of corresponding cluster maps. New methods, including the “difference in minWSS” and “difference in difference,” are also proposed and developed to facilitate objective selection of k. Note that prediction performance is isolated from the evaluation of k intentionally, as we believe the approach adopted here—with selection based on regionalization and not prediction—while conservative, does provide a more objective and realistic expectation of prediction skill.

Previous studies addressing homogeneous June–September precipitation clusters for Ethiopia prescribe anywhere from 4 to 8 clusters; NMA of Ethiopia officially divides Ethiopia into 8 clusters. These are all determined based on station-level data. Using the newly available high-resolution gridded dataset, but cognizant of previous work, an upper limit of 10 clusters is considered.

a. Elbow method and difference in minWSS

The elbow method measures how the WSS decreases with increasing number of clusters. If by adding one additional cluster, the WSS improvement slows, as compared to the previous cluster addition, an elbow will form (graphically). If this point is distinct, the optimal number of clusters selected should then be the elbow point. As previously mentioned, the common elbow method is not suitable in this case, given that no distinct elbow is established, because of the large number of grid cells considered yet relatively small number of clusters desired (Fig. 4). Similarly, using AIC, BIC, or GCV [see Eqs. (5), (6), and (7); Craven and Wahba 1979; Manning et al. 2008] as alternative criteria only modifies the curve negligibly since the large number of grid cells produces a high value of WSS, relative to which a higher number of clusters is not strongly penalized (Fig. 7). Therefore, an elbow method still fails to identify the optimal number of clusters below 15. The equations for calculating AIC, BIC, and GCV are listed here:
e5
e6
e7
where minWSS is the minimum WSS obtained from 1000 iterative k-means clustering processes with randomly selected initial states, K is the number of clusters (note that K is a variable here), M is the number of variables or the component-wise dimensions of the data (in this case, M is the number of years), and N is the total gridcell number.
Fig. 7.
Fig. 7.

Minimum WSS, AIC, BIC, and GCV given different number of clusters k based on k-means clustering results on JJAS seasonal total precipitation over the complete study region.

Citation: Journal of Climate 29, 10; 10.1175/JCLI-D-15-0582.1

To extend the elbow method, the difference in minWSS from k − 1 to k may be calculated and evaluated (Fig. 8). In contrast to the very smooth minWSS curve from the standard elbow approach, the difference in minWSS shows apparent elbows (downward elbows at k = 4 and k = 9 and upward elbows at k = 8 through k = 10), indicating a significant change in the slope. In contrast to the typical elbow method, where the reduction of the minWSS, or simply the improvement, is regarded as the base for evaluation, the difference in minWSS scrutinizes rates of improvement to locate the optimal k such that subtle changes in the “smooth” minWSS curve can be captured. In this case, the difference in minWSS (that is, the decrease of WSS from k − 1 to k) is large but initially decelerates sharply, as is typically expected for complex climate datasets, and then the deceleration becomes noticeably more gradual and nearly consistent from k = 4 to k = 8. However, from k = 8 to k = 9, the negative rate of improvement suddenly intensifies, indicating a faster decrease of the rate of improvement (a downward elbow)—an undesirable situation relative to the previous reductions; thus, an effective selection is reached at k = 8. Another marginal selection of k = 2 may also be considered as one of the potential candidates for optimal k. This new extended method can effectively identify the best number of clusters, particularly when the desired number is relatively small compared with the large number of objects processed (e.g., grid cells), improving upon the typical elbow approach.

Fig. 8.
Fig. 8.

The minWSS and difference in minWSS from k − 1 to k.

Citation: Journal of Climate 29, 10; 10.1175/JCLI-D-15-0582.1

b. Gap statistic and difference in difference

For a gap statistic approach (Tibshirani et al. 2001), the WSS from a cluster analysis on a randomly simulated (reference) dataset having the same dimensions as the original dataset is compared to the WSS from a cluster analysis on the original dataset. Intuitively, if clustering of the original dataset provides a similar WSS to the randomly simulated dataset, which should not have any cluster characteristics given its random nature, clustering of the original dataset is deemed inappropriate. A large difference between the WSS from the random and original datasets is preferred for the selection of k. This difference is the so-called gap.

The gap statistic algorithm includes the following steps:
  1. generate a reference dataset;
  2. perform cluster analysis with varying k;
  3. compute the corresponding ;
  4. iterate steps 1–3 B times.
A reference WSS in logarithmic form is then calculated as the expectation of all for each k [Eq. (8)]. The gap is the difference between the WSSK from the original dataset and from the reference dataset, both in logarithmic form [Eq. (9)]:
e8
e9
The standard deviation sdK and the simulation error seK based on the reference dataset are also required,
e10
e11
The optimal number of clusters is then defined as the smallest k such that
e12
The difference in gaps considering simulation errors is the “gap criteria” [Eq. (12)]. The gap statistic is applied with B = 100 trials and k ≤ 15. In this case, results show that, as k increases, so does the gap (Figs. 9a,b), and the gap criteria has not reached a nonnegative value prior to k = 15 (Fig. 9c), indicating an optimal k greater than 15. However, as previously discussed, having more than 15 clusters is not preferred in this case. Thus, an additional step is added by computing the difference in gap criteria, called the difference in difference (Fig. 9d), to measure the “speed vector” rather than the “state” of reaching the nonnegative value along the “time step” k. This serves as the secondary criterion for situations when the optimal number of clusters exceeds the desired limit using the traditional gap statistic approach. The first nonnegative difference in difference occurs at k = 8, illustrating the decline of the gap criteria between k = 8 and k = 9. In contrast, a relatively large improvement occurs between k = 7 and k = 8. Therefore, k = 8 is a suitable number of clusters, falling into our desired range of k, which is consistent with the difference in minWSS approach, further supporting the selection of k = 8.
Fig. 9.
Fig. 9.

Gap statistic and difference in difference results.

Citation: Journal of Climate 29, 10; 10.1175/JCLI-D-15-0582.1

c. Visualization of cluster maps

Visualization of cluster maps for different k is a more direct way to select an appropriate number of clusters, albeit subjective. Nonetheless, it is a useful approach to confirm results from the objective measures described above. Cluster maps (from k = 2 to k = 9; Fig. 10) indicate the stability of certain clusters even as k changes; for example, the central zones remain relatively constant from k = 4 to k = 5 and from k = 6 to k = 9. For k = 2 through k = 5, a relatively low number of clusters tends to produce cluster partitioning; the northeastern and southern regions typically belong to the same cluster but are geographically separated. It is not until k = 7 that the two regions are completely assigned to separate clusters. Given that clusters represent homogeneous precipitation patterns, it is unlikely (however possible) that disjointed clusters make physical sense; it is instead preferred that such clusters be split into distinct independent clusters. Thus, for this dataset, k ≥ 7 is preferred. Comparing k = 8 and k = 9, cluster boundaries appear very similar; however, the former illustrates a cleaner delineation, minimizing the jumble of clusters in the southeast portion of the dataset evident for k = 9.

Fig. 10.
Fig. 10.

The k-means clustering maps given a different number of clusters k ranging from (top left) 2 to (bottom right) 10.

Citation: Journal of Climate 29, 10; 10.1175/JCLI-D-15-0582.1

d. Sensitivity analysis using reduced time series

Although nonhierarchical clustering is relatively immune to extending spatial data, it can be subject to data length. Three shortened time series are compared with the original time series of 29 yr by dropping 5 yr: (i) the first 5 yr, (ii) the last 5 yr, and (iii) the driest 5 yr. The driest 5 yr are defined based on the average JJAS seasonal total precipitation over the complete study region (Fig. 11) and include 1984, 1987, 1997, 2002, and 2011. Interestingly, for k = 7, 8, and 9, dropping the driest 5 yr has the least influence of the three shortened time series as compared with the original outcomes (Fig. 12). This is predominantly attributable to all grid cells tending to behave similarly in drought years given the extensive range of consistently dry conditions throughout the region. Thus, the lack of differentiation spatially contributes very little to the cluster analysis. On the contrary, dropping moderate years affects clustering outcomes to a larger extent given spatial variability. Thus, dropping the first 5 yr of data, which contain 2 extreme years but 3 very moderate years, and the last 5 yr, which contains 1 extreme year and 4 moderate years, produces notably different cluster boundaries (Fig. 12).

Fig. 11.
Fig. 11.

Mean time series of JJAS seasonal total precipitation over complete study area.

Citation: Journal of Climate 29, 10; 10.1175/JCLI-D-15-0582.1

Fig. 12.
Fig. 12.

Clustering maps for k = (a) 7, (b) 8, and (c) 9 with (top left) full time series, (bottom left) dropping the 5 driest yr, (top right) dropping the first 5 yr, and (bottom right) dropping the last 5 yr.

Citation: Journal of Climate 29, 10; 10.1175/JCLI-D-15-0582.1

Reducing the time series length provides less variability in the cluster analysis and thus results in more fractured clusters; in other words, the remaining variability is less likely to clearly distinguish clusters or identify their homogeneity. However, most cluster patterns are still recognizable, indicating robust regionalization.

In summary, for k-means clustering, the delineation is affected by the number of predetermined clusters and is sensitive to the length of available data and characteristics of historical years. For this Ethiopia precipitation analysis, k = 8 appears superior given the objective measures—both the difference in difference based on the gap statistic and difference in minWSS based on the elbow method—and the subjective measures through visualization, including spatially coherent clusters and relatively smooth boundaries. For lesser numbers of clusters, single clusters fractured into multiple subregions, which are geographically distant, are problematic and undesirable. Intra- and intercorrelations for k = 8 (Table 3) indicate an overall strong coherency within each cluster, further justifying selection of k = 8 as an optimal choice (Fig. 13). Individual cluster mean time series (Fig. 14) also indicate diversity between clusters.

Table 3.

Intra- and intercorrelation table for k-means clustering results given k = 8.

Table 3.
Fig. 13.
Fig. 13.

The k-means clustering maps given k = 8 with country boundaries (black) and river profiles (white).

Citation: Journal of Climate 29, 10; 10.1175/JCLI-D-15-0582.1

Fig. 14.
Fig. 14.

Standardized mean time series (centroid) of JJAS seasonal total precipitation (mm) within each clustered region from 1983 to 2011 (29 yr); clustering results are based on k = 8.

Citation: Journal of Climate 29, 10; 10.1175/JCLI-D-15-0582.1

5. Identification of local and large-scale cluster-level precipitation drivers

Another technique to differentiate the independence of clusters is to understand the local and large-scale drivers affecting precipitation patterns and variability. This is typically undertaken by evaluating correlations between cluster-level precipitation (JJAS seasonal total in this case) with hydroclimatic variables. From these, physical mechanisms may be explored and identified. Gridded (2.5° × 2.5°) global NCEP–NCAR reanalysis data (Kalnay et al. 1996) for five different climate variables selected—sea surface temperature (SST), sea level pressure (SLP), geopotential height (GH) at 500 mb, surface air temperature (SAT), and outgoing longwave radiation (OLR)—is correlated with the mean time series of JJAS seasonal total precipitation from each cluster. Previous studies have identified relationships between Ethiopia’s precipitation and large-scale climate variables, such as SSTs in equatorial Pacific, Indian Ocean, and equatorial/southern Atlantic Ocean and SLPs near the African continent, specifically the Azores High, St. Helena High, Mascarene High, and southwest Asian monsoon (Korecha and Barnston 2007; Segele et al. 2009; Diro et al. 2011b). As expected, each cluster is associated with a unique set of large-scale climate patterns, although similarities exist. For example, similar signals from concurrent-season SST patterns in the equatorial Pacific region (ENSO) are found across all clusters, but to varying degrees (Fig. 15a). From north to south in western Ethiopia, the influence of ENSO generally decreases. Northwestern Ethiopia, particularly the region close to the Rift Valley (cluster 5), is the most strongly negatively correlated with equatorial Pacific SST. This association weakens southward (central-western Ethiopia and then southwestern Ethiopia; clusters 7, 6, and 2), indicating a smaller influence from ENSO, but still significant. All of the clustered regions illustrate a negative correlation: that is, warmer (colder) equatorial Pacific SST (from El Niño/La Niña) typically brings deficit (excess) JJAS seasonal total precipitation to the study region. This concurs with a number of previous studies (e.g., Korecha and Barnston 2007; Korecha and Sorteberg 2013; Segele and Lamb 2005). On the other hand, weak, insignificant associations with equatorial Pacific SST exist for some regions, including the southwest corner of Sudan (cluster 4) and its neighboring region from part of northwestern Ethiopia (cluster 1). These specific regions, however, show stronger correlations with Indian Ocean and southern Atlantic Ocean SST, perhaps implying a more direct influence on moisture transport from these oceanic regions (Fig. 15a). The overall influence of the Indian and southern Atlantic Oceans, as compared to the Pacific Ocean, on JJAS seasonal total precipitation in western Ethiopia is still less, based simply on the correlation with each SST in each ocean.

Fig. 15.
Fig. 15.

Global correlation map of (a) SST and (b) SLP correlated with within-cluster mean time series of JJAS seasonal total precipitation.

Citation: Journal of Climate 29, 10; 10.1175/JCLI-D-15-0582.1

Global correlation maps between concurrent-season SLPs and cluster-level time series illustrate diverse positive and negative patterns (Fig. 15b). For example, the Mascarene high-pressure system in the southern Indian Ocean is apparent for southwestern Ethiopia (clusters 2 and 6). As the moisture-laden winds caused by the Mascarene high cross the equatorial Indian Ocean from south to north, they change direction from southeasterly to southwesterly because of the Coriolis force (Viste and Sorteberg 2013b), affecting precipitation in southwestern Ethiopia. The Mascarene high is negatively correlated with JJAS precipitation in southwestern Ethiopia and may control the extent to which the ITCZ shifts. Previous studies (e.g., Segele et al. 2009) indicate that the Mascarene anticyclone is coupled to a weak, semipermanent surface ridge, which appears to limit the southern range of the ITCZ during the Northern Hemisphere summer. It is speculated that the timing of the development of this pressure system also matters. For example, an early development of the Mascarene high pushes the ITCZ to the north prior to JJAS, followed by an earlier diminish of the high-pressure system during JJAS (lower pressure). Meanwhile, the ITCZ moves back to the south bringing more rainfall to the southern part of western Ethiopia (high precipitation and thus negatively correlated with lower pressure of the Mascarene high). Additionally, local effects from adjacent SLP to the north of Ethiopia are nonnegligible for regions extending from northwestern to southwestern Ethiopia (clusters 5, 6, 7, and 8), coinciding with previous findings identifying a moisture flow path connecting the Mediterranean Sea, Red Sea, and Arabian Peninsula with the northern Ethiopian highlands in the summertime (Viste and Sorteberg 2013a,b).

Concurrent-season surface air temperature (SAT) time series over the Sahel indicate remarkably strong negative correlations with seasonal precipitation for a large portion of northwestern and central-western Ethiopia (approximately −0.9 for northwestern Ethiopia; Fig. 16a). This is likely due to a decrease of monsoon-related continental convergence and rainfall from Senegal to Ethiopia (Giannini et al. 2003), causing a high SAT over Sahel and low precipitation in Ethiopia. For concurrent-season OLR, the local index correlates strongly with southwestern Ethiopia (clusters 2 and 6) (Fig. 16b). Therefore, summertime precipitation in the southern part of the study region is likely to be more influenced by local climate variables compared with other clusters. Not surprisingly, clusters with high intercluster correlations (e.g., 2 and 6, and 5 and 7; Table 3) tend to produce similar correlation patterns and are therefore expected to be similarly affected by local and large-scale precipitation drivers.

Fig. 16.
Fig. 16.

Correlation map of (a) SAT, (b) outgoing longwave radiation correlated with within-cluster mean time series of JJAS seasonal total precipitation for (a) clusters 5 and 7 and (b) clusters 2 and 6, centered on Africa.

Citation: Journal of Climate 29, 10; 10.1175/JCLI-D-15-0582.1

6. Predictions

One motivation for developing and evaluating clusters is to produce homogeneous regions for which precipitation may be uniformly predicted and subsequently applied to agriculture or hydrology models. To demonstrate, a simple principal component linear regression model [Eqs. (13)] is applied to predict the JJAS seasonal precipitation total for each cluster through a drop-one cross-validation approach. This is arguably a simplified prediction technique but sufficient for demonstration purposes here. Predictor variables, including SST, SLP, GH, and SAT, from the previous month (May) over highly correlated and physically justifiable regions (Table 4) are selected. A PCA is then performed on all selected climate variables to remove multicolinearity and reduce the number of predictors. Consequently, the top two principal components (PCs) for each cluster explain approximately 52%–82% of the total variance and are applied as predictors in the regression framework. The model expressed through equations is shown below:
e13a
e13b
where for each cluster, Yi is the within-cluster JJAS seasonal total precipitation with year i left out. Similarly PC1i and PC2i are the top two PCs based on the PCA of selected predictors with year i left out. The modeled precipitation for year i based on the coefficients (, , and ) estimated from the linear regression [Eq. (13a)] is and its own reconstructed PCs [ and ; Eq. (13b)]. Note that each cluster has a unique set of model inputs.
Table 4.

Climate variables (CVs) in May over different regions for each cluster [clusters 1–8 (C1–C8)] used as predictors, with corresponding correlation between the CV averaged over the region and the cluster-level JJAS seasonal total precipitation time series shown (only cells with correlation values shown are used as predictors). The regions are as follows: equatorial Pacific (EP), north Indian Ocean (NI), south Indian Ocean (SI), equatorial/South Atlantic Ocean (E/SA), local region (LO), Azores high (AH), St. Helena high (SH), Mascarene high (MH), and southwest Asian monsoon (AM).

Table 4.

Cross-validated predictions, including a 95% confidence interval conditioned on model errors, are generally quite skillful and closely mimic year-to-year observed variability (Fig. 17). Pearson correlations between median predictions and observations are greater than 0.7 for all clusters. Additionally, the median rank probability skill score (RPSS), a categorical measure, ranges from 19% to 75% for all clusters, indicating significantly more skill than from a climatological (historical averages) prediction (Table 5).

Fig. 17.
Fig. 17.

Modeled results using the first two PCs of selected climate variables as predictors, with drop-one-year cross validation.

Citation: Journal of Climate 29, 10; 10.1175/JCLI-D-15-0582.1

Table 5.

Pearson correlation coefficients and median RPSS values corresponding to the modeled results compared with observations in Fig. 15.

Table 5.

7. Conclusions and discussion

In this paper, we evaluate various regionalization methods for objective delineation and define a number of approaches for optimally selecting an appropriate number of clusters, with a demonstration on western Ethiopia’s summertime precipitation. Given a high-resolution gridded precipitation dataset, objective, automated clustering and delineation is possible. The preference of hierarchical versus nonhierarchical (k means) clustering is typically case specific; in this case, k-means clustering outperforms hierarchical clustering, particularly considering its stable cluster patterns.

Various statistical methods are available objectively to define the optimal number of clusters; however, these approaches fail for datasets with large numbers of grid cells and a desire for a relatively small number of clusters. Therefore, a “difference in minWSS” extension to the elbow method and a “difference in difference” extension to the gap statistic method are developed to objectively define the optimal number of clusters within the desired range. Visualization of cluster maps—a subjective tool—can reinforce objective outputs from these newly developed methods.

Only a few studies have explored regionalization over Ethiopia, and, to our knowledge, all use station data and therefore require subjective delineation. The k-means clustering result given the optimal selection of k = 8 tends to generally agree with Diro et al. (2009)’s grouping of stations for summertime precipitation, with almost all stations still falling into equivalent clusters. Delineation of the boundaries differs, however, given the subjectivity of drawing boundaries conditioned on sparsely located stations. Other studies (e.g., Eklundh and Pilesjö 1990; Gissila et al. 2004; Korecha and Sorteberg 2013) differ in both the assignment of stations to clusters and the delineation of homogenous regions, given different datasets, clustering techniques, and, most critically, the variability in precipitation considered. In other studies the overall interannual variability and month-to-month seasonality is evaluated, whereas the interannual variability of JJAS seasonal total precipitation is isolated in this study for regionalization, partially agreeing with Diro et al. (2009), focusing purely on the main rainy and agricultural season.

The regionalization techniques and evaluation metrics developed in this study can also be generally applied to other hydroclimatic datasets, serving different purposes. For studies focused exclusively on Ethiopia, the country mask can be applied; however, given the robust cluster patterns produced by k means, the clustering results would be similar (not shown). Regionalization can also be applied on a subset of months, such as June–July and August–September, when the physical mechanism of precipitation patterns migrating over the season is desired or a submonth precipitation prediction is the subsequent goal.

It is unclear how future climate changes may affect regionalization. If it imposes regionally specific changes, clusters may eventually shift and reorient; however, if climate change influences the overall region consistently, then shifting of cluster boundaries may be minor.

Acknowledgments

This study was supported by NASA Project NNX14AD30G.

REFERENCES

  • Badr, H. S., , B. F. Zaitchik, , and A. K. Dezfuli, 2015: A tool for hierarchical climate regionalization. Earth Sci. Inf., 8, 949958, doi:10.1007/s12145-015-0221-7.

    • Search Google Scholar
    • Export Citation
  • Bartle, A., 2002: Hydropower potential and development activities. Energy Policy, 14, 1231–1239, doi:10.1016/S0301-4215(02)00084-8.

  • Bekele, F., 1997: Ethiopian use of ENSO information in its seasonal forecasts. Internet J. Afr. Stud., 2. [Available online at http://www.bradford.ac.uk/research-old/ijas/ijasno2/bekele.html.]

  • Bisetegne, D., , L. Ogallo, , and J. Ininda, 1986: Rainfall characteristics in Ethiopia. Proc. First Technical Conf. on Meteorological Research in Eastern and Southern Africa, Nairobi, Kenya, UCAR.

  • Black, E., , J. Slingo, , and K. R. Sperber, 2003: An observational study of the relationship between excessively strong short rains in coastal East Africa and Indian Ocean SST. Mon. Wea. Rev., 131, 7494, doi:10.1175/1520-0493(2003)131<0074:AOSOTR>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Block, P. J., , and B. Rajagopalan, 2007: Interannual variability and ensemble forecast of upper Blue Nile basin Kiremt season precipitation. J. Hydrometeor., 8, 327343, doi:10.1175/JHM580.1.

    • Search Google Scholar
    • Export Citation
  • Camberlin, P., 1997: Rainfall anomalies in the source region of the Nile and their connection with the Indian summer monsoon. J. Climate, 10, 13801392, doi:10.1175/1520-0442(1997)010<1380:RAITSR>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Conway, D., 2000: The climate and hydrology of the upper Blue Nile River. Geogr. J., 166, 4962, doi:10.1111/j.1475-4959.2000.tb00006.x.

    • Search Google Scholar
    • Export Citation
  • Craven, P., , and G. Wahba, 1978: Smoothing noisy data with spline functions. Numer. Math., 31, 377403, doi:10.1007/BF01404567.

  • Degefu, W., 1987: Some aspects of meteorological drought in Ethiopia. Drought and Hunger in Africa, M. H. Glantz, Ed., Cambridge University Press, 23–36.

  • Dinku, T., , K. Hailemariam, , R. Maidment, , E. Tarnavsky, , and S. Connor, 2014: Combined use of satellite estimates and rain gauge observations to generate high-quality historical rainfall time series over Ethiopia. Int. J. Climatol., 34, 24892504, doi:10.1002/joc.3855.

    • Search Google Scholar
    • Export Citation
  • Diro, G. T., , D. I. F. Grimes, , E. Black, , A. O’Neill, , and E. Pardo-Iguzquiza, 2009: Evaluation of reanalysis rainfall estimates over Ethiopia. Int. J. Climatol., 29, 6778, doi:10.1002/joc.1699.

    • Search Google Scholar
    • Export Citation
  • Diro, G. T., , D. I. F. Grimes, , and E. Black, 2011a: Teleconnections between Ethiopian summer rainfall and sea surface temperature: Part I—Observation and modelling. Climate Dyn., 37, 103119, doi:10.1007/s00382-010-0837-8.

    • Search Google Scholar
    • Export Citation
  • Diro, G. T., , D. I. F. Grimes, , and E. Black, 2011b: Teleconnections between Ethiopian summer rainfall and sea surface temperature: Part II. Seasonal forecasting. Climate Dyn., 37, 121131, doi:10.1007/s00382-010-0896-x.

    • Search Google Scholar
    • Export Citation
  • Eklundh, L., , and P. Pilesjö, 1990: Regionalization and spatial estimation of Ethiopian mean annual rainfall. Int. J. Climatol., 10, 473494, doi:10.1002/joc.3370100505.

    • Search Google Scholar
    • Export Citation
  • Elagib, N. A., , and M. M. Elhag, 2011: Major climate indicators of ongoing drought in Sudan. J. Hydrol., 409, 612625, doi:10.1016/j.jhydrol.2011.08.047.

    • Search Google Scholar
    • Export Citation
  • Gamachu, D., 1977: Aspects of Climate and Water Budget in Ethiopia. Tech. Monogr., Addis Ababa University Press, 71 pp.

  • Giannini, A., , R. Saravanan, , and P. Chang, 2003: Oceanic forcing of Sahel rainfall on interannual to interdecadal time scales. Science, 302, 10271030, doi:10.1126/science.1089357.

    • Search Google Scholar
    • Export Citation
  • Gissila, T., , E. Black, , D. I. F. Grimes, , and J. M. Slingo, 2004: Seasonal forecasting of the Ethiopian summer rains. Int. J. Climatol., 24, 13451358, doi:10.1002/joc.1078.

    • Search Google Scholar
    • Export Citation
  • Goddard, L., , and N. E. Graham, 1999: Importance of the Indian Ocean for simulating rainfall anomalies over eastern and southern Africa. J. Geophys. Res., 104, 19 09919 116, doi:10.1029/1999JD900326.

    • Search Google Scholar
    • Export Citation
  • Gong, X., , and M. B. Richman, 1995: On the application of cluster analysis to growing season precipitation data in North America east of the Rockies. J. Climate, 8, 897931, doi:10.1175/1520-0442(1995)008<0897:OTAOCA>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Griffiths, J., 1972: Ethiopian highlands. Climates of Africa, J. Griffiths, Ed., World Survey of Climatology, Vol. 10, Elsevier, 369–388.

  • Hartigan, J. A., 1975: Clustering Algorithms. John Wiley & Sons, 351 pp.

  • Jain, A. K., , M. N. Murty, , and P. J. Flynn, 1999: Data clustering: A review. ACM Comput. Surv., 31, 264323, doi:10.1145/331499.331504.

    • Search Google Scholar
    • Export Citation
  • Kalnay, E., and et al. , 1996: The NCEP/NCAR 40-Year Reanalysis Project. Bull. Amer. Meteor. Soc., 77, 437471, doi:10.1175/1520-0477(1996)077<0437:TNYRP>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Kassahun, B., 1987: Weather systems over Ethiopia. Proc. First Technical Conf. on Meteorological Research in Eastern and Southern Africa, Nairobi, Kenya, UCAR, 53–57.

  • Korecha, D., , and A. G. Barnston, 2007: Predictability of June–September rainfall in Ethiopia. Mon. Wea. Rev., 135, 628650, doi:10.1175/MWR3304.1.

    • Search Google Scholar
    • Export Citation
  • Korecha, D., , and A. Sorteberg, 2013: Validation of operational seasonal rainfall forecast in Ethiopia. Water Resour. Res., 49, 76817697, doi:10.1002/2013WR013760.

    • Search Google Scholar
    • Export Citation
  • Latif, M., , D. Dommenget, , M. Dima, , and A. Grötzner, 1999: The role of Indian Ocean sea surface temperature in forcing east African rainfall anomalies during December–January 1997/98. J. Climate, 12, 34973504, doi:10.1175/1520-0442(1999)012<3497:TROIOS>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Manning, C. D., , P. Raghavan, and H. Schütze, 2008: Introduction to Information Retrieval. Cambridge University Press, 506 pp.

  • NMSA, 1996: Climatic and agroclimatic resources of Ethiopia. National Meteorological Services Agency of Ethiopia Research Rep. 1, 37 pp.

  • Segele, Z. T., , and P. J. Lamb, 2005: Characterization and variability of Kiremt rainy season over Ethiopia. Meteor. Atmos. Phys., 89, 153180, doi:10.1007/s00703-005-0127-x.

    • Search Google Scholar
    • Export Citation
  • Segele, Z. T., , P. J. Lamb, , and L. M. Leslie, 2009: Large-scale atmospheric circulation and global sea surface temperature associations with Horn of Africa June–September rainfall. Int. J. Climatol., 29, 10751100, doi:10.1002/joc.1751.

    • Search Google Scholar
    • Export Citation
  • Seleshi, Y., , and U. Zanke, 2004: Recent changes in rainfall and rainy days in Ethiopia. Int. J. Climatol., 24, 973983, doi:10.1002/joc.1052.

    • Search Google Scholar
    • Export Citation
  • Shanko, D., , and P. Camberlin, 1998: The effects of the southwest Indian Ocean tropical cyclones on Ethiopian drought. Int. J. Climatol., 18, 13731388, doi:10.1002/(SICI)1097-0088(1998100)18:12<1373::AID-JOC313>3.0.CO;2-K.

    • Search Google Scholar
    • Export Citation
  • Sugar, C. A., , and G. M. James, 2003: Finding the number of clusters in a dataset. J. Amer. Stat. Assoc., 98, 750763, doi:10.1198/016214503000000666.

    • Search Google Scholar
    • Export Citation
  • Tadesse, T., 1994: The influence of the Arabian Sea storms/depressions over the Ethiopian weather. Proc. Int. Conf. on Monsoon Variability and Prediction, Geneva, Switzerland, World Meteorological Organization, 228–236.

  • Thorndike, R. L., 1953: Who belongs in the family? Psychometrika, 18, 267276, doi:10.1007/BF02289263.

  • Tibshirani, R., , G. Walther, , and T. Hastie, 2001: Estimating the number of clusters in a data set via the gap statistic. J. Roy. Stat. Soc., 63B, 411423, doi:10.1111/1467-9868.00293.

    • Search Google Scholar
    • Export Citation
  • Tsegay, W., 1998: El Niño and drought early warning in Ethiopia. Internet J. Afr. Stud., 2. [Available online at http://www.bradford.ac.uk/research-old/ijas/ijasno2/Georgis.html.]

  • Viste, E., , and A. Sorteberg, 2013a: The effect of moisture transport variability on Ethiopian summer precipitation. Int. J. Climatol., 33, 31063123, doi:10.1002/joc.3566.

    • Search Google Scholar
    • Export Citation
  • Viste, E., , and A. Sorteberg, 2013b: Moisture transport into the Ethiopian highlands. Int. J. Climatol., 33, 249263, doi:10.1002/joc.3409.

    • Search Google Scholar
    • Export Citation
  • Ward, J. H., Jr., 1963: Hierarchical grouping to optimize an objective function. J. Amer. Stat. Assoc., 58, 236244, doi:10.1080/01621459.1963.10500845.

    • Search Google Scholar
    • Export Citation
Save