Logistical constraints can limit the number of global climate model (GCM) simulations considered in a climate change impact assessment. When dealing with annual or seasonal variables, one can visualize and manually select GCM scenarios to cover as much of the ensemble’s range of changes as possible. Most environmental systems are sensitive to climate conditions (e.g., extremes) that cannot be described by a small number of variables. Instead, algorithms like k-means clustering have been used to select representative ensemble members. Clustering algorithms are, however, biased toward high-density regions of climate variable space and tend to select scenarios that describe the central tendency rather than the full spread of an ensemble. Also, scenarios selected via clustering may not be ordered: that is, scenarios in the five-cluster solution may not appear in the six-cluster solution, which makes recommending a consistent set of scenarios to researchers with different needs difficult. Alternatively, an automated procedure based on a cluster initialization algorithm is proposed and applied to changes in 27 climate extremes indices between 1986–2005 and 2081–2100 from a large ensemble of phase 5 of the Coupled Model Intercomparison Project (CMIP5) simulations. Selections by the method are ordered and are designed to span the overall range of the ensemble. The number of scenarios required to account for changes spanned by at least 90% of the CMIP5 ensemble members is reported for 21 regions of the globe and compared with k-means clustering. On average, the proposed method requires 40% fewer scenarios to meet this threshold than k-means clustering does.
Regional climate change impact assessments often depend on domain-specific environmental models (e.g., hydrological models, crop models, ecosystem models) that are driven by global climate model (GCM) outputs. Although it is generally desirable to use as many scenarios as possible to quantify uncertainty, it is not always practical. For instance, computational costs can be large, which may make running the models with anything more than a small subset of available GCM simulations infeasible. Ideally, the subset would be selected so that it brackets as much of the full range in simulated future changes as possible. When considering climate variables such as mean annual or seasonal temperature and precipitation, one can quite readily visualize and manually select GCM scenarios that capture most of the range of simulated changes (Murdock and Spittlehouse 2011). Environmental systems are, however, sensitive to climate conditions, like monthly means or extremes, that cannot adequately be described by a small number of climate indicators. For example, the World Meteorological Organization’s Expert Team on Climate Change Detection and Indices (ETCCDI) has recommended a set of 27 core climate change indicators for annual temperature and precipitation extremes (Zhang et al. 2011). Visualizing and selecting representative climate change scenarios in such a high-dimensional space is no longer a simple task. At the same time, failing to incorporate relevant information in the selection process could lead to a subset that does not represent the full range of uncertainty in all the indicators of interest. Selecting representative climate scenarios when dealing with multivariate data is a challenging problem.
To solve this problem, researchers have, as an alternative to manual selection, instead turned to automated multivariate statistical algorithms. In the context of numerical weather prediction, “representative members” have been selected from ensembles of forecast simulations using cluster analysis algorithms (Molteni et al. 2001). Similar techniques have also been used in climate change studies (Logan et al. 2011). For example, Houle et al. (2012) applied k-means clustering (Hartigan and Wong 1979) to 86 phase 3 of the Coupled Model Intercomparison Project (CMIP3) GCM simulations of monthly mean temperature and precipitation for 1971–2000 and 2071–2099. Five clusters were identified and a single representative simulation from each group, the one lying closest to the centroid of each cluster, was selected to provide inputs to soil temperature and soil moisture models at sites in eastern Canada.
In the absence of a priori information about model reliability and skill in an “ensemble of opportunity,” such as CMIP3 or the recent CMIP5 multimodel ensemble (Taylor et al. 2012), it is recommended that scenarios be selected from all available simulations so that both internal variability (e.g., different realizations from the same model) and model uncertainty (e.g., different GCMs) are represented (Murdock and Spittlehouse 2011). Because k-means clustering attempts to maximize explained variance of an ensemble, it selects members that are representative of high-density regions in climate space. In CMIP3 and CMIP5, many GCMs share model components in common (Masson and Knutti 2011; Knutti et al. 2013). Also, some models have contributed several realizations to the ensemble, whereas others have contributed just one. More populated regions of climate space may therefore simply be an artifact of the opportunistic nature of the CMIP5 ensemble. In addition, k-means clustering is unlikely to produce an ordered sequence of solutions (i.e., the 6-member clustering may not include scenarios in the 5-member clustering). This makes recommending a consistent set of scenarios to researchers in a region, where each of whom may have differing computational and logistical constraints, difficult.
To overcome these issues, use of an automated, objective procedure based on the cluster analysis initialization algorithm developed by Katsavounidis et al. (1994), called the Katsavounidis–Kuo–Zhang (KKZ) algorithm, is proposed. The KKZ algorithm was originally designed as a means of initializing k-means clustering, and it has been used in this manner in several climatological applications (Whitfield and Cannon 2000; Whitfield et al. 2002; Cannon et al. 2002). Unlike k-means clustering, KKZ recursively selects members that best span the spread of an ensemble rather than finding clusters that best characterize high-density regions of multivariate space. It is deterministic and ordered, incrementally adding scenarios to the ones previously selected. This paper demonstrates the KKZ algorithm in terms of selecting subsets of scenarios that capture the range of changes in 27 ETCCDI climate extremes indices in the CMIP5 multimodel ensemble.
To be consistent with results reported by the IPCC (Stocker et al. 2013), the focus of this study is on changes simulated by CMIP5 GCMs between the 1986–2005 historical period and the 2081–2100 period in the representative concentration pathway 4.5 (RCP4.5) radiative forcing scenario (Van Vuuren et al. 2011). ETCCDI indices (Table 1) computed for these simulations were obtained from the ETCCDI extremes indices archive hosted by the Canadian Centre for Climate Modeling and Analysis of Environment Canada (http://www.cccma.ec.gc.ca/data/climdex/). Validation of the ETCCDI indices and an analysis of their projected future changes is presented in Sillmann et al. (2013a,b). The focus is on climate change simulations over 21 continental and subcontinental regions of the globe (Giorgi and Francisco 2000). In total, 56 simulations from 27 different GCMs, each contributing between 1 and 10 members, are represented in the ensemble (Table 2).
Changes in ETCCDI indices between the 1986–2005 and 2081–2100 periods were calculated for the 21 continental and subcontinental regions shown in Fig. 1. As in Stocker et al. (2013), variables were first averaged regionally and then changes from the historical period were computed. In the regions of the Amazon basin, Australia, eastern Africa, southern Africa, Southeast Asia, and western Africa, the number of icing days (ID) and growing season length (GSL) variables were removed from consideration as several models simulate no days with maximum temperatures below 0°C and all days with daily mean temperatures above 5°C at the majority of grid cells.
a. k-means clustering
Given a dataset of dimension , where N is the number of cases and P is the number of variables, k-means clustering partitions cases into K clusters such that the within-cluster sums of squared errors (SSE) is minimized. Each cluster is characterized by its centroid
where is the value of the pth variable for the ith of cases assigned to the kth cluster. The SSE is thus given by
and is minimized by following the algorithm of Hartigan and Wong (1979). Minimizing SSE is equivalent to maximizing the proportion of variance explained by a clustering solution
where is the total SSE of the dataset (i.e., with ). Because solutions from k-means clustering are sensitive to initial conditions, the algorithm is typically run multiple times, each time starting from a different random partition, and the best solution in terms of SSE is saved. Here, k-means clustering results are reported for the best run from 50 000 random initializations.
b. KKZ algorithm
The KKZ algorithm of Katsavounidis et al. (1994) was originally designed to deterministically identify a set of (nearly) optimal seed cases for initializing the centroids in k-means clustering. In this study, KKZ is instead used to select scenarios from an ensemble of climate simulations. The algorithm, as applied here, consists of the following four steps:
Select the case that lies closest to the ensemble centroid [Eq. (1) with ] as the first scenario.
Select the case that lies farthest from the first scenario as the second scenario.
To select the next scenario,
calculate distances from each remaining case to the previously selected scenarios;
associate each remaining case with the minimum distance calculated in step 3(i); and
select the case with the maximum distance from step 3(ii) as the next scenario.
Repeat from step 3.
Variables are assumed to be standardized to zero mean and unit standard deviation, with the similarity between cases measured in terms of Euclidean distance. The KKZ algorithm (and k-means clustering), however, can be applied in conjunction with other distance metrics: for example, Mahalanobis distance (Mimmack et al. 2001).
c. Bivariate example
To illustrate, k-means clustering and the KKZ algorithm are applied to standardized changes in two variables derived from the CMIP5 ETCCDI indices: (TNx + TXn)/2, which is highly correlated with annual mean temperature, and PRCPTOT. Six scenario subsets for this bivariate case are shown in Fig. 2 for the western North America and Greenland regions. By design, the first scenario selected by KKZ lies closest to the ensemble centroid. In the two regions, all of the next five scenarios selected by KKZ are vertices of the convex hull, which highlights the algorithm’s efficiency at covering the spread in each variable. Selections are ordered, incremental, and deterministic, tending to aggressively increase the range covered as each new member is added to the subset. The full ensemble spread in both temperature and precipitation is accounted for by KKZ in western North America and for temperature in Greenland. In the Greenland region, note that the 10 ensemble members contributed by the CSIRO Mk3.6.0 GCM lie near the center of the ensemble: that is, the high-density in this area of biviarate space is primarily due to natural climate variability simulated by a single model. In this case, k-means clustering selects four scenarios that are in a ±1 standard deviation box around the centroid, versus only one for KKZ.
The KKZ and k-means clustering algorithms are used to select subsets of scenarios from the CMIP5 RCP4.5 multimodel ensemble based on changes in 27 ETCCDI climate extremes indices for each of the 21 world regions. The number of scenarios required to span 90% of the range in projected changes in the ETCCDI indices is analyzed for each region. To cover this range, the subset must overlap more than 50 of the 56 scenarios. For reference, selected scenarios in each region are reported as supplemental material.
The percentage of the 27 ETCCDI variables that meet this threshold for the western North America and Greenland regions is compared in Fig. 3 for the KKZ and k-means clustering algorithms. KKZ reaches the 90% coverage threshold for 100% of the ETCCDI indices in 12 and 13 scenarios for western North America and Greenland, respectively, versus 24 and 19 scenarios for k-means clustering. The number of scenarios required to meet the 90%/100% criterion in each of the 21 world regions is shown in Fig. 4, with KKZ requiring, on average, 40% fewer scenarios than k-means clustering (a median of 15 versus 25 scenarios). In 19 of the 21 regions, KKZ requires fewer scenarios than k-means clustering and the same number in two regions (Amazon basin and western Africa).
For sake of completeness, values of explained variance [Eq. (3)] are calculated for each number of selected scenarios in each of the 21 regions. In the context of the CMIP5 ensemble, where high-density regions of multivariate space may simply be due to the “genealogy” of contributing models (Knutti et al. 2013), explained variance is not a particularly meaningful measure. In addition, as noted earlier, maximizing explained variance is equivalent to minimizing the k-means clustering SSE error function; hence, it is expected that k-means clustering will outperform the KKZ algorithm on this performance measure. Despite these caveats, differences between the two methods are small. For the 15 scenario solutions, k-means clustering explains, on average, 79% of variance over all indices and regions versus 75% for the KKZ algorithm. For the 25 scenario solutions, explained variance increases to 88% for k-means clustering versus 87% for KKZ.
The KKZ cluster initialization algorithm is proposed as a means of performing automated, objective GCM scenario selection, especially for high-dimensional, multivariate datasets drawn from large “ensembles of opportunity” like CMIP3 and CMIP5. The KKZ algorithm is tailored for use with such datasets. Note that other scenario selection methods may also be suitable in this situation (e.g., Evans et al. 2013). When members of an ensemble are created using a more rigorous sampling design, then algorithms like k-means clustering may be appropriate. In this case, KKZ may still have a role: for example, by ordering the scenarios selected by the clustering algorithm.
Scenarios selected by KKZ are ordered (i.e., the 6-member solution simply adds a scenario to the 5-member solution), and the algorithm is designed to rapidly capture the overall range of the ensemble. To demonstrate, the number of scenarios required to cover the range in simulated changes in ETCCDI climate extremes indices (1986–2005 to 2081–2100) spanned by at least 90% of the CMIP5 RCP4.5 ensemble members is reported for 21 regions of the globe and compared with k-means clustering. On average, the KKZ requires 40% fewer scenarios to meet this threshold than k-means clustering does.
Because of its focus on characterizing the full range of variability represented by an ensemble of climate change simulations, the KKZ algorithm will preferentially select scenarios that lie on the periphery of the multivariate data cloud. For instance, in the bivariate data shown in Fig. 2, selections coincide with vertices of the convex hull that encloses the ensemble. Hence, if there is compelling evidence to suggest that outliers in terms of the projected climate change signal are associated with poor historical performance (e.g., Sherwood et al. 2014), then it may be prudent to screen these models from consideration prior to application of the KKZ algorithm.
Finally, while scenarios selected by the KKZ algorithm may span a given range in projected changes in GCM variables in an ensemble, it is not guaranteed that subsequent results from an impacts model, due to the complexity and potential nonlinearity of the underlying processes, will span the same range of impacts uncertainty. In the case of the KKZ algorithm, however, adding additional scenarios during analysis is simplified because of the incremental, ordered nature of the selection process.
We acknowledge the World Climate Research Programme’s Working Group on Coupled Modelling, which is responsible for CMIP, and we thank the climate modeling groups (listed in Table 2 of this paper) for producing and making available their model output. For CMIP, the U.S. Department of Energy’s Program for Climate Model Diagnosis and Intercomparison provides coordinating support and led development of software infrastructure in partnership with the Global Organization for Earth System Science Portals. We acknowledge the Canadian Centre for Climate Modelling and Analysis of Environment Canada for maintaining the ETCCDI extremes indices archive.
Supplemental information related to this paper is available at the Journals Online website: http://dx.doi.org/10.1175/JCLI-D-14-00636.s1.