• ASHRAE, 2013: Climatic Data for Building Design Standards. ASHRAE Standard 169-2013, American Society of Heating, Refrigerating and Air-Conditioning Engineers, 98 pp.

  • Boyd, S., N. Parikh, E. Chu, B. Peleato, and J. Eckstein, 2011: Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Now Foundations and Trends, 128 pp., https://doi.org/10.1561/2200000016.

    • Crossref
    • Export Citation
  • Briggs, R. S., R. G. Lucas, and Z. T. Taylor, 2003a: Climate classification for building energy codes and standards: Part 1—Development process. ASHRAE Trans., 109, 109121.

    • Search Google Scholar
    • Export Citation
  • Briggs, R. S., R. G. Lucas, and Z. T. Taylor, 2003b: Climate classification for building energy codes and standards: Part 2—Zone definitions, maps, and comparisons. ASHRAE Trans., 109, 122130.

    • Search Google Scholar
    • Export Citation
  • Chen, G., and G. Lerman, 2009: Spectral curvature clustering (SCC). Int. J. Comput. Vis., 81, 317330, https://doi.org/10.1007/s11263-008-0178-9.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Cheng, B., J. Yang, S. Yan, Y. Fu, and T. S. Huang, 2010: Learning with ℓ1-graph for image analysis. IEEE Trans. Image Process., 19, 858866, https://doi.org/10.1109/TIP.2009.2038764.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Elhamifar, E., and R. Vidal, 2013: Sparse subspace clustering: Algorithm, theory, and applications. IEEE Trans. Pattern Anal. Mach. Intell., 35, 27652781, https://doi.org/10.1109/TPAMI.2013.57.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Filippone, M., F. Camastra, F. Masulli, and S. Rovetta, 2008: A survey of kernel and spectral methods for clustering. Pattern Recognit., 41, 176190, https://doi.org/10.1016/j.patcog.2007.05.018.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Gerstengarbe, F. W., P. C. Werner, and K. Fraedrich, 1999: Applying non-hierarchical cluster analysis algorithms to climate classification: Some problems and their solution. Theor. Appl. Climatol., 64, 143150, https://doi.org/10.1007/s007040050118.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Jain, A. K., 2010: Data clustering: 50 years beyond K-means. Pattern Recognit. Lett., 31, 651666, https://doi.org/10.1016/j.patrec.2009.09.011.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Lau, C. C., J. C. Lam, and L. Yang, 2007: Climate classification and passive solar design implications in China. Energy Convers. Manage., 48, 20062015, https://doi.org/10.1016/j.enconman.2007.01.004.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Liu, G., Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, 2013: Robust recovery of subspace structures by low-rank representation. IEEE Trans. Pattern Anal. Mach. Intell., 35, 171184, https://doi.org/10.1109/TPAMI.2012.88.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Lund, R., and B. Li, 2009: Revisiting climate region definitions via clustering. J. Climate, 22, 17871800, https://doi.org/10.1175/2008JCLI2455.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Mallenahalli, N. K., 2015: Predicting climate variability over the Indian region using data mining strategies. 8 pp., https://arxiv.org/abs/1509.06920.

  • Mimmack, G. M., S. J. Mason, and J. S. Galpin, 2001: Choice of distance matrices in cluster analysis: Defining regions. J. Climate, 14, 27902797, https://doi.org/10.1175/1520-0442(2001)014<2790:CODMIC>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Ministry of Construction of China, 1993: Thermal design code for civil building (GB50176-93) (in Chinese). China Architecture and Building Press.

  • Netzel, P., and T. Stepinski, 2016: On using a clustering approach for global climate classification. J. Climate, 29, 33873401, https://doi.org/10.1175/JCLI-D-15-0640.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Ng, A. Y., M. I. Jordan, and Y. Weiss, 2002: On spectral clustering: Analysis and an algorithm. NIPS’01: Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, MIT Press, 849–856.

  • Olgyay, V., 2015: Design with Climate: Bioclimatic Approach to Architectural Regionalism. Princeton University Press, 224 pp.

    • Crossref
    • Export Citation
  • Steinhaeuser, K., N. V. Chawla, and A. R. Ganguly, 2010: An exploration of climate data using complex networks. ACM SIGKDD Explorations Newsletter, No. 12, Association for Computing Machinery, New York, NY, 25–32, https://doi.org/10.1145/1882471.1882476.

    • Crossref
    • Export Citation
  • Unal, Y., T. Kindap, and M. Karaca, 2003: Redefining the climate zones of Turkey using cluster analysis. Int. J. Climatol., 23, 10451055, https://doi.org/10.1002/joc.910.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Verichev, K., M. Zamorano, and M. Carpio, 2019: Assessing the applicability of various climatic zoning methods for building construction: Case study from the extreme southern part of Chile. Build. Environ., 160, 106165, https://doi.org/10.1016/j.buildenv.2019.106165.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Vidal, R., and P. Favaro, 2014: Low rank subspace clustering (LRSC). Pattern Recognit. Lett., 43, 4761, https://doi.org/10.1016/j.patrec.2013.08.006.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Von Luxburg, U., 2007: A tutorial on spectral clustering. Stat. Comput., 17, 395416, https://doi.org/10.1007/s11222-007-9033-z.

  • Walsh, A., D. Cóstola, and L. C. Labaki, 2017: Comparison of three climatic zoning methodologies for building energy efficiency applications. Energy Build., 146, 111121, https://doi.org/10.1016/j.enbuild.2017.04.044.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wan, K. K., D. H. Li, L. Yang, and J. C. Lam, 2010: Climate classifications and building energy use implications in China. Energy Build., 42, 14631471, https://doi.org/10.1016/j.enbuild.2010.03.016.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wright, J., A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, 2009: Robust face recognition via sparse representation. IEEE Trans. Pattern Anal. Mach. Intell., 31, 210227, https://doi.org/10.1109/TPAMI.2008.79.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Xiong, J., R. Yao, S. Grimmond, Q. Zhang, and B. Li, 2019: A hierarchical climatic zoning method for energy efficient building design applied in the region with diverse climate characteristics. Energy Build., 186, 355367, https://doi.org/10.1016/j.enbuild.2019.01.005.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Zhang, X., and X. Yan, 2014: Temporal change of climate zones in China in the context of climate warming. Theor. Appl. Climatol., 115, 167175, https://doi.org/10.1007/s00704-013-0887-z.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Zheng, J., Y. Yin, and B. Li, 2010: A new scheme for climate regionalization in China (in Chinese). Acta Geogr. Sin., 65, 313.

  • View in gallery

    Spatial distribution of 661 stations in China.

  • View in gallery

    Coefficients under different representations. (a) Representation based on l2 norm. (b) Representation based on l1 norm. (c) Sparse subspace representation.

  • View in gallery

    Visualization of correlation coefficients matrices.

  • View in gallery

    Visualization of similarity matrices for five meteorological variables: (a) daily average temperature, (b) average relative humidity, (c) sunshine hours, (d) diurnal temperature range, and (e) daily atmospheric pressure.

  • View in gallery

    Classification of daily average temperature. (a) Cluster locations of five zones. (b) Neighbors graph. (c) Mean of each zone.

  • View in gallery

    Classification of average relative humidity. (a) Cluster locations of five zones. (b) Neighbors graph. (c) Mean of each zone.

  • View in gallery

    Classification of sunshine hours. (a) Cluster locations of five zones. (b) Neighbors graph. (c) Mean of each zone.

  • View in gallery

    Classification of diurnal temperature range. (a) Cluster locations of five zones. (b) Neighbors graph. (c) Mean of each zone.

  • View in gallery

    Classification of atmospheric pressure. (a) Cluster locations of five zones. (b) Neighbors graph. (c) Mean of each zone.

  • View in gallery

    Climate classification by multiple views. (a) Cluster locations of five zones. (b) Neighbors graph.

  • View in gallery

    Eigengaps for integrated Laplacian matrix.

  • View in gallery

    Consistency rates under different α1 and α2.

  • View in gallery

    Consistency rates under different β.

  • View in gallery

    Consistency rates under different k.

  • View in gallery

    Classification under different numbers of clusters: (a) c = 7, (b) c = 9, (c) c = 11, and (d) c = 13.

  • View in gallery

    Comparison of four indicators on five zones: (a) January, (b) July, (c) HDD18, and (d) CDD26.

  • View in gallery

    Spatial distribution of 706 stations in China.

  • View in gallery

    Classification under four meteorological variables for the second dataset: (a) daily average temperature, (b) average relative humidity, (c) sunshine hours, and (d) diurnal temperature range.

All Time Past Year Past 30 Days
Abstract Views 30 30 0
Full Text Views 168 168 54
PDF Downloads 93 93 20

A Climate Classification of China through k-Nearest-Neighbor and Sparse Subspace Representation

View More View Less
  • 1 State Key Laboratory of Green Building in Western China, and School of Architecture, and School of Science, Xi’an University of Architecture and Technology, Xi’an, China
  • | 2 State Key Laboratory of Green Building in Western China, and School of Architecture, Xi’an University of Architecture and Technology, Xi’an, China
© Get Permissions
Free access

Abstract

Climate classification aims to divide a given region into distinct groups on the basis of meteorological variables, and it has important applications in fields such as agriculture and buildings. In this paper, we propose a novel spectral clustering–based method to partition 661 meteorological stations in China into several zones. First, the correlations are analyzed among five meteorological variables: daily average temperature, average relative humidity, sunshine hours, diurnal temperature range, and atmospheric pressure. The resulting weak linear correlation supports the classification under multiple views. Next, a similarity matrix/graph is constructed by combining the advantages of k-nearest-neighbor and sparse subspace representation. The blocking effect of the similarity matrix confirms the rationality of the classification. Then, we consider respectively the climate classification under a single view and multiple views. For the single view, atmospheric pressure has the highest imbalance degree, and sunshine hours and diurnal temperature range have the strongest classification consistency. The consistency rates are improved evidently in a multiple-view situation. Afterward, we propose a determining method for the number of clusters and make a sensitivity analysis on various parameters. Finally, we provide a further statistical analysis on the classification results and compare the consistency with another climate dataset. All experimental results show that the proposed classification method is feasible and effective in the climate zoning of China.

© 2019 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Dr. L. Yang, 626224056@qq.com

Abstract

Climate classification aims to divide a given region into distinct groups on the basis of meteorological variables, and it has important applications in fields such as agriculture and buildings. In this paper, we propose a novel spectral clustering–based method to partition 661 meteorological stations in China into several zones. First, the correlations are analyzed among five meteorological variables: daily average temperature, average relative humidity, sunshine hours, diurnal temperature range, and atmospheric pressure. The resulting weak linear correlation supports the classification under multiple views. Next, a similarity matrix/graph is constructed by combining the advantages of k-nearest-neighbor and sparse subspace representation. The blocking effect of the similarity matrix confirms the rationality of the classification. Then, we consider respectively the climate classification under a single view and multiple views. For the single view, atmospheric pressure has the highest imbalance degree, and sunshine hours and diurnal temperature range have the strongest classification consistency. The consistency rates are improved evidently in a multiple-view situation. Afterward, we propose a determining method for the number of clusters and make a sensitivity analysis on various parameters. Finally, we provide a further statistical analysis on the classification results and compare the consistency with another climate dataset. All experimental results show that the proposed classification method is feasible and effective in the climate zoning of China.

© 2019 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Dr. L. Yang, 626224056@qq.com

1. Introduction

There are many different climate types across the world and they have distinctive influences on several fields such as agriculture and energy usage of most residential and commercial buildings (Briggs et al. 2003a). Given sufficient climate data, climate classification will provide valuable guidelines to agriculture and building design. There are many climate classification methods based on different meteorological variables and indices. A proper and reasonable classification method can be chosen according to the purpose and motivation of classification.

Considering the impact of climate on building design principles and architectural similarities, Olgyay (2015) proposed four main climate types around the world, namely cool, temperate, hot and arid, and hot and humid. To improve the building energy codes and standards in the United States, a new climate classification technique has been developed in which 17 climate zones were designed (Briggs et al. 2003b). Recently, the American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE) standard determined nine thermal climate zones using the heating and cooling degree-days of all stations under study (ASHRAE 2013). In China, a cumulative temperature above 10°C is an important index for defining temperature zones. The latest climate regionalization was established by using the daily climate observations from 609 stations (Zheng et al. 2010). For the thermal design of buildings, five climate types were presented, namely severe cold, cold, hot summer and cold winter, mild, and hot summer and warm winter (Ministry of Construction of China 1993).

The climate classification criteria usually involve the mean temperature of the coldest and warmest month, annual cooling degree-days, annual heating degree-days, precipitation, humidity, and solar irradiation. When classifying climate zones or regions, we should obey an important principle that two points or samples from the same zone should have similar climate characteristics and those from different zones have dissimilar characteristics. In a detailed classification, the observation data are typically from a set of meteorological stations or grid points. Meanwhile, climate data of one or more variables over many years are required (Mimmack et al. 2001). Mathematically, cluster analysis, also named data clustering (Jain 2010), is just a statistical classification technique and it is commonly used in climate classification.

The goal of cluster analysis is to group samples according to similarity or dissimilarity, where the samples from the same cluster have similar features and the features of samples from different clusters are as dissimilar as possible (Filippone et al. 2008). Clustering methods can be broadly divided into two categories: hierarchical and partitioning. The first category is used to seek nested clusters in an agglomerative manner. The second category finds all clusters simultaneously by solving a minimization problem whose objective function measures the dissimilarities between each sample and its corresponding potential cluster center.

The following cites several representative climate classification examples based on cluster analysis. Unal et al. (2003) used hierarchical cluster analysis to redefine climate zones of Turkey based on temperatures (mean, maximum, and minimum) and total precipitation. Subsequently, hierarchical clustering was also employed to develop five solar climate zones according to the monthly average clearness index from 123 stations (Lau et al. 2007). Lund and Li (2009) considered three agglomerative methods and classified the state of Colorado into six zones based on monthly maximum and minimum temperatures from 292 meteorological stations. Wan et al. (2010) designed a hierarchical clustering on annual cumulative heat and cold stress derived by monthly minimum dry-bulb temperature (DBT), maximum DBT, and vapor pressure, and classified 3973 grids in China into five bioclimate zones. Gerstengarbe et al. (1999) discussed the cluster number and cluster separation of nonhierarchical cluster analysis, and investigated the classification of European climate using monthly precipitation, temperature, and temperature range at 228 meteorological stations. A K-means clustering, a simple and popular nonhierarchical clustering method, was used to classify China into different climate types based on monthly temperature and precipitation from 753 meteorological stations (Zhang and Yan 2014). The expectation maximization clustering was utilized to find climate regions based on the grid climate data over Indian regions (Mallenahalli 2015). Netzel and Stepinski (2016) considered monthly precipitation, temperature, and temperature range from grid points obtained by interpolation, and adopted two clustering methods: hierarchical clustering and partitioning around medoids. Three climate zoning methods (degree-days, cluster analysis, and administration divisions) were compared for building energy efficiency applications in Nicaragua (Walsh et al. 2017). Recently, a two-tier climate zoning method was applied to the hot summer and cold winter zone of China, where the second tier adopted hierarchical agglomerative clustering on climate data (degree-days, relative humidity, solar radiation, and wind speed) (Xiong et al. 2019). The K-means algorithm was used for the climate zoning in the extreme southern part of Chile (Verichev et al. 2019).

Regardless of the clustering method that is used, we have to face two critical problems, that is, how to determine the number of clusters and how to measure the dissimilarity or similarity between two samples. The paper will focus on the latter problem. Dissimilarity is equivalent to distance and similarity is the opposite of dissimilarity. The distance measurements are mainly composed of the widely used Euclidean distance of original data, Mahalanobis distance (Mimmack et al. 2001), and Euclidean distance of normalized data (Lund and Li 2009). The similarity measurements include the cross-correlation function (Steinhaeuser et al. 2010) and dynamic time warping (Netzel and Stepinski 2016). However, the aforementioned distance or similarity may not be the most appropriate measurement for high-dimensional and noisy data.

Among all clustering methods, K-means clustering attracts wide attention for its simplicity, ease of implementation, efficiency, and empirical success (Jain 2010). However, this clustering technique is not suitable for high-dimensional data. As a recently emerged alternative of K-means clustering, spectral clustering has many fundamental advantages over traditional clustering algorithm such as K-means clustering and hierarchical clustering (Von Luxburg 2007). Spectral clustering originates from the spectral graph theory and it configures the clustering problem as a graph cut problem (Filippone et al. 2008). In spectral clustering, we first construct a similarity graph/matrix (or affinity matrix) and then calculate the graph Laplacian matrix. Finally, K-means clustering is employed to cluster on the eigenvectors of the graph Laplacian matrix. The most difficult problem in spectral clustering is still how to build a good similarity matrix. To construct a better similarity matrix using global information, several spectral clustering-based methods were proposed successively. For instance, spectral curvature clustering constructed multiway similarities taking the curvature into account (Chen and Lerman 2009). Low-rank recovery (Liu et al. 2013) and low-rank subspace clustering (Vidal and Favaro 2014) obtained the similarity graph by seeking a low-rank representation of the data. Sparse subspace clustering built the similarity matrix by sparse subspace representation of all data (Elhamifar and Vidal 2013). One prominent advantage of these spectral clustering methods is that they are robust to noise and outliers. Among them, sparse subspace clustering is competitive with other state-of-the-art subspace clustering algorithms.

This paper proposes a novel spectral clustering-based method to perform climate classification. This method combines the advantages of k-nearest-neighbor and sparse subspace representation, and possesses strong stability and robustness. Furthermore, it can also deal with high-dimensional cases. We employ the proposed method to divide China into several zones according to different meteorological variables. These include daily average temperature, average relative humidity, sunshine hours, diurnal temperature range, and atmospheric pressure.

2. Data and methods

This section will introduce the climate data used in this paper, provide two manners of building similarity matrix, and present a new spectral clustering method.

a. Climate data

China has a vast and diverse landscape and its climate is dominated mainly by dry seasons and wet monsoons. There exists a significant temperature difference between winter and summer. We choose 661 national meteorological stations in China and perform climate classification. Figure 1 lists the spatial distribution of all stations, where the color of each station is used to indicate the altitude whose unit is 1 m. It can be seen from Fig. 1 that the east of China has a denser station distribution and lower altitude than the west.

Fig. 1.
Fig. 1.

Spatial distribution of 661 stations in China.

Citation: Journal of Climate 33, 1; 10.1175/JCLI-D-18-0718.1

The surface-based observation data of 661 stations in the period 2004–13 were obtained from the National Climate Center of China. Five main meteorological variables are considered as the classification indexes: daily average temperature (0.1°C), average relative humidity (%), sunshine hours (0.1 h), diurnal temperature range (0.1°C), and atmospheric pressure (10 Pa). To suppress adverse effects of stochastic fluctuation, we split a single time series into 365 sub-time series according to 10 consecutive days, and average the observation values of each sub-time series. Therefore, for a fixed meteorological variable, the observation values of each station are represented by a d-dimensional column vector, where d = 365. The above dimensionality reduction method can also preserve the interannual variability.

b. Construction of similarity matrix

1) k-nearest-neighbor

For each meteorological variable, we investigate the data from N meteorological stations. Obviously, the value of N is set to 661 for given data. Let Y = {y1, y2, …, yN} be the set of climate data, where yid×1 is the observation vector in the average of 10-day periods for the ith station over many years. In traditional data analysis, a common assumption is that the dataset Y lies in a linear subspace. Under this circumstance, the distance between yi and yj is measured by the l2 norm: dij=yiyj2. The above distance is also named as the Euclidean distance.

The first step in spectral clustering is to build a weighted graph G=(V,), where V is a set of N nodes and is a set of edges between two nodes. The ith node is used to denote the observation vector yi and the graph G is also called the similarity graph. Constructing the similarity graph is the most crucial step for spectral clustering. There are several popular construction manners of the similarity graph such as the ɛ-neighborhood graph, the k-nearest-neighbor graph, and the fully connected graph. Among these manners, the k-nearest-neighbor graph is generally recommended as the first choice (Von Luxburg 2007). In the following, we introduce the construction of similarity matrix based on the k-nearest-neighbor graph.

For each yi, we first calculate N − 1 distances {di1, di2, …, di,i−1, di,i+1, …, diN} and then sort these distances by the increasing order. As a result, k nearest neighbors for yi can be found by k smallest distances. Based on this, we build a similarity graph G: if yj is one of k nearest neighbors for yi or yi is one of k nearest neighbors for yj, then we add an edge between the ith and the jth nodes; otherwise, there is no edge between the two nodes. Finally, the similarity matrix W = (wij)N×N is computed by the resulting similarity graph, where wij is the similarity between data points yi and yj. If there is an edge between the ith and the jth nodes, then we set wij = 1; otherwise, wij = 0. The constructed similarity matrix W is symmetrical and binary, and wij = 1 means yi and yj have large similarity. In addition, W is also sparse in the case of kN.

2) Sparse subspace representation

In the aforementioned subspace clustering, we make an additional assumption that the dataset Y lies in the union of c linear subspaces. The aim of clustering analysis is to identify and separate these c subspaces. If Y is noise-free and there are sufficient samples or points in each subspace, then each yi can be represented by the linear combination of other samples:
yi=c1iy1+c2iy2++ci1,iyi1+ci+1,iyi+1++cNiyN,
where cji(ji) is the linear representation coefficient of the jth sample. Large |cji| means that the ith and the jth samples might have strong similarity.
Denote Y = (y1, y2, …, yN), ci = (c1i, c2i, …, cNi)T, where cii = 0, i = 1, 2, …, N. Then Eq. (1) is rewritten as yi = Yci. Generally, the linear system yi = Yci with respect to ci has infinite solutions when d < N. We usually desire that the solution vector ci has some special structure such as lower model complexity. In practical implementation, the optimal ci can be determined by solving the following lp norm minimization problem:
mincicip,s.t.yi=Yci,cii=0.
The value of p is commonly set to 1 or 2. The case p = 1 corresponds to the sparse representation, and it has made exciting breakthroughs in the field of pattern recognition (Wright et al. 2009). If samples yi and yj come from different subspaces, then we have cij = 0. When the sample numbers of c subspaces are roughly equivalent, the proportion of zero elements in optimal vector ci is approximately (c − 1)/c, which indicates ci is very sparse for large c.
The real observation data are usually contaminated by noise or even large sparse noise. Under this situation, vector yi is decomposed as the sum of three terms:
yi=Yci+ei+zi,
where zi is a dense noise vector and ei is a large sparse noise vector. We suppose that zi is a Gaussian white noise vector, and both ci and ei obey two different multivariate Laplacian distributions. According to the maximum likelihood estimation, we can establish a minimization problem to obtain the optimal linear representation:
minci,ei,zici1+μ1ei1+12μ2zi22,s.t.yi=Yci+ei+zi,cii=0,
where μ1 and μ2 are two positive regularization parameters.

The above optimization problem is more robust to dense or large sparse noise than problem (2). We call the representation based on problem (4) a sparse subspace representation. In practice, we further consider an assumption that each subspace is reduced to an affine space. Consequently, the constraint ciT1=1 is imposed on the constraints of problem (4), where 1 is an N-dimensional column vector whose entries are all ones. In the following, we take the temperature variable of Huashan station (34.48°N, 110.08°E) located at Shaanxi Province of China to illustrate the superiority of sparse subspace representation. The observation vector of the chosen station is represented as the linear combination of vectors from the other 660 stations. The linear representation coefficients obtained by solving problems (2) and (4) are compared in Fig. 2.

Fig. 2.
Fig. 2.

Coefficients under different representations. (a) Representation based on l2 norm. (b) Representation based on l1 norm. (c) Sparse subspace representation.

Citation: Journal of Climate 33, 1; 10.1175/JCLI-D-18-0718.1

We can draw two conclusions from Fig. 2. Compared with the l2 norm minimization model, more coefficients in the l1 norm minimization are close to zero, which means linear representation based on l1 norm is more sparse than the l2 norm. Furthermore, most coefficients in sparse subspace representation are zeros and only a few coefficients are larger than zero. In other words, the sparse subspace representation yields the most sparse representation. This conclusion is consistent with the linear subspace assumption with noise. In summary, the sparse subspace representation model has better noise robustness than the other two models in this example.

All linear representation coefficient vectors {ci}i=1N can be integrated into a comprehensive model. Let C = (c1, c2, …, cN), E = (e1, e2, …, eN), and Z = (z1, z2, …, zN). Then we have Y = YC + E + Z. The sparse subspace representations of all stations can be obtained simultaneously by solving the minimization problem:
minC,E,ZC1+μ1E1+12μ2ZF2,s.t.Y=YC+E+Z,CT1=1,diag(C)=0,
where 1 and F indicate respectively the l1 and the Frobenius norms of a matrix, and diag() is the vector composed by the main diagonal of a matrix. By eliminating the variable matrix Z, problem (5) is reformulated as
minC,EC1+μ1E1+12μ2YYCEF2,s.t.CT1=1,diag(C)=0.
The optimal C in the above optimization problem measures the similarities among all stations or nodes.

Although problem (6) is convex and continuous, it is not smooth. The alternating direction method of multipliers (ADMM) is a very efficient method for solving the distributed optimization problem with linear equality constraints (Boyd et al. 2011). ADMM adopts the alternating update strategy; that is, we minimize or maximize the augmented Lagrangian function of problem (6) with respect to each block variable at each iteration. Once the optimal solution of the above problem is obtained, we can calculate the weights on the edge set based on matrix C. For this purpose, each column of C is first normalized as cici/ci, where is the infinite norm of a vector. Then we set the weights matrix W = (|C| + |C|T)/2, where || denotes the absolute value for each element of a matrix. It is obvious that W is nonnegative and the maximum value of each column is within the interval [0.5, 1]. Furthermore, W is also sparse due to the fact that C is sparse.

c. Spectral clustering through k-nearest-neighbor and sparse subspace representation

Let Wknn be the similarity matrix constructed by the k-nearest-neighbor graph and Wssr be the similarity matrix obtained by sparse subspace representation. Both Wknn and Wssr are sparse and symmetric in view of their construction principles. The k-nearest-neighbor is a commonly used means for graph construction, and it is based on the assumption that the data space is locally linear. The main disadvantage of k-nearest-neighbor method is that it does not utilize the global information, which may result in the defect that this method is not robust to data noise. In contrast, the sparse subspace representation method adopts the overall contextual information. The sparsity-based method can convey valuable information in the task of classification, offer the datum-adaptive neighborhoods, and possess greater robustness to noise. In essence, the similarity graph constructed by sparse subspace representation is a robust version of the l1 graph (Cheng et al. 2010). However, a critical deficiency of the sparse subspace method is that two samples with very large Euclidean distance may be classified as the same cluster, which is generally contrary to our intuition. To sum up, these two graph construction approaches complement each other.

To overcome the shortcomings of k-nearest-neighbor and sparse subspace representation, we propose a weighted similarity matrix defined by the compromise between Wknn and Wssr:
W=βWknn+(1β)Wssr,
where the trade-off parameter β ∈ [0, 1]. If β = 0, the similarity matrix is constructed by sparse subspace representation. If β = 1, the similarity matrix is constructed by the k-nearest-neighbor graph. Large β indicates great importance of Wknn. Based on the resulting matrix W, we can divide the dataset Y into c clusters through the spectral clustering method. The following lists the implementation procedure of spectral clustering (Ng et al. 2002):
  • Step 1. Construct a diagonal matrix D whose (i, i) element is the sum of the ith row of W.
  • Step 2. Calculate the Laplacian matrix L = D−1/2WD−1/2.
  • Step 3. Carry out eigen-decomposition on L and obtain its c mutually orthogonal unit eigenvectors {xiN×1}i=1c corresponding to c largest eigenvalues.
  • Step 4. Form the matrix X = (x1, x2, …, xc) and normalize its rows to have unit length in the sense of the l2 norm.
  • Step 5. Treat each row of normalized X as a point and cluster these N points into c clusters via K-means clustering.
  • Step 6. The observation vector yi has the same class label with the ith row of normalized X.

If c < d, spectral clustering means that we first perform dimensionality reduction on the original data and then cluster them in fewer dimensions by K-means clustering. Compared with directly using K-means clustering on the original data, the spectral clustering can preserve the nonlinear manifold structure to some extent.

3. Results

Because China is roughly partitioned into five zones or clusters in the thermal design of buildings, the default value of c is 5 in this section. The experiments consist of six parts: correlation analysis, visualization of similarity matrices for five meteorological variables, single-view climate classification, multiple-view climate classification, determination of the number of clusters, and sensitivity analysis.

a. Correlation analysis

The exploration of correlation structure among variables is necessary for climate classification. For two arbitrary meteorological variables, we will investigate the correlation coefficients between two stations. Daily average temperature is an extremely important and special meteorological variable; the correlation coefficients between two stations are larger than 0.67 and 94.65% of coefficients are larger than 0.9. The above results show that two stations generally have a strong positive linear relationship, which can be explained as the fact that the temperature has a similar trend everywhere; namely, it reaches the highest value in summer and the lowest value in winter. The correlation matrices among five variables are displayed as 15 images shown in Fig. 3. In this figure, the x axis and y axis denote the number of stations. Moreover, Fig. 3 adopts the following abbreviations:

  • Tem = daily average temperature,
  • Hum = average relative humidity,
  • SunHour = sunshine hours,
  • TemRange = diurnal temperature range, and
  • Pressure = atmospheric pressure.
Fig. 3.
Fig. 3.

Visualization of correlation coefficients matrices.

Citation: Journal of Climate 33, 1; 10.1175/JCLI-D-18-0718.1

Two conclusions are drawn by observing Fig. 3. For identical variables, each correlation coefficient matrix shows an apparent blocking phenomenon (see Figs. 3a,f,j,m,o), which will be conductive to classification. As for two different variables, the correlation coefficients vary in a relatively large range, which indicates that different variables do not have strong linear relationships. In summary, it is reasonable to carry out climate classification based on the five meteorological variables we selected.

b. Visualization of similarity matrix

This subsection sets the neighbor number k = 10 in constructing the k-nearest-neighbor graph. In minimization problem (6), we denote μ1=α1/minimaxjiyj1, μ2=α2/minimaxji|yiTyj|. The setting of α1 and α2 must satisfy the conditions α1 > 1 and α2 > 1. We carry out respectively experiments on five meteorological variables. For the sake of brevity, both α1 and α2 are set to 50, and the compromise parameter β is chosen as 0.5.

For the constructed similarity graph, we expect that the similarity matrix W is sparse. The index of sparsity rate is employed to measure the sparsity of a matrix and its definition is given by
SR(W)=1W0/N2,
where 0 is the l0 norm of a matrix (i.e., the number of nonzero elements). The value of SR(W) lies in the interval [0, 1]. If W is the zero matrix, then SR(W) = 1. If all elements of W are nonzero, then SR(W) = 0. Large SR(W) means large sparsity.

The sparsity rates of five similarity matrices are daily average temperature (0.9713), average relative humidity (0.9656), sunshine hours (0.9634), diurnal temperature range (0.9663), and atmospheric pressure (0.7630). We can see that the first four meteorological variables have very sparse similarity matrices and the last variable (i.e., atmospheric pressure) has the worst sparsity. Figure 4 visualizes the similarity matrices of five meteorological variables, where the white pixel represents 0 and the dark red indicates 1. The first four subfigures in Fig. 4 indicate that most nonzero elements of the similarity matrices are concentrated near the subdiagonal, but the last subfigure shows a significant blocking effect in the subdiagonal. This observation can be interpreted as follows. The data matrix of all stations is stacked by the increasing order of station number. In general, the adjacent stations have similar meteorological characteristics owing to their relatively small geographic distances. In addition, the first four similarity matrices are sparser than the last matrix, which may be caused by small change of atmospheric pressure. By comparing Figs. 3 and 4, we find out that the similarity matrices are more sparse than the correlation coefficients matrices. This observation shows that it is unwise to measure the similarities by correlation coefficients.

Fig. 4.
Fig. 4.

Visualization of similarity matrices for five meteorological variables: (a) daily average temperature, (b) average relative humidity, (c) sunshine hours, (d) diurnal temperature range, and (e) daily atmospheric pressure.

Citation: Journal of Climate 33, 1; 10.1175/JCLI-D-18-0718.1

In the subsequent experiments, spectral clustering is used to classify all stations according to the constructed similarity graph or matrix under single view or multiple views. As an important component of spectral clustering, K-means clustering is sensitive to the centroids initialization. To reduce the uncertainty of clustering results, we repeat the K-means clustering 20 times and seek the solution with the lowest sum, over all clusters, of the within-cluster sums of point-to-cluster-centroid distances. We will plot cluster locations, the neighbors graph, and the mean of each zone. To simplify the neighbor relationship, we only draw the edges whose weights or similarities are larger than 0.2 in the neighbors graph. Moreover, the edge from the same zone is plotted by a solid line whose color is the same as its nodes, and the edge from different zones is drawn by a dotted line whose color is black.

c. Single-view climate classification

1) Daily average temperature

All stations are divided into five zones according to the data of daily average temperature. We visualize the neighbors graph to display the similarity relationship among all stations. The mean of each zone is utilized to evaluate the difference between different zones. Figure 5 shows the classification results of daily average temperature.

Fig. 5.
Fig. 5.

Classification of daily average temperature. (a) Cluster locations of five zones. (b) Neighbors graph. (c) Mean of each zone.

Citation: Journal of Climate 33, 1; 10.1175/JCLI-D-18-0718.1

In Fig. 5a, different colors or markers represent different climate zones or clusters. It can be seen from this subfigure that the adjacent stations are partitioned into the same cluster in most cases. By observing Figs. 5a and 5c, we find out the classification results are roughly consistent with five thermal climate clusters. Concretely speaking, most stations of Zone 1 are in the hot summer and warm winter region, and fewer stations are in the mild cluster. Zone 2 belongs to the cluster of hot summer and cold winter. Zone 3 involves two thermal climate clusters, the cold and the severe cold region. Zone 4 is part of the severe cold cluster. Zone 5 includes three thermal climate clusters, the severe cold, the cold, and the mild regions. Among these five zones, Zone 1 has the highest average temperature and Zone 2 has the second highest temperature. Most stations of Zone 5 are located in the Qinghai–Tibet Plateau and the Yunnan–Guizhou Plateau; compared with Zone 4, they have lower temperature in summer because of their higher altitudes. In addition, the latitudes of Zone 3 lie between Zone 5 and Zone 4. Figure 5b constructs the neighbors graph of daily average temperature. There are 4074 solid lines and 290 dotted lines in total, and each node has 13.2 neighbors in the average sense.

2) Average relative humidity

The classification results of average relative humidity is shown in Fig. 6. It can be seen from this figure that all stations are approximately divided into five disjoint regions. Compared with the distribution of dry and wet areas in China, we have the following observations from Figs. 6a and 6c. Zone 1 is the wet region with the highest relative humidity. Zone 2 includes two regions, the semiwet and the semiarid. Zone 3 has the lowest relative humidity and it consists of two regions, the arid and the semiarid. Zone 4 mainly belongs to the wet and the semiwet. Zone 5 is mainly located at plateau and it has three regions, the wet, the semiwet, and the semiarid. In conclusion, the trend of average relative humidity in China is decreasing from the southeast to the northwest. Figure 6b shows the neighbors graph of average relative humidity. In this subfigure, there are totally 4017 solid lines from the same clusters and 336 dotted lines from different clusters. On the average, each station has 13.17 neighbors.

Fig. 6.
Fig. 6.

Classification of average relative humidity. (a) Cluster locations of five zones. (b) Neighbors graph. (c) Mean of each zone.

Citation: Journal of Climate 33, 1; 10.1175/JCLI-D-18-0718.1

3) Sunshine hours

The sunshine hours of a location is a general indicator of cloudiness. Generally, the cloudiness has a strongly positive relationship with the relative humidity within a narrow latitude sector. Figure 7 gives the classification results of sunshine hours. We can see from Figs. 7a and 7c that the classification of sunshine hours is almost identical with that of average relative humidity except for Zone 1. Zone 1 in Fig. 7 has more stations than that in Fig. 6. Similarly, Zone 1 has the shortest sunshine hours and Zone 2 has the second shortest sunshine hours; the stations of Zones 3 and 4 are located at high altitude and they have the longest sunshine hours. Figure 7b plots the neighbors graph of sunshine hours. In this subfigure, 3887 solid lines and 291 dotted lines are drawn to describe the neighbor relationships and each station has 12.64 neighbors on average.

Fig. 7.
Fig. 7.

Classification of sunshine hours. (a) Cluster locations of five zones. (b) Neighbors graph. (c) Mean of each zone.

Citation: Journal of Climate 33, 1; 10.1175/JCLI-D-18-0718.1

4) Diurnal temperature range

Diurnal temperature range is the difference between the maximum and the minimum temperature. Generally speaking, the lower relative humidity or the higher the altitude, the greater the diurnal temperature range. Based on this meteorological variable, the classification results are shown in Fig. 8. By observing Figs. 8a and 8c, we have the following conclusions. Zone 1 has the smallest temperature range, which is consistent with this zone also having the highest relative humidity. Zone 2 has the second smallest temperature range. Both Zone 3 and Zone 5 have large temperature ranges, which makes sense given that Zone 3 is the driest region and Zone 5 has high altitude. Figure 8b displays the neighbors graph of diurnal temperature range. There are 4128 solid lines and 322 dotted lines, and each station has 13.46 neighbors.

Fig. 8.
Fig. 8.

Classification of diurnal temperature range. (a) Cluster locations of five zones. (b) Neighbors graph. (c) Mean of each zone.

Citation: Journal of Climate 33, 1; 10.1175/JCLI-D-18-0718.1

5) Atmospheric pressure

Atmospheric pressure decreases with increasing altitude. Figure 9 gives the classification results of atmospheric pressure. According to Figs. 9a and 9c, we draw several observations. Zone 5 has the lowest atmospheric pressure due to the fact the corresponding stations are located in the Yunnan–Guizhou Plateau and the Qinghai–Tibet Plateau. Zone 3 has the second lowest atmospheric pressure because of high altitude. The other three zones have low altitude, and they are distributed from south to north in turn. The mean atmospheric pressure of Zone 1 lies between Zone 2 and Zone 4. Furthermore, the mean atmospheric pressure of Zone 5 changes slightly and the other four zones have same change tendency, namely, the mean atmospheric pressure in winter is higher than in summer. The final classification result is in line with three steps of China’s terrain: Zone 5 belongs to the first step, Zone 3 lies in the second step, and the third step includes the other three zones.

Fig. 9.
Fig. 9.

Classification of atmospheric pressure. (a) Cluster locations of five zones. (b) Neighbors graph. (c) Mean of each zone.

Citation: Journal of Climate 33, 1; 10.1175/JCLI-D-18-0718.1

The sparsity rate of the revised W (i.e., the element less than 0.2 is replaced by 0) is reduced to 0.9390 and the neighbor relationships are shown in Fig. 9b. The subfigure shows 11 667 solid lines and 1976 dotted lines, and each station has 40.91 neighbors on the average. The large neighbor number may result from the fact that the atmospheric pressure has the worst sparsity among five similarity matrices. Figure 9c also illustrates the separability of the mean atmospheric pressure.

d. Multiple-view climate classification

In section 3c, we study the climate classification with only one meteorological variable. However, in many practical applications, we need to provide a comprehensive climate classification by integrating all interested meteorological variables. This subsection will present a composite classification method via multiple views.

We utilize Q meteorological variables to perform classification and obtain the similarity matrices by the proposed method in this paper. Let Wi be the similarity matrix of the ith meteorological variable, where i = 1, 2, …, Q. The integrated similarity matrix is defined by
W=i=1QγiWi,
where γi ≥ 0 is the weight of the ith meteorological variable and it holds that γ1+γ2++γQ=1. Large weight means great importance, and the case γi = 0 indicates that the ith meteorological variable is not considered in the classification procedure. The weights {γi}i=1Q can be determined by the purpose of classification.
The following provides a method to determine automatically all weights. We first perform classification on the ith meteorological variable and all stations are partitioned into c clusters. Let nj(i) be the number of stations from the jth cluster. For the ith meteorological variable, the imbalance degree of the above classification procedure is defined as
b(i)=[maxjnj(i)minjnj(i)]/maxjnj(i).
A larger imbalance degree indicates a more imbalanced grouping. If there is no prior information, the case that n1(i)=n2(i)==nc(i) is expected, that is, b(i) = 0. Large b(i) is propitious to classification. Hence, we set γi=b(i)/[b(1)+b(2)++b(Q)].

According to the experimental results of section 3c, we list the number of stations in each zone for all meteorological variables, as shown in Table 1. The imbalance degree of each variable is also displayed in the last column of Table 1. From this table, we can see that the daily atmospheric pressure has the largest imbalance degree, the relative humidity has the second largest imbalance degree and the sunshine hours has the smallest imbalance degree. The above observations mean that both daily atmospheric pressure and relative humidity are of primary importance.

Table 1.

Number of stations in each zone and imbalance degree. The bold font emphasizes the largest value.

Table 1.

Spectral clustering is carried out on the synthetic similarity matrix. Figure 10 shows the integrated classification results, where all stations are divided into five zones. It can be seen from this figure that each zone has a reasonable layout and the classification result is the closest to that of the atmosphere pressure. The comprehensive zoning in Fig. 10a has larger difference with that of other alone meteorological variable. Figure 10b plots the neighbors graph, and there are 5089 lines within the same zones and 517 lines from different zones. The numbers of stations from five zones are 112, 135, 199, 89, and 126 respectively. The imbalance degree of the above classification is 0.5528, which lies between the smallest and the largest values of five individual meteorological variables.

Fig. 10.
Fig. 10.

Climate classification by multiple views. (a) Cluster locations of five zones. (b) Neighbors graph.

Citation: Journal of Climate 33, 1; 10.1175/JCLI-D-18-0718.1

The consistency rate is employed to evaluate the compatibility of two classification methods. All stations are divided into c zones {Z1(i),Z2(i),,Zc(i)} according to the ith classification method, where these zones are rearranged by some order. For a given set Z, let Card(Z) be the cardinality of Z. The consistency rate (CR) between the ith and the jth classification methods is defined by
CRij=1Nl=1cCard[Zl(i)Zl(j)].
The values of CRij lie in the interval [0, 1]. Large CRij means large consistency and CRij = 1 when the ith and the jth classification methods have identical classification results.

Table 2 shows the consistency rates between different classification methods. It can be seen from this table that the sunshine hours variable has large classification consistency with diurnal temperature range and average relative humidity among the first five classification variables. This conclusion can be explained as follows. Both diurnal temperature range and average relative humidity have large impact on the sunshine hours. A small diurnal temperature range and large average relative humidity possibly shorten the sunshine hours. In addition, the classification under multiple views improves generally the consistency rate and it has the largest consistency with classification result of the atmosphere pressure.

Table 2.

Consistency rates between different classification methods. The bold font emphasizes the largest and second largest values.

Table 2.

e. Determination of the number of clusters

Choosing the number c of clusters is a fundamental problem for all clustering tasks. In spectral clustering, the eigengap heuristic is a recommended tool designed for determining the number c (Von Luxburg 2007). Let {λ1, λ2, …, λN} be N eigenvalues of the Laplacian matrix L and they are sorted in descending order (only the real part is considered for the complex eigenvalue). It is obvious that the largest eigenvalue λ1 = 1. Inspired by matrix perturbation theory, we hope that the first c eigenvalues λ1, λ2, …, λc are relatively large, but λc+1 is relatively small. For determining c, the ith eigengap is defined as δi = λiλi+1, i = 1, 2, …, N − 1. The relatively large eigengap provides us a reference for choosing the number c.

For simplicity, we synthesize a similarity matrix W through averaging the similarity matrices of five meteorological variables used in this paper. In the process of constructing similarity matrices, three numbers of neighbors are considered, namely, k ∈ {0, 10, 30}. For the Laplacian matrix corresponding to W, we compute its first 30 eigengaps and plot them in Fig. 11. Without considering the first eigengap, δ5 reaches the largest value, which means that the number c of clusters should be greater than or equal to 5. We suggest that the number c of clusters should satisfy the following two conditions: δcδ0 and δc is a local maximum, where the threshold δ0 is a given small positive value. If δ0 is set to 0.02, then we can take c ∈ {5, 8, 11, 13, 16} for k = 0, c ∈ {5, 7, 9, 11, 13, 15} for k = 10, and c ∈ {5, 10, 15} for k = 30. On the whole, the number c of clusters can be chosen within the range from 5 to 16.

Fig. 11.
Fig. 11.

Eigengaps for integrated Laplacian matrix.

Citation: Journal of Climate 33, 1; 10.1175/JCLI-D-18-0718.1

f. Sensitivity analysis

1) Regularization parameters

This part will discuss the influence of regularization parameters on the experimental results. For the sake of simplicity, we might as well take daily average temperature as an example. When both α1 and α2 are relatively small, the optimal coefficient matrix C* in problem (6) trends to the zero matrix. Conversely, the optimal noise matrix E* or Z* in problem (5) also trends to the zero matrix. Based on numerical experiments, we have the following observations. If α1 = α2 and 1 ≤ α1 ≤ 4, some columns of C* are zero vectors, which is in conflict with our motivation. If α1 = α2 = 50, each column of C* has 8.07 nonzero elements in the average sense. If α1 = α2 = 500, each column of C* has 38.72 nonzero elements on the average. Generally, the value of SR(C*) decreases with the increasing of α1 and α2. In summary, α1 = α2 = 50 is an appropriate choice.

Because the true classification result is unknown, we take the classification result for the case α1 = α2 = 50 as a comparison object and further study the consistency rates under different values of parameters α1 and α2. Set α1 = 10 + 20i and α2 = 10 + 20j, where i, j = 0, 1, …, 25. For all possible combinations of α1 and α2, the corresponding consistency rates are shown in Fig. 12. From this figure, we have the following observations. For fixed α1, the consistency rates with different α2 are relatively stable and the corresponding standard deviation is less than 0.032. While the consistency rates show a decreasing trend with the increasing of α1 for fixed α2. Furthermore, if α1 ∈ [30, 330] and α2 ∈ [10, 510], the smallest consistency rate is 0.9213, which means the classification results have good consistency. The above observations demonstrate that the classification results are not very sensitive to the choices of α1 and α2 to some extent.

Fig. 12.
Fig. 12.

Consistency rates under different α1 and α2.

Citation: Journal of Climate 33, 1; 10.1175/JCLI-D-18-0718.1

2) Trade-off parameter

The trade-off parameter β controls the importance of balancing sparse subspace representation and the k nearest neighbor. For five given meteorological variables, we will discuss the impact of different β on classification results. Similarly, we take β = 1/2 as a reference object and explore the consistency rates for other values of β. Let {i/10}i=010 be the set of values of β. The consistency rates are compared in Fig. 13. If β lies in the interval [0.2, 0.8], the smallest consistency rates are daily average temperature (0.8835), average relative humidity (0.9198), sunshine hours (0.8729), diurnal temperature range (0.9592), and atmospheric pressure (0.8926). This observation means that the classification results have good consistency for the case β ∈ [0.2, 0.8]. Moreover, the consistency rate of β = 0 has large difference with that of β = 1, which shows that it is valuable to choose an appropriate parameter β.

Fig. 13.
Fig. 13.

Consistency rates under different β.

Citation: Journal of Climate 33, 1; 10.1175/JCLI-D-18-0718.1

3) Number of neighbors

The choice of neighbor numbers directly affects the similarity matrix Wknn, which has a strong influence on the compromise matrix W. This part adopts the same computation formulation for integrated similarity matrix as section 3d. In the following, we will compare the classification results for different values of k. Let k = 5i, where i = 0, 1, 2, …, 10. The climate classification is performed for each k and the consistency rate is employed to evaluate the consistency of classification results under two values of k. According to experimental results, the consistency rate lies in the interval [0.8215, 0.9955], which indicates that the value of k has a certain impact on classification. The consistency rate comparison is displayed in Fig. 14 as an image.

Fig. 14.
Fig. 14.

Consistency rates under different k.

Citation: Journal of Climate 33, 1; 10.1175/JCLI-D-18-0718.1

Let k1 and k2 be two numbers of neighbors. Three observations can be seen from Fig. 14. In general, the smaller the value of |k1k2| is, the larger the consistency rate is. The case that k1 = 0 and k2 = 40 yields the worst consistency rate. When |k1k2| = 5, the difference between two corresponding consistency rates is less than or equal to 0.0439 in the absolute sense. These observations mean that the climate classification results are stable for minor changes in k.

4) Number of clusters

In section 3e, we discuss a determination method for the number of clusters. For the climate data used in this paper, it is proposed that the number c of clusters ranges from 5 to 16. This part will consider several other values of c, that is, c ∈ {7, 9, 11, 13}. For the above four values of c, spectral clustering is performed on the similarity matrix used in section 3d. The detailed classification results are shown in Fig. 15.

Fig. 15.
Fig. 15.

Classification under different numbers of clusters: (a) c = 7, (b) c = 9, (c) c = 11, and (d) c = 13.

Citation: Journal of Climate 33, 1; 10.1175/JCLI-D-18-0718.1

By observing Figs. 10a and 15a, we can see that Zones 1 and 4 in Fig. 10a have slight changes, but the remaining three zones are approximately divided into five zones when c = 7. We also have the following observations by further comparing four subfigures of Fig. 15. Zones 1 and 3 in Fig. 15a are roughly partitioned into two zones respectively when c = 9. Zones 1 and 7 in Fig. 15b are respectively split into two zones when c = 11. Zone 11 in Fig. 15c are segmented into Zone 1 and Zone 10 in Fig. 15d, and Zone 3 and Zone 6 are changed into three zones. Based on the above observations, we draw three conclusions: all stations are regrouped regularly with the increasing of the number of clusters, there is no serious imbalance in the number of each zone, and all stations are divided into several disjoint regions except a few outliers.

4. Discussion

This section first conducts a further statistical analysis on the climate classification on the basis of daily average temperature, and then it discusses the classification consistency by providing another meteorological dataset.

a. Further statistics of classification results

Section 3c has compared the average results of each zone for five meteorological variables. This section will make a statistical analysis on other indicators. These include the mean temperature of the coldest month (January), the mean temperature of the warmest month (July), cooling degree-days based on 26°C (CDD26), and heating degree-days based on 18°C (HDD18). They are important classification elements in climate zoning for building thermal design.

For brevity, only the classification result of daily average temperature in section 3c is utilized. To count the aforementioned four indicators, the daily average temperatures for each station are averaged over 10 years and thus an annual average temperature time series is obtained. The boxplot is used to graphically depict groups of numerical data through their quartiles. The experimental results are displayed in Fig. 16. By observing the first two subfigures in Fig. 16, we find that Zone 1 has the highest average temperature in January and July, Zone 4 has the lowest average temperature in January, and Zone 5 has the lowest average temperature in July. There are in total three outliers in January and 21 outliers in July. It can be seen from the last two subfigures that the average degree-days in HDD18 gradually increase from Zone 1 to Zone 4, and CDD26 has the opposite changes. Zone 5 has a large change range than Zone 4 for HDD18, and the degree-days of these two zones are almost zeros for CDD26. Coincidentally, HDD18 and CDD26 have 3 outliers and 21 outliers respectively. To sum it up, the statistical analysis on these four indicators demonstrates the rationality and effectiveness of the classification result.

Fig. 16.
Fig. 16.

Comparison of four indicators on five zones: (a) January, (b) July, (c) HDD18, and (d) CDD26.

Citation: Journal of Climate 33, 1; 10.1175/JCLI-D-18-0718.1

b. Comparison with another dataset

Due to data limitations, the previous work only considers the climate dataset covering a 10-yr period. In this subsection, we provide another dataset during the period from 1 January 2014 to 31 May 2017. This dataset is also from the National Climate Center of China. As for the aforementioned meteorological variables, atmospheric pressure is not included in the new dataset. Moreover, this dataset involves 706 stations shown in Fig. 17 and there are 552 stations common to the two datasets. Compared with Fig. 1, the stations located at the regions of Zhejiang, Jiangxi, Guizhou, and Tibet are not taken into account. By adopting a 5-day period, each meteorological variable is represented by a 249-dimensional vector.

Fig. 17.
Fig. 17.

Spatial distribution of 706 stations in China.

Citation: Journal of Climate 33, 1; 10.1175/JCLI-D-18-0718.1

For fair comparison, we set the same parameters as in the previous section, namely, c = 5, k = 10, and α1 = α2 = 50. Under single view, the resulting classifications are shown in Fig. 18. For each meteorological variable in both datasets, we compare their classification consistency on 552 common stations, and the consistency rates are 0.6757, 0.7572, 0.8514, and 0.9275 respectively. The best consistency occurs at diurnal temperature range, which indicates that the two datasets have good consistency on classification. For other three variables, station distribution in Zone 5 is obviously influenced by the lack of stations in Tibet, which leads to a direct impact on the stations in Zone 1 and Zone 3. Especially, Zone 5 and Zone 2 are severely disrupted in terms of daily average temperature. In summary, the proposed classification method has good stability and obtains acceptable consistency between two consecutive periods except daily average temperature.

Fig. 18.
Fig. 18.

Classification under four meteorological variables for the second dataset: (a) daily average temperature, (b) average relative humidity, (c) sunshine hours, and (d) diurnal temperature range.

Citation: Journal of Climate 33, 1; 10.1175/JCLI-D-18-0718.1

5. Conclusions

In this paper, we investigate the climate classifications by using a spectral clustering–based method. Daily climate observations of five variables are employed and they are from 661 stations in China during the period of 2004–13. To obtain a stable and reliable climate classification, the daily observations are averaged through 10-day period. For classification, a novel construction method of the similarity graph or matrix is proposed. The construction method combines the strengths of k-nearest-neighbor and sparse subspace representation.

In the implementation process of experiments, weak linear correlation is first validated by correlation analysis among five meteorological variables. Next, classification of each individual meteorological variable is studied based on the constructed similarity matrix. Then the climate classification is performed by integrating all meteorological variables. Regardless of single-view or multiple-view classification, the proposed method has good consistency. Afterward, the classification of each individual meteorological variable is studied based on the constructed similarity matrix. Then the climate classification is performed by integrating all meteorological variables. Regardless of single-view or multiple-view classification, the proposed method has good consistency. Afterwards, the determination method of the number of clusters is explored. Finally, we research the influence of various parameters on classification and the results show the proposed method is not very sensitive to all parameters.

In the discussion section, we make a further statistical analysis on classification results and compare the classification consistency by supplementing another dataset. All experimental results validate the feasibility and effectiveness of the proposed climate classification method. This paper provides a mathematical model to climate zones, and also presents a valuable reference and verification on climate zones. In the future, we will consider other meteorological variables in climate classification, such as monthly precipitation, solar radiation, and wind. Furthermore, it is worth studying the construction manner of similarity matrix by integrating all interested meteorological variables. We will also explore the application of the climate classification to the field of building.

Acknowledgments

This study was supported by “the 13th Five-Year” National Science and Technology Major Project of China (2018YFC0704500), the State Key Program of National Natural Science of China (51838011), and the China Postdoctoral Science Foundation (2017M613087).

REFERENCES

  • ASHRAE, 2013: Climatic Data for Building Design Standards. ASHRAE Standard 169-2013, American Society of Heating, Refrigerating and Air-Conditioning Engineers, 98 pp.

  • Boyd, S., N. Parikh, E. Chu, B. Peleato, and J. Eckstein, 2011: Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Now Foundations and Trends, 128 pp., https://doi.org/10.1561/2200000016.

    • Crossref
    • Export Citation
  • Briggs, R. S., R. G. Lucas, and Z. T. Taylor, 2003a: Climate classification for building energy codes and standards: Part 1—Development process. ASHRAE Trans., 109, 109121.

    • Search Google Scholar
    • Export Citation
  • Briggs, R. S., R. G. Lucas, and Z. T. Taylor, 2003b: Climate classification for building energy codes and standards: Part 2—Zone definitions, maps, and comparisons. ASHRAE Trans., 109, 122130.

    • Search Google Scholar
    • Export Citation
  • Chen, G., and G. Lerman, 2009: Spectral curvature clustering (SCC). Int. J. Comput. Vis., 81, 317330, https://doi.org/10.1007/s11263-008-0178-9.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Cheng, B., J. Yang, S. Yan, Y. Fu, and T. S. Huang, 2010: Learning with ℓ1-graph for image analysis. IEEE Trans. Image Process., 19, 858866, https://doi.org/10.1109/TIP.2009.2038764.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Elhamifar, E., and R. Vidal, 2013: Sparse subspace clustering: Algorithm, theory, and applications. IEEE Trans. Pattern Anal. Mach. Intell., 35, 27652781, https://doi.org/10.1109/TPAMI.2013.57.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Filippone, M., F. Camastra, F. Masulli, and S. Rovetta, 2008: A survey of kernel and spectral methods for clustering. Pattern Recognit., 41, 176190, https://doi.org/10.1016/j.patcog.2007.05.018.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Gerstengarbe, F. W., P. C. Werner, and K. Fraedrich, 1999: Applying non-hierarchical cluster analysis algorithms to climate classification: Some problems and their solution. Theor. Appl. Climatol., 64, 143150, https://doi.org/10.1007/s007040050118.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Jain, A. K., 2010: Data clustering: 50 years beyond K-means. Pattern Recognit. Lett., 31, 651666, https://doi.org/10.1016/j.patrec.2009.09.011.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Lau, C. C., J. C. Lam, and L. Yang, 2007: Climate classification and passive solar design implications in China. Energy Convers. Manage., 48, 20062015, https://doi.org/10.1016/j.enconman.2007.01.004.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Liu, G., Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, 2013: Robust recovery of subspace structures by low-rank representation. IEEE Trans. Pattern Anal. Mach. Intell., 35, 171184, https://doi.org/10.1109/TPAMI.2012.88.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Lund, R., and B. Li, 2009: Revisiting climate region definitions via clustering. J. Climate, 22, 17871800, https://doi.org/10.1175/2008JCLI2455.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Mallenahalli, N. K., 2015: Predicting climate variability over the Indian region using data mining strategies. 8 pp., https://arxiv.org/abs/1509.06920.

  • Mimmack, G. M., S. J. Mason, and J. S. Galpin, 2001: Choice of distance matrices in cluster analysis: Defining regions. J. Climate, 14, 27902797, https://doi.org/10.1175/1520-0442(2001)014<2790:CODMIC>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Ministry of Construction of China, 1993: Thermal design code for civil building (GB50176-93) (in Chinese). China Architecture and Building Press.

  • Netzel, P., and T. Stepinski, 2016: On using a clustering approach for global climate classification. J. Climate, 29, 33873401, https://doi.org/10.1175/JCLI-D-15-0640.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Ng, A. Y., M. I. Jordan, and Y. Weiss, 2002: On spectral clustering: Analysis and an algorithm. NIPS’01: Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, MIT Press, 849–856.

  • Olgyay, V., 2015: Design with Climate: Bioclimatic Approach to Architectural Regionalism. Princeton University Press, 224 pp.

    • Crossref
    • Export Citation
  • Steinhaeuser, K., N. V. Chawla, and A. R. Ganguly, 2010: An exploration of climate data using complex networks. ACM SIGKDD Explorations Newsletter, No. 12, Association for Computing Machinery, New York, NY, 25–32, https://doi.org/10.1145/1882471.1882476.

    • Crossref
    • Export Citation
  • Unal, Y., T. Kindap, and M. Karaca, 2003: Redefining the climate zones of Turkey using cluster analysis. Int. J. Climatol., 23, 10451055, https://doi.org/10.1002/joc.910.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Verichev, K., M. Zamorano, and M. Carpio, 2019: Assessing the applicability of various climatic zoning methods for building construction: Case study from the extreme southern part of Chile. Build. Environ., 160, 106165, https://doi.org/10.1016/j.buildenv.2019.106165.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Vidal, R., and P. Favaro, 2014: Low rank subspace clustering (LRSC). Pattern Recognit. Lett., 43, 4761, https://doi.org/10.1016/j.patrec.2013.08.006.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Von Luxburg, U., 2007: A tutorial on spectral clustering. Stat. Comput., 17, 395416, https://doi.org/10.1007/s11222-007-9033-z.

  • Walsh, A., D. Cóstola, and L. C. Labaki, 2017: Comparison of three climatic zoning methodologies for building energy efficiency applications. Energy Build., 146, 111121, https://doi.org/10.1016/j.enbuild.2017.04.044.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wan, K. K., D. H. Li, L. Yang, and J. C. Lam, 2010: Climate classifications and building energy use implications in China. Energy Build., 42, 14631471, https://doi.org/10.1016/j.enbuild.2010.03.016.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wright, J., A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, 2009: Robust face recognition via sparse representation. IEEE Trans. Pattern Anal. Mach. Intell., 31, 210227, https://doi.org/10.1109/TPAMI.2008.79.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Xiong, J., R. Yao, S. Grimmond, Q. Zhang, and B. Li, 2019: A hierarchical climatic zoning method for energy efficient building design applied in the region with diverse climate characteristics. Energy Build., 186, 355367, https://doi.org/10.1016/j.enbuild.2019.01.005.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Zhang, X., and X. Yan, 2014: Temporal change of climate zones in China in the context of climate warming. Theor. Appl. Climatol., 115, 167175, https://doi.org/10.1007/s00704-013-0887-z.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Zheng, J., Y. Yin, and B. Li, 2010: A new scheme for climate regionalization in China (in Chinese). Acta Geogr. Sin., 65, 313.

Save