Lightning is a natural hazard that can lead to the ignition of wildfires, disruption and damage to power and telecommunication infrastructures, human and livestock injuries and fatalities, and disruption to airport activities. This paper examines the ability of six statistical and machine-learning classification techniques to distinguish between nonlightning and lightning days at the coarse spatial and temporal scales of current general circulation models and reanalyses. The classification techniques considered were 1) a combination of principal component analysis and logistic regression, 2) classification and regression trees, 3) random forests, 4) linear discriminant analysis, 5) quadratic discriminant analysis, and 6) logistic regression. Lightning-flash counts at six locations across Australia for 2004–13 were used, together with atmospheric variables from the ERA-Interim dataset. Tenfold cross validation was used to evaluate classification performance. It was found that logistic regression was superior to the other classifiers considered and that its prediction skill is much better than using climatological values. The sets of atmospheric variables included in the final logistic-regression models were primarily composed of spatial mean measures of instability and lifting potential, along with atmospheric water content. The memberships of these sets varied among climatic zones.
Appreciable attention has been given to the problems of thunderstorm classification and prediction in recent years. Many techniques have been used, including empirical orthogonal function and canonical correlation analyses (e.g., Muñoz et al. 2016), classification and regression trees (e.g., Burrows et al. 2005), random-forest classification (e.g., Blouin et al. 2016), quadratic discriminant analysis (e.g., Sánchez et al. 1998), logistic regression (e.g., Mazany et al. 2002; Sousa et al. 2013; Romps et al. 2014), and dynamical modeling (e.g., Yair et al. 2010; Lynn et al. 2012; Zepka et al. 2014). Subject areas have included very-short-range forecasting, seasonal prediction, and climatological studies. The systematic evaluation of the performances of several different classification techniques when applied to datasets from a wide range of climatic zones has not received much attention, however.
The principal aim of this study is to investigate the relationships between lightning activity and atmospheric conditions at the coarse spatial and temporal scales of currently available climate models and reanalyses, including a comparison of different modeling techniques that are based on a range of thermodynamic measures. The lightning data used in this study are from several locations in Australia, covering a variety of climate types. The description of study sites and data, multivariate analyses, and cross-validation experiments below parallels that of Bates et al. (2017). The material covered in section 2 is derived from there with minor modification and is intended to provide sufficient detail to allow readers to assess the validity and generalizability of the results presented here. Results are presented in section 3, and a summary of key research findings is given in section 4.
2. Data and methods
The daily lightning data used in this study were collected from six Comité Internationale des Grandes Réseaux Electriques (International Committee on Large Electric Systems; CIGRE) model “CIGRE 500” lightning-flash counters (Fig. 1; Table 1). The total number of flash counts was considered herein because 1) although the CIGRE 500 sensor was designed specifically to detect cloud-to-ground flashes, it also responded to cloud-to-cloud flashes, with about 68% of the lightning-flash counts recorded being due to cloud-to-ground flashes; 2) estimates of the effective horizontal ranges of the counters for cloud-to-ground flashes and cloud-to-cloud flashes are different (30 and 15 km, respectively); and 3) the ratio of intracloud to cloud-to-ground flashes can vary considerably depending on thunderstorm type and intensity, region of occurrence, and season (Rakov and Uman 2003). The counters were read manually each day between 0800 and 0900 local time. They were selected because of their record length and quality and their different climatic settings. The period of record varies from January of 2004 to at least December of 2010 (Townsville, Queensland, Australia) and at most February of 2013 (Melbourne, Victoria, Australia). A thunderstorm was deemed to have occurred during a 24-h period if the counter registered at least one lightning-flash count (LFC).
A second threshold of two LFCs per 24-h period was also used to assess the degree of the sensitivity to threshold selection, because of the (undocumented) possibility that some counts of one flash may have originated from a source other than lightning. For the Melbourne site, which has a high proportion of lightning days with LFC = 1 (Table 1), it was found that the results obtained from the procedures described below showed slight to some sensitivity in terms of atmospheric-variable selection and levels of prediction skill. There was no impact on the interpretation of the results, however. Therefore, the results obtained using the threshold of two LFCs in a 24-h period will not be reported here.
Data for 31 atmospheric fields were obtained from the European Centre for Medium-Range Weather Forecasts interim reanalysis (ERA-Interim) archive (Dee et al. 2011): the fields are listed in Table 2. The fields represent a broad variety of physical processes that can be associated with deep convection, including both dynamical and thermodynamical processes. The variables cover various measures of temperature lapse, moisture content, vertical motion, and water phase state at a range of different pressure levels. The spatial and temporal resolution of the dataset is 0.75° and 6 h. For each CIGRE 500 site, atmospheric data were extracted for the 49 reanalysis grid points closest to the sensor’s location. The lightning series was synchronized with the ERA-Interim series for 0600 UTC (1600 h eastern Australia time) within the 24-h period represented by the lightning data. This procedure was done because, in general, weather conditions are more favorable for lightning activity to occur during the late-afternoon period than at other times of the day or night (see, e.g., Christian et al. 2003; Dowdy and Mills 2009). Moreover, the additional information provided by data at other time steps was found to be largely redundant because correlations within a 24-h period were invariably high. For example, correlations between individual variables at 0600 and 1200 UTC, spanning the time period during which most deep-convective processes occur in Australia, are greater than 0.85 in every case examined, and most are greater than 0.95.
Quadratic surfaces and low-dimensional summary statistics (LDSS) were used to characterize the main features of the atmospheric fields on each day ( appendix A). Six LDSS were considered: the intercept of the quadratic surface (mu), the magnitude of the gradient vector (gd) and its direction (dr), Gaussian curvature (gc), vertical gradient (vg), and adjusted correlation coefficient squared R2 (r2). The adjusted R2, as a goodness-of-fit measure for the quadratic surfaces, was included as a measure of spatial (dis)organization in the atmospheric fields.
For each candidate variable (designated hereinafter by the convention “LDSS code.field name”), comparative box plots were used to contrast its values for lightning and nonlightning days. A data matrix was constructed using variables that showed the greatest contrast. The columns of this matrix were standardized to zero mean and unit variance. This ensures that the variables were placed on a commensurable scale without disturbing the shape of their probability distributions, facilitates interpretation of the results of a discriminant or regression analysis, and helps to concentrate precisely on the conditions that are present during nonlightning and lightning days because it focuses on the relative variations of each variable within its own physical limits. The “colldiag” function from the “perturb” package in the R computing software environment (https://cran.r-project.org/web/packages/perturb/index.html) was used to detect the presence of collinearity in the data matrix. Colldiag is an implementation of the regression collinearity diagnostic procedures found in Belsley (1991). It computes the condition indices of the data matrix and provides the variance decomposition proportions associated with each condition index. As a rule of thumb, variables with proportions that are greater than 0.99 were considered to be sources of severe collinearity. Thus, the corresponding columns were removed to form a reduced data matrix.
Six classification techniques were used: a principal component (empirical orthogonal) analysis–logistic-regression approach (PCA&LR), classification and regression trees (CART), random forests (RF), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and logistic regression (LR). Four measures of prediction skill were considered: hit rate (HR), false-alarm ratio (FAR), Brier (1950) score (BS), and (for LR) the area under the receiver-operating-characteristic curve (AUC). For a perfect classification, HR = 1, FAR = 0, BS = 0, and AUC = 1. Values of HR near 0, FAR and BS values near 1, and AUC values near 0.5 indicate poor performance. Further details on the classifiers and the receiver-operating-characteristic curve can be found in appendix B and other sources (see, e.g., Breiman 2001; Venables and Ripley 2002; Hilbe 2009). Tenfold cross validation was used to assess how well the classifiers performed on an independent dataset. All analyses were carried out in the R computing environment (https://www.r-project.org/).
The proportions of lightning days for the CIGRE 500 sites are displayed in Fig. 1. The proportions range from 0.06 (Perth, Western Australia, Australia) to 0.43 (Darwin, Northern Territory, Australia). Median adjusted R2 values for the fitted quadratic surfaces varied across atmospheric fields and sites, with 8.1%–16% below 0.5 and 49%–70% above 0.75. Thus, the surfaces gave a reasonable representation of the main features of the fields. For PCA&LR, perusal of comparative box plots revealed that only the first principal component had any predictive power in terms of discriminating between lightning and nonlightning days. Therefore, the remaining principal components were not considered further.
Summaries of the cross-validation results for lightning days appear in Figs. 2 and 3. Every classifier considered performed well for Darwin: HRs are noticeably greater than FARs and the BSs are low (0.09–0.13). The RF and LR methods produced the highest HRs (0.92 and 0.93, respectively). For the five remaining sites, PCA&LR is the worst-performing classifier, with FAR > HR in every case. The LR method produced the highest HRs across these sites and, for Melbourne and Perth, the only cases in which HR > FAR. Across all sites, AUC values ranged from 0.80 (Melbourne) to 0.96 (Darwin). They indicate a prediction skill that is much better than use of climatological values (known as “climatology”; AUC = 0.5). Perusal of Fig. 3 reveals that Coffs Harbour (in New South Wales) and Melbourne, both in the vicinity of the extratropical east coast region of Australia, have the highest BSs (0.15 < BS ≤ 0.24). This is primarily due to low HRs for both lightning and nonlightning days and high FARs for nonlightning days. The relatively low prediction skill in this region could relate to the fact that lightning activity is sometimes associated with other synoptic-scale systems, such as subtropical cyclones known as “east coast lows” (Chambers et al. 2014; Dowdy and Kuleshov 2014). Although these synoptic-scale systems may exert some localized control on atmospheric conditions associated with deep convection (such as in relation to low-level convergence and advection of moisture), the cyclones in this region may not be well represented by the specific set of 0600 UTC thunderstorm and convection measures that are considered in this study. Furthermore, an improved understanding of the factors controlling cyclone activity in this region, as well as associated local severe-weather conditions, is an area of active research. Research topics include improved understanding of the region’s uniqueness in terms of the hybrid energetics characteristics of these cyclones (Pezza et al. 2014); the regional features of its lightning climatological behavior (Dowdy and Kuleshov 2014); and the interrelationships among cyclones, fronts, thunderstorms/lightning activity, and local atmospheric conditions (Dowdy and Catto 2017). Across the six classifiers, the lowest BSs were obtained for Perth (0.052 < BS < 0.11). This result is because of the high HR (0.90) and low FAR (0.017) obtained for nonlightning days and the low number of lightning days for that site.
A dot chart of the 15 variables included in the six final LR models is displayed in Fig. 4, and a key to the atmospheric fields that were used appears in Table 2. The plot and table reveal five key features. First, 10 variables are spatial mean measures of instability and lifting potential and of atmospheric water content. Second, only mu.CAPE appears in all six models, mu.TOTP appears in five, and mu.TTI, r2.SH, and mu.CONVP appear in four. Third, variables representing wind shear are not included in the LR models for these locations. Fourth, the LR models for Darwin and Townsville include mu.CAPE, mu.TOTP, mu.TTI, and r2.SH as variables whereas the models for Perth and Port Hedland (in Western Australia) include those variables as well as mu.CONVP. Fifth, despite the above similarities, the general pattern of scatter in the dot chart suggests that the optimal variable sets for the classification of lightning days may vary among different climatic zones.
The dominance of the spatial mean measures could be related to temporal variations in the timing of thunderstorms with respect to a given location since lightning can, at times, occur during hours other than late afternoon. Although CAPE and TTI are well known measures of storminess, including for Australian conditions (Hanstrum et al. 2002; Niall and Walsh 2005; Allen et al. 2011; Dowdy 2015), the prominence of mu.TOTP and mu.CONVP might reflect the strong relationship between convective parameterizations and dynamic variables such as moisture convergence and vertical velocity W at midlevels that was found by Davies et al. (2013). It is also noted that measures that are based on precipitation in combination with CAPE have also been shown to provide a good indication of lightning-flash density in other regions of the world, such as the United States (Romps et al. 2014). Perusal of the comparative box plots for r2.SH revealed that high values (>0.9) are associated with lightning days at Darwin, Townsville, Perth, and Port Hedland. These days are associated with noticeably high values of mu.SH and low values of vg.SH. Combined, these conditions indicate the presence of high levels of atmospheric moisture content near the surface (as indicated by r2.SH) with little change with height (as indicated by vg.SH). A high concentration of moisture throughout a range of levels in the lower troposphere could help to lead to enhanced moist convection over a considerable depth of the atmosphere. Given the important role of entrainment and detrainment in cumulus convection (De Rooy et al. 2013), these conditions are likely to be conducive to latent heat release (from condensation and/or freezing processes) acting to enhance potential updraft strengths. Although there are considerable uncertainties and complexities around the microphysical processes and combination of physical factors associated with lightning generation, such as the role of aerosols as potential cloud condensation nuclei in processes leading to lightning occurrence (Stolz et al. 2017; Thornton et al. 2017), it is widely accepted from a thermodynamic perspective that strong updraft speed (i.e., kinetic energy) is needed in regions of the cloud where ice is present to help to produce charge separation and the associated high potential differences that are required for atmospheric electrical breakdown (Rakov and Uman 2003).
Two variables that represent vertical wind shear were considered as candidate variables in the analysis (Table 2). Although some previous studies, such as Allen et al. (2011), have used variables for vertical wind shear within large-scale indicators of thunderstorm characteristics in Australia, one plausible explanation for the lack of wind-shear variables in the final LR models for lightning is that it is due to differences in the thunderstorm characteristics under study. For example, Allen et al. (2011) focused on severe-thunderstorm characteristics such as hail, tornados, and extreme winds and rainfall rather than on lightning occurrence.
The following key results were found for six different locations in Australia:
Low-dimensional summary statistics capture useful information about the structure of thunderstorms at coarse spatial and temporal scales. This result is consistent with the finding of Bates et al. (2017) that the use of LDSS adds value to the discrimination of dry and wet thunderstorms.
The overall performance of logistic regression was superior to that of the other classifiers considered.
The prediction skill of the LR was found to be much better than use of climatology.
The variables associated with the final LR models are dominated by spatial mean measures of instability and lifting potential and of atmospheric water content (10 of 15 variables). This dominance might be related to temporal variations in the timing of thunderstorms at a given location.
Although the same set of atmospheric variables was used for each CIGRE 500 site, the variables in the final LR models varied across climatic zones. The issue of whether it is possible to use the same variables and same classification method at different sites within a single climatic zone would be a fertile area for future research.
It is envisaged that the combined LDSS–LR approach advocated in this study will find application as finer-scale reanalyses and GCM runs become available. Such work might lead to results that are less dependent on climate-model parameterizations (such as mu.TOTP and mu.CONVP).
Lightning data were provided by the Observations and Engineering Branch of the Australian Bureau of Meteorology. The National Environmental Science Programme provided partial financial support for author AJD. All data used in this work are available on request from Andrew J. Dowdy (email@example.com). We also thank three anonymous referees and the editor of this journal, Andrew Ellis, for their thoughtful and constructive comments on the original typescript.
Representation of Atmospheric Variables
The material covered here and in appendix B is taken from Bates et al. (2017) with minor modification and is intended to provide sufficient methodological detail to allow readers to decide whether they need to read further. Most of the information on the daily atmospheric variables that are used herein is available at a single pressure level or is defined as a mean or difference for fixed pressure levels and hence can be considered as a function of two spatial dimensions: z = f(x, y). An exception is convective mass flux (CMF), which, by definition, has a constant value across all 49 reanalysis grid points for a given day and UTC time. Other variables such as air temperature, minimum geostrophic vorticity, vertical velocity, specific humidity, and zonal and meridional wind are defined for specific atmospheric pressure levels p at each grid point (Table 2). These variables can be considered as a function of three spatial dimensions: z = f(x, y, p). For each day, quadratic surfaces were fitted to the atmospheric fields for 0600 UTC using ordinary least squares. A quadratic surface in two spatial dimensions is defined by
and the corresponding surface in three spatial dimensions is defined by
Instead of fitting Eqs. (A1) and (A2) directly, the linear and quadratic terms were replaced by orthogonal polynomials to ensure that the intercept and linear and quadratic regression coefficients are independent of each other (i.e., they do not change when higher-order terms are added), and the estimates of the intercept and regression coefficients are placed on the same scale. Also, it allows the decomposition of relationships into general components of magnitude as well as into linear and nonlinear rates of change. The estimates were calculated in a coordinate system that was centered on the CIGRE 500 sensor (i.e., the 7 × 7 grid described in section 2). The adjusted R2 was used as a goodness-of-fit measure for the quadratic surfaces and a measure of spatial (dis)organization in the atmospheric fields.
Let θ1, …, θ10 denote the orthogonal polynomial regression coefficients. Six LDSS for the above surfaces were used to facilitate physical interpretation: the intercept, which is equivalent to the mean across the domain (mu = θ1); the magnitude of the gradient vector (gd) and its direction (dr) in the x–y plane; Gaussian curvature (gc); vertical gradient (vg = θ7); and adjusted R2 (r2) since it is a measure of spatial (dis)organization in the atmospheric fields. The magnitude of the gradient vector and its direction in terms of linear rate of change are defined by and dr = tan−1(θ4/θ2). Given the use of orthogonal polynomial regression, the values of gd and dr are the same as those that would have been obtained had a linear surface been fitted to the data. Gaussian curvature is an intrinsic geometric property of a surface that is independent of the coordinate system that is used to describe it. It is defined by
where det () denotes the determinant, is the Hessian matrix given by
and λ1 and λ2 are the eigenvalues of (and also the maximum and minimum principal curvatures).
Statistical and Machine-Learning Classification Techniques
The first classification technique used in this study involves dimensional reduction using PCA and classification with logistic regression. PCA uses an orthogonal transformation to convert an original set of possibly correlated variables into a new set of mutually uncorrelated variables that are arranged in decreasing order of importance. The first principal component is the linear combination of the original variables that captures as much of the variation in the original dataset as possible. The second component captures the maximum variability that is uncorrelated with the first component, and so on. PCA provides a useful reduction in complexity when a substantial proportion of the total variance in the data is accounted for by a few components. It is not in itself a classification technique.
The LR model can be written as
where πi is the probability of occurrence of class i (i = 1, 2), πi/(1 − πi) is the odds ratio for class i, p is the number of columns in the data matrix , and β0, …, βp are the regression coefficients, which are determined through maximum likelihood estimation. (It is obvious that with only two categories it is only necessary to estimate the coefficients for one of the categories since π2 = 1 − π1.) Classification on the basis of the variables is then done by setting a threshold τ, say, and allocating a day to category 1 if π1 > τ. For each site, a receiver-operating-characteristic curve (a plot of HR vs FAR as the threshold τ is varied across its full range) was used to estimate the threshold by minimizing the distance from the curve to the point representing perfect classification accuracy (HR = 1; FAR = 0). This was done to account for the fact that the sample sizes for nonlightning and lightning days were noticeably unequal at all sites (Table 1). Experiments using the Youden (1950) index indicated that threshold estimates were not sensitive to the selection technique that was used. With LR, by contrast with LDA and QDA, there is no formal requirement for multivariate normality of the explanatory variables within each category of the response variable, and the use of binary or categorical variables is acceptable. A combination of stepwise selection and analysis of deviance was used to determine the significance of variables in the LR models. All of the variables used in the final LR models are significant at the 0.05 level.
CART uses binary recursive partitioning to divide the data space, splitting it along the coordinate axes of the candidate variables to give increasingly homogenous subsets and hence the maximal separation of the classes until it is infeasible to continue. The measure of node heterogeneity is the deviance (a quality-of-fit statistic). The partitioning leads to a set of decision rules in the form of a binary tree. The tree is “pruned” to identify a parsimonious tree with acceptable misclassification rates. Cross validation can be used to determine an appropriate tree size.
The random-forests method is an ensemble learning algorithm that generates a large number of CART from bootstrap samples of the original data. An estimate of the misclassification rate can be obtained by using each tree to predict the data not in the bootstrap sample and averaging the predictions over all trees. Variable importance plots can be produced that reveal how important each variable is in classifying the data and contributing to the homogeneity of the nodes.
LDA is derived from an underlying model in which the distributions of the variables on dry and wet lightning days are both multivariate normal, with possibly different means and a common covariance matrix. LDA is somewhat robust with respect to minor violations of these assumptions. Although serious violations will often result in unreliable estimates of the coefficients, the procedure can still be a good heuristic. The discriminant function is a linear combination of the candidate variables, the coefficients of which are estimated by ordinary least squares so that the ratio of the between-classes variance and the within-classes variance is maximized. This function takes the value zero at the decision boundary. If the value of the discriminant function is negative, the variable vector is assigned to one class; if it is positive, the variable vector is assigned to the other class. Given that the variables are standardized, the coefficients indicate the relative importance of each variable in predicting class assignment.
Quadratic discriminant analysis is a generalization of LDA in which the two classes need not have the same covariance matrix, but the assumption of multivariate normality still applies. The interpretation of the coefficients in terms of the relative importance of each variable is more difficult to assess than for LDA as the discriminant function contains quadratic as well as linear and constant terms.