## Abstract

This work evaluates numerous thunderstorm predictors and investigates the use of artificial neural networks (ANNs) for identifying occurrences of thunderstorms in reanalysis data. Environmental conditions favorable for deep, moist convection are derived from 6-hourly ERA-Interim reanalyses, while thunderstorm occurrence in the following 6 h over Finland is derived from lightning location data. By taking advantage of the consistency and large sample size (14 summers) provided by the reanalysis, complex multivariate models can be trained for a robust estimation of convective weather events from model data. This and other methods are used to yield information on the most effective convective predictors in a multivariate setting, which can also benefit the forecasting community. The best ANN found uses 15 inputs and received a Heidke skill score (HSS) of 0.51 on an independent test sample. This is a substantial improvement over the best predictor when used alone, the most unstable lifted index (MULI) with HSS = 0.40, the multivariate model having fewer false alarms in particular. After MULI, the most important ANN input was relative humidity near 700 hPa. Dry air aloft was associated with significantly lower thunderstorm probability and flash density regardless of convective available potential energy (CAPE). Other important parameters for thunderstorm development were vertical velocity and low-level *θ*_{e} advection. Finally, the Peirce skill score indicates a clear meridional gradient in skill for categorical forecasts, with higher skill in northern Finland. This analysis suggests that the difference in skill is real and associated with a steeper thunderstorm probability curve in the north, but further studies are needed for a physical explanation.

## 1. Introduction

The accurate prediction of deep, moist convection in its many hazardous forms—such as large hail, lightning, severe straight-line winds, and tornadoes—remains one of the biggest challenges in meteorology. A reasonable representation of atmospheric convection in atmospheric models is crucial also because it constitutes a major form of heat and moisture transport between the boundary layer and the upper atmosphere (Wang et al. 2011). Finally, given the significant human and economic loss associated with severe weather (e.g., Changnon 2001), it is very important to determine if such events will become more frequent with climate change (Kunz et al. 2009). Unfortunately, answering this question based on the observational record is very difficult given that such records are rare, short, and/or affected by nonmeteorological factors (consider, e.g., the effect of population density on tornado reports). Therefore, previous studies have instead evaluated changes in large-scale environments conducive to convective weather events (Brooks 2009, 2013) but often without considering storm initiation (Robinson et al. 2013).

In recent years, convection-permitting models have gained ground and enabled a more realistic representation of convection by simulating it explicitly. However, these types of simulations are computationally very challenging to run on larger domains and time scales, especially when it comes to climate simulations spanning several decades, even if it can be done in some cases (Leutwyler et al. 2016). Therefore, there is still a need to develop better convective parameterization schemes, as existing ones have been shown to exhibit many problems. These include an unrealistic diurnal cycle of convection (Brockhaus et al. 2008) and underestimation of hourly precipitation intensities (Ban et al. 2014; Fosser et al. 2015; Hanel and Buishand 2010). For climate studies, it is not necessary to simulate individual storms realistically (and indeed, this has traditionally not been the goal of convective parameterizations) as long as reliable statistics of, for example, convective precipitation can be attained by other means. Both the parameterization problem and the operative forecasting issue come down to a few key questions, such as which environmental conditions cause convection, in its various forms, to trigger and develop? Which parameters can most effectively describe these conditions?

This study aims to address these questions and more by using artificial neural networks to identify and forecast thunderstorm occurrence from reanalysis data. Atmospheric reanalyses provide a “best guess” of the atmosphere for (i) long time periods and (ii) in a physically consistent manner. These qualities should allow for more robust estimates of trends in convective weather than offered by direct observational records of convective weather, assuming a strong relationship can be established between reanalysis-derived parameters and the convective weather event.

Past studies on severe weather and convective parameters have often focused on severe convective storms, where parameters such as convective available potential energy (CAPE) and vertical wind shear have been shown to efficiently characterize severe convective events such as tornadoes and large hail (Brooks and Dotzek 2007). In high-latitude regions such as Finland, high-CAPE environments, which give birth to severe storms such as found in the United States, seldom occur (Tuovinen et al. 2015). For forecasters, it is useful to know which indices, or their combinations, have the best skill in predicting the occurrence of thunderstorms in the specific region they operate in. By focusing on Finland, we aim to shed light on convective processes in high latitudes (60°–70°N).

Thus, a large number of thunderstorm predictors are calculated from ERA-Interim 6-hourly analyses spanning 14 summers in Finland. First, skill scores are used to evaluate the skill of 28 atmospheric stability indices for predicting thunderstorm occurrence, deduced from lightning location data, within six hours after the analysis. Stability indices are parameters derived from vertical profiles of thermodynamic variables and measure the potential for thunderstorm development based on the large-scale thermodynamic environment (Huntrieser et al. 1997). Having established the skill of the best univariate model, we then build a multivariate, nonlinear model for predicting thunderstorm occurrence. We have chosen the artificial neural network (ANN) for this task, as ANNs have been shown to be effective in solving many different kinds of problems (Osman and Laporte 1996), including the prediction of thunderstorms (Manzato 2005; McCann 1992). ANNs are able to approximate virtually any continuous function (Hornik et al. 1989), given sufficient complexity. Our hope is that the “big data” provided by an atmospheric reanalysis will enable building a skillful, highly nonlinear, and complex model without running into overfitting issues. A predictor-selection algorithm is used to select and combine effective ANN inputs from a pool of 36 parameters constituting some of the best-performing stability indices in the skill score test, and also various measures of moisture, lift, and convective inhibition—important factors for deep, moist convection (DMC; Doswell 1987; Johns and Doswell 1992; Lock and Houston 2014).

## 2. Methodology

### a. Data

#### 1) Calculation of indices from ERA-Interim

ERA-Interim (Dee et al. 2011) is the most recent global atmospheric reanalysis produced by the European Centre for Medium-Range Weather Forecasts (ECMWF), replacing ERA-40. It represents a major upgrade over ERA-40, using four-dimensional variational analysis (4D-Var) for data assimilation. ERA-Interim covers the period from 1979 onward, as it is continuously updated in near–real time. The spatial resolution is 0.75° (79 km globally), with 37 pressure levels and 60 model levels.

All indices and parameters are calculated from ERA-Interim pressure-level fields of geopotential height, temperature, specific humidity, two-dimensional wind components, and divergence. These are given at a level spacing of 25 hPa across the lower and 50 hPa across most of the upper troposphere, with 24 levels in total (100 to 1000 hPa). Additionally, surface parameters of temperature, dewpoint temperature, mean sea level pressure, and 10-m wind components were utilized. The reanalysis provides a best guess of the state of the atmosphere at 6-h intervals, and each vertical profile above a grid box can be regarded as a “pseudo-sounding” to distinguish them from real (observed) rawinsonde soundings. Pseudo-adiabatic parcel ascent was simulated on a 1-hPa vertical resolution using a cubic spline interpolation and the virtual temperature correction (Doswell and Rasmussen 1994).

The data used in this work cover the main thunderstorm season in Finland, May to August (Tuovinen et al. 2009), between 2002 and 2015. Using a time span of 14 yr should hopefully be sufficient to capture a decent variety of thunderstorm environments in Finland. This is important given that year-to-year variation, for example, in the number of flashes observed is typically very large (Mäkelä et al. 2014), and specific summers can be characterized by flow regimes and thunderstorms of a particular type (Tuovinen and Mäkelä 2003).

#### 2) Lightning location data

The lightning location data used in this study are based on the Nordic Lightning Information System (NORDLIS), covering the Nordic and Baltic countries (Mäkelä et al. 2014). The detection efficiency and median location accuracy regarding cloud-to-ground lightning flashes (i.e., the system performance) are estimated to be above 90% and less than 1 km, respectively, within the study area used in this paper (Mäkelä et al. 2010). Although the network evolves practically every year by, for example, adding/replacing sensors, for cloud-to-ground lightning, the performance can be considered somewhat constant since 2002 (Mäkelä et al. 2010).

Pseudo-soundings are associated with thunderstorm activity if at least 1 lightning flash is located inside the grid box within 6 h following the sounding (0000, 0600, 1200, and 1800 UTC). Otherwise, it is classified as a null event. Both cloud-to-ground (C-G) and cloud-to-cloud (C-C) flashes are considered, although the latter are associated with much lower detection efficiency (Mäkelä et al. 2010). The inclusion of C-C flashes, and the low lightning threshold of only one flash being required for a convective classification, means that weak convective events are also considered.

An alternative temporal criterion was also tested for proximity soundings, where thunderstorm activity was monitored within 3 h (±3 h) of the sounding. This led to slightly higher maximum skill score values, suggesting that the proximity soundings were more representative of convective conditions using this criterion. The reason this criterion was not chosen is that we are interested in the *preconvective* conditions (i.e., the conditions that led to storm development) since this is more relevant in an operational context.

The total number of pseudo-soundings amounts to around 1.87 × 10^{6}, or almost two million. Of all the 6-hourly cases, 6.10% were associated with lightning activity. Because the soundings are correlated in space and time, the effective number of statistically independent samples will be lower.

### b. Data analysis

In this study, several methods were used for assessing the efficacy of a large number of thunderstorm predictors. First, a bivariate analysis is performed, using performance metrics derived from the contingency table as a measure of skill of 28 different stability indices. Next, we carry out a multivariate analysis, using a nonlinear ANN to combine several predictors together. Inputs were selected using an iterative predictor-selection algorithm (Manzato 2007b) for building an “optimal” subset of inputs. This entails looping through a list of candidate predictors, finding the input that gives the highest performance, adding it as an input, finding the next best input, and so on. Given how computationally demanding such a procedure is, the number of candidate predictors needs to be kept reasonably low. Hence, a few different methods were used for finding promising candidate predictors for the ANN experiment.

#### 1) Skill scores

The traditional use of stability indices for forecasting convective weather falls into the domain of a dichotomous (two class) forecasting scheme, where the predictand is the occurrence (yes) versus nonoccurrence (no) of the convective event. Since the predictor can take on a continuous range of values, it is often made dichotomous by determining a threshold value that divides its range into two parts: “thunderstorm predicted” (yes = 1) and “thunderstorm not predicted” (no = 0). The dichotomous forecasting setting is illustrated in a 2 × 2 contingency table (Table 1, top). From the contingency table, it is possible to derive various scalars that characterize the forecast performance in some way (Table 1, bottom). The quality of forecasts is often estimated using *skill scores*, which are scalar measures of “overall” forecast skill. Although convenient, such measures are necessarily incomplete representations of forecast performance in a higher-dimensional setting (see Wilks 2006, section 7.2.3) and emphasize different aspects of forecast quality, always relative to some reference model. The two perhaps most widely used skill scores for verifying categorical forecasts are the Heidke skill score (HSS) and the Peirce skill score (PSS). Both of these receive a score of one for perfect forecasts. The reference model in PSS are random forecasts with a distribution equal to the sample climatology, random forecasts receiving a zero score and forecasts worse than this a negative score. HSS may be more useful for rare-event forecasts but has the disadvantage of being sensitive to event climatology. For a thorough discussion of different skill scores, and their respective merits, the reader is referred to section 7.2 of Wilks (2006).

In the bivariate part of this work, the threshold value for each candidate predictor is chosen by maximizing PSS, as suggested by Manzato (2007a). Maximum PSS (maxPSS) is therefore the ranking criterion for assessing the relative performance of indices.

#### 2) Composite analysis

In a multivariate forecasting setting, we are interested in the remaining separation between convective and nonconvective environments, after having accounted for potential instability by using a stability index.^{1} From another perspective, we wish to examine the typical environments in which the convective predictor—such as the lifted index (LI) or CAPE—failed and succeeded in predicting convection and nonevents and find parameters that may improve the forecast further. Previously, Suhas and Zhang (2014) addressed a similar question when evaluating the performance of convective trigger functions. The authors analyzed the vertical profiles of key variables, such as moisture advection and vertical velocity, for the different forecasting categories, such as correct prediction and overprediction of precipitation. In the study, the consistently best-performing trigger function accounted for the entrainment effect of large-scale advection on CAPE generation, with lower-tropospheric advection being particularly important for the good performance.

In a similar fashion, we composite key thermodynamic variables into the four different categories of the 2 × 2 contingency table. This is done for the best-performing stability index in the skill score test and allows us to see what was, for example, the typical humidity profile for cases where thunderstorms were forecast based on the predictor’s value falling in the “thunderstorm” range but then failed to develop (a false alarm).

#### 3) Thundery case probability

The predictive power of a predictor in a multivariate setting may also be estimated by calculating the empirical thunderstorm probability as a function of two input variables, which can be visualized on a two-dimensional diagram. One of the input variables may be a “primary predictor” such as lifted index, while the other may be regarded as a supplementary predictor. Here, we are interested to see if the secondary variable offers any discriminatory power after having accounted for potential instability with the stability index. In this case, we may bin the dependent variable (thunderstorm probability) into a table with, for example, LI on the *y* axis and a humidity variable on the *x* axis. If the humidity variable shows a clear gradient in thunderstorm probability along the *x* axis, for similar values of LI, it is included as a candidate predictor in our neural network experiment. If we instead use the output of an existing multivariate neural network as the primary predictor, the two-dimensional (2D) diagram may inform us whether an additional input could further improve performance. For calculating the empirical thunderstorm probability, we again define thunderstorm events by at least one flash being associated with a given gridbox sounding.

#### 4) Artificial neural networks for predicting lightning occurrence

Multivariate artificial neural networks (Bishop 2006, chapter 5) are trained for the prediction of thunderstorm occurrence in a supervised learning setting. The output of the ANN will be continuous but can later be made dichotomous in the same manner as the stability indices, by choosing a threshold value (here, one that maximizes PSS). The ANN used in this work is a feed-forward, multilayer perceptron with a single hidden layer. The learning function used was the scaled conjugate gradient algorithm, with a tan-sigmoid (log sigmoid) transfer function in the hidden (output) layer. The ANN experiment has been carried out with the Matlab ANN toolbox (Beale et al. 2012).

To avoid overfitting, the database is divided into a training subset (2003–12), validation subset (2002, 2013), and testing subset (2014/15). The training subset is used for the actual training of neural networks. This entails computing the gradient of the error surface and updating the network weights, which represent the strength of connections between neurons, in order to minimize the error function. We use the cross-entropy error (CEE) as the error function as it is generally recommended over least squares error for classification problems. CEE is defined as

where *N* is the total number of cases used in the training set, *y* is the output of the ANN, and *t* is the target output (observed thunderstorm occurrence). The objective under training is to find the weights (which represent the strength of connections between neurons) that minimize the error function.

The validation subset is for model selection (i.e., finding a good combination of inputs and hidden neurons) and also for *early stopping*, which is a method used to improve generalization. This entails monitoring the error for the validation subset during training and stopping training when the validation error begins to rise, a sign of overfitting. Because of the large training sample, a relatively weak early stopping criterion is used (the number of maximum validation failures was increased from the default 6 to 30). Finally, the testing subset is an independent dataset that is only used for the final evaluation of model performance.

For finding a good set of inputs and hidden neurons, we employ the following methodology:

Use the

*forward-selection*algorithm by Manzato (2007b) to iteratively add more inputs to the neural network as long as CEE on the validation set is reduced by at least 0.5% on each input addition.Loop over all 36 parameters, and test them one by one as an ANN input. The parameter that reduces the validation CEE the most is added to the ANN input list and removed from candidate inputs. Repeat to find the second-best input when used together with the first, and so forth, as long as performance still increases as the model grows in complexity.

At each input trial,

*N*= 15 neural networks are trained. Each time an ANN is retrained, the starting weights are slightly different (random initialization). Taking the average output of the ANN ensemble should increase the likelihood of finding a global minimum in the error space and make the selection less prone to an “unlucky” initialization, thus making the choice of input more robust.No manual preprocessing of inputs was performed before network training, but the Matlab feed-forward function automatically applies some preprocessing, such as [−1, 1] minimum–maximum (min–max) normalization.

Because of computational constraints, the number of hidden neurons

*H*is at this stage fixed at*H*=*N*_{inputs}.

Having found a good set of inputs, find a suitable number of hidden neurons

*H*by testing all*H*in the range ½ ×*N*_{inputs}≤*H*≤ 2 ×*N*_{inputs}. Thirty networks are trained on each trial, and selection is again based on validation error of the 30-member ensemble.Having found a good combination of inputs and number of hidden neurons, train

*N*= 100 networks and calculate the ensemble and individual ANN validation performances. The final ANN will be the one with the lowest validation CEE.

The algorithm should result in a good combination of inputs and hidden neurons despite being nonexhaustive. Since the procedure is still computationally very demanding, we aim to have a reasonably low number of candidate predictors. Furthermore, including parameters very similar to each other is simply unnecessary, especially regarding stability indices using the same parcel. Hence, parameters are divided into categories, with the idea that only the most promising predictors in each category are included. For guiding the selection of input candidates, we used the (i) composite analysis of the vertical profile of thermodynamic variables and (ii) two-dimensional thunderstorm probability tables. The resulting pool of input candidates (36 in total), along with stability indices featured in the skill score test, are presented in Table 2.

## 3. Results

### a. Skill scores

#### 1) Stability index rankings and optimal thresholds

The ranking of stability indices by their maximum Peirce skill score is presented in Table 3. According to this measure, the most unstable lifted index (MULI) is by far the best index for predicting thunderstorms in Finland. MULI was also the highest-ranking index by maximum HSS (not shown), for pseudo-soundings at different times of the day (0000 and 1200 UTC were analyzed), and when using the alternative ±3-h criterion for proximity soundings. Variants of the LI have performed well also in other studies in Europe. For example, LI using a mixed-layer parcel was the best-performing index for the prediction of thunderstorms in Germany (Kunz 2007) and for predicting hail-damage days (Mohr and Kunz 2013).

Other indices that consistently ranked in the top five in this work were the most unstable CAPE in the layer from −10° to −40°C (MUCAPE_{−10°to−40°C}; except for 1200 UTC, when it came in seventh place), the stability and wind shear index for thunderstorms in Switzerland (SWISS_{12}; Huntrieser et al. 1997), and the surface lifted index (SLI; except for 0000 UTC). However, the usefulness of SLI is called into question by it being outperformed by MULI even for 1200 UTC soundings, receiving a maxPSS of 0.642 as compared with 0.617 using SLI.

The threshold value for MUCAPE_{−10°to−40°C} is very low at ≥5.4 J kg^{−1}, suggesting that the existence of any positive buoyant energy in the layer important for electrification processes is a good predictor for the occurrence of lightning. According to maxPSS, CAPE values calculated using surface and mixed-layers parcels are relatively poor predictors for thunderstorm occurrence in Finland.

#### 2) Regional distribution of forecast skill

It is of interest to ask to what extent the results presented thus far are spatially homogenous for Finland. Because of the large dataset, the skill score test can be repeated for individual grid boxes with fairly robust statistics. Figure 1 shows the regional distribution of thunderstorm probability (left) and predictability (right) over Finland. The latter is evaluated using the maximum PSS of MULI forecasts. According to this metric, there is a significant meridional gradient in forecast skill over Finland, with higher skill in the north. The distribution of maxPSS is negatively correlated (*r* = −0.74) with the climatological event frequency in our 14-yr database, with convection occurring more frequently in the south. This raises an important question of whether the varying forecast skill is caused by a varying climatology, as a number of skill scores, such as the equitable threat score (ETS) and HSS, have been shown to exhibit a positive correlation with event frequency *p* (Hamill and Juras 2006). When repeating the analysis using HSS, no meridional gradient was seen, but this may be caused by the higher event frequency in the south increasing the skill score artificially and thus negating the physical gradient in skill. According to Mason (1989), the Peirce skill score has the property of being insensitive to changes in event frequency. However, Baldwin and Kain (2006) demonstrated the opposite and showed that the different conclusions can be explained by which strategy is used to vary event frequency. In light of this ambiguity, it is necessary to delve deeper into the apparent north–south division in order to establish whether or not the difference in skill is real.

Therefore, we calculate the frequency distribution of MULI for convective and nonconvective cases for the northern and southern half of Finland (Fig. 2). By separating the two frequency curves (event and null event) using the optimal threshold given by maxPSS for yes/no thunderstorm forecasts, we end up with the four categories of the dichotomous contingency table. For example, the area under the null event distribution on the “thundery” side of the threshold value (i.e., toward more unstable values) corresponds to the total number of false alarms *b* (thunderstorm forecast but not observed), shown in gray in Fig. 2. The point of this exercise is that we can visualize all aspects of forecast quality and how it changes when going from the south to north.

First, we again note that the overall thunderstorm probability is lower in the north, which is associated with fewer hits (39.3% fewer) and misses (54.7% fewer) relative to the southern domain. Second, we note that the null event distribution in the north is clearly shifted toward the right (i.e., a more stable regime). The shift in the null event distribution relative to the thundery distribution (which does not see a similar shift) is associated with a reduction in false alarms (19.8%) and increase in correct negatives (11.4%), which corresponds to more skillful forecasts. In effect, the relative number of cases that are “difficult to predict” is decreased. These are cases where MULI falls just above the threshold value, and a thunderstorm is therefore not predicted to occur. The forecast is more likely to be correct in this case since the thunderstorm probability is lower in the north for more stable values of MULI while being higher for very unstable values (Fig. 2a, solid cyan and orange lines). In other words, the steeper event probability curve of MULI means there is better separation between events and null events in the northern region.

Therefore, we argue that the difference in forecast skill between the north and south is *not* an artifact caused by a differing climatology but that thunderstorm forecasts do have higher skill on average in northern Finland because of a favorable shift in the distribution of null events relative to that of thundery events. Finally, we note that the optimal threshold of MULI, as given by maxPSS, showed no meridional gradient and exhibited a random spatial pattern in general. The exception was central and southeastern Finland, where the optimal MULI threshold was roughly 1°C lower than average.

### b. Composite analysis

Figure 3 shows the composites of different thermodynamic and dynamic parameters for categorical forecasts using MULI. Beginning with the humidity profile (Fig. 3a), we find a notable separation between convective and null events around 700 hPa. The atmosphere is the moistest for misses and hits, the relative humidity (RH) near 700 hPa being almost 16% higher (absolute difference) for these cases than for correct negatives. False alarms also show a positive humidity anomaly at this height but one not as large. Hence, given a sounding with a low relative humidity at this level, a multivariate model should be able to turn some false alarms into correct negatives by learning that thunderstorms are associated with a larger RH.

The *θ*_{e} advection (Fig. 3b) shows a strong anomaly for misses and a smaller anomaly for hits at two levels: 900 and 200 hPa. When decomposing the advection of *θ*_{e} into thermal advection and moisture advection (not shown), it was found that upper-level anomaly can be attributed entirely to warm-air advection, while the low-level anomaly is associated with the advection of warm and moist air. Advection of high-*θ*_{e} air at lower levels is associated with rising motion and destabilization under differential advection (warm-air advection decreasing with height). Therefore, it is not surprising that many of the “misses” of MULI forecasts of thunderstorms in the next 6 h are caused by ignoring the time tendency of instability or low-level moisture. Meanwhile, we have no immediate explanation for the warm-air advection anomaly near the tropopause for thundery cases (which is larger for misses than hits). From a potential instability perspective, this is unintuitive, but what may be more important is that midlevels have a cold *θ*_{e} anomaly.

The anomalies for horizontal wind convergence (Fig. 3c) indicate that convection is associated with low-level convergence and upper-level divergence 0–6 h prior to lightning being observed. Such wind fields are associated with rising motion, and indeed, the model vertical velocity (Fig. 3d) shows a strong positive anomaly throughout the troposphere for the convective cases. Although the precise time and location of convective initiation is largely determined by smaller-scale variability in thermodynamic and kinematic environments, Birch et al. (2014) found that convective initiation almost always occurs within areas of local-scale convergence (60 km by 60 km), which roughly corresponds to the horizontal resolution of ERA-Interim.

Finally, the anomaly profile for temperature (not shown) gave no suggestions for improving MULI forecasts, as the main separation was found between yes forecasts and no forecasts instead of event occurrences and nonoccurrences. Correct negatives were associated with a cold anomaly throughout the troposphere, while other categories had a warm anomaly. The planetary boundary layer (PBL) was especially warm (up to 5 K warmer than average) for cases where convection was correctly predicted to occur.

### c. Thunderstorm probability as a function of two predictors

Thunderstorm probability binned as a function of two variables (Fig. 4) illustrates the benefit of a multivariate forecasting approach: thunderstorm probability has a clear correlation with many secondary predictors (*x* axis) even after accounting for potential instability with MULI (*y* axis). The efficacy of a supplementary forecasting variable is in this case associated with high maximum thunderstorm probability (TP) values and a clear gradient along the *x* axis. By this metric, mean relative humidity in the 800–600-hPa layer (MRH_{800–600}) is one of the most effective predictors to be used together with MULI, with thunderstorm probability exceeding 0.8 for very unstable values of MULI and when the 800–600-hPa layer is very humid. Dry air above the PBL has been suggested to be detrimental for convective development because of entrainment, but James and Markowski (2010) found the inhibiting effect of dry midtropospheric air to be important only in low-CAPE environments. Our results suggest the opposite for MRH_{800–600}: Fig. 4a shows a larger gradient in thunderstorm probability along the *x* axis, the larger the instability. We also calculated flash density—number of flashes observed per 100 km^{2}—as a function of MUCAPE and MRH_{800–600}, with similar results. With humid air aloft (MRH = 80%–90%), flash density ranged from 84 for low CAPE (800–1200 J kg^{−1}) to over 400 for high CAPE (2400–2800 J kg^{−1}). However, when the 800–600-hPa layer was dry (MRH = 40%–50%), fewer than 30 flashes per 100 km^{2} were observed, regardless of CAPE.

One factor at play here may be dry air aloft correlating with low humidity in the boundary layer (PBL). We therefore calculated thunderstorm probability as a function of three variables: MUCAPE, MRH in the PBL (800–1000 hPa), and RH at 700 hPa. Given a PBL relative humidity above 75% and MUCAPE exceeding 1500 J kg^{−1}, thunderstorm probability was 0.63 (4890 cases in total) *if* RH at 700 hPa (RH_{700}) exceeded 50% but only 0.35 if RH_{700} was below 50% (336 cases). Similarly, when training an ANN with MULI and PBL humidity as inputs, thunderstorm probability exhibited a smaller but still substantial dependence on RH_{700} for larger values of the ANN output, corresponding to CAPE of 1000 J kg^{−1} and above. However, we are unable to infer causality or give a clear physical explanation, as MRH in the 800–600-hPa layer may still correlate with other underlying factors, such as frontal passages or a capping inversion caused by warm, dry air aloft. The effect of dry air aloft on convective storms remains an ambiguous and complex issue (Gilmore and Wicker 1998).

Convective development is more likely to occur when there is less convective inhibition (CIN) that must be overcome for a parcel to reach its lifted condensation level (LCL; Fig. 4b). We also calculated TP tables for other inhibition measures paired with MULI: strength of capping aloft (CAP), lid strength index (LSI), and most unstable CIN (MUCIN; see Table 2 for definitions). Based on this subjective analysis, MUCIN at the LCL (MUCIN_{LCL}) was clearly the best inhibition-based supplementary predictor. This is a hybrid measure as it can include positive area in a sounding (CAPE) and has the benefit of not requiring a level of free convection (LFC) to exist to be defined. Overall, “pure” inhibition measures performed poorly in this analysis.

Finally, we analyze two variables associated with convective initiation. Moisture flux convergence (MFC) has been widely used in short-term forecasting of convective initiation under the assumption that it is effective at identifying air-mass boundaries, along which convection is known to often initiate, and areas of “moisture pooling.” MFC is a term in the conservation of water vapor equation and has the expression

where *r* is the mixing ratio and **V**_{h} is the horizontal wind vector. The first term is the *advection term*, which represents the advection of humidity, and the second is the *convergence term*. Banacos and Schultz (2005) evaluated this parameter through the use of case studies and found that horizontal mass convergence is at least as effective as surface MFC for identifying boundaries. Our results show the same: vertically integrated horizontal mass convergence (VIHMC) is at least as useful, if not more, as a supplementary predictor for thunderstorm occurrence (Figs. 4c,d). Additionally, considering a deeper layer may be useful as the highest maximum TP values were obtained when convergence was integrated over a 500-hPa-deep layer above the surface.

It is important to ensure that these results are not contaminated by previous convective activity, where high humidity aloft in the pseudo-soundings might be associated with already ongoing convection and therefore lead to higher event probabilities also in the next 6-h period. The same may hold for convergence, since DMC is always a *cause* of low-level convergence and upper-level divergence. Therefore, the analysis in Figs. 4a and 4c was repeated after removing cases associated with previous activity, where lightning was observed both in the 6-h period prior and in the period after the analysis. This led to lower thunderstorm probabilities, since 34% of thundery soundings were removed, but no qualitative change was apparent in the relationships between the predictors and the predictand. This increases our confidence in the results, but the analysis does not rule out the existence of cumulus or altocumulus clouds, which may, for example, play a role in the strong relationship between thunderstorms and MRH_{800–600}.

### d. Artificial neural networks for predicting lightning occurrence

The evolution of the neural network as more inputs are added by the forward-selection algorithm is shown in Fig. 5. Not surprisingly, the first input selected was MULI, based on the cross-entropy error of the validation sample (validation error). From the start, we note that the training error is substantially lower than the validation error (0.162 and 0.180, respectively, for the single-input ANN). In this case, this seems to be not caused by overfitting but by the years in the training sample having a higher intrinsic predictability. This was indicated by randomizing the data allocation, which led to similar performance on training and validation sets. The early stopping technique—stopping network training when the validation error begins to increase—should improve generalization, and a good sign that overfitting is not taking place is that the validation and training errors do not diverge significantly at any point but retain a similar ratio during most of the input additions.

The second-best input is MRH_{800–600}, which is not surprising given the earlier results (Fig. 4). In some runs, the vertically integrated MFC between the surface and 700 hPa (VIMFC_{700}) was selected as the second input. This parameter includes not only lift but moisture information, combining them in a manner that can be argued to be arbitrary in the absence of a clear physical link to DMC (Banacos and Schultz 2005). The correlation with both mass convergence and low-level humidity can explain its success as an ANN input early on but might not lead to the optimal set of ANN inputs. We argue that it is desirable to keep quantities representing different physical processes separate to give the ANN the opportunity to learn about the underlying physical relationships between the variables. This is the central idea behind ingredients-based forecasting and is also why, for example, the SWISS_{12} index, which is a stability index incorporating low-level humidity and wind shear, was not included as a candidate predictor. Therefore, VIMFC was excluded from the candidate parameter pool initially and added only after MRH_{800–600} was chosen. It should be acknowledged that rejecting or selecting certain parameters as input candidates based on physical arguments means our approach is not a statistically blind one but lies between deterministic and statistical philosophies.

Subsequently selected inputs are, in order, low-level vertical velocity (*ω*_{low}), low-level *θ*_{e} advection (*θ*_{e}-adv_{low}), hours (UTC) of sounding (HH), and surface temperature (t2m). After this, some saturation seems to kick in. Nevertheless, performance increases are still noticeable for the next four parameters or so: vertically integrated horizontal mass convergence in the low to midtroposphere (VIHMC_{500}), the environmental lapse rate between 850 and 500 hPa (ELR_{mid}), CIN–CAPE between surface and 875 hPa (CIN_{875}), and total column water (TCW). Hereafter, the next best inputs barely fulfill the criterion for being added to the ANN (0.5% decrease in validation CEE). Eventually, five more inputs do pass the test, and we end up with a total of 15 inputs. The last inputs include a combined CAPE + CIN parameter using a mixed-layer parcel (MLCAPE_{LCL}) and low-level vertical wind shear (VWS_{low}). The very last input is MULI of the previous sounding. This parameter was included as a proxy for previous activity, that is, persistence (lightning occurrence in the previous 6 h). When the algorithm was run with previous activity included, it was selected as the second input but was ultimately excluded from the parameter pool since information regarding previous activity is generally not available in an operational context.

After the predictor-selection script completed its run, the number of hidden neurons was varied. The best validation performance was found with 17 hidden neurons. The predictor-selection script was tested once more, but no further inputs were selected using the updated *H*. Therefore, our final ANN has 17 hidden neurons in one hidden layer and utilizes 15 inputs that consist of, for example, two traditional stability indices, two measures of humidity above the PBL, two combined CAPE + CIN parameters, two measures of lift, synoptic hour, and two variables related to thermal and/or moisture advection.

The final performance of the ANN as evaluated on the test data is presented on a relative operating characteristics (ROC) diagram in Fig. 6. The ROC curve is obtained by varying the dichotomous forecasting threshold of the continuous ANN output and depicts probability of detection (POD) against probability of false detection (POFD) as the threshold is adjusted. In addition, we have added probability of false alarms (POFA) to Fig. 6 by mapping it onto a color bar. POFA, also known as false alarm ratio (FAR), is the proportion of “yes” forecasts that turned out to be wrong. The resulting visualization is a full description of forecast quality, which has 3 degrees of freedom for dichotomous forecasts. The diagram reveals that the performance of the ANN is substantially better than for the single-index model (MULI) for all aspects of forecast quality. In particular, false alarms are reduced by combining predictors in a multivariate ANN. This can be seen in the notable difference in POFA in Fig. 6 and also in the maximum Heidke skill score (see Table 4): maxHSS = 0.513 for the ANN as compared with maxHSS = 0.401 using MULI. HSS is a skill score that emphasizes a low POFA, while PSS emphasizes a high POD (Kunz 2007). PSS also suggests a meaningful improvement: 0.744 and 0.657 for the ANN and MULI, respectively.

Finally, we plotted the regional distribution of maximum PSS for the ANN (not shown). This was almost identical to that of MULI. The meridional gradient in maxPSS was also robust across different stability indices tested [Showalter index based at 850 hPa (SI_{850}) and SWISS_{12}]. This casts perhaps some doubt on the existence of a physical factor to explain the meridional gradient in PSS. It cannot be ruled out that the meridional gradient is associated with some bias of the reanalysis, but this seems unlikely.

### e. Formulation of a new index based on MULI

Given the success of the relative humidity near the 700-hPa level as a convective predictor, it may be advantageous to utilize this information in a stability index. This is not a novel idea but was done earlier by Huntrieser et al. (1997), who developed an index combining SLI, the 700-hPa dewpoint depression, and low-level wind shear for predicting thunderstorms in Switzerland. Huntrieser et al. (1997) included low-level shear based on hodograph analysis, revealing a distinct low-level jet associated with thunderstorms, and 0–3-km shear also succeeded as an input in our own ANN experiment. The resulting SWISS_{12} index has performed well in several subsequent studies (Manzato 2005, 2012), including ours, ranking second after MULI by maxPSS. However, since MULI performed far better than SLI in this study, we propose a modified version of the SWISS_{12} index for predicting thunderstorms in Finland:

where the coefficients were found by attempting to maximize the parameters’ linear correlation coefficient with thunderstorm occurrence, using the training and validation subsets. The maximum PSS of the newly developed “FIN” index is 0.678 on the test data, as compared with 0.657 with MULI, indicating a small but meaningful improvement (Table 4).

## 4. Discussion

Combining several predictors together by building a nonlinear, multivariate ANN led to an increase in skill that may be considered significant yet modest relative to the added complexity. This is indicative of a few things. One is that MULI is in itself a very effective parameter to forecast thunderstorms in Finland, which was evident in the skill score ranking. This index performed very well also in Haklander and Van Delden (2003), outperformed only by lifted index using mixed-layer parcels for dichotomous thunderstorm forecasts in the Netherlands. The advantage of using the most unstable parcel is that it can capture both elevated and surface-based instability (during daytime conditions, MULI is often equal to SLI). Another major benefit of MULI is that, unlike CAPE, it is not a bounded variable and can measure the potential “stability” also of near-neutral vertical profiles, while CAPE is defined only for potentially unstable profiles (when an LFC can be defined).

We further note that MULI is a large-scale variable in the sense that the thermodynamic instability required for thunderstorms is often created by large-scale processes (Doswell 1987). It can therefore be considered more robust than inputs related to lift and inhibition. Strictly speaking, convective initiation is a mesoscale process involving complex interactions of atmospheric motions on different scales. Key limitations in the ERA-Interim data with regards to building a skillful model may be both the horizontal and temporal resolutions. The former may be too coarse to capture, for example, important mesoscale convergence patterns, while the 6-hourly temporal resolution is quite low given how quickly the atmosphere can change, for example, by the sun heating the PBL. The coarse temporal resolution combined with the relatively small grid area and the proximity criterion, which ignores advection, may have significant implications for how representative the model pseudo-soundings are of the actual preconvective conditions. In the worst case, lightning can occur nearly 6 h after the analysis and be associated with a storm that formed far away (in another grid box). However, this issue is alleviated by including inputs that sample thermal or moisture advection into the grid cell, which may be why low-level *θ*_{e} advection was one of the most important ANN inputs.

In effect, the main benefit of using neural networks to predict convective initiation—the models’ inherent nonlinearity—is in this work likely to be constrained by the reanalysis’s limited capability for resolving convective initiation processes.

### a. Quality of pseudo-soundings

The reliability of the pseudo-soundings is a key uncertainty that has to be considered with respect to all results, especially regarding the relative performance of different predictors. While eventual biases are not an issue for the ANN, the soundings may exhibit significant root-mean-square errors (RMSEs) that act to degrade performance. Bao and Zhang (2013) evaluated reanalysis pseudo-soundings against independent sounding observations over the Tibetan Plateau and found that while newer-generation products such as ERA-Interim have smaller RMSE in general, significant errors and biases were found in relative humidity at all heights (up to 25%). Similarly, Gensini et al. (2014) found large errors in low-level moisture fields in the North American Regional Reanalysis. In our study, the maximum PSS of MULI forecasts varied with time (UTC) of the sounding. The maximum PSS for MULI forecasts is 0.62, 0.60, 0.64, and 0.58 for analyses at 0000, 0600, 1200, and 1800 UTC, respectively. The higher skill for +0.6-h forecasts made at midnight and noon may be associated with the fact that radiosonde soundings that are assimilated in ERA-Interim are carried out at three stations in Finland at 0000 and 1200 UTC only. As an example of a result that may be influenced by the above limitations, we note the poor success of variables measuring convective inhibition. No pure inhibition measures were selected by the input-selection algorithm, nor did they perform well in the thunderstorm probability tables, despite CIN having a clear physical link to convective initiation.

### b. Using stability indices for categorical thunderstorm forecasts

It was found that thunderstorm probability changed in a fairly linear manner as a function of sounding-derived indices such as lifted index and CAPE. This alone indicates that threshold values are arbitrary in the sense that they lack physical (but not practical) justification. This is not to say that skill score tests are useless; on the contrary, they allow evaluating the relative performance of indices in a convenient and robust way. Overall, probability forecasts may be considered more useful (AMS 2002), and they can be improved by considering several variables, as was shown by plotting the event probability as a function of two predictors.

## 5. Conclusions and future work

The presented study evaluated a large number of reanalysis-derived thunderstorm predictors for Finland in a multivariate setting using neural networks and other methods. The main findings are summarized as follows:

Training neural networks by using reanalysis data is advantageous as ANNs benefit from having a large amount of data to learn from. Owing to the large sample size, it was possible to train a very complex ANN without running into overtraining issues. The best architecture found had 15 inputs and 17 hidden neurons in one hidden layer.

The final neural network has significantly better performance on an independent test sample than any single stability index (Fig. 6), receiving a maximum PSS of 0.74, as compared with 0.66 using MULI. Heidke skill score suggests an even larger improvement (maxHSS = 0.51 as compared with 0.40 using MULI) because of false alarms being reduced especially. At the same time, the increase in skill is modest relative to the increase in model complexity. Overall, the results may be considered promising, keeping in mind the limitations of the input data (specifically, a coarse resolution). Our study area covered only Finland, but in the future, even larger datasets from higher-resolution models can be utilized for

*deep learning*, for example, training ANNs with more than one hidden layer.Evaluation of predictors in a multivariate setting is useful for gaining insight into the physical characteristics of thunderstorm environments at a given location. In the work conducted here, humidity variables were shown to be effective predictors for lightning occurrence in Finland, and the second-best ANN input was mean relative humidity between 800 and 600 hPa. In contrast to some other studies, dry air aloft was found to be very unfavorable for convection (based on lightning occurrence and flash density) at also higher CAPE values.

Regarding moisture flux convergence (MFC), our results are entirely in line with Banacos and Schultz (2005): horizontal mass convergence is at least as useful as MFC, if not more, for predicting thunderstorm occurrence when used as a supplementary predictor. We find no evidence of MFC offering any value for predicting convective initiation beyond its correlation with mass convergence, which is effective for identifying air-mass boundaries.

The Peirce skill score indicates a clear meridional gradient in forecasting skill, with thunderstorm predictability being higher in northern Finland. This is associated with the null event distribution being shifted toward a more stable regime in the north and thus being more separated from the thundery distribution. This may increase some measures of forecasting skill but does not in itself mean that, for a given instability, convective initiation is more likely in the north. On the contrary, for moderate and stable values of MULI, where thunderstorms are not forecast, thunderstorm probability is higher in the south. This leads to a higher miss rate, hereby reducing forecast skill. Further studies are needed to explain these findings, which were also obtained for the ANN, given that the physics of the atmosphere should not depend on location.

The application of neural networks for predicting convection, and convective initiation in particular, is a promising avenue for future research. ANNs may be used operationally by using the output from NWP models as ANN input. In the latter case, information gained in this study regarding the most effective ANN inputs can be utilized, keeping in mind that some of them are scale sensitive. The present approach of combining neural networks with reanalysis data may also be applied for constructing climatological trends. Robinson et al. (2013) produced a 20-yr climatology of severe thunderstorms using this approach but used a high-resolution numerical model to first dynamically downscale the reanalysis fields. The advantage of training the ANN directly using reanalysis data is that substantially more training data are available. In the near future, we envisage repeating this work for ECMWF’s upcoming next-generation ERA5 reanalysis, where a regression ANN would also be trained for estimating lightning intensity from reanalysis fields. ERA5 is currently in production (Hersbach and Dee 2016) and will have a much higher spatial and temporal resolution, which should yield better results for predicting convective initiation.

Weather forecasts and climate simulations would both benefit from a better representation of convective initiation. The prediction of convective initiation using nonlinear models, learning from vast quantities of data, is something that warrants further study.

## Acknowledgments

This work was partly funded by the Ministry of Employment and the Economy in Finland through the EXWE project (Extreme weather and nuclear power plants) of the Finnish Nuclear Power Plant Safety Research Programme 2015-2018 (SAFIR2018).

## REFERENCES

*Neural Network Toolbox: User’s Guide.*MathWorks, 446 pp.

*Pattern Recognition and Machine Learning.*Springer-Verlag, 738 pp.

*Climate Extremes and Society*, H. F. Diaz and R. J. Murnane, Eds., Cambridge University Press, 35–53.

*Statistical Methods in the Atmospheric Sciences.*2nd ed. Academic Press, 627 pp.

## Footnotes

© 2017 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

^{1}

By potential instability, we refer to the definition by Manzato and Morgan (2003, p. 462), which invokes the lifted parcel theory: “This kind of instability is considered as ‘potential’ because it needs an *external agent*, which lifts the [parcel] up to the LFC. Above that level, the parcel is buoyant and the positive energy is the CAPE.” This differs from the classical definition, which is simply (*dθ*_{e}/*dz*) < 0, because it requires that low-level *θ*_{e} is in fact higher than the midlevel *θ*_{es}, the saturation equivalent potential temperature.