from the first act of “Sant’Alessio” (1631) by Stefano Landi (1587–1639), text by Giulio Rospigliosi (1600–69, also known as Pope Clemente IX).“Pazzo è bene da catene, Chi fastidio mai si dà Per saper quel che sarà … ”

He is a raving madman who ever takes the trouble to know what the future holds …

## 1. Introduction

Hail forecasting is a challenging problem (e.g., Danielsen 1977; Brimelow et al. 2002). The first attempts at short-term hail forecasting were made by Fawbush and Miller (1953), who analyzed radiosoundings. Fawbush and Miller introduced a graphical method for forecasting the hailstone diameter from the observed soundings in a skew *T*–log*p* diagram. Foster and Bates (1956) built a similar approach, but using a mathematical formulation that relates the updraft to the sounding buoyancy between the level of free convection (LFC) and the level where the lifted parcel temperature becomes −10°C. Renick and Maxwell (1977) developed a diagram that relates the class-size hailstones to the maximum updraft velocity and to the temperature measured by the sounding at the corresponding maximum updraft level. Moore and Pino (1990) introduced a variant of the Foster and Bates (1956) approach, based on a more complex cloud model for forecasting the updraft.

The following approaches tried to improve the modeling of physical hail growth for explicit forecasts made by weather models. Wobrock et al. (2003) incorporated the Farley and Orville (1986) hail module inside the 3D nonhydrostatic Clark and Hall cloud model (Clark et al. 1996). Brimelow et al. (2002) developed an algorithm (HAILCAST) that was able to couple a steady-state hydrostatic cloud model with a hail-growth model, similar to that described in Rasmussen and Heymsfield (1987). On 160 different days (62 of which were hail days, so the event prior probability was 0.39), HAILCAST scored a Peirce skill score [called PSS, as suggested by Mason (2003)] of 0.66 and a Heidke skill score (HSS) of 0.64, which were superior to the skill scores of the Renick and Maxwell (1977) method when applied to the same dataset.

Apart from these physical methods, purely statistical models were also developed. An example is the multivariate statistical approach given by López et al. (2007), which developed a short-term hail occurrence forecast from sounding-derived indices, using a logistic-regression approach. Starting from 22 candidate indices, López et al. selected a set of seven variables, including instability (total totals index), wind speed (at 850 and 500 hPa), and the heights of the convective condensation level and of the wet-bulb zero. More generally, statistical analyses have revealed relations between hail and wind-derived fields. For example, Dessens (1986) studied the wind profile environment for 30 strong hailstorms in France, finding an almost constant wind direction in the 3–12-km layer, but an increase in wind magnitude (“intensity shear”). Houze et al. (1993) found a relation between the bulk Richardson number (BRI) and the occurrence of right-moving severe hailstorms in Switzerland. Kitzmiller and Mcgovern (1990) analyzed wind profiler and sounding data in Colorado, finding that large-hail events could be discriminated by a linear combination of wind speed at 8 km and buoyancy energy (in J kg^{−1}).

For developing a hail-forecasting method it is fundamental to have a good database of hail observations at the surface. Since the use of hailpad networks (Towery et al. 1976) or disdrometers is not widespread and since hailstreaks can be detected only by high-density observing networks (Morgan and Towery 1975), much work has been done to develop remote-sensing hail detection algorithms. Examples include Waldvogel et al. (1979), Balakrishnan and Zrnić (1990), Edwards and Thompson (1998), Witt et al. (1998), Zrnić et al. (2001), Féral et al. (2003), Heinselman and Ryzhkov (2006), and Lakshmanan et al. (2007).

Following this work on remote-sensing hail detection, it was natural to introduce radar-derived parameters also as candidate predictors for a statistical *nowcasting* of hail. For example, Marzban and Witt (2001) studied 386 hailstorms with a regression neural network for forecasting the maximum hailstone size and a classification neural network to sort the hailstone sizes by category. They used as predictors four radar-derived parameters together with five parameters derived from radiosoundings. The radar-derived parameters had a much higher linear correlation with the observed maximum hailstone size (*R* ≅ 0.5) than did the sounding-derived parameters (*R* ≤ 0.1). On the other hand, the nature of radar data limits the time of forecast validity, so that typically only nowcasting (0–60 min) can be done, instead of short-term forecasts (1–6 h, also called *nearcast*). On the validation sample, their regression neural network scored *R* = 0.63, showing a 26% improvement against the linear bivariate regression with a single radar predictor.

From this literature review, it appears that different authors found different hail predictors to be useful in different places of the world, even though there is a general consensus that instability measures (buoyancy, CAPE, estimated updraft, lifted index) associated with some wind profile characteristics (e.g., bulk shear) play an important role in hail preconvective environments. The wind field seems important, but its influence is not always considered in the same way. For example, Das (1962) suggested that strong wind shear increases the probability of hailstorms, but decreases the probability to have large hailstones.

The aim of this work is to apply nonlinear statistical methods for doing a multivariate analysis between the hailpad data in northeast Italy’s Friuli Venezia Giulia (FVG) plain and the sounding-derived indices of the Udine–Campoformido radiosounding.^{1} That is done for forecasting two kind of phenomena: the hail occurrence and the number of hailpads hit in 6 h (hereafter 6–h). Manzato (2005) and Manzato (2007b) have developed a methodology based on artificial neural networks (hereafter NNs) for short-term forecasting of thunderstorms and of maximum rain in the FVG plain from sounding-derived indices. Here, a similar approach will be applied to the hail problem, but new statistical tools (inspired by the artificial intelligence school) will be introduced. In particular, a new way of combining together different NNs for doing an *ensemble* forecast will be presented. The next section will briefly present the data used. Section 3 will discuss the hail occurrence forecast, while section 4 will deal with the hail–regression problem. Finally, section 5 will discuss our final conclusions.

## 2. Data

As shown by Manzato (2012), hail is a relatively frequent phenomenon in the FVG plain. The contemporaneous presence of a network of hailpads and of a World Meteorological Organization (WMO) radiosounding (Udine–Campoformido, 46.04°N, 13.19°E; WMO ID 16044) for a long period makes this place a very good candidate for studying the relations between hail and sounding data. The bivariate diagnostic analysis made by Manzato (2012) will fix the baseline benchmark against which the performances of the new statistical models here developed will be compared. In particular, when classifying the occurrence of at least two hit hailpads in 6–h, Manzato (2012) found a maximum diagnostic PSS of 0.49 for the updraft speed and for the difference in temperature between the lifted parcel and the environment at the level of −15°C (DTC). When estimating the number of hit hailpads in 6–h, the best correlation coefficient was obtained by the difference in temperature between the lifted parcel and the environment at 500 hPa (DT500) and by the Showalter index (Showalter 1953), with a diagnostic *R* of *−*0.36 (*p* value < 2 × 10^{−16}).

Figure 1 shows the target area, the plain of the FVG region (along the border with Slovenia), which has the Alps to the north and the Adriatic Sea to the south. Each of the 357 numbers represents the total number of hit hailpads collected at that station during the April–September 1992–2009 hail campaigns. The hailpad and sounding data have been exhaustively discussed in Manzato (2012), where the reader may find their climatology and further details. Here, it is only recalled that a *case* is defined as a 6-h period starting at the sounding launching time, that is at 0500, 1100, 1700, or 2300 UTC.

Since 1986, radiosoundings with wind measurements have mainly been conducted at the Udine–Campoformido station four times per day (0000, 0600, 1200, and 1800 UTC), even though the numbers of 1800 and 0600 UTC cases are less than those conducted at 0000 and 1200 UTC. Every sounding available in the studied period (11 209 cases) has been analyzed with the Sound_Analys.PY program (Manzato and Morgan 2003) to compute 52 sounding-derived indices. The sounding-derived indices are then associated with the hailpad measurements of the corresponding 6-h period. For example, the 1200 UTC sounding, launched at 1100 UTC, is associated with hailpads hit by hail between 1100 and 1700 UTC. If the sounding is missing, then that case is not considered and that explains why the total counts for the four periods of the day are different, being 3270, 2479, 3273, and 2187 for 0000, 0600, 1200, and 1800 UTC, respectively. Note that all the cases with an existing sounding were retained in this blended statistical study, so that there is no filtering of any sort, like those based on “contaminated” (saturated) soundings or others intended to “clean” the database.

Table 1 lists all the candidate predictors, which are the 52 sounding-derived indices used in this study along with the period of the day (HH), the Julian date (JJJ), and the sea surface temperature (SST) measured by the dell’Agenzia Regionale per la Protezione dell’Ambiente del Friuli Venezia Giulia (OSMER–ARPA FVG) station at Trieste. All of these indices have previously been described by the author (Manzato 2003, 2005, 2007b, 2012). The columns shown in Table 1 list the index acronym (which will be used hereafter for the sake of brevity) and unit, the name or description, a “yes” flag for the 15 indices computed in three different ways by the Sound_Analys.py software [the T, Tv, and Tvc methods described in Manzato and Morgan (2003) and recalled also in Manzato (2012), which will be used later], the total number of cases studied in each subsample of the index of nonmissing values, and, finally, the total number of cases with at least one hit hailpad and the index value not missing. The maximum number of 6-h cases in the classification problem is 11 209 (no missing cases at all). The maximum number of 6-h cases with at least one hit hailpad in the regression problem is 1070.

All the 55 candidate predictors, including 52 sounding-derived indices plus the period of the day (HH), the Julian day (JJJ), and the SST, with flags to identify the 15 indices computed in three different ways (T, Tv, and Tvc) and the total number of valid cases used in the classification and in the regression problems, respectively.

## 3. The hail-occurrence forecast

In this section the 11 209 six-hour cases will be classified based on hail occurrence. In the studied database there are 1070 cases that have at least one hit hailpad and 555 of these (52%) have only one hit hailpad in 6 h. Since the cases with only one hit hailpad do not form a very robust hail signal, they will be considered as nonoccurrences of hail. Thus, for this hail classification problem, we want to discriminate the 515 cases with at least two hit hailpads (active cases) from the 10 694 cases with zero or one hit hailpad (nonactive cases). In the following subsections it will be shown first how to build a NN to solve this problem and second how to develop and combine together different NNs.

### a. A neural network recipe

#### 1) Input preprocessing

*x*” index, the associated probability-transformed variable will be indicated as “

*P*(

*x*)”, which is a contracted form of the more precise

*p*(YES |

*x*) notation, defined in terms of likelihood probability by Bayes’s theorem (Bayes 1763):where

*p*(

*x*| YES) is the event likelihood,

*p*(

*x*| NO) is the nonevent likelihood, and Pno = 1 − Pyes. Examples of this nonlinear transformation are shown in Fig. 2 for the hail diameter (HD, panel a) and the SWISS index

^{2}(panel b) sounding-derived predictors. Mapping all the original input data to their empirical posterior probability makes them more comparable (the new data are all probabilities) and was shown by Manzato (2007a) to be an optimal preprocessing method for classification problems.

#### 2) Input selection based on mean validation error resampling

The database^{3} is then divided into three parts. The data from 2002 to 2005 are defined as the test sample and will be used only at the end of the simulation, for an independent test. The other data (1992–2001 plus 2006–09) will be called the total sample and will be resampled 12 times, dividing the dataset into two separate parts: each time 75% of the data will form the training sample, while the remaining 25% will form the validation sample. The division of the total sample in training and validation is done without repetition of the data already chosen—repeated holdout—and using 12 different deterministic procedures (not randomly). Each one of these 12 different training–validation realizations will be called a bootstrap.

*validation*sample (lowest validation error). CEE is defined aswith

*y*being the continuous NN output,

*t*the observed hail-occurrence–nonoccurrence target variable (1 if there are at least two hit hailpads in 6 h or 0 otherwise) and

*N*the total number of cases in the sample. Note that the continuous output of the single hidden-layer NN is computed bywhere

*x*

_{1}, … ,

*x*are the NN inputs (

_{I}*I*being their count in each iteration);

*h*

_{1}, … ,

*h*are the “neurons” in the NN hidden layer [with

_{H}*H*set to

*f*and

*f*are the logistic function [

_{o}*f*(

*x*) = (1 +

*e*

^{−x})

^{−1}]; and

*ω*and

*β*are the NN weights, computed by the R nnet module (Venables and Ripley 2002) when fitting the training sample.

The first best input chosen is almost invariably the *P*(SWISS) variable, which has both a mean—over the 12 bootstraps—training CEE (called TE) and validation CEE (called VE) of 0.153. As a second index, the one that decreases the mean VE more, when added together to *P*(SWISS) as NN input, is the hail probability associated with the period of the day, *P*(HH), followed by the bulk shear *P*(BS850), showing the importance of low-level shear in combination with instability.

Figure 3a [similar to the training–validation diagram of Marzban (2000)] shows the 12 bootstrap (TE, VE) points of the first 10 inputs selected for the Tv database. The thick black dashed line in Fig. 3a joins the mean of the 12 bootstraps (unfilled squares) and highlights overfitting when diverging far above the 45° bisector (black continuous line). The light gray dashed line passing through the mean bootstrap value has a slope determined by the training–validation sample ratio (which is *H*) for each algorithm iteration. After the selection of the set of 10 NN inputs, the number *H* is varied between 1 and 6, optimizing this parameter as well to minimize the mean validation CEE.

#### 3) Verification of the NN on the total and test dataset

Since the input variables have been chosen with the goal of identifying the best mean performance, and there is no sign of a large overfitting (e.g., VE is not much higher than TE), the whole “total” sample (which does not include the test sample, but only training plus validation) has been fit using those 10 inputs and a variable number of hidden neurons. The best results are found with a *I* **=** 10 and *H* **=** 4 NN, which is able to fit the data with a “total” CEE = 0.108. The continuous NN output, *y*, is then dichotomized into a binary (yes–no) forecast using as a threshold the hail prior probability [the

A possible explanation of the lower results for the test sample is that the complexity of this model is too great for the real “statistical signal” embedded in the data. If one chooses the simpler NN made with the same 10 inputs but only two neurons,^{4} then one can find a CEE of 0.112 and a PSS of 0.655 for the total sample and a CEE of 0.131 and a PSS of 0.632 for the test sample. Thus, both metrics are slightly worse for the total sample but improve for the test sample. Nevertheless, one should not rely on the “test” results during the process of developing the statistical model, since they must be used only to validate the final performance and one might have overfit the test sample. In the next section it will be shown that it is possible to obtain better performance using more NNs together rather than searching for the “very best” single NN. In fact, Berk (2005) and Navone et al. (2000) have shown that combining NNs together into an ensemble usually produces better—and statistically more robust—forecasts than any single NN member.

### b. Developing a neural network ensemble for classification

Two strategies are implemented for producing different NNs for solving the same problem: the first is to select different NN inputs from those obtained with the “mean VE” resampling method, while the second is to apply this methodology to the three different types of databases computed by Sound_Analys.PY, since, as explained in Manzato (2012), T, Tv, and Tvc have 15 indices computed in a different way.

#### 1) The “12 bootstraps” and the “four periods” resampling methods

The first new input selection method chooses the predictor set with the lowest VE independently for each one of the 12 bootstrap trials, instead of optimizing the average of the 12 VEs with a single predictor. In this way, one obtains 12 different sets of input predictors, each one optimized for a particular training–validation realization. The second method employed to produce another four NN input sets is to use as the validation sample all the data with the same period of the day (same value of HH). For example, one can use the cases from 0600, 1200, and 1800 UTC to train the NN and all the case at 0000 UTC to validate it. Note that, in general, this method is not the best choice if there is any diurnal effect; however, in this case, these additional NNs were developed with the aim of producing some very different classifiers, not the best one, although still a reasonable one. In fact, what is also important in an ensemble approach is the availability of many diverse members, producing a large variability in their forecasts (Carney and Cunningham 1999; Navone et al. 2000). Incidentally, it is noted how the forecast spread is very often too small and almost never too large. For example, it is much more common to find rank histograms (e.g., Hamill 2001) with a ∨ shape than with complementary shape (a ∧ shape).

Figure 3b shows the training–validation division based on the four periods of the day, optimizing the input set differently for each of these four bootstraps. It is interesting to note how the NNs trained while excluding the 0600 or the 0000 UTC data have the highest training CEEs, but when applied to the corresponding cases they perform very well, obtaining low validation CEEs. Conversely, the NN trained while excluding the 1200 UTC data has the lowest TE, but this good performance is not confirmed by the 1200 UTC validation sample, because VE is very high. This suggests that the 1200 UTC cases (followed by those at 1800 UTC) are the most “difficult” cases to predict correctly with the sounding-derived indices, while the 0600 and 0000 UTC are the “easiest” to forecast, as if the intrinsic predictability is different for the different periods of the day.

It is possible (hypothesis A) that there are some physical processes acting during the daytime (in particular when the solar radiation is stronger) that are very important for thunderstorm development and that cannot be inferred from the statistical relations occurring during the other periods of day. Another possibility (hypothesis B) to explain this result arises by noting that the intrinsic predictability seems to follow the same trend as the event prior probability [the probabilities of having a hailpads ≥2 case are 0.05, 0.17, 0.11, and 0.04, respectively, during the (0500, 1100), (1100, 1700), (1700, 2300) and (2300, 0500) UTC periods], indicating that it should be easier to forecast the rarer events than the more common ones.^{5} On the other hand, that seems to be a weak explanation considering that a diurnal trend (not shown) similar to that in Fig. 3b has been found for the FVG rain >5 mm (accumulated in 6 h) event, which has a quite similar prior probability during the four different periods of the day.

#### 2) Using the three resampling methods to produce 68 different NNs

Using these two new resampling strategies, another 12 plus 4 different sets of input predictors have been found, in addition to the one developed in section 3a(2), for a total of (1 + 12 + 4) = 17 different input sets. In practice, a number of neurons *H* varying in the 1–4 range has been tried for each of the 17 sets of inputs, obtaining a total of 17 × 4 = 68 different classification NNs for the same hail problem. Figure 4 shows the CEEs computed for the total and for the test samples of each of these 68 different NNs and describes how the variability of the test error (*σ* ≅ 0.036) is 8 times higher than for the total error (*σ* ≅ 0.004). The dashed line is a linear fit of all the points. In reality there are two “bad” NNs that have a test CEE outside Fig. 4, one of which has a test CEE as high as 0.43. The ratio between the mean total and test CEE (0.82 or 0.85 excluding the worst NN) is less than 1, because the points are above the 45*°* bisector (the continuous line visible near the bottom-right corner).

The three different gray levels used in Fig. 4 correspond to the three resampling methods: the “mean” set (only one input list), the “12 boots” input set (12 different lists), and the “four HH” set (one list of inputs for each period of the day). The four mean NNs developed in section 3a have the best performance on average (big filled squares). For each different *H* in the 1–4 range a different symbol is used to plot the (total CEE, test CEE) point. In general, the lowest test CEEs are obtained by NNs with one or two neurons in the hidden layer, while the poorest test performances are obtained by NNs with three or four neurons; however, there are exceptions. The two circled NNs correspond to the *H* = 2 and *H* = 4 NNs described at the end of section 3a (mean set). The *H* = 2 NN has the second-lowest test CEE, while the *H* = 4 NN has the lowest training CEE.

#### 3) Using three databases and combining NNs in an ensemble

What was done up until now on the Tv sounding-derived database can be applied also to the T and Tvc datasets, obtaining a final number of 68 × 3 = 204 NNs that solve the same hail-classification problem. These NNs can be used as candidate members for an ensemble forecast. To select a good subset of NN classifiers a forward-model-selection algorithm has been applied. That is, the first classifier of the ensemble is the one that, taken alone, has the highest PSS^{6} for the *total* sample and then the selection is iterated, adding the classification NN that increases the PSS of the ensemble further. A possible evaluation of the NN goodness could be an error weighted with the sample size, like (¾ × Total MSE + ¼ × Test MSE), but, again, it must be stressed that the selection of the ensemble members must be based on minimizing the ensemble total error, to avoid the use of the test data when “building” the statistical model. The latter produces worse results on the test sample, but is needed to maintain the independence of the test data and also to not overfit it.

To combine in an ensemble forecast the different NN binary forecasts, the new modified Mojirsheibani major voting (MMMV) algorithm was used. The MMMV is introduced in the appendix; here, it is only noted that the original Mojirsheibani major voting algorithm (Mojirsheibani 1999) is based on the most probable observation for any given combination of forecasts and that our modified version can optimize PSS even in rare-event situations. As found in other ensemble methods, such as boosting (Freund and Schapire 1996), in this case the ensemble also reaches a better level of performance than any single member using a combination of just a few members.

Figure 5a shows the increase in PSS, adding more classifiers in the MMMV ensemble. One can see that the ensemble forecast reaches a maximum on the test sample with just five members. Combining these five NNs in an ensemble gives a PSS of 0.679 for the test sample and of 0.747 for the total sample. These PSSs have a relative improvement of 12% and 8%, respectively, for the test and the total sample, with respect to the single NN with *H* = 4 found in Section3a(3), and 9% better than the errors obtained by the first member of the ensemble, which was the mean-error sampling NN with *H* = 4 of the T database. Table 2 shows the so-called truth table of the ensemble model, that is, the MMMV output corresponding to any possible combination of forecasts of the five NN members, obtained with any weight threshold (see the appendix) in the 0.04–0.06 range. Note the unexpected “1” ensemble forecast associated with the member forecast combination numbers 9, 10, 12, and 20, which had only two NNs forecasting 1 and three forecasting 0.

Truth table of the MMMV ensemble based on the five NN forecasts, which is obtained with any “yes–no” weight threshold in the 0.04–0.06 range. The output forecast vector can be compressed into the single decimal number corresponding to 7020319.

Table 3 shows the sounding-derived indices that are transformed into their hail empirical posterior probability and are used as inputs for the five NNs chosen for this ensemble. All five NNs use 10 inputs and four neurons in the hidden layer. Three of them were developed for the T database while the other two were found for the Tv and Tvc databases. The first NN was found with the “mean error” variables, while the third was found using the 0000 UTC data as the verification sample and the other three periods of the day as the training sample. The other 3 NNs were of the 12 bootstrap-sampling type. In general, one can see how SWISS (a measure of instability and shear), HH (period of the day), and BS850 (low-level bulk shear) are the predictors chosen most frequently in the NN first input positions. Also the bulk Richardson number (BRI) and wet-bulb zero height (WBZ) and some other instability measures (LI, EHI, UpDr, CIN, b_PBL and MaxBuo; refer to Table 1) were chosen at least once in the first five predictors. Wind fields in the “mid-” levels (MLWu, MLWv, and MLWspd; refer to Table 1) are often chosen, but never in the first four positions.

The four NNs chosen in the MMMV ensemble for forecasting the occurrence of at least two hit hailpads. The NN chosen as first is in the left column, while that chosen as last (least important) is in the right column. One can see how SWISS, BS850, and HH are the most important predictors.

Figure 5b shows the zoom receiver operating characteristic (ROC; Swets 1973^{7}) curves obtained by this five-member MMMV ensemble on the total (continuous line) and test (dashed line) samples. The different (POD, POFD) points are computed varying the weight threshold of the MMMV algorithm (see the appendix). The ROC curve for the test sample is almost parallel to that of the total sample, showing just a little overfit. The maximum PSS for the total sample on the ROC curve corresponds to the POFD = 0.16 point. Table 4 shows a detailed forecast verification for this specific binary classifier, computed for the total (training plus validation) and the test samples for most of the widely used performance measures. The point that maximizes PSS has a high POD but also a high-frequency bias (BIAS) and false alarm ratio (FAR). In any case, all the performance measures in Table 4b are better than those computed in Table 4b of Manzato (2012). In fact, comparing the performance measures obtained for the diagnostic bivariate analysis with the “test” prognostic measures in Table 4b (last column on the right), the absolute value of the relative improvement (that is |New Measure − Old Measure|/Old Measure) goes from a minimum of 7% (for FAR) to a maximum of 478% (for the odds ratio), with a mean—for the nine different measures—relative improvement of 84%, which reduces to 35% if one does not take into account the very high improvement in the odds ratio.

(a) The definition of the contingency table and its values and (b) the definitions and values for many derived performance measures for the total and test samples.

As explained in Manzato (2007a), the threshold that maximizes PSS is not always the choice that maximizes the customer value (which depends on customer cost for missing events and false alarms), but it is just a convenient point to measure the potential skill of the whole forecasting system. If one needs a frequency BIAS lower than 5, then one should choose another point on the ROC curve, that is, a binary classifier obtained from the same ensemble but using a different weight threshold. For example, the BIAS = 1 point is given by the intersection of the ROC with the POD = 1 − (*α* × POFD) line, with *α* = Pno/Pyes.

## 4. The hail extension forecast

While the previous section was dedicated to the hail-occurrence classification problem, this one will deal with the estimation of the hit hailpad extension, which is a regression problem. In this case, the zero hit hailpads (which are the majority) must be avoided, or they will heavily bias the forecast toward very low values. The 555 cases with *only* one hit hailpad in 6 h are retained, even if they were considered as hail nonoccurrences in the classification problem, because removing them would reduce the available database by more than half. Hence, here the subset of the 1070 six-hour cases with at least one hit hailpad is studied.

### a. Results for linear multiregressions

As a simple benchmark reference, a linear multiregression was also performed. A subset of predictors has been chosen from Table 1 using two different algorithms. The first was a simple forward-stepwise selection (stepwise regression). The second was an exhaustive subset selection, based on the LEAPS package, developed in the R software by T. Lumley (see http://cran.r-project.org/web/packages/leaps/leaps.pdf). For this simple linear approach, there is no division among the training and validation samples, but all of the total dataset was fitted, while the test sample was used to verify the prognostic performance.

Figure 6a shows how the square of the correlation coefficient improves by adding more and more predictors to the stepwise regression. While for the total sample *R*^{2} steadily increases from 1 to 15 predictors (*R*^{2} goes from 0.13 to 0.26), for the test sample there is a maximum value (*R*^{2} = 0.139, *p* value < 2 × 10^{−9}) for 7 predictors, while, after that point, the model is probably overfitting. Another interesting feature is that, even when the test and total *R*^{2} increase with a similar trend (e.g., from one to nine predictors), there is always a large difference between them. In fact, their relative ratio (|Test − Total|/Total) is bounded in the 0.37–0.48 range. It is important to note also the fact that even in a very simple model like a linear regression with only one predictor (Showalter index) there is a significant difference between the test and total *R*^{2} (0.07 versus 0.13).^{8} This effect is probably not an overfitting (as likely happens for the complex models with more than nine predictors), but is most likely due to a different intrinsic predictability of the two samples and/or to the effects of the different sample sizes (25% against 75%).

The exhaustive LEAPS regression is computationally much more demanding, but assures a total *R*^{2} higher than the corresponding value found with the stepwise selection for the same number of input predictors, even if that does not also apply on the test *R*^{2} (figure not shown). For example, choosing with LEAPS the best combination of 2–12 input predictors, the total *R*^{2} increases from 0.17 to 0.27. The best test (*R*^{2} = 0.142, *p* value < 2 × 10^{−9}) was found with the most complex multiregression of 12 predictors, while the second-best test *R*^{2} (0.135) was found using only 3 predictors (DT500, Trop, and HLWspd).

### b. Result for the neural network with the “mean VE” inputs

*f*is no longer the logistic function (as it is for all the hidden-layer neurons) but it is the

_{o}*linear*function

*f*(

_{o}*x*) =

*x*. Moreover, the performance used to quantify the NN error is the mean square error (MSE) instead of the CEE, which iswhere FORHA6h (forecasted hail activity) is the continuous NN output. The final regression performances will also be measured in terms of the linear correlation coefficient

*R*.

For this regression problem, the event posterior probability transformation was not used, but the preprocessing was done with a simpler *Z*-score transformation [indicated by a *Z*(⋅) notation]; that is, each sounding index is standardized subtracting its mean value and dividing the difference by its standard deviation (e.g., Abdi 2007). As done in section 3a(2), here the 2002–05 data will form the test sample, while the remaining data (total sample) will be divided up into training (75%) and validation (25%) datasets, using 12 different methods (bootstraps). The best NN inputs are found by minimizing the mean validation MSE for the 12 bootstraps, while the number of hidden neurons is optimized in a following step.

Figure 7a shows the training and validation errors obtained by an NN when forecasting the Tv dataset. The different lines correspond to the 12 different ways used to resample, without repetitions, the total dataset. Comparing Fig. 7a with the corresponding figure for the classification problem (Fig. 3a), one can see that in this case adding more inputs does not seem to reduce the validation error as much as it did before, probably due to the smaller sample size. Moreover, the average points (unfilled squares connected by a solid black line) clearly depart from the 45° bisector after more than four mean-error variables were chosen by the forward-selection algorithm, showing overfitting problems. The eighth variable does not improve the training error either. The first seven predictors with lower mean validation MSEs are listed in the first column of Table 5. One can see that those chosen in the first positions are *Z*(SWISS), *Z*(ShowI), and *Z*(HLWspd) so that instability and wind fields (the high-level speed and the low-level shear used into the SWISS computation) seem to be the most important predictors for estimating the hailstorm extent.

The four NNs chosen as members of the linear multiregression model for forecasting the number of hit hailpads [ln(1 + Hit_Hailpads)/4.6]. The NN chosen as first is in the left column, while that chosen as last is in the right column. One can see how SWISS, ShowI, and wind speeds (high- or midlevel winds) are among the most important predictors.

After having chosen a subset of seven inputs, then all of the total sample was refit, varying *H* in the 1–4 range. A good level of performance is already obtained with a seven-inputs, two-hidden-neurons (*I* = 7, *H* = 2) NN, that has a total MSE of 0.018 over 690 valid cases and a test MSE of 0.022 over 239 valid cases. The corresponding *R*^{2} values are 0.327 (*p* value < 2 × 10^{−16}) and 0.143 (*p* value < 2 × 10^{−9}), respectively, showing relative improvements of 150% and 10% with respect to the best diagnostic bivariate *R*^{2} (0.131) found in Manzato (2012), or of 20% and 1% with respect to the *R*^{2} found with the 12-predictor linear LEAPS regression. Thus, while for the total sample there is significant improvement with respect to the bivariate correlation or to the linear multivariate approach, for the test sample it is much smaller. To improve further this result, an ensemble NN method was developed.

### c. Regression neural network ensemble

#### 1) Generating 204 candidate NN members

As was already done in section 3b, here, 204 NNs will also be developed using two other resampling techniques and using all three databases computed by the Sound_Analys.PY software (i.e., T, Tv, and Tvc). For example, Fig. 7b shows the training–validation MSE diagram when using all the data from the same period of day to form the validation sample. As found in Fig. 4b, here there also seems to be a different predictability for the different periods of day. The main difference is that here the lowest validation errors are obtained by the 0600 UTC cases instead of by the 0000 UTC cases. On the other hand, NNs built using the 0000, 0600, and 1800 UTC data performed very badly when validated for the 1200 UTC data, which again seem to be the most difficult to be predicted without their explicit knowledge during the training phase.

Using these alternative resampling techniques, a total of 1 + 12 + 4 = 17 different NN input sets were found. Varying *H* for each one in the 1–4 range will produce 4 + 48 + 16 = 68 different regression NNs. Figure 8 shows the 68 [total MSE, test MSE] points obtained on the Tv dataset. One can see that the mean-error NNs (i.e. NNs that have the inputs chosen to minimize the mean bootstrap validation MSEs) have on average the best performance, followed by the NNs built by resampling on the four different periods of day. The 48 boot (i.e.,the variables chosen independently for each of the 12 bootstrap trials) NNs have a very high average test MSE because one of them has an extremely high error (test MSE = 0.15, which is off of the figure). Excluding this outlier from the computation will give a mean ratio between the remaining 67 total and test MSEs of 0.84, which is very similar to that found for the classification problem. The circled NN is the one described at the end of the previous section and will be used as a starting point to build the regression ensemble.

The same procedure described up to now has been applied to the T and Tvc databases, producing a final number of 204 regression NNs. To increase the NN diversity, the number of NN inputs used in the T database was lower than for the Tv and Tvc datasets (six against seven).

#### 2) Combining NN members in a regression ensemble

To combine more regression NNs into one ensemble forecast, two methods were tried. The first is the bagging technique (Breiman 1996), while the second is a linear multiregression of the ensemble member outputs. The basic idea of bagging (which stands for bootstrap aggregation) is to develop a regression model for each of the many different bootstrap resamples. They are then aggregated by simply taking the mean value of their different forecasts. The best bagged ensemble has been found starting with the NN described at the end of the previous section and using a forward-model selection to add the NNs that further improve the *R*^{2} of the ensemble forecast on the total sample. The best result has been found when taking the mean value from four NNs and has a total *R*^{2} of 0.477 and a test *R*^{2} of 0.184, with MSEs of 0.0155 and 0.0200, respectively.

*R*are important parameters, but they do not explain the whole story. As shown by Taylor (2001), there is a nice relation between the “centered” root MSE, the standard deviation of forecasts (

*σ*

_{for}), and of observations (

*σ*

_{obs}) and

*R*:with

*γ*=

*σ*

_{for}/

*σ*

_{obs}. Of course,

*σ*

_{obs}is constant for a given dataset to be forecasted. NNs are usually very good in forecasting the mean value, so that BIAS

^{9}≅ 0. Thus, the only other parameter not already considered is

*γ*, that is, the degree to which the forecast can reproduce the observed variability. For the bagged ensemble of four NNs,

*γ*= 0.49 for the total sample and 0.42 for the test sample, which are far from the desired unit value. Basically, that is because the bagged output tends to shrink toward the mean observed value and that limits the forecast variability.

If one uses linear multiregression (linear fit between the NN forecasts and the observed CALHA6h) to combine the single NN members into an ensemble forecast, then one can obtain better performance for the total sample, but usually worse results for the test sample, because a weighted ensemble, like a linear multiregression, is a “more powerful” algorithm than a simple mean ensemble, as is the bagging method (Granitto et al. 2005). For example, a simple linear regression made with the output of only two NNs gives a total *R*^{2} of 0.419 and a test *R*^{2} of 0.145, while increasing the number of NN members gives even lower test *R*^{2} values. But in that case, *γ* increases to 0.65 for the total sample and 0.61 for the test sample, because there is no shrinking effect toward the mean value.

*R*

^{2}increase, adding more and more models—the NN members chosen during the bagging algorithm—to the ensemble linear multiregression. While the total

*R*

^{2}shows a huge improvement from one to four members, the test

*R*

^{2}has a smaller improvement, but it is still significant. The equation used to build the ensemble output from these four NN outputs iswhere NN1–NN4 are the four NNs described in Table 5. Looking at the multiplying coefficients, one can see that the starting model NN1 gives a negligible contribution to the multiregression output, while the other three NNs have similar importance. Hence, in practice this ensemble employs only 3 NNs.

#### 3) Verification of the regression forecast

With this hybrid approach, the four-NN linear multiregression obtains *R*^{2} values of 0.491 (*p* value < 2 × 10^{−16}) and 0.173 (*p* value < 2 × 10^{−11}), with MSEs of 0.0140 and 0.0213, with BIASs *γ*s of 0.70 and 0.61, respectively, for the total (689 valid cases) and test (239 valid cases) samples. The ensemble forecast *R*^{2} has a relative improvement of 50% and 21%, for the total and test samples, with respect to those obtained using the single NN of the previous section, which were already better than those obtained by the diagnostic bivariate regression of the first part of this work and of the linear LEAPS method. Nevertheless, to improve the test *R*^{2} by a relatively small amount, it was necessary to increase substantially the complexity of the new model and the *R*^{2} on the total sample, which now has a total – test MSE ratio of only 0.66.

Figure 9a and 9b summarize these performances in the Taylor (2001) diagrams for the total and the test databases. The four-NN multiregression (black dots) is compared with the bagging ensemble of the same four NNs (dark gray), with the single 7–*I* 2–*H* NN developed in section 4b (medium gray) and with the 12-variable multiregression found by LEAPS (light gray). The open circle along the abscissa represents the observed standard deviation, and its distance from each forecast point is the centered RMSE. The solid black arc coming out from this point represents the *γ* = 1 condition, while the dotted lines show values of *R* (not *R*^{2}). It is possible to see that the regression ensemble is closer to the *γ* = 1 arc, even if in the test diagram it has a centered RMSE that is slightly higher than the bagged ensemble because, as mentioned before, the ensemble mean optimizes the centered RMSE at the expense of variability. So, one can consider the multiregression NN ensemble to be the best forecast model among those examined.

Figure 10 shows the scatterplot for the 689 cases (Fig. 10a) of the total sample with all the needed predictors available and for the 239 cases (Fig. 10b) of the not-missing test dataset. While for the total sample the cases with CALHA6h > 0.6 were forecasted relatively well (in particular for the historical maximum), for the test sample all these cases have FORHA6h < 0.6; that is, the largest hailstorms are underestimated. Moreover, note that for the low observed values (CALHA6h = 0.15, i.e., only one hit hailpad) the forecast variability is very large. That could also be a problem due to the fact that the one-hit hailpad data are not very reliable.

All the inputs predictors (transformed into their *Z* scores) of the four NNs used in this multiregression ensemble are listed in Table 5. Both of the first two NNs are of the mean-error sampling type (but computed on two different databases), which confirms it to be the best resampling method. The third NN is built using the 0000 UTC period of the day as validation, while the last NN is obtained by optimizing the predictor list for a specific bootstrap. The total number of different predictors used in the ensemble is 19. One can see that *Z*(SWISS), *Z*(ShowI), the wind fields [in particular *Z*(HLWspd) and *Z*(MLWv)], and the hail probability associated with the period of the day [P(HH)] are the predictors chosen among the most important positions, that is, in the higher rows in Table 5.

## 5. Conclusions

In Manzato (2012), hailpad data that were collected by about 360 stations positioned over the plain of the Friuli Venezia Giulia region during April–September 1992–2009 were studied, considering the number of hit hailpads in 6-h periods, during the (0500–1100), (1100–1700), (1700–2300), and (2300–0500) UTC periods of the day. A total of 11 209 Udine–Campoformido soundings were analyzed with the Sounding_Analys.PY software (Manzato and Morgan 2003), deriving the 52 indices listed in Table 1. They were used as hail-candidate predictors, together with the period of the day, the Julian date, and the sea surface temperature measured by the OSMER–ARPA FVG station located in Trieste. Fifteen of these sounding-derived indices were computed with three different methods, following different thermodynamic schemes in the sounding analysis, so that three candidate predictor databases (called T, Tv, and Tvc) were created.

In this new work, the challenging problem of hail forecasting from sounding-derived indices has been tackled with a multivariate approach, building more NNs and combining some of them into an ensemble forecast. A three-stage methodology was developed. The first stage is the nonlinear transformation of the sounding-derived indices into their hail-occurrence posterior probability (for the problem of classifying the 6-h cases with at least two hit hailpads) or into *Z*-score standardization (for the number-of-hailpad regression problem). The second stage is to select some of these predictors as inputs for a nonlinear classification NN, to estimate their hail-occurrence posterior joint probability. The third stage is to develop many of these NNs (using different resampling techniques and three partially different databases) and to combine the outputs of some of them into an ensemble forecast, to improve performance and reliability.

The occurrence NN ensemble has been built with five NNs, using a total of 29 different sounding-derived indices (Table 3). Among these, the ones that seem to play a more important role are *P*(SWISS) [mixing instability and shear; Huntrieser et al. (1997)], *P*(HH), and *P*(BS850) (low-level bulk shear), followed by *P*(BRI) and *P*(WBZ), mixed with other instability indices and the wind in particular in the midlevels (MLW). This five-NN ensemble gives a maximum PSS of 0.747 for the total sample (7673 diagnostic cases) and of 0.679 for the test sample (2627 prognostic cases). The maximum test PSS is 39% higher than that found for the diagnostic bivariate analysis, showing the validity of a multivariate nonlinear ensemble approach. A detailed binary forecast verification is shown in Table 4 and is found using the threshold that maximizes PSS (see Manzato 2007a). If one has a cost–loss ratio different from the event prior probability (Pyes = 0.045), then the classifier value will not be optimized by maximizing PSS and one can choose a different [POD − POFD] point on the ROC curve in Fig. 5b. In any case, comparing Table 4b with the corresponding Table 4b in Manzato (2012), one can find a mean improvement of 84% for the nine performance measures shown in the last column (or 35% when excluding the very favorable odds ratio).

Using more NNs in an ensemble is not a novel technique (e.g., Granitto et al. 2005 and references therein), but the method used to select and combine the binary NN outputs (from the 204 available candidate models) into an ensemble forecast is a novel variant of the original Mojirsheibani major voting algorithm (Mojirsheibani 1999), which is based on the most probable observation for a given combination of member forecasts and is described for the first time in the appendix. This method is very powerful and can be applied to other ensembles of binary classifiers. Table 2 summarizes how the ensemble output is obtained starting from all the possible combinations of the five forecasts given by the original NNs selected in the MMMV ensemble.

The regression estimate of the number of hit hailpads in 6 h has been done both with linear approaches as well as with a nonlinear NN approach. Applying a linear bivariate regression (only ShowI as predictor) to the independent test dataset produced a correlation of *R* = 0.26 (*p* value < 2 × 10^{−5}). The linear multivariate LEAPS method was able to find a maximum correlation coefficient *R* of 0.38 (*p* value < 2 × 10^{−9}) for the test sample. On the other hand, the ensemble developed with four NNs, combined as shown in Eq. (7), was able to score an *R* coefficient of 0.42 (*p* value < 2 × 10^{−11}) for the 240 test cases. Even though the absolute value of *R* is still low for many practical applications, it is 58% higher than the simple bivariate prognostic analysis and 10% higher than the LEAPS multiregression *R*, showing that the complex statistical methodology applied proved to be beneficial. Moreover, the ensemble diagnostic *R* for the total sample was much higher than before, being 0.70 for 689 cases. The large difference between the total and test performances can be due partially to the high complexity of the statistical model (i.e., prone to overfitting) and partially due to intrinsic problems in the test sample or to a sensitivity to the very different sample size, since it is also already present in a linear regression with very few predictors (see Fig. 6a).

Particular emphasis has been given to improving the spread capability. The method used to select the ensemble members was based on the bagging technique (Breiman 1996), but the final method used to combine their forecasts together was a linear multiregression, because the ensemble mean had a too low forecast variability; that is, *γ* = 0.42 for the averaged forecast versus *γ* = 0.61 for the multiregression forecast (on the test sample). Of all the 19 predictors used as inputs for the four NN chosen for the regression ensemble, *Z*(SWISS) (combining instability and shear), *Z*(ShowI) (instability evaluated at 500 hPa), and some wind fields (in particular at mid- and high levels) seem to play the most important role. Even so, the relative improvement reached on this regression problem is not as high as it was in the classification problem (cf. Figs. 5a and 6b) and that is probably due to the more effective MMMV method developed for combining the ensemble members.

In conclusion, it seems that a nonlinear multivariate hail forecast could be performed using a combination of pure potential instability indices (like ShowI) or mixed indices (like SWISS or BRI) in association with wind-derived indices, like low-level shear (BS850) or mid- and high-levels wind speed (HLW and MLW). The statistical evidence of the importance of the wind field (Dessens 1986) is confirmed, in particular when used together with measures of potential instability (as in Kitzmiller and Mcgovern 1990). The period of the day (HH) also seems to be a key parameter for forecasting hail occurrence, so that one can think about developing different statistical models for the different periods of the day that may be able to better describe different processes active in hail production during daytime and nighttime. But doing so would decrease the sample size and hence diminish the statistical robustness. Figures 3b and 7b suggest also that the degrees of predictability for the four periods of the day are different, with 1200 UTC (followed by 1800 UTC) cases being “more difficult” than those at 0000 and 0600 UTC. Note that in both of the ensembles a NN has been chosen that was developed without the 0000 UTC data during the training phase, as if the 0000 UTC cases brings less useful informations than the other three periods of the day.

The NN ensemble forecasts described in this work have been operating in real time in the OSMER–ARPA since February 2009. The occurrence ensemble is evaluated first and if hail occurrence is forecasted, then the regression ensemble will also try to estimate the hailstorm extent. In the future we contemplate the extension of the short-term (nearcast) hail–forecast far ahead in time using the pseudoindices derived from the European Centre for Medium-Range Weather Forecasts (ECMWF) model, as has already been done operationally for the thunderstorm and rain forecasts.

First of all, the author wants to thank AMS for granting a full-cost waiver for this manuscript. Thanks to Rich Caruana (Cornell University) for introducing the author to the bagging/ensemble technique during the AMS Short Course on Artificial Intelligence Applications to Environmental Science (Atlanta, GA, 28–29 January 2006) organized by Professor Caren Marzban, whose help never left the author. The author wants to thank also the “hail volunteers,” who managed for free the hailpad stations and Rich Rotunno (NCAR) for improvements to the English in this paper. Last but not least, he thanks the “GNU generation” for having provided such good—and free—tools such as R, Python, LaTeX, Emacs, as well as Linux itself.

# APPENDIX

## The Modified Mojirsheibani Major Voting

The problem discussed in this appendix is that of finding a “good” technique for combining different categorical classifiers into an ensemble. Suppose that there are *N* different cases available, each one characterized by an input vector **x**_{n} and the associated true event class *t _{n}*. Suppose that this database is used to build

*M*different classifiers

*C*

^{1}, … ,

*C*. Each of these classifiers maps every input vector

^{M}**x**

_{n}in a forecast

*K*− 1) range, representing

*K*different classes (same as

*t*). Thus, for the

_{n}*n*case an

*M*vector of forecasts,

*y*the class label that has been voted by the majority of the

_{n}*M*classifiers; that is,

*I*gives 1 if its argument is true or 0 otherwise.

**w**= (

*y*

^{1}, … ,

*y*) of the

^{M}*M*classifier forecasts (which are at maximum

*K*) that occurred in the dataset are listed and for each of them the mode of all the true class

^{M}*t*, observed when that particular combination of classifier forecasts was predicted, is chosen. Defining the subset

*W*of all the

_{i}*N*cases when the same vector

_{i}**w**

_{i}of forecasts occurred as

*W*= {

_{i}**w**

_{n}∀

*n*in 1, … ,

*N*:

**w**

_{n}=

**w**

_{i}}, it is found thatThus, the MMV ensemble chooses as the forecast the class most often voted by the true observations instead of by the classifiers. Mojirsheibani (1999) has shown how this algorithm asymptotically outperforms each of the individual classifiers.

Consider this simple example with *K* = 2 (binary events). Suppose that the major voting forecast is “perfectly wrong”; that is, the mode of the classifiers is always the complement of the observed class. In that case the MMV forecast would be perfect, because it does not consider the classifier “absolute” forecasts, but only the most probable observed class for any given occurrence of their sequence of forecasts. That is, the event posterior probability is associated with that particular combination of forecasts **w** as in a Bayesian approach. Since it is built using as reference the *N*-observed *t* classes, the MMV is a calibrated method and, in particular, its frequency BIAS is close to 1.

There are two characteristics of the MMV that here will be modified. The first one is not to use the entire table of different forecast combinations and corresponding MMV outputs but rather a compact (and complete) version of it. This is achieved by simply using a **v** vector of *K ^{M}* elements. Each element of the vector has a value in the 0, … , (

*K*− 1) range. Thus, the table rows can be coded in a

*K*-base system, corresponding to the decimal numbers from 0 to

*K*− 1, where this number is simply the index,

^{M}*i*, of the vector elements minus one. The value of the

*i*th element of

**v**is the MMV forecast corresponding to the particular ensemble forecast

**w**

_{i}, which written in the

*K*-base system corresponds to

*i*− 1. If a particular

*N*cases (MMV missing value), then one can set that ensemble forecast using just the normal major voting or simply assigning it to the most frequent class (highest prior probability

*P*). In this way, the whole MMV true table is condensed in the

_{k}*K*-base

**v**vector, which can also be converted into a simple decimal number. For example, all the nontrivial information in Table 2 (contained in the last column) can be coded as “7020319” if the top element is taken to be the “most significant bit.”

The second, and more important, modification is introduced for the rare-event problems. Suppose there is a binary event with very low prior probability (e.g., Pyes = *P*_{1} = 0.05 and Pno = *P*_{0} = 0.95). Suppose also that the number of the ensemble classifiers is low (e.g., *M* = 3) and that each classifier has a BIAS > 1, because it is built to have a high probability of event detection (POD), even though that produces many false alarms. Then, it could happen that all the *K ^{M}* (2

^{3}= 9) possible combinations of the classifier forecasts have as MMV forecasts always 0, since the observed nonevent cases are the majority and MMV gives an output with BIAS ≅ 1. This happens because the ensemble “sample resolution,” 1/

*K*, associated with the number of possible classifier combinations, is high when compared with the event prior probability

^{M}*t*values are counted on to compute the MMMV output. In that sense, it is a weighted majority voting approach. One possibility is to fix the weights as the class relative “rarity”; that is, (1 −

*P*)/(

_{k}*K*− 1), where

*P*is the prior probability of the

_{k}*k*class and the (

*K*− 1) coefficient normalizes the weight sum to 1. In probability terms,In the binary case, this means that the forecast

*y*will be 1 when the number,

_{i}*W*is greater than the number of nonevents,

_{i}*α*=

*P*

_{0}/

*P*

_{1}.

Another possibility, in particular for *K* = 2, is simply to leave this weight unfixed and to use it as an ensemble classifier threshold. Using this threshold, the ensemble forecast can be varied and it is possible to build a ROC curve. For each point on this ensemble ROC curve, one can compute the table of contingencies and the associated statistics. What has been chosen in this work is to tune this threshold to maximize the ensemble PSS, as suggested in Manzato (2007a). This method has been introduced to manage the events with low probability, but also works quite well with the not-so-rare events.

## REFERENCES

Abdi, H., 2007: Z scores.

*Encyclopedia of Measurement and Statistics,*N. J. Salkind Ed., Sage, 1057–1058.Balakrishnan, N., , and Zrnić D. , 1990: Use of polarization to characterize precipitation and discriminate large hail.

,*J. Atmos. Sci.***47**, 1525–1540.Bayes, T., 1763: An essay towards solving a problem in the doctrine of chances.

*Philos. Trans. Roy. Soc. London,***53,**370–418.Berk, R., 2005: An introduction to ensemble methods for data analysis. Dept. of Statistics Paper 2005032701, University of California, Los Angeles, 37 pp. [Available online at http://repositories.cdlib.org/uclastat/papers/2005032701.]

Bishop, C. M., 1996:

*Neural Networks for Pattern Recognition.*Clarendon Press, 482 pp.Breiman, L., 1996: Bagging predictors.

,*Mach. Learn.***24**, 123–140.Brimelow, J. C., , Reuter G. W. , , and Poolman E. R. , 2002: Modeling maximum hail size in Alberta thunderstorms.

,*Wea. Forecasting***17**, 1048–1062.Carney, J. G., , and Cunningham P. , 1999: Tuning diversity in bagged neural network ensembles. Tech. Rep. TCD-CS-1999-44, Trinity College, 23 pp.

Clark, T. L., , Hall W. D. , , and Coen J. L. , 1996: Source code documentation for the Clark–Hall cloud-scale model: Code version G3CH01. NCAR Tech. Note NCAR/TN-426+STR, 174 pp.

Danielsen, E. F., 1977: Inherent difficulties in hail probability prediction.

*Hail: A Review of Hail Science and Hail Suppression, Meteor. Monogr.,*No. 38, Amer. Meteor. Soc., 135–144.Das, P., 1962: Influence of wind shear on the growth of hail.

,*J. Atmos. Sci.***19**, 407–413.Dessens, J., 1986: Hail in southwestern France. I: Hailfall characteristics and hailstorm environment.

,*J. Climate Appl. Meteor.***25**, 35–47.Edwards, R., , and Thompson R. L. , 1998: Nationwide comparisons of hail size with WSR-88D vertically integrated liquid water and derived thermodynamic sounding data.

,*Wea. Forecasting***13**, 277–285.Farley, R. D., , and Orville H. D. , 1986: Numerical modeling of hailstorms and hailstone growth. Part I: Preliminary model verification and sensitivity tests.

,*J. Climate Appl. Meteor.***25**, 2014–2035.Fawbush, W. J., , and Miller R. C. , 1953: A method for forecasting hailstone size at the earth’s surface.

,*Bull. Amer. Meteor. Soc.***34**, 235–244.Féral, L., , Sauvageot H. , , and Soula S. , 2003: Hail detection using S- and C-band radar reflectivity difference.

,*J. Atmos. Oceanic Technol.***20**, 233–248.Foster, D. S., , and Bates F. C. , 1956: A hail size forecasting technique.

,*Bull. Amer. Meteor. Soc.***37**, 135–141.Freund, Y., , and Schapire R. , 1996: Experiments with a new boosting algorithm.

*Proc. 13th Int. Conf.,*Bari, Italy, Machine Learning Group, 148–156.Galway, J. G., 1956: The lifted index as a predictor of latent instability.

,*Bull. Amer. Meteor. Soc.***37**, 528–529.Granitto, P. M., , Verdes P. F. , , and Ceccatto H. A. , 2005: Neural networks ensembles: Evaluation of aggregation algorithms.

,*Artif. Intell.***163**, 139–162.Hamill, T. M., 2001: Interpretation of rank histograms for verifying ensemble forecasts.

,*Mon. Wea. Rev.***129**, 550–560.Heinselman, P. L., , and Ryzhkov A. V. , 2006: Validation of polarimetric hail detection.

,*Wea. Forecasting***21**, 839–850.Houze, R. A., Jr., Schmid W. , , Fovell R. G. , , and Schiesser H.-H. , 1993: Hailstorms in Switzerland: Left movers, right movers, and false hooks.

,*Mon. Wea. Rev.***121**, 3345–3370.Huntrieser, H., , Schiesser H. H. , , Schmid W. , , and Waldvogel A. , 1997: Comparison of traditional and newly developed thunderstorm indices for Switzerland.

,*Wea. Forecasting***12**, 108–125.Kitzmiller, D. H., , and Mcgovern W. E. , 1990: Wind profiler observations preceding outbreaks of large hail over northeastern Colorado.

,*Wea. Forecasting***5**, 78–88.Kleiber, C., 2008: The Lorenz curve in economics and econometrics.

*Advances on Income Inequality and Concentration Measures, Collected Papers in Memory of Corrado Gini and Max O. Lorenz,*G. Betti and A. Lemmi, Ed., Routledge, 225–242.Krzysztofowicz, R., , and Maranzano C. J. , 2006: Bayesian processor of output for probability of precipitation occurrence. [Available online at www.faculty.virginia.edu/rk/BPO.htm.]

Lakshmanan, V., , Smith T. , , Stumpf G. , , and Hondl K. , 2007: The Warning Decision Support System–Integrated Information.

,*Wea. Forecasting***22**, 596–612.Lemon, J., 2006: Plotrix: A package in the red light district of R.

*R–News,*No. 4, R Foundation for Statistical Computing, 8–12. [Available online at http://www.r-project.org/doc/Rnews/Rnews_2006-4.pdf.]López, L., , García–Ortega E. , , and Sánchez J. L. , 2007: A short-term forecast model for hail.

,*Atmos. Res.***83**, 176–184.Lorenz, M. O., 1905: Methods of measuring the concentration of wealth.

,*Publ. Amer. Stat. Assoc.***9**, 209–219.Manzato, A., 2003: A climatology of instability indices derived from Friuli Venezia Giulia soundings, using three different methods.

,*Atmos. Res.***67–68**, 417–454.Manzato, A., 2005: The use of sounding-derived indices for a neural network short-term thunderstorm forecast.

,*Wea. Forecasting***20**, 896–917.Manzato, A., 2007a: A note on the maximum Peirce skill score.

,*Wea. Forecasting***22**, 1148–1154.Manzato, A., 2007b: Sounding-derived indices for neural network based short-term thunderstorm and rainfall forecasts.

,*Atmos. Res.***83**, 336–348.Manzato, A., 2012: Hail in northeast Italy: Climatology and bivariate analysis with sounding–derived indices.

,*J. Appl. Meteor. Climatol.***51**, 449–467.Manzato, A., , and Morgan G. M. , 2003: Evaluating the sounding instability with the lifted parcel theory.

,*Atmos. Res.***67–68**, 455–473.Marzban, C., 2000: A neural network for tornado diagnosis.

,*Neural Comput. Appl.***9**, 133–141.Marzban, C., , and Witt A. , 2001: A Bayesian neural network for severe–hail size prediction.

,*Wea. Forecasting***16**, 600–610.Mason, I. B., 2003: Binary events.

*Forecast Verification: A Practitioner’s Guide in Atmospheric Science,*I. T. Jolliffe and D. B. Stephenson, Eds., J. Wiley and Sons 37–76.Mojirsheibani, M., 1999: Combining classifiers via discretization.

,*J. Amer. Stat. Assoc.***94**, 600–609.Moore, J. T., , and Pino J. P. , 1990: An interactive method for estimating maximum hailstone size from forecast soundings.

,*Wea. Forecasting***5**, 508–525.Morgan, G. M., Jr., , and Towery N. G. , 1975: Small-scale variability of hail and its significance for hail prevention experiments.

,*J. Appl. Meteor.***14**, 763–770.Navone, H. D., , Verdes P. F. , , Granitto P. M. , , and Ceccatto H. A. , 2000: A new algorithm for selecting diverse members of neural network ensembles.

*Proc. VIth Int. Congress on Information Engineering,*Buenos Aires, Argentina. [Available online at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.9390.]Rasmussen, R. M., , and Heymsfield A. J. , 1987: Melting and shedding of graupel and hail. Part I: Model physics.

,*J. Atmos. Sci.***44**, 2754–2763.Renick, J. H., , and Maxwell J. B. , 1977: Forecasting hailfall in Alberta.

*Hail: A Review of Hail Science and Hail Suppression, Meteor. Monogr.,*No. 38, Amer. Meteor. Soc., 145–151.Showalter, A. K., 1953: A stability index for thunderstorm forecasting.

,*Bull. Amer. Meteor. Soc.***34**, 250–252.Swets, J. A., 1973: The relative operating characteristic in psychology.

,*Science***182**, 900–1000.Taylor, K. E., 2001: Summarizing multiple aspects of model performance in a single diagram.

,*J. Geophys. Res.***106**(D7), 7183–7192.Towery, N. G., , Changnon S. A. Jr. & , and Morgan G. M. Jr., 1976: A review of hail-measuring instruments.

,*Bull. Amer. Meteor. Soc.***57**, 1132–1140.Venables, W. N., , and Ripley B. D. , 2002:

*Modern Applied Statistics with S-PLUS.*4th ed. Springer-Verlag, 495 pp.Waldvogel, A., , Federer B. , , and Grimm P. , 1979: Criteria for the detection of hail cells.

,*J. Appl. Meteor.***18**, 1521–1525.Witt, A., , Eilts M. D. , , Stumpf G. J. , , Johnson J. T. , , Mitchell E. D. , , and Thomas K. W. , 1998: An enhanced hail detection algorithm for the WSR-88D.

,*Wea. Forecasting***13**, 286–303.Wobrock, W., , Flossmann A. I. , , and Farley R. D. , 2003: Comparison of observed and modelled hailstone spectra during a severe storm over the northern Pyrenean.

,*Atmos. Res.***67–68**, 685–703.Zrnić, D. S., , Ryzhkov A. , , Straka J. , , Liu Y. , , and Vivekanandan J. , 2001: Testing a procedure for automatic classification of hydrometeor types.

,*J. Atmos. Oceanic Technol.***18**, 892–913.

^{1}

Radar and lightning data have not been investigated here because their database record is too short with respect to that of the hailpads and because the focus is on short-term forecasting and not nowcasting.

^{2}

SWISS (Huntrieser et al. 1997) is an index that combines together the lifted index (LI; Galway 1956) with the lowest 3-km shear and the dewpoint depression at 650 hPa.

^{3}

For this first experiment only the sounding-derived data computed with the Tv method are used.

^{4}

Note that a 10–*I* 4–*H* NN already has 49 free parameters (adaptive weights) to be set, while a 10–*I* 2–*H* NN has only 25 weights.

^{5}

A similar effect has been found by Manzato (2007a, his Fig. 11) on the PSS metric, when evaluating different thresholds of accumulated rain.

^{6}

PSS was used instead of CEE because the ensemble classifier has a binary output, while CEE can be computed only from the continuous forecasts in the 0 < *y* < 1 range.

^{7}

The ROC curve was developed, before the Swets paper, during the Second World War, by engineers working on the analysis of radar signals. Kleiber (2008) recognizes the ROC curve as a variant of the Lorenz curve, first published in 1905 by the economist Max Otto Lorenz (Lorenz 1905).

^{8}

It should be said that these correlations are both very low from a practical point of view, even if they are *statistically significant*, since the *p* value is <2 × 10^{−5} and <2 × 10^{−16}, respectively.

^{9}

In the regression problem, BIAS means the average forecast minus the average observation and should not be confused with the frequency BIAS used in the classification problem.