## 1. Introduction

Verification is a key component of weather forecasting. In fact, verification not only allows one to monitor and compare the performance of weather forecasts, but also to analyze the nature of the forecast error. A *diagnostic* verification can help to detect the forecast weaknesses and systematic errors in numerical weather prediction (NWP) models. Therefore, a diagnostic verification provides guidance for forecasters and NWP modelers, which leads to new development and improvements. This work introduces a new diagnostic verification technique for probabilistic forecasts defined on a spatial domain.

Weather maps are often characterized by the presence of coherent spatial structures. Verification of forecasts defined on spatial domains needs to account for such spatial correlation. Moreover, weather phenomena are characterized by the presence of features on different scales. Phenomena on different scales are often driven by different physical processes. Verification on different spatial scales can therefore provide useful insight into the NWP model representation of different physical processes, and indicate which of these processes might benefit most from research and development. The verification technique introduced in this work aims to assess the quality of a probabilistic forecast on different spatial scales. For studying the forecast-predictability-scale limit, the technique aims also to establish at which scale there is a transition from negative to positive skill. Moreover, the technique aims to verify the capability of the forecast to reproduce the scale structure of the observations. The technique has been specifically designed to provide a verification framework, which can be used to compare the performance of probabilistic forecasts produced by models with different spatial resolutions.

A few techniques for the verification of *deterministic* forecasts on different scales can be found in the literature: Briggs and Levine (1997) introduced a wavelet-based verification method on different spatial scales that uses continuous verification statistics (e.g., the mean squared error); Casati et al. (2004) developed an intensity-scale verification technique based again on 2D wavelet decomposition and on a categorical verification approach; Zepeda-Arce et al. (2000) and Harris et al. (2001) assess the forecast capability of reproducing the observation spatiotemporal and multiscale spatial structure of precipitation fields. De Elia et al. (2002) and Denis et al. (2003) evaluate the forecast time-scale predictability limits as a function of the scale for high-resolution regional models. Currently, there is no technique specifically designed for the verification on different scales for *probabilistic* forecasts: the verification technique introduced in this work aims to fill this gap. Since it is scale oriented, the technique provides additional and complementary information to the classic verification methods (e.g., rank histograms, relative operating characteristic curves, etc.). However, it still strongly relates to traditional statistics, such as the Brier score and skill score, and their reliability and resolution components. The technique can be useful not only for the verification of probabilistic forecasts, such as those produced by an Ensemble Prediction System (EPS) or statistical models, but also for the verification of high-resolution deterministic models. In fact, some recent studies (e.g., Theis et al. 2005; Roberts and Lean 2007) transform the output of high-resolution models into probabilities prior to verification, to account for the timeâ€“space uncertainty and representative issues on scales smaller than the one of interest.

The verification method introduced in this paper is demonstrated on the Canadian Meteorological Centre (CMC) lightning probability forecast. This forecast was chosen because of the high spatial and temporal resolution of the lightning observations, so that verification can be performed on all the scales resolved by the forecast. The general features of the CMC lightning probability forecasts are reviewed in section 2. The verification method is then illustrated by using one representative case study in section 3. Interpretation of the verification results for the case study is presented along with the verification method description. Some monthly summary verification statistics for July 2004 are shown in section 4. Finally, in section 5, some conclusions are drawn.

## 2. The CMC lightning probability forecast

Lightning probability forecasts are produced operationally at the CMC (Burrows et al. 2005). The forecasts predict the probability that the number of lightning flashes in 3-h periods exceeds some specified thresholds. Forecasts are produced for time projections up to 48 h, on a domain of approximately 24-km resolution encompassing Canada and the northern United States. The forecasts are produced by the Classification and Regression Trees method (CART; Brieman et al. 1984). Separate regression trees were built for each 5Â° Ã— 5Â° latitudeâ€“longitude sector, for each month from May to September, and for each of the 16 total 3-h time windows. Predictors were derived from the 24-km resolution Global Environmental Multiscale (GEM) NWP model output (CÃ´tÃ© et al. 1998), run in a regional configuration. Predictands were obtained from reports of lightning flashes observed by the North American Lightning Detection Network (NALDN) and distributed by Vaisala (Orville et al. 2002). To match predictands and predictors, the number of lightning flashes in the 3-h time window was gridded on the 24-km resolution GEM domain. Each flash was assigned a weight of 1 if within a distance of 10 km from the grid point, and a weight decreasing linearly from 1 to 0 as the distance from the grid point increases from 10 to 20 km (note that this weighting procedure slightly smoothes the observations and augments the number of lightning flashes). Predictors and predictands of the summers of 2000 and 2001 were used as training data to construct the regression model. A detailed description of the model producing the CMC lightning probability forecast can be found in Burrows et al. (2005).

Two categories of lightning forecast probabilities are verified in this work: probability of *any lightning*, which is defined as the probability that the number of lightning flashes in the 3-h time window is greater than 0.5, and probability of *intense lightning*, which is defined as the probability that the lightning occurrence in the 3-h time window exceeds the threshold of *e*^{3} â‰ˆ 20.085 flashes. The thresholds used to define the lightning probability categories were chosen while developing the regression model and are related to the characteristics of the lightning flash distribution, which is lognormal when many flashes are observed, and exponential when few flashes are observed (Burrows et al. 2005). The forecast probabilities are verified against lightning flash reports from the NALDN. The number of observed lightning flashes in the 3-h time windows are gridded on the forecast domain. The observations used for the verification are treated using the same spatial weighting procedure that was used to construct the predictands when developing the statistical regression model. Then, forecast probabilities of any and intense lightning are verified against observed occurrences of the corresponding category (i.e., binary fields equal to 1 where the observed gridded lightning flashes exceed the corresponding category threshold, and equal to 0 elsewhere).

Figure 1 shows the gridded lightning flashes that were observed in the 3-h window from 1800 to 2100 UTC 20 July 2004. Figures 2a,c show the corresponding binary fields of observed occurrence for the two lightning categories. The binary fields were obtained by thresholding Fig. 1 with the appropriate category threshold. Figures 2b,d show the 21-h lead time lightning probability forecasts for the two lightning categories valid at the same time. Note that the forecast fields display some rectangular areas of homogeneous small probability values: these correspond to the 5Â° Ã— 5Â° latitudeâ€“longitude sectors used to develop the regression model. The case shown is typical: the lightning activity on the east side of the domain is related to a large frontal system; the lightning activity on the west side of the domain is mainly related to small-scale convective activity in the region of the Rocky Mountains.

## 3. The verification approach

### a. Field decomposition on different spatial scales

*W*

^{m}

_{j}(

*X*) are the mother wavelet components of the field

*X*on the scale

*j*, and

*W*

^{f}

_{J}(

*X*) is the father wavelet component of

*X*on the largest scale

*J*. Note that the father wavelet components on different scales are obtained by a smoothing of the original field: for the Haar wavelet, they are obtained by a smoothing through spatial averaging at different resolutions. The mother wavelet components are deviation fields from the smoothed father wavelet components and have zero spatial average: for the Haar wavelet, they are the variation-about-the-mean fields (see the appendix for further details). The resolution of the mother wavelet components for

*j*= 1, . . . ,

*J*= 7 is equal to 2

^{j}^{âˆ’1}grid points, corresponding approximately to 24, 48, 96, 192, 384, 768, and 1536 km, respectively. The resolution of the largest-scale father wavelet component is 2

^{7}grid points, corresponding approximately to 3072 km.

### b. Energy bias on different scales

*X*is the average of the squared values of

*X*over all the domain:

*X*is equal to the sum of the squared energies of its spatial-scale components:

*j*â‰

*k*, and

*W*

^{m}

_{j}(

*X*)

*W*

^{f}

_{J}(

*X*)

*j*= 1, . . . ,

*J*(Mallat 1989). The percentage with which each scale contributes to the total squared energy was then evaluated as

Figure 3 shows the squared energy for the different scale components of forecast and observation fields for the case study illustrated in Fig. 2. For both forecast and observation and for both categories, the smallest scale exhibits the largest squared energy and then, as the scale increases, the squared energy decreases (except for the largest scale). This indicates that in both the forecast and in the observation field there is a large number of small-scale events and then, as the scale increases, the number of events (and the intensity of the forecast probabilities) decreases. Figure 3a shows that the forecast squared energy is smaller than the observed one on all the scales but the largest one: this indicates that the forecast probabilities for the any lightning category underestimates the occurrence of events, in terms of both magnitude of the probabilities and the number of features, on all the scales smaller than 3000 km. Similarly, Fig. 3b shows that the forecast probabilities for the intense lightning category underestimate the magnitude of the observed occurrences on small scales, and overestimate the events on the larger scales. The forecast is smoother than the observed field, which suggests a lack of resolution.

The largest scale corresponds to the largest father wavelet component *W*^{f}_{J}(*X*). Its behavior is different from the other scales: it provides a measure of the average value of the forecast and observation fields over the entire domain. For both lightning categories shown in Fig. 3, the largest-scale energy for the forecast is visibly larger than the one for the observations: this indicates an overall overforecasting, which is also clearly shown in Fig. 2 and, which is amplified (especially for the any lightning category) by the 5Â° Ã— 5Â° latitudeâ€“longitude sector artificial forecast of small probabilities over large areas, with an extent of 500 km or larger.

The comparison of Figs. 3a,b shows some of the differences between the two lightning categories. The squared energy for the intense lightning category is significantly smaller than the squared energy for the any lightning category, on all scales. This is due to the presence of fewer events in this category, since it is defined by a higher threshold. Note also that, as the scale increases, the squared energy decreases more rapidly for the intense lightning category than for the any lightning category. However, the difference in sample climatologies (and therefore in magnitude of the squared energies) prevents a direct quantitative comparison of the two lightning categories.

To directly compare the qualitative behaviors of the squared energies of the two lightning categories, and therefore their scale structure, the squared energy percentages were calculated. Figure 4 shows the squared energy percentages and their ratio, for the different scale components of forecast and observation fields for the case study illustrated in Fig. 2. For both forecast (Fig. 4a) and observation (Fig. 4b), the squared energy percentages on small scales for the intense lightning category is larger than for the any lightning category. This is due to the presence of a larger number of small-scale events in the intense lightning category. Vice versa, as the scale increases, the fraction of squared energy on large scales for the any lightning category becomes larger than for the intense lightning category. This is due to the presence of a larger number of large-scale features in the forecast and observation fields of the any lightning category.

The ratio of the squared energy percentages (Fig. 4c) measures the differences in the scale structure of the observation and forecast fields, independent of their bias. The observed-scale structure for the any lightning category is well reproduced by the forecast on all the scales from 40 to 1500 km. The smallest scale exhibits a small overforecast of small-scale features: this is partially due to the spatial smoothing performed on the observations as described in section 2. The largest scale is dominated by the overall overforecasting, enhanced by the large areas of small forecast probabilities introduced artificially by the 5Â° Ã— 5Â° latitudeâ€“longitude sectors. For the intense lightning category, the scale structure on small scales is well reproduced, whereas on large scales it exhibits overforecasting. This is partially due to the overall overforecasting and the large-scale artificial features induced by the 5Â° Ã— 5Â° latitudeâ€“longitude sectors. However, the scale of 400 km exhibits a particularly large overforecast. As it can be seen in Fig. 2, this is mainly due to the presence in the intense lightning forecast of some realistic features (e.g., the feature bordering Nebraska and Iowa) of approximately 400 km, which are not present in the observation binary field.

### c. Brier score decomposition on different scales

*Y*) versus its corresponding observed occurrence binary field (

*X*) was evaluated as

*Z*=

*Y*âˆ’

*X*is the probability error field, and the overbar denotes the domain average. The Brier scores measures the forecast error. The Brier score for a perfect forecast is 0.

*W*

^{m}

_{j}(

*Z*) are the mother wavelet components of the probability error field

*Z*on the scale

*j*= 1, . . . ,

*J*, and

*W*

^{f}

_{J}(

*Z*) is the largest-scale father wavelet component of

*Z*. From this result, and following the arguments used in section 3b for the squared energy, it can be shown that the Brier score is equal to the sum of its components on different scales:

### d. Scale decomposition of the Brier skill score

_{perf}= 0 is the Brier score for a perfect forecast. The BS

_{ref}= BS

_{clim}is the Brier score one would obtain by forecasting for each grid point of the forecast field the sample climatology (

*Y*=

*), and BS*X

_{clim}= is equal to the observation variance

*Ïƒ*

^{2}

_{X}.

*are the Brier score components on different scales given by Eq. (6), and*

^{m}_{j}*Ïƒ*

^{2}

_{Wmj(X)}is the variance of the mother wavelet component

*W*

^{m}

_{j}(

*X*) of the observation field

*X*on the scale

*j*. Note that, since the mother wavelets have zero mean, their squared energy and variance are equal and

*Ïƒ*

^{2}

_{Wmj(X)}= En

^{2}[

*W*

^{m}

_{j}(

*X*)]. The scale components of the observation variance are therefore shown in Fig. 3.

Note that the BSS for the largest-scale father wavelet component is not evaluated. Its computation is not possible because the variance of such component is zero, since it is the variance of a constant field *W*^{f}_{J}(*X*) = * X*. Rather, since we are comparing two constant fields, the forecast error corresponding to the largest-scale father wavelet components is measured by the energy bias of the largest-scale component (section 3b).

The BSS components on the different scales measure the skill of the forecast at each scale. BSS* _{j}* is equal to 1 for perfect skill; BSS

*is positive when the forecast performs better than the climatological forecast, and it is negative when the forecast performs worse than the climatological forecast (negative skill). Figure 7 shows the BSS components on the different scales for the case study illustrated in Fig. 2. For both lightning categories, the skill is negative on small and intermediate scales (from 24- to 400-km resolution), and it becomes positive only on very large scales (700 km and larger). The negative skill is due mainly to small-scale feature displacements. Positive skill on large scales indicates that large-scale features, such as frontal systems, are well detected by the forecast. The intense lightning category exhibits significantly negative skill at the 400-km scale due to the displacement and overforecasting of 400-km features. This can be seen in Fig. 2 and the overforecast was already diagnosed by the bias of the energy percentages (Fig. 4).*

_{j}### e. Reliability and resolution on different scales

Probabilistic forecasts are often verified in terms of reliability and resolution (Jolliffe and Stephenson 2003, chapter 7). The reliability measures the conditional bias of the observation given the forecast (i.e., the agreement between the frequency of observing the event when a certain probability was forecast, versus the forecast probability itself). The resolution measures the capability of the forecast of separating situations for which the observed events have distinct frequency distributions (e.g., distinguish intense events from climatology or nonevents). The scale structure of reliability and resolution is examined in this section.

*R*= Pr(

*X*|

*Y*) is the conditional probability of the observed occurrence

*X*given the forecast probability

*Y*; BSrel = is the reliability component of the Brier score; BSres =

^{2}

*Ïƒ*

^{2}

_{X}is the underlying uncertainty associated to the observation, and is equal to the observation variance.

To estimate *R*, the forecast probabilities *Y* were binned into the 18 intervals [0], (0, 2^{âˆ’6}], (2^{âˆ’6}, 2^{âˆ’5}], . . . , (2^{âˆ’1}, 2^{0}], (1, 10], (10, 20], . . . , (90, 100]. For each interval *I _{k}*, the conditional probability Pr(

*X*|

*Y*âˆˆ

*I*) was then estimated by using the gridpoint values of the fields

_{k}*X*and

*Y*over the entire domain, and averaging the binary values of the observed occurrences

*X*corresponding to the gridpoint values of the forecast probabilities

*Y*âˆˆ

*I*

_{k}_{.}Finally, the conditional probability field

*R*= Pr(

*X*|

*Y*) is obtained by substituting each gridpoint value

*y*âŠ‚

*Y*of the forecast probability field

*Y*with the corresponding conditional probability value Pr(

*X*|

*y*âˆˆ

*I*). Figure 8 shows the conditional probability fields for the categories of any and intense lightning for the case study shown in Fig. 2.

_{k}The choice of the intervals in which to bin the forecast probabilities can affect the Brier score partition in Eq. (14). Different binnings with different intervals were tested in this study. The chosen partition was based on the following criteria: 1) minimization of the difference between the Brier score evaluated from Eq. (5) and the sum of its reliability, resolution, and uncertainty components as given in Eq. (14) with *R* estimated from the binning; 2) uniformity of the distribution of the forecast probabilities through the bins; 3) sample size of the bins, each of which was required to contain at least 500 values, considered sufficient to obtain a reliable estimate of the conditional probability Pr(*X*|*Y* âˆˆ *I _{k}*). The effect due to the variability of the forecast probabilities within each bin can also affect the Brier score partition (D. B. Stephenson 2007, personal communication), and was therefore also analyzed. The forecast probabilities were substituted with discrete values, equal to the average value of the forecast probabilities in each interval. Then, the Brier scores evaluated from Eqs. (5) and (14) were compared, for both the discretized and nondiscretized forecast probabilities. The differences were negligible, of the order of the third/fourth significant figure. We chose to use the nondiscretized forecast probabilities.

*Y*âˆ’ Pr(

*X*|

*Y*) is large, and the forecasts also lack reliability. The BSS resolution component is very small compared to the reliability component, leading to negative skill.

^{m}

_{j}and BSres

^{m}

_{j}are the mother wavelet components of BSrel and BSres on the scale

*j*, and unc

^{m}

_{j}is the variance

*Ïƒ*

^{2}

_{Wmj(X)}of the mother wavelet component of

*X*on the scale

*j*. Finally, the percent components of the BSS reliability and resolution were evaluated on different scales as

^{m}

_{j}, BSres%

^{m}

_{j}and unc%

^{m}

_{j}are, respectively, the percentages of BSrel, BSres, and unc on the different scales. Note that the BSS percent components defined in Eq. (17) are not additive, nor are the Brier score and skill score reliability and resolution components, on each separate scale (i.e. BS

^{m}

_{j}â‰ BSrel

^{m}

_{j}âˆ’ BSres

^{m}

_{j}+ unc

^{m}

_{j}; BSS

_{j}â‰ BSSres

_{j}âˆ’ BSSrel

_{j}). However, the scale decomposition of reliability and resolution components helps the understanding the scale structure of the error. In particular, for the case study analyzed, the reliability and resolution percent components of the BSS on different scales diagnose how resolution and reliability are distributed across the scales, and help determine which scales are mainly responsible for the poor resolution and lack of reliability of the forecast.

Figure 9 shows BSSrel%* _{j}* and BSSres%

*on different scales for the case study illustrated in Fig. 2. The reliability for the any lightning category is almost constant across the scales. For the intense lightning category, the reliability is lacking mainly on the 400- and 1500-km scale, due again to the displacement and overforecast of 400-km features and to the smoothing and overforecast of large features. The resolution for the any lightning category is higher on scales larger than 700 km, and lower on the smaller scales. This can be seen also from the conditional probability field shown in Fig. 8a, where the very large scale features (e.g., the eastern frontal feature) are identified for the any lightning category. The resolution for the intense lightning category is almost constant across the small scales and then it improves almost linearly from the 400-km scale to the largest scale. The BSSrel%*

_{j}*and BSSres%*

_{j}*scale components show that the higher resolution on the scales larger than 700 km counterbalances the lack of reliability on the same scales, for both categories, leading to positive skill on these scales. The very large negative skill of the intense lightning category on the 400-km scale is due to a lack of reliability on this scale caused by the displacement and overforecast of 400-km features.*

_{j}## 4. Monthly verification

The verification method illustrated in section 3 was used to verify the CMC lightning probability forecast for the GEM NWP run starting at 0000 UTC, with lead times up to 48 h, for the months of July and August 2004 and 2005. In this paper we show the verification results for the forecasts with 24-h lead time, since they predict for the 3-h time window from 2100 to 2400 UTC, when the more intense afternoon lightning activity occurs. We only show the verification for July 2004; verification scores for July 2005, August 2004, and August 2005 exhibit similar behaviors.

The Brier score and squared energy of the observation and forecast fields were evaluated on different scales for each of the 31 forecastâ€“observation daily realizations over the month, for the two lightning categories. The scale components of the Brier score and observed and forecast squared energy over the month have then been evaluated as the average of their corresponding daily-scale components. Finally, the monthly percentages of the Brier score, observed and forecast squared energy, and the monthly BSS on different scales were evaluated from the monthly-scale components of the Brier score and observed and forecast squared energy, in the same fashion as described in section 3.

To assess the significance of the verification results, the 90% confidence intervals (CIs) associated with all the monthly verification statistics, along with the 0.25, 0.50, and 0.75 percentiles, were evaluated by using a bootstrapping technique (Efron and Tibshirani 1993). A total of 1001 samples of the same size as the one used to compute the monthly summary scores (31 forecastâ€“observation realizations) were created by a random selection with replacement of the daily forecastâ€“observation realizations. For each one of the 1001 31-day samples, the monthly summary statistics were evaluated. These 1001 statistics were used to create a distribution of scores, from which the 90% CIs, 0.25, 0.50, and 0.75 percentiles were deduced, for each of the monthly summary verification scores. In Figs. 10, 11 and 12, the monthly summary statistics for the categories of any and intense lightning are shown as continuous and dashed lines, respectively. The 90% CIs and 0.25, 0.50, and 0.75 percentiles of the monthly verification statistics are plotted by using box plots. The whiskers of the box plots extend to the 90% CI values, whereas the box of the box plots shows the 0.25, 0.50, and 0.75 percentiles of the score distribution.

Figures 10 and 11 show the observed and forecast squared energy on different scales, their percentages, and the ratio for July 2004. The squared energy and its percentages exhibit a behavior similar to the behavior of the case study analyzed in section 3: small scales exhibit the largest energy (i.e., both forecast and observation are characterized by a large number of small-scale events), and then the energy decreases as the scale increases (i.e., the number of events and the intensity of the probabilities decreases as the scale increases). The comparison of the squared energy of forecast and observation shows an underforecast on all the scales smaller than 3000 km, for both lightning categories. However, the largest scale again shows the tendency toward the overall overforecasting of the lightning probability forecasts (partially due to the artificial large areas of small probabilities forecast by the 5Â° Ã— 5Â° latitudeâ€“longitude sectors). These two results together suggest poor spatial resolution in the forecast. The squared energy percentages reveal that the intense lightning category on small (large) scales exhibits more (less) energy than the any lightning category. This is due to the fact that the intense lightning category is characterized more by the presence of many small-scale intense events, whereas the any lightning category is characterized more by the presence of large-scale features. The ratio of the squared energy percentages shows that the forecast well reproduces the observed-scale structure on the small scales, up to 400 km. The ratio of the squared energy percentages for the intense lightning category shows a very light smoothing on small scales (except the smallest scale, which is well represented), and an overestimation of large-scale features. The ratio of the squared energy percentages for the any lightning category exhibits a parabolic behavior, associated with a small underforecasting of mediumâ€“large-scale features and overforecasting of very large scale features. The smallest scale is again very well represented. This might reflect a good representation of the smallest-scale variability by the statistical model, since the forecast probabilities are evaluated grid point by grid point. However, this might in part be also an artificial effect due to the spatial smoothing (from 20 to 40 km) introduced by the spatial weighting of the observed lightning flashes (see section 2).

Figure 12 shows the Brier score, its percentages, and the BSS on different scales for July 2004. The Brier score and its percentages exhibit a behavior similar to the behavior of the squared energies and its percentages and to the case study analyzed in section 3 (again, small scales exhibit the largest error, and then the error decreases as the scale increases). Such behavior follows from the dependency of the Brier score on the number of events present on each scales. As explained in section 3c, the Brier score, the squared energy, and the correlation are related such that the stronger the dependency on the number of events, the more similar the behaviors of the squared energy and Brier score, and the weaker the correlation. The relation given in Eq. (11) was used to evaluate the forecastâ€“observation correlation on different scales from the scale components of Brier score and squared energies (not shown). The scores obtained revealed that the forecast and observation fields are weakly correlated on small scales, but then, as the scale increases, the correlation increases almost linearly. The BSS shows a systematic positive skill on scales larger than 700 km, and negative skill on scales smaller than 400 km. The transition scale from no skill to positive skill corresponds to the scale of the 5Â° Ã— 5Â° latitudeâ€“longitude sectors used to construct the regression model. The model produces forecasts that resolve the sector scale and larger scales, but do not resolve the features within the sectors. The development of a new regression model, no longer based on the 5Â° Ã— 5Â° latitudeâ€“longitude sectors, was recently started (W. Burrows 2006, personal communication). In the new model, all data from all the grid points are pulled together to evaluate the new regression equations. The spatial consistency of the lightning forecasts derives from the predictors, since it is preserved through the statistical process.

Note that the behavior of the statistics for the intense lightning category is qualitatively different from the behavior of the statistics for the any lightning category. For example, the behavior of the energy ratio with respect to the scale for the former category is almost linear, while for the latter category it is parabolic; the BSS for the former category is almost constant on scales up to 400 km, while for the latter category it increases linearly as the scale increases. These different behaviors are related to the different nature of the physical phenomena described by the intense lightning category (small-scale intense events) with respect to the any lightning category (large-scale nonintense features). Applying the scale separation to the verification of phenomena of different intensities enables a comparison of the scale structures of events of different nature, and a deep analysis of their different error-scale structure.

## 5. Conclusions

A new diagnostic verification technique for probabilistic forecasts defined on a spatial domain has been introduced in this work. The technique assesses the forecast quality and skill on different scales. In particular, the technique is capable of

quantifying error, skill, and bias on different scales;

establishing the no-skill/skill transition scale;

measuring reliability and resolution on different scales; and

verifying the ability of the forecast to reproduce the observed-scale structure.

The scale separation can be related to weather phenomena of different physical nature, such as large-scale frontal systems or small-scale convective events. Therefore, verification on different scales can provide guidance for the forecast improvement in reproducing specific physical processes. The scale separation of Brier score and the squared energy provides new insights and a deeper understanding of the scale structure of the forecast error. Therefore, it increases the diagnostic power of classic scores such as the Brier score, Brier skill score, and their reliability and resolution components. The verification approaches introduced by Briggs and Levine (1997) and Casati et al. (2004) link the verification on different scales with the traditional continuous and categorical verification approaches, respectively. The technique introduced in this paper links the scale-verification approaches with traditional verification scores for probabilistic forecasts.

The verification technique has been illustrated on the CMC lightning probability forecast. This forecast was chosen because of the availability of high spatial and temporal resolution lightning observations, so that verification could be performed on all the scales resolved by the forecast. To verify both modest and intense events, two categories of lightning activity have been assessed: any and intense lightning. The scale separation was performed by a 2D discrete Haar wavelet filter. Wavelets, rather than Fourier transforms, were chosen because they are locally defined, and therefore more suitable for representing discontinuous spatial fields characterized by the presence of few sparse nonzero values, such as the CMC lightning forecasts. Verification results for July 2004 are presented; verification scores for July 2005, August 2004, and August 2005 exhibit similar behaviors.

The squared energy and its percentages on different scales enable the assessment of the bias on different scales and the analysis and comparison of the forecast- and observation-scale structures. Both the forecast and observation are characterized by a large number of small-scale events, and then the events decrease as the scale increases. Since independent from the total energy, the percentages enable a direct comparison of the scale structure of the two lightning categories. As expected, the intense lightning category is characterized by a larger (smaller) number of events on scales smaller (larger) than the any lightning category. The comparison of forecast and observed squared energy shows that the CMC lightning probability forecast exhibits an overall overforecast, for both categories. However scales smaller than 3000 km are underforecast. These results together indicate a lack of spatial resolution in the forecast. The ratio of the squared energy percentages reveals that the observed-scale structure is well represented by the forecast on small scales, up to 400 km, whereas the large scales are dominated by the overforecast of large-scale features.

The Brier score on different scales shows that the CMC lightning probability forecast exhibits the largest error on the smallest scales. As the scale increases, the error decreases. The percentage of the error for intense lightning activity is larger (smaller) on small (large) scales than for modest lightning activity. The error exhibits dependence to the number of events present in the forecast and observation fields on each scale. Therefore, the Brier score on different scales exhibits a behavior similar to the one of the squared energy. The relation between squared energy, Brier score, and correlation on different scales enabled the diagnosis of a weak forecastâ€“observation correlation on small scales, with an almost linear increase in correlation with scale.

The Brier skill score on different scales shows that the CMC lightning probability forecast exhibits positive skill on scales larger than 700 km, and negative skill on scales smaller than 400 km. This indicates that large-scale features, such as fronts, are relatively well forecast. The transition scale from no-skill to positive skill corresponds to the scale of the 5Â° Ã— 5Â° latitudeâ€“longitude sectors used to construct the regression model. The forecast resolves features of the size of the sectors or larger, but does not accurately resolve smaller features within the sectors. The development of a new regression model, no longer dependent on the 5Â° Ã— 5Â° latitudeâ€“longitude sectors has been started.

The reliability and resolution components of the Brier score and Brier skill score were evaluated on different spatial scales for a representative case study. The scale decomposition of reliability and resolution helped to diagnose the scale structure of the forecast error: as an example, it was shown that positive skill on large scales is due to the higher resolution on these scales. Note that the mathematical decomposition of the Brier skill score into reliability and resolution components depends on the climatology used as reference forecast. In the paper, this decomposition is applied only on one case study, and the sample climatology of the single case is used. This was sufficient to diagnose the forecast error (as in a posteriori process) and illustrate the capabilities of the technique. When the evaluation of reliability and resolution is carried out on a larger sample, such as monthly data, the reference forecast is more representative of the long-term climatological average. A long-term climatology is a meaningful reference forecast, since known a priori, and can provide a different interpretation of the verification results.

The verification method is capable of diagnosing specific forecast errors associated with particular forecast situations. As an example, for the case study analyzed in this paper, the forecast for intense lightning activity exhibits a particular negative skill at the 400-km resolution scale. This is due mainly to the displacement and overforecasting of features on this scale, which shows up well in the scale-partitioned reliability component of the Brier score. Such specific overforecast of the 400-km scale features was also detected by the ratio of the energy percentages.

Verification of probabilistic forecasts on different spatial scales can be performed by using different spatial filters and other verification scores, such as the ranked probability score. In this study the Brier score and squared energy are used for two main reasons: these scores are strictly related to the correlation existing between the forecast and observation field on different scales, and therefore enable a deeper analysis of the forecastâ€“observation agreement. Moreover, the Brier score and squared energy are defined by quadratic rules. This fact, together with the orthogonality of the discrete wavelet transforms, enables the additivity of the different spatial components, and therefore the definition of percentages associated to each scale. The percentages, even more the absolute values of the scores on each scale, were shown to provide useful information on the scale structure of the forecast error.

The technique introduced in this paper can be used to verify any probabilistic forecasts such as those produced by an EPS or statistical models. Moreover, the technique can be used for the verification of very high resolution deterministic forecasts, which can be transformed into probabilities, to account for the space and time uncertainty on scales smaller than the one of interest (e.g., Theis et al. 2005; Roberts and Lean 2007). Finally, the technique enables the comparison of forecasts with different resolutions. It is very well known that high-resolution forecasts, with their intrinsic high variability, tend to achieve lower scores than lower-resolution forecasts, when assessed with standard verification methods (see Nachamkin 2004, and references therein). The Brier skill score decomposition introduced in this work is defined on each scale by a normalization of the Brier score by the scale variance. This enables verification and a fairer comparison of forecasts with different resolutions, on each separate scale.

## Acknowledgments

The authors wish to thank B. Denis, V. Fortin, P. L. Houtekamer, and the three anonymous reviewers for their helpful comments on earlier versions of this paper.

## REFERENCES

Brieman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone, 1984:

*Classification and Regression Trees*. Chapmann and Hall/CRC Press, 368 pp.Brier, G. W., 1950: Verification of forecasts expressed in terms of probability.

,*Mon. Wea. Rev.***78****,**1â€“3.Briggs, W. M., and R. A. Levine, 1997: Wavelets and field forecast verification.

,*Mon. Wea. Rev.***125****,**1329â€“1341.Burrows, W. R., C. Price, and L. J. Wilson, 2005: Warm season lightning probability prediction for Canada and the northern United States.

,*Wea. Forecasting***20****,**971â€“988.Casati, B., G. Ross, and D. B. Stephenson, 2004: A new intensity-scale approach for the verification of spatial precipitation forecasts.

,*Meteor. Appl.***11****,**141â€“154.CÃ´tÃ©, J., S. Gravel, A. MÃ©thot, A. Patoine, M. Roch, and A. Staniforth, 1998: The operational CMCâ€“MRB Global Environmental Multiscale (GEM) model: Part I: Design considerations and formulation.

,*Mon. Wea. Rev.***126****,**1373â€“1395.Daubechies, I., 1992:

*Ten Lectures on Wavelets*. SIAM, 357 pp.De Elia, R., R. Laprise, and B. Denis, 2002: Forecasting skill limits of nested, limited-area models: A perfect model approach.

,*Mon. Wea. Rev.***130****,**2006â€“2023.Denis, B., R. Laprise, and D. Caya, 2003: Sensitivity of a regional climate model to the resolution of the lateral boundary conditions.

,*Climate Dyn.***20****,**107â€“126.Efron, B., and R. J. Tibshirani, 1993:

*An Introduction to the Bootstrap*. Chapman and Hall, 436 pp.Harris, D., E. Foufoula-Georgiou, K. K. Droegemeier, and J. J. Levit, 2001: Multiscale statistical properties of a high-resolution precipitation forecast.

,*J. Hydrometeor.***2****,**406â€“418.Jolliffe, I. T., and D. B. Stephenson, 2003:

*Forecast Verification: A Practitionerâ€™s Guide in Atmospheric Science*. John Wiley and Sons, 240 pp.Mallat, S. G., 1989: A theory for multiresolution signal decomposition: The wavelet representation.

,*IEEE Trans. Pattern Anal. Mach. Intell.***II****,**674â€“693.Murphy, A. H., 1973: A new vector partition of the probability score.

,*J. Appl. Meteor.***12****,**595â€“600.Nachamkin, J. E., 2004: Mesoscale verification using meteorological composites.

,*Mon. Wea. Rev.***132****,**941â€“955.Orville, R. E., G. R. Huffines, W. R. Burrows, R. L. Holle, and K. L. Cummins, 2002: The North American Lightning Detection Network (NALDN)â€”First results: 1998â€“2000.

,*Mon. Wea. Rev.***130****,**2098â€“2109.Roberts, N. M., and H. W. Lean, 2007: Scale-selective verification of rainfall accumulations from high-resolution forecasts of convective events.

, in press.*Mon. Wea. Rev.*Theis, S. E., A. Hense, and U. Damrath, 2005: Probabilistic precipitation forecasts from a deterministic model: A pragmatic approach.

,*Meteor. Appl.***12****,**257â€“268.Zepeda-Arce, J., E. Foufoula-Georgiou, and K. K. Droegemeier, 2000: Space-time rainfall organization and its role in validating Quantitative Precipitation Forecasts.

,*J. Geophys. Res.***105****,**10129â€“10146.

## APPENDIX The 2D Discrete Haar Wavelet Filter

Wavelets are real functions characterized by a location and a scale (Daubechies 1992; Mallat 1989). Similar to Fourier transforms, wavelets can be used to represent a function as a sum of components on different spatial scales. Therefore, wavelets can be used to analyze the frequency structure of a signal or the scale structure of a field. Wavelets, rather than Fourier transforms, are used in this study because they are locally defined, and therefore more suitable for representing spatially discontinuous fields characterized by the presence of few sparse nonzero values, such as the CMC lightning probability forecasts.

Different types of wavelets exist. Each wavelet type is defined by a mother and a father wavelet, characterized by different shapes and mathematical properties (e.g., smoothness, symmetry, etc.). When performing a wavelet decomposition, it is often desirable to select a certain wavelet so as to gain from the characteristics of the wavelet itself. As an example, the wavelet chosen can be the one that best correlates with the function to be decomposed, so that the number of wavelet coefficients necessary to describe the decomposed function is minimized. In this study Haar wavelets are used because of their square shape, which best deals with the sharp discontinuities characterizing the lightning fields and so enables a very efficient decomposition. Figure A1 shows the one- and two-dimensional Haar wavelets. Note that the two-dimensional wavelets are generated simply as the Cartesian product of one-dimensional wavelets.

A discrete wavelet family is a set of wavelets of the same type generated from the mother and father wavelets by a deformation and a translation. The deformation characterizes the scale *j* of the wavelet: it stretches the domain of the wavelet by a factor of 2* ^{j}* and reduces its amplitude by a factor of 2

^{âˆ’}

^{j}^{/2}(this is to maintain its

*L*

^{2}norm equal to 1). Therefore, wavelets on the scale

*j*have a domain that is twice as large as the domain of the wavelets on the spatial scale

*j*âˆ’ 1 (i.e., as the scale increases, wavelets are stretched by a factor of 2). The translation determines the location of the wavelet in the domain: wavelets of scale

*j*is translated by a multiple of 2

*units. For Haar wavelets this implies that wavelets of the same spatial scale cover the whole domain and their supports do not overlap.*

^{j}Any finite real function defined on a grid, such as the lightning field *X*, can be expressed as a linear combination of discrete wavelets of the same family. The field *X* is so expressed as a sum of components on different spatial scales. Note that discrete wavelets, and the spatial-scale components obtained from a discrete wavelet decomposition, are orthogonal (Mallat 1989). This implies that the integral (and the average) over the spatial domain of the product of two different wavelet spatial-scale components is 0. Note that the integral (and the average) over the spatial domain of the mother wavelets and their spatial-scale components is also 0 (Mallat 1989). Finally, note that the discrete wavelet decomposition is a linear operator (Mallat 1989), that is, the wavelet decomposition of a linear combination of functions (or fields) is the linear combination of the wavelet decomposition of each function (field).

Discrete wavelet transforms, as Fourier transforms, can be explained by using the theory of functional analysis (e.g., Mallat 1989). However, the 2D discrete Haar wavelet filter can also be explained by an algorithm based on spatial averaging over 2* ^{j}* Ã— 2

*gridpoint domains. Figure A2 shows a one-dimensional theoretical example of the Haar wavelet filter using this latter approach. In the remainder of this appendix we explain the two-dimensional Haar wavelet filter with this approach.*

^{j}The 2D Haar wavelet filter is applied to a spatial field *X* defined over a spatial domain of 2* ^{J}* Ã— 2

*grid points. The Haar wavelet filter at its first step decomposes the spatial field*

^{J}*X*into the sum of a coarser

*mean*field (the first father wavelet component) and a detail

*variation-about-the-mean*field (the first mother wavelet component). The father wavelet component is obtained from the spatial field

*X*by a spatial averaging over 2 Ã— 2 grid points. The mother wavelet component is obtained as the difference between the spatial field

*X*and the father wavelet component.

At its second step the Haar wavelet filter decomposes the father wavelet component obtained from the first step into the sum of a coarser *mean* field (the second father wavelet component) and a detail *variation-about-the-mean* field (the second mother wavelet component). The second father wavelet component is obtained from the spatial field *X* by a spatial averaging over 4 Ã— 4 grid points. The second mother wavelet component is obtained as the difference between the second father wavelet component and the first father wavelet component.

The process is recursive and at each step the Haar wavelet filter decomposes the father wavelet component obtained from the (*j* âˆ’ 1)th step into the sum of a coarser *mean* field (the *j*th father wavelet component) and a detail *variation-about-the-mean* field (the *j*th mother wavelet component). The *j*th father wavelet component is obtained from the spatial field *X* by a spatial averaging over 2* ^{j}* Ã— 2

*grid points. The*

^{j}*j*th mother wavelet component is obtained as the difference between

*j*th and (

*j*âˆ’ 1)th father wavelet components.

*J*) is found. The spatial field

*X*is decomposed into the sum of the mother wavelet components on the spatial scales

*j*= 1, . . . ,

*J*and the

*J*th father wavelet component:

*W*

^{m}

_{j}(

*X*) wavelet components on the scale

*j*have resolution equals to 2

^{j}^{âˆ’1}grid points and the father wavelet component

*W*

^{f}

_{j}(

*X*) on the largest scale

*J*has resolution equal to 2

*grid points.*

^{J}The lightning forecasts verified in this study are on a polar stereographic grid of 295 Ã— 183 grid points. The regression model was not developed over the entire domain, therefore some of the more external grid points have missing values. The domain considered for verification purposes is a rectangular subdomain of 256 Ã— 128 grid points embedded in the 295 Ã— 183 gridpoint domain (Fig. A3). The verification domain was centered on the original domain in order to minimize the number of grid points with missing values. The dimensions of the verification domain were chosen to be powers of 2 in order to have a dyadic domain, which is appropriate for performing the 2D discrete wavelet transform. Note that the rectangular verification subdomain is the union of two square subdomains of 2^{7} Ã— 2^{7} grid points. The wavelet decomposition is performed on the east and on the west square subdomains. Then, the union of the east and west wavelet components on each scale is considered. When performing the wavelet decomposition, the missing values within the rectangular verification subdomain are assigned the average of the nonmissing values either of the west or east subdomain, depending on which of these square subdomains they belong to. Note that this value is the largest-scale father wavelet component value evaluated on the nonmissing values.

Brier score and its reliability, resolution, and uncertainty components, and BSS and its reliability and resolution components, for the case study illustrated in Fig. 2.