## 1. Introduction

It is now evident that forecasts with a spatial structure should be verified in a manner that accounts for that structure. Such methods are generally referred to as “object oriented” because they acknowledge the existence of objects in both the forecast and observed fields, and attempt to quantify the quality of the former in terms of various error components, including displacement, size, and intensity. The landmark papers on this topic include the following: Baldwin et al. (2001, 2002, Brown et al. (2002, 2004, Bullock et al. (2004), Casati et al. (2004), Chapman et al. (2004), Davis et al. (2006a,b), Du and Mullen (2000), Ebert and McBride (2000), Hoffman et al. (1995), Marzban and Sandgathe (2006, 2008, Nachamkin (2004), and Venugopal et al. (2005).

In a sequence of two papers, Marzban and Sandgathe (2006, 2008) have demonstrated the utility of cluster analysis in identifying/defining the objects in one or both fields (observed and forecast). Cluster analysis (Everitt 1980) refers to a set of statistical techniques designed to identify structures in data. The generality of the methodology allows for the objects to be not only two-dimensional (as in a gridded field), but also multidimensional entities that include spatial information as well as other dimensions, including the intensity of the field, or the time at which it is recorded. As such, the verification procedure based on cluster analysis has three desirable features: It 1) is object oriented, 2) allows for a multitude of quantities to be included in the definition and identification of an object, and 3) is fully automated, allowing for nonsubjective verification of many forecasts.

Marzban and Sandgathe (2006) proposed to perform cluster analysis on a forecast field and an observation field, separately. The two fields are then compared in terms of the best pairing of clusters within them. This approach allows one to compute any measure of error between the two fields in an object-oriented fashion. An alternative was proposed in Marzban and Sandgathe (2008), where the clustering is performed on the combined set of forecasts and observations. This approach was named combinative cluster analysis (CCA). In CCA, one identifies clusters in the two fields, simultaneously. A single cluster in the combined set can be considered a “hit,” if it consists of comparable proportions of forecast and observed grid points. Otherwise, it amounts to a “miss” or a “false alarm.” In this way, it is possible to produce a contingency table reflecting the quality of a single forecast field. One may then summarize the contingency table by a scalar measure, for example, the critical success index (CSI), to assess the overall quality of a single forecast field.

Both approaches rely on an iterative/hierarchical variant of cluster analysis, wherein the number of clusters in a field (any field) is varied systematically. On the one hand, one may begin with a single cluster containing the entire dataset, and then proceed to break it up into ever smaller clusters, ending with as many clusters as cases.^{1} Alternatively, the procedure may begin with as many clusters as cases, and proceed to combine them into ever larger clusters, ending with a single cluster containing all the data. Either way, the number of clusters is varied iteratively. The latter version is called hierarchical agglomerative cluster analysis (HAC); it is the one underlying both approaches examined by Marzban and Sandgathe (2006, 2008) and it is also used here.

As mentioned previously, it is important to emphasize that cluster analysis is not constrained to two dimensions, or even to spatial variables. The basic variables of the method can be any set of quantities defined to characterize a cluster/object. In this article, *p* denotes the dimension of the space in which the cluster analysis is performed. And in a verification scheme, it is desirable for the variables to include information about spatial location. Here, if the analysis is done on *p* = 2 spatial coordinates only, it is referred to as an (*x*, *y*) analysis. In addition to (*x*, *y*) analysis, Marzban and Sandgathe also examine some *p* = 3 dimensional examples: an (*x*, *y*, log(precipitation)) analysis (Marzban and Sandgathe 2006) and an (*x*, *y*, reflectivity) analysis (Marzban and Sandgathe 2008).

When cluster analysis is employed for verification, one of the virtues of the hierarchical approach is that it allows one to examine the quality of a forecast field on different spatial scales, that is, number of clusters. It is worth noting that within an object-oriented framework, the number of clusters in a field is a more appropriate notion of scale than one based on a physical notion of distance. This is so because different clusters/objects in a given field (forecast or observed) may be wildly different in terms of their size. Furthermore, as mentioned above, cluster analysis may be performed in a space that includes nonspatial variables (e.g., reflectivity). In such cases, it would simply make no sense to employ a quantity based on length alone to explore different spatial scales. As such, the number of clusters (NC), is a generalized notion of scale, beyond spatial scale. In short, in hierarchical clustering (agglomerative or not), one can assess performance as a function of the number of clusters in a field, with the latter (inversely) related to scale. Therefore, henceforth, “scale” refers to the number of clusters, NC.

In contrast to hierarchical clustering, there exists a class of clustering methods generally called *k* means (Everitt 1980). The main advantage of *k* means is speed. This approach is generally much faster than a hierarchical method, because it assumes that the data consist of precisely *k* clusters and coerces the cases to fall into those clusters. Although its general disadvantage is that it can be somewhat sensitive to the initial choice of the *k* clusters, this is generally not a concern in large datasets. In the verification scheme, *k* means does have another disadvantage: it does not allow an exploration of different scales, because the number of clusters *k* is fixed. The idea of running multiple *k*-means clusterings with *k* itself varying from *N*, the sample size, to 1 turns out to be unfeasible. However, *k* means can serve as an initial clustering method, to reduce the number of clusters from *N* to some reasonable number (say 100), before hierarchical clustering is employed to perform the remainder of the clustering from 100 clusters down to 1. Indeed, this is one of the steps taken in this work to expedite CCA. Other steps for expediting CCA are described below. The relatively slow performance of CCA allowed for the verification of only a handful of forecast days in Marzban and Sandgathe (2006, 2008). A fast CCA opens the possibility of applying the methodology to a large number of forecasts, and thereby comparing different forecast models in an objective and statistically reliable way.

The main goal of the current work is to apply CCA to the verification of reflectivity forecasts for 32 days, from two high-resolution versions of the Weather Research and Forecast model (WRF), and the National Oceanic and Atmospheric Administration (NOAA) Mesoscale Model (NMM). To that end, several methodological revisions to CCA are introduced and described. The three models are compared in terms of their CSI values as computed by the revised CCA in (*x*, *y*). The results are compared with those based on an expert’s assessment of the forecasts. An attempt is also made to assess errors in the timing of events. Additionally, a *p* = 3 dimensional analysis is performed, and some technical issues in that work are addressed.

## 2. The data and method

The dataset consists of pairs of 32 days of observations and 24-h forecasts of reflectivity exceeding 20 dB*Z*. This corresponds to the forecast of light to heavy precipitation. The 32 days span the dates 19 April–4 June 2005, and the grid spacing is 4.7625 km. Figure 1 displays the observations and the 24-h forecasts according to the University of Oklahoma 2*-*km resolution WRF (arw2), the National Center for Atmospheric Research (NCAR) 4*-*km resolution WRF (arw4), and the National Weather Service 4*-*km NMM (nmm4), for 13 May 2005. This forecast is one of the 32 forecasts that will be examined in this paper. The coordinates of the four corners of the region are 30°N, 70°W; 27°N, 93°W; 48°N, 67°W; and 44°N, 101°W, covering the United States, east of the Mississippi. The data come from the 2005 National Severe Storms Laboratory/Storm Prediction Center’s (NSSL/SPC) Spring Experiment, described in (Baldwin and Elmore 2005; Kain et al. 2005, 2006).

Although CCA is thoroughly discussed in Marzban and Sandgathe (2008), it is also reviewed here briefly. CCA amounts to performing HAC on the combined set of a forecast and observation field. CCA has several parameters, one of which is referred to as the “hit threshold.” This parameter determines whether a given cluster should be considered a hit, a miss, or a false alarm. The hit threshold pertains to the proportion of grid points in the cluster that belong to the observed field. For example, a hit threshold of 0.1 means that if fewer than 10 will be classified as a false alarm. Also, if fewer than 10 consists of forecast points, then it is classified as a miss. Otherwise, the cluster is identified as a hit. A threshold of 0.5 corresponds to the most stringent requirement for a hit, requiring exactly an equal number of observation and forecast points. This would be a “perfect” forecast if one ignored displacement and distortion. Although different hit thresholds affect the overall evaluation of the forecasts, in the context of our model comparison (i.e., the goal of this work), the choice of the hit threshold is not critical. Several different thresholds were examined and it was determined that, within a reasonable range of values (0.05–0.3), the choice of the threshold did not vary the assessed relative performance of the three models. For that reason, the bulk of the analysis here is performed at a hit threshold of 0.1, which is easily justified physically.

As mentioned previously, the typical HAC approach has two drawbacks: It is prohibitively slow for efficient processing of multiple, large fields, as well as needlessly exhaustive in the number of clusters produced by the procedure. The revisions to CCA discussed here involve combining multiple clustering techniques, and sampling approaches, to produce large improvements in computational efficiency. An algorithmic sketch of this methodology is as follows:

- Only grid points whose reflectivities exceed 20 dB
*Z*are kept for analysis. In terms of the implied precipitation, this means that the verification is performed on light to heavy precipitation. - The clustering is performed on
*p*= 2 spatial coordinates (*x*,*y*). As such, the comparison of the three models is done in terms of the quality of their spatial content. A*p*= 3 example is also performed for comparison, The relative weight of the three coordinates, that is, the metric, is discussed. - The two coordinates (
*x*,*y*) are standardized by subtracting the respective mean and dividing by the pooled (across observation and forecast) standard deviation.^{2}In the*p*= 3 dimensional analysis, a transformation is applied in order to assure all three coordinates are on the same footing. This issue is further addressed in section 6. - A
*k*-means clustering is performed on the combined dataset, at a specified number of clusters,*k*= 100. This step clusters the data into 100 clusters much faster than hierarchical clustering can. Although the 100 resulting clusters are somewhat sensitive to the choice of the initial clusters, the differences are thought to be unimportant, because the 100 clusters will be furthered clustered by the hierarchical approach (step 7, below). - The procedure is further expedited by performing the analysis on a sample of size
*n*taken (with replacement) from each of the*k*clusters. - A final step is taken in order to improve the computational efficiency. Some details of this step are presented in the appendix. Briefly, instead of performing CCA on
*N*p-dimensional vectors, it is performed on*k*(*n*×*p*) dimensional vectors. - CCA is performed, and CSI curves (Marzban and Sandgathe 2008) are produced. These are “curves” that display CSI values as a function of NC, the number of clusters in a field.
- Steps 5–7 are repeated many times (101) to assess the influence of sampling on CSI curves. This type of resampling is often called bootstrapping in statistics. Only the average (over the 101 samples) of the CSI curves is computed for comparing the forecasts of the three models.
- To assess timing errors, the entire procedure is applied to observations and forecasts with a time lag between them. The time lag values examined here are −3 to +3 h The introduction of a time lag calls for a generalization of the CSI curves to CSI surfaces (i.e., CSI as a function of NC and time lag).
^{3} - All of the ensuing results are compared with a human expert’s assessment of the quality of the forecasts. This is not an exact science, as only one trained forecaster is considered. Furthermore, the forecasts are very complex fields, and so, the expert’s assessments are only qualitative.

Elaboration of some of the above steps follows:

Observation and forecast fields consist of a 501 × 601 grid of reflectivity values. This is the common grid for the forecast and observation fields. For the data at hand, there are approximately 300 000 points (i.e., grid points) with nonzero (in dB*Z*) reflectivity in each field. That number is reduced to about 10 000 for reflectivity exceeding 20 dB*Z*.

In practice, there is little interest in every possible cluster number, especially when the number of clusters is comparable to the number of points in the fields. Therefore, *k*-means clustering with *k* ∼ 100 serves as a reasonable and efficient starting point for clustering to fewer clusters. The *k*-means approach does not produce the same clusters as a hierarchical technique, but it can be thought of as a technique with which to dynamically reduce the resolution of the field based on objects (as opposed to nearby points).

Step 6, above, alludes to a transformation of the data for the purpose of improving the computational efficiency. Although computational efficiency is one of the reasons for the transformation, the primary reason is more technical. The *k*-means clustering produces cluster assignments for each point for the *k* clusters. However, HAC does not allow clustering to begin at some arbitrary initial clustering, because it requires a dissimilarity metric for every pair of points (i.e., a matrix of distances between every pair). As such, to take advantage of the cluster initialization from *k* means, *n* points are selected from each cluster and are “transposed” to a (*n* × *p*) dimensional vector. Thus, instead of performing HAC on *N p*-dimensional vectors, this transposition allows one to perform HAC on *k* (*n* × *p*) dimensional vectors. The transposition is described further in the appendix. Here, *n* = 25 points are sampled randomly (with replacement) from each of the *k* = 100 clusters.

For large *N*, performing HAC on p-dimensional vectors would be computationally infeasible; but the transposition yields *k* ≪ *N* cases, with each case being a vector of dimension *n* × *p*. It turns out that the computational efficiency of HAC is not particularly sensitive to the dimension of the points over which clustering is performed, as the dissimilarity between pairs of points is computed just once and the bulk of the processing typically occurs in the repeated search to select the next optimal combination of clusters.

## 3. Results

Armed with CSI values for the three models over 32 days, and over a range of scales, a number of comparisons can be made. The most relevant ones can be divided into two classes: One class where the models are compared in terms of their actual CSI values, and another where the models are ranked in terms of their CSI and then compared based on their rankings. The conclusions from the two classes are somewhat different, because they address different facets of model performance.

### a. Comparisons of CSI values

Figure 2 shows the mean (over 101 samples) CSI curves for all 32 days, for the three models: arw2, arw4, and nmm4, colored black, red, and blue, respectively. Let us begin with a discussion of CSI curves for a specific date: 13 May, whose fields are shown in Fig. 1.^{4} Not surprisingly, there is a significant amount of overlap between some of the models and for some values of NC. In other words, on this particular day the models appear to perform comparably. However, one may note that nmm4 does appear to produce systematically low CSI values across the full range of NC values, suggesting that it is the worst of the three. On larger scales (10 < NC < 40), arw2 (black) appears to be the best of the three, while on smaller scales (60 < NC < 100) that status is occupied by arw4 (red). However, these conclusions should be qualified: First, they pertain to the forecasts on a single day, and second, they ignore sampling variations. The first limitation is overcome here by examining such CSI curves for forecasts made on 32 days. As for sampling variations, it turns out that a larger portion of them can be attributed to between-days variations than within-days variations. Therefore, to decide which of the models (if any) is best, one must examine all 32 days.

To get a sense of the numerical scale of these CSI values, it is beneficial to compute them for a “random” forecast. However, it is important to assure that the random field has a spatial structure similar to the actual forecasts. In other words, a random field consisting of white noise would be inappropriate. The number of points in the random forecast field must also be comparable to that of actual forecast fields. Such a random field is shown in Fig. 1.^{5}

Note that the field does visually resemble real forecasts. For each of the 32 forecasts, a similar random field is produced. The corresponding CSI curves are shown as the green, dashed lines in Fig. 2. Clearly, the CSI curves for the real forecasts are generally higher than those of a random forecast, certainly for cluster numbers exceeding 10 or so. For smaller cluster numbers (i.e., on larger scales), real forecasts and the random forecast are indistinguishable in terms of the resulting CSI curves.^{6}

One can now address the question of how the three models perform over all 32 days. The sampling variations (over the 101 samples) are not shown for clarity but are implicitly taken into account in the following observations. There is only 1 day for which the CSI curves for the three models have little overlap across the full range of scales, namely 27 April. In that case, the ranking of the models, in order of decreasing performance, is arw2, arw4, and nmm4. This ranking, although physically arguable, is clearly not generally true. On most days (22), arw2 and arw4 are comparable across the full range of scales, as expected for nearly identical models executing at only slightly different resolution; meanwhile, nmm4 is equally likely to be better or worse. On a few days the models actually switch rank at some scale. For example, on 3 May, nmm4 outperforms the other two for cluster numbers below 30, but on smaller scales immediately above NC = 30, arw2 appears to be the best of the three. Finally, on some days (e.g., 27 May) the models are not only comparable to each other but also comparable to a random forecast.

Although the CSI curves in Fig. 2 speak more to the complex relationship between performance and spatial scale, it is possible to summarize the results in a way that is conducive to a comparison of the three models “averaged” over the 32 days. One such way is to compare the three models two at a time. Figure 3 shows the box plot of the difference between the CSI curves of the models in a pairwise fashion. Each panel, therefore, compares two of the models.^{7}

Evidently, all of the box plots cover the horizontal axis at 0. In other words, there does not appear to be a strong statistically significant difference between the three models. A more rigorous comparison could be performed using a multivariate *t* test, but some of the assumptions of that test are violated here. For that reason, only graphic means, such as the box plots in Fig. 3, are examined in this study.

If one were to relax the requirement of the statistical significance, the box plots in Fig. 3 suggest some tentative conclusions. The conclusions are better organized if they pertain to three distinct scales, broadly defined as NC < 20, 20 < NC < 60, and NC > 60. Consider the middle range first: The left panel in Fig. 3 implies that the CSI values for arw2 are generally higher than those of arw4. In fact, these differences appear to be statistically significant, in the sense that zero (i.e., the horizontal line) is just outside of the interquartile range of the box plots. In Fig. 3 the middle panel in suggests that arw4 is generally worse than nmm4, while the right panel implies that arw2 is generally better than nmm4. In short, across the 32 days, the CSI values of the three models, for NC values between 20 and 60, can be ordered as CSI(arw2) > CSI(nmm4) > CSI(arw4).

For NC < 20, a similar analysis of Fig. 3 implies that arw2 and arw4 are comparable, and both are superior to nmm4; that is, CSI(arw2) ∼ CSI(arw4) > CSI(nmm4). And for NC > 60, arw2 and nmm4 are comparable, with both superior to arw4; that is, CSI(arw2) ∼ CSI(nmm4) > CSI(arw4).

In short, on larger scales (NC < 20), arw2 and arw4 are comparable, with both being superior to nmm4. On midrange scales (20 < NC < 60), arw2 outperforms nmm4, which in turn is superior to arw4. Finally, on smaller scales (NC > 60), arw2 and nmm4 are comparable, with both being better than arw4.

One can make an educated guess at the meaning of these results, although the model developers themselves will likely have a better interpretation. Starting with arw2 and arw4, essentially the same model at different resolutions, the higher resolution of arw2 is less important in comparison to arw4 at larger scales; therefore, they perform similarly. At smaller scales, however, higher resolution has a greater impact on model skill, allowing arw2 to score higher, that is, resolve and predict smaller features more reliably than arw4. When considering arw (2 or 4) versus nmm4, the results reveal the effects of both resolution and different model numerics and physics. The CSI data indicate that the model formulation in nmm4 is better at smaller scales than in arw; arw2 (higher resolution than nmm4) performs comparably to nmm4, and arw4 (same resolution) performs worse than nmm4. However, at larger scales, the arw model seems to have the edge over nmm4 (i.e., at NC ∼ 0, arw4 is better than nmm4 even though both have the same 4-km resolution).

Of course, many caveats apply to these conclusions: The entire dataset is restricted to only 32 days, in one spring, of a single year over a region representing predominately the Midwest, from the Gulf of Mexico to the Canadian border. Moreover, both the arw and nmm models have undergone many upgrades since the spring 2005 experiment. What is being demonstrated here is the ability to derive meaningful conclusions from an automated analysis of highly complex, very high-resolution, weather predictions.

### b. Comparisons of CSI-based ranks

The above tests compare the three models in terms of their CSI values. It is also possible to compare them in terms of their rank: 1, 2, or 3. The following lead-in questions set the stage for the analysis. In how many of the 32 days does CSI (at, say NC = 20) suggest that arw2 is the best (i.e., rank = 1) of the three models? In how many does CSI suggest that arw2 has rank = 2 and rank = 3? Similarly, what are the answers to these questions if they pertain to arw4 and nmm4? The results can be tabulated as a contingency table, with the rows representing the three models, and the columns denoting the rank. For NC = 20 and 60, the results are shown in Table 1. The choice of NC = 20 and 60 is based partly on meteorological considerations and, so, is explained in section 5. Note that in rankings where a “tie” occurs, the models are given the same score; that is, if there is a tie for first place, there will be two ranks of 1 and one rank of two, and if there is a tie for second place, there will be one rank of 1 and two ranks of 2; hence, there is a preponderance of rank = 2 and significantly fewer rank = 3 cases.

Given that these contingency tables address the association between the choice of the model and a CSI-based rank, one can perform a chi-squared test of that association. For the NC = 20 and NC = 60 contingency tables, the p values of the test are 0.03 and 0.01, respectively. Therefore, at a significance level of 0.05, both of these p values are significant, implying that there is statistically significant association between the choice of the model and a CSI-based rank. In short, a knowledge of CSI can generally help one to predict the rank of a model, and vice versa. Said differently, CSI curves can distinguish between the ranking of the models, at a statistically significant level.^{8}

Since the associations are statistically significant, one may further diagnose these tables.^{9} For example, it is clear that arw2 rarely obtains rank = 3, for either NC = 20 or 60; arw2 is also approximately equally likely to obtain a rank of 1 or 2. The result of arw4 obtains a rank of 2 on the majority of days, again for both NC values, while nmm4 follows a slightly different pattern: at larger scales (NC = 20), it ranks second most frequently, but on smaller scales (NC = 60), its rank is a tie between first and second.

To summarize these observations, for NC = 20, both arw4 and nmm4 rank second, but arw2 ranks equally between 1 and 2. In this sense, one may conclude that arw2 is the better of the three at larger scales, as noted previously. On smaller scales (NC = 60), arw2 can still be considered the best of the three models, of course aided by its higher resolution; however, nmm4 follows closely, implying that a 2*-*km resolution version of nmm may score better than arw2.

## 4. Timing error

By virtue of clustering in (*x*, *y*) space, the above analysis implicitly takes into account the spatial error of the forecasts. To account for errors in the magnitude of the reflectivity as well, one may perform cluster analysis in (*x*, *y*, *z*) space, where *z* = reflectivity; see section 6. But what about timing errors? Although one can set up a frame for clustering in (*x*, *y*, *t*), or even in (*x*, *y*, *z*, *t*), it is more instructive to perform a series of (*x*, *y*) analyses for observations and forecasts that differ in their valid times. Here, for each observation day, in addition to a comparison with the “default” forecast for that hour, 12 additional forecasts are also examined at hourly lags from −6 to +6 h. Again, specializing to a given day, the observations and the forecasts for the three models are shown in Fig. 4, but only for hourly lags from −3 to +3 h (from bottom to top) to economize on the number of figures.

Visually detecting a time lag in these very complex weather patterns is difficult; however, a visual inspection of the forecasts in Fig. 4 suggests that the nmm4 forecast is noticeably different, at least for values above 20 dB*Z*, from arw2 and arw4. Looking specifically at the development in nmm4, some of the hourly predictions appear to match better with a previous observation (∼−3 h) than the verifying observation (e.g., the 28-h nmm4 with the 1301 or 1302 UTC observations, or the 25-h nmm4 with the 1322 or 1323 UTC observations).

This type of timing error can be quantified in the current verification scheme. For each of the 32 days, it is possible to produce a plot of CSI as a function of the number of clusters as well as the time lag. Figure 5 shows a sample of two such dates: 23 April and 13 May 2005. These two dates are selected for display, because they show two different patterns of time lags. Consider 13 May first: On large scales (small NC), arw2 and arw4 have their highest CSI values at lag ∼ 0; in other words, on large scales these models are neither fast nor slow. On smaller scales (large NC), however, they both have their highest CSI values at positive lag (∼2 h), implying that they are slow on small scales. The nmm4 results show a slight peak CSI at a negative lag (∼−2 h) across all scales. This means that on this day nmm4 is too fast on all scales.^{10} By contrast, on 23 April, all three models are too fast because their highest CSI values occur at negative lags, across the full range of scales. It is important to point out that these two dates are atypical cases and are discussed here only to illustrate the diagnostic nature of the verification scheme. A visual inspection of all 32 plots (not shown) suggests that the three models are “on time” on the average.

## 5. Expert opinion

As mentioned previously, the main aim of the verification methodology discussed here is to allow for the automatic verification of large numbers of forecasts. The question remains as to whether or not the results of the verification agree with human/expert assessment. To that end, three tests are performed.

The first test involves an expert’s (Sandgathe) visual assessment of the overall quality of the three forecasts combined. This is possible because the three forecast models’ predictions are quite similar to each other on each day; in fact, they are closer to one another than to the corresponding observations. (At this point, there is a desire to recommend more research into the model physics, i.e., all models appear to be similarly lacking; however, that point will be reserved for a different forum.) Based on a visual inspection of the three forecasts (e.g., Fig. 1), a subjective score (VIS) ranging from 0 to 10, representing poor to excellent forecasts, respectively, is assigned to each day. Given that CSI is the objective measure of model–forecast quality adopted in this study, the question is if VIS is correlated with CSI across the 32 days. It is important to point out that the VIS scores are assigned *prior* to the expert’s viewing of the CSI scores. The results are shown in Fig. 6, for three values of the hit threshold (0.01, 0.1, 0.2), and two values of NC (20 and 60). We selected NC = 20 because the fields appear visually to have two to seven major “systems” and 15 to 30 major clusters on each day. As such, 20 is a physically meaningful cluster number. We selected NC = 60 because it corresponds to where the CSI curves (e.g., Fig. 2) become constant. Interestingly, NC = 20 and 60 are also suggested by the box plots shown in Fig. 3. The correlation coefficient, *r*, between VIS and CSI is also shown.

It can be seen that CSI and VIS are generally well correlated, with *r* values in the 0.7–0.8 range. These scatterplots confirm that the relation between CSI and VIS is generally linear, regardless of hit threshold or NC. As such, it follows that the verification method described here yields CSI values consistent with a human expert’s assessment of forecast quality. This conclusion is intended to be only qualitative in nature. The forecasts cover a large area involving multiple weather systems, and trade-offs between a “good” forecast for one system and a “poor” forecast for another system on a given day are very subjective. A more rigorous study would require multiple experts, careful attention to interoperator variability, reducing the forecast area to smaller regions, and a more thorough definition of forecast quality.

The above test compares an expert’s assessment of the forecasts of the general skill of the three models on any given day. In other words, it answers the question of how well the models perform in general. The second test is more difficult visually and attempts to verify whether or not CSI is sufficiently sensitive to model skill to accurately rank the model forecasts on a given day.^{11} To that end, the three model forecasts are ranked visually for each of the 32 days. Again, ties are awarded the same ranking, resulting in scores skewed toward lower numbers (higher rank). The data are summarized in Table 2, and a chi-squared test of the association between model and visual rank yields a p value of 0.04. So, at a significance level of 0.05, one can conclude that visually ranking the models distinguishes between the models at a statistically significant level. Inspection of Table 2 indicates again that the models should be ranked, from best to worst, as arw2, nmm4, and arw4, in general agreement with Table 1 as discussed in section 3b, with the caveat that in this visual test, the number of clusters is ignored. Note that this particular ranking of the three models is the same as that based on the values of CSI itself, in midrange NC values, as discussed in section 3a.

The last test or comparison is to ensure that “clear winners” and “clear losers” are identified by the CCA methodology. In this instance, only those cases where either the visual inspection indicates a clear best or worst model, or the CSI gives a clear indication of a best or worst model, are compared. Of the 22 cases so indicated, there is agreement on 10, disagreement on 2, and either the visual or CSI do not indicate a clear (for CSI, statistically significant, for visual, too difficult to assess) winner on the remaining 10. Again, in order to perform this test more accurately, the region should be subdivided into smaller regions and additional experts should be employed to visually verify the fields.

## 6. A 3D example

The above CCA is performed in a *p* = 2 dimensional space, but it is instructive to examine at least one *p* = 3 dimensional space because it illuminates some important issues. On the one hand, by performing the analysis in (*x*, *y*, *z*) space, one expects the resulting clusters (at each iteration of HAC) to be more physically sensible than in an analysis performed in (*x*, *y*). That is the main reason for even pursuing this methodology in a higher-dimensional space. On the other hand, there are (at least) two complexities. First, the resulting clusters *when viewed in* (*x*, *y*) may appear to be unphysical. For instance, two spatially adjacent clusters may be labeled as distinct, because they differ in terms of their nonspatial coordinates. Although, this may create some problems in a visual assessment of the clusters, in truth the analysis is more faithful than an analysis in (*x*, *y*), because it relies on more information (even if that information may not be easily displayed in a 2D plot of the clusters).

The second complexity with performing the analysis in a higher-dimensional space is that it raises a question that is difficult to answer objectively. Specifically, what metric should be used to compute the distance between two points in a *p*-dimensional space? In other words, how should the nonspatial components be weighted against the spatial components? Said differently, what metric should be used for computing distances? One way to address this issue is to introduce a “knob” controlling the strength of each nonspatial component relative to the spatial ones. This solution may be feasible if there is only one nonspatial variable, for example, reflectivity; however, it becomes impractical for large values of *p*.

Alternatively, one may appeal to some other criteria for choosing the metric, for example, one based on tolerance. One expert or a consensus of experts might advise “tolerances” for forecast errors for spatial and nonspatial coordinates. These tolerances may be used to standardize the different coordinates to a common tolerance scale. For example, one might tolerate a forecast error of 20 km. As for errors in reflectivity itself, one might tolerate a forecast error of one-quarter of the difference in reflectivities of 50 dB*Z* (very heavy rain) and 20 dB*Z* (very light rain). Then distance errors and reflectivity errors may be converted to “tolerance units” by dividing the spatial component of the distance by 20 km, and the reflectivity component by 7.5 = (50 − 20 dB*Z*)/4. This is one of many possible criteria.

Figure 7 shows the CSI surfaces for three analyses of the 13 May data. The first example (Fig. 7, top) is a *p* = 2 dimensional analysis in (*x*, *y*); it is Fig. 5 reproduced here for easier comparison. The other two are *p* = 3 dimensional analyses in [*x*, *y*, log(*z*)], where *z* denotes reflectivity; in one case (Fig. 7, middle) the three components are standardized individually (i.e., they are weighted equally), and in the other case (Fig. 7, bottom) the metric is based on the aforementioned tolerance considerations. In going from the *p* = 2 (Fig. 7, top) to the *p* = 3 analysis with equal weight given to all the components (Fig. 7, middle), it is evident that arw2 becomes more sensitive to the lag. This conclusion is based on the “crest” appearing in the middle panel of the arw2 column in Fig. 7. Another notable difference is that the above-mentioned “slowness” of arw4 is much less pronounced. Also, nmm4 is not affected by the inclusion of reflectivity in the analysis; it appears to be fast on that date.

Most of these features disappear when a tolerance-based metric is employed for computing distances. Indeed, the resulting figures (Fig. 7, bottom) resemble those of the (*x*, *y*) analysis (Fig. 7, top). It follows that the inclusion of reflectivity in the analysis does not drastically affect the assessment of performance; the three models are comparable in terms of the amount of reflectivity assigned to the various clusters, at least within the range of tolerances injected into the distance metric.

## 7. Summary and discussion

This is arguably a small sample for any definitive conclusions regarding the comparison of the three model formulations. Moreover, the three model formulations have changed significantly since these cases were forecast in the spring of 2005. But the data do appear to provide a valid test of the CCA methodology, and the methodology does appear to provide insight into the model formulations. The CSI curves, not surprisingly, indicate little difference between the various model formulations. However, there appears to be a tendency for arw2 to outperform nmm4, which in turn outperforms arw4, in the 20–60-cluster range, the range that is likely the most significant for the cases evaluated here. The evaluated arw4 formulation performs statistically worse in that range as borne out also by both visual and rank-based evaluations. Although the large, complex region under evaluation makes visual ranking difficult, comparison of visual rankings and CCA CSI scores indicates that CCA is faithfully capturing model performance.

A methodological contribution of the current study is the advent of the transposed HAC (section 2 and the appendix). Without it, it would be simply impossible to analyze a large number of forecasts in any reasonable length of time. To assess the efficiency of the transposed HAC method, a few benchmark tests have been performed to compare transposed HAC with traditional HAC. The comparison was performed on a field containing about 20 000 points (even after implementing the 20-dB*Z* threshold). The key parameters of the method were set at *k* = 100 (used for *k* means), *n* = 25 (the number of sample points), and 101 CSI resamples (i.e., yielding 101 different CSI curves to be averaged). To assess the speed of the two methods, the number of points to be clustered was varied from 200 to 4000 for both methods, continuing up to 20 000 only for transposed HAC. All tests were performed on a computer using a relatively modest Intel Pentium 3.00-GHz CPU.

The results are striking though not surprising. For transposed HAC, the procedure ran in the range of 30–45 s depending on the number of points, from 200 to 20 000; this suggests that the timing in this range is perhaps dominated by the CSI resampling. For traditional HAC, run time grows exponentially: for less than 1500 points, it ran in less than 30 s, growing to about 10 min for 4000 points. For 20 000 points, traditional HAC simply did not execute due to memory limitations on the computer; the matrix of pairwise distances repeatedly scanned in each HAC step has a very large number of elements [i.e., 20 000 (20 000 − 1)/2], which traditional HAC cannot accommodate. The transposed HAC technique introduces vast improvements into the computational efficiency, making the analysis of large datasets practical.

## Acknowledgments

The authors would like to acknowledge Michael Baldwin, Barbara Brown, Chris Davis, and Randy Bullock for contributing to all levels of this project. In particular we are grateful to M. Baldwin for providing the data for this analysis. Partial support for this project was provided by the Weather Research and Forecasting Model Developmental Testbed Center (WRF/DTC), and by the National Science Foundation (Grant 0513871).

## REFERENCES

Baldwin, M. E., , and K. L. Elmore, 2005: Objective verification of high-resolution WRF forecasts during 2005 NSSL/SPC Spring Program. Preprints,

*21st Conf. on Weather Analysis and Forecasting/17th Conf. on Numerical Weather Prediction,*Washington, DC, Amer. Meteor. Soc., 11B.4. [Available online at http://ams.confex.com/ams/pdfpapers/95172.pdf.].Baldwin, M. E., , S. Lakshmivarahan, , and J. S. Kain, 2001: Verification of mesoscale features in NWP models. Preprints,

*Ninth Conf. on Mesoscale Processes,*Fort Lauderdale, FL, Amer. Meteor. Soc., 255–258.Baldwin, M. E., , S. Lakshmivarahan, , and J. S. Kain, 2002: Development of an “events-oriented” approach to forecast verification. Preprints,

*19th Conf. on Weather Analysis and Forecasting/15th Conf. on Numerical Weather Prediction,*San Antonio, TX, Amer. Meteor. Soc., 7B.3. [Available online at http://ams.confex.com/ams/pdfpapers/47738.pdf.].Brown, B. G., , J. L. Mahoney, , C. A. Davis, , R. Bullock, , and C. K. Mueller, 2002: Improved approaches for measuring the quality of convective weather forecasts. Preprints,

*16th Conf. on Probability and Statistics in the Atmospheric Sciences,*Orlando, FL, Amer. Meteor. Soc., 1.6. [Available online at http://ams.confex.com/ams/pdfpapers/29359.pdf.].Brown, B. G., and Coauthors, 2004: New verification approaches for convective weather forecasts. Preprints,

*11th Conf. on Aviation, Range, and Aerospace,*Hyannis, MA, Amer. Meteor. Soc., 9.4. [Available online at http://ams.confex.com/ams/pdfpapers/82068.pdf.].Bullock, R., , B. G. Brown, , C. A. Davis, , K. W. Manning, , and M. Chapman, 2004: An object-oriented approach to quantitative precipitation forecasts. Preprints,

*17th Conf. on Probability and Statistics in the Atmospheric Sciences*/*20th Conf. on Weather Analysis and Forecasting/16th Conf. on Numerical Weather Prediction,*Seattle, WA, Amer. Meteor. Soc., J12.4. [Available online at http://ams.confex.com/ams/pdfpapers/71819.pdf.].Casati, B., , G. Ross, , and D. B. Stephenson, 2004: A new intensity-scale approach for the verification of spatial precipitation forecasts.

,*Meteor. Appl.***11****,**141–154.Chapman, M., , R. Bullock, , B. G. Brown, , C. A. Davis, , K. W. Manning, , R. Morss, , and A. Takacs, 2004: An object oriented approach to the verification of quantitative precipitation forecasts: Part II—Examples. Preprints.

*17th Conf. on Probability and Statistics in the Atmospheric Sciences*/*20th Conf. on Weather Analysis and Forecasting/16th Conf. on Numerical Weather Prediction,*Seattle, WA, Amer. Meteor. Soc., J12.5. [Available online at http://ams.confex.com/ams/pdfpapers/70881.pdf.].Davis, C. A., , B. Brown, , and R. Bullock, 2006a: Object-based verification of precipitation forecasts. Part I: Methodology and application to mesoscale rain areas.

,*Mon. Wea. Rev.***134****,**1772–1784.Davis, C. A., , B. Brown, , and R. Bullock, 2006b: Object-based verification of precipitation forecasts. Part II: Application to convective rain systems.

,*Mon. Wea. Rev.***134****,**1785–1795.Du, J., , and S. L. Mullen, 2000: Removal of distortion error from an ensemble forecast.

,*Mon. Wea. Rev.***128****,**3347–3351.Ebert, E. E., , and J. L. McBride, 2000: Verification of precipitation in weather systems: Determination of systematic errors.

,*J. Hydrol.***239****,**179–202.Everitt, B. S., 1980:

*Cluster Analysis*. 2nd ed. Heinemann, 136 pp.Gneiting, T., , H. Sevcikova, , D. B. Percival, , M. Schlather, , and Y. Jiang, 2005: Fast and exact simulation of large Gaussian lattice systems in.

*R*2: Exploring the limits. Dept. of Statistics Tech. Rep. 477, University of Washington, Seattle, WA, 17 pp. [Available online at http://www.stat.washington.edu/www/research/reports/2005/tr477.pdf.].Hoffman, R. N., , Z. Liu, , J-F. Louis, , and C. Grassotti, 1995: Distortion representation of forecast errors.

,*Mon. Wea. Rev.***123****,**2758–2770.Kain, J. S., , S. J. Weiss, , M. E. Baldwin, , G. W. Carbin, , D. A. Bright, , J. J. Levit, , and J. A. Hart, 2005: Evaluating high-resolution configurations of the WRF model that are used to forecast severe convective weather: The 2005 SPC/NSSL Spring Experiment. Preprints,

*21th Conf. on Weather Analysis and Forecasting/17th Conf. on Numerical Weather Prediction,*Washington, DC, Amer. Meteor. Soc., 2A.5. [Available online at http://ams.confex.com/ams/pdfpapers/94893.pdf.].Kain, J. S., , S. J. Weiss, , J. J. Levit, , M. E. Baldwin, , and D. R. Bright, 2006: Examination of convection-allowing configurations of the WRF model for the prediction of severe convective weather: The SPC/NSSL Spring Program 2004.

,*Wea. Forecasting***21****,**167–181.Marzban, C., , and S. Sandgathe, 2006: Cluster analysis for verification of precipitation fields.

,*Wea. Forecasting***21****,**824–838.Marzban, C., , and S. Sandgathe, 2008: Cluster analysis for object-oriented verification of fields: A variation.

,*Mon. Wea. Rev.***136****,**1013–1025.Nachamkin, J. E., 2004: Mesoscale verification using meteorological composites.

,*Mon. Wea. Rev.***132****,**941–955.Venugopal, V., , S. Basu, , and E. Foufoula-Georgiou, 2005: A new metric for comparing precipitation patterns with an application to ensemble forecasts.

,*J. Geophys. Res.***110****.**D08111, doi:10.1029/2004JD005395.

## APPENDIX

### Transposed Hierarchical Clustering

In this appendix, the method of “transposing” the data will be outlined.

A hierarchical clustering method (HAC) is inherently iterative. The agglomerative version (adopted here) begins by assigning every data case to a cluster, and ends when all the cases fall into a single cluster. In the current application, a “case” refers to the coordinates of a grid point, but, as described by Marzban and Sandgathe (2008), the point may reside in a larger-dimensional space. For example, the clustering may be performed in (*x*, *y*, *z*), where *x* and *y* denote the Cartesian coordinates of a grid point, and *z* refers to the reflectivity at that point.

The iterative nature of HAC is a desirable feature in the current application because it allows exploration of the clusters on different scales. However, for datasets involving a large number of cases the procedure is prohibitively slow. To expedite the procedure, a revision is considered: For simplicity, and without loss of generality, consider a one-dimensional dataset whose cases are denoted *x _{i}*, with

*i*= 1, 2, 3, . . . ,

*n*. HAC begins with

*n*clusters, and proceeds to identify the nearest pair, which in turn are merged into a new cluster.

Now, suppose one were to identify all of the nearest pairs but not combine them into clusters. Instead, suppose the two members of each pair are viewed as coordinates of a new “case” viewed in a two-dimensional Cartesian space. Figure A1 illustrates the idea on a hypothetical dataset consisting of six cases: *x* = 11, 12, 14, 15, 18, 19. Let *C _{i}*,

*,*

_{j}*, . . . denote the cluster with the elements*

_{k}*i*,

*j*,

*k*, . . . HAC begins with the six clusters:

*C*

_{1}= 11,

*C*

_{2}= 12,

*C*

_{3}= 14, . . . ,

*C*

_{6}= 19. Traditional HAC would yield the following sequence of iterations:

*C*

_{1,2},

*C*

_{3,4},

*C*

_{5,6},

*C*

_{1,2,3,4}, and

*C*

_{1,2,3,4,5,6}. However, consider a transposition that maps the one-dimensional data to two dimensions according to the scheme displayed in Fig. A1. For example, the two cases,

*x*= 11 and

*x*= 12, get mapped to the single case at (11, 12). In this space, there are only three cases present, and, so, the size of the data has been reduced by 50%. HAC on the new data can then proceed in a traditional fashion.

The method is being called transposed HAC, because with each case being a *p*-dimensional vector, the original data, which can be viewed as an *n* × *p* matrix, are “transposed” into an *n*/2 × 2*p* matrix. As another example, consider *n* = 6 and *p* = 2. The data (*x _{i}*,

*y*),

_{i}*i*= 1, 2, . . . ,

*n*, are mapped into four dimensions, according to the scheme shown in Fig. A1. The only ambiguity in this map has to do with the order of the points. For instance, one might map two nearest neighbors, (

*x*

_{1},

*y*

_{1}) and (

*x*

_{2},

*y*

_{2}), to the point (

*x*

_{1},

*x*

_{2},

*y*

_{1},

*y*

_{2}) or (

*x*

_{1},

*y*

_{1},

*x*

_{2},

*y*

_{2}), etc. It has been confirmed that the final results of the analysis (e.g., CSI curves) are insensitive to the ordering of the points. This is so because in the resampling phase of the procedure (step 5 in section 2), the order of the points changes from sample to sample.

To make contact with the analysis performed in the body of the paper, note that the above-mentioned pairing of the cases is equivalent to performing some type of clustering (e.g., *k* means) on the original data.

The contingency table reflecting the association between the three models (rows), and a ranking of the three models according to CSI at NC = 20, and NC = 60.

The contingency table reflecting the association between the three models (rows), and a ranking of the three models according to a visual inspection of the forecast field and the corresponding observation field.

^{1}

A “case” refers to a grid point whose reflectivity meets some requirement, e.g., reflectivity >20 dB*Z*. This usage of the term is consistent with the statistical usage. For example, one speaks of performing cluster analysis on some number cases.

^{2}

Note: This is different from the standardization adopted by Marzban and Sandgathe (2006, 2008), where unpooled std devs are employed. In fact, in the *p* = 2 dimensional case, standardization with the pooled std dev is unnecessary, since the *x* and *y* coordinates are already on the same scale.

^{3}

These CSI surfaces are different from those of Marzban and Sandgathe (2006); there, CSI is plotted as a function of two NCs, one for each of the fields.

^{4}

Hereafter, CSI curve refers to the curve resulting from averaging over *n* = 101 samples.

^{5}

What is a random Gaussian field? First, consider a sequence of random numbers and note that not all random numbers are the same. For example, the numbers may be uniformly distributed over some interval, or be a sample drawn from a Gaussian with a specified mean and variance, etc. Now, moving to the 2D case, it is relatively easy to show that if one organizes a sequence of *n* × *m* uniformly distributed random numbers, then the resulting field (or “image”) will appear to also be random in a uniform fashion. Generating a random field that has some nontrivial spatial structure calls for specifying something about that structure. One way is to assume that the numbers are distributed according to a bivariate Gaussian with specified means and a covariance matrix. In the field of spatial statistics, there are families of popular covariance structures and methods for simulating them. The one employed in this work is the so-called stable family, whose simulation is described in Gneiting et al. (2005).

^{6}

Recall that for NC = 1 one would include all valid points in both the observed and forecast fields. Since our “random” field is distributed across the entire domain, it scores well for the 13 May case where a significant portion of the domain has observed precipitation.

^{7}

For technical reasons, it is difficult to assess the statistical significance of these differences. For example, a paired *t* test may seem appropriate, but then the issue of multiple testing (i.e., for 100 NC values) becomes a thorny one. One may address that concern via a Bonferroni adjustment, but such a correction ignores the clear dependence of CSI as a function of NC. For such reasons, a rigorous statistical test of the hypothesis of equal means is not performed here. The box plots, however, do provide a useful and visual tool for qualitatively assessing the difference between the means.

^{8}

Two other but equivalent conclusions are as follows: 1) The three models are not homogeneous with respect to their rank and 2) for at least one of the three ranks, the proportions of the three ranks (for each model) are not identical for the three models.

^{9}

Technically, one should convert all of the cell counts into row proportions. One can then compare the proportion of rank = 1 cases in arw2, with that in arw4, etc. However, given that the row marginals of these tables are equal (i.e., 32), one can compare the actual counts.

^{10}

The lags reflect the difference in the observation time rather than the forecast; i.e., the forecast time is fixed. A peak at a negative lag is, therefore, a peak at an earlier observation; i.e., the forecast is fast. Similarly, a positive lag implies a slow forecast.

^{11}

Ranking the actual forecast fields visually is much more difficult than ranking the models based on CSI, as a model may do well in one geographic region of the field, or on one weather complex, while doing poorly in another. For this reason, the following analysis should be considered qualitative, in spite of the appearance of statistical significance tests.