## 1. Introduction

Precipitation patterns are often complex and difficult to characterize. Since localized maxima produce high-impacts over small areas, accurate forecasts are highly desirable. Precipitation prediction lends itself to the use of mesoscale models, and high-resolution numerical forecasts are providing guidance with increasing detail. These forecasts are the subject of intense study in the search for new verification methods, for comparing the similarity of two complex fields becomes quite difficult with increasing resolution (Koch 1985; White et al. 1999; Ebert and McBride 2000; Zepeda-Arce et al. 2000; Colle et al. 2001; Mass et al. 2002; Baldwin et al. 2002). The increased variability in both the forecasts and the observations increases the error values reported by many conventional statistics, such as the equitable threat score and the root-mean-square (rms) error. Objectively characterizing the intrinsic quality of these forecasts in the face of limited or imperfect observations is one of the greatest challenges to mesoscale verification.

Ebert and McBride (2000), Bullock et al. (2004), and Sandgathe and Heiss (2004) approach the problem with techniques that match the forecasts to the observations using methods related to those of Hoffman et al. (1995). Ebert and McBride (2000) consider mesoscale rain areas as contiguous entities and collect statistics regarding the placement, volumetric, and pattern errors. Sandgathe and Heiss (2004) approach the problem from a more generalized perspective, deriving distortion and amplitude errors from the differences between the forecast and an analyzed field. These methods reduce the intrinsic statistical variability by isolating specific areas and summarizing the errors over the scale of a particular event.

The related issue of scale was addressed by Zepeda-Arce et al. (2000). They found that the threat score for a given precipitation forecast improved as the data were smoothed to increasingly larger scales. In this case, statistical variability was removed by smoothing as opposed to compositing. Presumably, better forecasts have higher threat scores at correspondingly smaller scales (smaller averaging factors). However, no unique scale stands out as the most applicable. Uncertainty in the observations associated with varying gauge densities and representativeness errors are also a problem. Ultimately, these considerations factor into the forecast in terms of an uncertainty. Forecasts are issued as probabilities over general regions because the model guidance is inexact. The degree of model error determines the forecast probability and area over which the probability is distributed. Good forecasts result in high probabilities over small areas, whereas bad forecasts (large random errors) require reduced probabilities over larger areas. In this sense, “goodness” can be measured in terms of a probability density function, or more simply, in terms of the expected distributions of observations and forecasts given that a subgroup of events is predicted or observed. This approach is especially relevant considering that several models or ensembles of models are often used together, and each model or ensemble has its own error characteristics.

Murphy and Winkler (1987) examined the viability of using such a distributions-oriented approach as a verification tool. The main premise involved reporting the basic statistics directly from the conditional distributions of the observations and forecasts. This approach simplifies the mesoscale verification problem because conditional forecast and matching observed precipitation samples of multiple events are easily collected and evaluated. Variability still exists in the samples, but its nature can be directly quantified and, more importantly, modulated by adjusting the sampling criteria.

In the presence of limited observations, Nachamkin (2004) showed that composite methods can be used to derive statistically reliable wind distributions. In that study, forecasts of the mistral were evaluated against the oceanic Special Sensor Microwave Imager (SSM/I) winds. Although the SSM/I observations often did not fully cover any specific event, information regarding the mean model performance was derived through the composite statistics. Forecast quality was examined through the similarity of the conditional statistics given that a mistral was predicted or observed. Good forecasts displayed small differences in the conditional means since the characteristics of the conditional wind speed distributions were similar regardless of whether an event was either predicted or observed. Degrading forecast quality promotes large differences in the conditional statistics as the forecasts and observations diverge.

In this paper, operational forecasts of heavy precipitation events from the Coupled Ocean/Atmosphere Mesoscale Prediction System (COAMPS^{1}) (Hodur 1997) are evaluated using the composite verification method introduced by Nachamkin (2004). The forecasts are diagnosed in terms of the expected rainfall distribution given that an event is either predicted or observed. The nature of the model errors is investigated, as are possible causes for the problems. Precipitation event statistics are compared to the statistics derived for the mistral by Nachamkin (2004). Although the quality of the precipitation event forecasts was not as high, the composites of precipitation conditional on the existence of observed and predicted events displayed many attributes indicating that the forecasts were useful. Additional statistics taken at multiple scales relative to the precipitation events reveal interesting limitations for optimal performance measures of high-variance phenomena.

## 2. Data and methods

The operational forecasts for the continental United States (CONUS) run at Fleet Numerical Meteorology and Oceanography Center (FNMOC) were used for this study. The 24- and 48-h precipitation forecasts for all model integrations initialized at 1200 UTC were collected from 15 April through 7 September 2003. For brevity, the primary verification analysis in this study focused on the 1-day forecasts of the 24-h accumulated precipitation valid at 1200 UTC. The forecasts were initialized using the multivariate optimal interpolation analysis (Barker 1992) with the previous COAMPS 6-h interim forecast as a first guess. Boundary conditions were supplied from the Navy Operational Global Atmospheric Prediction System (NOGAPS) (Hogan et al. 2002) at 3-h intervals using a Davies (1976) scheme. Two one-way nested grids with spacings of 81 and 27 km were used. The domain contained 30 vertical levels with the lowest at 10 m AGL and the highest near 35 km. Subgrid-scale convection was parameterized using the Kain–Fritsch scheme (Kain and Fritsch 1993), while the explicit microphysics was parameterized using a modified Rutledge and Hobbs (1983) scheme described by Schmidt (2001). The model was set in a nonhydrostatic equation system with terrain-following sigma coordinates.

The precipitation forecasts were verified against the 24-h accumulated gauge-only rainfall analyses performed at the National Centers for Environmental Prediction (NCEP) (Olson et al. 1995). The analyses are valid at 1200 UTC each day, and are based on approximately 8000 gauge observations collected from the National Weather Service (NWS) River Forecast Centers (RFC). Precipitation data from the 4-km gauge analysis grid were interpolated to the 27-km mesoscale model grid. To do this, each model grid area was divided into 25 equally sized squares. In this case each box was 5.4 km on a side. The gauge analysis was sampled by selecting the analysis grid point that was closest to the center of each box. These samples were then averaged over the model grid square to obtain a single value. Chen et al. (2002) noted the errors associated with this interpolation were generally small compared to the model errors.

Heavy precipitation events were located in both the forecasts and observations for verification using the Nachamkin (2004) composite technique. Model performance is quantified by sampling only in areas where specific events of interest are known to be predicted or observed. Since predicted events are not necessarily collocated with observed events, the compositing is conducted twice. First, all qualifying events in the forecasts are located, and conditional samples of the forecasts and corresponding observations are taken based on the existence of the forecast event. Then another composite of forecasts and observations is taken based on the existence of an observed event. Each composite consists of a structured set of superimposed events that are known to exist and a corresponding, generally less structured, set of coexistent observations or forecasts where events might exist. The term “structured” refers to the reduction in the degrees of freedom in the composite distribution of meteorological variables brought on by constraining the sample criteria. For example, when the forecasts and observations are sampled based on the existence of a predicted event, the structure of the composite distribution of forecast winds, rain, etc. is directly controlled by the sample criteria. Highly defined criteria, such as all circular rain events with intensities between 25 and 30 mm spanning 100–200 grid points, lead to very well defined composites. Of course criteria that are too focused will only sample a small number of model forecasts and may not be very representative. In this example, the attributes of the composite distribution of observations are primarily controlled by the predictive ability of the model. If the model forecasts are “good,” then the observational composite will be well defined and will share many of the same attributes with the forecast composite. However, forecast errors change the structure of the composite distribution of observed variables with respect to the forecasts. Random errors dilute the properties of the composite distribution, while systematic errors result in systematic shifts. Thus, the quality of the forecasts can be measured in terms of the similarity between the composites. In this paper, the composite of predicted or observed events where the attributes are directly controlled by the sample criteria is referred to as the *independent* sample. The composite of forecasts or observations that correspond to the occurrence of the events in the independent sample is referred to as the *dependent* sample. Again, the structure of the dependent sample composite is largely determined by the model performance. If the predictions are “good” the forecasts and observations will closely resemble one another and the independent and dependent sample composites will have similar properties.

Heavy events were defined in this study as contiguous areas with 24-h accumulations greater than 25 mm (∼1 in.). All events containing 50–500 contiguous grid points were considered in the composites. These parameters were chosen as the best compromise between achieving a sharp composite and a meaningful sample that contained a large number of events. The statistics summarized over the remainder of this work indicate that these parameters were adequate for verification purposes. Once an event was identified in the forecasts (observations) by an automated algorithm, all surrounding data were transferred to a 31 point × 31 point (837 km × 837 km) relative grid with characteristics that were identical to the model grid (Fig. 1). The center of an event, as defined the centroid of the contiguous points that define the precipitation event, was positioned at the center of the relative grid. Then, all available observational (forecast) data were also positioned on the relative grid. Model data were templated by the available observations, such that all points in the forecast outside the contiguous coverage of the RFC analyses were removed from the set. The main goal of this sampling strategy is to minimize the variance within the independent sample. Low variance translates into high confidence in the properties of the independent sample composite. High confidence facilitates the determination of meaningful, systematic differences between the independent and dependent sample composites.

Ebert and McBride (2000) noted that data voids within the observations lead to errors in determining statistical quantities from the events. The conditional composites contingent on the existence of an event in the forecasts are unaffected by this because the relative grid position is determined by the forecasts alone. However, the conditional composites contingent on the existence of an event in the observations are affected since partially observed events lead to errors in the location of the event center. In this study the observations are only valid over the continental United States; thus any event or portion of an event outside that region is not sampled. Errors associated with partially sampled events effectively increase the variance in the observational conditional composite because the observed events are not all directly superimposed. Features like systematic phase errors may become less apparent. Ebert and McBride (2000) noted that the phase error, and thus event position, was among the least sensitive parameters to the data boundary errors. In a Monte Carlo experiment, they noted standard displacement errors on the order of two grid points or less for most events.

## 3. Precipitation distributions on the native model grid

Although the results are most intuitive when viewed as composites in the relative grid reference frame, the precipitation distributions on the native geographical model grid also contain useful information. The precipitation in Figs. 2 and 3 consists of the entire relative grid samples as templated by the available RFC data. These precipitation distributions were compiled to ensure that any overlapping samples were only counted once. The distributions in Fig. 2 were constructed based on the existence of a predicted event. Thus Fig. 2a represents the independent sample associated with known forecast events, while Fig. 2b represents the dependent sample of the observed precipitation associated with those forecasts. Precipitation did not necessarily exist in the observations because the samples were constrained to the location and time of the forecast events. The fact that the distributions are quite similar is an initial indication that the model does fairly well when an event is predicted. In Fig. 3, the samples were contingent on the existence of an observed event. Thus the observations in Fig. 3b compose the independent sample, while the corresponding 24-h forecasts in Fig. 3a compose the dependent sample. Conditioning the samples while constraining them to a limited areal window allows for focused assessments of model performance that time climatologies alone do not provide. Consider the case of an event that is predicted on a Saturday but occurs on a Sunday in the correct place; such errors do not appear in the mean. Generally, the statistical sensitivity to errors increases as the size of the space–time sampling window decreases.

The degree of similarity between the observed and predicted precipitation indicates the degree of correspondence between the forecasts and observations within the sample template. The statistics can be investigated by directly comparing the forecasts to the observations, but comparisons between the independent and dependent samples are also useful. For example, the differences in the observational patterns in Figs. 2b and 3b come about because the independent sample of observed rainfall defined by the set of observed 25-mm events does not match the dependent sample of observed rainfall defined by the set of predicted 25-mm events. Some discrepancy must exist for this to occur. Of course highly accurate forecasts result in similar precipitation distributions regardless of the sampling method. The most useful diagnostics are those that quickly convey the basic forecast characteristics.

Almost all of the events in both the observed and predicted samples occurred in regions east of the Rocky Mountains; thus the results of this study are most applicable to forecasts in those areas. In all, 86 predicted events and 78 observed events were sampled. The degree of agreement between the known predicted events (Fig. 2a) and their corresponding observations (Fig. 2b) was relatively high through central Kansas, southern Texas, and much of the east coast from South Carolina to Pennsylvania. Precipitation amounts tended to be higher in the observed sample in most areas, especially across Oklahoma, Missouri, Arkansas, Mississippi, and Alabama (Fig. 2b). The degree of agreement between the known observed events and their corresponding forecasts (Fig. 3) was lower than that between the known predicted events and their corresponding observations (Fig. 2). Observed precipitation values were considerably higher than the forecasts over most areas in Fig. 3, especially in the Midwest, lower Mississippi Valley, and Florida. Note the expanded dynamic range in the shade scaling in Fig. 3. The model either completely missed or strongly underestimated precipitation amounts for a large number of events. Visual inspection of the data indicated that convection with little synoptic-scale organization was prevalent in these areas, and the model often underestimated the precipitation amounts. The degree of this underestimation and its possible association with any spatial errors are best investigated using the relative grid.

## 4. Precipitation distributions on the relative grid

For the composites contingent on an event in the 24-h forecasts (Fig. 4), the highest average predicted precipitation amounts of just over 50 mm were found near the center of the relative grid (Fig. 4a). Maximum observed amounts were about 10 mm less than the forecasts. Since the dependent sample is not constrained by the compositing process, the centers of the observed events were not necessarily superimposed. Random position errors tended to dilute the magnitude of the maximum in the observed precipitation distribution compared to the forecasts. This was reflected by elevated observed standard deviations (Fig. 4b) as well as a broadening of the observed frequency distribution compared to the forecasts (Fig. 4c). Systematic errors were also evident, especially in the average rainfall distributions. The highest average observed precipitation amounts were shifted about three grid points (81 km) south and east of the forecasts. This phase shift is also apparent in the frequency distributions, though it is not as well defined.

When events existed in the observations (Fig. 5), the composites were dominated by the underestimation errors that were apparent on the native model grids (note the reduced scaling in Figs. 5a,c). The average predicted maximum of just over 20 mm covered a small area compared to the large area of 40 mm in the observations (Fig. 5a). Precipitation underestimation biases in the frequency distributions were even more severe, as in some areas near the grid center where the observed frequencies were up to 6 times greater than the forecasts (Fig. 5c). Standard deviations in the dependent sample composite, which in this case consisted of the forecasts, were higher with respect to the mean than those in the observed composite (Fig. 5b). This is similar to the behavior in the forecast-based composite (Fig. 4b), and again reflects random error. The focus of the forecast events was shifted north and west of the observations, indicating the same spatial error as the forecast-based composite.

The results above depict the average trends when events are predicted or observed, but they lack specific information regarding the individual forecasts. For instance, if the model predicts an event, how often will it result in an acceptable match with the observations? Comparing the rainfall frequency characteristics compiled over the individual relative grid samples used to generate the composites provides useful information in this regard. The relative grid-average 24-h rainfall rate, integrated 24-h rainfall, and maximum 24-h rainfall amounts were recorded for the observations and forecasts in each relative grid sample. Figures 6 and 7 depict the number frequencies of the ratios between the predicted and observed (forecast: observation) values of these quantities. Note that only those samples where the data coverage on the relative grid was 50% or greater were used for these statistics. Most of the samples rejected by this threshold were forecast events that occurred near the coast. These more stringent criteria were imposed to ensure adequate data coverage to estimate the statistics from each individual event. The restrictions were not necessary for the composite statistics because the composite means were calculated once for the set of all events. Partially observed events simply contributed a weighted portion to the total.

When an event was predicted (Fig. 6), about half of the 32 qualifying forecasts were within a factor of 25% of the observations for all three quantities. Most of the other forecasts were within a factor of 1.75, though the distribution was slightly skewed in favor of greater observed values. No forecasts had grid average or integrated ratios greater than 1.75, while several had ratios of 0.5 or less. Otherwise, the distribution was fairly narrow and well balanced about a value of unity. This was not the case when an event was observed (Fig. 7). Less than one-third of the 72 qualifying events displayed average, integrated, and maximum ratios that were within 25% of unity, and nearly half of the events had average and integrated ratios that fell within the 0.67 bin. All three distributions were skewed toward a prevalence of severe underestimations with almost no large overestimations. The maximum precipitation ratios were the most heavily skewed, indicating a prevalence of broad areas of light rain in the predictions when heavy rain maxima were observed.

The distribution of the rainfall bias with rain amount is further investigated by calculating the ratio-based rainfall bias on the composite-grid over a number of rainfall amount categories (Fig. 8). These biases were calculated by dividing the area of the predicted 24-h rainfall by the corresponding observed area in each category. The calculations were performed for both the observation-based and forecast-based composites. The predicted and observed areas were derived for each event and then added such that the ratios were calculated once for the integrated total.

When events were observed, the forecast-to-observation biases were high at low rain thresholds and low at high thresholds. In contrast, the forecast-based event biases were more uniform and close to unity at most values except at the highest and lowest thresholds. These trends reflect an underprediction tendency in the model in that it often predicted light rain when heavy rain was observed. However, in those cases when the model did predict heavy rain, the forecast was often correct.

## 5. Diagnostic statistics

The results to this point have established that the model has difficulties producing enough precipitation in many situations when heavy precipitation is observed. Ideally such deficits will be remedied, but short of that, knowledge of when and where a forecast goes bad is quite useful. Do the underestimates have any common factors? As an attempt to answer this, the composite of all observed events where the grid-average and integrated forecast precipitation ratio was less than 0.8 was derived. Since this made up the majority of the observed events (50 out of 78 total), the mean precipitation distributions (Fig. 9) were quite similar to those in Fig. 5. The errors were not completely random, as indicated by the maximum in the dependent forecast composite in Fig. 9. Many of these forecasts featured distinct areas of precipitation located relatively close to the observed area.

Perhaps the most revealing field was the percentage of the predicted rainfall total attributed to convectively parameterized precipitation (Fig. 10). Between 50% and 90% of the predicted precipitation near the center of the observed precipitation maximum (Fig. 10a) was produced by the Kain–Fritsch convective scheme. Contrast that with the percentage of convective precipitation associated with the composite of events where heavy precipitation was predicted by the model (Figs. 4, 10b). There, the convective contribution over the highest observed rainfall areas was about half that in Fig. 10a.

Not surprisingly, events that are strongly forced and/or well resolved result in enough explicit precipitation to compensate for deficits in the prediction of convective precipitation. Many of these made up the forecast-based composite of heavy rain events. Conversely, many of the observation-based events were highly convective, and model precipitation was more reliant on the convective parameterization scheme. The north-to-south gradient in the convective contribution fields (Fig. 10a) shows that most of the resolved precipitation fell in the northern portion of the average event. In the Northern Hemisphere, much of the strongest synoptic-scale forcing is located in the northern portion of most baroclinic systems. Convection dominates farther south in the moist, unstable warm sector. Visual inspection of multiple forecasts indicates that the northward phase shifts in the forecast precipitation maxima result from too much resolved precipitation in the north, and too little convective precipitation farther south. The convective scheme appeared to be triggering close to the right place but was not producing enough precipitation. Many factors may contribute to this problem, such as temperature and moisture biases as well as deficits in the convective parameterization scheme.

A typical example of this type of forecast error is shown in Fig. 11. The 24-h forecast valid for 1200 UTC 16 May 2003 depicts a broad shield of resolved precipitation across southern Indiana, Illinois, Missouri, and Kansas, giving way to convectively parameterized precipitation farther south. Observed amounts more than doubled the forecasts in Arkansas, while farther north in central Missouri and Illinois the forecasts contained more precipitation. Although the exact error field is quite complex, the area of resolved precipitation generally extended too far north and contained too much precipitation while convective amounts were too light. Other errors were apparent, such as the phase error in the convective line through central Arkansas, but these are very difficult to systematically characterize and similar types of errors occur almost randomly from event to event.

The general errors described above were intuitively suspected based on anecdotal accounts and individual case studies, though the errors were never systematically quantified. Compositing temperature and moisture variables as well as the convective trigger function could further determine the nature of the errors and why they occur. From a user’s perspective, a forecaster might want to know the probability that heavy rain will occur within a certain radius of a predicted event. Statistics detailed in the next section can help answer this question.

## 6. Conditional statistics and grid size dependency

Nachamkin (2004) used the differences in the composite conditional biases to define the degree of error contributed by missed forecasts, false alarms, and large phase errors. In general false alarms contribute to a positive tendency in the forecast-based composite bias, as defined by the forecasts minus the observations, while missed forecasts contribute to a negative tendency in the observation-based composite bias. Nachamkin (2004) noted that the error sensitivity of the conditional biases depends on the size of the relative grid. Large sample grids display a greater tolerance for displacement errors due to the increased sampling area. In general, small-scale errors contribute less to the conditional bias with increasing sample grid size. These errors often cancel one another, or are overwhelmed by large areas of correctly predicted dry weather with near-zero bias. As the sample grids approach the size of the model grid the conditional biases approach the magnitude of the true model bias. This behavior is due to the loss of coherence with increasing distance from the set of common events. The rate at which this coherence drops depends on the diversity in the shapes and sizes of the events as well as the variability within each event. Such grid size dependence suggests that summarizing the conditional statistics at a single grid size, as was done by Nachamkin (2004), may not fully represent the breadth of the error. Instead, sampling along a range of relative grid sizes integrates the composite distributions in Figs. 4 and 5 into forms better suited for model intercomparison.

These concepts are demonstrated in Fig. 12, which shows the 24-h relative grid statistics for both the observation- and forecast-based conditional composites calculated at successively increasing concentric grid squares centered at grid center. The curves in Fig. 12 depict the change in the composite mean with successively increasing averaging area. The conditional biases as well as the conditional bias differences (CBDs) (Fig. 12a) are quite large on the smallest grids compared to the means of the independent sample variables (Fig. 12b). The highest precipitation amounts were often located near (though not directly at) the center of the events, and even minor displacement errors contributed to large biases in these areas due to the strong precipitation gradients. Errors due to poor resolution of subgrid-scale phenomena may also contribute at this scale. Both bias curves converge toward values near −4 mm at large areas, but the forecast-based biases do so more rapidly. At grid dimensions of 15 points × 15 points the forecast-based bias reaches zero, meaning that the average predicted rainfall biases are negligible on grids of this size placed at the center of a forecast event. In this case the systematic displacement errors were primarily responsible for the positive bias on the smaller grids. Without these displacements the forecast-based biases would have converged even more rapidly toward zero. The systematic displacement component in the conditional bias could be removed by applying methods of Ebert and McBride (2000), though that was not performed in this study. In contrast to the forecast-based bias, the observation-based biases stayed negative for all grid sizes, reflecting the severe underestimation errors.

In addition to predicting the occurrence of an event at the correct place and time, the model should be able reproduce the general physical and spatial characteristic of the events. The independent sample composites are ideal for investigating these properties because each composite consists of coherently superimposed events collected entirely from one source. The independent sample of the forecast-based events represents the average structure of the predicted events, while the independent sample of the observation-based events represents the average structure of the observed events. The grid size dependence of the mean of the independent sample composites (Fig. 12b) conveys information regarding the ability of the model to reproduce general event structure. These distributions show that mean values for the predicted events peaked at higher values near grid center than the observations. However, the rate of decline of the mean with sample grid scale was higher for the forecasts. The standard deviations (Fig. 12b) followed similar trends, though they declined far more slowly with grid size. Observed 24-h rainfall occurrence frequencies (not shown) exceeded the forecasts by normalized factors of 30%–60% for all thresholds above 25 mm, indicating the higher forecast means were not due to greater maxima. Instead, the heavy precipitation maxima within the observed events were more broadly distributed about the event center compared to the forecasts.

So, given these statistics, how “good” are the precipitation event forecasts in the absolute sense? Any meaningful verification system should be able to objectively resolve the many gradations of quality associated with a given set of forecasts. Since the composite method is relatively new it has only been applied to a limited set of event forecasts created from a single model, so the full sensitivity of the measurements is not yet known. However, Nachamkin (2004) applied this method to the mistral, which by all subjective and objective accounts was very well predicted. The high quality mistral forecasts can be thought of as a base state that represents the statistics for “good” forecasts. By comparing these to the precipitation statistics from the current study, the sensitivity of the composite statistics can be investigated. Small conditional biases (compared to the means) and independent sample composites that closely resemble one another are strong indications of “good” forecasts,^{2} and these qualities were reflected in the mistral statistics (Fig. 13). Not only were the conditional biases small, they also displayed weak grid size dependency (Fig. 13a). The conditional bias differences were only about 10% of the independent sample means, while this ratio was closer to 50%–100% for the precipitation events. Differences in the structure between the independent samples of the wind and precipitation events were also quite apparent. Mean wind values decreased slowly with increasing relative grid size, and standard deviations were actually minimized near grid center (Fig. 13b). Mistrals resembled one another to the point that compositing them reduced the wind speed standard deviation. This was not the case for precipitation events. Clearly, the precipitation events possess considerable natural variability. The high level of the standard deviations indicates that high natural variability poses significant limitations on the ability to quantify error in meaningful ways.

## 7. Discussion

*S*,

_{f}*S*, and

_{o}*r*

_{fo}represent the predicted variance, the observed variance, and the correlation coefficient, respectively. For a fixed correlation, the rms error increases as the variance increases. Furthermore, increased variance tends to increase the sensitivity of the correlation coefficient to small displacements. Thus collection methods that reduce the variance, such as collecting events of similar type, help identify systematic errors in highly variable fields. The collection categories should be broad enough to include enough events for a statistically significant sample, while being narrow enough to reduce the variance. Unfortunately, robust, deterministic statistics are difficult to derive from precipitation events due to the extreme variability. Even in this study, where relatively high precipitation thresholds were used on coarse (for convection) 27-km model grids with 24-h integrated precipitation accumulation fields, the standard deviations were not reduced near the composite center. Imagine the extreme variability associated with near-instantaneous radar images and high-resolution (∼3 km) model output!

This lack of determinism poses questions on the amount of useful information that can be collected from each event for verification purposes. Each event is different so all but the most robust statistics are lost in the average. This is problematic and counterintuitive given the highly deterministic nature of the mesoscale forecasts. Ideally one would want to know the extent to which the predicted deterministic structures will exist. Much of this knowledge can be inferred through refinements in the event collection criteria. Information regarding particular characteristics, such as precipitation type, orientation or skew, or the nature of maxima (cellular versus linear) can be specified. However, highly deterministic information such as the exact structure, location, and shape of each convective cell may be too specific. Much of this information will be lost when compositing multiple events. Regardless of the refinement, some element of probability will inevitably exist. Though the absolute error is invariant, the contribution associated with the field variance is dependent on the collection method so criteria should be chosen carefully.

Although refinements to the collection criteria result in a greater number of categories, each with lowered statistical representativeness, such compromises are necessary to fully understand the nature of the error in terms of the phenomenon of interest. Without these refinements, the signal-to-noise ratio becomes small and statistical significance is lost. This phenomenon is graphically demonstrated in Fig. 12b. For small to moderate sample areas the composite means are higher than the standard deviations due to the organizing influences of the sample constraints. However, the means decline far more rapidly with increasing grid size than the standard deviations. On the larger sample grids, the standard deviations are on the order of the mean. This behavior can cause problems when intercomparing errors from multiple models. High standard deviations reduce the degree of statistical confidence that the mean error from one model is significantly different from that of another model. High standard deviations require larger mean differences and more observations to attain statistical significance. As the standard deviations approach the mean, statistical significance becomes increasingly difficult to attain. This is especially true for models of relatively high quality where the forecast means are close to the observed means. Events should be selected and sampled to maximize the signal-to-noise ratio by elevating the mean as far above the standard deviation as possible. At the same time statistical relevance should not be sacrificed by overly specific criteria. Unfortunately, high variability reduces the number of common events in a given category, increases the standard deviation within each category, and increases the number of categories needed to fully describe the field. This paradox limits most attempts to characterize the error over the entire field in terms of a single robust measurement.

Finally, it is interesting to compare attributes of the composite method to entity-based methods such as Ebert and McBride (2000). The major differences between the two approaches hinge on the data collection and comparison strategies. The final statistics for the entity-based methods are averaged from the daily statistics taken separately from each event, whereas composite statistics are taken once from the total distribution of all qualifying events. The results are not identical because averaging daily statistics gives equal weight to all events, while taking statistics on the final distribution gives greater weight to the largest or most intense events. Neither method is free from bias. A single bad forecast of a very intense event can bias the composite statistics, whereas bad forecasts of small events can bias the entity-based statistics. In both cases, careful interpretation of multiple statistical measures from the set of collected events can alleviate this problem. The sample bias can be reduced by restricting the breadth of the sample sets to specific types of events.

The entity-based statistics provide more detailed accounts of the daily error with respect to the events and are especially useful in diagnosing phase errors. This utility requires well-correlated fields and high quality observations, as observed and predicted events are not easily paired where the forecasts and observations are poorly correlated. Choosing small precipitation thresholds increases the number of correlated matches, but the interevent variability also increases. As noted above, increased variance reduces the statistical significance of the results. Also, since the error decomposition is reference frame dependent, specific errors associated with high precipitation thresholds in complex fields may not be isolated. This loss of specificity results in the same ambiguities that plague the simpler statistical measures. Convolving (Bullock et al. 2004) or clustering (C. Marzban 2004, personal communication) may alleviate these problems by allowing more complex entities to be correspondingly paired.

## 8. Conclusions

The simple composite statistics performed in this study reveal the most robust trends pertinent to the COAMPS heavy precipitation forecasts for the warm season. When the model predicted an event, the average observed rainfall amount was generally within a factor of 2 of the forecasts within the confines of the relative grid. The observed occurrence frequencies as well as the center of the heaviest observed average rainfall were shifted south and east of the predicted maxima by about three grid points or 81 km. By far, the most prolific problem was one of underprediction of observed heavy rain events, many of which were highly convective. Further diagnostic studies are needed to fully understand the causes of these errors, as temperature and moisture biases as well as deficiencies in the convective parameterization may all contribute. Further statistics indicate that the phase shift and the underprediction errors are interrelated through the propensity of stronger resolved dynamics in the northern portions of Northern Hemisphere synoptic systems. The dynamics tend to support heavier grid-resolved rainfall in the north in comparison to the convectively dominated precipitation farther south.

Thus far the response to this method has been generally positive. The results offer a complement to the threat score, which is often difficult to interpret due to its sensitivity to small errors in high-variance forecasts. The database nature of the composites has proven to be quite useful in grouping together forecasts with varying degrees of error. The grouping allows for more focused study of those forecasts with the greatest errors. Thus far most of the tendencies illuminated by compositing were intuitively known, though the errors were never adequately quantified. Compositing assigns numerical value to this intuition, which in turn allows for a quantifiable measure of improvement.

The interplay of the statistical errors demonstrates that the verification problem, like many problems in nature, can be approached using relational databases. The challenges come primarily from specifying the proper collection strategies and extracting meaningful relationships from the assembled data. Regarding the latter, projecting meteorological and physical relationships on the data structures serves as a natural guide. The most robust and meaningful results come about when the data are analyzed with meteorology in mind. This is especially true if known biases exist in certain parameterizations. Regarding the former, natural variability and statistical significance must be carefully balanced. In many ways the sampling issue illustrates the paradox facing mesoscale verification. Collecting information over broad event composites condenses the general model performance into a few statistics. However, high variability dilutes the statistics, and much of the detailed information is hidden in the standard deviation of the fields. Focusing the samples emphasizes the forecast–observation differences by reducing the amount of natural variability, but this comes at the expense of simplicity, statistical significance, and overall breadth of the score. Unfortunately, there is no simple answer. As models become more explicit, their application as well as their verification will likely become very situation and user specific.

## Acknowledgments

This research is supported by the Office of Naval Research (ONR) through Program Element 62435N and the Space and Naval Warfare Systems Command (SPAWAR) through Program Element 603207N. Computing time was supported in part by a grant of high performance computing (HPC) time from the Department of Defense Major Shared Resource Center, Stennis Space Center, Mississippi. The work was performed on a Cray SV1. Computing time was also supported by an HPC grant from FNMOC as part of their Distributed Center for computing. The work was performed on an SGI Origin supercomputer. The RFC precipitation data were provided to us by Mike Baldwin.

## REFERENCES

Baldwin, M. E., S. Lakshmivarahan, and J. S. Kain, 2002: Development of an “events-oriented” approach to forecast verification. Preprints,

*19th Conf. on Weather Analysis and Forecasting*, San Antonio, TX, Amer. Meteor. Soc., 255–258.Barker, E. H., 1992: Design of the navy’s multivariate optimum interpolation analysis system.

,*Wea. Forecasting***7****,**220–231.Bullock, R., B. G. Brown, C. A. Davis, K. W. Manning, and M. Chapman, 2004: An object-oriented approach to quantitative precipitation forecasts. Preprints,

*17th Conf. on Probability and Statistics in the Atmospheric Sciences,*Seattle, WA, Amer. Meteor. Soc., CD-ROM, J12.4.Chen, S., J. E. Nachamkin, J. M. Schmidt, and C. S. Liou, 2002: Quantitative precipitation forecast for the Coupled Ocean/Atmosphere Mesoscale Prediction System (COAMPS). Preprints,

*19th Conf. on Weather Analysis and Forecasting,*San Antonio, TX, Amer. Meteor. Soc., 202–205.Colle, B. A., C. F. Mass, and D. Ovens, 2001: Evaluation of the timing and strength of MM5 and Eta surface trough passages over the eastern Pacific.

,*Wea. Forecasting***16****,**553–572.Davies, H. C., 1976: A lateral boundary formulation for multi-level prediction models.

,*Quart. J. Roy. Meteor. Soc.***102****,**405–418.Ebert, E., and J. L. McBride, 2000: Verification of precipitation in weather systems: Determination of systematic errors.

,*J. Hydrol.***239****,**179–202.Hodur, R. M., 1997: The Naval Research Laboratory’s Coupled Ocean/Atmosphere Mesoscale Prediction System (COAMPS).

,*Mon. Wea. Rev.***125****,**1414–1430.Hoffman, R. N., Z. Liu, J-F. Louis, and C. Grassotti, 1995: Distortion representation of forecast errors.

,*Mon. Wea. Rev.***123****,**2758–2770.Hogan, T. F., M. S. Peng, J. A. Ridout, and W. M. Clune, 2002: A description of the impact of changes to NOGAPS convection parameterization and the increase in resolution to T239L30. NRL Memo. Rep. NRL/MR/7530-02-52, Naval Research Laboratory, Monterey, CA, 10 pp.

Kain, J. S., and J. M. Fritsch, 1993: Convective parameterization for mesoscale models: The Kain–Fritsch scheme.

*The Representation of Cumulus Convection in Numerical Models, Meteor. Monogr*., No. 46, Amer. Meteor. Soc., 165–170.Koch, S. E., 1985: Ability of a regional-scale model to predict the genesis of intense mesoscale convective systems.

,*Mon. Wea. Rev.***113****,**1693–1713.Mass, C. F., D. Ovens, K. Westrick, and B. A. Colle, 2002: Does increasing horizontal resolution produce more skillful forecasts? The results of two years of real-time numerical weather prediction over the Pacific Northwest.

,*Bull. Amer. Meteor. Soc.***83****,**407–430.Murphy, A. H., 1988: Skill scores based on the mean square error and their relationships to the correlation coefficient.

,*Mon. Wea. Rev.***116****,**2417–2424.Murphy, A. H., and R. L. Winkler, 1987: A general framework for forecast verification.

,*Mon. Wea. Rev.***115****,**1330–1338.Nachamkin, J. E., 2004: Mesoscale verification using meteorological composites.

,*Mon. Wea. Rev.***132****,**941–955.Olson, D. A., N. W. Junker, and B. Korty, 1995: Evaluation of 33 years of quantitative precipitation forecasting at the NMC.

,*Wea. Forecasting***10****,**498–511.Rutledge, S. A., and P. V. Hobbs, 1983: The mesoscale and microscale structure and organization of clouds and precipitation in midlatitude cyclones. VIII: A model for the “seeder-feeder” process in warm-frontal rainbands.

,*J. Atmos. Sci.***40****,**1185–1206.Sandgathe, S. A., and L. Heiss, 2004: MVT—An automated mesoscale verification tool. Preprints,

*17th Conf. on Probability and Statistics in the Atmospheric Sciences*, Seattle, WA, Amer. Meteor. Soc., CD-ROM, J13.1.Schmidt, J. M., 2001: Moist physics development for the Naval Research Laboratory’s Coupled Ocean/Atmosphere Mesoscale Prediction System (COAMPS). BACIMO, Fort Collins, CO, 16 pp.

White, G. B., J. Paegle, W. J. Steeenburgh, J. D. Horel, R. T. Swanson, L. K. Cook, D. J. Onton, and J. G. Miles, 1999: Short-term forecast validation of six models.

,*Wea. Forecasting***14****,**84–108.Zepeda-Arce, J., E. Foufoula-Georgiou, and K. K. Droegemeier, 2000: Space–time rainfall organization and its role in validating quantitative precipitation forecasts.

,*J. Geophys. Res.***105****,**D8,. 10129–10146.