Beyond the Basics: Evaluating Model-Based Precipitation Forecasts Using Traditional, Spatial, and Object-Based Methods

Jamie K. Wolff National Center for Atmospheric Research*/Research Applications Laboratory and Developmental Testbed Center, Boulder, Colorado

Search for other papers by Jamie K. Wolff in
Current site
Google Scholar
PubMed
Close
,
Michelle Harrold National Center for Atmospheric Research*/Research Applications Laboratory and Developmental Testbed Center, Boulder, Colorado

Search for other papers by Michelle Harrold in
Current site
Google Scholar
PubMed
Close
,
Tressa Fowler National Center for Atmospheric Research*/Research Applications Laboratory and Developmental Testbed Center, Boulder, Colorado

Search for other papers by Tressa Fowler in
Current site
Google Scholar
PubMed
Close
,
John Halley Gotway National Center for Atmospheric Research*/Research Applications Laboratory and Developmental Testbed Center, Boulder, Colorado

Search for other papers by John Halley Gotway in
Current site
Google Scholar
PubMed
Close
,
Louisa Nance National Center for Atmospheric Research*/Research Applications Laboratory and Developmental Testbed Center, Boulder, Colorado

Search for other papers by Louisa Nance in
Current site
Google Scholar
PubMed
Close
, and
Barbara G. Brown National Center for Atmospheric Research*/Research Applications Laboratory and Developmental Testbed Center, Boulder, Colorado

Search for other papers by Barbara G. Brown in
Current site
Google Scholar
PubMed
Close
Full access

We are aware of a technical issue preventing figures and tables from showing in some newly published articles in the full-text HTML view.
While we are resolving the problem, please use the online PDF version of these articles to view figures and tables.

Abstract

While traditional verification methods are commonly used to assess numerical model quantitative precipitation forecasts (QPFs) using a grid-to-grid approach, they generally offer little diagnostic information or reasoning behind the computed statistic. On the other hand, advanced spatial verification techniques, such as neighborhood and object-based methods, can provide more meaningful insight into differences between forecast and observed features in terms of skill with spatial scale, coverage area, displacement, orientation, and intensity. To demonstrate the utility of applying advanced verification techniques to mid- and coarse-resolution models, the Developmental Testbed Center (DTC) applied several traditional metrics and spatial verification techniques to QPFs provided by the Global Forecast System (GFS) and operational North American Mesoscale Model (NAM). Along with frequency bias and Gilbert skill score (GSS) adjusted for bias, both the fractions skill score (FSS) and Method for Object-Based Diagnostic Evaluation (MODE) were utilized for this study with careful consideration given to how these methods were applied and how the results were interpreted. By illustrating the types of forecast attributes appropriate to assess with the spatial verification techniques, this paper provides examples of how to obtain advanced diagnostic information to help identify what aspects of the forecast are or are not performing well.

The National Center for Atmospheric Research is sponsored by the National Science Foundation.

Corresponding author address: Jamie K. Wolff, NCAR/RAL, P.O. Box 3000, Boulder, CO 80307-3000. E-mail: jwolff@ucar.edu

Abstract

While traditional verification methods are commonly used to assess numerical model quantitative precipitation forecasts (QPFs) using a grid-to-grid approach, they generally offer little diagnostic information or reasoning behind the computed statistic. On the other hand, advanced spatial verification techniques, such as neighborhood and object-based methods, can provide more meaningful insight into differences between forecast and observed features in terms of skill with spatial scale, coverage area, displacement, orientation, and intensity. To demonstrate the utility of applying advanced verification techniques to mid- and coarse-resolution models, the Developmental Testbed Center (DTC) applied several traditional metrics and spatial verification techniques to QPFs provided by the Global Forecast System (GFS) and operational North American Mesoscale Model (NAM). Along with frequency bias and Gilbert skill score (GSS) adjusted for bias, both the fractions skill score (FSS) and Method for Object-Based Diagnostic Evaluation (MODE) were utilized for this study with careful consideration given to how these methods were applied and how the results were interpreted. By illustrating the types of forecast attributes appropriate to assess with the spatial verification techniques, this paper provides examples of how to obtain advanced diagnostic information to help identify what aspects of the forecast are or are not performing well.

The National Center for Atmospheric Research is sponsored by the National Science Foundation.

Corresponding author address: Jamie K. Wolff, NCAR/RAL, P.O. Box 3000, Boulder, CO 80307-3000. E-mail: jwolff@ucar.edu

1. Introduction

A well-established approach in the numerical weather prediction (NWP) community for assessing quantitative precipitation forecasts (QPFs) is based on the use of traditional verification methods (Jolliffe and Stephenson 2011; Wilks 2011), which require near-perfect spatial and temporal placement for a forecast to be considered good. These approaches tend to favor the smoother forecast fields of coarser-resolution models and offer little or no meaningful insight regarding the reasons a forecast is assessed to be good or bad. It is also widely acknowledged that using traditional verification metrics for evaluation may unfairly penalize and fail to show the benefits of higher-resolution forecasts (Mass et al. 2002; Done et al. 2004; Davis et al. 2006; Clark et al. 2007; Ebert 2009). In contrast, more advanced spatial verification techniques (Ebert 2008, 2009; Ahijevych et al. 2009; Gilleland et al. 2009, 2010), such as neighborhood methods, can provide information on the spatial scale at which a forecast becomes skillful, and object-based methods can provide information on differences between forecast and observed features in terms of coverage areas, displacement, orientation, and intensity.

Numerous studies (e.g., Mittermaier and Roberts 2010; Duda and Gallus 2013; Johnson et al. 2013; Mittermaier et al. 2013; Clark et al. 2014) have demonstrated the utility of applying advanced spatial verification techniques to high-resolution models (<5-km horizontal grid spacing), whereas the application of these methods for models with a middle (>5–20 km) and coarse (>20 km) resolution is not well documented in the literature—though there is no fundamental reason suggesting they would be inappropriate. The Developmental Testbed Center (DTC) utilized output from the Global Forecast System (GFS; EMC 2003) and the operational North American Mesoscale Model (NAM; Janjić 2003, 2004) to compare and contrast QPF performance when assessed using traditional, spatial, and object-based verification methods. Traditional verification metrics computed for this test included frequency bias and Gilbert skill score (GSS) with an adjustment accounting for the bias in the forecast.

To further investigate errors in the simulated mesoscale QPF features (with scales from a few to several hundred kilometers), two spatial techniques were also examined: the fractions skill score (FSS; Roberts and Lean 2008) and the Method for Object-Based Diagnostic Evaluation (MODE; Davis et al. 2006, 2009). These state-of-the-art verification techniques offer more diagnostic information when assessing forecast performance than do the widely applied traditional methods. The comprehensive evaluation conducted for the GFS and NAM QPFs offers an opportunity to illustrate advantages of applying these more advanced spatial techniques and suggests some “best practices” when using these methods for mid- and coarse-resolution models.

2. Data

a. Precipitation analyses

For this evaluation, forecast precipitation amounts in accumulation periods of 3 h were assessed utilizing the hourly 4-km National Centers for Environmental Prediction (NCEP) stage II analyses summed into 3-h accumulations. As summarized in Lin and Mitchell (2005), NCEP stage II refers to a real-time, high-resolution, multisensor precipitation analysis from hourly radar precipitation estimates and hourly rain gauge data. While some initial quality control steps are included in the stage II analysis (e.g., removal of anomalous propagation), no manual quality control (QC) is performed and, thus, some inherent biases may exist in the dataset. Spurious areas of precipitation can be a result of radar artifacts (e.g., beam blockage) not corrected for in the QC algorithms; this could potentially lead to spatially varying biases in the analysis field (Hunter 1996; Fulton et al. 1998). While it is acknowledged that NCEP’s stage IV analyses are produced with more advanced algorithms and some manual QC procedures, major benefits of the stage II analysis include its timeliness and consistency in producing hourly analyses and its near-full coverage over the contiguous United States (CONUS) (Lin and Mitchell 2005). For the latter reason, stage II analyses were chosen for use in this study. While issues with radar coverage are well documented in the mountainous regions of the western United States (Westrick et al. 1999; Maddox et al. 2002), the region is included in this study because it is important to demonstrate the application of objective verification techniques over this area to help forecasters and model developers to better understand model QPF performance in this region.

b. Model output

Operational QPF output from GFS and NAM was retrieved from the National Oceanic and Atmospheric Administration (NOAA)/NCEP for 18 December 2008–15 December 2009. For this study, focus was placed on the 0000 UTC daily forecast initializations and the associated precipitation accumulations at 3-h intervals out to 84 h. The native datasets for the NAM output are on an Arakawa E grid staggered domain with approximately 12-km grid spacing, whereas the GFS output is on a global Gaussian grid with 0.5° × 0.5° (approximately 55 km) resolution. For this evaluation, the copygb program, developed by NCEP, was used to regrid the GFS and NAM native output onto the same grid as the precipitation analyses: a 4-km CONUS grid with a polar stereographic map projection. This step was necessary because the forecast and analysis fields must be collocated on the same grid in order to perform grid-to-grid comparisons. Thus, a choice had to be made regarding which grid to interpolate everything to for consistency. Choosing which common grid to interpolate to strongly depends on the research question that is being addressed. For this analysis, the decision to interpolate the GFS and NAM output to the 4-km observation domain was made for several reasons. In this case, we are interested in determining how well the models replicate the precipitation represented in the 4-km precipitation analysis and what is potentially gained from the higher-resolution model. While interpolating a coarse-resolution (e.g., GFS) QPF field to a higher-resolution grid will not artificially produce finescale structure, it is desired to preserve any additional smaller-scale structure provided by the midresolution (e.g., NAM) model. This approach also allows the precipitation analyses to remain on their native grid and not be subjected to any interpolation. Finally, since FSS is one of the spatial verification approaches being applied in this study—which allows comparisons across many different spatial scales—it is valuable to be able to start examining results at the finest resolution possible.

The budget interpolation option in the copygb program, also known as the remapping or simple nearest-neighbor averaging method [described in Baldwin (2012) and Accadia et al. (2003)], was utilized. This approach conserves the total area-average precipitation amounts of the native grid. While Accadia et al. (2003) demonstrated that interpolation can have a statistically significant impact on the resulting verification scores, they concluded that utilizing the budget interpolation option provides skill scores that “are generally closer to those computed on the native grid.”

3. Verification

a. Measures

To demonstrate the utility of applying advanced verification techniques to mid- and coarse-resolution models, several traditional metrics and spatial verification techniques were applied to QPFs provided by GFS and NAM. The Model Evaluation Tools (MET; Fowler et al. 2010) software package, which offers a wide variety of verification methods, was utilized in conjunction with software in the R statistical language (R Development Core Team 2013). The basic concept behind each approach is provided in the following section, while specific details on how each method was implemented for this study are provided in section 4, along with a discussion of the results.

1) Traditional metrics

As is commonly used in the NWP community, grid-to-grid comparisons of the forecast and observation fields were performed and standard 2 × 2 contingency tables (Table 1) were created for a range of accumulation thresholds and forecast lead times, from which a variety of verification measures and skill scores can be computed (Jolliffe and Stephenson 2011; Wilks 2011). For this study, the traditional metrics computed were frequency bias and GSS. Frequency bias [Eq. (1); terms defined in Table 1] measures the ratio of the frequency of forecast events to the frequency of observed events and indicates whether the forecast system has a tendency to underforecast (<1) or overforecast (>1) events (where subscript y indicates yes and n indicates no):
e1
Table 1.

Standard 2 × 2 contingency table. The f values represent counts of forecast–observation pairs of yes–no values.

Table 1.
The Gilbert skill score [Eq. (2)] measures the fraction of observed events that were correctly predicted and is adjusted for the expected number of hits associated with random chance; it is a widely used metric for evaluating accuracy in precipitation forecasts. However, a downside to using GSS is that values can be inflated by model overprediction [i.e., frequency bias values over 1; Baldwin and Kain (2006)]. To account for this inherent problem, an adjustment similar to that discussed by Hamill (1999) was made to the GFS and NAM QPFs, separately, in order to debias the model forecasts prior to computing the GSS. The procedure includes, first, identifying the coverage area for each observed precipitation accumulation threshold of interest. Then, the forecast precipitation accumulation threshold that results in a similar coverage area, thereby providing a frequency bias as close to, without exceeding, 1 as possible, is identified. From there, the standard GSS is calculated using the observed and forecast precipitation accumulation thresholds with corresponding coverage areas. While the debiasing method removed nearly all of the bias in the GFS QPFs (i.e., frequency bias ≈ 1), the NAM QPFs were occasionally more difficult to debias, which resulted in a somewhat low bias after the adjustment. Values of GSS range from −⅓ to 1; a no-skill forecast would have a value of 0 and a perfect forecast would have GSS = 1. An event is defined when the specific threshold criteria are met and, otherwise, is considered a nonevent:
e2
where
eq1

2) Spatial techniques

To illustrate the additional diagnostic information provided by spatial verification methods, this study considered two categories of techniques: neighborhood (FSS) and feature based (MODE). First, FSS was applied to obtain an objective measure of how the forecast skill of each model varied with spatial scale. FSS includes the following steps, fully described in Roberts and Lean (2008): (i) convert all forecast F and observed O fields into binary fields for each threshold of interest, (ii) generate fractions within a square of length n that have exceeded the threshold at each grid point across the full verification domain (Nx, Ny), and (iii) compute the mean-squared error (MSE) relative to a low-skill reference forecast MSEref, which equates to the largest possible MSE that would be found if no overlap between forecast and observed events occurred. FSS for a neighborhood of length n is given by
e3
where
eq2
and
eq3

The forecast skill associated with a uniform forecast is also defined by Roberts and Lean (2008) as the FSS that would be obtained at the grid scale (i.e., n = 1) for a forecast with a probability equal to the base rate at every point (FSSuniform = 0.5 + base rate/2). Here, the base rate is the fraction of the domain covered by the observed precipitation exceeding the threshold. The FSSuniform value falls approximately halfway between the random forecast skill (defined as the base rate, or fractional coverage of the domain) and perfect skill and is considered to be a reasonably skillful forecast at the lower bound of the useful spatial scales. Some advantages of FSS are that it is easy to implement, is less sensitive to localized errors than traditional metrics, and has a simple physical interpretation regarding the spatial scale at which forecasts are skillful. However, it provides only a limited level of diagnostic information, and no information on the spatial structure of the forecast being evaluated.

The second spatial verification approach applied is a feature-based method referred to as MODE. The process of identifying and verifying features (objects) with MODE is defined in Davis et al. (2006). Briefly, this approach consists of the following steps: (i) resolve forecast and observation objects—after convolving the raw fields and thresholding the smoothed data to create the resolved object, the raw precipitation values are reinserted within the objects for use in the remainder of the analysis; (ii) compute attributes (e.g., area and centroid) for each forecast and observation object identified; (iii) determine which objects in each field should be grouped together (merged); (iv) run a fuzzy logic algorithm on all possible pairs of forecast and observation objects to determine which should be matched between the two fields; and (v) write out attributes for single objects and pairs of matched forecast and observation objects to assess forecast quality. Because MODE was designed to automate the process of subjectively assessing a forecast field, it is generally intuitive to interpret and provides physically meaningful results. The method also provides extensive diagnostic information regarding the identified features within the forecast field. MODE, however, is highly configurable; the tuning of parameters will impact the process of identifying, merging, and matching features and, ultimately, the results (e.g., Clark et al. 2014). Thus, it is important to first determine the features of interest, and then select a set of MODE parameters that best capture the intended areas, prior to evaluation. Selecting the appropriate parameters is often an iterative process in order to determine the optimal configuration that best suits the research question; MODE parameter settings specific to this work are discussed in section 4.

b. Methodology

Verification results were computed over the CONUS region (Fig. 1) for several temporal aggregations (Table 2) on the 4-km domain. The 3-h QPF verification scores for traditional metrics were evaluated every 3 h out to 84 h with a focus on a variety of accumulation thresholds (0.254, 0.508, 1.27, 2.54, 3.81, 6.35, 8.89, 12.7, and 25.4 mm) to include ordinary precipitation systems, as well as higher-impact events. Spatial techniques were also evaluated every 3 h out to 84 h, but for a subset of the accumulation thresholds used for the traditional metrics (described in greater detail below).

Fig. 1.
Fig. 1.

Map showing the boundary of the CONUS verification domain (denoted by the boldface outline).

Citation: Weather and Forecasting 29, 6; 10.1175/WAF-D-13-00135.1

Table 2.

Date ranges used to define temporal aggregations considered in this study.

Table 2.

Aggregate values were computed for the traditional and FSS methods, while median values of the distributions were used for the MODE attributes. Confidence intervals (CIs) at the 99% level were then applied to the computed statistic in order to estimate the uncertainty associated with sampling variability. With the large number of tests performed in this study, the more stringent confidence level of 99% is preferred due to the likelihood of obtaining significance by random chance. Observational uncertainty was not considered in this study. The CIs were computed using the appropriate statistical method (Gilleland 2010); in particular, either a bootstrapping technique was applied (for frequency bias, GSS, and FSS) or the standard error about the median was computed (for all MODE attributes except frequency bias). For the standard error algorithm, a normal distribution is assumed and the variance of the sample is considered, while bootstrapping provides an estimate of the uncertainty by applying a numerical resampling method. For this study, the resampling with replacement was conducted 1500 times.

Forecasts from both operational NWP models were available for the same cases, which makes it possible to apply a pairwise difference methodology to the verification measures. This technique calculates differences between the NAM and GFS verification statistics and applies CIs to the difference statistic. The CIs for the pairwise differences between statistics for the two models provide an efficient, objective measure of whether the differences are statistically significant (SS); in particular, if the CIs for the pairwise differences include zero, the difference in performance is not SS. The pairwise difference was computed for GSS and FSS. For these verification measures, a positive (negative) difference indicates the NAM (GFS) has greater skill. Due to the nonlinear nature of frequency bias, it is not amenable to a pairwise difference calculation. Therefore, the more powerful pairwise difference method for establishing SS cannot be used and a more conservative estimate was employed based solely on whether the CIs of the aggregate statistic overlapped between the two models. If no overlap was noted, the frequency biases of the two models were considered statistically distinguishable at the 99% level.

A key consideration related to obtaining meaningful verification results from aggregated datasets is ensuring that the underlying sample is consistent in physical characteristics. For traditional statistics, attributes such as threshold and valid time are most important for identifying meaningful subsets; in addition, for spatial verification approaches, the horizontal extent and intensity of the meteorological systems are also very important. The annual aggregation, along with the summer and winter seasons, are examined here when applying the traditional metrics and FSS. However, for the MODE analyses, the annual aggregations are not considered, and attention is focused on the individual summer and winter aggregated results. This approach is taken to ensure consistency among the identified meteorological systems included in the samples. Different accumulation thresholds were chosen for winter (0.254 mm) and summer (2.54 mm) to capture the meteorological systems generally of interest for each season; broader, synoptic-scale systems in the winter tend to produce larger areas of lighter precipitation, while smaller, convective-scale systems in the summer can produce more localized, higher precipitation totals.

4. Results

a. Traditional verification results

Traditional verification metrics have been widely used to assess the performance of forecast models for decades. Thus, it is useful to first establish baseline results using these standard metrics before demonstrating the additional information new spatial techniques can provide. Note that very few 3-h accumulations at and above 12.7 mm were found in the sample (the median frequency of observed events for this threshold was less than 1% for each of the temporal aggregations). This small sample size leads to higher uncertainty in the verification statistics for this and larger thresholds; for this reason, objective verification scores are only presented for thresholds below 12.7 mm. This is an example of steps taken to understand the observational dataset being used and when to acknowledge a sample size is too small to yield meaningful results.

1) Frequency bias

Time series plots of annually aggregated frequency bias show that the values for both models depend strongly on threshold and valid time, but have little variation as forecast lead time increases (Figs. 2a–d). The base rate, which is the ratio of total observed gridbox events to the total number of grid boxes summed over all cases, exhibits a peak between valid times 2100 and 0000 UTC and decreases with increasing threshold, where very few observations are associated with the largest accumulation values. Both models exhibit a strong diurnal signal with the largest frequency bias values at valid times near 1800 UTC (i.e., forecast hours 18, 42, and 66), while the smallest values are seen during the overnight hours (valid between 0300 and 1200 UTC). For the lowest three thresholds (Figs. 2a–c), the GFS has an SS high bias (i.e., where the lower bounds of the CIs are larger than 1) at most lead times, which transitions to an SS low bias at all but the 1800 UTC valid time for the highest threshold shown (Fig. 2d). While the NAM also has an SS high bias during the daytime hours, overnight the CIs more often encompass one, and the QPF is considered unbiased (Figs. 2a–c). Similar to the GFS, the NAM also transitions to an SS low bias for most lead times at the largest threshold shown (Fig. 2d). When compared to the GFS, the NAM has a statistically smaller bias at the 0.254- and 1.27-mm thresholds (Figs. 2a,b) throughout the forecast. For most lead times beyond 24 h, except at the 1800 UTC valid time for the 2.54-mm threshold and all forecast lead times for the 6.35-mm threshold, there are no SS differences between the GFS and NAM (Figs. 2c,d).

Fig. 2.
Fig. 2.

Time series plots of frequency bias for 3-h QPFs aggregated across all model initializations (annual) for the (a) 0.254-, (b) 1.27-, (c) 2.54-, and (d) 6.35-mm thresholds. The GFS results are shown in red and the NAM results are in blue. The vertical bars represent the 99% CIs. The base rate is associated with the second y axis and shown in black.

Citation: Weather and Forecasting 29, 6; 10.1175/WAF-D-13-00135.1

When focusing on seasonal aggregations for a variety of thresholds at the 48-h lead time only, a uniform SS high bias for the winter season is found for GFS at all thresholds, whereas for NAM the bias is SS high for thresholds below 6.35 mm only, and the CIs encompass the value of one for larger thresholds (Fig. 3). For summer, both the GFS and NAM have SS low-frequency biases at and above the 1.27-mm threshold. While the GFS forecasts are unbiased for thresholds below 1.27 mm, the NAM has an SS high-frequency bias. The base rates for the summer and winter aggregations, which had the largest and smallest values, respectively, of any season, are also included in Fig. 3. The seasonal base rate influences the size of the CIs; the largest CIs bound the frequency bias values for the winter season, indicating a higher level of uncertainty in the aggregate value due to the smaller observed sample size.

Fig. 3.
Fig. 3.

Threshold series plots of frequency bias for 3-h QPFs for the 48-h forecast lead time aggregated across the winter (solid) and summer (dash) seasons. The GFS results are shown in red and the NAM results are in blue. The vertical bars represent the 99% CIs. The base rates for the winter (solid) and summer (dash) aggregations are associated with the second y axis and shown in black.

Citation: Weather and Forecasting 29, 6; 10.1175/WAF-D-13-00135.1

2) GSS

A decrease in the debiased and annually aggregated GSS values for 3-h QPF with increasing threshold and forecast lead time is depicted in Figs. 4a–d. The lowest GSS values occur around the valid time of 0300 UTC, and the highest values are noted around 1200 UTC, except for the 6.35-mm precipitation accumulation threshold for which the highest values occur closer to 0900 UTC. This signal is associated with the times of generally higher and lower values of base rate, respectively. Pairwise differences for the annual aggregation reveal that the NAM forecast has an SS lower skill than the GFS (negative pairwise difference values) for all lead times at the 0.254-mm threshold (Fig. 4a). A similar result is noted at a majority of lead times for the 1.27- and 2.54-mm thresholds; the non-SS differences for these thresholds frequently correspond to the 1800 UTC valid time (Figs. 4b,c). Fewer SS pairwise differences are noted for the 6.35-mm threshold (Fig. 4d).

Fig. 4.
Fig. 4.

Time series plots of debiased GSS for 3-h QPFs aggregated across all model initializations (annual) for the (a) 0.254-, (b) 1.27-, (c) 2.54-, and (d) 6.35-mm thresholds. The GFS results are shown in red, NAM results are in blue, and the pairwise difference (NAM − GFS) results are in green. The vertical bars represent the 99% CIs. The base rate is associated with the second y axis and shown in black.

Citation: Weather and Forecasting 29, 6; 10.1175/WAF-D-13-00135.1

When looking at pairwise differences for the seasonal breakdown, the GFS GSS values are significantly larger than the NAM values for all thresholds for the winter aggregation and the thresholds below 2.54 mm for the summer aggregation (Fig. 5). The decrease in seasonal base rate during the winter season is one possible contributor to the higher overall GSS values because of the larger proportion of correct negatives, which are generally easier to forecast. Another possible explanation is that the mesoscale systems during the winter season are more often strongly forced, which, again, makes them easier to forecast.

Fig. 5.
Fig. 5.

Threshold series plots of debiased GSS for 3-h QPFs for the 48-h forecast lead time aggregated across the winter (solid) and summer (dash) seasons. The GFS results are shown in red, NAM results are in blue, and the pairwise difference (NAM − GFS) results are in green. The vertical bars represent the 99% CIs. The base rates for the winter (solid) and summer (dash) aggregations are associated with the second y axis and shown in black.

Citation: Weather and Forecasting 29, 6; 10.1175/WAF-D-13-00135.1

b. Spatial verification results

Spatial verification approaches provide additional diagnostic information when comparing the forecast performance of models with different horizontal scales, especially as the grid spacing decreases. While spatial verification approaches become critical when investigating forecast deficiencies at fine resolutions (<5 km), similar benefits are available at coarser resolutions. Advantages of two state-of-the-art spatial verification techniques are illustrated while keeping in mind the best practices and limitations of these types of approaches for mid- and coarse resolutions.

1) FSS

Forecast performance at a variety of spatial scales was investigated by changing the width of the verification neighborhood in grid squares n where the entire neighborhood size is defined as n × n grid squares. Figure 6 provides a visual example of neighborhood widths and sizes. For grid-to-grid comparisons (as is used for traditional verification metrics such as frequency bias or GSS), the neighborhood width is n = 1, denoted by the solid outline in Fig. 6; the dotted and the dashed outlines illustrate larger neighborhood sizes of n = 3 and 5, respectively. Neighborhood widths of n = 3, 7, 11, …, 75 were applied to each model forecast for this evaluation.

Fig. 6.
Fig. 6.

Illustration of neighborhood size and the relationship of forecast skill with varying spatial scale for a particular precipitation threshold. In the forecast and observed fields, the shaded squares represent a value of 1 if the forecast or observed precipitation in that square exceeds the designated threshold; the nonshaded squares represent a value of 0. The solid outline represents a single grid square. Evaluating each individual grid square using traditional verification metrics would reveal the forecast has no skill, as none of the forecast events overlaps with the observed events. However, as the neighborhood size increases from 9 (3 × 3 dotted outline) to 25 (5 × 5 dashed outline), both the forecast and observed fields have events in 6 of 25 grid squares. [Adapted from Roberts and Lean (2008), their Fig. 2.]

Citation: Weather and Forecasting 29, 6; 10.1175/WAF-D-13-00135.1

Verification quilt plots (Ebert 2009; Gilleland et al. 2009), as in Fig. 7, provide a clear summary of FSS as a function of spatial scale and threshold at a particular forecast lead time. For these plots, the neighborhood size increases toward the top of the plot, effectively representing coarsening of the grid, while the precipitation threshold increases toward the right. The FSS value associated with each combination of spatial scale and threshold is indicated by both the number and the color shading in each box; the warmer colors are associated with larger FSS values. Typically, the greatest skill will be associated with the coarsest resolution and lowest threshold (top-left corner), while the lowest skill will be associated with the finest resolution and largest threshold [bottom-right corner; Ebert (2009)]. Essentially, the least skill will frequently be associated with the most difficult forecast event to accurately predict—often, very localized, intense precipitation accumulation events. A similar result is found in this study as well; regardless of lead time, the largest FSS values are associated with the larger spatial scales and the lowest threshold (0.254 mm) while the smallest FSS values are associated with the smaller spatial scales and highest threshold (8.89 mm).

Fig. 7.
Fig. 7.

Quilt plots of FSS as a function of spatial scale and threshold aggregated across all model initializations (annual) for the 24-h lead time. Shown are the (top) NAM and (bottom) GFS plots. The FSS value associated with each spatial scale and threshold is indicated by both the number and the color shading in each box; warmer colors are associated with larger FSS values. Values that are smaller than the uniform forecast skill value are denoted with parentheses.

Citation: Weather and Forecasting 29, 6; 10.1175/WAF-D-13-00135.1

As described in section 3, the uniform forecast skill is an important indicator of the scale at which the forecast becomes useful. Values less than this uniform forecast skill score are denoted with parentheses around them in the individual boxes of the quilt plot. In the FSS quilt plots generated for this comparison, the two highest precipitation thresholds of 6.35 and 8.89 mm are always associated with FSS values less than the calculated uniform forecast skill for both models regardless of spatial scale or forecast lead time. An overall decrease in skill was observed as lead time increases from 12 to 84 h, resulting in an increase in the number of FSS values that fall below the uniform forecast skill value with lead time. To focus on a generally more active time of day, in terms of precipitation, only the 24-h lead time (valid at 0000 UTC) is included in this discussion. For the annual aggregation, the NAM FSS values are consistently larger than the GFS FSS values for all scores larger than the uniform forecast skill score (Fig. 7). Fairly consistent behavior is evident for the summer and winter aggregations (not shown).

To further explore FSS by lead time and seasonal aggregation, two spatial scales (60 and 300 km) for the 0.254-mm threshold for the winter aggregation (Fig. 8) and the 2.54-mm threshold for the summer aggregation (Fig. 9) are shown. As seen in Figs. 8 and 9, FSS decreases with lead time for both seasonal aggregations. A diurnal cycle (weak in the winter) is also superimposed, with the largest FSS values typically occurring during the afternoon/evening hours, the time period corresponding to a higher observed frequency of precipitation events. FSS increases with neighborhood size, as expected, where the smallest spatial scale displayed (60 km) has smaller FSS values than the values associated with a larger spatial scale considered for the same model (300 km).

Fig. 8.
Fig. 8.

Time series plot of FSS using a threshold of 0.254 mm aggregated across the winter season. The GFS (red), NAM (blue), and the pairwise differences (green) are shown for n = 15 (60-km spatial scale; triangle, dot–dash) and 75 (300-km spatial scale; circle, solid). The vertical bars on the pairwise differences represent the 99% CIs.

Citation: Weather and Forecasting 29, 6; 10.1175/WAF-D-13-00135.1

Fig. 9.
Fig. 9.

Time series plot of FSS using a threshold of 2.54 mm aggregated across the summer season. The GFS (red), NAM (blue), and the pairwise differences (green) are shown for n = 15 (60-km spatial scale; triangle, dot–dash) and 75 (300-km spatial scale; circle, solid). The vertical bars on the pairwise differences represent the 99% CIs.

Citation: Weather and Forecasting 29, 6; 10.1175/WAF-D-13-00135.1

Pairwise differences were computed between the NAM and GFS FSS values for each neighborhood size at each lead time. For the winter aggregation, NAM exhibited larger FSS values at both the 60- and 300-km neighborhood sizes, with SS pairwise differences highlighting improved QPF performance in the NAM for nearly all lead times; the only exceptions are at the longer lead times (i.e., greater than 60 h for the 60-km scores and greater than 78 h for the 300-km scores), where the CIs on the difference line encompass zero (Fig. 8). For the summer aggregation, a majority of the CIs on the pairwise differences encompass zero. A few consistent results at valid times of 0000, 0300, 1500, and 1800 UTC indicate the NAM has significantly larger FSS values, with more SS pairwise differences noted for the 300-km neighborhood size (Fig. 9). The only differences showing better performance by the GFS are for the 60-km neighborhood size at the 21- and 45-h forecast times. The root cause of the intermittency of SS pairwise differences and the occasional larger FSS values for GFS for the summer aggregation is related to the large change in GFS FSS values between 1800 and 0000 UTC (lead times of 18–24, 42–48, and 66–72 h), which was not found for NAM.

2) MODE

As discussed in section 3, MODE is a highly configurable verification tool, and it is important to define the features of interest prior to beginning an evaluation. For this study, mesoscale precipitation systems were selected as the features of interest, and MODE was tuned (select settings defined in parentheses) to best suit this focus. A raw threshold (raw_thresh) of 0.254 mm was first applied to both the forecast and observation fields and all values that did not meet the threshold of interest were set to zero. A circular smoother (conv_radius) with a radius of 10 grid points was then used. Two thresholds (conv_thresh), 0.254 and 2.54 mm, were applied to the convolved 3-h precipitation accumulation fields to define discrete precipitation objects and the raw data values were reinserted. For each forecast–observation precipitation object pair, MODE computed a total interest value between 0 and 1 to quantify the similarity of the objects. The total interest is a weighted average of the following object pair attributes, each followed by its relative weight in parentheses: the distance between the objects’ centroids (2), the minimum (boundary) distance between the objects (4), the difference in the objects’ orientation angles (1), the ratio of the objects’ areas (1), and the ratio of the objects’ intersection area to their union area (2). Identified precipitation objects were matched between the forecast field and observed field if the total interest value for a forecast–observation object pair was greater than or equal to 0.7. While no merging was performed in the individual forecast and observation fields (merge_flag = none), merging of simple objects into a cluster object (group of related simple objects) was allowed in each field if two or more objects in one field (either forecast or observed) matched the same object in the other field (match_flag = merge_both). Examples of the objects created from the forecast and observation fields for 24-h forecasts from the NAM and GFS valid at 0000 UTC 14 May 2009 are shown in Fig. 10. Note that while the identified “simple” precipitation objects in the observation field are exactly the same in Fig. 10 (top) and Fig. 10 (bottom), the comparisons between the observed and forecast fields may identify different ways to match clusters of observed and forecasted objects, depending on the forecast field. This behavior is thought to mimic the typical subjective assessment process applied by some forecasters and other weather analysts when comparing observations to different forecast models.

Fig. 10.
Fig. 10.

Example illustrating the MODE objects created from the (top) NAM and (bottom) GFS (left) 3-h QPF fields and (right) associated stage II analysis field from a 24-h forecast valid at 0000 UTC 14 May 2009. Both the forecast and observations fields are on the 4-km domain. Similar colors between the fields indicate matched objects; royal blue objects in the forecast field are false alarms and in the observation field they are misses. The black lines surrounding objects are the convex hulls, which are the smallest set of curves bounding an object or group of objects together.

Citation: Weather and Forecasting 29, 6; 10.1175/WAF-D-13-00135.1

MODE computes a variety of measures that the user can examine depending on their specific application. When the aggregation of all objects in the forecast field is compared to the aggregation of all objects in the observed field, MODE attributes assess the bias of the forecast. Accuracy is evaluated when objects are matched between the forecast and observed field, and the differences between the forecast and observed MODE count, area, and location attributes are computed. The measures that are relevant for the dataset and approach in this study (i.e., attributes that are appropriate for examining regional mesoscale features) are discussed further in this section, starting with the total number of precipitation objects and the spatial coverage of each forecast object (i.e., the areas).

Identifying precipitation objects in both the forecast and observation fields provides a unique way of comparing the model and observed precipitation fields through a variety of attributes. Figure 11 shows the total counts of precipitation objects from the two models and the observation field as a function of forecast lead time. As was done for FSS, the results for the 0.254-mm threshold for the winter aggregation and 2.54-mm threshold for the summer aggregation will be discussed in detail. The counts represented here are the total number of simple (i.e., not matched or clustered) objects in each field summed by forecast lead time for each temporal aggregation, regardless of whether a matching precipitation object could be identified in the other field. For the winter aggregation [Fig. 11 (top)], a peak number of observed precipitation objects was found at 2100 UTC, and a minimum number was found at 1200 UTC. The GFS distribution exhibits a nearly opposite signal from the observed count time series, with a peak number of precipitation objects at 0300 UTC and a minimum near 2100 UTC. Regardless of lead time, the total number of precipitation objects identified in the GFS forecasts is substantially smaller than the number in the observed field. The forecast count series for NAM, on the other hand, is characterized by a double-peak structure in the total number of precipitation objects at 1200 and 0000 UTC. Hence, NAM generally underforecasted the total number of precipitation objects during the daytime, with a 3-h lag in the peak from the observed count, and overforecasted the total number of precipitation objects during the overnight hours.

Fig. 11.
Fig. 11.

Time series plots of total object counts by lead time for the GFS (red), NAM (blue), and stage II analysis (black) fields aggregated across the (top) winter season for the 0.254-mm threshold and (bottom) summer season for the 2.54-mm threshold.

Citation: Weather and Forecasting 29, 6; 10.1175/WAF-D-13-00135.1

The total object count for the summer aggregation [Fig. 11 (bottom)] exhibits a clear diurnal signal consistent with the convective nature of the precipitation objects identified. The peak count in identified precipitation objects in the observation field is shifted 3 h later than that found for the winter aggregation (i.e., to 0000 UTC), likely due to the timing of convective initiation occurring later in connection with the maximum diurnal heating during the summer. A minimum in total precipitation objects is observed from the early morning to early afternoon between 0600 and 1800 UTC. The number of observed precipitation objects identified in the summer aggregation is about 1.5 times larger than the winter aggregation, likely due to the mesoscale versus synoptic scales of summer versus winter precipitation. For the NAM forecasts, the diurnal distribution of precipitation object counts is very similar to the diurnal distribution of the observed precipitation object counts; however, smaller total numbers of identified precipitation objects are associated with all lead times. The distribution of counts for the GFS forecasts is even further displaced toward fewer total precipitation objects, and the peak number of forecast precipitation objects, which occurs at 2100 UTC, is offset by 3 h from the observed peak. The plots in Fig. 11 suggest that while both the NAM and the GFS produced too few precipitation objects at many lead times, the NAM more closely reproduced the total number of precipitation objects found in the observation field for both seasonal aggregations and captured the timing of the convective peaks better during the summer. It is likely that the models are not able to resolve the appropriate number of precipitation objects because of the coarseness of their native resolutions.

To further investigate the simple precipitation objects identified in each field, box plots of the distributions of object area by lead time are shown in Fig. 12. In each box plot, the median value of the distribution is denoted by the “waist” of the box and the “notches” about the median approximate the 99% CIs for the median. The 25th and 75th percentiles are denoted by the lower and upper ends of the box, respectively, and the largest nonoutlier values, defined as 1.5 times the interquartile range, are contained within the whiskers of the box plot. For the winter aggregation [Fig. 12 (top)], the area of the identified precipitation objects in the observed field is consistent across valid times, with a median value of approximately 7000 km2 (for reference, the area is slightly larger than the state of Delaware). The overall area of the GFS precipitation objects was significantly larger than those identified in the observed field, regardless of the seasonal aggregation examined. At the 0.254-mm threshold, for the winter aggregation, regardless of the valid time, the median precipitation object area for the GFS is nearly double the median for the observed precipitation objects. In addition, the upper ends of the whiskers for the GFS box plots are substantially larger (approaching sizes closer to the state of Wisconsin) than the upper whiskers for the observed and NAM precipitation area distributions. This difference is likely due to the coarse native resolution of the GFS, which leads to large areas of forecast precipitation. The NAM median precipitation object areas are generally significantly smaller than the median observed precipitation object areas for the winter aggregation. While the NAM is able to produce larger synoptic-scale features, with the 75th percentile of the distribution similar to that seen for the observations, it also has a large number of relatively small precipitation objects. Related, peaks in total forecast precipitation object counts are seen in Fig. 11 (top) for the NAM at the 0000 and 1200 UTC valid times. This leads to the NAM distribution having lower values for the 25th percentile and smaller median values as compared to the distribution of the observed precipitation object areas.

Fig. 12.
Fig. 12.

Box plots by lead time showing the size distributions for precipitation objects identified within the GFS (red), NAM (blue), and stage II analysis fields (gray) aggregated across the (top) winter season for the 0.254-mm threshold and (bottom) summer season for the 2.54-mm threshold. The bottom and top of each box plot correspond to the 25th and 75th percentiles, respectively; the black line at the “waist” is the median value and the “notches” about the median approximate the 99% CIs.

Citation: Weather and Forecasting 29, 6; 10.1175/WAF-D-13-00135.1

The median area of the observed precipitation objects for the summer aggregation [Fig. 12 (bottom)] is smaller than for the winter aggregation and is dependent on valid time; for 1200 UTC, the median value is around 6000 km2 and drops to about 5700 km2 at 0000 UTC. The smaller median area values of observed precipitation objects at 0000 UTC for the summer aggregation may be attributed to the climatological nature of convective initiation around that time, while at 1200 UTC the individual storm cells may have conglomerated into a smaller number of larger mesoscale-type convective systems overnight. The NAM generally replicated the size distribution of the observed precipitation objects, with CIs overlapping for the 0000 UTC valid times in the summer aggregation at the 2.54-mm threshold. For the 1200 UTC summer samples at all valid times, the median NAM forecast precipitation object areas are about 1.5 times too large and the GFS medians are nearly twice the size of the observed precipitation objects.

The poor performance of coarse NWP models in predicting warm-season precipitation can be attributed to the inability of the models to capture the rudimentary climatology of warm-season rainfall (Davis et al. 2003). The results described above, related to the inconsistency of the GFS precipitation object counts and areas with the values for the observed field, are well aligned with this assertion from Davis et al. (2003). Because the GFS differences in total object areas and counts are so large, it was not meaningful to undertake further investigations of forecast accuracy with additional MODE attributes for the GFS. In other words, because the forecast model cannot reproduce the correct number or size of precipitation objects, it is not beneficial to continue diagnosing the accuracy of those precipitation objects. Thus, further diagnostic analyses were only performed for NAM, which more appropriately captured the number and size of observed precipitation objects. Additional attributes available through MODE that are examined for NAM include frequency bias, symmetric difference, centroid distance, and centroid displacement (illustrated and defined in Fig. 13).

Fig. 13.
Fig. 13.

Illustration of MODE-matched object attributes used in this study. The forecast object is shown in blue and the observed object is in red. The symmetric difference is the total nonoverlap area between the matched objects, shaded in gray (smaller is better). The centroid distance is the difference between the two centroids of the matched objects. The centroid displacement examines the x (nominally east–west) and y (nominally north–south) offsets of the centroids of two matched objects.

Citation: Weather and Forecasting 29, 6; 10.1175/WAF-D-13-00135.1

A MODE-based spatial version of frequency bias can be computed as the area ratio of all identified forecast precipitation objects to all identified observed precipitation objects (as with traditional frequency bias, a value greater than 1 is an overforecast and a value less than 1 is an underforecast). The NAM MODE frequency bias results by lead time depend largely on the temporal aggregation (Fig. 14). For the winter aggregation, the NAM has an SS high bias for all lead times. In contrast, the summer aggregation has an SS low bias for all lead times except those valid at 1800 UTC, where the CIs encompass one. For both aggregations, a diurnal signal is noted. During the winter aggregation, the largest (high) bias is associated with valid times between 0600 and 1200 UTC, which are also the lead times when the total count is too large [Fig. 11 (top)]. For the summer aggregation, the most extreme low bias is associated with the 0000 UTC valid times, which are also the lead times that tended to have precipitation objects with areas that were too small [Fig. 12 (bottom)]. In addition, not enough individual storm cells were forecast (as indicated by the counts), which also contributed to the low bias. This result may indicate that NAM is not able to initiate enough discrete storms at the operational grid spacing of ~12 km.

Fig. 14.
Fig. 14.

Time series plot of the median MODE frequency bias for NAM aggregated across the winter season for the 0.254-mm threshold (solid) and the summer season for the 2.54-mm threshold (dash). The vertical bars represent the 99% CIs.

Citation: Weather and Forecasting 29, 6; 10.1175/WAF-D-13-00135.1

The accuracy of the NAM forecasts is further assessed by examining several MODE metrics that directly compare matched, or clustered, forecast and observed precipitation objects. First, the symmetric difference is examined to assess how well the identified and matched observed and forecast precipitation objects relate to each other, not only in size, but also location. Symmetric difference measures the nonintersecting area between the forecast–observed precipitation object pair, with larger values indicating less overlap; a symmetric difference of zero indicates the objects exactly overlap. In Fig. 15, the symmetric difference results for the NAM summer aggregation at the 2.54-mm threshold indicate the largest symmetric differences were found for the morning hours (1200–1800 UTC). This result may indicate a problem with the timing and propagation of the precipitation objects. In fact, that time period is also when the centroid distances between the forecast and observed precipitation objects were found to be largest (Fig. 16), indicating the center of mass for the identified objects was farther apart. Looking further, the x and y displacements (nominally west–east and north–south, respectively) of the centroids (Fig. 17) reveal that the NAM tended to have a general westerly bias in the location of objects, perhaps indicating a lag in system propagation. This result is consistent with previous investigations conducted by Davis et al. (2003), Grams et al. (2006), and Clark et al. (2010), which highlight problems with the west–east propagation of mesoscale systems in several NWP models with parameterized convection. In a result that is similar to a conclusion of Davis et al. (2003), the NAM precipitation objects have smaller errors in latitudinal position compared to the errors in longitudinal position, where there are no SS displacements of the centroid in the north–south direction.

Fig. 15.
Fig. 15.

Time series plot of the median symmetric difference for all NAM forecast objects compared to their matching observed objects aggregated across the summer season for the 2.54-mm threshold. The vertical bars represent the 99% CIs.

Citation: Weather and Forecasting 29, 6; 10.1175/WAF-D-13-00135.1

Fig. 16.
Fig. 16.

Time series plot of the median centroid distance for all NAM forecast objects compared to their matching observed objects aggregated across the summer season for the 2.54-mm threshold. The vertical bars represent the 99% CIs.

Citation: Weather and Forecasting 29, 6; 10.1175/WAF-D-13-00135.1

Fig. 17.
Fig. 17.

Time series plot of the median centroid displacements in the x (solid) and y (dash) directions for all NAM forecast objects compared to their matching observed objects aggregated across the summer season for the 2.54-mm threshold. A positive (negative) value in the x direction (CENTX) indicates an easterly (westerly) bias and a positive (negative) value in the y direction (CENTY) indicates a northerly (southerly) bias. The vertical bars represent the 99% CIs.

Citation: Weather and Forecasting 29, 6; 10.1175/WAF-D-13-00135.1

5. Summary

Multiple verification methods were applied to the operational GFS and NAM in order to highlight information provided on QPF performance when assessed using traditional, neighborhood, and object-based verification techniques for mid- and coarse-resolution models. The additional diagnostic information available from the advanced spatial verification techniques, such as FSS and MODE, is beneficial for informing forecasters and model developers why forecasts are or are not performing well. Information on the scale at which the forecast becomes skillful is available using the FSS neighborhood method. The use of MODE allows for the diagnosis of model performance in terms of coverage, displacement, and orientation—a richer evaluation than the grid overlap comparisons that more traditional, categorical metrics use. Using MODE within the context of this analysis provided additional opportunities to investigate and understand the accuracy and performance of the NAM QPFs.

When looking at the traditional metric of frequency bias, an SS high bias was noted for the winter aggregation for both GFS and NAM at most thresholds, whereas an SS low bias was found for higher thresholds and both models for the summer aggregation. While not shown, when the NAM traditional frequency bias is plotted by lead time for both seasonal aggregations, the results are consistent and closely emulate the diurnal pattern seen in the MODE frequency bias for NAM; thus, the two methods provide similar information. Looking further into the additional information provided by MODE, with regard to the total number of forecast precipitation objects identified, the GFS was found to have a significant low bias regardless of temporal aggregation. While the NAM also exhibited a low bias in the total number of forecast precipitation objects during the summer aggregation, the diurnal distribution is very similar to that found for the observation field. The GFS had far too few precipitation objects in the forecast field and the areas of those identified objects were significantly too large, which was not unexpected given its coarse resolution. Even though the NAM forecast precipitation object size distributions were generally significantly smaller than the observed precipitation objects for the winter aggregation and summer aggregation at the 1200 UTC valid time, the CIs overlapped for the summer 0000 UTC valid time and, overall, matched the observed precipitation object size distribution more closely than the GFS. Given this context provided by MODE, it is possible to further explore potential explanations for the frequency bias values for each model. For the winter aggregation, the high-frequency bias can likely be attributed to two main issues: 1) the GFS object areas were significantly too large and 2) the total numbers of precipitation objects identified in the NAM forecast fields were too large, especially between the 0600 and 1200 UTC valid times. For the summer aggregation, the largest contributor to the low bias for the higher thresholds is likely the small numbers of forecast precipitation objects produced by both models. In addition, while the sizes of the identified forecast precipitation objects were generally closer to the sizes of the observed precipitation objects for the summer aggregation than for winter, the forecast precipitation object areas were generally smallest at the 0000 UTC valid time, concurrent with the time of the smallest-frequency bias values.

Few SS differences between GFS and NAM were identified for the traditional frequency bias metric, and the model with smaller bias values depends on forecast valid time. For GSS, however, GFS consistently had more skill than NAM when pairwise SS differences were noted. The FSS evaluation contradicted this result, especially during the winter aggregation, which clearly shows that the higher-resolution NAM had larger FSS values for the same neighborhood sizes. With the exception of the 1800 UTC valid time during the summer aggregation, when SS pairwise differences in FSS are present, NAM is favored. The FSS quilt plots provide a clear summary of forecast performance as a function of spatial scale and precipitation threshold. The quilt plot for the annual aggregation revealed that NAM consistently performed better than GFS; however, for both models, several spatial scales did not meet the uniform forecast skill value. By including FSS in the evaluation, it becomes clear that while the NAM QPF does not precisely overlap the observations, the spatial distribution is more representative of the observations than the distribution from the GFS.

The use of additional MODE object attributes in an educated way can help answer clearly defined verification questions that are being investigated. For this study, mesoscale precipitation systems were selected as the precipitation sizes of interest and several diagnostic MODE measures were examined to assess the forecast accuracy, starting with the evaluation of the symmetric difference. This attribute revealed that the NAM forecast objects had the least overlap with observed objects during the morning hours between 1200 and 1800 UTC; this result points to a possible offset with timing and propagation of precipitation objects when compared to the observations. This conclusion is supported by the results related to the centroid displacement; NAM tended to have a general westerly bias in the location of precipitation objects, indicating a potential lag in system propagation.

The major focus of this paper was to describe the best practices for the evaluation of NWP model precipitation forecasts and, especially, for applying the newer spatial verification methods. The spatial verification techniques described in this paper are considered an advancement over traditional methods for evaluating forecast performance. These techniques have proven to be useful at mid- and coarse resolutions but will be critical at finer resolutions. As computational resources increase, NWP models will continue to move toward higher resolution and provide both a finer level of detail and a more realistic structure in the resulting forecasts. With regard to precipitation forecasts, benefits of high-resolution modeling (<5 km) include a finer detail in the underlying topography and the ability to explicitly depict convection (e.g., Kain et al. 2006; Weisman et al. 2008; Schwartz et al. 2009). However, Roberts and Lean (2008) state that “the problem we may have to face is an inherent reduction in predictability at the new resolved scales as the grid spacing is reduced and convection is resolved.” Thus, having appropriate verification measures is imperative to show the strengths and weaknesses of these high-resolution models. This paper provides examples of how to obtain diagnostic information regarding forecast performance on different scales; in particular, this study has illustrated the types of measures that are appropriate for assessing the performance of precipitation forecasts to answer particular types of questions, and has demonstrated the kinds of forecast performance information that the measures can provide. In addition to determining which model is better, it is valuable for many purposes to ascertain the aspects of the forecast that are or are not performing well. When carefully considering the interpretation of results, the new spatial verification methods will begin to help us answer those types of questions in a more objective manner.

Acknowledgments

The authors thank Ying Lin at NCEP/EMC for her assistance in acquiring the model and observation data used for this evaluation. We express gratitude to Paul Oldenburg and Tatiana Burek for their development work on verification graphics generation. Thanks, also, to Zach Trabold for providing additional assistance in the analysis of this work during his time as a student assistant. We appreciate the time Eric Gilleland, Matthias Steiner, and Edward Tollerud invested in providing their insightful suggestions for the improvement of an earlier version of this manuscript. Constructive comments from three anonymous reviewers were appreciated as they improved the quality of the final submission. The Developmental Testbed Center (DTC) is funded by the National Oceanic and Atmospheric Administration (NOAA), the Air Force Weather Agency (AFWA), the National Center for Atmospheric Research (NCAR), and the National Science Foundation (NSF).

REFERENCES

  • Accadia, C., Mariani S. , Casaioli M. , and Lavagnini A. , 2003: Sensitivity of precipitation forecast skill scores to bilinear interpolation and a simple nearest-neighbor average method on high-resolution verification grids. Wea. Forecasting, 18, 918932, doi:10.1175/1520-0434(2003)018<0918:SOPFSS>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Ahijevych, D., Gilleland E. , Brown B. G. , and Ebert E. , 2009: Application of spatial verification methods to idealized and NWP-gridded precipitation forecasts. Wea. Forecasting, 24, 14851497, doi:10.1175/2009WAF2222298.1.

    • Search Google Scholar
    • Export Citation
  • Baldwin, M. E., cited 2012: Quantitative precipitation forecast verification documentation. NOAA/NCEP/Environmental Modeling Center. [Available online at http://www.emc.ncep.noaa.gov/mmb/ylin/pcpverif/scores/docs/mbdoc/pptmethod.html.]

  • Baldwin, M. E., and Kain J. S. , 2006: Sensitivity of several performance measures to displacement error, bias, and event frequency. Wea. Forecasting, 21, 636648, doi:10.1175/WAF933.1.

    • Search Google Scholar
    • Export Citation
  • Clark, A. J., Gallus W. A. Jr., and Chen T.-C. , 2007: Comparison of the diurnal precipitation cycle in convection-resolving and non-convection-resolving mesoscale models. Mon. Wea. Rev., 135, 34563473, doi:10.1175/MWR3467.1.

    • Search Google Scholar
    • Export Citation
  • Clark, A. J., Gallus W. A. Jr., and Weisman M. L. , 2010: Neighborhood-based verification of precipitation forecasts from convection-allowing NCAR WRF Model simulations and the operational NAM. Wea. Forecasting, 25, 14951509, doi:10.1175/2010WAF2222404.1.

    • Search Google Scholar
    • Export Citation
  • Clark, A. J., Bullock R. G. , Jensen T. L. , Xue M. , and Kong F. , 2014: Application of object-based time-domain diagnostics for tracking precipitation systems in convection-allowing models. Wea. Forecasting,29, 517–542, doi:10.1175/WAF-D-13-00098.1.

  • Davis, C. A., Manning K. W. , Carbone R. E. , Trier S. B. , and Tuttle J. D. , 2003: Coherence of warm-season continental rainfall in numerical weather prediction models. Mon. Wea. Rev., 131, 26672679, doi:10.1175/1520-0493(2003)131<2667:COWCRI>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Davis, C. A., Brown B. , and Bullock R. , 2006: Object-based verification of precipitation forecasts. Part I: Methodology and application to mesoscale rain areas. Mon. Wea. Rev., 134, 17721784, doi:10.1175/MWR3145.1.

    • Search Google Scholar
    • Export Citation
  • Davis, C. A., Brown B. , Bullock R. , and Halley Gotway J. , 2009: The Method for Object-Based Diagnostic Evaluation (MODE) applied to numerical forecasts from the 2005 NSSL/SPC Spring Program. Wea. Forecasting, 24, 12521267, doi:10.1175/2009WAF2222241.1.

    • Search Google Scholar
    • Export Citation
  • Done, J., Davis C. A. , and Weisman M. , 2004: The next generation of NWP: Explicit forecasts of convection using the Weather Research and Forecasting (WRF) Model. Atmos. Sci. Lett., 5, 110117, doi:10.1002/asl.72.

    • Search Google Scholar
    • Export Citation
  • Duda, J. D., and Gallus W. A. , 2013: The impact of large-scale forcing on skill of simulated convective initiation and upscale evolution with convection-allowing grid spacings in the WRF. Wea. Forecasting, 28, 9941018, doi:10.1175/WAF-D-13-00005.1.

    • Search Google Scholar
    • Export Citation
  • Ebert, E. E., 2008: Fuzzy verification of high-resolution gridded forecasts: A review and proposed framework. Meteor. Appl., 15, 5164, doi:10.1002/met.25.

    • Search Google Scholar
    • Export Citation
  • Ebert, E. E., 2009: Neighborhood verification: A strategy for rewarding close forecasts. Wea. Forecasting, 24, 14981510, doi:10.1175/2009WAF2222251.1.

    • Search Google Scholar
    • Export Citation
  • EMC, 2003: The GFS atmospheric model. NCEP Office Note 442, Global Climate and Weather Modeling Branch, Environmental Modeling Center, Camp Springs, MD, 14 pp. [Available online at http://www.emc.ncep.noaa.gov/officenotes/newernotes/on442.pdf.]

  • Fowler, T. L., Jensen T. , Tollerud E. I. , Halley Gotway J. , Oldenburg P. , and Bullock R. , 2010: New Model Evaluation Tools (MET) software capabilities for QPF verification. Preprints, Third Int. Conf. on QPE, QPF and Hydrology, Nanjing, China, WMO/World Weather Research Programme. [Code and documentation available online at http://www.dtcenter.org/met/users/metoverview/index.php.]

  • Fulton, R. A., Breidenbach J. P. , Seo D.-J. , Miller D. A. , and O’Bannon T. , 1998: The WSR-88D rainfall algorithm. Wea. Forecasting, 13, 377395, doi:10.1175/1520-0434(1998)013<0377:TWRA>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Gilleland, E., 2010: Confidence intervals for forecast verification. NCAR Tech. Note NCAR/TN-479+STR, 71 pp., doi:10.5065/D6WD3XJM.

  • Gilleland, E., Ahijevych D. , and Brown B. G. , 2009: Intercomparison of spatial forecast verification methods. Wea. Forecasting, 24, 14161430, doi:10.1175/2009WAF2222269.1.

    • Search Google Scholar
    • Export Citation
  • Gilleland, E., Ahijevych D. , Brown B. G. , and Ebert E. , 2010: Verifying forecasts spatially. Bull. Amer. Meteor. Soc., 91, 13651373, doi:10.1175/2010BAMS2819.1.

    • Search Google Scholar
    • Export Citation
  • Grams, J. S., Gallus W. A. Jr., Koch S. E. , Wharton L. S. , Loughe A. , and Ebert E. E. , 2006: The use of a modified Ebert–McBride technique to evaluate mesoscale model QPF as a function of convective system morphology during IHOP 2002. Wea. Forecasting, 21, 288306, doi:10.1175/WAF918.1.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts. Wea. Forecasting, 14, 155167, doi:10.1175/1520-0434(1999)014<0155:HTFENP>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Hunter, S. M., 1996: WSR-88D radar rainfall estimation: Capabilities, limitations and potential improvements. Natl. Wea. Dig., 20, 2641.

    • Search Google Scholar
    • Export Citation
  • Janjić, Z. I., 2003: A nonhydrostatic model based on a new approach. Meteor. Atmos. Phys., 82, 271285, doi:10.1007/s00703-001-0587-6.

    • Search Google Scholar
    • Export Citation
  • Janjić, Z. I., 2004: The NCEP WRF core. Preprints, 20th Conf. on Numerical Weather Prediction, Seattle, WA, Amer. Meteor. Soc., 12.7. [Available online at http://ams.confex.com/ams/pdfpapers/70036.pdf.]

  • Johnson, A., Wang X. , and Xue M. , 2013: Object-based evaluation of the impact of horizontal grid spacing on convection-allowing forecasts. Mon. Wea. Rev., 141, 34133425, doi:10.1175/MWR-D-13-00027.1.

    • Search Google Scholar
    • Export Citation
  • Jolliffe, I. T., and Stephenson D. B. , 2011: Forecast Verification. A Practitioner’s Guide in Atmospheric Science. John Wiley and Sons, 240 pp.

  • Kain, J. S., Weiss S. J. , Levit J. J. , Baldwin M. E. , and Bright D. R. , 2006: Examination of convection-allowing configurations of the WRF Model for the prediction of severe convective weather: The SPC/NSSL Spring Program 2004. Wea. Forecasting, 21, 167181, doi:10.1175/WAF906.1.

    • Search Google Scholar
    • Export Citation
  • Lin, Y., and Mitchell K. E. , 2005: The NCEP stage II/IV hourly precipitation analyses: Development and applications. Preprints, 19th Conf. on Hydrology, San Diego, CA, Amer. Meteor. Soc., 1.2. [Available online at https://ams.confex.com/ams/pdfpapers/83847.pdf.]

  • Maddox, R. A., Zhang J. , Gourley J. J. , and Howard K. W. , 2002: Weather radar coverage over the contiguous United States. Wea. Forecasting, 17, 927934, doi:10.1175/1520-0434(2002)017<0927:WRCOTC>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Mass, C. F., Ovens D. , Westrick K. , and Colle B. A. , 2002: Does increasing horizontal resolution produce more skillful forecasts? Bull. Amer. Meteor. Soc., 83, 407430, doi:10.1175/1520-0477(2002)083<0407:DIHRPM>2.3.CO;2.

    • Search Google Scholar
    • Export Citation
  • Mittermaier, M., and Roberts N. , 2010: Intercomparison of spatial forecast verification methods: Identifying skillful spatial scales using the fractions skill score. Wea. Forecasting, 25, 343354, doi:10.1175/2009WAF2222260.1.

    • Search Google Scholar
    • Export Citation
  • Mittermaier, M., Roberts N. , and Thompson S. A. , 2013: A long-term assessment of precipitation forecast skill using the Fractions skill score. Meteor. Appl., 20, 176186, doi:10.1002/met.296.

    • Search Google Scholar
    • Export Citation
  • R Development Core Team, cited 2013: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. [Available online at http://www.R-project.org.]

  • Roberts, N. M., and Lean H. W. , 2008: Scale-selective verification of rainfall accumulations from high-resolution forecasts of convective events. Mon. Wea. Rev., 136, 7897, doi:10.1175/2007MWR2123.1.

    • Search Google Scholar
    • Export Citation
  • Schwartz, C. S., and Coauthors, 2009: Next-day convection-allowing WRF Model guidance: A second look at 2-km versus 4-km grid spacing. Mon. Wea. Rev., 137, 33513372, doi:10.1175/2009MWR2924.1.

    • Search Google Scholar
    • Export Citation
  • Weisman, M. L., Davis C. , Wang W. , Manning K. W. , and Klemp J. B. , 2008: Experiences with 0–36-h explicit convective forecasts with the WRF-ARW Model. Wea. Forecasting, 23, 407437, doi:10.1175/2007WAF2007005.1.

    • Search Google Scholar
    • Export Citation
  • Westrick, K. J., Mass C. F. , and Colle B. A. , 1999: The limitations of the WSR-88D radar network for quantitative precipitation measurement over the coastal western United States. Bull. Amer. Meteor. Soc., 80, 22892298, doi:10.1175/1520-0477(1999)080<2289:TLOTWR>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 2011: Statistical Methods in the Atmospheric Sciences. 2nd ed. Elsevier, 704 pp.

Save
  • Accadia, C., Mariani S. , Casaioli M. , and Lavagnini A. , 2003: Sensitivity of precipitation forecast skill scores to bilinear interpolation and a simple nearest-neighbor average method on high-resolution verification grids. Wea. Forecasting, 18, 918932, doi:10.1175/1520-0434(2003)018<0918:SOPFSS>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Ahijevych, D., Gilleland E. , Brown B. G. , and Ebert E. , 2009: Application of spatial verification methods to idealized and NWP-gridded precipitation forecasts. Wea. Forecasting, 24, 14851497, doi:10.1175/2009WAF2222298.1.

    • Search Google Scholar
    • Export Citation
  • Baldwin, M. E., cited 2012: Quantitative precipitation forecast verification documentation. NOAA/NCEP/Environmental Modeling Center. [Available online at http://www.emc.ncep.noaa.gov/mmb/ylin/pcpverif/scores/docs/mbdoc/pptmethod.html.]

  • Baldwin, M. E., and Kain J. S. , 2006: Sensitivity of several performance measures to displacement error, bias, and event frequency. Wea. Forecasting, 21, 636648, doi:10.1175/WAF933.1.

    • Search Google Scholar
    • Export Citation
  • Clark, A. J., Gallus W. A. Jr., and Chen T.-C. , 2007: Comparison of the diurnal precipitation cycle in convection-resolving and non-convection-resolving mesoscale models. Mon. Wea. Rev., 135, 34563473, doi:10.1175/MWR3467.1.

    • Search Google Scholar
    • Export Citation
  • Clark, A. J., Gallus W. A. Jr., and Weisman M. L. , 2010: Neighborhood-based verification of precipitation forecasts from convection-allowing NCAR WRF Model simulations and the operational NAM. Wea. Forecasting, 25, 14951509, doi:10.1175/2010WAF2222404.1.

    • Search Google Scholar
    • Export Citation
  • Clark, A. J., Bullock R. G. , Jensen T. L. , Xue M. , and Kong F. , 2014: Application of object-based time-domain diagnostics for tracking precipitation systems in convection-allowing models. Wea. Forecasting,29, 517–542, doi:10.1175/WAF-D-13-00098.1.

  • Davis, C. A., Manning K. W. , Carbone R. E. , Trier S. B. , and Tuttle J. D. , 2003: Coherence of warm-season continental rainfall in numerical weather prediction models. Mon. Wea. Rev., 131, 26672679, doi:10.1175/1520-0493(2003)131<2667:COWCRI>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Davis, C. A., Brown B. , and Bullock R. , 2006: Object-based verification of precipitation forecasts. Part I: Methodology and application to mesoscale rain areas. Mon. Wea. Rev., 134, 17721784, doi:10.1175/MWR3145.1.

    • Search Google Scholar
    • Export Citation
  • Davis, C. A., Brown B. , Bullock R. , and Halley Gotway J. , 2009: The Method for Object-Based Diagnostic Evaluation (MODE) applied to numerical forecasts from the 2005 NSSL/SPC Spring Program. Wea. Forecasting, 24, 12521267, doi:10.1175/2009WAF2222241.1.

    • Search Google Scholar
    • Export Citation
  • Done, J., Davis C. A. , and Weisman M. , 2004: The next generation of NWP: Explicit forecasts of convection using the Weather Research and Forecasting (WRF) Model. Atmos. Sci. Lett., 5, 110117, doi:10.1002/asl.72.

    • Search Google Scholar
    • Export Citation
  • Duda, J. D., and Gallus W. A. , 2013: The impact of large-scale forcing on skill of simulated convective initiation and upscale evolution with convection-allowing grid spacings in the WRF. Wea. Forecasting, 28, 9941018, doi:10.1175/WAF-D-13-00005.1.

    • Search Google Scholar
    • Export Citation
  • Ebert, E. E., 2008: Fuzzy verification of high-resolution gridded forecasts: A review and proposed framework. Meteor. Appl., 15, 5164, doi:10.1002/met.25.

    • Search Google Scholar
    • Export Citation
  • Ebert, E. E., 2009: Neighborhood verification: A strategy for rewarding close forecasts. Wea. Forecasting, 24, 14981510, doi:10.1175/2009WAF2222251.1.

    • Search Google Scholar
    • Export Citation
  • EMC, 2003: The GFS atmospheric model. NCEP Office Note 442, Global Climate and Weather Modeling Branch, Environmental Modeling Center, Camp Springs, MD, 14 pp. [Available online at http://www.emc.ncep.noaa.gov/officenotes/newernotes/on442.pdf.]

  • Fowler, T. L., Jensen T. , Tollerud E. I. , Halley Gotway J. , Oldenburg P. , and Bullock R. , 2010: New Model Evaluation Tools (MET) software capabilities for QPF verification. Preprints, Third Int. Conf. on QPE, QPF and Hydrology, Nanjing, China, WMO/World Weather Research Programme. [Code and documentation available online at http://www.dtcenter.org/met/users/metoverview/index.php.]

  • Fulton, R. A., Breidenbach J. P. , Seo D.-J. , Miller D. A. , and O’Bannon T. , 1998: The WSR-88D rainfall algorithm. Wea. Forecasting, 13, 377395, doi:10.1175/1520-0434(1998)013<0377:TWRA>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Gilleland, E., 2010: Confidence intervals for forecast verification. NCAR Tech. Note NCAR/TN-479+STR, 71 pp., doi:10.5065/D6WD3XJM.

  • Gilleland, E., Ahijevych D. , and Brown B. G. , 2009: Intercomparison of spatial forecast verification methods. Wea. Forecasting, 24, 14161430, doi:10.1175/2009WAF2222269.1.

    • Search Google Scholar
    • Export Citation
  • Gilleland, E., Ahijevych D. , Brown B. G. , and Ebert E. , 2010: Verifying forecasts spatially. Bull. Amer. Meteor. Soc., 91, 13651373, doi:10.1175/2010BAMS2819.1.

    • Search Google Scholar
    • Export Citation
  • Grams, J. S., Gallus W. A. Jr., Koch S. E. , Wharton L. S. , Loughe A. , and Ebert E. E. , 2006: The use of a modified Ebert–McBride technique to evaluate mesoscale model QPF as a function of convective system morphology during IHOP 2002. Wea. Forecasting, 21, 288306, doi:10.1175/WAF918.1.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts. Wea. Forecasting, 14, 155167, doi:10.1175/1520-0434(1999)014<0155:HTFENP>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Hunter, S. M., 1996: WSR-88D radar rainfall estimation: Capabilities, limitations and potential improvements. Natl. Wea. Dig., 20, 2641.

    • Search Google Scholar
    • Export Citation
  • Janjić, Z. I., 2003: A nonhydrostatic model based on a new approach. Meteor. Atmos. Phys., 82, 271285, doi:10.1007/s00703-001-0587-6.

    • Search Google Scholar
    • Export Citation
  • Janjić, Z. I., 2004: The NCEP WRF core. Preprints, 20th Conf. on Numerical Weather Prediction, Seattle, WA, Amer. Meteor. Soc., 12.7. [Available online at http://ams.confex.com/ams/pdfpapers/70036.pdf.]

  • Johnson, A., Wang X. , and Xue M. , 2013: Object-based evaluation of the impact of horizontal grid spacing on convection-allowing forecasts. Mon. Wea. Rev., 141, 34133425, doi:10.1175/MWR-D-13-00027.1.

    • Search Google Scholar
    • Export Citation
  • Jolliffe, I. T., and Stephenson D. B. , 2011: Forecast Verification. A Practitioner’s Guide in Atmospheric Science. John Wiley and Sons, 240 pp.

  • Kain, J. S., Weiss S. J. , Levit J. J. , Baldwin M. E. , and Bright D. R. , 2006: Examination of convection-allowing configurations of the WRF Model for the prediction of severe convective weather: The SPC/NSSL Spring Program 2004. Wea. Forecasting, 21, 167181, doi:10.1175/WAF906.1.

    • Search Google Scholar
    • Export Citation
  • Lin, Y., and Mitchell K. E. , 2005: The NCEP stage II/IV hourly precipitation analyses: Development and applications. Preprints, 19th Conf. on Hydrology, San Diego, CA, Amer. Meteor. Soc., 1.2. [Available online at https://ams.confex.com/ams/pdfpapers/83847.pdf.]

  • Maddox, R. A., Zhang J. , Gourley J. J. , and Howard K. W. , 2002: Weather radar coverage over the contiguous United States. Wea. Forecasting, 17, 927934, doi:10.1175/1520-0434(2002)017<0927:WRCOTC>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Mass, C. F., Ovens D. , Westrick K. , and Colle B. A. , 2002: Does increasing horizontal resolution produce more skillful forecasts? Bull. Amer. Meteor. Soc., 83, 407430, doi:10.1175/1520-0477(2002)083<0407:DIHRPM>2.3.CO;2.

    • Search Google Scholar
    • Export Citation
  • Mittermaier, M., and Roberts N. , 2010: Intercomparison of spatial forecast verification methods: Identifying skillful spatial scales using the fractions skill score. Wea. Forecasting, 25, 343354, doi:10.1175/2009WAF2222260.1.

    • Search Google Scholar
    • Export Citation
  • Mittermaier, M., Roberts N. , and Thompson S. A. , 2013: A long-term assessment of precipitation forecast skill using the Fractions skill score. Meteor. Appl., 20, 176186, doi:10.1002/met.296.

    • Search Google Scholar
    • Export Citation
  • R Development Core Team, cited 2013: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. [Available online at http://www.R-project.org.]

  • Roberts, N. M., and Lean H. W. , 2008: Scale-selective verification of rainfall accumulations from high-resolution forecasts of convective events. Mon. Wea. Rev., 136, 7897, doi:10.1175/2007MWR2123.1.

    • Search Google Scholar
    • Export Citation
  • Schwartz, C. S., and Coauthors, 2009: Next-day convection-allowing WRF Model guidance: A second look at 2-km versus 4-km grid spacing. Mon. Wea. Rev., 137, 33513372, doi:10.1175/2009MWR2924.1.

    • Search Google Scholar
    • Export Citation
  • Weisman, M. L., Davis C. , Wang W. , Manning K. W. , and Klemp J. B. , 2008: Experiences with 0–36-h explicit convective forecasts with the WRF-ARW Model. Wea. Forecasting, 23, 407437, doi:10.1175/2007WAF2007005.1.

    • Search Google Scholar
    • Export Citation
  • Westrick, K. J., Mass C. F. , and Colle B. A. , 1999: The limitations of the WSR-88D radar network for quantitative precipitation measurement over the coastal western United States. Bull. Amer. Meteor. Soc., 80, 22892298, doi:10.1175/1520-0477(1999)080<2289:TLOTWR>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 2011: Statistical Methods in the Atmospheric Sciences. 2nd ed. Elsevier, 704 pp.

  • Fig. 1.

    Map showing the boundary of the CONUS verification domain (denoted by the boldface outline).

  • Fig. 2.

    Time series plots of frequency bias for 3-h QPFs aggregated across all model initializations (annual) for the (a) 0.254-, (b) 1.27-, (c) 2.54-, and (d) 6.35-mm thresholds. The GFS results are shown in red and the NAM results are in blue. The vertical bars represent the 99% CIs. The base rate is associated with the second y axis and shown in black.

  • Fig. 3.

    Threshold series plots of frequency bias for 3-h QPFs for the 48-h forecast lead time aggregated across the winter (solid) and summer (dash) seasons. The GFS results are shown in red and the NAM results are in blue. The vertical bars represent the 99% CIs. The base rates for the winter (solid) and summer (dash) aggregations are associated with the second y axis and shown in black.

  • Fig. 4.

    Time series plots of debiased GSS for 3-h QPFs aggregated across all model initializations (annual) for the (a) 0.254-, (b) 1.27-, (c) 2.54-, and (d) 6.35-mm thresholds. The GFS results are shown in red, NAM results are in blue, and the pairwise difference (NAM − GFS) results are in green. The vertical bars represent the 99% CIs. The base rate is associated with the second y axis and shown in black.

  • Fig. 5.

    Threshold series plots of debiased GSS for 3-h QPFs for the 48-h forecast lead time aggregated across the winter (solid) and summer (dash) seasons. The GFS results are shown in red, NAM results are in blue, and the pairwise difference (NAM − GFS) results are in green. The vertical bars represent the 99% CIs. The base rates for the winter (solid) and summer (dash) aggregations are associated with the second y axis and shown in black.

  • Fig. 6.

    Illustration of neighborhood size and the relationship of forecast skill with varying spatial scale for a particular precipitation threshold. In the forecast and observed fields, the shaded squares represent a value of 1 if the forecast or observed precipitation in that square exceeds the designated threshold; the nonshaded squares represent a value of 0. The solid outline represents a single grid square. Evaluating each individual grid square using traditional verification metrics would reveal the forecast has no skill, as none of the forecast events overlaps with the observed events. However, as the neighborhood size increases from 9 (3 × 3 dotted outline) to 25 (5 × 5 dashed outline), both the forecast and observed fields have events in 6 of 25 grid squares. [Adapted from Roberts and Lean (2008), their Fig. 2.]

  • Fig. 7.

    Quilt plots of FSS as a function of spatial scale and threshold aggregated across all model initializations (annual) for the 24-h lead time. Shown are the (top) NAM and (bottom) GFS plots. The FSS value associated with each spatial scale and threshold is indicated by both the number and the color shading in each box; warmer colors are associated with larger FSS values. Values that are smaller than the uniform forecast skill value are denoted with parentheses.

  • Fig. 8.

    Time series plot of FSS using a threshold of 0.254 mm aggregated across the winter season. The GFS (red), NAM (blue), and the pairwise differences (green) are shown for n = 15 (60-km spatial scale; triangle, dot–dash) and 75 (300-km spatial scale; circle, solid). The vertical bars on the pairwise differences represent the 99% CIs.

  • Fig. 9.

    Time series plot of FSS using a threshold of 2.54 mm aggregated across the summer season. The GFS (red), NAM (blue), and the pairwise differences (green) are shown for n = 15 (60-km spatial scale; triangle, dot–dash) and 75 (300-km spatial scale; circle, solid). The vertical bars on the pairwise differences represent the 99% CIs.

  • Fig. 10.

    Example illustrating the MODE objects created from the (top) NAM and (bottom) GFS (left) 3-h QPF fields and (right) associated stage II analysis field from a 24-h forecast valid at 0000 UTC 14 May 2009. Both the forecast and observations fields are on the 4-km domain. Similar colors between the fields indicate matched objects; royal blue objects in the forecast field are false alarms and in the observation field they are misses. The black lines surrounding objects are the convex hulls, which are the smallest set of curves bounding an object or group of objects together.

  • Fig. 11.

    Time series plots of total object counts by lead time for the GFS (red), NAM (blue), and stage II analysis (black) fields aggregated across the (top) winter season for the 0.254-mm threshold and (bottom) summer season for the 2.54-mm threshold.

  • Fig. 12.

    Box plots by lead time showing the size distributions for precipitation objects identified within the GFS (red), NAM (blue), and stage II analysis fields (gray) aggregated across the (top) winter season for the 0.254-mm threshold and (bottom) summer season for the 2.54-mm threshold. The bottom and top of each box plot correspond to the 25th and 75th percentiles, respectively; the black line at the “waist” is the median value and the “notches” about the median approximate the 99% CIs.

  • Fig. 13.

    Illustration of MODE-matched object attributes used in this study. The forecast object is shown in blue and the observed object is in red. The symmetric difference is the total nonoverlap area between the matched objects, shaded in gray (smaller is better). The centroid distance is the difference between the two centroids of the matched objects. The centroid displacement examines the x (nominally east–west) and y (nominally north–south) offsets of the centroids of two matched objects.

  • Fig. 14.

    Time series plot of the median MODE frequency bias for NAM aggregated across the winter season for the 0.254-mm threshold (solid) and the summer season for the 2.54-mm threshold (dash). The vertical bars represent the 99% CIs.

  • Fig. 15.

    Time series plot of the median symmetric difference for all NAM forecast objects compared to their matching observed objects aggregated across the summer season for the 2.54-mm threshold. The vertical bars represent the 99% CIs.

  • Fig. 16.

    Time series plot of the median centroid distance for all NAM forecast objects compared to their matching observed objects aggregated across the summer season for the 2.54-mm threshold. The vertical bars represent the 99% CIs.

  • Fig. 17.

    Time series plot of the median centroid displacements in the x (solid) and y (dash) directions for all NAM forecast objects compared to their matching observed objects aggregated across the summer season for the 2.54-mm threshold. A positive (negative) value in the x direction (CENTX) indicates an easterly (westerly) bias and a positive (negative) value in the y direction (CENTY) indicates a northerly (southerly) bias. The vertical bars represent the 99% CIs.

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 1446 650 63
PDF Downloads 759 174 19